Description

Build real-time voice AI applications using Azure AI Voice Live SDK (azure-ai-voicelive). Use this skill when creating Python applications that need real-time bidirectional audio communication with Azure AI, including voice assistants, voice-enabled chatbots, real-time speech-to-speech translation, voice-driven avatars, or any WebSocket-based audio streaming with AI models. Supports Server VAD (Voice Activity Detection), turn-based conversation, function calling, MCP tools, avatar integration, and transcription.

README (SKILL.md)

Azure AI Voice Live SDK

Name: Azure Ai Voicelive Py
Author: thegovind

Build real-time voice AI applications with bidirectional WebSocket communication.

Installation

pip install azure-ai-voicelive aiohttp azure-identity

Environment Variables

AZURE_COGNITIVE_SERVICES_ENDPOINT=https://\x3Cregion>.api.cognitive.microsoft.com
# For API key auth (not recommended for production)
AZURE_COGNITIVE_SERVICES_KEY=\x3Capi-key>

Authentication

DefaultAzureCredential (preferred):

from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential

async with connect(
    endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
    credential=DefaultAzureCredential(),
    model="gpt-4o-realtime-preview",
    credential_scopes=["https://cognitiveservices.azure.com/.default"]
) as conn:
    ...

API Key:

from azure.ai.voicelive.aio import connect
from azure.core.credentials import AzureKeyCredential

async with connect(
    endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
    credential=AzureKeyCredential(os.environ["AZURE_COGNITIVE_SERVICES_KEY"]),
    model="gpt-4o-realtime-preview"
) as conn:
    ...

Quick Start

import asyncio
import os
from azure.ai.voicelive.aio import connect
from azure.identity.aio import DefaultAzureCredential

async def main():
    async with connect(
        endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],
        credential=DefaultAzureCredential(),
        model="gpt-4o-realtime-preview",
        credential_scopes=["https://cognitiveservices.azure.com/.default"]
    ) as conn:
        # Update session with instructions
        await conn.session.update(session={
            "instructions": "You are a helpful assistant.",
            "modalities": ["text", "audio"],
            "voice": "alloy"
        })
        
        # Listen for events
        async for event in conn:
            print(f"Event: {event.type}")
            if event.type == "response.audio_transcript.done":
                print(f"Transcript: {event.transcript}")
            elif event.type == "response.done":
                break

asyncio.run(main())

Core Architecture

Connection Resources

The VoiceLiveConnection exposes these resources:

Resource	Purpose	Key Methods
`conn.session`	Session configuration	`update(session=...)`
`conn.response`	Model responses	`create()`, `cancel()`
`conn.input_audio_buffer`	Audio input	`append()`, `commit()`, `clear()`
`conn.output_audio_buffer`	Audio output	`clear()`
`conn.conversation`	Conversation state	`item.create()`, `item.delete()`, `item.truncate()`
`conn.transcription_session`	Transcription config	`update(session=...)`

Session Configuration

from azure.ai.voicelive.models import RequestSession, FunctionTool

await conn.session.update(session=RequestSession(
    instructions="You are a helpful voice assistant.",
    modalities=["text", "audio"],
    voice="alloy",  # or "echo", "shimmer", "sage", etc.
    input_audio_format="pcm16",
    output_audio_format="pcm16",
    turn_detection={
        "type": "server_vad",
        "threshold": 0.5,
        "prefix_padding_ms": 300,
        "silence_duration_ms": 500
    },
    tools=[
        FunctionTool(
            type="function",
            name="get_weather",
            description="Get current weather",
            parameters={
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        )
    ]
))

Audio Streaming

Send Audio (Base64 PCM16)

import base64

# Read audio chunk (16-bit PCM, 24kHz mono)
audio_chunk = await read_audio_from_microphone()
b64_audio = base64.b64encode(audio_chunk).decode()

await conn.input_audio_buffer.append(audio=b64_audio)

Receive Audio

async for event in conn:
    if event.type == "response.audio.delta":
        audio_bytes = base64.b64decode(event.delta)
        await play_audio(audio_bytes)
    elif event.type == "response.audio.done":
        print("Audio complete")

Event Handling

async for event in conn:
    match event.type:
        # Session events
        case "session.created":
            print(f"Session: {event.session}")
        case "session.updated":
            print("Session updated")
        
        # Audio input events
        case "input_audio_buffer.speech_started":
            print(f"Speech started at {event.audio_start_ms}ms")
        case "input_audio_buffer.speech_stopped":
            print(f"Speech stopped at {event.audio_end_ms}ms")
        
        # Transcription events
        case "conversation.item.input_audio_transcription.completed":
            print(f"User said: {event.transcript}")
        case "conversation.item.input_audio_transcription.delta":
            print(f"Partial: {event.delta}")
        
        # Response events
        case "response.created":
            print(f"Response started: {event.response.id}")
        case "response.audio_transcript.delta":
            print(event.delta, end="", flush=True)
        case "response.audio.delta":
            audio = base64.b64decode(event.delta)
        case "response.done":
            print(f"Response complete: {event.response.status}")
        
        # Function calls
        case "response.function_call_arguments.done":
            result = handle_function(event.name, event.arguments)
            await conn.conversation.item.create(item={
                "type": "function_call_output",
                "call_id": event.call_id,
                "output": json.dumps(result)
            })
            await conn.response.create()
        
        # Errors
        case "error":
            print(f"Error: {event.error.message}")

Common Patterns

Manual Turn Mode (No VAD)

await conn.session.update(session={"turn_detection": None})

# Manually control turns
await conn.input_audio_buffer.append(audio=b64_audio)
await conn.input_audio_buffer.commit()  # End of user turn
await conn.response.create()  # Trigger response

Interrupt Handling

async for event in conn:
    if event.type == "input_audio_buffer.speech_started":
        # User interrupted - cancel current response
        await conn.response.cancel()
        await conn.output_audio_buffer.clear()

Conversation History

# Add system message
await conn.conversation.item.create(item={
    "type": "message",
    "role": "system",
    "content": [{"type": "input_text", "text": "Be concise."}]
})

# Add user message
await conn.conversation.item.create(item={
    "type": "message",
    "role": "user", 
    "content": [{"type": "input_text", "text": "Hello!"}]
})

await conn.response.create()

Voice Options

Voice	Description
`alloy`	Neutral, balanced
`echo`	Warm, conversational
`shimmer`	Clear, professional
`sage`	Calm, authoritative
`coral`	Friendly, upbeat
`ash`	Deep, measured
`ballad`	Expressive
`verse`	Storytelling

Azure voices: Use AzureStandardVoice, AzureCustomVoice, or AzurePersonalVoice models.

Audio Formats

Format	Sample Rate	Use Case
`pcm16`	24kHz	Default, high quality
`pcm16-8000hz`	8kHz	Telephony
`pcm16-16000hz`	16kHz	Voice assistants
`g711_ulaw`	8kHz	Telephony (US)
`g711_alaw`	8kHz	Telephony (EU)

Turn Detection Options

# Server VAD (default)
{"type": "server_vad", "threshold": 0.5, "silence_duration_ms": 500}

# Azure Semantic VAD (smarter detection)
{"type": "azure_semantic_vad"}
{"type": "azure_semantic_vad_en"}  # English optimized
{"type": "azure_semantic_vad_multilingual"}

Error Handling

from azure.ai.voicelive.aio import ConnectionError, ConnectionClosed

try:
    async with connect(...) as conn:
        async for event in conn:
            if event.type == "error":
                print(f"API Error: {event.error.code} - {event.error.message}")
except ConnectionClosed as e:
    print(f"Connection closed: {e.code} - {e.reason}")
except ConnectionError as e:
    print(f"Connection error: {e}")

References

Detailed API Reference: See references/api-reference.md
Complete Examples: See references/examples.md
All Models & Types: See references/models.md

Usage Guidance

This skill appears to be legitimate documentation for using Azure's Voice Live SDK, but the package metadata does not declare the environment variables and credential access that the SKILL.md actually requires. Before installing or running code derived from this skill: - Treat the mismatch as a red flag: confirm with the skill author/source why required env vars and credentials are not declared. Ask for an authoritative homepage or repository. - Do not expose broad Azure credentials (AZURE_CLIENT_ID/SECRET, Azure CLI tokens, or subscription-level keys) to untrusted code. Prefer creating a dedicated Azure resource with minimal permissions and a short-lived credential for testing. - Be aware DefaultAzureCredential will attempt multiple auth methods (env vars, managed identity, Azure CLI cache) and could cause the agent to use existing credentials on the host — run in an isolated environment if you want to limit exposure. - Verify packages (azure-ai-voicelive, azure-identity) come from a trusted source (PyPI/GitHub) before pip installing. If you cannot validate the source or get corrected metadata, treat the skill as suspicious and avoid giving it credentials or running it in a privileged environment.

Capability Analysis

Type: OpenClaw Skill Name: azure-ai-voicelive-py Version: 0.1.0 The skill bundle provides documentation and code examples for integrating with Azure AI Voice Live SDK. All files describe legitimate usage of the SDK, including standard Python package installation (`pip install azure-ai-voicelive aiohttp azure-identity`), Azure authentication methods (environment variables, `DefaultAzureCredential`), and various real-time audio processing scenarios. There is no evidence of data exfiltration, malicious execution, persistence mechanisms, obfuscation, or prompt injection attempts against the OpenClaw agent. The content is clearly aligned with its stated purpose of building real-time voice AI applications using Azure services.

Capability Assessment

⚠ Purpose & Capability

The SKILL.md describes building real-time voice apps with Azure's Voice Live SDK (logical and coherent). However the registry metadata declares no required environment variables or primary credential even though the instructions explicitly require AZURE_COGNITIVE_SERVICES_ENDPOINT and optionally AZURE_COGNITIVE_SERVICES_KEY and recommend DefaultAzureCredential. The missing declared credentials is disproportionate to the documented purpose (likely an oversight, but it is an incoherence).

⚠ Instruction Scope

Runtime instructions instruct the agent to read environment variables (endpoint, optional API key) and to use DefaultAzureCredential which may surface additional Azure credentials (AZURE_CLIENT_ID/TENANT_ID/CLIENT_SECRET, Azure CLI tokens, managed identity). The examples also read local audio files and stream microphone audio (expected for the stated purpose). The problem is the instructions access credentials and auth surfaces that are not declared in the skill metadata, granting the agent potential access to broader Azure credentials than the registry advertises.

✓ Install Mechanism

This is an instruction-only skill with no install spec or shipped code — lowest install risk. The SKILL.md recommends installing pip packages (azure-ai-voicelive, aiohttp, azure-identity), which is expected for this SDK and would be a normal developer dependency, but those installs would happen outside the skill bundle.

⚠ Credentials

The skill's documentation requires AZURE_COGNITIVE_SERVICES_ENDPOINT and optionally AZURE_COGNITIVE_SERVICES_KEY, and suggests DefaultAzureCredential (which uses other Azure auth sources). Yet the skill metadata lists no required env vars or primary credential. Requesting (via instructions) broad Azure credential sources without declaring them is disproportionate and opaque; it could cause unexpected use of existing Azure credentials on the host.

✓ Persistence & Privilege

always:false and no install code or persistent modifications are requested. The skill is user-invocable and can be invoked autonomously (platform default), but there's no evidence it requests persistent elevated privileges or modifies other skills.

Version History

v0.1.0

Initial release – enables building real-time voice AI apps using Azure AI Voice Live SDK for Python. - Real-time bidirectional WebSocket audio streaming with Azure AI models. - Supports Server VAD, turn-based conversation, function calls, tools, transcription, and avatar integration. - Easy authentication via `DefaultAzureCredential` or API key. - Provides structured resources for session, response, audio buffers, and conversation state. - Includes example snippets for session config, event handling, audio streaming, and interrupt management. - Supports voice selection, multiple audio formats, and manual or VAD turn-taking.

Metadata

Slug azure-ai-voicelive-py

Version 0.1.0

License —

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is Azure Ai Voicelive Py?

Build real-time voice AI applications using Azure AI Voice Live SDK (azure-ai-voicelive). Use this skill when creating Python applications that need real-time bidirectional audio communication with Azure AI, including voice assistants, voice-enabled chatbots, real-time speech-to-speech translation, voice-driven avatars, or any WebSocket-based audio streaming with AI models. Supports Server VAD (Voice Activity Detection), turn-based conversation, function calling, MCP tools, avatar integration, and transcription. It is an AI Agent Skill for Claude Code / OpenClaw, with 2144 downloads so far.

How do I install Azure Ai Voicelive Py?

Run "/install azure-ai-voicelive-py" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Azure Ai Voicelive Py free?

Yes, Azure Ai Voicelive Py is completely free (open-source). You can download, install and use it at no cost.

Which platforms does Azure Ai Voicelive Py support?

Azure Ai Voicelive Py is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Azure Ai Voicelive Py?

It is built and maintained by thegovind (@thegovind); the current version is v0.1.0.

More Skills

Azure Ai Voicelive Py