← Back to Skills Marketplace
isabellazhangym

autoglmasr

by IsabellaZhangYM · GitHub ↗ · v0.0.1
cross-platform ⚠ suspicious
375
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install autoglmasr
Description
AutoGLM ASR MCP 服务:长音频并发转录、上下文传递、时间戳分段。基于智谱 GLM-ASR-2512。触发词:语音识别、ASR、转录、转录音频、长音频
README (SKILL.md)

AutoGLM ASR MCP Server

GitHub: https://github.com/Starrylyn/autoglm-asr-mcp

一个面向 Agent 的语音转文字 MCP 服务,核心特性:

  • 长音频自动分块
  • 并发调用(可配置并发数)
  • 上下文传递模式
  • 时间戳分段输出

安装

# 前置依赖:ffmpeg
brew install ffmpeg  # macOS

# 运行 MCP 服务
npx autoglm-asr-mcp

MCP 配置

{
  "mcpServers": {
    "autoglm-asr": {
      "command": "npx",
      "args": ["-y", "autoglm-asr-mcp"],
      "env": {
        "AUTOGLM_ASR_API_KEY": "your-api-key"
      }
    }
  }
}

核心工具

transcribe_audio

参数 类型 必填 默认值 说明
audio_path string - 音频文件绝对路径
context_mode string sliding 上下文模式
max_concurrency int 5 并发数 (1-20)

返回:

  • 完整转录文本
  • 时间戳分段列表
  • 运行统计(分块数、耗时、模式)

get_audio_info

获取音频文件信息(时长、格式、预估分块数)。


核心实现解析

1. 并发调用机制

# 使用 Semaphore 控制并发数
semaphore = asyncio.Semaphore(concurrency)

async def transcribe_with_semaphore(chunk: AudioChunk) -> None:
    async with semaphore:
        result = await self._transcribe_chunk(chunk, audio_format=audio_format)
        text_results[chunk.index] = result["text"]
        # ...

# 所有分块并行执行
tasks = [transcribe_with_semaphore(chunk) for chunk in non_silent_chunks]
await asyncio.gather(*tasks)

关键点:

  • Semaphore 限制最大并发数
  • asyncio.gather() 并行执行所有任务
  • 结果存入字典 text_results: dict[int, str],按分块索引排序

2. 上下文模式

模式 速度 质量 说明
sliding 第一个分块初始化上下文,后续并行
none 最快 各分块独立并行,无上下文传递
full_serial 最佳 顺序执行,完整上下文链

注意: 新版 /audio/transcriptions API 不需要上下文传递,所有分块默认并行。

3. 自动分块

chunks = split_audio_on_silence(
    audio,
    max_chunk_duration_ms=self.config.max_chunk_duration * 1000,  # 默认 25s
)
  • 按静音点分割音频
  • 每块最大 25 秒(可配置)
  • 静音块自动跳过

4. 静音检测 (VAD)

non_silent_chunks = [c for c in chunks if not c.is_silent]
skipped_silent = len(chunks) - len(non_silent_chunks)
  • 使用 VAD 检测静音片段
  • 静音块不调用 API,节省费用

5. 结果合并

# 按分块顺序合并文本
full_text = "".join(text_results.get(chunk.index, "") for chunk in chunks)

# 合并时间戳分段(偏移调整)
for seg in result["segments"]:
    offset_segments.append(TranscriptionSegment(
        start=seg.start + chunk.start_ms / 1000.0,  # 加上分块起始偏移
        end=seg.end + chunk.start_ms / 1000.0,
        text=seg.text,
    ))

环境变量

变量 默认值 说明
AUTOGLM_ASR_API_KEY 必填 智谱 API Key
AUTOGLM_ASR_API_BASE https://open.bigmodel.cn/api/paas/v4/audio/transcriptions API 端点
AUTOGLM_ASR_MODEL glm-asr-2512 ASR 模型
AUTOGLM_ASR_MAX_CHUNK_DURATION 25 每块最大时长(秒)
AUTOGLM_ASR_MAX_CONCURRENCY 5 默认并发数
AUTOGLM_ASR_CONTEXT_MAX_CHARS 2000 最大上下文字数
AUTOGLM_ASR_REQUEST_TIMEOUT 60 请求超时(秒)
AUTOGLM_ASR_MAX_RETRIES 2 重试次数

支持的音频格式

mp3, wav, m4a, flac, ogg, webm


直接调用 API(不通过 MCP)

# 短音频
curl --request POST \
  --url https://open.bigmodel.cn/api/paas/v4/audio/transcriptions \
  --header 'Authorization: Bearer YOUR_API_KEY' \
  --form model=glm-asr-2512 \
  --form stream=false \
  --form [email protected]

# 长音频:需要自己实现分块、并发、结果合并

最佳实践

  1. 短音频(\x3C30s):直接调用 API
  2. 长音频:使用 MCP 服务,自动分块 + 并发
  3. 高质量需求:用 full_serial 模式
  4. 快速处理:用 none 模式 + 高并发(10-20)
  5. 平衡选择sliding 模式 + 并发 5(默认)

常见错误

错误 原因 解决
ffmpeg not found 未安装 ffmpeg brew install ffmpeg
File not found 路径错误 使用绝对路径
AUTOGLM_ASR_API_KEY environment variable is required 未设置 API Key 在 MCP 配置中设置
transcriptions文件只支持单声道 音频是立体声 自动转换为单声道

关键代码片段(参考实现)

Python 异步并发调用示例

import asyncio
import httpx

async def transcribe_chunk(client, chunk_data, api_key):
    """转录单个音频块"""
    headers = {"Authorization": f"Bearer {api_key}"}
    files = {"file": ("audio.wav", chunk_data, "audio/wav")}
    data = {"model": "glm-asr-2512"}
    
    response = await client.post(
        "https://open.bigmodel.cn/api/paas/v4/audio/transcriptions",
        headers=headers,
        files=files,
        data=data,
    )
    result = response.json()
    return result.get("text", "")

async def transcribe_parallel(chunks, api_key, max_concurrency=5):
    """并发转录多个音频块"""
    semaphore = asyncio.Semaphore(max_concurrency)
    client = httpx.AsyncClient(timeout=60)
    results = {}
    
    async def limited_transcribe(chunk, index):
        async with semaphore:
            text = await transcribe_chunk(client, chunk, api_key)
            results[index] = text
    
    tasks = [limited_transcribe(chunk, i) for i, chunk in enumerate(chunks)]
    await asyncio.gather(*tasks)
    await client.aclose()
    
    # 按顺序合并
    return "".join(results.get(i, "") for i in range(len(chunks)))

扩展阅读

Usage Guidance
Before installing or invoking this skill: (1) Verify the upstream project and npm package (the SKILL.md references a GitHub repo and expects 'npx autoglm-asr-mcp'); confirm the package name, maintainers, and that the code is trustworthy. (2) Understand that you must provide AUTOGLM_ASR_API_KEY and that audio files will be uploaded to open.bigmodel.cn — avoid sending sensitive audio or use a scoped/test API key. (3) Expect runtime downloads via npx and that ffmpeg must be installed; consider running in an isolated or sandboxed environment. (4) If you require stricter control, prefer a self-hosted/local ASR or vendor-reviewed package, and update the skill registry metadata to declare required env vars/binaries before trusting it.
Capability Analysis
Type: OpenClaw Skill Name: autoglmasr Version: 0.0.1 The skill instructs the agent to execute `npx autoglm-asr-mcp` for installation and operation. This command downloads and runs an external Node.js package, introducing a supply chain risk where a compromised package could lead to arbitrary code execution on the agent's host. Additionally, the `transcribe_audio` tool accepts an `audio_path` parameter as an absolute file path, which, if not properly validated by the underlying service, could be exploited for Local File Inclusion (LFI) by a malicious user. The skill's core functionality involves legitimate external network calls to `https://open.bigmodel.cn/api/paas/v4/audio/transcriptions`.
Capability Assessment
Purpose & Capability
The skill's stated purpose (long-audio ASR, chunking, concurrency, timestamps using GLM-ASR-2512) matches the instructions and examples (split on silence, concurrent HTTP calls to open.bigmodel.cn). However the registry metadata lists no required environment variables or binaries while the SKILL.md clearly requires ffmpeg and an AUTOGLM_ASR_API_KEY; that metadata mismatch is an incoherence.
Instruction Scope
The instructions tell the agent to read local audio files (absolute paths), run 'npx autoglm-asr-mcp' (which will fetch and execute code from npm at runtime), and use an API key to POST audio to https://open.bigmodel.cn. Reading local audio is expected for ASR, but the guidance to fetch/execute remote npm code and to use an API key (not declared in registry) increases risk and scope beyond what the registry claims.
Install Mechanism
There is no install spec in the registry, yet the SKILL.md relies on 'npx autoglm-asr-mcp' (dynamic download/execute from npm) and on installing ffmpeg. Dynamically pulling an npm package at runtime is an implicit install step not captured in metadata and has higher risk than a pure instruction-only skill.
Credentials
The environment variables listed in SKILL.md (AUTOGLM_ASR_API_KEY, API_BASE, model, concurrency, timeouts, etc.) are proportionate to an ASR client. But the registry declares none of these; importantly the API key will be sent to an external third-party (open.bigmodel.cn), which is a privacy and credential-exposure consideration the user should weigh.
Persistence & Privilege
The skill is not 'always: true', has no declared install hooks or config path modifications in the registry, and does not ask to modify other skills or system-wide agent settings. Autonomous invocation is allowed by default but is not in itself a new privilege here.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install autoglmasr
  3. After installation, invoke the skill by name or use /autoglmasr
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v0.0.1
AutoGLM ASR MCP: High-concurrency, context-aware, long audio transcription server based on GLM-ASR-2512. - Supports automatic chunking of long audio files and concurrent transcription. - Offers selectable context modes: sliding (default), none, or full serial for quality/speed tradeoffs. - Returns full transcript, timestamped segments, and detailed statistics. - Skips silent chunks with VAD to save on API calls and costs. - Configurable via environment variables and designed for Agent/MCP integration.
Metadata
Slug autoglmasr
Version 0.0.1
License
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is autoglmasr?

AutoGLM ASR MCP 服务:长音频并发转录、上下文传递、时间戳分段。基于智谱 GLM-ASR-2512。触发词:语音识别、ASR、转录、转录音频、长音频. It is an AI Agent Skill for Claude Code / OpenClaw, with 375 downloads so far.

How do I install autoglmasr?

Run "/install autoglmasr" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is autoglmasr free?

Yes, autoglmasr is completely free (open-source). You can download, install and use it at no cost.

Which platforms does autoglmasr support?

autoglmasr is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created autoglmasr?

It is built and maintained by IsabellaZhangYM (@isabellazhangym); the current version is v0.0.1.

💬 Comments