MCP Server 监控告警：Prometheus metrics 暴露方案

去年双十一，我负责的电商平台 AI 客服系统遭遇了前所未有的流量洪峰。凌晨 0 点 0 分，订单系统与 AI 客服同时被海量请求淹没，MCP Server 作为连接大模型与业务逻辑的核心枢纽，响应时间从日常的 80ms 飙升到 2.3 秒，用户投诉量在 15 分钟内突破 500 条。那一刻我意识到：没有监控的 MCP Server，就是一颗随时可能爆炸的定时炸弹。

本文将完整记录我从零搭建 MCP Server Prometheus 监控体系的全过程，包含可复用的代码模板、真实延迟数据、以及踩坑后的经验总结。文末附 HolySheep AI 的接入方案，因为稳定的上游 API 是监控体系发挥作用的前提。

为什么 MCP Server 必须暴露 Prometheus metrics

MCP Server（Model Context Protocol Server）本质上是一个 HTTP 服务，接收来自客户端的工具调用请求，再与上游大模型 API 交互返回结果。在高并发场景下，以下问题必须被实时监控：

上游 API 响应超时：大模型 API 响应时间波动直接影响用户体验
Token 消耗速率：未监控的 Token 消耗可能在促销日让你的账单爆炸
并发连接数：超出 MCP Server 处理能力时会触发熔断
工具调用成功率：某个 MCP 工具故障可能导致整条对话链路失败

我曾对比过国内多个大模型 API 中转服务，最终选择了 HolySheheep AI，原因是其国内直连延迟稳定在 <50ms，且支持 Prometheus 格式的用量统计，配合自建监控体系非常方便。

快速接入 HolySheep API（监控体系的前提）

在开始监控配置之前，确保你已接入稳定的上游 API。以下是 Python SDK 的标准接入方式：

# 安装依赖
pip install openai httpx prometheus-client

holy_sheep_client.py
import httpx
from openai import OpenAI

class HolySheepClient:
    def __init__(self, api_key: str):
        self.client = OpenAI(
            api_key=api_key,
            base_url="https://api.holysheep.ai/v1"  # 汇率 ¥1=$1 无损
        )
    
    def chat_completion(self, messages: list, model: str = "gpt-4o"):
        """调用 HolySheep API，带超时控制和重试机制"""
        try:
            response = self.client.chat.completions.create(
                model=model,
                messages=messages,
                timeout=30.0  # 超时阈值
            )
            return response
        except httpx.TimeoutException:
            raise TimeoutError(f"HolySheep API 超时，model={model}")
        except Exception as e:
            raise RuntimeError(f"API 调用失败: {str(e)}")

使用示例
client = HolySheepClient(api_key="YOUR_HOLYSHEEP_API_KEY")
response = client.chat_completion([
    {"role": "user", "content": "双十一有哪些优惠活动？"}
])
print(f"Token 消耗: {response.usage.total_tokens}")

为 MCP Server 添加 Prometheus metrics 端点

下面是完整的 MCP Server 实现，集成了 Prometheus metrics 暴露功能：

# mcp_server_with_metrics.py
from prometheus_client import Counter, Histogram, Gauge, generate_latest, start_http_server
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import time
import threading

==================== Prometheus Metrics 定义 ====================
REQUEST_COUNT = Counter(
    'mcp_server_requests_total',
    'Total MCP requests',
    ['endpoint', 'status']
)

REQUEST_LATENCY = Histogram(
    'mcp_server_request_duration_seconds',
    'Request latency in seconds',
    ['endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

TOKEN_USAGE = Counter(
    'mcp_server_tokens_total',
    'Total tokens consumed',
    ['model', 'type']  # type: prompt/completion
)

ACTIVE_CONNECTIONS = Gauge(
    'mcp_server_active_connections',
    'Number of active connections'
)

MODEL_RESPONSE_TIME = Histogram(
    'mcp_model_response_seconds',
    'Upstream model response time',
    ['model'],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0]
)

==================== MCP Server 实现 ====================
app = FastAPI(title="MCP Server with Prometheus")

class MCPRequest(BaseModel):
    tool_name: str
    parameters: dict
    model: str = "gpt-4o"

class MCPResponse(BaseModel):
    result: dict
    latency_ms: float
    tokens_used: int

@app.post("/mcp/execute", response_model=MCPResponse)
async def execute_mcp_tool(request: MCPRequest):
    """执行 MCP 工具调用（示例：商品查询）"""
    ACTIVE_CONNECTIONS.inc()
    start_time = time.time()
    
    try:
        # 模拟实际业务逻辑
        from holy_sheep_client import HolySheepClient
        
        client = HolySheepClient(api_key="YOUR_HOLYSHEEP_API_KEY")
        
        # 构建 Prompt
        messages = [{
            "role": "user", 
            "content": f"Execute tool: {request.tool_name}, params: {request.parameters}"
        }]
        
        # 调用上游 API 并记录时间
        model_start = time.time()
        response = client.chat_completion(messages, model=request.model)
        model_elapsed = time.time() - model_start
        
        MODEL_RESPONSE_TIME.labels(model=request.model).observe(model_elapsed)
        
        # 记录 Token 消耗
        tokens = response.usage.total_tokens
        TOKEN_USAGE.labels(model=request.model, type="total").inc(tokens)
        
        elapsed_ms = (time.time() - start_time) * 1000
        
        REQUEST_COUNT.labels(endpoint="/mcp/execute", status="success").inc()
        REQUEST_LATENCY.labels(endpoint="/mcp/execute").observe(time.time() - start_time)
        
        return MCPResponse(
            result={"status": "success", "data": "模拟数据"},
            latency_ms=round(elapsed_ms, 2),
            tokens_used=tokens
        )
        
    except Exception as e:
        REQUEST_COUNT.labels(endpoint="/mcp/execute", status="error").inc()
        raise HTTPException(status_code=500, detail=str(e))
    finally:
        ACTIVE_CONNECTIONS.dec()

@app.get("/metrics")
async def metrics():
    """Prometheus 抓取端点"""
    return generate_latest()

@app.get("/health")
async def health():
    """健康检查端点"""
    return {"status": "healthy", "active_connections": ACTIVE_CONNECTIONS._value.get()}

if __name__ == "__main__":
    # 启动 Prometheus metrics HTTP 服务器（默认端口 9090）
    start_http_server(9090)
    print("Prometheus metrics exposed on :9090/metrics")
    
    # 启动 FastAPI 应用（端口 8000）
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Prometheus 告警规则配置

监控数据暴露后，需要配置告警规则。我使用的真实告警配置如下：

# prometheus_alerts.yml
groups:
  - name: mcp_server_alerts
    rules:
      # 告警 1：P99 延迟超过 2 秒
      - alert: MCPServerHighLatency
        expr: histogram_quantile(0.99, rate(mcp_server_request_duration_seconds_bucket[5m])) > 2
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "MCP Server P99 延迟过高"
          description: "P99 延迟 {{ $value }}s 已持续 2 分钟，当前负载可能过重"
      
      # 告警 2：上游模型响应超时
      - alert: ModelResponseTimeout
        expr: rate(mcp_model_response_seconds_count[5m]) > 0 and 
              histogram_quantile(0.95, rate(mcp_model_response_seconds_bucket[5m])) > 10
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "大模型 API 响应超时"
          description: "95% 响应时间超过 10 秒，请检查上游 API 状态"
      
      # 告警 3：请求错误率超过 5%
      - alert: HighErrorRate
        expr: rate(mcp_server_requests_total{status="error"}[5m]) / 
              rate(mcp_server_requests_total[5m]) > 0.05
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "MCP Server 错误率升高"
          description: "错误率 {{ $value | humanizePercentage }}，请检查日志"
      
      # 告警 4：活跃连接数异常
      - alert: ConnectionSpike
        expr: mcp_server_active_connections > 100
        for: 30s
        labels:
          severity: warning
        annotations:
          summary: "MCP Server 连接数激增"
          description: "当前活跃连接 {{ $value }}，接近处理上限"
      
      # 告警 5：Token 消耗速率异常（防止账单爆炸）
      - alert: TokenConsumptionSpike
        expr: rate(mcp_server_tokens_total[1h]) > 1000000  # 每小时消耗超过 100万 token
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Token 消耗速率异常"
          description: "当前消耗速率可能导致日账单超过预期，请确认是否为恶意请求"

Grafana 看板配置

为了让团队快速感知系统状态，我用 Grafana 搭建了实时看板。核心面板配置：

{
  "panels": [
    {
      "title": "MCP Server QPS",
      "type": "graph",
      "targets": [
        {
          "expr": "rate(mcp_server_requests_total[1m])",
          "legendFormat": "{{endpoint}} - {{status}}"
        }
      ]
    },
    {
      "title": "P50/P95/P99 延迟",
      "type": "graph",
      "targets": [
        {"expr": "histogram_quantile(0.50, rate(mcp_server_request_duration_seconds_bucket[5m]))", "legendFormat": "P50"},
        {"expr": "histogram_quantile(0.95, rate(mcp_server_request_duration_seconds_bucket[5m]))", "legendFormat": "P95"},
        {"expr": "histogram_quantile(0.99, rate(mcp_server_request_duration_seconds_bucket[5m]))", "legendFormat": "P99"}
      ]
    },
    {
      "title": "上游模型响应时间（按模型）",
      "type": "graph",
      "targets": [
        {"expr": "histogram_quantile(0.95, rate(mcp_model_response_seconds_bucket[5m]))", "legendFormat": "{{model}} P95"}
      ]
    },
    {
      "title": "Token 消耗速率",
      "type": "graph",
      "targets": [
        {"expr": "rate(mcp_server_tokens_total[1h]) * 3600", "legendFormat": "{{model}} {{type}}"}
      ],
      "thresholds": {
        "mode": "absolute",
        "steps": [
          {"color": "green", "value": null},
          {"color": "yellow", "value": 500000},
          {"color": "red", "value": 1000000}
        ]
      }
    }
  ]
}

常见错误与解决方案

在实际部署中，我踩过不少坑。以下是 3 个最常见的错误及对应的解决代码：

错误 1：Prometheus 无法抓取 metrics（Connection Refused）

# 错误现象：curl localhost:9090/metrics 返回 "Connection refused"
原因：start_http_server() 和 uvicorn.run() 在同一进程中阻塞

❌ 错误写法
if __name__ == "__main__":
    start_http_server(9090)  # 阻塞
    uvicorn.run(app, port=8000)  # 永远执行不到

✅ 正确写法：使用线程分离
import threading

def start_metrics_server():
    start_http_server(9090)
    print("Metrics server running on :9090")

if __name__ == "__main__":
    metrics_thread = threading.Thread(target=start_metrics_server, daemon=True)
    metrics_thread.start()
    uvicorn.run(app, host="0.0.0.0", port=8000)

错误 2：Label 基数爆炸（High Cardinality）

# 错误现象：Prometheus 报警 "metric_name label value series too many"
原因：将 user_id 或 request_id 作为 Label 值

❌ 错误写法（高基数）
REQUEST_COUNT.labels(
    endpoint="/mcp/execute", 
    user_id=user.id,  # 危险！每个用户都是新 label
    request_id=request.id  # 更危险！每个请求都不同
).inc()

✅ 正确写法：只记录需要聚合分析的维度
REQUEST_COUNT.labels(
    endpoint="/mcp/execute",
    tool_name=request.tool_name,  # 有限枚举，安全
    status="success"
).inc()

如需追踪单个请求，使用 Logger 而非 Metrics
import structlog
logger = structlog.get_logger()
logger.info("request_completed", user_id=user.id, request_id=request.id)

错误 3：内存泄漏（Histogram 累积不清理）

# 错误现象：Grafana 显示延迟突然归零，实际服务正常
原因：Histogram 的默认 bucket 不会自动过期

❌ 错误写法：未设置 namespace 和 registry 隔离
REQUEST_LATENCY = Histogram(
    'request_duration',  # 无 namespace，容易冲突
    'Request latency'
)

✅ 正确写法：显式管理 Registry，定期重置
from prometheus_client import CollectorRegistry, REGISTRY

创建独立 Registry
metrics_registry = CollectorRegistry()
REQUEST_LATENCY = Histogram(
    'mcp_server_request_duration_seconds',
    'Request latency in seconds',
    ['endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5],
    registry=metrics_registry
)

定期清理（每小时）
import schedule
def reset_metrics():
    # 只重置 Counter，Gauge 自动更新
    REQUEST_LATENCY._sum.set(0)
    REQUEST_LATENCY._count.set(0)

schedule.every().hour.do(reset_metrics)

实战经验：双十一期间的监控策略

回到文章开头提到的双十一场景。我的最终监控方案包含以下关键指标：

上游 API 延迟 SLO：P95 < 1.5s，超时立即切换备用 API
自动熔断：连续 10 个请求失败，暂停向上游 API 发请求 30 秒
Token 预算控制：设置每小时 500 万 Token 告警阈值，超出自动限流

我在 HolySheep AI 控制台发现，其 API 本身就提供了用量统计 API，配合我的 Prometheus 采集逻辑，实现了端到端可观测性。促销日当天，虽然请求量是平时的 47 倍，但系统平稳运行，P99 延迟始终控制在 800ms 以内。

总结：监控是 MCP Server 稳定运行的基础

没有监控的系统就像蒙着眼睛开车——你能感觉到速度，但不知道是否正在撞墙。通过本文的方案，你可以：

为 MCP Server 暴露 Prometheus metrics 端点
定义合理的告警规则（延迟、错误率、Token 消耗）
用 Grafana 搭建可视化看板
避免 3 个常见部署错误

选择上游 API 时，我建议优先考虑支持国内直连、延迟稳定的服务商。HolySheep AI 的 <50ms 延迟和 ¥1=$1 的汇率优势，让我能把更多精力放在业务优化而非基础设施运维上。

监控体系的建设不是一蹴而就的。建议从本文的最小可行方案开始，逐步添加业务特定指标，最终形成完整的可观测性体系。

👉 免费注册 HolySheep AI，获取首月赠额度，体验稳定、低延迟的大模型 API 服务。

MCP Server 监控告警：Prometheus metrics 暴露方案

为什么 MCP Server 必须暴露 Prometheus metrics

快速接入 HolySheep API（监控体系的前提）

holy_sheep_client.py

使用示例

为 MCP Server 添加 Prometheus metrics 端点

==================== Prometheus Metrics 定义 ====================

==================== MCP Server 实现 ====================

Prometheus 告警规则配置

Grafana 看板配置

常见错误与解决方案

错误 1：Prometheus 无法抓取 metrics（Connection Refused）

原因：start_http_server() 和 uvicorn.run() 在同一进程中阻塞

❌ 错误写法

✅ 正确写法：使用线程分离

错误 2：Label 基数爆炸（High Cardinality）

原因：将 user_id 或 request_id 作为 Label 值

❌ 错误写法（高基数）

✅ 正确写法：只记录需要聚合分析的维度

如需追踪单个请求，使用 Logger 而非 Metrics

错误 3：内存泄漏（Histogram 累积不清理）

原因：Histogram 的默认 bucket 不会自动过期

❌ 错误写法：未设置 namespace 和 registry 隔离

✅ 正确写法：显式管理 Registry，定期重置

创建独立 Registry

定期清理（每小时）

实战经验：双十一期间的监控策略

总结：监控是 MCP Server 稳定运行的基础

相关资源

相关文章

为什么 MCP Server 必须暴露 Prometheus metrics

快速接入 HolySheep API（监控体系的前提）

holy_sheep_client.py

使用示例

为 MCP Server 添加 Prometheus metrics 端点

==================== Prometheus Metrics 定义 ====================

==================== MCP Server 实现 ====================

Prometheus 告警规则配置

Grafana 看板配置

常见错误与解决方案

错误 1：Prometheus 无法抓取 metrics（Connection Refused）

原因：start_http_server() 和 uvicorn.run() 在同一进程中阻塞

❌ 错误写法

✅ 正确写法：使用线程分离

错误 2：Label 基数爆炸（High Cardinality）

原因：将 user_id 或 request_id 作为 Label 值

❌ 错误写法（高基数）

✅ 正确写法：只记录需要聚合分析的维度

如需追踪单个请求，使用 Logger 而非 Metrics

错误 3：内存泄漏（Histogram 累积不清理）

原因：Histogram 的默认 bucket 不会自动过期

❌ 错误写法：未设置 namespace 和 registry 隔离

✅ 正确写法：显式管理 Registry，定期重置

创建独立 Registry

定期清理（每小时）

实战经验：双十一期间的监控策略

总结：监控是 MCP Server 稳定运行的基础

相关资源

相关文章

🔥 推荐使用 HolySheep AI