去年双十一,我负责的电商平台 AI 客服系统遭遇了前所未有的流量洪峰。凌晨 0 点 0 分,订单系统与 AI 客服同时被海量请求淹没,MCP Server 作为连接大模型与业务逻辑的核心枢纽,响应时间从日常的 80ms 飙升到 2.3 秒,用户投诉量在 15 分钟内突破 500 条。那一刻我意识到:没有监控的 MCP Server,就是一颗随时可能爆炸的定时炸弹

本文将完整记录我从零搭建 MCP Server Prometheus 监控体系的全过程,包含可复用的代码模板、真实延迟数据、以及踩坑后的经验总结。文末附 HolySheep AI 的接入方案,因为稳定的上游 API 是监控体系发挥作用的前提。

为什么 MCP Server 必须暴露 Prometheus metrics

MCP Server(Model Context Protocol Server)本质上是一个 HTTP 服务,接收来自客户端的工具调用请求,再与上游大模型 API 交互返回结果。在高并发场景下,以下问题必须被实时监控:

我曾对比过国内多个大模型 API 中转服务,最终选择了 HolySheheep AI,原因是其国内直连延迟稳定在 <50ms,且支持 Prometheus 格式的用量统计,配合自建监控体系非常方便。

快速接入 HolySheep API(监控体系的前提)

在开始监控配置之前,确保你已接入稳定的上游 API。以下是 Python SDK 的标准接入方式:

# 安装依赖
pip install openai httpx prometheus-client

holy_sheep_client.py

import httpx from openai import OpenAI class HolySheepClient: def __init__(self, api_key: str): self.client = OpenAI( api_key=api_key, base_url="https://api.holysheep.ai/v1" # 汇率 ¥1=$1 无损 ) def chat_completion(self, messages: list, model: str = "gpt-4o"): """调用 HolySheep API,带超时控制和重试机制""" try: response = self.client.chat.completions.create( model=model, messages=messages, timeout=30.0 # 超时阈值 ) return response except httpx.TimeoutException: raise TimeoutError(f"HolySheep API 超时,model={model}") except Exception as e: raise RuntimeError(f"API 调用失败: {str(e)}")

使用示例

client = HolySheepClient(api_key="YOUR_HOLYSHEEP_API_KEY") response = client.chat_completion([ {"role": "user", "content": "双十一有哪些优惠活动?"} ]) print(f"Token 消耗: {response.usage.total_tokens}")

为 MCP Server 添加 Prometheus metrics 端点

下面是完整的 MCP Server 实现,集成了 Prometheus metrics 暴露功能:

# mcp_server_with_metrics.py
from prometheus_client import Counter, Histogram, Gauge, generate_latest, start_http_server
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import time
import threading

==================== Prometheus Metrics 定义 ====================

REQUEST_COUNT = Counter( 'mcp_server_requests_total', 'Total MCP requests', ['endpoint', 'status'] ) REQUEST_LATENCY = Histogram( 'mcp_server_request_duration_seconds', 'Request latency in seconds', ['endpoint'], buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0] ) TOKEN_USAGE = Counter( 'mcp_server_tokens_total', 'Total tokens consumed', ['model', 'type'] # type: prompt/completion ) ACTIVE_CONNECTIONS = Gauge( 'mcp_server_active_connections', 'Number of active connections' ) MODEL_RESPONSE_TIME = Histogram( 'mcp_model_response_seconds', 'Upstream model response time', ['model'], buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0] )

==================== MCP Server 实现 ====================

app = FastAPI(title="MCP Server with Prometheus") class MCPRequest(BaseModel): tool_name: str parameters: dict model: str = "gpt-4o" class MCPResponse(BaseModel): result: dict latency_ms: float tokens_used: int @app.post("/mcp/execute", response_model=MCPResponse) async def execute_mcp_tool(request: MCPRequest): """执行 MCP 工具调用(示例:商品查询)""" ACTIVE_CONNECTIONS.inc() start_time = time.time() try: # 模拟实际业务逻辑 from holy_sheep_client import HolySheepClient client = HolySheepClient(api_key="YOUR_HOLYSHEEP_API_KEY") # 构建 Prompt messages = [{ "role": "user", "content": f"Execute tool: {request.tool_name}, params: {request.parameters}" }] # 调用上游 API 并记录时间 model_start = time.time() response = client.chat_completion(messages, model=request.model) model_elapsed = time.time() - model_start MODEL_RESPONSE_TIME.labels(model=request.model).observe(model_elapsed) # 记录 Token 消耗 tokens = response.usage.total_tokens TOKEN_USAGE.labels(model=request.model, type="total").inc(tokens) elapsed_ms = (time.time() - start_time) * 1000 REQUEST_COUNT.labels(endpoint="/mcp/execute", status="success").inc() REQUEST_LATENCY.labels(endpoint="/mcp/execute").observe(time.time() - start_time) return MCPResponse( result={"status": "success", "data": "模拟数据"}, latency_ms=round(elapsed_ms, 2), tokens_used=tokens ) except Exception as e: REQUEST_COUNT.labels(endpoint="/mcp/execute", status="error").inc() raise HTTPException(status_code=500, detail=str(e)) finally: ACTIVE_CONNECTIONS.dec() @app.get("/metrics") async def metrics(): """Prometheus 抓取端点""" return generate_latest() @app.get("/health") async def health(): """健康检查端点""" return {"status": "healthy", "active_connections": ACTIVE_CONNECTIONS._value.get()} if __name__ == "__main__": # 启动 Prometheus metrics HTTP 服务器(默认端口 9090) start_http_server(9090) print("Prometheus metrics exposed on :9090/metrics") # 启动 FastAPI 应用(端口 8000) import uvicorn uvicorn.run(app, host="0.0.0.0", port=8000)

Prometheus 告警规则配置

监控数据暴露后,需要配置告警规则。我使用的真实告警配置如下:

# prometheus_alerts.yml
groups:
  - name: mcp_server_alerts
    rules:
      # 告警 1:P99 延迟超过 2 秒
      - alert: MCPServerHighLatency
        expr: histogram_quantile(0.99, rate(mcp_server_request_duration_seconds_bucket[5m])) > 2
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "MCP Server P99 延迟过高"
          description: "P99 延迟 {{ $value }}s 已持续 2 分钟,当前负载可能过重"
      
      # 告警 2:上游模型响应超时
      - alert: ModelResponseTimeout
        expr: rate(mcp_model_response_seconds_count[5m]) > 0 and 
              histogram_quantile(0.95, rate(mcp_model_response_seconds_bucket[5m])) > 10
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "大模型 API 响应超时"
          description: "95% 响应时间超过 10 秒,请检查上游 API 状态"
      
      # 告警 3:请求错误率超过 5%
      - alert: HighErrorRate
        expr: rate(mcp_server_requests_total{status="error"}[5m]) / 
              rate(mcp_server_requests_total[5m]) > 0.05
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "MCP Server 错误率升高"
          description: "错误率 {{ $value | humanizePercentage }},请检查日志"
      
      # 告警 4:活跃连接数异常
      - alert: ConnectionSpike
        expr: mcp_server_active_connections > 100
        for: 30s
        labels:
          severity: warning
        annotations:
          summary: "MCP Server 连接数激增"
          description: "当前活跃连接 {{ $value }},接近处理上限"
      
      # 告警 5:Token 消耗速率异常(防止账单爆炸)
      - alert: TokenConsumptionSpike
        expr: rate(mcp_server_tokens_total[1h]) > 1000000  # 每小时消耗超过 100万 token
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Token 消耗速率异常"
          description: "当前消耗速率可能导致日账单超过预期,请确认是否为恶意请求"

Grafana 看板配置

为了让团队快速感知系统状态,我用 Grafana 搭建了实时看板。核心面板配置:

{
  "panels": [
    {
      "title": "MCP Server QPS",
      "type": "graph",
      "targets": [
        {
          "expr": "rate(mcp_server_requests_total[1m])",
          "legendFormat": "{{endpoint}} - {{status}}"
        }
      ]
    },
    {
      "title": "P50/P95/P99 延迟",
      "type": "graph",
      "targets": [
        {"expr": "histogram_quantile(0.50, rate(mcp_server_request_duration_seconds_bucket[5m]))", "legendFormat": "P50"},
        {"expr": "histogram_quantile(0.95, rate(mcp_server_request_duration_seconds_bucket[5m]))", "legendFormat": "P95"},
        {"expr": "histogram_quantile(0.99, rate(mcp_server_request_duration_seconds_bucket[5m]))", "legendFormat": "P99"}
      ]
    },
    {
      "title": "上游模型响应时间(按模型)",
      "type": "graph",
      "targets": [
        {"expr": "histogram_quantile(0.95, rate(mcp_model_response_seconds_bucket[5m]))", "legendFormat": "{{model}} P95"}
      ]
    },
    {
      "title": "Token 消耗速率",
      "type": "graph",
      "targets": [
        {"expr": "rate(mcp_server_tokens_total[1h]) * 3600", "legendFormat": "{{model}} {{type}}"}
      ],
      "thresholds": {
        "mode": "absolute",
        "steps": [
          {"color": "green", "value": null},
          {"color": "yellow", "value": 500000},
          {"color": "red", "value": 1000000}
        ]
      }
    }
  ]
}

常见错误与解决方案

在实际部署中,我踩过不少坑。以下是 3 个最常见的错误及对应的解决代码:

错误 1:Prometheus 无法抓取 metrics(Connection Refused)

# 错误现象:curl localhost:9090/metrics 返回 "Connection refused"

原因:start_http_server() 和 uvicorn.run() 在同一进程中阻塞

❌ 错误写法

if __name__ == "__main__": start_http_server(9090) # 阻塞 uvicorn.run(app, port=8000) # 永远执行不到

✅ 正确写法:使用线程分离

import threading def start_metrics_server(): start_http_server(9090) print("Metrics server running on :9090") if __name__ == "__main__": metrics_thread = threading.Thread(target=start_metrics_server, daemon=True) metrics_thread.start() uvicorn.run(app, host="0.0.0.0", port=8000)

错误 2:Label 基数爆炸(High Cardinality)

# 错误现象:Prometheus 报警 "metric_name label value series too many"

原因:将 user_id 或 request_id 作为 Label 值

❌ 错误写法(高基数)

REQUEST_COUNT.labels( endpoint="/mcp/execute", user_id=user.id, # 危险!每个用户都是新 label request_id=request.id # 更危险!每个请求都不同 ).inc()

✅ 正确写法:只记录需要聚合分析的维度

REQUEST_COUNT.labels( endpoint="/mcp/execute", tool_name=request.tool_name, # 有限枚举,安全 status="success" ).inc()

如需追踪单个请求,使用 Logger 而非 Metrics

import structlog logger = structlog.get_logger() logger.info("request_completed", user_id=user.id, request_id=request.id)

错误 3:内存泄漏(Histogram 累积不清理)

# 错误现象:Grafana 显示延迟突然归零,实际服务正常

原因:Histogram 的默认 bucket 不会自动过期

❌ 错误写法:未设置 namespace 和 registry 隔离

REQUEST_LATENCY = Histogram( 'request_duration', # 无 namespace,容易冲突 'Request latency' )

✅ 正确写法:显式管理 Registry,定期重置

from prometheus_client import CollectorRegistry, REGISTRY

创建独立 Registry

metrics_registry = CollectorRegistry() REQUEST_LATENCY = Histogram( 'mcp_server_request_duration_seconds', 'Request latency in seconds', ['endpoint'], buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5], registry=metrics_registry )

定期清理(每小时)

import schedule def reset_metrics(): # 只重置 Counter,Gauge 自动更新 REQUEST_LATENCY._sum.set(0) REQUEST_LATENCY._count.set(0) schedule.every().hour.do(reset_metrics)

实战经验:双十一期间的监控策略

回到文章开头提到的双十一场景。我的最终监控方案包含以下关键指标:

我在 HolySheep AI 控制台发现,其 API 本身就提供了用量统计 API,配合我的 Prometheus 采集逻辑,实现了端到端可观测性。促销日当天,虽然请求量是平时的 47 倍,但系统平稳运行,P99 延迟始终控制在 800ms 以内。

总结:监控是 MCP Server 稳定运行的基础

没有监控的系统就像蒙着眼睛开车——你能感觉到速度,但不知道是否正在撞墙。通过本文的方案,你可以:

  1. 为 MCP Server 暴露 Prometheus metrics 端点
  2. 定义合理的告警规则(延迟、错误率、Token 消耗)
  3. 用 Grafana 搭建可视化看板
  4. 避免 3 个常见部署错误

选择上游 API 时,我建议优先考虑支持国内直连、延迟稳定的服务商。HolySheep AI 的 <50ms 延迟和 ¥1=$1 的汇率优势,让我能把更多精力放在业务优化而非基础设施运维上。

监控体系的建设不是一蹴而就的。建议从本文的最小可行方案开始,逐步添加业务特定指标,最终形成完整的可观测性体系。

👉 免费注册 HolySheep AI,获取首月赠额度,体验稳定、低延迟的大模型 API 服务。