AI API 监控与告警系统指南：2026年用 Prometheus / Grafana / PagerDuty 打造 AI 服务可观测性

AI API 调用成本高、延迟大、错误多——没有监控就是在"盲飞"。2026年，随着 AI 服务在生产环境中的重要性提升，监控和告警已成为必需。本文详解如何用 Prometheus + Grafana 打造 AI API 可观测性体系。

⚠️ AI API 监控的核心指标：Token 消耗（直接关联成本）、API 延迟（用户体验）、错误率（服务质量）、QPS（容量规划）。没有监控，月底账单可能吓你一跳。

AI API 监控指标体系

指标类型	具体指标	重要性
成本	Token 消耗（输入/输出）、日费用、月费用	⭐⭐⭐⭐⭐
性能	API 延迟（P50/P95/P99）、吞吐量	⭐⭐⭐⭐⭐
质量	错误率、429 超限次数、401 认证失败	⭐⭐⭐⭐⭐
容量	QPS、并发连接数、利用率	⭐⭐⭐

Prometheus 埋点方案

# pip install prometheus-client

from prometheus_client import Counter, Histogram, Gauge
import time

# 定义指标
api_requests_total = Counter(
    'ai_api_requests_total',
    'Total AI API requests',
    ['model', 'endpoint', 'status']
)

api_request_duration = Histogram(
    'ai_api_request_duration_seconds',
    'AI API request duration',
    ['model', 'endpoint']
)

token_usage_input = Counter(
    'ai_token_usage_input_total',
    'Total input tokens consumed',
    ['model']
)

token_usage_output = Counter(
    'ai_token_usage_output_total',
    'Total output tokens consumed',
    ['model']
)

api_cost_total = Counter(
    'ai_api_cost_total_yuan',
    'Total API cost in yuan',
    ['model']
)

# 在调用 API 时记录指标
def call_ai_api_with_metrics(model: str, messages: list):
    start = time.time()

    try:
        response = client.messages.create(
            model=model,
            messages=messages
        )

        duration = time.time() - start

        # 记录请求
        api_requests_total.labels(model=model, endpoint='chat', status='success').inc()

        # 记录延迟
        api_request_duration.labels(model=model, endpoint='chat').observe(duration)

        # 记录 Token（根据模型计算）
        input_tokens = estimate_tokens(messages)
        output_tokens = estimate_tokens(response.content)

        token_usage_input.labels(model=model).inc(input_tokens)
        token_usage_output.labels(model=model).inc(output_tokens)

        # 记录费用
        cost = calculate_cost(model, input_tokens, output_tokens)
        api_cost_total.labels(model=model).inc(cost)

        return response

    except Exception as e:
        duration = time.time() - start
        api_requests_total.labels(model=model, endpoint='chat', status='error').inc()
        api_request_duration.labels(model=model, endpoint='chat').observe(duration)
        raise

Grafana 仪表盘配置

# Grafana Dashboard JSON（关键 Panel 配置）

# Panel 1: API 请求量
{
  "title": "AI API 请求量",
  "type": "timeseries",
  "targets": [{
    "expr": "rate(ai_api_requests_total[5m])",
    "legendFormat": "{{model}} - {{status}}"
  }],
  "fieldConfig": {
    "defaults": {
      "unit": "reqps",
      "color": {"mode": "palette-classic"}
    }
  }
}

# Panel 2: Token 消耗（累计）
{
  "title": "Token 消耗（累计）",
  "type": "timeseries",
  "targets": [{
    "expr": "ai_token_usage_input_total",
    "legendFormat": "输入 Token - {{model}}"
  }, {
    "expr": "ai_token_usage_output_total",
    "legendFormat": "输出 Token - {{model}}"
  }]
}

# Panel 3: API 延迟 P99
{
  "title": "API 延迟 P99",
  "type": "gauge",
  "targets": [{
    "expr": "histogram_quantile(0.99, rate(ai_api_request_duration_seconds_bucket[5m]))",
    "legendFormat": "P99 延迟"
  }],
  "fieldConfig": {
    "defaults": {
      "unit": "s",
      "thresholds": {
        "steps": [
          {"value": 0, "color": "green"},
          {"value": 5, "color": "yellow"},
          {"value": 10, "color": "red"}
        ]
      }
    }
  }
}

# Panel 4: 日费用估算
{
  "title": "日费用估算（¥）",
  "type": "stat",
  "targets": [{
    "expr": "ai_api_cost_total_yuan",
    "legendFormat": "累计费用"
  }],
  "options": {
    "colorMode": "value",
    "thresholds": {
      "steps": [
        {"value": 0, "color": "green"},
        {"value": 100, "color": "yellow"},
        {"value": 500, "color": "red"}
      ]
    }
  }
}

AlertManager 告警规则

# alertmanager.yml
global:
  smtp_smarthost: 'smtp.qq.com:587'
  smtp_from: '[email protected]'

route:
  group_by: ['alertname']
  receiver: 'email-alerts'

receivers:
- name: 'email-alerts'
  email_configs:
  - to: '[email protected]'

# Prometheus 告警规则 - prometheus_rules.yml
groups:
- name: ai_api_alerts
  rules:
  # 错误率超过 5%
  - alert: HighAPIErrorRate
    expr: |
      rate(ai_api_requests_total{status="error"}[5m])
      / rate(ai_api_requests_total[5m]) > 0.05
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "AI API 错误率超过 5%"
      description: "{{ $labels.model }} 错误率达到 {{ $value | humanizePercentage }}"

  # Token 日消耗超过 ¥500
  - alert: HighDailyCost
    expr: |
      increase(ai_api_cost_total_yuan[24h]) > 500
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "AI API 日费用超过 ¥500"
      description: "过去24小时费用已达 ¥{{ $value | printf \"%.2f\" }}"

  # P99 延迟超过 10 秒
  - alert: HighAPILatency
    expr: |
      histogram_quantile(0.99, rate(ai_api_request_duration_seconds_bucket[5m])) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "AI API P99 延迟超过 10 秒"
      description: "{{ $labels.model }} P99 延迟达到 {{ $value }}s"

  # Token 分钟消耗异常（可能是刷量攻击）
  - alert: TokenSpike
    expr: |
      rate(ai_token_usage_input_total[1m]) > 1000000  # 1分钟超过1M token
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Token 消耗异常，可能存在刷量"
      description: "当前分钟 Token 消耗速率：{{ $value }}"

完整监控架构

# docker-compose.yml（监控全家桶）
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus_rules.yml:/etc/prometheus/rules.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--rule.files=/etc/prometheus/rules.yml'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=your_password
    volumes:
      - ./grafana/dashboards:/var/lib/grafana/dashboards
      - ./grafana/datasources:/etc/grafana/provisioning/datasources

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

  # 你的 AI API 服务
  your_ai_app:
    build: .
    ports:
      - "8000:8000"
    environment:
      - HOLYSHEEP_API_KEY=${HOLYSHEEP_API_KEY}

关键监控面板推荐指标

面板名称	展示内容	告警阈值建议
请求量 QPS	每秒请求数，按模型分	> 100 QPS 关注
Token 消耗	输入/输出 Token 累计	日消耗 > ¥1000 告警
P50/P95/P99 延迟	API 响应时间分布	P99 > 10s 告警
错误率	按错误类型分类	> 5% 告警
费用趋势	日/周/月费用曲线	日均 > ¥500 关注
429 超限次数	Rate limit 触发次数	> 10次/分钟告警

👉 HolySheep API：¥1/$1 · 支持用量监控与告警
微信/支付宝 · 国内直连 · OpenAI-Compatible