AI API 监控与告警系统指南:2026年用 Prometheus / Grafana / PagerDuty 打造 AI 服务可观测性
AI API 调用成本高、延迟大、错误多——没有监控就是在"盲飞"。2026年,随着 AI 服务在生产环境中的重要性提升,监控和告警已成为必需。本文详解如何用 Prometheus + Grafana 打造 AI API 可观测性体系。
⚠️ AI API 监控的核心指标:Token 消耗(直接关联成本)、API 延迟(用户体验)、错误率(服务质量)、QPS(容量规划)。没有监控,月底账单可能吓你一跳。
AI API 监控指标体系
| 指标类型 | 具体指标 | 重要性 |
|---|---|---|
| 成本 | Token 消耗(输入/输出)、日费用、月费用 | ⭐⭐⭐⭐⭐ |
| 性能 | API 延迟(P50/P95/P99)、吞吐量 | ⭐⭐⭐⭐⭐ |
| 质量 | 错误率、429 超限次数、401 认证失败 | ⭐⭐⭐⭐⭐ |
| 容量 | QPS、并发连接数、利用率 | ⭐⭐⭐ |
Prometheus 埋点方案
# pip install prometheus-client
from prometheus_client import Counter, Histogram, Gauge
import time
# 定义指标
api_requests_total = Counter(
'ai_api_requests_total',
'Total AI API requests',
['model', 'endpoint', 'status']
)
api_request_duration = Histogram(
'ai_api_request_duration_seconds',
'AI API request duration',
['model', 'endpoint']
)
token_usage_input = Counter(
'ai_token_usage_input_total',
'Total input tokens consumed',
['model']
)
token_usage_output = Counter(
'ai_token_usage_output_total',
'Total output tokens consumed',
['model']
)
api_cost_total = Counter(
'ai_api_cost_total_yuan',
'Total API cost in yuan',
['model']
)
# 在调用 API 时记录指标
def call_ai_api_with_metrics(model: str, messages: list):
start = time.time()
try:
response = client.messages.create(
model=model,
messages=messages
)
duration = time.time() - start
# 记录请求
api_requests_total.labels(model=model, endpoint='chat', status='success').inc()
# 记录延迟
api_request_duration.labels(model=model, endpoint='chat').observe(duration)
# 记录 Token(根据模型计算)
input_tokens = estimate_tokens(messages)
output_tokens = estimate_tokens(response.content)
token_usage_input.labels(model=model).inc(input_tokens)
token_usage_output.labels(model=model).inc(output_tokens)
# 记录费用
cost = calculate_cost(model, input_tokens, output_tokens)
api_cost_total.labels(model=model).inc(cost)
return response
except Exception as e:
duration = time.time() - start
api_requests_total.labels(model=model, endpoint='chat', status='error').inc()
api_request_duration.labels(model=model, endpoint='chat').observe(duration)
raise
Grafana 仪表盘配置
# Grafana Dashboard JSON(关键 Panel 配置)
# Panel 1: API 请求量
{
"title": "AI API 请求量",
"type": "timeseries",
"targets": [{
"expr": "rate(ai_api_requests_total[5m])",
"legendFormat": "{{model}} - {{status}}"
}],
"fieldConfig": {
"defaults": {
"unit": "reqps",
"color": {"mode": "palette-classic"}
}
}
}
# Panel 2: Token 消耗(累计)
{
"title": "Token 消耗(累计)",
"type": "timeseries",
"targets": [{
"expr": "ai_token_usage_input_total",
"legendFormat": "输入 Token - {{model}}"
}, {
"expr": "ai_token_usage_output_total",
"legendFormat": "输出 Token - {{model}}"
}]
}
# Panel 3: API 延迟 P99
{
"title": "API 延迟 P99",
"type": "gauge",
"targets": [{
"expr": "histogram_quantile(0.99, rate(ai_api_request_duration_seconds_bucket[5m]))",
"legendFormat": "P99 延迟"
}],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 5, "color": "yellow"},
{"value": 10, "color": "red"}
]
}
}
}
}
# Panel 4: 日费用估算
{
"title": "日费用估算(¥)",
"type": "stat",
"targets": [{
"expr": "ai_api_cost_total_yuan",
"legendFormat": "累计费用"
}],
"options": {
"colorMode": "value",
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 100, "color": "yellow"},
{"value": 500, "color": "red"}
]
}
}
}
AlertManager 告警规则
# alertmanager.yml
global:
smtp_smarthost: 'smtp.qq.com:587'
smtp_from: '[email protected]'
route:
group_by: ['alertname']
receiver: 'email-alerts'
receivers:
- name: 'email-alerts'
email_configs:
- to: '[email protected]'
# Prometheus 告警规则 - prometheus_rules.yml
groups:
- name: ai_api_alerts
rules:
# 错误率超过 5%
- alert: HighAPIErrorRate
expr: |
rate(ai_api_requests_total{status="error"}[5m])
/ rate(ai_api_requests_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "AI API 错误率超过 5%"
description: "{{ $labels.model }} 错误率达到 {{ $value | humanizePercentage }}"
# Token 日消耗超过 ¥500
- alert: HighDailyCost
expr: |
increase(ai_api_cost_total_yuan[24h]) > 500
for: 1m
labels:
severity: warning
annotations:
summary: "AI API 日费用超过 ¥500"
description: "过去24小时费用已达 ¥{{ $value | printf \"%.2f\" }}"
# P99 延迟超过 10 秒
- alert: HighAPILatency
expr: |
histogram_quantile(0.99, rate(ai_api_request_duration_seconds_bucket[5m])) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "AI API P99 延迟超过 10 秒"
description: "{{ $labels.model }} P99 延迟达到 {{ $value }}s"
# Token 分钟消耗异常(可能是刷量攻击)
- alert: TokenSpike
expr: |
rate(ai_token_usage_input_total[1m]) > 1000000 # 1分钟超过1M token
for: 1m
labels:
severity: critical
annotations:
summary: "Token 消耗异常,可能存在刷量"
description: "当前分钟 Token 消耗速率:{{ $value }}"
完整监控架构
# docker-compose.yml(监控全家桶)
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus_rules.yml:/etc/prometheus/rules.yml
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--rule.files=/etc/prometheus/rules.yml'
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=your_password
volumes:
- ./grafana/dashboards:/var/lib/grafana/dashboards
- ./grafana/datasources:/etc/grafana/provisioning/datasources
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
# 你的 AI API 服务
your_ai_app:
build: .
ports:
- "8000:8000"
environment:
- HOLYSHEEP_API_KEY=${HOLYSHEEP_API_KEY}
关键监控面板推荐指标
| 面板名称 | 展示内容 | 告警阈值建议 |
|---|---|---|
| 请求量 QPS | 每秒请求数,按模型分 | > 100 QPS 关注 |
| Token 消耗 | 输入/输出 Token 累计 | 日消耗 > ¥1000 告警 |
| P50/P95/P99 延迟 | API 响应时间分布 | P99 > 10s 告警 |
| 错误率 | 按错误类型分类 | > 5% 告警 |
| 费用趋势 | 日/周/月费用曲线 | 日均 > ¥500 关注 |
| 429 超限次数 | Rate limit 触发次数 | > 10次/分钟 告警 |