AI API 并发请求设计：asyncio + aiohttp 完整实现

Trong thời đại AI bùng nổ, việc gọi API LLM (Large Language Model) một cách hiệu quả là yếu tố then chốt cho mọi ứng dụng. Bài viết này sẽ hướng dẫn bạn xây dựng hệ thống concurrency request mạnh mẽ với asyncio và aiohttp, đồng thời so sánh chi phí giữa các nhà cung cấp API hàng đầu.

Bảng so sánh chi phí API

Trước khi đi vào code, hãy cùng xem bảng so sánh chi phí giữa các nhà cung cấp API phổ biến:

Nhà cung cấp	Tỷ giá	Thanh toán	Độ trễ	GPT-4.1 ($/MTok)	Claude Sonnet 4.5 ($/MTok)
HolySheep AI	¥1 = $1	WeChat/Alipay, Visa	<50ms	$8	$15
API chính hãng	Tỷ giá thị trường	Visa, Wire	~100ms	$15	$25
Relay services khác	Biopolar rate	Limit rủi ro	~80ms	$10-12	$18-20

Kết luận: Với tỷ giá cố định ¥1 = $1 và chi phí thấp hơn tới 85% so với API chính hãng, HolySheep AI là lựa chọn tối ưu cho các dự án cần xử lý khối lượng lớn requests.

Tại sao cần Asyncio cho AI API?

Khi xây dựng ứng dụng AI, bạn thường gặp các bài toán:

Gọi nhiều model cùng lúc (multi-model inference)
Xử lý batch requests lớn
Tổng hợp kết quả từ nhiều nguồn
Retry logic khi API rate limit

asyncio giúp bạn xử lý hàng nghìn concurrent requests mà không cần tạo nhiều threads, tiết kiệm tài nguyên đáng kể.

Cài đặt môi trường

pip install aiohttp aiofiles asyncio

Triển khai AI API Client với Asyncio

1. Cấu hình và Import

import asyncio
import aiohttp
import json
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
from datetime import datetime
import time

@dataclass
class AIRequest:
    model: str
    messages: List[Dict[str, str]]
    temperature: float = 0.7
    max_tokens: int = 2048

@dataclass
class AIResponse:
    model: str
    content: str
    usage: Dict[str, int]
    latency: float
    success: bool
    error: Optional[str] = None

class HolySheepAIClient:
    """HolySheep AI API Client với hỗ trợ concurrent requests"""
    
    def __init__(
        self,
        api_key: str,
        base_url: str = "https://api.holysheep.ai/v1"
    ):
        self.api_key = api_key
        self.base_url = base_url
        self._session: Optional[aiohttp.ClientSession] = None
    
    async def __aenter__(self):
        timeout = aiohttp.ClientTimeout(total=120)
        self._session = aiohttp.ClientSession(
            headers={
                "Authorization": f"Bearer {self.api_key}",
                "Content-Type": "application/json"
            },
            timeout=timeout
        )
        return self
    
    async def __aexit__(self, exc_type, exc_val, exc_tb):
        if self._session:
            await self._session.close()
    
    async def chat_completion(
        self,
        request: AIRequest,
        retry_count: int = 3
    ) -> AIResponse:
        """Gửi single request với retry logic"""
        
        url = f"{self.base_url}/chat/completions"
        payload = {
            "model": request.model,
            "messages": request.messages,
            "temperature": request.temperature,
            "max_tokens": request.max_tokens
        }
        
        for attempt in range(retry_count):
            start_time = time.time()
            try:
                async with self._session.post(url, json=payload) as response:
                    latency = time.time() - start_time
                    
                    if response.status == 200:
                        data = await response.json()
                        return AIResponse(
                            model=request.model,
                            content=data["choices"][0]["message"]["content"],
                            usage=data.get("usage", {}),
                            latency=latency,
                            success=True
                        )
                    elif response.status == 429:
                        wait_time = 2 ** attempt
                        await asyncio.sleep(wait_time)
                        continue
                    else:
                        error_text = await response.text()
                        return AIResponse(
                            model=request.model,
                            content="",
                            usage={},
                            latency=latency,
                            success=False,
                            error=f"HTTP {response.status}: {error_text}"
                        )
                        
            except aiohttp.ClientError as e:
                if attempt == retry_count - 1:
                    return AIResponse(
                        model=request.model,
                        content="",
                        usage={},
                        latency=time.time() - start_time,
                        success=False,
                        error=str(e)
                    )
                await asyncio.sleep(1)
        
        return AIResponse(
            model=request.model,
            content="",
            usage={},
            latency=0,
            success=False,
            error="Max retries exceeded"
        )

2. Concurrent Request Handler

class ConcurrentAIHandler:
    """Xử lý batch requests với semaphore để kiểm soát concurrency"""
    
    def __init__(
        self,
        client: HolySheepAIClient,
        max_concurrent: int = 10
    ):
        self.client = client
        self.semaphore = asyncio.Semaphore(max_concurrent)
    
    async def process_single(
        self,
        request: AIRequest,
        request_id: int
    ) -> tuple[int, AIResponse]:
        """Xử lý một request với semaphore"""
        async with self.semaphore:
            response = await self.client.chat_completion(request)
            return request_id, response
    
    async def batch_process(
        self,
        requests: List[AIRequest]
    ) -> List[tuple[int, AIResponse]]:
        """Xử lý batch requests đồng thời"""
        tasks = [
            self.process_single(request, idx)
            for idx, request in enumerate(requests)
        ]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        return results
    
    async def multi_model_inference(
        self,
        prompts: List[str],
        models: List[str]
    ) -> Dict[str, List[AIResponse]]:
        """Gọi nhiều model cùng lúc cho cùng một prompt"""
        results = {model: [] for model in models}
        
        async def call_model(prompt: str, model: str):
            request = AIRequest(
                model=model,
                messages=[{"role": "user", "content": prompt}]
            )
            return model, await self.client.chat_completion(request)
        
        tasks = [
            call_model(prompt, model)
            for prompt in prompts
            for model in models
        ]
        
        raw_results = await asyncio.gather(*tasks)
        
        for model, response in raw_results:
            results[model].append(response)
        
        return results

Ví dụ sử dụng
async def main():
    api_key = "YOUR_HOLYSHEEP_API_KEY"
    
    async with HolySheepAIClient(api_key) as client:
        handler = ConcurrentAIHandler(client, max_concurrent=5)
        
        # Tạo batch requests
        batch_requests = [
            AIRequest(
                model="gpt-4.1",
                messages=[{"role": "user", "content": f"Prompt {i}: Phân tích dữ liệu #{i}"}]
            )
            for i in range(20)
        ]
        
        # Xử lý batch
        start = time.time()
        results = await handler.batch_process(batch_requests)
        total_time = time.time() - start
        
        # Thống kê
        success_count = sum(1 for _, r in results if r.success)
        print(f"Hoàn thành: {success_count}/{len(results)} requests")
        print(f"Tổng thời gian: {total_time:.2f}s")
        print(f"Trung bình: {total_time/len(results):.2f}s/request")

if __name__ == "__main__":
    asyncio.run(main())

3. Streaming Response Handler

async def stream_chat_completion(
    client: HolySheepAIClient,
    request: AIRequest
):
    """Xử lý streaming response từ API"""
    
    url = f"{client.base_url}/chat/completions"
    payload = {
        "model": request.model,
        "messages": request.messages,
        "temperature": request.temperature,
        "max_tokens": request.max_tokens,
        "stream": True
    }
    
    async with client._session.post(url, json=payload) as response:
        async for line in response.content:
            line = line.decode('utf-8').strip()
            if line.startswith("data: "):
                if line == "data: [DONE]":
                    break
                data = json.loads(line[6:])
                if "choices" in data and len(data["choices"]) > 0:
                    delta = data["choices"][0].get("delta", {})
                    if "content" in delta:
                        yield delta["content"]

Sử dụng streaming
async def main_streaming():
    api_key = "YOUR_HOLYSHEEP_API_KEY"
    
    async with HolySheepAIClient(api_key) as client:
        request = AIRequest(
            model="gpt-4.1",
            messages=[{"role": "user", "content": "Viết một đoạn văn dài"}]
        )
        
        print("Streaming response: ", end="", flush=True)
        async for chunk in stream_chat_completion(client, request):
            print(chunk, end="", flush=True)
        print()

Mô hình giá HolySheep 2026

Model	Giá Input ($/MTok)	Giá Output ($/MTok)	Tiết kiệm so với chính hãng
GPT-4.1	$8	$32	~50%
Claude Sonnet 4.5	$15	$75	~40%
Gemini 2.5 Flash	$2.50	$10	~60%
DeepSeek V3.2	$0.42	$1.68	~70%

Lỗi thường gặp và cách khắc phục

1. Lỗi Rate Limit (HTTP 429)

Nguyên nhân: Gửi quá nhiều requests trong thời gian ngắn.

Cách khắc phục:

class RateLimitedClient:
    """Client với built-in rate limiting"""
    
    def __init__(self, requests_per_second: int = 10):
        self.min_interval = 1.0 / requests_per_second
        self.last_request_time = 0
    
    async def throttled_request(self, coro):
        now = time.time()
        elapsed = now - self.last_request_time
        if elapsed < self.min_interval:
            await asyncio.sleep(self.min_interval - elapsed)
        self.last_request_time = time.time()
        return await coro

Hoặc sử dụng aiolimiter
from aiolimiter import AsyncLimiter

rate_limiter = AsyncLimiter(max_rate=50, time_period=1)

async def rate_limited_call(client, request):
    async with rate_limiter:
        return await client.chat_completion(request)

2. Lỗi Session Not Created

Nguyên nhân: Gọi API khi session đã bị đóng hoặc chưa được khởi tạo.

Cách khắc phục:

# Sai - Session chưa được tạo
client = HolySheepAIClient("key")
await client.chat_completion(request)  # Lỗi!

Đúng - Sử dụng context manager
async with HolySheepAIClient("key") as client:
    result = await client.chat_completion(request)
    
Hoặc khởi tạo thủ công
client = HolySheepAIClient("key")
await client.__aenter__()
try:
    result = await client.chat_completion(request)
finally:
    await client.__aexit__(None, None, None)

3. Lỗi Timeout

Nguyên nhân: Request mất quá thời gian cho phép, thường do model busy hoặc network issue.

Cách khắc phục:

# Tăng timeout cho batch processing
timeout = aiohttp.ClientTimeout(total=300)  # 5 phút

Hoặc xử lý timeout riêng cho từng request
async def request_with_timeout(coro, timeout_seconds=60):
    try:
        return await asyncio.wait_for(coro, timeout=timeout_seconds)
    except asyncio.TimeoutError:
        return AIResponse(
            model="",
            content="",
            usage={},
            latency=timeout_seconds,
            success=False,
            error=f"Request timeout after {timeout_seconds}s"
        )

Sử dụng
result = await request_with_timeout(
    client.chat_completion(request),
    timeout_seconds=120
)

4. Lỗi Invalid API Key

Nguyên nhân: API key không đúng hoặc chưa được kích hoạt.

Cách khắc phục:

async def verify_api_key(client: HolySheepAIClient) -> bool:
    """Kiểm tra API key trư�
Tài nguyên liên quan
📚 Hướng dẫn AI API
💰 Xem giá
📖 Tài liệu nhà phát triển
🚀 Đăng ký miễn phí
Bài viết liên quan
Function Calling vs Tool Use: Sự khác biệt giữa Claude và GP

Bảng so sánh chi phí API

Tại sao cần Asyncio cho AI API?

Cài đặt môi trường

Triển khai AI API Client với Asyncio

1. Cấu hình và Import

2. Concurrent Request Handler

Ví dụ sử dụng

3. Streaming Response Handler

Sử dụng streaming

Mô hình giá HolySheep 2026

Lỗi thường gặp và cách khắc phục

1. Lỗi Rate Limit (HTTP 429)

Hoặc sử dụng aiolimiter

2. Lỗi Session Not Created

Đúng - Sử dụng context manager

Hoặc khởi tạo thủ công

3. Lỗi Timeout

Hoặc xử lý timeout riêng cho từng request

Sử dụng

4. Lỗi Invalid API Key

Tài nguyên liên quan

Bài viết liên quan

🔥 Thử HolySheep AI