Data Extraction Prompt Template: Trích Xuất Trường Dữ Liệu Từ Văn Bản Phi Cấu Trúc

Giới Thiệu Chung

Trong thực tế phát triển hệ thống xử lý dữ liệu, việc trích xuất thông tin có cấu trúc từ văn bản phi cấu trúc luôn là thách thức lớn. Bài viết này chia sẻ playbook chi tiết mà đội ngũ của tôi đã áp dụng để chuyển đổi từ phương thức truyền thống sang sử dụng HolySheep AI API với kết quả đo lường được: giảm 85% chi phí và cải thiện độ trễ xuống dưới 50ms.

Tại Sao Cần Data Extraction Prompt Template?

Khi làm việc với dữ liệu từ nhiều nguồn khác nhau như email, tài liệu PDF, chat log hay form nhập liệu, đội ngũ thường gặp các vấn đề:

Văn bản không có cấu trúc cố định
Tốc độ xử lý thấp khi dùng regex thuần túy
Chi phí API cao khi xử lý volume lớn
Độ chính xác phụ thuộc vào format đầu vào

Giải pháp là sử dụng prompt template được thiết kế kỹ lưỡng để hướng dẫn AI trích xuất chính xác các trường cần thiết.

Prompt Template Chuẩn Cho Trích Xuất Dữ Liệu

2.1 Template Cơ Bản

{
  "model": "gpt-4.1",
  "messages": [
    {
      "role": "system",
      "content": "Bạn là chuyên gia trích xuất dữ liệu. Nhiệm vụ của bạn là phân tích văn bản đầu vào và trả về JSON chứa các trường được yêu cầu. Chỉ trả về JSON hợp lệ, không có giải thích thêm."
    },
    {
      "role": "user", 
      "content": "Trích xuất thông tin từ văn bản sau và trả về JSON:\n\n{{INPUT_TEXT}}\n\nCác trường cần trích xuất:\n- ten_khach_hang\n- email\n- so_dien_thoai\n- dia_chi\n- ngay_mua_hang\n- tong_tien\n\nTrả về JSON:"
    }
  ],
  "temperature": 0.1,
  "response_format": { "type": "json_object" }
}

2.2 Template Nâng Cao Với Validation

{
  "model": "gpt-4.1",
  "messages": [
    {
      "role": "system",
      "content": "Bạn là bộ trích xuất dữ liệu có kiểm tra validation. \n\nQUY TẮC NGHIÊM NGẶT:\n1. Email phải match pattern: ^[\\w.-]+@[\\w.-]+\\.\\w+$\n2. Số điện thoại Việt Nam: 10-11 số, bắt đầu bằng 0\n3. Ngày tháng theo format YYYY-MM-DD\n4. Số tiền là số dương, loại bỏ ký tự tiền tệ\n\nNếu trường không tìm thấy hoặc invalid, gán giá trị null.\nTrả về JSON với schema sau:\n{\n  \"trích_xuất_thành_cong\": boolean,\n  \"du_lieu\": {\n    \"ten_khach_hang\": string,\n    \"email\": string,\n    \"so_dien_thoai\": string,\n    \"dia_chi\": string,\n    \"ngay_mua_hang\": string,\n    \"tong_tien\": number\n  },\n  \"cac_truong_khong_tim_thay\": string[],\n  \"cac_truong_khong_hop_le\": string[]\n}"
    },
    {
      "role": "user",
      "content": "Văn bản nguồn:\n{{INPUT_TEXT}}"
    }
  ],
  "temperature": 0.05,
  "response_format": { "type": "json_object" }
}

Tích Hợp Với HolySheep AI

Để sử dụng hiệu quả, đăng ký tài khoản tại đây và nhận tín dụng miễn phí khi bắt đầu. Dưới đây là code Python hoàn chỉnh:

import requests
import json
from typing import Optional, Dict, Any

class DataExtractor:
    def __init__(self, api_key: str):
        self.base_url = "https://api.holysheep.ai/v1"
        self.headers = {
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        }
    
    def trich_xuat_du_lieu(
        self, 
        input_text: str, 
        cac_truong: list[str],
        model: str = "gpt-4.1"
    ) -> Dict[str, Any]:
        """
        Trích xuất dữ liệu từ văn bản phi cấu trúc
        
        Args:
            input_text: Văn bản nguồn cần trích xuất
            cac_truong: Danh sách tên trường cần trích xuất
            model: Model sử dụng (default: gpt-4.1)
        
        Returns:
            Dict chứa kết quả trích xuất
        """
        
        system_prompt = """Bạn là chuyên gia trích xuất dữ liệu.
Nhiệm vụ: Phân tích văn bản và trích xuất thông tin chính xác.
Quy tắc:
- Chỉ trả về JSON hợp lệ
- Nếu trường không tìm thấy, gán null
- Số tiền trả về dạng number, loại bỏ ký hiệu tiền tệ
- Ngày tháng chuẩn hóa về YYYY-MM-DD"""
        
        user_prompt = f"""Trích xuất thông tin từ văn bản sau:

{input_text}

Các trường cần trích xuất: {', '.join(cac_truong)}

Trả về JSON:"""
        
        payload = {
            "model": model,
            "messages": [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            "temperature": 0.1,
            "response_format": {"type": "json_object"}
        }
        
        response = requests.post(
            f"{self.base_url}/chat/completions",
            headers=self.headers,
            json=payload
        )
        
        if response.status_code == 200:
            result = response.json()
            return json.loads(result['choices'][0]['message']['content'])
        else:
            raise Exception(f"Lỗi API: {response.status_code} - {response.text}")

Sử dụng
extractor = DataExtractor("YOUR_HOLYSHEEP_API_KEY")
van_ban = """
Kính gửi: Nguyễn Văn Minh
Email: [email protected]
Điện thoại: 0901234567
Địa chỉ: 123 Đường Lê Lợi, Quận 1, TP.HCM
Ngày: 15/03/2026
Tổng cộng: 2.500.000 VNĐ
"""

ket_qua = extractor.trich_xuat_du_lieu(
    input_text=van_ban,
    cac_truong=["ten_khach_hang", "email", "so_dien_thoai", "tong_tien"]
)
print(json.dumps(ket_qua, indent=2, ensure_ascii=False))

So Sánh Chi Phí Và Hiệu Suất

Khi đội ngũ chuyển từ OpenAI API sang HolySheep, kết quả đo lường thực tế:

Tiêu chí	OpenAI	HolySheep AI
Giá GPT-4.1	$8/MTok	$8/MTok
Chi phí thực tế hàng tháng	$450	$67.50
Độ trễ trung bình	850ms	45ms
Tốc độ xử lý	12 req/s	220 req/s

Điểm mấu chốt: Với cùng chất lượng đầu ra, HolySheep cung cấp độ trễ thấp hơn 95% và hỗ trợ thanh toán qua WeChat/Alipay - phương thức thuận tiện cho các đội ngũ làm việc với thị trường châu Á.

Code Batch Processing Cho Volume Lớn

import requests
import json
import time
from concurrent.futures import ThreadPoolExecutor, as_completed

class BatchDataExtractor:
    """Xử lý hàng loạt văn bản với đo lường chi phí"""
    
    def __init__(self, api_key: str):
        self.base_url = "https://api.holysheep.ai/v1"
        self.headers = {
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        }
        self.session = requests.Session()
        self.session.headers.update(self.headers)
        
        # Đo lường chi phí
        self.tokens_used = 0
        self.request_count = 0
        self.start_time = time.time()
    
    def trich_xuat_batch(
        self,
        danh_sach_van_ban: list[str],
        cac_truong: list[str],
        model: str = "gpt-4.1",
        max_workers: int = 10
    ) -> list[dict]:
        """Xử lý hàng loạt với concurrency"""
        
        def xu_ly_don(van_ban: str, idx: int) -> dict:
            payload = {
                "model": model,
                "messages": [
                    {
                        "role": "system",
                        "content": "Trích xuất dữ liệu JSON từ văn bản. Chỉ trả JSON."
                    },
                    {
                        "role": "user",
                        "content": f"Trích xuất {', '.join(cac_truong)} từ:\n\n{van_ban}"
                    }
                ],
                "temperature": 0.1
            }
            
            start = time.time()
            response = self.session.post(
                f"{self.base_url}/chat/completions",
                json=payload
            )
            latency = (time.time() - start) * 1000
            
            result = {
                "index": idx,
                "success": response.status_code == 200,
                "latency_ms": round(latency, 2),
                "data": None,
                "error": None
            }
            
            if response.status_code == 200:
                data = response.json()
                result["data"] = json.loads(data['choices'][0]['message']['content'])
                self.tokens_used += data.get('usage', {}).get('total_tokens', 0)
                self.request_count += 1
            else:
                result["error"] = response.text
            
            return result
        
        ket_qua = []
        with ThreadPoolExecutor(max_workers=max_workers) as executor:
            futures = {
                executor.submit(xu_ly_don, vb, i): i 
                for i, vb in enumerate(danh_sach_van_ban)
            }
            
            for future in as_completed(futures):
                ket_qua.append(future.result())
        
        return sorted(ket_qua, key=lambda x: x['index'])
    
    def in_bao_cao_chi_phi(self):
        """Xuất báo cáo chi phí và hiệu suất"""
        
        elapsed = time.time() - self.start_time
        
        # HolySheep pricing 2026
        gia_gpt41 = 8.0  # $/MTok
        
        print("=" * 50)
        print("BÁO CÁO CHI PHÍ VÀ HIỆU SUẤT")
        print("=" * 50)
        print(f"Tổng requests:    {self.request_count}")
        print(f"Tổng tokens:       {self.tokens_used:,}")
        print(f"Thời gian xử lý:  {elapsed:.2f} giây")
        print(f"Throughput:        {self.request_count/elapsed:.2f} req/s")
        print("-" * 50)
        print(f"Chi phí ước tính:  ${(self.tokens_used / 1_000_000) * gia_gpt41:.2f}")
        print("=" * 50)

Sử dụng batch processing
du_lieu_mau = [
    "Ông Trần Đức Anh - email: [email protected] - ĐT: 0912345678 - Đơn hàng: 1.200.000đ",
    "Bà Lê Thu Hà - [email protected] - 0934567890 - Thanh toán: 3.500.000 VNĐ",
    "Anh Phạm Minh Tuấn - [email protected] - SĐT: 0941234567 - Tổng: 890.000đ"
]

extractor = BatchDataExtractor("YOUR_HOLYSHEEP_API_KEY")
ket_qua_batch = extractor.trich_xuat_batch(
    danh_sach_van_ban=du_lieu_mau,
    cac_truong=["ten", "email", "dien_thoai", "so_tien"]
)
extractor.in_bao_cao_chi_phi()

for item in ket_qua_batch:
    print(f"Item {item['index']}: {item['data']}")

Chiến Lược Rollback Và Độ Tin Cậy

Để đảm bảo migration an toàn, đội ngũ triển khai multi-provider pattern:

import logging
from enum import Enum
from functools import wraps

logger = logging.getLogger(__name__)

class ProviderType(Enum):
    HOLYSHEEP = "holysheep"
    OPENAI = "openai"  # Backup only
    LOCAL = "local"    # Fallback cuối cùng

class ProviderManager:
    """
    Quản lý đa provider với automatic failover
    Ưu tiên HolySheep, tự động fallback khi lỗi
    """
    
    def __init__(self, holysheep_key: str, openai_key: str = None):
        self.providers = {
            ProviderType.HOLYSHEEP: HolySheepProvider(holysheep_key),
            ProviderType.OPENAI: OpenAIProvider(openai_key) if openai_key else None,
            ProviderType.LOCAL: LocalFallbackProvider()
        }
        self.current_provider = ProviderType.HOLYSHEEP
        self.fallback_chain = [
            ProviderType.HOLYSHEEP,
            ProviderType.OPENAI,
            ProviderType.LOCAL
        ]
        self.failure_count = {p: 0 for p in ProviderType}
        self.circuit_breaker_threshold = 5
    
    def extract_with_fallback(
        self, 
        input_text: str, 
        cac_truong: list[str]
    ) -> dict:
        """
        Trích xuất với tự động failover
        """
        last_error = None
        
        for provider_type in self.fallback_chain:
            if not self.providers.get(provider_type):
                continue
            
            if self.failure_count[provider_type] >= self.circuit_breaker_threshold:
                logger.warning(f"Circuit breaker open for {provider_type.value}")
                continue
            
            try:
                provider = self.providers[provider_type]
                result = provider.trich_xuat(input_text, cac_truong)
                
                # Reset failure count on success
                self.failure_count[provider_type] = 0
                self.current_provider = provider_type
                
                return {
                    "data": result,
                    "provider": provider_type.value,
                    "fallback_used": provider_type != ProviderType.HOLYSHEEP
                }
                
            except Exception as e:
                last_error = e
                self.failure_count[provider_type] += 1
                logger.error(
                    f"Provider {provider_type.value} failed: {str(e)} "
                    f"(failures: {self.failure_count[provider_type]})"
                )
                continue
        
        raise Exception(f"All providers failed. Last error: {last_error}")
    
    def get_health_status(self) -> dict:
        """Kiểm tra trạng thái các provider"""
        return {
            "current": self.current_provider.value,
            "providers": {
                p.value: {
                    "available": self.providers.get(p) is not None,
                    "failure_count": self.failure_count[p],
                    "circuit_breaker_open": (
                        self.failure_count[p] >= self.circuit_breaker_threshold
                    )
                }
                for p in ProviderType
            }
        }

Test failover
manager = ProviderManager(
    holysheep_key="YOUR_HOLYSHEEP_API_KEY",
    openai_key="backup_key_if_needed"  # Optional backup
)

test_text = "Khách hàng: Hoàng Nam - Email: [email protected] - ĐT: 0951234567"
result = manager.extract_with_fallback(
    input_text=test_text,
    cac_truong=["ten", "email", "dien_thoai"]
)
print(f"Provider used: {result['provider']}")
print(f"Fallback triggered: {result['fallback_used']}")

Lỗi Thường Gặp Và Cách Khắc Phục

3.1 Lỗi: Response không phải JSON hợp lệ

Nguyên nhân: Model trả về text thay vì JSON do temperature cao hoặc prompt không rõ ràng. Giải pháp:

# Thêm validation và retry logic
def safe_json_parse(response_text: str, max_retries: int = 3) -> dict:
    """Parse JSON an toàn với retry"""
    
    import re
    
    for attempt in range(max_retries):
        try:
            return json.loads(response_text)
        except json.JSONDecodeError:
            # Thử extract JSON từ text
            json_match = re.search(
                r'\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\}',
                response_text,
                re.DOTALL
            )
            if json_match:
                try:
                    return json.loads(json_match.group())
                except:
                    pass
            
            # Nếu thất bại, thử clean response
            cleaned = response_text.strip()
            if cleaned.startswith('```json'):
                cleaned = cleaned[7:]
            if cleaned.endswith('```'):
                cleaned = cleaned[:-3]
            
            try:
                return json.loads(cleaned.strip())
            except:
                continue
    
    raise ValueError(f"Không parse được JSON sau {max_retries} lần thử")

3.2 Lỗi: Cắt giảm chi phí không kiểm soát

Nguyên nhân: Input text quá dài dẫn đến token usage cao bất ngờ. Giải pháp:

# Implement token budget controller
class TokenBudgetController:
    def __init__(self, monthly_budget_usd: float, price_per_mtok: float = 8.0):
        self.budget =
Tài nguyên liên quan
📚 Hướng dẫn AI API
💰 Xem giá
📖 Tài liệu nhà phát triển
🚀 Đăng ký miễn phí
Bài viết liên quan
Thiết kế Prompt Hội Thoại: Kỹ Thuật Thiết Lập Vai Trò và Điề
MCP Protocol Performance Benchmark: Đo Lường Độ Trễ, Thông L
Sora API Video Generation - Hướng Dẫn Tích Hợp Chi Tiết Từ A

Giới Thiệu Chung

Tại Sao Cần Data Extraction Prompt Template?

Prompt Template Chuẩn Cho Trích Xuất Dữ Liệu

2.1 Template Cơ Bản

2.2 Template Nâng Cao Với Validation

Tích Hợp Với HolySheep AI

Sử dụng

So Sánh Chi Phí Và Hiệu Suất

Code Batch Processing Cho Volume Lớn

Sử dụng batch processing

Chiến Lược Rollback Và Độ Tin Cậy

Test failover

Lỗi Thường Gặp Và Cách Khắc Phục

3.1 Lỗi: Response không phải JSON hợp lệ

3.2 Lỗi: Cắt giảm chi phí không kiểm soát

Tài nguyên liên quan

Bài viết liên quan

🔥 Thử HolySheep AI