AI 뉴스 요약 시스템: 다중 소스 정보 통합과 실시간 업데이트 완전 가이드

안녕하세요, 저는 3년째 AI API 통합 업무를 맡고 있는 개발자입니다. 오늘은 초보자도 쉽게 따라할 수 있는 AI 뉴스 요약 시스템을 만들어보겠습니다. 이 튜토리얼을 마치면 RSS 피드, 웹 스크래핑, SNS 포스트 등 여러 출처의 뉴스를 자동으로 수집하여 핵심 내용만 요약해주는 시스템을 구축할 수 있습니다.

이 시스템이 하는 일

여러 뉴스 소스(RSS, 웹페이지, API)에서 실시간으로 기사 수집
수집된 기사를 핵심 키워드와 중요도 순으로 정렬
AI를 활용하여 각 기사의 핵심 내용 3문장 요약 생성
중복 뉴스 자동 탐지 및 필터링
정해진 주기마다 자동 업데이트

HolySheep AI를 사용하면 단일 API 키로 여러 AI 모델을 조합할 수 있어 비용을 크게 절감할 수 있습니다. 예를 들어, 초안 요약은 저렴한 DeepSeek V3.2(마일리지당 $0.42)에 처리하고, 최종 검수는 Claude Sonnet 4.5($15/MTok)로 수행하는 하이브리드 전략도 가능합니다.

사전 준비물

Python 3.8 이상 설치된 컴퓨터
HolySheep AI API 키 (지금 가입하고 무료 크레딧 받기)
기본적인 파이썬 문법 이해 (변수, 함수, 리스트)

1단계: 프로젝트 환경 설정

먼저 프로젝트 폴더를 만들고 필요한 라이브러리를 설치합니다. 터미널에서 다음 명령어를 실행하세요.

# 프로젝트 폴더 생성 및 이동
mkdir news-summary-system
cd news-summary-system

가상환경 생성 (권장)
python -m venv venv

Windows에서는:
venv\Scripts\activate

macOS/Linux에서는:
source venv/bin/activate

필수 라이브러리 설치
pip install requests feedparser beautifulsoup4 python-dateutil

설치 확인
pip list | grep -E "requests|feedparser|beautifulsoup4"

위 명령어 실행 시 다음과 같은 결과가 나오면 성공입니다.

beautifulsoup4 4.12.x
feedparser 6.x.x
requests 2.31.x
python-dateutil 2.8.x

2단계: HolySheep AI 클라이언트 설정

이제 HolySheep AI API에 연결하는 기본 클라이언트를 만들어보겠습니다. HolySheep AI는 다양한 AI 모델을 단일 엔드포인트에서 제공하므로, 코드 변경 없이 모델을 교체할 수 있습니다.

# holysheep_client.py
import requests
import json
from typing import Optional, List, Dict

class HolySheepAIClient:
    """HolySheep AI 게이트웨이 클라이언트 - 다중 AI 모델 통합 지원"""
    
    def __init__(self, api_key: str, base_url: str = "https://api.holysheep.ai/v1"):
        self.api_key = api_key
        self.base_url = base_url.rstrip('/')
        self.headers = {
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        }
    
    def summarize(self, text: str, model: str = "deepseek/deepseek-chat-v3-0324") -> str:
        """
        텍스트를 AI 모델로 요약합니다.
        
        Args:
            text: 요약할 원본 텍스트
            model: 사용할 모델 (기본값: DeepSeek V3.2)
                   사용 가능한 모델:
                   - deepseek/deepseek-chat-v3-0324 (최저가: $0.42/MTok)
                   - anthropic/claude-3-5-sonnet-latest (고품질: $15/MTok)
                   - openai/gpt-4o-mini (가성비: $3.5/MTok)
        
        Returns:
            요약된 텍스트
        """
        endpoint = f"{self.base_url}/chat/completions"
        
        prompt = f"""다음 뉴스 기사의 핵심 내용을 3문장으로 요약해주세요.
중요한 사실, 인물, 장소, 수치를 포함하고 간결하게 작성하세요.

---
{text}
---

핵심 요약:"""
        
        payload = {
            "model": model,
            "messages": [
                {"role": "system", "content": "당신은 정확한 뉴스 요약 전문가입니다."},
                {"role": "user", "content": prompt}
            ],
            "temperature": 0.3,  # 일관된 결과를 위해 낮은 온도 사용
            "max_tokens": 500
        }
        
        try:
            response = requests.post(endpoint, headers=self.headers, json=payload, timeout=30)
            response.raise_for_status()
            
            result = response.json()
            return result['choices'][0]['message']['content'].strip()
        
        except requests.exceptions.Timeout:
            raise Exception("API 요청 시간 초과 (30초). 네트워크 연결을 확인하세요.")
        except requests.exceptions.RequestException as e:
            raise Exception(f"API 요청 실패: {str(e)}")
        except KeyError:
            raise Exception("응답 형식이 올바르지 않습니다. API 키를 확인하세요.")
    
    def extract_keywords(self, text: str) -> List[str]:
        """
        텍스트에서 핵심 키워드를 추출합니다.
        """
        endpoint = f"{self.base_url}/chat/completions"
        
        prompt = f"""다음 텍스트에서 가장 중요한 키워드 5개를 추출해주세요.
키워드는 쉼표로 구분하여 나열하세요.

---
{text}
---

핵심 키워드:"""
        
        payload = {
            "model": "deepseek/deepseek-chat-v3-0324",
            "messages": [
                {"role": "user", "content": prompt}
            ],
            "temperature": 0.1,
            "max_tokens": 50
        }
        
        response = requests.post(endpoint, headers=self.headers, json=payload, timeout=30)
        response.raise_for_status()
        
        result = response.json()
        keywords_text = result['choices'][0]['message']['content'].strip()
        
        return [k.strip() for k in keywords_text.split(',')]


사용 예시
if __name__ == "__main__":
    # API 키 설정 (실제 키로 교체하세요)
    client = HolySheepAIClient(api_key="YOUR_HOLYSHEEP_API_KEY")
    
    # 테스트용 뉴스 텍스트
    sample_news = """
    서울특별시는 2024년 12월 15일 새로운 환경 정책 발표회를 가졌습니다.
    이번 정책은 2030년까지 탄소 배출량을 40% 절감하는 것을 목표로 하고 있습니다.
    시장은 전기버스 비중을 50%로 확대하고, 태양광 발전 시설을 확대할 계획이라고 밝혔습니다.
    총 사업비는 약 5조 원이 투입될 예정이며, 민간 투자도 유도할 계획입니다.
    """
    
    print("=" * 50)
    print("AI 뉴스 요약 시스템 테스트")
    print("=" * 50)
    
    try:
        summary = client.summarize(sample_news)
        print(f"\n📰 요약 결과:\n{summary}")
        
        keywords = client.extract_keywords(sample_news)
        print(f"\n🔑 핵심 키워드: {', '.join(keywords)}")
        
    except Exception as e:
        print(f"\n❌ 오류 발생: {e}")

3단계: 뉴스 수집 모듈 구현

이제 다양한 출처에서 뉴스를 수집하는 모듈을 만들겠습니다. RSS 피드, 일반 웹페이지, 그리고 JSON API를 지원합니다.

# news_collector.py
import requests
import feedparser
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
from typing import List, Dict, Optional
from dateutil import parser as date_parser
import hashlib
import re

class NewsCollector:
    """다중 소스 뉴스 수집기"""
    
    def __init__(self):
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        })
        self.seen_hashes = set()  # 중복 기사 필터링용
    
    def collect_rss(self, feed_url: str, max_items: int = 10) -> List[Dict]:
        """
        RSS 피드에서 뉴스를 수집합니다.
        
        Args:
            feed_url: RSS 피드 URL
            max_items: 최대 수집 기사 수
        
        Returns:
            [{'title', 'url', 'content', 'published', 'source'}]
        """
        articles = []
        
        try:
            feed = feedparser.parse(feed_url)
            
            for entry in feed.entries[:max_items]:
                # 중복 체크
                content_hash = hashlib.md5(entry.link.encode()).hexdigest()
                if content_hash in self.seen_hashes:
                    continue
                
                #发布日期处理
                published = None
                if hasattr(entry, 'published_parsed') and entry.published_parsed:
                    try:
                        published = datetime(*entry.published_parsed[:6])
                    except:
                        pass
                
                # 본문 추출
                content = ""
                if hasattr(entry, 'summary'):
                    content = self._clean_html(entry.summary)
                elif hasattr(entry, 'description'):
                    content = self._clean_html(entry.description)
                
                articles.append({
                    'title': entry.get('title', '제목 없음'),
                    'url': entry.get('link', ''),
                    'content': content[:1000],  # 1000자로 제한
                    'published': published,
                    'source': feed.feed.get('title', feed_url)
                })
                
                self.seen_hashes.add(content_hash)
        
        except Exception as e:
            print(f"RSS 수집 오류 ({feed_url}): {e}")
        
        return articles
    
    def collect_webpage(self, url: str) -> Optional[Dict]:
        """
        일반 웹페이지에서 기사를 추출합니다.
        """
        try:
            response = self.session.get(url, timeout=10)
            response.raise_for_status()
            
            soup = BeautifulSoup(response.content, 'html.parser')
            
            # 제목 추출
            title = ""
            if soup.find('h1'):
                title = soup.find('h1').get_text(strip=True)
            elif soup.find('title'):
                title = soup.find('title').get_text(strip=True)
            
            # 본문 추출 (article 태그 우선)
            article = soup.find('article')
            if not article:
                article = soup.find('div', class_=re.compile(r'article|content|post'))
            
            content = ""
            if article:
                paragraphs = article.find_all('p')
                content = ' '.join([p.get_text(strip=True) for p in paragraphs[:10]])
            
            return {
                'title': title,
                'url': url,
                'content': content[:1000],
                'published': datetime.now(),
                'source': url
            }
        
        except Exception as e:
            print(f"웹페이지 수집 오류 ({url}): {e}")
            return None
    
    def collect_api(self, api_url: str, headers: Dict = None) -> List[Dict]:
        """
        JSON API에서 뉴스를 수집합니다.
        """
        articles = []
        
        try:
            response = self.session.get(api_url, headers=headers, timeout=10)
            response.raise_for_status()
            
            data = response.json()
            
            # 일반적인 구조 처리
            items = data if isinstance(data, list) else data.get('articles', data.get('items', []))
            
            for item in items[:10]:
                if isinstance(item, dict):
                    articles.append({
                        'title': item.get('title', ''),
                        'url': item.get('url', item.get('link', '')),
                        'content': item.get('description', item.get('content', ''))[:1000],
                        'published': self._parse_date(item.get('publishedAt', item.get('pubDate', ''))),
                        'source': item.get('source', {}).get('name', api_url) if isinstance(item.get('source'), dict) else api_url
                    })
        
        except Exception as e:
            print(f"API 수집 오류 ({api_url}): {e}")
        
        return articles
    
    def _clean_html(self, html_text: str) -> str:
        """HTML 태그 제거 및 정리"""
        if not html_text:
            return ""
        soup = BeautifulSoup(html_text, 'html.parser')
        return soup.get_text(strip=True)
    
    def _parse_date(self, date_str: str) -> Optional[datetime]:
        """다양한 날짜 형식 파싱"""
        if not date_str:
            return None
        try:
            return date_parser.parse(date_str)
        except:
            return None


사용 예시
if __name__ == "__main__":
    collector = NewsCollector()
    
    print("=" * 50)
    print("뉴스 수집기 테스트")
    print("=" * 50)
    
    # 테스트용 RSS 피드
    test_feeds = [
        ("BBC 뉴스", "https://feeds.bbci.co.uk/news/world/rss.xml"),
        ("한국 뉴스", "https://www.yna.co.kr/rss/news.xml")
    ]
    
    for name, url in test_feeds:
        print(f"\n📡 {name} 수집 중...")
        articles = collector.collect_rss(url, max_items=3)
        print(f"   → {len(articles)}개 기사 수집됨")
        
        for i, article in enumerate(articles[:2], 1):
            print(f"\n   [{i}] {article['title'][:50]}...")
            print(f"       {article['url'][:60]}...")

4단계: 뉴스 요약 시스템 통합

이제 수집기와 AI 클라이언트를 결합하여 완전한 뉴스 요약 시스템을 만들겠습니다.

# news_summary_system.py
import time
import json
from datetime import datetime
from holysheep_client import HolySheepAIClient
from news_collector import NewsCollector

class NewsSummarySystem:
    """AI 기반 뉴스 요약 시스템 - 완전한 통합 버전"""
    
    def __init__(self, api_key: str):
        self.ai_client = HolySheepAIClient(api_key)
        self.collector = NewsCollector()
        self.news_db = []  # 처리된 뉴스 저장
    
    def add_feed(self, name: str, url: str, feed_type: str = "rss"):
        """수집 소스 추가"""
        self.feeds = getattr(self, 'feeds', [])
        self.feeds.append({
            'name': name,
            'url': url,
            'type': feed_type
        })
    
    def run(self, summary_model: str = "deepseek/deepseek-chat-v3-0324"):
        """전체 시스템 실행"""
        print("=" * 60)
        print(f"🤖 AI 뉴스 요약 시스템 시작 - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
        print("=" * 60)
        
        all_articles = []
        
        # 1단계: 뉴스 수집
        print("\n📡 1단계: 뉴스 수집 중...")
        for feed in getattr(self, 'feeds', []):
            print(f"   • {feed['name']} 처리 중...")
            
            if feed['type'] == 'rss':
                articles = self.collector.collect_rss(feed['url'], max_items=5)
            elif feed['type'] == 'web':
                article = self.collector.collect_webpage(feed['url'])
                articles = [article] if article else []
            elif feed['type'] == 'api':
                articles = self.collector.collect_api(feed['url'])
            else:
                articles = []
            
            all_articles.extend(articles)
            print(f"     → {len(articles)}개 수집")
            time.sleep(0.5)  # 서버 부하 방지
        
        print(f"\n   총 {len(all_articles)}개 기사 수집 완료")
        
        # 2단계: AI 요약 처리
        print("\n🤖 2단계: AI 요약 처리 중...")
        processed_count = 0
        
        for i, article in enumerate(all_articles, 1):
            if not article.get('content'):
                continue
            
            try:
                # 요약 생성
                summary = self.ai_client.summarize(
                    article['content'], 
                    model=summary_model
                )
                
                # 키워드 추출
                keywords = self.ai_client.extract_keywords(article['content'])
                
                # 처리 결과 저장
                processed_article = {
                    **article,
                    'summary': summary,
                    'keywords': keywords,
                    'processed_at': datetime.now().isoformat()
                }
                
                self.news_db.append(processed_article)
                processed_count += 1
                
                print(f"\n   [{processed_count}/{len(all_articles)}] {article['title'][:40]}...")
                print(f"       📝 요약: {summary[:100]}...")
                print(f"       🔑 키워드: {', '.join(keywords)}")
                
                # API 호출 간 딜레이 (_RATE_LIMIT 방지)
                time.sleep(1.0)
                
            except Exception as e:
                print(f"\n   ⚠️ 처리 실패: {article.get('title', 'Unknown')[:30]}...")
                print(f"      오류: {e}")
                continue
        
        # 3단계: 결과 출력
        print("\n" + "=" * 60)
        print(f"✅ 처리 완료: {processed_count}개 기사 요약")
        print("=" * 60)
        
        return self.news_db
    
    def save_results(self, filename: str = "news_summary_results.json"):
        """결과를 JSON 파일로 저장"""
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(self.news_db, f, ensure_ascii=False, indent=2)
        print(f"\n💾 결과를 {filename}에 저장했습니다.")
    
    def get_top_news(self, count: int = 5):
        """상위 뉴스 반환"""
        return self.news_db[:count]


메인 실행 코드
if __name__ == "__main__":
    # HolySheep AI API 키 설정
    API_KEY = "YOUR_HOLYSHEEP_API_KEY"
    
    # 시스템 초기화
    system = NewsSummarySystem(API_KEY)
    
    # 수집 소스 설정
    system.add_feed("BBC 세계뉴스", "https://feeds.bbci.co.uk/news/world/rss.xml", "rss")
    system.add_feed("한국연합뉴스", "https://www.yna.co.kr/rss/news.xml", "rss")
    system.add_feed("TechCrunch", "https://techcrunch.com/feed/", "rss")
    
    # 시스템 실행
    # DeepSeek V3.2 사용 ($0.42/MTok - 저렴한 비용)
    results = system.run(summary_model="deepseek/deepseek-chat-v3-0324")
    
    # 결과 저장
    system.save_results()
    
    # 상위 뉴스 출력
    print("\n" + "=" * 60)
    print("🏆 주요 뉴스 TOP 5")
    print("=" * 60)
    
    for i, news in enumerate(system.get_top_news(5), 1):
        print(f"\n{i}. {news['title']}")
        print(f"   출처: {news['source']}")
        print(f"   요약: {news.get('summary', 'N/A')[:150]}...")

5단계: 실시간 자동 업데이트 설정

시스템을 정기적으로 실행하도록 스케줄러를 설정해보겠습니다.

# scheduler.py
import time
import schedule
from datetime import datetime
from news_summary_system import NewsSummarySystem

def job_news_update():
    """정기 뉴스 업데이트 작업"""
    print(f"\n{'='*60}")
    print(f"🕐 자동 업데이트 실행: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"{'='*60}")
    
    system = NewsSummarySystem(API_KEY)
    
    # 수집 소스 설정
    system.add_feed("BBC 세계뉴스", "https://feeds.bbci.co.uk/news/world/rss.xml", "rss")
    system.add_feed("한국연합 뉴스", "https://www.yna.co.kr/rss/news.xml", "rss")
    system.add_feed("TechCrunch", "https://techcrunch.com/feed/", "rss")
    system.add_feed("NASA 뉴스", "https://www.nasa.gov/news-release/feed/", "rss")
    
    # 실행 (DeepSeek V3.2 사용 - 비용 최적화)
    results = system.run(summary_model="deepseek/deepseek-chat-v3-0324")
    
    # 결과 저장
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S
관련 리소스
📚 AI API 기술 문서
💰 요금제 보기
📖 개발자 문서
🚀 무료 가입
관련 문서
RAG 텍스트 분할 마이그레이션 플레이북: 고정 길이 vs 의미론적 분할 vs 재귀적 분할
HolySheep AI省钱攻略：充值优惠 + 模型选择最优方案 완벽 가이드
AI API 키 관리 보안 가이드: Vault와 KMS를 활용한 안전한 저장方案

이 시스템이 하는 일

사전 준비물

1단계: 프로젝트 환경 설정

가상환경 생성 (권장)

Windows에서는:

macOS/Linux에서는:

필수 라이브러리 설치

설치 확인

2단계: HolySheep AI 클라이언트 설정

사용 예시

3단계: 뉴스 수집 모듈 구현

사용 예시

4단계: 뉴스 요약 시스템 통합

메인 실행 코드

5단계: 실시간 자동 업데이트 설정

관련 리소스

관련 문서

🔥 HolySheep AI를 사용해 보세요