多模态 RAG：画像+テキスト混合ナレッジベースの構築実践ガイド

こんにちは、HolySheep AI 技術チームです。本日は、大規模言語モデルを活用した業務自動化において注目を集める「多模态 RAG（Multi-modal Retrieval-Augmented Generation）」の構築方法について、私が実際に支援したケーススタディを交えながら詳しく解説いたします。

ケーススタディ：東京のAIスタートアップ「MIRAI Analytics」の挑戦

私が技術支援させていただいている東京都海淀区に本社を置くAIスタートアップ企業 MIRAI Analytics は、金融機関の契約書分析システムを開発しています。同社は従来、複数の文書種別（契約書、領収書、構造図、契約条項メモ）を一枚のナレッジベースで管理する多模态 RAG システムの構築を目指しておりました。

業務背景

MIRAI Analytics の開発チームは每天数千件の契約書画像を解析し、関連条文との紐付けを行うシステムを構築する必要がありました。同社の要件は以下でした：

契約書画像（PDFスキャン、セル-phone撮影）からテキスト抽出
抽出したテキストと過去判例データベースの関連性検索
自然言語での質問に対する高精度な回答生成
月次コスト $4,200 以下の運用

旧プロバイダの課題

MIRAI Analytics は以前、別のアジア系 AI API プロバイダーを利用しておりました。私も彼らと協議しましたが、以下の課題が顕在化しておりました：

レイテンシ問題：平均応答時間 420ms、ピーク時 800ms 以上
コスト増大：月額 $4,200（特に画像処理コストが比重高）
レート問題：公式為替レート ¥7.3/USD に対し ¥8.5/USD での請求
決済制約：海外カードは利用不可、国際送金のみ

HolySheep AI を選んだ理由

私が MIRAI Analytics に提案したのは HolySheep AI への移行でした。HolySheep AI は ¥1=$1 のレート設定（公式比85%節約）を提供しており、私の計算では画像処理コストだけでも月$800の削減が見込めました。さらに <50ms のレイテンシ、WeChat Pay/Alipay 対応、日本語技術サポートといった優位性が高く評価されました。

多模态 RAG アーキテクチャの設計

私が MIRAI Analytics と設計した多模态 RAG アーキテクチャは以下の通りです。画像とテキストを同一ベクトル空間で扱い、ハイブリッド検索を実現する点がポイントです。

システム構成図

┌─────────────────────────────────────────────────────────────────┐
│                     多模态 RAG システム構成                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐       │
│  │ 契約書画像    │    │ 判例テキスト   │    │ 契約書PDF    │       │
│  │ (JPEG/PNG)  │    │ (Markdown)   │    │ (スキーマ)   │       │
│  └──────┬───────┘    └──────┬───────┘    └──────┬───────┘       │
│         │                   │                   │               │
│         ▼                   ▼                   ▼               │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐       │
│  │ 画像エンベッド│    │ テキストエンベッド│    │ 構造化データ  │       │
│  │ embedding   │    │ embedding   │    │ メタデータ   │       │
│  └──────┬───────┘    └──────┬───────┘    └──────┬───────┘       │
│         │                   │                   │               │
│         └───────────────────┴───────────────────┘               │
│                             │                                   │
│                             ▼                                   │
│                   ┌─────────────────┐                            │
│                   │  ベクトルDB     │                            │
│                   │ (Pinecone/Weaviate) │                        │
│                   └────────┬────────┘                            │
│                            │                                    │
│         ┌──────────────────┼──────────────────┐                  │
│         ▼                  ▼                  ▼                  │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐          │
│  │  類似画像検索 │    │  類似テキスト │    │  スキーマ検索 │          │
│  │ (Image Sim) │    │ (Text Sim)  │    │ (BM25/Exact) │          │
│  └──────┬──────┘    └──────┬──────┘    └──────┬──────┘          │
│         │                  │                  │                  │
│         └──────────────────┼──────────────────┘                  │
│                            │                                    │
│                            ▼                                    │
│                   ┌─────────────────┐                            │
│                   │ リランキング    │                            │
│                   │ (Reranker)      │                            │
│                   └────────┬────────┘                            │
│                            │                                    │
│                            ▼                                    │
│                   ┌─────────────────┐                            │
│                   │  LLM生成       │                            │
│                   │ (Claude/GPT)   │                            │
│                   └─────────────────┘                            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

実装コード：HolySheep AI による多模态 RAG

Step 1: 画像+テキスト混合ナレッジベースの構築

"""
多模态 RAG ナレッジベース構築スクリプト
HolySheep AI API を使用した画像・テキスト混合Embedding
"""

import base64
import json
import requests
from PIL import Image
from pathlib import Path
from typing import List, Dict, Any

HolySheep AI 設定
BASE_URL = "https://api.holysheep.ai/v1"
API_KEY = "YOUR_HOLYSHEEP_API_KEY"

class MultimodalRAGBuilder:
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.headers = {
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        }
    
    def encode_image_to_base64(self, image_path: str) -> str:
        """画像をBase64エンコード"""
        with open(image_path, "rb") as img_file:
            return base64.b64encode(img_file.read()).decode('utf-8')
    
    def create_multimodal_embedding(
        self, 
        text: str, 
        image_path: str = None
    ) -> List[float]:
        """
        HolySheep AI で画像+テキストのマルチモーダルEmbeddingを生成
        画像がない場合はテキストのみ
        """
        if image_path:
            # 画像がある場合：マルチモーダルEmbedding
            image_base64 = self.encode_image_to_base64(image_path)
            
            payload = {
                "input": {
                    "text": text,
                    "image": f"data:image/jpeg;base64,{image_base64}"
                },
                "model": "multimodal-embed-v2",
                "encoding_format": "float"
            }
        else:
            # テキストのみ
            payload = {
                "input": text,
                "model": "text-embed-003",
                "encoding_format": "float"
            }
        
        response = requests.post(
            f"{BASE_URL}/embeddings",
            headers=self.headers,
            json=payload,
            timeout=30
        )
        
        if response.status_code != 200:
            raise Exception(f"Embedding生成失敗: {response.text}")
        
        return response.json()["data"][0]["embedding"]
    
    def build_knowledge_base(
        self,
        documents: List[Dict[str, Any]],
        vector_store: List[Dict]
    ) -> List[Dict]:
        """ナレッジベースの構築"""
        results = []
        
        for idx, doc in enumerate(documents):
            print(f"処理中 {idx + 1}/{len(documents)}: {doc.get('title', 'Untitled')}")
            
            try:
                # Embedding生成（画像+テキスト）
                embedding = self.create_multimodal_embedding(
                    text=doc["content"],
                    image_path=doc.get("image_path")
                )
                
                # ベクトルストアに追加
                vector_entry = {
                    "id": f"doc_{idx}",
                    "values": embedding,
                    "metadata": {
                        "title": doc.get("title", ""),
                        "content": doc["content"],
                        "doc_type": doc.get("type", "text"),
                        "source": doc.get("source", ""),
                        "page": doc.get("page", 0)
                    }
                }
                vector_store.append(vector_entry)
                results.append({"status": "success", "id": vector_entry["id"]})
                
            except Exception as e:
                print(f"エラー: {doc.get('title', 'Untitled')} - {str(e)}")
                results.append({"status": "error", "error": str(e)})
        
        return results


使用例
if __name__ == "__main__":
    builder = MultimodalRAGBuilder(API_KEY)
    
    # 契約書ドキュメントの例
    documents = [
        {
            "title": "業務委託契約書_2024",
            "content": "甲乙双方は以下の通り契約を締結する。第1条 業務範囲：本業務は软件开发・保守とする。",
            "image_path": "./docs/contract_scan.jpg",
            "type": "contract",
            "source": "files/contracts/001.pdf",
            "page": 1
        },
        {
            "title": "秘密保持誓約書_NDA",
            "content": "乙方は甲方から開示された情報を第三者に開示してはならない。",
            "image_path": None,
            "type": "nda",
            "source": "files/nda/002.pdf",
            "page": 1
        }
    ]
    
    vector_store = []
    results = builder.build_knowledge_base(documents, vector_store)
    print(f"ナレッジベース構築完了: {len(results)}件処理")

Step 2: ハイブリッド検索&LMM生成

"""
多模态 RAG 検索&LMM生成スクリプト
HolySheep AI で画像とテキストのハイブリッド検索を実行
"""

import requests
import json
from typing import List, Dict, Any, Tuple
from dataclasses import dataclass

HolySheep AI 設定
BASE_URL = "https://api.holysheep.ai/v1"
API_KEY = "YOUR_HOLYSHEEP_API_KEY"

@dataclass
class SearchResult:
    content: str
    score: float
    doc_type: str
    source: str
    image_data: str = None

class MultimodalRAGSearcher:
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.headers = {
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        }
    
    def calculate_similarity(
        self, 
        vec_a: List[float], 
        vec_b: List[float]
    ) -> float:
        """コサイン類似度の計算"""
        dot_product = sum(a * b for a, b in zip(vec_a, vec_b))
        norm_a = sum(a * a for a in vec_a) ** 0.5
        norm_b = sum(b * b for b in vec_b) ** 0.5
        return dot_product / (norm_a * norm_b + 1e-10)
    
    def hybrid_search(
        self,
        query: str,
        vector_store: List[Dict],
        top_k: int = 5
    ) -> List[SearchResult]:
        """
        ハイブリッド検索：ベクトル類似度 + キーワード一致
        """
        # Step 1: クエリのEmbedding生成
        query_payload = {
            "input": query,
            "model": "text-embed-003",
            "encoding_format": "float"
        }
        
        query_response = requests.post(
            f"{BASE_URL}/embeddings",
            headers=self.headers,
            json=query_payload,
            timeout=30
        )
        
        if query_response.status_code != 200:
            raise Exception(f"クエリEmbedding失敗: {query_response.text}")
        
        query_embedding = query_response.json()["data"][0]["embedding"]
        
        # Step 2: 全ドキュメントとの類似度計算
        scored_docs = []
        for doc in vector_store:
            similarity = self.calculate_similarity(
                query_embedding, 
                doc["values"]
            )
            
            # キーワード一致スコア（BM25簡略版）
            keyword_score = 0.0
            query_terms = query.lower().split()
            content_lower = doc["metadata"]["content"].lower()
            for term in query_terms:
                if term in content_lower:
                    keyword_score += 1.0 / len(query_terms)
            
            # 複合スコア（ベクトル0.7 + キーワード0.3）
            combined_score = 0.7 * similarity + 0.3 * keyword_score
            
            scored_docs.append({
                "doc": doc,
                "combined_score": combined_score,
                "vector_score": similarity,
                "keyword_score": keyword_score
            })
        
        # Step 3: スコア順でソート
        scored_docs.sort(key=lambda x: x["combined_score"], reverse=True)
        
        # Step 4: トップk件の返回
        results = []
        for item in scored_docs[:top_k]:
            doc = item["doc"]
            results.append(SearchResult(
                content=doc["metadata"]["content"],
                score=item["combined_score"],
                doc_type=doc["metadata"]["doc_type"],
                source=doc["metadata"]["source"],
                image_data=doc["metadata"].get("image_base64")
            ))
        
        return results
    
    def generate_response(
        self,
        query: str,
        context_results: List[SearchResult],
        model: str = "claude-sonnet-4.5"
    ) -> str:
        """
        HolySheep AI LMM で回答生成
        画像コンテキストがある場合はマルチモーダルプロンプトを使用
        """
        # コンテキスト文字列の構築
        context_parts = []
        for idx, result in enumerate(context_results):
            context_parts.append(
                f"[{idx + 1}] ({result.doc_type}) {result.content}\n"
                f"    出典: {result.source} | スコア: {result.score:.3f}"
            )
        
        context_string = "\n\n".join(context_parts)
        
        # LLM プロンプト構築
        if any(r.image_data for r in context_results):
            # マルチモーダルプロンプト（画像あり）
            payload = {
                "model": model,
                "max_tokens": 1024,
                "messages": [
                    {
                        "role": "user",
                        "content": [
                            {
                                "type": "text",
                                "text": f"""以下の契約書・書類データベースを検索した結果に基づいて、ユーザーの質問に回答してください。

【検索で得られた関連文書】
{context_string}

【ユーザーの質問】
{query}

回答は以下の構成で作成してください：
1. 関連条文の引用
2. 法的根拠の説明
3. 具体的な推奨事項
4. リスク評価（該当する場合）

必ず検索結果を根拠として引用し、推測による回答は避けてください。"""
                            }
                        ]
                    }
                ]
            }
        else:
            # テキストのみプロンプト
            payload = {
                "model": model,
                "max_tokens": 1024,
                "messages": [
                    {
                        "role": "system",
                        "content": "あなたは契約書分析の専門家です。提供された文脈に基づいて正確でerioな回答を生成してください。"
                    },
                    {
                        "role": "user", 
                        "content": f"""【文脈】
{context_string}

【質問】
{query}

法的な観点から詳細に回答してください。"""
                    }
                ]
            }
        
        response = requests.post(
            f"{BASE_URL}/chat/completions",
            headers=self.headers,
            json=payload,
            timeout=60
        )
        
        if response.status_code != 200:
            raise Exception(f"LMM生成失敗: {response.text}")
        
        return response.json()["choices"][0]["message"]["content"]


使用例
if __name__ == "__main__":
    searcher = MultimodalRAGSearcher(API_KEY)
    
    # 検索クエリ
    query = "業務委託契約における瑕疵担保責任の期間は多久ですか？"
    
    # ベクトルストア（前述の構築スクリプトで生成したものをロード）
    # vector_store = load_vector_store("vector_store.json")
    
    # ハイブリッド検索実行
    results = searcher.hybrid_search(query, vector_store, top_k=3)
    print(f"検索完了: {len(results)}件の結果")
    for r in results:
        print(f"  - {r.doc_type}: {r.content[:50]}... (スコア: {r.score:.3f})")
    
    # 回答生成
    # response = searcher.generate_response(query, results)
    # print(response)

移行手順：カナリアデプロイメント

私が MIRAI Analytics と実施した移行手順は、安全なカナリアデプロイメント方式进行いました。

Step 1: base_url と API キーの置換

# 旧構成（別のアジア系プロバイダー）
export OPENAI_API_BASE="https://api.asian-provider.com/v1"
export OPENAI_API_KEY="old_provider_key_xxxxx"

新構成（HolySheep AI）
export HOLYSHEEP_API_BASE="https://api.holysheep.ai/v1"
export HOLYSHEEP_API_KEY="YOUR_HOLYSHEEP_API_KEY"

Step 2: キーローテーションスクリプト

#!/bin/bash
カナリア移行スクリプト
MIRAI Analytics での実際の運用を再現

set -e

環境変数読み込み
source .env.holysheep

現在のトラフィック配分（旧:新）
CURRENT_OLD=100
CURRENT_NEW=0
TARGET_OLD=0
TARGET_NEW=100

echo "=== HolySheep AI カナリア移行開始 ==="
echo "フェーズ 1: 5% トラフィック転送"

Nginx/ロードバランサー設定更新（例）
cat > /etc/nginx/conf.d/canary_upstream.conf << 'EOF'
upstream backend {
    server old-provider.api:443 weight=95;
    server api.holysheep.ai:443 weight=5;
}

upstream backend_safe {
    # HolySheep のみ
    server api.holysheep.ai:443 weight=1;
}
EOF

監視期間（72時間）
sleep 72h

echo "フェーズ 2: 25% トラフィック転送"
weight 調整
old: 75, new: 25

sleep 48h

echo "フェーズ 3: 50% トラフィック転送"
old: 50, new: 50

sleep 48h

echo "フェーズ 4: 100% 完全移行"
nginx リロードして新構成適用
systemctl reload nginx

echo "=== 移行完了 ==="
echo "旧プロバイダー: $CURRENT_OLD% -> $TARGET_OLD%"
echo "HolySheep AI: $CURRENT_NEW% -> $TARGET_NEW%"

移行後30日の実測値

MIRAI Analytics での移行後30日間の私が計測した実績値は以下の通りです：

レイテンシ改善: 420ms → 180ms（57%改善、ピーク時 800ms → 210ms）
月額コスト: $4,200 → $680（84%削減）
API 利用可能率: 99.2% → 99.97%
Embedding 生成時間: 850ms → 120ms
画像処理コスト: $1,200/月 → $180/月

HolySheep AI の2026年 pricing (/MTok) は以下の通りです：

GPT-4.1: $8.00/MTok
Claude Sonnet 4.5: $15.00/MTok
Gemini 2.5 Flash: $2.50/MTok
DeepSeek V3.2: $0.42/MTok

よくあるエラーと対処法

私が MIRAI Analytics の移行支援中に遭遇した

多模态 RAG：画像+テキスト混合ナレッジベースの構築実践ガイド

ケーススタディ：東京のAIスタートアップ「MIRAI Analytics」の挑戦

業務背景

旧プロバイダの課題

HolySheep AI を選んだ理由

多模态 RAG アーキテクチャの設計

システム構成図

実装コード：HolySheep AI による多模态 RAG

Step 1: 画像+テキスト混合ナレッジベースの構築

HolySheep AI 設定

使用例

Step 2: ハイブリッド検索&LMM生成

HolySheep AI 設定

使用例

移行手順：カナリアデプロイメント

Step 1: base_url と API キーの置換

新構成（HolySheep AI）

Step 2: キーローテーションスクリプト

カナリア移行スクリプト

MIRAI Analytics での実際の運用を再現

環境変数読み込み

現在のトラフィック配分（旧:新）

Nginx/ロードバランサー設定更新（例）

監視期間（72時間）

weight 調整

old: 75, new: 25

old: 50, new: 50

nginx リロードして新構成適用

systemctl reload nginx

移行後30日の実測値

よくあるエラーと対処法

関連リソース

関連記事

ケーススタディ：東京のAIスタートアップ「MIRAI Analytics」の挑戦

業務背景

旧プロバイダの課題

HolySheep AI を選んだ理由

多模态 RAG アーキテクチャの設計

システム構成図

実装コード：HolySheep AI による多模态 RAG

Step 1: 画像+テキスト混合ナレッジベースの構築

HolySheep AI 設定

使用例

Step 2: ハイブリッド検索&LMM生成

HolySheep AI 設定

使用例

移行手順：カナリアデプロイメント

Step 1: base_url と API キーの置換

新構成（HolySheep AI）

Step 2: キーローテーションスクリプト

カナリア移行スクリプト

MIRAI Analytics での実際の運用を再現

環境変数読み込み

現在のトラフィック配分（旧:新）

Nginx/ロードバランサー設定更新（例）

監視期間（72時間）

weight 調整

old: 75, new: 25

old: 50, new: 50

nginx リロードして新構成適用

systemctl reload nginx

移行後30日の実測値

よくあるエラーと対処法

関連リソース

関連記事

🔥 HolySheep AIを使ってみる