0
0
Loading version...
🔄 Update App
🔍 Check for Updates
Test Notification
🔔 Enable Notifications
📰 Fetch NHK News
🚀 Fetch TechCrunch News
🧪 Experiment
📰 Wordlist List
📚 Reading List
🎤 Speaking List
📊 Statistics
💻 Software Statistics
Push Admin
Edit Reading
Back to List
Basic Information
Title
Please enter a title.
URL
Please enter a valid URL.
Date
カテゴリID
画像ファイル名
単語数(空欄の場合は本文から自動計算)
空欄の場合は本文から自動計算されます。本文が空欄の場合は既存の値が保持されます。
統計情報
現在の単語数:
263語
読了回数:
1回
作成日:
2026/02/02 06:30
更新日:
2026/02/02 06:32
本文
本文
Most RAG systems don’t understand sophisticated documents — they shred them Dippu Kumar Singh January 31, 2026 RAG shredding CleoP made with Midjourney By now, many enterprises have deployed some form of RAG. The promise is seductive: index your PDFs, connect an LLM and instantly democratize your corporate knowledge. But for industries dependent on heavy engineering, the reality has been underwhelming. Engineers ask specific questions about infrastructure, and the bot hallucinates. Keep Watching 2 Encryption Isn't Enough: Tokenization For The AI Era The failure isn't in the LLM. The failure is in the preprocessing. Standard RAG pipelines treat documents as flat strings of text. They use "fixed-size chunking" (cutting a document every 500 characters). This works for prose, but it destroys the logic of technical manuals. It slices tables in half, severs captions from images, and ignores the visual hierarchy of the page. Improving RAG reliability isn't about buying a bigger model; it's about fixing the "dark data" problem through semantic chunking and multimodal textualization. Here is the architectural framework for building a RAG system that can actually read a manual. The fallacy of fixed-size chunking In a standard Python RAG tutorial, you split text by character count. In an enterprise PDF, this is disastrous. If a safety specification table spans 1,000 tokens, and your chunk size is 500, you have just split the "voltage limit" header from the "240V" value. The vector database stores them separately. When a user asks, "What is the voltage limit?", the retrieval system finds the header but not the value. The LLM, forced to answer, often guesses.
メモ
メモ・感想
キャンセル
更新
Debug Info:
Saved State:
-
Redirected Flag:
-
Current URL:
-
Refresh
Close
Debug
Send Report
Send Report
Draw Arrow
Clear
Message:
Cancel
Send