Your RAG isn't broken. Your chunking is.
When founders bring us a RAG project that's already in production and "not working," nine times out of ten the problem isn't the embedding model, the vector database, or the LLM doing the answering. It's the chunking.
Chunking is the first transformation your data goes through and the one that gets the least thought. Most teams pick a chunk size — 512 tokens is the popular default — and ship it. Then they spend three months tuning rerankers and prompt templates trying to fix problems the chunker created.
Here are the failure modes we see most.
Failure mode 1: Cutting mid-thought. Fixed-size chunkers are blind to structure. They'll split a paragraph between sentences, between bullets, sometimes mid-sentence. The chunk that retrieves looks plausible but doesn't actually contain the answer — the answer is split across two chunks and neither one alone wins the retrieval.
Fix: chunk by structure first, size second. Markdown headings, HTML sections, code blocks — these are natural boundaries. Use them. Fall back to token-based splitting only when you've run out of structure.
Failure mode 2: Chunks too small to be self-contained. A 256-token chunk often loses its referent. "It increased by 18%" is meaningless without the surrounding "Quarterly revenue from the SMB segment." If your retrieval surfaces the small chunk, the LLM downstream has to either confabulate the missing context or refuse.
Fix: bigger chunks (or smaller chunks plus parent-document retrieval — store the small chunk for retrieval but pass the parent to the model).
Failure mode 3: Chunks too big to be specific. The opposite problem. A 2,000-token chunk is hard to vector-match because the embedding gets averaged across too many topics. You retrieve documents that are adjacent to the query, not answering it.
Fix: tighter semantic chunking, sometimes paired with summary embeddings (embed a summary of the chunk, retrieve, then pass the full chunk to the model).
Failure mode 4: One chunker for all data shapes. A PDF whitepaper, a Slack thread, a support ticket, and a code file have radically different shapes. The same chunker treats them all like a wall of prose. The wall-of-prose chunker is wrong for at least three of them.
Fix: one chunking strategy per data shape. Code: chunk by function or class. Threads: chunk by message or by conversation turn-window. Tables: don't chunk — keep them whole and let the retrieval handle the indirection.
The diagnostic. Before you reach for a different embedding model or a fancier reranker, do this: take 20 queries that returned bad answers. For each one, look at what was retrieved. In our experience, in 70% of broken RAGs the answer is in the corpus but it's not in any chunk that retrieval surfaced. That's a chunking problem, not a model problem.
Fix the chunker first. Most of the time, the rest of the system was fine.
— Wash Candido