How We Shrunk Our RAG Dataset from 450GB to Under 10GB in One Day
A real-world story of diagnosing and fixing RAG pipeline bloat — and what it reveals about AI data architecture for production systems.
Ajay Jetty
Founder, Jetty AI & JettyIQ
When our AI lead told us the dataset was approaching 450GB, the first instinct was to panic. We were building a wellness AI assistant — not training a foundation model. Something was clearly wrong.
This is the story of how we diagnosed the problem, fixed it in under 24 hours, and got our production RAG datastore down to under 10GB — exactly where it needed to be for a disciplined V1 launch.
The Architecture We Were Building
Our system — GutBut — is a memory-driven wellness AI assistant. The core retrieval philosophy is deliberately narrow:
- 20 intent archetypes (bloating, sleep, energy, stress, etc.)
- ~1,000 structured questions mapped to those intents
- 3 curated expert sources per question
- Intent-structured evidence-layer RAG, not internet-scale crawling
The math is simple. 1,000 questions × 3 sources = ~3,000 source pages. At 30–80KB of cleaned text per medical article, the expected corpus is roughly 150–300MB of raw text. After chunking and embeddings, the total datastore should land between 10GB and 15GB.
So when the number came back at 450GB, we knew immediately: this was not a storage problem. It was a scope control problem.
The 7 Causes of RAG Dataset Bloat
After auditing the pipeline, we identified the exact causes. In order of contribution:
1. Multiple Representations of the Same Source
The single biggest contributor. For every source document, the pipeline was storing:
raw_html/
clean_text/
chunked_text/
embeddings/
metadata_extraction_outputs/
debug_logs/
raw_html/
clean_text/
chunked_text/
embeddings/
metadata_extraction_outputs/
debug_logs/
One article was being stored 5–6 times in different forms. Multiply that by 3,000 sources and the numbers explode fast.
Fix: Store only cleaned text + embeddings + metadata. Delete all intermediate artifacts after the pipeline completes.
2. Crawler Following Internal Links
The scraper had follow_links=True and depth=2 set as defaults. What was supposed to be one Mayo Clinic article on lactose intolerance became the entire Mayo Clinic digestion section — related articles, pagination, sidebars, category indexes.
Fix: Enforce strict URL-level ingestion. Each source is a specific URL, not a domain entry point.
3. Aggressive Chunk Overlap
The chunking configuration was:
chunk_size = 400
overlap = 300
chunk_size = 400
overlap = 300
An overlap of 300 on a chunk size of 400 means 75% of every chunk is duplicated content. One article that should produce 20 chunks was producing 80.
Fix:
chunk_size = 800
overlap = 100
chunk_size = 800
overlap = 100
4. Metadata Duplicated Per Chunk Instead of Per Document
Author, citations, trust score, source authority, intent mapping — all of it was being stored inside every single chunk record. For a document split into 50 chunks, that metadata was written 50 times.
Fix: Store metadata once per document. Reference it by document ID from chunk records.
5. Multiple Embedding Versions Saved During Testing
During development, the team had run the embedding pipeline multiple times with different models and parameters. All versions were saved:
embeddings_v1/
embeddings_v2/
embeddings_v3/
embeddings_v1/
embeddings_v2/
embeddings_v3/
Fix: Keep only the current production embedding version. Archive or delete test runs.
6. Hardcoded Behavior from the Previous Developer
Several of the worst offenders were hardcoded values left by a previous developer — hardcoded storage paths that wrote to multiple directories simultaneously, hardcoded crawler depth settings, hardcoded retry logic that re-saved failed documents on each attempt.
This is the invisible tax of inherited codebases. The pipeline looked correct from the outside, but was silently multiplying storage on every run.
7. Intermediate Pipeline Artifacts Persisted
Temporary JSON outputs, extraction intermediaries, and debug files were all being persisted to disk and never cleaned up. These accumulated across every pipeline run.
The Fix: One Day, 97% Reduction
After identifying all seven causes, the team applied the fixes in a single day:
| Fix Applied | Size Reduction |
|---|---|
| Remove duplicate storage layers | ~60% reduction |
| Fix crawler to URL-only ingestion | ~20% reduction |
| Fix chunk overlap (400/300 → 800/100) | ~10% reduction |
| Metadata per document instead of per chunk | ~5% reduction |
| Delete old embedding versions + artifacts | ~5% reduction |
Result: 450GB → under 10GB.
That's a 97.8% reduction without losing a single piece of meaningful data.
Why 10GB Is the Right Number
For our exact architecture:
| Layer | Expected Size |
|---|---|
| Cleaned text corpus | ~200–400 MB |
| Chunks | ~1–2 GB |
| Embeddings (text-embedding-3-small) | ~4–6 GB |
| Metadata | < 1 GB |
| Index overhead | ~1–2 GB |
| Total | ~7–12 GB |
This is not a coincidence. A 10GB datastore for 1,000 questions × 3 expert sources is architecturally correct. It means the pipeline is behaving like an intent-scoped expert datastore, not a crawler.
The Deeper Lesson: RAG Architecture Philosophy
The 450GB problem was a symptom of a more fundamental confusion that appears in almost every early-stage AI build:
Engineers build a knowledge warehouse. Founders need a retrieval layer.
A knowledge warehouse ingests everything, stores everything, and figures out relevance later. A retrieval layer ingests only what the product needs, stores it efficiently, and retrieves it precisely.
For consumer AI products, the retrieval layer approach wins every time:
- Faster latency — smaller index, faster vector search
- Cheaper embeddings — 3,000 documents vs 300,000 documents
- Stronger citations — curated expert sources vs internet noise
- Controllable personalization — memory weighting works better on a clean signal
- Better explainability — you know exactly what's in the datastore
The moment you let ingestion drift toward domain-scale crawling, you lose all of these properties. Your AI assistant becomes a generic internet RAG wrapper. That's not a product. That's a demo.
The Guardrail We Now Enforce
After this incident, we added a simple pipeline guardrail:
MAX_V1_PAGES = 3500 # 1,000 questions × 3 sources + 15% buffer
MAX_V1_DATASTORE_GB = 15
if ingested_pages > MAX_V1_PAGES:
raise PipelineGuardrailError(f"Ingestion exceeded V1 scope: {ingested_pages} pages")
if datastore_size_gb > MAX_V1_DATASTORE_GB:
raise PipelineGuardrailError(f"Datastore exceeded V1 size limit: {datastore_size_gb}GB")
MAX_V1_PAGES = 3500 # 1,000 questions × 3 sources + 15% buffer
MAX_V1_DATASTORE_GB = 15
if ingested_pages > MAX_V1_PAGES:
raise PipelineGuardrailError(f"Ingestion exceeded V1 scope: {ingested_pages} pages")
if datastore_size_gb > MAX_V1_DATASTORE_GB:
raise PipelineGuardrailError(f"Datastore exceeded V1 size limit: {datastore_size_gb}GB")
This runs as a pre-flight check before any embedding job starts. It cannot be bypassed without a deliberate scope change decision.
What This Means for AI Readiness
At JettyIQ, we score companies on their AI readiness across five dimensions. Data is one of the most commonly misunderstood. Companies often assume that more data is always better — that a larger dataset means a smarter AI.
The GutBut story illustrates the opposite principle: disciplined data architecture beats raw data volume every time. A 10GB intent-scoped expert datastore will outperform a 450GB internet crawl on every metric that matters for a production product.
If your company is building AI systems and your Data dimension score is lower than you'd like, the problem is rarely that you don't have enough data. It's almost always that the data you have isn't structured, scoped, or governed correctly.
That's a solvable problem — and it's exactly the kind of problem Jetty AI helps companies fix.
Ajay Jetty is the founder of Jetty AI and JettyIQ. JettyIQ scores your company's AI readiness in 3 minutes — free, no credit card required.
Ajay Jetty
Founder, Jetty AI & JettyIQ
Building Jetty AI — the AI readiness and implementation platform for enterprise teams.See how your company scores →