JettyIQ — AI Readiness Platform

A real-world story of diagnosing and fixing RAG pipeline bloat — and what it reveals about AI data architecture for production systems.

When our AI lead told us the dataset was approaching 450GB, the first instinct was to panic. We were building a wellness AI assistant — not training a foundation model. Something was clearly wrong.

This is the story of how we diagnosed the problem, fixed it in under 24 hours, and got our production RAG datastore down to under 10GB — exactly where it needed to be for a disciplined V1 launch.

The Architecture We Were Building

Our system — GutBut — is a memory-driven wellness AI assistant. The core retrieval philosophy is deliberately narrow:

20 intent archetypes (bloating, sleep, energy, stress, etc.)
~1,000 structured questions mapped to those intents
3 curated expert sources per question
Intent-structured evidence-layer RAG, not internet-scale crawling

The math is simple. 1,000 questions × 3 sources = ~3,000 source pages. At 30–80KB of cleaned text per medical article, the expected corpus is roughly 150–300MB of raw text. After chunking and embeddings, the total datastore should land between 10GB and 15GB.

So when the number came back at 450GB, we knew immediately: this was not a storage problem. It was a scope control problem.

The 7 Causes of RAG Dataset Bloat

After auditing the pipeline, we identified the exact causes. In order of contribution:

1. Multiple Representations of the Same Source

The single biggest contributor. For every source document, the pipeline was storing:

raw_html/
clean_text/
chunked_text/
embeddings/
metadata_extraction_outputs/
debug_logs/

raw_html/
clean_text/
chunked_text/
embeddings/
metadata_extraction_outputs/
debug_logs/

One article was being stored 5–6 times in different forms. Multiply that by 3,000 sources and the numbers explode fast.

Fix: Store only cleaned text + embeddings + metadata. Delete all intermediate artifacts after the pipeline completes.

2. Crawler Following Internal Links

The scraper had follow_links=True and depth=2 set as defaults. What was supposed to be one Mayo Clinic article on lactose intolerance became the entire Mayo Clinic digestion section — related articles, pagination, sidebars, category indexes.

Fix: Enforce strict URL-level ingestion. Each source is a specific URL, not a domain entry point.

3. Aggressive Chunk Overlap

The chunking configuration was:

python

chunk_size = 400
overlap = 300

chunk_size = 400
overlap = 300

An overlap of 300 on a chunk size of 400 means 75% of every chunk is duplicated content. One article that should produce 20 chunks was producing 80.

Fix:

python

chunk_size = 800
overlap = 100

chunk_size = 800
overlap = 100

4. Metadata Duplicated Per Chunk Instead of Per Document

Author, citations, trust score, source authority, intent mapping — all of it was being stored inside every single chunk record. For a document split into 50 chunks, that metadata was written 50 times.

Fix: Store metadata once per document. Reference it by document ID from chunk records.

5. Multiple Embedding Versions Saved During Testing

During development, the team had run the embedding pipeline multiple times with different models and parameters. All versions were saved:

embeddings_v1/
embeddings_v2/
embeddings_v3/

embeddings_v1/
embeddings_v2/
embeddings_v3/

Fix: Keep only the current production embedding version. Archive or delete test runs.

6. Hardcoded Behavior from the Previous Developer

Several of the worst offenders were hardcoded values left by a previous developer — hardcoded storage paths that wrote to multiple directories simultaneously, hardcoded crawler depth settings, hardcoded retry logic that re-saved failed documents on each attempt.

This is the invisible tax of inherited codebases. The pipeline looked correct from the outside, but was silently multiplying storage on every run.

7. Intermediate Pipeline Artifacts Persisted

Temporary JSON outputs, extraction intermediaries, and debug files were all being persisted to disk and never cleaned up. These accumulated across every pipeline run.

The Fix: One Day, 97% Reduction

After identifying all seven causes, the team applied the fixes in a single day:

Fix Applied	Size Reduction
Remove duplicate storage layers	~60% reduction
Fix crawler to URL-only ingestion	~20% reduction
Fix chunk overlap (400/300 → 800/100)	~10% reduction
Metadata per document instead of per chunk	~5% reduction
Delete old embedding versions + artifacts	~5% reduction

Result: 450GB → under 10GB.

That's a 97.8% reduction without losing a single piece of meaningful data.

Why 10GB Is the Right Number

For our exact architecture:

Layer	Expected Size
Cleaned text corpus	~200–400 MB
Chunks	~1–2 GB
Embeddings (text-embedding-3-small)	~4–6 GB
Metadata	< 1 GB
Index overhead	~1–2 GB
Total	~7–12 GB

This is not a coincidence. A 10GB datastore for 1,000 questions × 3 expert sources is architecturally correct. It means the pipeline is behaving like an intent-scoped expert datastore, not a crawler.

The Deeper Lesson: RAG Architecture Philosophy

The 450GB problem was a symptom of a more fundamental confusion that appears in almost every early-stage AI build:

Engineers build a knowledge warehouse. Founders need a retrieval layer.

A knowledge warehouse ingests everything, stores everything, and figures out relevance later. A retrieval layer ingests only what the product needs, stores it efficiently, and retrieves it precisely.

For consumer AI products, the retrieval layer approach wins every time:

Faster latency — smaller index, faster vector search
Cheaper embeddings — 3,000 documents vs 300,000 documents
Stronger citations — curated expert sources vs internet noise
Controllable personalization — memory weighting works better on a clean signal
Better explainability — you know exactly what's in the datastore

The moment you let ingestion drift toward domain-scale crawling, you lose all of these properties. Your AI assistant becomes a generic internet RAG wrapper. That's not a product. That's a demo.

The Guardrail We Now Enforce

After this incident, we added a simple pipeline guardrail:

python

MAX_V1_PAGES = 3500  # 1,000 questions × 3 sources + 15% buffer
MAX_V1_DATASTORE_GB = 15

if ingested_pages > MAX_V1_PAGES:
    raise PipelineGuardrailError(f"Ingestion exceeded V1 scope: {ingested_pages} pages")

if datastore_size_gb > MAX_V1_DATASTORE_GB:
    raise PipelineGuardrailError(f"Datastore exceeded V1 size limit: {datastore_size_gb}GB")

MAX_V1_PAGES = 3500  # 1,000 questions × 3 sources + 15% buffer
MAX_V1_DATASTORE_GB = 15

if ingested_pages > MAX_V1_PAGES:
    raise PipelineGuardrailError(f"Ingestion exceeded V1 scope: {ingested_pages} pages")

if datastore_size_gb > MAX_V1_DATASTORE_GB:
    raise PipelineGuardrailError(f"Datastore exceeded V1 size limit: {datastore_size_gb}GB")

This runs as a pre-flight check before any embedding job starts. It cannot be bypassed without a deliberate scope change decision.

What This Means for AI Readiness

At JettyIQ, we score companies on their AI readiness across five dimensions. Data is one of the most commonly misunderstood. Companies often assume that more data is always better — that a larger dataset means a smarter AI.

The GutBut story illustrates the opposite principle: disciplined data architecture beats raw data volume every time. A 10GB intent-scoped expert datastore will outperform a 450GB internet crawl on every metric that matters for a production product.

If your company is building AI systems and your Data dimension score is lower than you'd like, the problem is rarely that you don't have enough data. It's almost always that the data you have isn't structured, scoped, or governed correctly.

That's a solvable problem — and it's exactly the kind of problem Jetty AI helps companies fix.

Ajay Jetty is the founder of Jetty AI and JettyIQ. JettyIQ scores your company's AI readiness in 3 minutes — free, no credit card required.

How We Shrunk Our RAG Dataset from 450GB to Under 10GB in One Day

The Architecture We Were Building

The 7 Causes of RAG Dataset Bloat

1. Multiple Representations of the Same Source

2. Crawler Following Internal Links

3. Aggressive Chunk Overlap

4. Metadata Duplicated Per Chunk Instead of Per Document

5. Multiple Embedding Versions Saved During Testing

6. Hardcoded Behavior from the Previous Developer

7. Intermediate Pipeline Artifacts Persisted

The Fix: One Day, 97% Reduction

Why 10GB Is the Right Number

The Deeper Lesson: RAG Architecture Philosophy

The Guardrail We Now Enforce

What This Means for AI Readiness

Is your company AI-ready?

More from the blog

Why Your AI Agent Is Probably 4× Slower Than It Needs to Be