Cookbook

Ingest documents into a RAG pipeline

Chunk per-page, keep provenance metadata on every chunk, hand to your vector store of choice.

The shape you want to feed your vector store

A good RAG chunk has text + citation metadata. OCRQueen returns both natively — every block carries page, bbox, text_source, and confidence. You don't have to invent provenance later; it's there from the start.

Python — chunk per page, attach metadata

python
from ocrqueen import OCRQueen

client = OCRQueen(api_key="pk_live_xxx")
result = client.extract("annual-report.pdf")

# One chunk per page. For longer pages you can sub-chunk text blocks
# further; for most docs page-level chunks land in the sweet spot for
# retrieval recall (the LLM gets one cohesive section per match).
chunks = []
for page in result.document.pages:
    page_number = page["number"]
    text_pieces = [
        b.get("text") for b in (page.get("blocks") or [])
        if b.get("text")
    ]
    if not text_pieces:
        continue
    chunks.append({
        "text": "\n\n".join(text_pieces),
        "metadata": {
            "source_file": result.document.source.get("filename"),
            "page": page_number,
            # Every text block on this page came from the PDF text layer
            # (byte-perfect) unless any was OCR'd — capture the mix.
            "text_sources": sorted({
                b.get("text_source", "unknown")
                for b in (page.get("blocks") or [])
            }),
        },
    })

# Now hand to your vector store. Example with a generic interface:
#   for chunk in chunks:
#       vectorstore.upsert(
#           id=f"{filename}#{chunk['metadata']['page']}",
#           text=chunk["text"],
#           metadata=chunk["metadata"],
#       )

Or just use the Markdown render

If your pipeline is simpler (e.g. you split markdown by headings and embed the resulting sections), the deterministic result.markdown field gives you a flat, LLM-ready string with page separators baked in as HTML comments:

python
md = result.markdown          # "# Title\n\n<!-- page 1 -->\n\n..."
sections = md.split("<!-- page ")  # split on the deterministic page marker

Filter by confidence before embedding

For high-stakes domains (legal, medical, financial), drop blocks below a confidence threshold so your retriever doesn't cite probabilistic OCR text as if it were source-of-truth:

python
HIGH_TRUST_SOURCES = {"pdf_text_layer", "pptx_native", "pptx_chart_data"}

for page in result.document.pages:
    for block in page.get("blocks") or []:
        if block.get("text_source") in HIGH_TRUST_SOURCES:
            yield block            # byte-perfect — always cite
        elif block.get("confidence", 0) >= 0.85:
            yield block            # LLM Vision but verified by faithfulness check
        # else: skip — too uncertain to feed the model

Next