Cookbook
Ingest documents into a RAG pipeline
Chunk per-page, keep provenance metadata on every chunk, hand to your vector store of choice.
The shape you want to feed your vector store
A good RAG chunk has text + citation metadata. OCRQueen returns both natively — every block carries page, bbox, text_source, and confidence. You don't have to invent provenance later; it's there from the start.
Python — chunk per page, attach metadata
from ocrqueen import OCRQueen
client = OCRQueen(api_key="pk_live_xxx")
result = client.extract("annual-report.pdf")
# One chunk per page. For longer pages you can sub-chunk text blocks
# further; for most docs page-level chunks land in the sweet spot for
# retrieval recall (the LLM gets one cohesive section per match).
chunks = []
for page in result.document.pages:
page_number = page["number"]
text_pieces = [
b.get("text") for b in (page.get("blocks") or [])
if b.get("text")
]
if not text_pieces:
continue
chunks.append({
"text": "\n\n".join(text_pieces),
"metadata": {
"source_file": result.document.source.get("filename"),
"page": page_number,
# Every text block on this page came from the PDF text layer
# (byte-perfect) unless any was OCR'd — capture the mix.
"text_sources": sorted({
b.get("text_source", "unknown")
for b in (page.get("blocks") or [])
}),
},
})
# Now hand to your vector store. Example with a generic interface:
# for chunk in chunks:
# vectorstore.upsert(
# id=f"{filename}#{chunk['metadata']['page']}",
# text=chunk["text"],
# metadata=chunk["metadata"],
# )Or just use the Markdown render
If your pipeline is simpler (e.g. you split markdown by headings and embed the resulting sections), the deterministic result.markdown field gives you a flat, LLM-ready string with page separators baked in as HTML comments:
md = result.markdown # "# Title\n\n<!-- page 1 -->\n\n..."
sections = md.split("<!-- page ") # split on the deterministic page markerFilter by confidence before embedding
For high-stakes domains (legal, medical, financial), drop blocks below a confidence threshold so your retriever doesn't cite probabilistic OCR text as if it were source-of-truth:
HIGH_TRUST_SOURCES = {"pdf_text_layer", "pptx_native", "pptx_chart_data"}
for page in result.document.pages:
for block in page.get("blocks") or []:
if block.get("text_source") in HIGH_TRUST_SOURCES:
yield block # byte-perfect — always cite
elif block.get("confidence", 0) >= 0.85:
yield block # LLM Vision but verified by faithfulness check
# else: skip — too uncertain to feed the modelNext
- Patent extraction — switch to
profile=advancedfor diagram graphs + reference cross-linking. - Batch with webhooks — process thousands of files without polling.
