About

We turn documents into data that LLMs can actually reason over.

OCRQueen is a document extraction API for AI builders. We extract structured JSON from PDFs and PowerPoint — diagrams as graphs, charts as exact numbers, math preserved as LaTeX, every block carrying its provenance and confidence.

Why we built it

Existing OCR APIs treat documents like images of text. They rasterize a chart and ask a vision model to read the bars, then return approximate numbers. They flatten diagrams into alt text. They lose math entirely or stringify it. The output is fine for full-text search and brutal for anything an LLM needs to reason about.

We took the opposite approach. Where the original document already encodes a piece of information losslessly — text in a PDF text layer, chart data in a PPTX Chart shape, a table's cells — we read it byte-perfect. We only call an LLM for things that genuinely require semantics: classifying a scanned page, describing a photo, transcribing handwriting, or extracting structure from a vector diagram. Even then we verify every LLM output against the source before returning it.

The result is an extraction you can audit. Every block tells you where it came from. Every confidence score is meaningful. Every chart carries the exact numbers a human drew, not an OCR'd approximation of them.

Who it's for

Teams building RAG pipelines, document understanding workflows, and AI agents that read structured business content. Customers tell us they reach for OCRQueen when accuracy of the structure — not just the text — matters: legal briefs, patents, financial reports, technical specs, research papers, sales decks.

We're developer-first. There's a free tier, a one-line SDK in Python and Node / TypeScript, predictable per-page pricing, and documentation a human can actually read.

How we think about quality

  • Lossless when possible. If a piece of information exists losslessly in the source, we use that — never an LLM rephrasing of it.
  • Verifiable when not. When the LLM does run, its output is faithfulness-checked against the source and surfaced with a confidence score.
  • Honest about uncertainty. Every block carries text_source and confidence so high-stakes callers can drop blocks below their threshold instead of blindly trusting them.
  • Reproducible. Same document in, same JSON out. Temperature is zero, model versions are pinned, extractions are cited.

Where we are

Early. Today OCRQueen has a stable extraction API, two SDKs, a dashboard, webhooks, and a free tier you can use without talking to anyone. We're building toward enterprise features (SSO, BYOS, audit log exports, on-prem deployment) on customer demand — not roadmap inertia.

Found a bug, want a feature, comparing us to a competitor — email support@ocrqueen.com. We read everything.

Get in touch

General questions: hello@ocrqueen.com
Sales + enterprise: sales@ocrqueen.com
Support + security: support@ocrqueen.com