Reference · Extract
POST /v1/extract
Submit a document for extraction. Returns 200 with the result when finished within the sync window, otherwise 202 with a job handle to poll.
Request
Content type: multipart/form-data. Max file size: 100 MB. Accepted MIME types: PDF, JPEG, PNG, WebP, HEIC, HEIF, PPTX, PPT.
Headers
| Header | Required | Notes |
|---|---|---|
Authorization | Yes | Bearer pk_live_xxx or pk_test_xxx. Test keys are free. |
Idempotency-Key | No | 1–255 chars. Retrying with the same key returns the original job (and the same result) without re-extracting or re-billing. See below. |
Multipart fields
| Field | Type | Default | Notes |
|---|---|---|---|
file | file | — | The document. Required. |
options | JSON string | {} | Extraction options. See below. |
sync | boolean | false | Block up to sync_timeout_seconds for the job to finish, return 200 inline. See sync mode. |
sync_timeout_seconds | integer | 25 | Max wait when sync=true. Clamped 1–30. |
options object
JSON-encoded because multipart can't carry nested objects.
| Field | Type | Default | Notes |
|---|---|---|---|
extraction_profile | "standard" | "advanced" | "standard" | standard = text, tables, images, math.advanced = standard + diagram graphs + alt-text + embedded OCR + reference linking. |
include_layout | boolean | false | Include bbox on every block. Forced true on advanced. |
preserve_images | boolean | true | If false, extracted images are dropped from the result. |
storage_destination_id | string | null | null | BYOS destination. If null, images go to OCRQueen-managed storage. |
callback_url | URL | null | null | One-shot webhook. Must be HTTPS in production. |
retain_hours | integer (0–168) | 24 | Source-file retention. 0 deletes immediately after extraction. |
result_retain_hours | integer (0–168) | null | same as retain_hours | How long the extracted result stays in our database after completion. Set to 0 and use a webhook for ephemeral processing. See Data retention. |
bypass_cache | boolean | false | Skip content-hash cache. You will be billed for the new extraction even if an identical cached result exists. |
Example
curl https://api.ocrqueen.com/v1/extract \
-H "Authorization: Bearer pk_test_xxx" \
-H "Idempotency-Key: invoice-3034-2026-05-14" \
-F "file=@invoice.pdf" \
-F 'options={"extraction_profile":"advanced"}' \
-F "sync=true" \
-F "sync_timeout_seconds=20"Response
200 OK when the body carries the full extraction result; 202 Accepted when the client still needs to poll. Both responses have the same shape — only document and markdown are populated on 200.
{
"job_id": "5f8a...",
"status": "completed", // queued | processing | completed | failed | cancelled
"file_info": {
"filename": "invoice.pdf",
"size_bytes": 184321,
"mime_type": "application/pdf"
},
"created_at": "2026-05-14T12:00:00Z",
"started_at": "2026-05-14T12:00:01Z",
"completed_at": "2026-05-14T12:00:08Z",
"estimated_seconds": 0,
"status_url": "https://api.ocrqueen.com/v1/jobs/5f8a...",
"expires_at": "2026-05-15T12:00:00Z",
"document": { /* full DigitisedDocument — see GET /v1/jobs */ },
"markdown": "# Invoice\n\n…",
"cache_hit": false
}The full schema for document is documented under GET /v1/jobs/{id}.
Response headers
| Header | When | Meaning |
|---|---|---|
X-Cache | always | HIT when a content-hash cache served this — no billing. MISS otherwise. |
Idempotent-Replayed | idempotency replay | true when the original job was returned for a previously-seen Idempotency-Key. |
Retry-After | on 202 | Seconds to wait before the next poll. Conservative default 2. |
X-RateLimit-* | always | Limit, remaining, reset (epoch seconds). |
Sync mode
Pass sync=true to ask the server to hold the connection up to sync_timeout_seconds (max 30) and return 200 with the full result inline. If the job is still running when the timeout elapses, you get 202 with Retry-After and you poll status_url normally — no extra cost, no lost job.
SDKs default to sync=true and fall back to polling transparently. Most small documents (≤20 pages) come back inline.
Idempotency
Pass any stable string in Idempotency-Key to make the request safe to retry across queue redelivery, crashes, and network blips. The first call wins; every subsequent call with the same key returns the original job, with Idempotent-Replayed: true set.
Keys are scoped per customer and live for 24h. After that, the same key starts a fresh extraction. See the idempotency notes for edge cases.
Errors
Errors carry a string detail like UNSUPPORTED_FILE_TYPE: application/zip. Common shapes here:
| Status | Code | When |
|---|---|---|
| 400 | UNSUPPORTED_FILE_TYPE | MIME not in the allow-list. |
| 400 | FILE_TOO_LARGE | Body > 100 MB. |
| 400 | EMPTY_FILE | 0-byte upload. |
| 400 | INVALID_OPTIONS | Bad JSON in options. |
| 401 | MISSING_API_KEY / INVALID_AUTH_FORMAT / INVALID_API_KEY | Header missing, malformed, or doesn't match a live key. |
| 403 | INSUFFICIENT_SCOPE | Key lacks extract:read. Create a new key with the right scopes. |
| 402 | FREE_QUOTA_EXCEEDED | Free-tier monthly pages exhausted. Add a card in Settings. |
| 429 | RATE_LIMITED | Back off per Retry-After. |
Per-document failures (password-protected PDF, scan unreadable, etc.) come back via the job — see Errors.
