Reference · Extract

POST /v1/extract

Submit a document for extraction. Returns 200 with the result when finished within the sync window, otherwise 202 with a job handle to poll.

Request

Content type: multipart/form-data. Max file size: 100 MB. Accepted MIME types: PDF, JPEG, PNG, WebP, HEIC, HEIF, PPTX, PPT.

Headers

HeaderRequiredNotes
AuthorizationYesBearer pk_live_xxx or pk_test_xxx. Test keys are free.
Idempotency-KeyNo1–255 chars. Retrying with the same key returns the original job (and the same result) without re-extracting or re-billing. See below.

Multipart fields

FieldTypeDefaultNotes
filefileThe document. Required.
optionsJSON string{}Extraction options. See below.
syncbooleanfalseBlock up to sync_timeout_seconds for the job to finish, return 200 inline. See sync mode.
sync_timeout_secondsinteger25Max wait when sync=true. Clamped 1–30.

options object

JSON-encoded because multipart can't carry nested objects.

FieldTypeDefaultNotes
extraction_profile"standard" | "advanced""standard"standard = text, tables, images, math.
advanced = standard + diagram graphs + alt-text + embedded OCR + reference linking.
include_layoutbooleanfalseInclude bbox on every block. Forced true on advanced.
preserve_imagesbooleantrueIf false, extracted images are dropped from the result.
storage_destination_idstring | nullnullBYOS destination. If null, images go to OCRQueen-managed storage.
callback_urlURL | nullnullOne-shot webhook. Must be HTTPS in production.
retain_hoursinteger (0–168)24Source-file retention. 0 deletes immediately after extraction.
result_retain_hoursinteger (0–168) | nullsame as retain_hoursHow long the extracted result stays in our database after completion. Set to 0 and use a webhook for ephemeral processing. See Data retention.
bypass_cachebooleanfalseSkip content-hash cache. You will be billed for the new extraction even if an identical cached result exists.

Example

bash
curl https://api.ocrqueen.com/v1/extract \
  -H "Authorization: Bearer pk_test_xxx" \
  -H "Idempotency-Key: invoice-3034-2026-05-14" \
  -F "file=@invoice.pdf" \
  -F 'options={"extraction_profile":"advanced"}' \
  -F "sync=true" \
  -F "sync_timeout_seconds=20"

Response

200 OK when the body carries the full extraction result; 202 Accepted when the client still needs to poll. Both responses have the same shape — only document and markdown are populated on 200.

json
{
  "job_id": "5f8a...",
  "status": "completed",          // queued | processing | completed | failed | cancelled
  "file_info": {
    "filename": "invoice.pdf",
    "size_bytes": 184321,
    "mime_type": "application/pdf"
  },
  "created_at": "2026-05-14T12:00:00Z",
  "started_at": "2026-05-14T12:00:01Z",
  "completed_at": "2026-05-14T12:00:08Z",
  "estimated_seconds": 0,
  "status_url": "https://api.ocrqueen.com/v1/jobs/5f8a...",
  "expires_at": "2026-05-15T12:00:00Z",
  "document": { /* full DigitisedDocument — see GET /v1/jobs */ },
  "markdown": "# Invoice\n\n…",
  "cache_hit": false
}

The full schema for document is documented under GET /v1/jobs/{id}.

Response headers

HeaderWhenMeaning
X-CachealwaysHIT when a content-hash cache served this — no billing. MISS otherwise.
Idempotent-Replayedidempotency replaytrue when the original job was returned for a previously-seen Idempotency-Key.
Retry-Afteron 202Seconds to wait before the next poll. Conservative default 2.
X-RateLimit-*alwaysLimit, remaining, reset (epoch seconds).

Sync mode

Pass sync=true to ask the server to hold the connection up to sync_timeout_seconds (max 30) and return 200 with the full result inline. If the job is still running when the timeout elapses, you get 202 with Retry-After and you poll status_url normally — no extra cost, no lost job.

SDKs default to sync=true and fall back to polling transparently. Most small documents (≤20 pages) come back inline.

Idempotency

Pass any stable string in Idempotency-Key to make the request safe to retry across queue redelivery, crashes, and network blips. The first call wins; every subsequent call with the same key returns the original job, with Idempotent-Replayed: true set.

Keys are scoped per customer and live for 24h. After that, the same key starts a fresh extraction. See the idempotency notes for edge cases.

Errors

Errors carry a string detail like UNSUPPORTED_FILE_TYPE: application/zip. Common shapes here:

StatusCodeWhen
400UNSUPPORTED_FILE_TYPEMIME not in the allow-list.
400FILE_TOO_LARGEBody > 100 MB.
400EMPTY_FILE0-byte upload.
400INVALID_OPTIONSBad JSON in options.
401MISSING_API_KEY / INVALID_AUTH_FORMAT / INVALID_API_KEYHeader missing, malformed, or doesn't match a live key.
403INSUFFICIENT_SCOPEKey lacks extract:read. Create a new key with the right scopes.
402FREE_QUOTA_EXCEEDEDFree-tier monthly pages exhausted. Add a card in Settings.
429RATE_LIMITEDBack off per Retry-After.

Per-document failures (password-protected PDF, scan unreadable, etc.) come back via the job — see Errors.