Concept · Data retention

Data retention & deletion

Default 24-hour retention. Set retain_hours: 0 and use webhooks for ephemeral processing. Hit POST /v1/jobs/{id}/purge to erase on demand. We bill from a tombstone with no document content.

What we store, and for how long

Every extraction job creates three logically separate things. Each has its own retention window, and you control them independently per request.

ArtifactWhere it livesDefault lifetimeControlled by
Source file (your PDF/PPT/etc.)Cloudflare R2 (or your BYOS bucket)24 hoursretain_hours
Extracted content (markdown + structured JSON)Our Postgres24 hours (matches retain_hours)result_retain_hours
Billing tombstone (job id, customer, pages, timestamps)Our PostgresIndefiniteTax + audit retention; no document content

A tombstone is the row that remains after we delete your data. It carries the fields we need to produce invoices and usage reports (id, customer_id, pages_extracted, completed_at) and nothing about your document — no filename, no hash, no result.

Ephemeral processing (recommended for sensitive data)

Set both retention windows to zero and have the result delivered to you via webhook. Our database holds your content for the duration of the extraction and the webhook delivery, then it's gone.

bash
curl -X POST https://api.ocrqueen.com/v1/extract \
  -H "Authorization: Bearer pk_..." \
  -F "file=@confidential.pdf" \
  -F 'options={
    "retain_hours": 0,
    "result_retain_hours": 0,
    "callback_url": "https://your-app.com/ocrqueen-webhook"
  }'

The webhook payload contains the full extraction result. You're responsible for persisting whatever you need from there — once we deliver, we forget.

Erase on demand

For GDPR right-to-erasure requests, end-of-customer-engagement cleanups, or any one-off deletion: hit the purge endpoint. It's idempotent — calling it twice on the same job is harmless.

bash
curl -X POST https://api.ocrqueen.com/v1/jobs/{job_id}/purge \
  -H "Authorization: Bearer pk_..."

# 204 No Content. The source bytes are deleted from R2,
# the extracted result + request options are nulled, the
# row remains as a billing tombstone.

Requires the jobs:write scope on the API key — destructive operations are deliberately gated behind a separate scope from extract:write so you can issue read-only keys that fetch results without being able to delete them.

Which scope does what

ScopeWhat it permits
extract:readRead job status & results, cancel queued jobs
extract:writeSubmit new extraction jobs
jobs:writeHard-purge a job (source + result). Separate from extract:write on purpose.

What stays after a purge

The job row remains so that:

  • You can still see purged jobs in your usage reports.
  • We can produce accurate invoices and tax records.
  • An idempotent retry of the same request returns a clear “this job was purged” response instead of silently re-running and re-billing.

The fields we keep are: job id, customer id, pages extracted, file size in bytes (no filename, no hash), status, created / started / completed / purged timestamps, and the usage events that drive billing.

BYOS and retention

If you provided a BYOS destination for output artifacts (extracted images, page renders), those live in yourbucket and OCRQueen never touches them again — including on purge. Lifecycle there is governed by your bucket's own policies.

Today, source-side BYOS (uploading directly to your bucket instead of ours) is on the roadmap; until then the source file is uploaded to our R2 and deleted per retain_hours.

What we never persist

  • Plaintext API keys. We hash on creation and store only the hash + a short prefix for display.
  • Plaintext BYOS credentials. Encrypted at rest with our KMS-backed envelope encryption; workers decrypt to memory only, never to disk.
  • Request bodies in logs. Structured logs record metadata (job id, page count, latency, error code) and never the document contents.

Changes to this policy

We'll announce material changes here and to API-key owners by email at least 30 days before they take effect. Tighter defaults (shorter retention) ship immediately; looser defaults (longer retention) require notice.