Concept · Storage

Storage, retention, and BYOS

Where your documents and extracted images live, for how long, and how to keep them around longer when you need to.

Retention by default

Every job has two independent retention deadlines — retain_until for the source document and image artifacts, and result_retain_until for the extracted markdown / JSON in our database. After each deadline, that artifact is permanently deleted.

  • Default: 24 hours after extraction completes (both source and result).
  • Minimum: 0 hours — purged immediately after the response is sent. Pair with a callback_url webhook so you receive the full result before our copy disappears.
  • Maximum: 168 hours (7 days). Set per request with options.retain_hours (source) and options.result_retain_hours (extracted content). The full contract lives at Data retention & deletion.

Backstop: regardless of retain_hours, our R2 bucket has a hard 30-day object lifecycle rule. Nothing ever lives longer than 30 days, even if a bug in our retention sweeper misses an object.

Set retention per request

python
from ocrqueen import OCRQueen

client = OCRQueen(api_key="pk_...")

# Default 24h retention — fine for most pipelines.
job = client.extract.create(file=open("doc.pdf", "rb"))

# Or: short retention for sensitive docs (purge in 1 hour).
job = client.extract.create(
    file=open("contract.pdf", "rb"),
    options={"retain_hours": 1},
)

# Or: keep for the full 7 days while your batch pipeline catches up.
job = client.extract.create(
    file=open("paper.pdf", "rb"),
    options={"retain_hours": 168},
)

Force-delete a job before its retention window expires

Call DELETE /v1/jobs/{id}at any time to wipe a job's source and extracted images immediately. The job row itself is kept (anonymized) for billing audit, but its result is nulled and all storage is purged.

python
client.jobs.cancel(job.id)  # also purges storage

How to get extracted images

Image blocks in the response carry a presigned URL pointing at OCRQueen's R2 bucket. The URL is signed for a few hours — long enough to fetch the bytes, short enough that a leaked URL doesn't become a permanent backdoor.

json
{
  "id": "job_abc",
  "status": "completed",
  "expires_at": "2026-05-16T10:00:00Z",
  "result": {
    "pages": [
      {
        "page": 1,
        "blocks": [
          {
            "id": "img_1",
            "type": "image",
            "image_url": "https://...r2.cloudflarestorage.com/.../page-1.png?X-Amz-Expires=7200&X-Amz-Signature=...",
            "alt": "Network topology diagram",
            "image_type": "diagram"
          }
        ]
      }
    ]
  }
}

If you need to keep images long-term, download them within your retention window

Within retain_hours, you can fetch every image URL with a plain HTTP GET — no auth header needed (the signature is in the URL). Save the bytes wherever you want; after that point you can forget about us.

python
import httpx
from ocrqueen import OCRQueen

client = OCRQueen(api_key="pk_...")

job = client.extract.create(file=open("doc.pdf", "rb"))
result = client.jobs.wait(job)

# Walk every image block, download the bytes, store however you like.
for page in result.raw["pages"]:
    for block in page["blocks"]:
        if block.get("type") == "image":
            bytes_ = httpx.get(block["image_url"]).content
            my_storage.put(f"{job.id}/{block['id']}.png", bytes_)

Once your storage has the bytes, OCRQueen's R2 can purge the originals safely — your downstream serving doesn't depend on us.

URLs vs the JSON response: store the result, but refresh URLs before serving

The image_urlvalues are time-limited. If you save the full JSON to your database and try to use those URLs after a few hours, they'll return 403. Two clean fixes:

  1. Re-fetch the job via GET /v1/jobs/{id} to get fresh URLs — works as long as retain_untilhasn't passed.
  2. Download once, host the bytes yourself — recommended for any production app that re-renders images.

Bring Your Own Storage (BYOS)

For regulated workloads or customers who need data residency, OCRQueen can write extracted images directly into a bucket you own. We never store the bytes in our R2; the URLs in the response point at your bucket, signed with the credentials you provided.

Supported destinations:

  • Cloudflare R2
  • Amazon S3 (any region)
  • Google Cloud Storage (S3-compatible XML API)
  • Any S3-compatible endpoint (MinIO, Backblaze B2, Wasabi…)

1. Register a destination

One-time setup per bucket. Credentials are encrypted at rest using AES-256-GCM keyed by your tenant ID — extraction workers decrypt them in memory only.

bash
curl -X POST https://api.ocrqueen.com/v1/storage-destinations \
  -H "Authorization: Bearer pk_..." \
  -H "Content-Type: application/json" \
  -d '{
    "kind": "r2",
    "bucket": "my-extracts",
    "endpoint_url": "https://abc123.r2.cloudflarestorage.com",
    "access_key_id": "...",
    "secret_access_key": "...",
    "region": "auto"
  }'

# Response
# {
#   "id": "dest_xyz",
#   "kind": "r2",
#   "bucket": "my-extracts",
#   "verified_at": null,
#   ...
# }

2. Verify it

Confirms we can write + read in your bucket. Writes a small probe object then immediately deletes it.

bash
curl -X POST https://api.ocrqueen.com/v1/storage-destinations/dest_xyz/verify \
  -H "Authorization: Bearer pk_..."

# Response
# { "ok": true, "verified_at": "2026-05-15T12:34:56Z" }

3. Use it for an extraction

Reference the destination by ID in options. Any extracted images will be written to your bucket; the URLs in the response point there, not at OCRQueen's R2.

python
job = client.extract.create(
    file=open("doc.pdf", "rb"),
    options={
        "extraction_profile": "advanced",
        "storage_destination_id": "dest_xyz",
    },
)

List + delete destinations

bash
# List all destinations for the authenticated customer
curl https://api.ocrqueen.com/v1/storage-destinations \
  -H "Authorization: Bearer pk_..."

# Remove a destination (does NOT delete the bucket, just removes the binding)
curl -X DELETE https://api.ocrqueen.com/v1/storage-destinations/dest_xyz \
  -H "Authorization: Bearer pk_..."

When you use BYOS, retain_hourson OCRQueen's side becomes a non-issue — we never wrote bytes to our storage in the first place. Your bucket's own lifecycle rules govern when objects expire.

Picking a strategy

Three patterns, pick whichever matches your use case:

StrategySetupWhen to pick it
Default 24h + downloadNoneRAG pipelines that ingest text once. Indie products. Most developers.
Short retention (0–1h)Pass retain_hours: 0Sensitive documents where you can't tolerate data living on our side past the extraction call.
BYOSOne-time bucket registrationRegulated workloads, data residency requirements, enterprise contracts, archival use cases. Customer owns the bucket lifetime.

See POST /v1/extract for the full options schema, or GET /v1/jobs/{id} for how to refresh expired URLs.