Concept · Storage
Storage, retention, and BYOS
Where your documents and extracted images live, for how long, and how to keep them around longer when you need to.
Retention by default
Every job has two independent retention deadlines — retain_until for the source document and image artifacts, and result_retain_until for the extracted markdown / JSON in our database. After each deadline, that artifact is permanently deleted.
- Default: 24 hours after extraction completes (both source and result).
- Minimum: 0 hours — purged immediately after the response is sent. Pair with a
callback_urlwebhook so you receive the full result before our copy disappears. - Maximum: 168 hours (7 days). Set per request with
options.retain_hours(source) andoptions.result_retain_hours(extracted content). The full contract lives at Data retention & deletion.
Backstop: regardless of retain_hours, our R2 bucket has a hard 30-day object lifecycle rule. Nothing ever lives longer than 30 days, even if a bug in our retention sweeper misses an object.
Set retention per request
from ocrqueen import OCRQueen
client = OCRQueen(api_key="pk_...")
# Default 24h retention — fine for most pipelines.
job = client.extract.create(file=open("doc.pdf", "rb"))
# Or: short retention for sensitive docs (purge in 1 hour).
job = client.extract.create(
file=open("contract.pdf", "rb"),
options={"retain_hours": 1},
)
# Or: keep for the full 7 days while your batch pipeline catches up.
job = client.extract.create(
file=open("paper.pdf", "rb"),
options={"retain_hours": 168},
)Force-delete a job before its retention window expires
Call DELETE /v1/jobs/{id}at any time to wipe a job's source and extracted images immediately. The job row itself is kept (anonymized) for billing audit, but its result is nulled and all storage is purged.
client.jobs.cancel(job.id) # also purges storageHow to get extracted images
Image blocks in the response carry a presigned URL pointing at OCRQueen's R2 bucket. The URL is signed for a few hours — long enough to fetch the bytes, short enough that a leaked URL doesn't become a permanent backdoor.
{
"id": "job_abc",
"status": "completed",
"expires_at": "2026-05-16T10:00:00Z",
"result": {
"pages": [
{
"page": 1,
"blocks": [
{
"id": "img_1",
"type": "image",
"image_url": "https://...r2.cloudflarestorage.com/.../page-1.png?X-Amz-Expires=7200&X-Amz-Signature=...",
"alt": "Network topology diagram",
"image_type": "diagram"
}
]
}
]
}
}If you need to keep images long-term, download them within your retention window
Within retain_hours, you can fetch every image URL with a plain HTTP GET — no auth header needed (the signature is in the URL). Save the bytes wherever you want; after that point you can forget about us.
import httpx
from ocrqueen import OCRQueen
client = OCRQueen(api_key="pk_...")
job = client.extract.create(file=open("doc.pdf", "rb"))
result = client.jobs.wait(job)
# Walk every image block, download the bytes, store however you like.
for page in result.raw["pages"]:
for block in page["blocks"]:
if block.get("type") == "image":
bytes_ = httpx.get(block["image_url"]).content
my_storage.put(f"{job.id}/{block['id']}.png", bytes_)Once your storage has the bytes, OCRQueen's R2 can purge the originals safely — your downstream serving doesn't depend on us.
URLs vs the JSON response: store the result, but refresh URLs before serving
The image_urlvalues are time-limited. If you save the full JSON to your database and try to use those URLs after a few hours, they'll return 403. Two clean fixes:
- Re-fetch the job via
GET /v1/jobs/{id}to get fresh URLs — works as long asretain_untilhasn't passed. - Download once, host the bytes yourself — recommended for any production app that re-renders images.
Bring Your Own Storage (BYOS)
For regulated workloads or customers who need data residency, OCRQueen can write extracted images directly into a bucket you own. We never store the bytes in our R2; the URLs in the response point at your bucket, signed with the credentials you provided.
Supported destinations:
- Cloudflare R2
- Amazon S3 (any region)
- Google Cloud Storage (S3-compatible XML API)
- Any S3-compatible endpoint (MinIO, Backblaze B2, Wasabi…)
1. Register a destination
One-time setup per bucket. Credentials are encrypted at rest using AES-256-GCM keyed by your tenant ID — extraction workers decrypt them in memory only.
curl -X POST https://api.ocrqueen.com/v1/storage-destinations \
-H "Authorization: Bearer pk_..." \
-H "Content-Type: application/json" \
-d '{
"kind": "r2",
"bucket": "my-extracts",
"endpoint_url": "https://abc123.r2.cloudflarestorage.com",
"access_key_id": "...",
"secret_access_key": "...",
"region": "auto"
}'
# Response
# {
# "id": "dest_xyz",
# "kind": "r2",
# "bucket": "my-extracts",
# "verified_at": null,
# ...
# }2. Verify it
Confirms we can write + read in your bucket. Writes a small probe object then immediately deletes it.
curl -X POST https://api.ocrqueen.com/v1/storage-destinations/dest_xyz/verify \
-H "Authorization: Bearer pk_..."
# Response
# { "ok": true, "verified_at": "2026-05-15T12:34:56Z" }3. Use it for an extraction
Reference the destination by ID in options. Any extracted images will be written to your bucket; the URLs in the response point there, not at OCRQueen's R2.
job = client.extract.create(
file=open("doc.pdf", "rb"),
options={
"extraction_profile": "advanced",
"storage_destination_id": "dest_xyz",
},
)List + delete destinations
# List all destinations for the authenticated customer
curl https://api.ocrqueen.com/v1/storage-destinations \
-H "Authorization: Bearer pk_..."
# Remove a destination (does NOT delete the bucket, just removes the binding)
curl -X DELETE https://api.ocrqueen.com/v1/storage-destinations/dest_xyz \
-H "Authorization: Bearer pk_..."When you use BYOS, retain_hourson OCRQueen's side becomes a non-issue — we never wrote bytes to our storage in the first place. Your bucket's own lifecycle rules govern when objects expire.
Picking a strategy
Three patterns, pick whichever matches your use case:
| Strategy | Setup | When to pick it |
|---|---|---|
| Default 24h + download | None | RAG pipelines that ingest text once. Indie products. Most developers. |
| Short retention (0–1h) | Pass retain_hours: 0 | Sensitive documents where you can't tolerate data living on our side past the extraction call. |
| BYOS | One-time bucket registration | Regulated workloads, data residency requirements, enterprise contracts, archival use cases. Customer owns the bucket lifetime. |
See POST /v1/extract for the full options schema, or GET /v1/jobs/{id} for how to refresh expired URLs.
