Data Extraction

Define a schema once, then extract structured data from any text —consistently, every time.

How It Works

Backbone takes a standard JSON Schema definition and uses it to extract structured output from unstructured text. You define the fields, types, and constraints — Backbone returns validated, schema-conformant JSON.

The workflow is: define schema → commit version → send text → receive structured output.

Schemas use JSON Schema

If you've used JSON Schema before, you already know the format. If not —it's just a way to describe what fields you expect, their types, and which ones are required.

Schema Management

Before you can extract anything, you need a schema. Schemas live inside projects and support full version control.

Creating a Schema

Schemas have a name, optional description, and a JSON Schema definition:

{
  "type": "object",
  "properties": {
    "company": { "type": "string" },
    "invoice_number": { "type": "string" },
    "total": { "type": "number" },
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": { "type": "string" },
          "quantity": { "type": "integer" },
          "price": { "type": "number" }
        }
      }
    }
  },
  "required": ["company", "invoice_number"]
}

Version Control

Every time you change your schema, you create a new version. Versions are immutable —this gives you a full audit trail of every change.

You can:

Commit new versions with a change description
Deactivate versions you don't want used anymore
Re-activate a historical version (creates a new version as a copy, preserving the audit trail)
Pin extractions to a specific version for reproducibility

Each new version auto-increments the version number and updates the latest label.

Re-activation creates a copy

Re-activating version 3 doesn't revert —it creates version N+1 with the same content. The original stays untouched.

Labels

Labels are named pointers to specific schema versions. They decouple your application code from version numbers, enabling controlled rollouts.

How Labels Work

Every schema gets a latest label automatically, updated on each new version. You create custom labels for deployment stages:

Label	Points to	Purpose
`latest`	v5 (auto)	Always the newest version
`production`	v3	What your API consumers use
`staging`	v5	What you're testing

Your application resolves schemas by ID + label —when you're ready to promote, just move the label pointer.

Managing Labels

`POST /api/v1/projects/{projectId}/schemas/{schemaId}/labels`

Create a label:

{
  "name": "production",
  "schemaVersionId": "version-uuid"
}

Update a label to point to a different version:

`PUT /api/v1/projects/{projectId}/schemas/{schemaId}/labels/{labelName}`

{
  "schemaVersionId": "new-version-uuid"
}

Label names must be lowercase alphanumeric with hyphens (e.g., production, staging, v2-rollback). The system latest label cannot be deleted.

Resolving Schemas

`GET /api/v1/projects/{projectId}/schemas/{schemaId}/resolve`

Resolve a schema by ID and optional label to get the JSON Schema definition for a specific version:

Request

curl "https://backbone.manfred-kunze.dev/api/v1/projects/{projectId}/schemas/{schemaId}/resolve?label=production" \
  -H "Authorization: Bearer sk_your_api_key"

Parameter	Type	Required	Description
`label`	string	No	Label to resolve (defaults to `latest`)

Response:

{
  "schemaId": "uuid",
  "schemaName": "invoice-schema",
  "versionId": "uuid",
  "versionNumber": 3,
  "jsonSchema": {
    "type": "object",
    "properties": {
      "company": { "type": "string" },
      "total": { "type": "number" }
    }
  },
  "label": "production"
}

Validate Before Committing

`POST /api/v1/projects/{projectId}/schemas/{schemaId}/validate`

Check if a schema is valid before you commit it. Returns structured errors and warnings:

{
  "jsonSchema": {
    "type": "object",
    "properties": {
      "name": { "type": "string" }
    }
  }
}

Test Against Sample Text

`POST /api/v1/projects/{projectId}/schemas/{schemaId}/test`

Run an extraction against sample text without persisting the result. Great for iterating on your schema:

Request

curl -X POST https://backbone.manfred-kunze.dev/api/v1/projects/{projectId}/schemas/{schemaId}/test \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk_your_api_key" \
  -d '{
    "jsonSchema": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "email": { "type": "string" }
      }
    },
    "sampleText": "Contact Jane Smith at [email protected]",
    "model": "gpt-4o"
  }'

The response includes token usage and processing time so you can estimate costs:

{
  "success": true,
  "extractedData": { "name": "Jane Smith", "email": "[email protected]" },
  "inputTokens": 85,
  "outputTokens": 22,
  "processingDurationMs": 890,
  "modelUsed": "gpt-4o"
}

Running Extractions

Synchronous

`POST /api/v1/projects/{projectId}/extractions`

The standard extraction endpoint. Send text, get structured data back immediately.

Request body:

Field	Type	Required	Description
`schemaId`	string	Yes	Schema to use for extraction
`schemaVersionId`	string	No	Pin to a specific version (defaults to latest active)
`inputText`	string	Yes	Text to extract data from
`model`	string	Yes	Platform model name (e.g., `gpt-4o`) or `provider/model` for BYOK

Request

curl -X POST https://backbone.manfred-kunze.dev/api/v1/projects/{projectId}/extractions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk_your_api_key" \
  -d '{
    "schemaId": "your-schema-id",
    "inputText": "Invoice #INV-2024-001 from Acme Corp. Total: $1,250.00",
    "model": "gpt-4o"
  }'

Asynchronous

`POST /api/v1/projects/{projectId}/extractions/async`

For large inputs, submit an extraction for background processing. You get back a 202 Accepted with a Location header to poll:

HTTP/1.1 202 Accepted
Location: /api/v1/projects/{projectId}/extractions/{id}
Retry-After: 5

Poll the Location URL until status changes from PENDING/PROCESSING to COMPLETED or FAILED.

Estimate Tokens First

`POST /api/v1/projects/{projectId}/extractions/estimate`

Before running an extraction, you can estimate the cost:

{
  "inputTokens": 4200,
  "estimatedOutputTokens": 350,
  "strategy": "SINGLE_SHOT"
}

The strategy field tells you which processing path will be used.

Extraction Strategies

Backbone automatically picks the best strategy based on input size:

Strategy	Token Range	What Happens
Single-shot	Under 10K tokens	Entire input processed in one LLM call
Chunked	10K - 100K tokens	Input split into overlapping chunks, results merged
Async	Over 100K tokens	Background processing with polling

Chunking details

Chunks are max 8,000 tokens with 200 tokens of overlap to avoid losing context at boundaries. Results are automatically merged.

Re-running Extractions

`POST /api/v1/projects/{projectId}/extractions/{id}/rerun`

Need to retry an extraction with the same config? Hit the rerun endpoint. Creates a new extraction —the original stays untouched.

Listing and Filtering

`GET /api/v1/projects/{projectId}/extractions`

Parameter	Type	Description
`search`	string	Filter by model name
`schemaVersionId`	string	Filter by schema version
`status`	string	`PENDING`, `PROCESSING`, `COMPLETED`, or `FAILED`
`page`	number	Page number (0-based)
`size`	number	Page size (default: 20)

Use Cases

Contact extraction —names, emails, phone numbers from unstructured text
Invoice processing —invoice numbers, dates, amounts, line items from documents
Resume parsing —skills, experience, education from CVs
Document analysis —key fields from contracts, reports, forms
Data entry automation —turn free-text notes into structured database records