Multi-Format Ingest Strategy

decisions architecture ingest ai config-driven

How to handle incoming data in multiple formats from multiple sources. The core decision: known formats get deterministic config-driven mapping; unknown formats get AI-assisted draft config followed by human review; PDFs and unstructured documents get AI extraction at the edges.

Decision made 2026-04. Validated against cost and reliability analysis.

The Problem

A single integration may receive data in dozens of formats from different sources. Options:

Pure AI — let an LLM parse and map every incoming document at runtime.
Pure deterministic — hand-code a parser for every format.
Hybrid — config-driven mapping for known formats, AI only for unknown formats and unstructured documents.

The Decision: Hybrid Approach

Known format = deterministic config mapping. Unknown format = AI draft config, then human review. Unstructured document (PDF, etc.) = AI extraction, then config validation.

Known Formats

When you know what a source sends (e.g., a trading partner always sends the same schema), store the mapping in a config record and process deterministically. Zero AI at runtime. Fully auditable. Fast.

[ingest] --> [lookup config by source + type] --> [apply mapping] --> [validate] --> [process]

Config record example:

source: partner-a
format: TYPE_X_V2
field_map:
  their_field_name: our_canonical_field
  their_date_format: ISO8601
  their_status_codes: { "01": "ACCEPTED", "05": "REJECTED" }

Adding a new source = adding a config record. No code change. No deployment.

Unknown Formats

When a new source sends a format not in the config store:

AI reads a sample document and drafts a config record.
Human reviews and approves the draft.
Approved config enters the config store.
All future documents from that source use deterministic processing.

AI is used once per format, not at every runtime.

Unstructured Documents (PDF, Free Text)

AI extraction is appropriate here because deterministic parsing is not feasible. The AI extracts structured fields, then those fields are validated against a known schema and processed deterministically.

[PDF ingest] --> [AI extraction] --> [schema validation] --> [deterministic processing]

Cost Analysis

Approach	AI calls per document	Relative cost at scale
Pure AI	1 per document	400x baseline
Hybrid (known format)	0 per document	1x (config lookup only)
Hybrid (new format setup)	1 per format, amortized	~0 at volume
Hybrid (PDF extraction)	1 per document	400x, but only for PDFs

The hybrid approach is approximately 400x cheaper than pure AI at runtime for known-format documents, which make up the majority of volume in any mature integration. The cost differential grows with volume.

Parts Catalog Model

For integrations with a known universe of data types (e.g., a parts catalog, a product catalog, a code table), the config store functions as the canonical reference. Incoming data is looked up against the catalog rather than parsed freeform.

Known item in catalog: deterministic match, zero AI.
Unknown item: AI attempts to match against nearest catalog entry, flags for human review.
Confirmed match: added to catalog for future deterministic resolution.

This is a specialization of the config-driven routing pattern applied to data content rather than data format.

When to Use Pure AI

Pure AI at runtime is appropriate when:

Documents are genuinely unstructured and format varies per document (not per source).
Volume is low enough that cost is not a constraint.
Speed of setup matters more than runtime cost.
The domain is novel and a catalog/config approach would require constant maintenance.

Pure AI is not appropriate when:

Volume is high (cost scales linearly with documents).
Auditability is required (AI mappings are not deterministically reproducible).
The same source sends the same format repeatedly (config-driven is strictly better).

Implementation Notes

Start with AI-assisted config generation even for known formats if you are building from scratch. Let the AI draft the config, then lock it down. Faster than hand-coding from scratch.
Version config records. A bad config change is a deployment event.
Keep the AI extraction and the deterministic validation as separate steps with a clear boundary. Do not let AI decisions flow through to output without a validation gate.
Human review of AI-drafted configs is not optional. The AI will be wrong on edge cases. Review cadence: every new format, sampled review of AI-extracted PDFs.

config-driven-routing — the config-driven approach this decision implements
error-handling — unknown formats that fail AI extraction route to exception queues
mock-data-strategy — mock configs for testing new formats before real data arrives
pre-release-checklist — config records must be reviewed before release

Clarity Wiki

Explorer

multi-format-ingest-strategy

Multi-Format Ingest Strategy

The Problem

The Decision: Hybrid Approach

Known Formats

Unknown Formats

Unstructured Documents (PDF, Free Text)

Cost Analysis

Parts Catalog Model

When to Use Pure AI

Implementation Notes

Graph View

Table of Contents

Backlinks

Clarity Wiki

Explorer

multi-format-ingest-strategy

Multi-Format Ingest Strategy

The Problem

The Decision: Hybrid Approach

Known Formats

Unknown Formats

Unstructured Documents (PDF, Free Text)

Cost Analysis

Parts Catalog Model

When to Use Pure AI

Implementation Notes

Related

Graph View

Table of Contents

Backlinks