Source types

Provisioning reads your existing material and turns it into engine configuration. The one decision that shapes the result is what you put in the request's sources array – because the engine you get back is only as good as what you point it at.

Each entry is one of two kinds. A link source is a URL the platform fetches and crawls for you; a content source is raw text or markdown you pass in directly. You can mix both in the same request, up to ten sources total. This page covers the two kinds, what the platform does with each, and how to pick sources that produce a useful configuration rather than an empty one. The sources field itself lives on the create request – see Create a provisioning job for the full payload and the 202 response.

The two source kinds#

Every source is an object with a type and a payload. The type decides how the platform reads the payload.

Type	Payload	Reach for it when
`link`	A URL to crawl	The context already lives on the web – your brand page, public docs, a published style guide, a glossary page.
`content`	Raw text or markdown	The context lives in your head or a private doc – terminology lists, tone rules, product-name conventions, translation do's and don'ts.

json

{
  "sources": [
    { "type": "link", "payload": "https://acme.com/brand-guidelines" },
    { "type": "link", "payload": "https://acme.com/docs/style-guide" },
    {
      "type": "content",
      "payload": "Brand name 'Acme' is never translated. Use formal tone in German (Sie-form). Product names: AcmeFlow, AcmeSync, AcmeVault - always keep in English."
    }
  ]
}

Two links and one content block in the same array. The links point at pages that already hold the context; the content block carries rules that live nowhere public. Both feed the same extraction step.

What the platform does with each#

The two kinds differ in one step – getting the text in front of the AI agent – and converge after that.

A link source is fetched and converted to markdown before analysis. The platform crawls link sources in parallel, so ten URLs are not ten sequential round-trips – they are read concurrently, then handed to the agent as text. You give a URL; the platform does the fetching and the HTML-to-markdown reduction so the agent reads prose, not page markup.

A content source skips that step. The text you send is passed to the AI agent directly, exactly as written. There is no crawl, no conversion, nothing between your words and the agent – which is why a content source is the most precise way to state a rule you already know.

From there both kinds are the same input: the agent reads all of it and extracts brand voices, glossary items, and instructions. What it produces from that text, and the summary it returns, is its own subject – see What the AI extracts.

How deep does a link crawl go?

A link source is fetched and converted to markdown before the agent analyzes it. Whether the crawler follows links beyond the URL you supply – and to what depth – is not specified here. If you need a specific set of pages analyzed, the reliable approach is to list each one as its own link source rather than relying on a single URL to fan out.

Pick sources that carry signal#

This is the move that decides whether provisioning is worth running. The extraction is only as good as its input, and the failure here is quiet: a job against weak sources still completes, still creates an engine – but a nearly empty one, and you find out later when translations ignore conventions you assumed were captured. The completion arrives like any other – see Webhook delivery – so nothing flags the gap for you.

Provide meaningful sources

The quality of the extracted configuration depends on the quality of your input. Link sources should point to pages with useful context – brand guidelines, style guides, product documentation, glossaries. Raw content sources should contain concrete terminology, tone guidance, or translation rules. Generic marketing pages or login screens produce little useful configuration.

The pattern behind the callout: the agent extracts what is stated, not what is implied. A page that says "we write in a friendly, direct German that uses Sie, never Du" yields a brand voice. A glossary page that lists "workspace → Arbeitsbereich" yields a glossary item. A polished landing page that demonstrates good tone without naming a single rule yields almost nothing, because there is no rule on it to lift. When in doubt, prefer the source that says the rule out loud – which is often a content block you write in a sentence rather than a page you hope the agent infers from.

One weak source won't sink the job#

A natural worry follows from feeding several sources at once: if one URL is dead or one block is thin, does the whole request fail? It does not. Sources are read independently, and a per-item failure is recorded rather than fatal – a dead link or an unreadable block is skipped, and the agent works from what it could read. The job fails as a whole only when no source could be read at all, leaving nothing to analyze. The exact shapes of those outcomes – the per-item failures recorded on success, and the failure payload when nothing could be read – belong to What the AI extracts and Webhook delivery.

So you can list a candidate set without auditing every URL first: the strong sources contribute, the weak ones drop out, and you read the output summary to see what actually landed. Point it at what you already have – then check what came back.

Next steps#

Create a provisioning job

The full create request the sources array is part of, with the 202 response and engine ID.

What the AI extracts

Brand voices, glossary items, and instructions the agent builds from your sources, plus the output summary.

Live progress (WebSocket)

Watch crawling and configuring steps as the job reads your sources and builds the engine.