A nine-step workflow for AI-assisted causal coding

From documents to a conclusion you can defend

You have a stack of documents

Interviews. Reports. Open-ended survey answers. Hundreds of pages of people telling you what changed and why.

Somewhere in there is the answer to your evaluation or research questions.

How do you get the answers, in a rigorous way?

Two tempting shortcuts, both bad

Hand it to the black box

“ChatGPT, what does this say?”

Fast, fluent, and you have no idea what it leaned on. You cannot show your working, so you cannot defend the answer.

Read it all yourself

Thorough, but it does not scale. Three hundred transcripts? How to make sure that your synthesis really reflects what they all say, without jumping to early conclusions? How can you synthesise in a way which answers the causal questions in research and evaluation?

There is a third way.

The promise

One workflow from raw text to a defensible conclusion, using AI as a clerk and making the judgements yourself.

Nine steps. Built for AI coding at scale, and it works just as well by hand.

Human first, AI next.

First, the big idea

Causal mapping is an evidence broker

It does not compete with the evaluation methods you already know.

It has a 50-year tradition.

It is the step that finds the causal claims, organises them, and hands them to contribution analysis, process tracing, Outcome Harvesting, QuIP, or your own judgement.

Stories in, organised evidence out. The evaluative call stays with you.

What a coded claim is

You read the text and mark each causal claim as a link from one factor to another.

“The training gave me confidence, and that is why I started the business.”

becomes training → confidence → started a business, with the verbatim quote and the source kept on every link.

We code claims, not facts

A coded link means: there is evidence that this source claims X influenced Y.

Not that X really did influence Y. Twenty people saying so is not proof. It is evidence you can now weigh. Crossing from claims to conclusions is your job, and the back half of this workflow is about doing it well.

Keep the coding minimalist

We deliberately do not code strength, polarity, or hidden meaning.

People say “X made Y happen”. They rarely say how strongly, or whether it was linear, or what the counterfactual was. So we do not invent it.

Minimalist coding is fast, automatable, and stays close to what people actually said. That is exactly why AI can do the heavy lifting.

AI as a clerk, not an oracle

The clerk’s job

Find every causal claim. Attach a quote. Tireless, exhaustive, cheap.

Your job

Decide the question, check the work, judge what it all means.

The minimalist task is simple enough to hand over. The judgement is not, so we keep it.

How the workflow is built

Three acts

Plan

Steps 1 to 2. Decide what you want to answer and gather data that can answer it.

Code

Steps 3 to 5. Turn the text into a checked table of claims, each with a quote.

Query

Steps 6 to 9. Weigh the evidence and answer the question.

Capture cheaply, then judge

A few wide, cheap passes to capture the evidence. Then steadily narrower judgement.

1000 claims → 30 bundles → 25 assessed links → 1 conclusion

Coding is broad and cheap on purpose. The value is added later, spending that volume down into something you can vouch for.

The nine steps

Collect

Overall planning
Data gathering

Code

Manage the codebook
Code the claims
Check and enrich the links

Query

From claims to bundles
From bundles to pathways
Judge value and contribution
Holistic judgement

Not a strict sequence. You revisit the early ones. Only the last is strictly required.

Collect

Step 1: Start from the question

Before you touch a codebook, write down what you want to be able to say at the end, and to whom.

Every later choice, the data, the labels, the columns, the queries, follows from that one sentence.

Tip: sketch the map or table that would answer your question. That sketch is your target.

Be realistic about what it can answer

Good at

Which factors matter most
What drives or follows from a factor
How groups differ
Whether evidence fits a theory of change
The overall structure of the system

Not for

Effect sizes
Proving X causes Y on its own

Pick questions the method can serve, and only as many as the evaluation needs.

Step 2: Gather data that can answer it

The question decides the data.

Narrative material works best. Ask people what changed and why, and you get causal claims to code.

Want to compare women and men, or staff and clients, or early and late? Those groups must be in the data and recorded in the source metadata. You cannot compare what you did not capture.

Code

Step 3: Manage the codebook

How much freedom does the coder get to invent labels?

Forced: only your labels
Mostly fixed: your labels, flag new ones for review
Hierarchical: fix the top level, free the detail
Free: invent everything

Loose finds more but leaves more to tidy. Tight is cleaner but misses links. Most projects start loose and tighten later by recoding.

Four tensions behind every coding choice

The settings look like separate knobs. They all pull on the same four tensions.

Precision and recall: are the links right, and did you catch them all?
Freedom and control: how much rope you give the model
Capture now, judge later: grab it cheaply, commit late
Cost and time: how much you spend tuning the rest

Turn one knob and the others move.

The choices feed the tensions

flowchart LR
    subgraph CHOICES["Your coding choices"]
        direction TB
        c1["Codebook strictness"]
        c2["Chunk size"]
        c3["Model"]
        c4["Iterations"]
        c5["Custom columns"]
        c6["Context"]
    end
    subgraph TENSIONS["The four tensions"]
        direction TB
        PR["Precision vs Recall"]
        FC["AI freedom vs your control"]
        CN["Capture now vs judge later"]
        CT["Cost and time"]
    end
    c1 -.-> FC
    c2 -.-> PR
    c3 -.-> PR
    c4 -.-> CT
    c5 -.-> PR
    c6 -.-> PR
    PR <==> FC
    FC <==> CN
    CN <==> CT
    CT <==> PR
    linkStyle 0,1,2,3,4,5 stroke:#999,stroke-width:1px
    linkStyle 6,7,8,9 stroke:#1a936f,stroke-width:3px

Dotted: each choice feeds the tensions. Solid green: the tensions pull on each other.

Step 4: Code the claims

You write an instruction for the AI, like a chatbot prompt, and paste it in.

In a hurry, or one short text? Press one-click and accept the defaults.

The golden rule: > test on a small, varied sample, > work out exactly why the output is wrong or thin, > fix the instruction > iterate > scale up.

::::

Holistic or claim by claim?

Holistic

One connected diagram per chunk. A more joined-up story. Best for a single short text. The model has more freedom.

Claim by claim

Every link separately. Fuller coverage. Best for many texts. Links join up less, so you recode later.

The one rule you never break

Every link needs a quote.

Without a verbatim quote behind each link, you cannot show your working, and you cannot defend the conclusion. Ask for it explicitly, every time.

Step 5: Check and enrich the links

However careful the coding, some links will be wrong. Check them before you analyse.

Tag the doubtful or surprising ones, to filter later
Add a conviction column: how sure the source sounds
Score sources too: reliability, role

Most claims are unmarked, so most are neutral. Neutral means “not mentioned”, not “medium”. Do not read these as scores.

Query

Your links are a knowledge graph

Not a static report. A model you can query, over and over, to answer different questions.

It is a knowledge graph with one kind of relation: “influences”. Because every link is causal, a lot of evaluation questions can be answered almost out of the box.

Stories in. A queryable model out.

A filter is a question, and filters chain

You query the graph with filters. Each filter is a question. Stack them and you answer a bigger one.

Links → women only → trace training to income → zoom out → bundle → map

The same data gives very different maps with no contradiction. Each map is just the result of a different chain of filters. Order usually matters.

The general questions, answered off the links

The moment coding is done, these answer themselves.

Which factors come up most often?
Which are the main drivers?
Which are the main outcomes?
What leads to a factor you care about?
What follows from it?

How do groups differ, women and men, arm A and arm B?
Are there hidden subgroups?
What is surprising or emerging?
What is the overall structure, and are there feedback loops?

Two everyday examples

“What did the cash transfer lead to?”

Filter to that factor, look downstream. Out comes the map of every reported consequence, with counts.

“Do women and men tell different stories?”

Split by group on the same map. The links each group stresses light up differently.

These quick wins also sharpen your question before the harder work. The next four steps are for the questions that need a defensible answer.

Step 6: From claims to bundles

A bundle is all the claims that say the same X influences Y, from different sources or different parts of one source.

Whatever else you do, weigh each bundle as a whole. How many sources? How convincing? Do they agree or pull apart?

Recording the verdict: assessed links

You can weigh bundles by eye, or record the judgement formally.

Collapse a bundle into a single assessed link that carries your quality score. The raw claims are not deleted; a switch shows either view, never both at once. Thin bundles earn no assessed link.

Either way the move is the same: from a mass of raw claims to a smaller set you are willing to vouch for. The app makes you write the criteria down first, on purpose.

Step 7: From bundles to pathways

Now the indirect questions. How does an intervention reach an outcome, two or three steps down the chain?

This is where causal mapping earns its name, and where the biggest trap lives.

The transitivity trap

A pig farmer says:

the cash grant gave me more cash

A wheat farmer says:

more cash let me buy more seed

So cash grants lead to more seed?

No. The first step is true only for pig farmers, the second only for wheat farmers. Two people, two stories, stitched into one that nobody told.

Source tracing keeps you safe

Path tracing shows every link on a route between two factors, across all sources. Easy to misread.

Source tracing keeps only the sources whose own account runs all the way through. Every link is then part of one complete story told by one person. That is the safe move.

Step 8: Judge value and contribution

How much did it matter, and compared to what?

Compare the influence you care about against the rival explanations, on the same map, not in isolation. Count the sources whose narratives actually run from your driver to your outcome.

Step 9: Holistic judgement

Finally, step back and draw the conclusion.

Behind a single tidy map there may still be hundreds of quotes. Does the claim hold up? Do all the links really belong to the same context?

The AI can draft a vignette: a source-by-source commentary on the pathways, judging how coherent each account is. It does only what a patient reader could. Treat it as a first draft and edit it.

Close the loop

Does the evidence answer your Step 1 question?

If yes, you have a conclusion with every step on show, from quote to claim to bundle to pathway to judgement.

So what

The whole thing in one line

Code “X influenced Y, with a quote”.

Capture everything. Capture cheaply. AI can do this.

Then judge, in the open.

None of this is statistical causal inference. It is a disciplined way to assemble evidence, weigh it transparently, and reach conclusions you can stand behind.

Where it fits

Causal mapping is the evidence broker. It feeds the methods you already trust.

Contribution analysis
Process tracing
Outcome Harvesting

QuIP
Realist evaluation
Most Significant Change

Most real evaluations combine several. Pick the methods to fit the question.

We use this every day

This is how we work at Causal Map Ltd, and it keeps evolving.

If you want to go from a stack of documents to a conclusion you can defend, come and try it with us.

Companion working paper: “A workflow for AI-assisted causal coding”. App: app.causalmap.app