A nine-step workflow for AI-assisted causal coding
From documents to a conclusion you can defend
Interviews. Reports. Open-ended survey answers. Hundreds of pages of people telling you what changed and why.
Somewhere in there is the answer to your evaluation or research questions.
How do you get the answers, in a rigorous way?
Hand it to the black box
“ChatGPT, what does this say?”
Fast, fluent, and you have no idea what it leaned on. You cannot show your working, so you cannot defend the answer.
Read it all yourself
Thorough, but it does not scale. Three hundred transcripts? How to make sure that your synthesis really reflects what they all say, without jumping to early conclusions? How can you synthesise in a way which answers the causal questions in research and evaluation?
There is a third way.
One workflow from raw text to a defensible conclusion, using AI as a clerk and making the judgements yourself.
Nine steps. Built for AI coding at scale, and it works just as well by hand.
Human first, AI next.
It does not compete with the evaluation methods you already know.
It has a 50-year tradition.
It is the step that finds the causal claims, organises them, and hands them to contribution analysis, process tracing, Outcome Harvesting, QuIP, or your own judgement.
Stories in, organised evidence out. The evaluative call stays with you.
You read the text and mark each causal claim as a link from one factor to another.
“The training gave me confidence, and that is why I started the business.”
becomes training → confidence → started a business, with the verbatim quote and the source kept on every link.
A coded link means: there is evidence that this source claims X influenced Y.
Not that X really did influence Y. Twenty people saying so is not proof. It is evidence you can now weigh. Crossing from claims to conclusions is your job, and the back half of this workflow is about doing it well.
We deliberately do not code strength, polarity, or hidden meaning.
People say “X made Y happen”. They rarely say how strongly, or whether it was linear, or what the counterfactual was. So we do not invent it.
Minimalist coding is fast, automatable, and stays close to what people actually said. That is exactly why AI can do the heavy lifting.
The clerk’s job
Find every causal claim. Attach a quote. Tireless, exhaustive, cheap.
Your job
Decide the question, check the work, judge what it all means.
The minimalist task is simple enough to hand over. The judgement is not, so we keep it.
Plan
Steps 1 to 2. Decide what you want to answer and gather data that can answer it.
Code
Steps 3 to 5. Turn the text into a checked table of claims, each with a quote.
Query
Steps 6 to 9. Weigh the evidence and answer the question.
A few wide, cheap passes to capture the evidence. Then steadily narrower judgement.
1000 claims → 30 bundles → 25 assessed links → 1 conclusion
Coding is broad and cheap on purpose. The value is added later, spending that volume down into something you can vouch for.
Plan
Code
Query
Not a strict sequence. You revisit the early ones. Only the last is strictly required.
Before you touch a codebook, write down what you want to be able to say at the end, and to whom.
Every later choice, the data, the labels, the columns, the queries, follows from that one sentence.
Tip: sketch the map or table that would answer your question. That sketch is your target.
Good at
Not for
Pick questions the method can serve, and only as many as the evaluation needs.
The question decides the data.
Narrative material works best. Ask people what changed and why, and you get causal claims to code.
Want to compare women and men, or staff and clients, or early and late? Those groups must be in the data and recorded in the source metadata. You cannot compare what you did not capture.
How much freedom does the coder get to invent labels?
Loose finds more but leaves more to tidy. Tight is cleaner but misses links. Most projects start loose and tighten later by recoding.
The settings look like separate knobs. They all pull on the same four tensions.
Turn one knob and the others move.
You write an instruction for the AI, like a chatbot prompt, and paste it in.
In a hurry, or one short text? Press one-click and accept the defaults. Often that is enough.
The golden rule: test on a small, varied sample, work out exactly why the output is wrong or thin, fix the instruction, run again. Then scale up.
Holistic
One connected diagram per chunk. A more joined-up story. Best for a single short text. The model has more freedom.
Claim by claim
Every link separately. Fuller coverage. Best for many texts. Links join up less, so you recode later.
Every link needs a quote.
Without a verbatim quote behind each link, you cannot show your working, and you cannot defend the conclusion. Ask for it explicitly, every time.
However careful the coding, some links will be wrong. Check them before you analyse.
Most claims are unmarked, so most are neutral. Neutral means “not mentioned”, not “medium”. Do not read these as scores.
Not a static report. A model you can query, over and over, to answer different questions.
It is a knowledge graph with one kind of relation: “influences”. Because every link is causal, a lot of evaluation questions can be answered almost out of the box.
Stories in. A queryable model out.
You query the graph with filters. Each filter is a question. Stack them and you answer a bigger one.
Links → women only → trace training to income → zoom out → bundle → map
The same data gives very different maps with no contradiction. Each map is just the result of a different chain of filters. Order usually matters.
The moment coding is done, these answer themselves.
“What did the cash transfer lead to?”
Filter to that factor, look downstream. Out comes the map of every reported consequence, with counts.
“Do women and men tell different stories?”
Split by group on the same map. The links each group stresses light up differently.
These quick wins also sharpen your question before the harder work. The next four steps are for the questions that need a defensible answer.
A bundle is all the claims that say the same X influences Y, from different sources or different parts of one source.
Whatever else you do, weigh each bundle as a whole. How many sources? How convincing? Do they agree or pull apart?
You can weigh bundles by eye, or record the judgement formally.
Collapse a bundle into a single assessed link that carries your quality score. The raw claims are not deleted; a switch shows either view, never both at once. Thin bundles earn no assessed link.
Either way the move is the same: from a mass of raw claims to a smaller set you are willing to vouch for. The app makes you write the criteria down first, on purpose.
Now the indirect questions. How does an intervention reach an outcome, two or three steps down the chain?
This is where causal mapping earns its name, and where the biggest trap lives.
A pig farmer says:
the cash grant gave me more cash
A wheat farmer says:
more cash let me buy more seed
So cash grants lead to more seed?
No. The first step is true only for pig farmers, the second only for wheat farmers. Two people, two stories, stitched into one that nobody told.
Path tracing shows every link on a route between two factors, across all sources. Easy to misread.
Source tracing keeps only the sources whose own account runs all the way through. Every link is then part of one complete story told by one person. That is the safe move.
How much did it matter, and compared to what?
Compare the influence you care about against the rival explanations, on the same map, not in isolation. Count the sources whose narratives actually run from your driver to your outcome.
Finally, step back and draw the conclusion.
Behind a single tidy map there may still be hundreds of quotes. Does the claim hold up? Do all the links really belong to the same context?
The AI can draft a vignette: a source-by-source commentary on the pathways, judging how coherent each account is. It does only what a patient reader could. Treat it as a first draft and edit it.
Does the evidence answer your Step 1 question?
If yes, you have a conclusion with every step on show, from quote to claim to bundle to pathway to judgement.
Code “X influenced Y, with a quote”. Capture cheaply. Then judge, in the open, until you have something you can defend.
None of this is statistical causal inference. It is a disciplined way to assemble evidence, weigh it transparently, and reach conclusions you can stand behind.
Causal mapping is the evidence broker. It feeds the methods you already trust.
Most real evaluations combine several. Pick the methods to fit the question.
This is how we work at Causal Map Ltd, and it keeps evolving.
If you want to go from a stack of documents to a conclusion you can defend, come and try it with us.
Companion working paper: “A workflow for AI-assisted causal coding”. App: app.causalmap.app