How We Built a 4-Stage Multimodal AI Reasoning System

When we started building the Multimodal Reasoning Agent, the obvious approach was a single prompt: upload image, ask question, get answer. We tried it. The outputs were shallow, inconsistent, and frequently missed nuance that was clearly visible in the image.

So we built a 4-stage visual reasoning pipeline instead. Here's exactly how it works under the hood.

The Architecture Decision

The fundamental insight was that image understanding, like research, benefits from staged cognitive processing. Human visual experts don't glance at an X-ray and immediately diagnose. They look, catalogue, hypothesise, reason, and then conclude. We built an AI that does the same thing.

Stage 1 — Initial Perception

👁 What Is in This Image?

The model sees the image cold — without the user's task influencing its interpretation. It produces a raw perception report: subject, scene type, dominant colours, composition, text detected, and quality assessment. This "clean look" prevents confirmation bias in later stages.

Stage 2 — Deep Visual Analysis

🔮 What Is Actually Here?

Now the model goes deep: cataloguing every object, noting spatial relationships, extracting data from charts or tables, identifying the domain (medical, engineering, artistic, etc.), and noting visual cues like lighting, composition style, and UI state for screenshots.

Stage 3 — Reasoning

🧩 What Does It Mean?

Armed with a rich visual knowledge base from stages 1 and 2, the model now applies the user's specific task. It chains visual evidence to conclusions explicitly, notes assumptions, identifies patterns and anomalies, and produces a confidence rating with reasoning.

Stage 4 — Final Response

✅ What Should You Know?

The final stage synthesises everything into a structured, evidence-based answer: direct response, key insights with visual evidence, detailed explanation, and recommended next steps. Every claim is traceable back to something visible in the image.

The Technical Implementation

Each stage calls GPT-4o's vision API with the same image (as base64) plus the structured output from prior stages as context. The image is not stored server-side — it lives in the browser as a JavaScript Blob URL and is sent with each API call, then discarded.

Streaming is handled via Django's StreamingHttpResponse with NDJSON — one JSON line emitted per completed stage. This allows the UI to progressively update in real time rather than waiting for all four stages to complete.

What We Learned

The biggest surprise: the "Initial Perception" stage, which we almost cut as redundant, turned out to be the most important. By forcing the model to describe the image without task context first, we dramatically reduced the rate of confirmation bias in later stages — the model was less likely to "see" things that confirmed the user's question rather than what was actually present.

🔗 Try It Yourself

The Multimodal Reasoning Agent is available to all Pro and Business subscribers. Upload any image — medical scans, charts, architectural drawings, screenshots — and watch the 4-stage reasoning unfold in real time.