When we started building the Multimodal Reasoning Agent, the obvious approach was a single prompt: upload image, ask question, get answer. We tried it. The outputs were shallow, inconsistent, and frequently missed nuance that was clearly visible in the image.
So we built a 4-stage visual reasoning pipeline instead. Here's exactly how it works under the hood.
The Architecture Decision
The fundamental insight was that image understanding, like research, benefits from staged cognitive processing. Human visual experts don't glance at an X-ray and immediately diagnose. They look, catalogue, hypothesise, reason, and then conclude. We built an AI that does the same thing.
The Technical Implementation
Each stage calls GPT-4o's vision API with the same image (as base64) plus the structured output from prior stages as context. The image is not stored server-side — it lives in the browser as a JavaScript Blob URL and is sent with each API call, then discarded.
Streaming is handled via Django's StreamingHttpResponse with NDJSON — one JSON line emitted per completed stage. This allows the UI to progressively update in real time rather than waiting for all four stages to complete.
What We Learned
The biggest surprise: the "Initial Perception" stage, which we almost cut as redundant, turned out to be the most important. By forcing the model to describe the image without task context first, we dramatically reduced the rate of confirmation bias in later stages — the model was less likely to "see" things that confirmed the user's question rather than what was actually present.
The Multimodal Reasoning Agent is available to all Pro and Business subscribers. Upload any image — medical scans, charts, architectural drawings, screenshots — and watch the 4-stage reasoning unfold in real time.