What Can Multimodal AI Actually See? A Practical Test

Everyone talks about multimodal AI. Few people have systematically tested its actual capabilities and limits. We spent two weeks uploading hundreds of images to GPT-4o through WriterPilots' 4-stage reasoning pipeline and documenting exactly what it could and couldn't see. Here's what we found.

The Test Methodology

We tested across 8 image categories: medical scans, architectural drawings, data visualisations, handwritten text, UI screenshots, photographs, scientific diagrams, and financial charts. For each category, we defined specific tasks and evaluated outputs on accuracy, depth, and confidence calibration.

What Multimodal AI Sees Extremely Well

1. Charts and Data Visualisations

This was the most impressive category. GPT-4o correctly read bar chart values to within 2-3% accuracy, identified trend lines, extracted axis labels, and produced structured data tables from visual charts. It's genuinely better at reading a chart than many humans are.

2. UI Screenshots

The model excels at UI analysis. It identifies interactive elements, notes their states (active/inactive, checked/unchecked), reads all visible text, and can accurately describe the information architecture of a complex screen. Particularly useful for UX analysis and accessibility audits.

3. Photographic Scene Understanding

For photographs, GPT-4o consistently identifies subjects, estimates spatial relationships, infers context (indoor/outdoor, time of day, setting type), and detects mood. Its object detection is accurate for common items but degrades with unusual objects or heavy occlusion.

4. Document Text Extraction

For clearly photographed documents, text extraction accuracy is high — typically 95%+ for printed text in good lighting. Handwritten text is harder, with accuracy dropping to 70-80% for clear handwriting and significantly lower for cursive or poor lighting.

Where It Struggles

1. Precise Measurements

When we showed architectural drawings with dimensions, the model could read labelled measurements but could not estimate unlabelled distances with useful accuracy. It knows that one object is "roughly twice as wide" as another, but not that it's 47cm vs 23cm.

2. Medical Interpretation

This was the expected weak spot, but the failure mode was interesting. GPT-4o describes what it sees in a medical image accurately (structures, densities, relative sizes) but appropriately declines to diagnose, recommending professional interpretation. The description layer is actually useful for non-clinical purposes (education, record organisation).

3. Counting Large Numbers of Similar Items

Ask it to count 200 people in a stadium photograph and accuracy drops to ±15-20%. It's better than a human eyeballing it, but not reliable for precise counts above ~20 similar items.

The 4-Stage Advantage

Running images through WriterPilots' 4-stage pipeline versus a single prompt produced measurably different results. The pipeline approach:

Caught 40% more text elements in complex images (Stage 1's cold perception vs. task-biased single-shot)
Produced significantly more nuanced spatial relationship descriptions (Stage 2's dedicated analysis)
Generated more calibrated confidence scores (Stage 3's explicit reasoning about evidence quality)

🤗 Test It Yourself

The Multimodal Reasoning Agent is available to Pro subscribers. Upload any image and see the 4-stage analysis unfold in real time. The first run is free for all registered users.