XLSX backend + agent tools

Full-stack agent integration: Python service, TypeScript tools, prompt engineering, formula recalc.

Methodology

This evaluation tests cross-stack agent integration. The model must implement Python FastAPI routes for Excel manipulation, wire Convex TypeScript tools, and write agent prompts. Formula recalculation via LibreOffice is specified but often stubbed.

Spec: Python FastAPI routes + Convex tool wiring + agent prompts + tests/evals for Excel manipulation.

RESULTS BY MODEL

Opus 4.5 + GPT-5.2 High

Flow-Next

GPT-5.2-codex medium

Codex CLI

GPT-5.2 xhigh

Codex CLI

GPT-5.2 medium

Codex CLI

Opus 4.5 thinking

Claude Code

GPT-5.1-codex-max medium

Codex CLI

Gemini 3 Pro

Gemini CLI

KEY TAKEAWAYS

GPT-5.2 xhigh dominates (90); medium (76), Claude/5.1 tied (70); Gemini struggles (56).
Formula recalculation common gap—models stub it rather than integrating LibreOffice.
Gemini regressed existing tools (dropped refetchParagraphs) showing change hygiene risks.

Note

Cross-stack complexity

This eval touches Python (FastAPI, openpyxl, pandas), TypeScript (Convex actions, agent tools), and prompt engineering. Even with detailed instructions on Excel manipulation, none of the models fully integrated formula recalculation via LibreOffice as specified. The pattern: models implement the happy path but skip the hard system integration.