XLSX backend + agent tools
Full-stack agent integration: Python service, TypeScript tools, prompt engineering, formula recalc.
This evaluation tests cross-stack agent integration. The model must implement Python FastAPI routes for Excel manipulation, wire Convex TypeScript tools, and write agent prompts. Formula recalculation via LibreOffice is specified but often stubbed.
Spec: Python FastAPI routes + Convex tool wiring + agent prompts + tests/evals for Excel manipulation.
RESULTS BY MODEL
GPT-5.2-codex medium
Codex CLI
GPT-5.2 xhigh
Codex CLI
GPT-5.2 medium
Codex CLI
Opus 4.5 thinking
Claude Code
GPT-5.1-codex-max medium
Codex CLI
Gemini 3 Pro
Gemini CLI
KEY TAKEAWAYS
- GPT-5.2 xhigh dominates (90); medium (76), Claude/5.1 tied (70); Gemini struggles (56).
- Formula recalculation common gap—models stub it rather than integrating LibreOffice.
- Gemini regressed existing tools (dropped refetchParagraphs) showing change hygiene risks.
Cross-stack complexity
This eval touches Python (FastAPI, openpyxl, pandas), TypeScript (Convex actions, agent tools), and prompt engineering. Even with detailed instructions on Excel manipulation, none of the models fully integrated formula recalculation via LibreOffice as specified. The pattern: models implement the happy path but skip the hard system integration.