gmickel bench — real client-grade evals
← Back to main siteLiving scoreboard for agentic coding + design tasks pulled from my actual work surfaces. Each result is the top score across three independent runs, mirrors how humans retry.
Benchmarks
MCP auth, ACL sharing, AI dashboard, tiny GPT, macOS utility, XLSX agent
Stacks covered
Full-stack web · macOS utility · systems programming
Signal mix
Mix of LLM judge + human evaluation/acceptance runs
SCOREBOARD // LLM + HUMAN
Mix of LLM judge plus human evaluation/acceptance, best-of-three per modelTOTALS // LLM + HUMAN
Sum + average across all benches (best-of-3 per model)CATEGORIES // NORMALIZED
BENCHMARKS // INSIGHTS
Scores ≈ capability; notes = why it matteredSecurity-sensitive slice with OAuth semantics, streaming MCP transport, and admin surface.
- Dense plans still leave scope edges (org metadata, pagination, export scopes) uncovered.
- Vertical slices over Convex stack are viable for LLMs but need human review on security invariants.
Tests whether agents can extend ACL patterns without regressions when specs are intentionally light.
- Inference from existing ACLs is error-prone: owner checks skipped, guest filters incomplete, activation hooks missing.
- UI dialogs often wired incorrectly without explicit triggers; best-of-3 helps catch flukes.
Separates layout craft from data wiring; measures taste + speed under strong aesthetic constraints.
- Design plugins boosted polish but navigation completeness still lags (missing routes).
- Codex produced fastest single-page polish; Claude+plugin best overall design score.
Cross-language generalisation on Karpathy's minGPT/nanoGPT lineage, rebuilt in Zig without ML libs.
- Runtime bonus matters: only Gemini passed build→train→sample; others crashed on backprop/matmul.
- Initialization hygiene and buffer sizing are frequent failure points even when shape math looks right.
Why Gemini scored higher
Gemini 3.0 Pro was the only model to achieve a working build→train→sample cycle. Its harness kept iterating on errors rather than stopping early—we didn't explicitly prompt it to do this. Other models produced comparable or better code quality but their runs terminated before resolving runtime issues. Points for getting everything to run, but instruction following... YMMV, depends on how autonomous you like these things to be.
System-integration slice: macOS menu bar UI, clipboard healing, global hotkey, launch-at-login.
- Claude and Codex meet tests and real clipboard healing; Gemini leaves minor ghost indentation.
- Strict concurrency + SwiftUI + system APIs are tractable; heuristics still need human-tuned edges.
Full-stack agent integration: Python service, TypeScript tools, prompt engineering, formula recalc.
- Codex and Claude tied on LLM score but Claude wins on human review (more tests passing).
- Formula recalculation is a common gap - models stub it rather than integrating LibreOffice.
- Gemini regressed existing tools (dropped refetchParagraphs) showing change hygiene risks.
Cross-stack complexity
This eval touches Python (FastAPI, openpyxl, pandas), TypeScript (Convex actions, agent tools), and prompt engineering. Even with detailed instructions on Excel manipulation, none of the models fully integrated formula recalculation via LibreOffice as specified. The pattern: models implement the happy path but skip the hard system integration.
METHODOLOGY // HOW SCORES ARE CALCULATED
Real repositories, consistent prompts, rigorous scoring
Real-world source
Every eval starts from a tagged checkpoint in an actual repository. DocIQ Sphere provides the bulk of evals (MCP server, permissions, docshare). Side projects like SmartTrim and Zig experiments add language diversity.
Best-of-3 execution
Each model runs the identical prompt three times with consistent settings. All MCP servers, plugins, custom commands, and subagents are disabled (exception: Anthropic's frontend-design plugin for design evals). The highest score is recorded - mirrors how humans retry when the first attempt doesn't land.
Dual scoring
Scores combine GPT-5.1 Pro as LLM judge (code quality, structure, patterns) with human review (instruction following, functional correctness, does it actually work?).
Scoring dimensions
INSTRUCTION FOLLOWING
Did it do what was asked? All requirements met?
CODE QUALITY
Clean patterns, proper error handling, maintainable structure.
CHANGE HYGIENE
Minimal diffs, no regressions, respects existing patterns.
FUNCTIONAL CORRECTNESS
Does it compile, run, and pass acceptance criteria?
Evaluation sources
DocIQ Sphere
MCP server, permissions, docshare, XLSX agent tools
SmartTrim
Swift 6 / SwiftUI macOS system integration
Zig experiments
Low-level systems programming, ML from scratch
INSIGHTS // WHAT THE SCORES SAY
• Claude leads overall with strongest instruction following and change hygiene; Codex edges on pure code quality.
• Gemini wins low-level systems (Zig) but see the notes for that test; struggles with change hygiene on cross-stack work.
• Common gaps: models implement happy paths but skip hard system integration (formula recalc, LibreOffice). Manual verification stays mandatory.
• Single runs are noisy—models can fluke or get unlucky with context.
• Humans naturally retry; taking best score reflects real workflow.
• Three runs balance signal quality against compute cost.
Claude Code
v2.0.64
Codex CLI
v0.66.0
Gemini CLI
v0.19.4
Legacy code port
Deep refactor/port of a service with sparse docs.
Microservice integration
Cross-service change touching contracts and ACLs.
Next live client task
New surface queued; revealed when it ships.