gmickel bench — real client-grade evals
← Back to main siteLiving scoreboard for agentic coding + design tasks pulled from my actual work surfaces. Each result is the top score across three independent runs, mirrors how humans retry.
Benchmarks
MCP auth, ACL sharing, AI dashboard, tiny GPT, macOS utility, XLSX agent
Stacks covered
Full-stack web · macOS utility · systems programming
Signal mix
Mix of LLM judge + human evaluation/acceptance runs
SCOREBOARD // LLM + HUMAN
Mix of LLM judge plus human evaluation/acceptance, best-of-three per modelTOTALS // LLM + HUMAN
Sum + average across all benches (best-of-3 per model)CATEGORIES // NORMALIZED
BENCHMARKS // INSIGHTS
Scores ≈ capability; notes = why it matteredSecurity-sensitive slice with OAuth semantics, streaming MCP transport, and admin surface.
- GPT-5.2 medium edges xhigh (76 vs 75); both dominate Claude (65), Gemini (63).
- Dense plans still leave scope edges uncovered—human review mandatory on security invariants.
Tests whether agents can extend ACL patterns without regressions when specs are intentionally light.
- GPT-5.2 medium and xhigh tie (78) with best ACL inference; Claude (65) solid; Gemini struggles (49).
- Owner checks, guest filters, activation hooks—common gaps even with light specs.
Separates layout craft from data wiring; measures taste + speed under strong aesthetic constraints.
- Claude + frontend-design plugin wins (86) with polished visuals; xhigh second (82) despite higher LLM score.
- GPT-5.1 (77) had better aesthetics than 5.2 despite lower functional scores—proves taste ≠ code quality.
- Without explicit design prompting, models converge on AI slop—navigation completeness still lags.
Design eval reweighted 11 Dec 2025
Originally scored LLM 0-90 + design 0-10 = 100 max. This underweighted aesthetics—a model could score 83/90 functionally but lose only 3 pts for mediocre design. Reweighted to 70/30 split: llm_weighted = (llm/90)*70, design_weighted = (design/10)*30. High design scores now properly amplified; functional-but-boring outputs penalized.
Cross-language generalisation on Karpathy's minGPT/nanoGPT lineage, rebuilt in Zig without ML libs.
- GPT-5.2 xhigh (82) beats Gemini (81)—first OpenAI model to win Zig; medium (70) also solid.
- Claude (62) succeeded on 3rd attempt via self-correction; Codex 5.1 (36) still crashes on matmul.
Claude recovers via best-of-3
GPT-5.2 xhigh (82) beats Gemini 3.0 Pro (81) on the Zig eval—first OpenAI model to win Zig. Claude (62) initially crashed but succeeded on 3rd attempt through self-correction. Codex 5.1 (36) still crashes on matmul assertions.
System-integration slice: macOS menu bar UI, clipboard healing, global hotkey, launch-at-login.
- GPT-5.1 wins (92), Claude/xhigh tied (88), medium (85); Gemini lags (67) with ghost indentation.
- Swift 6 strict concurrency + SwiftUI tractable; heuristics still need human-tuned edges.
Full-stack agent integration: Python service, TypeScript tools, prompt engineering, formula recalc.
- GPT-5.2 xhigh dominates (90); medium (76), Claude/5.1 tied (70); Gemini struggles (56).
- Formula recalculation common gap—models stub it rather than integrating LibreOffice.
- Gemini regressed existing tools (dropped refetchParagraphs) showing change hygiene risks.
Cross-stack complexity
This eval touches Python (FastAPI, openpyxl, pandas), TypeScript (Convex actions, agent tools), and prompt engineering. Even with detailed instructions on Excel manipulation, none of the models fully integrated formula recalculation via LibreOffice as specified. The pattern: models implement the happy path but skip the hard system integration.
METHODOLOGY // HOW SCORES ARE CALCULATED
Real repositories, consistent prompts, rigorous scoring
Real-world source
Every eval starts from a tagged checkpoint in an actual repository. DocIQ Sphere provides the bulk of evals (MCP server, permissions, docshare). Side projects like SmartTrim and Zig experiments add language diversity.
Best-of-3 execution
Each model runs the identical prompt three times with consistent settings. All MCP servers, plugins, custom commands, and subagents are disabled (exception: Anthropic's frontend-design plugin for design evals). The highest score is recorded - mirrors how humans retry when the first attempt doesn't land.
Dual scoring
Scores combine GPT-5.2 Pro as LLM judge (code quality, structure, patterns) with human review (instruction following, functional correctness, does it actually work?).
Scoring dimensions
INSTRUCTION FOLLOWING
Did it do what was asked? All requirements met?
CODE QUALITY
Clean patterns, proper error handling, maintainable structure.
CHANGE HYGIENE
Minimal diffs, no regressions, respects existing patterns.
FUNCTIONAL CORRECTNESS
Does it compile, run, and pass acceptance criteria?
Evaluation sources
DocIQ Sphere
MCP server, permissions, docshare, XLSX agent tools
SmartTrim
Swift 6 / SwiftUI macOS system integration
Zig experiments
Low-level systems programming, ML from scratch
INSIGHTS // WHAT THE SCORES SAY
• Flow-Next leads (88.3 avg), +7% over GPT-5.2 xhigh (82.5)—orchestration and cross-model review delivers consistent gains across all 6 evals.
• Biggest Flow-Next gains on open-ended tasks: Zig (+9 vs xhigh), SmartTrim (+8 vs 5.1), MCP (+6 vs xhigh).
• Brownfield strength: Flow-Next's docs analyzer and pattern matching excels on Convex evals (MCP +6, Permissions +5, XLSX +2 vs xhigh).
• Tradeoff: Flow-Next uses more tokens and takes longer—but extra compute buys quality. Review loops catch blind spots single-model runs miss.
• Single runs are noisy—models can fluke or get unlucky with context.
• Humans naturally retry; taking best score reflects real workflow.
• Three runs balance signal quality against compute cost.
Flow-swarm
Parallelized orchestration with token efficiency and cost optimization.
Legacy code port
Deep refactor/port of a service with sparse docs.
Microservice integration
Cross-service change touching contracts and ACLs.
Cross-model orchestration delivers +7% gains
Flow-Next combines Opus 4.5 for implementation with GPT-5.2 High for review loops. The dual-model approach catches blind spots that single-model runs miss—different models have different failure modes.
WHY IT WORKS
Review loops + acceptance criteria + epic-review gating. Brownfield pattern matching via docs analyzer.
TRADEOFFS
Slower execution. Higher token usage. More expensive—but extra compute buys quality.
EVAL-BY-EVAL GAINS VS XHIGH
Note: This isn't apples-to-apples—Flow-Next includes skills, review loops, gating, and acceptance criteria that baseline harnesses lack. The point: "you don't need a complicated setup" isn't universally true. Orchestration can yield meaningful improvements.
High reasoning mode underperforms on implementation tasks
Claude Opus 4.5 with high reasoning mode scored 393.3 vs default's 436 (−42.7 pts). The pattern: overthinking with detailed specs leads to custom solutions instead of following prescribed patterns, incomplete integration despite good individual components, and over-engineered concurrency that introduces bugs.
WORSE
SmartTrim (−21), Sharing (−11), MCP (−8), XLSX (−6)
BETTER
Design (+6.6), Zig (+4)
DIMENSION BREAKDOWN
Functional correctness hit hardest—builds complex stuff that doesn't work. Instruction following second—deviates from explicit specs.
Takeaway: Reserve high reasoning for open-ended problems where default fails entirely, or for planning/reviewing. For implementation tasks with detailed specs, less thinking = more faithful execution.