LIVE BENCHMARK

Best-of-3 runs

Updated 30 Jan 2026

gmickel bench — real client-grade evals

Living scoreboard for agentic coding + design tasks pulled from my actual work surfaces. Each result is the top score across three independent runs, mirrors how humans retry.

Benchmarks

MCP auth, ACL sharing, AI dashboard, tiny GPT, macOS utility, XLSX agent

Stacks covered

Next.js · Convex · TanStack · Python · Zig · SwiftUI

Full-stack web · macOS utility · systems programming

Signal mix

LLM + human

Mix of LLM judge + human evaluation/acceptance runs

SCOREBOARD // LLM + HUMAN

Mix of LLM judge plus human evaluation/acceptance, best-of-three per model

Where each model shines per surface

Harness

Opus 4.5 + GPT-5.2 High

GPT-5.2-codex medium

GPT-5.2 xhigh

GPT-5.2 medium

Opus 4.5 thinking

GPT-5.1-codex-max medium

TOTALS // LLM + HUMAN

Sum + average across all benches (best-of-3 per model)

Overall score totals

CATEGORIES // NORMALIZED

Strengths across categories

Normalized vs. max per dimension (0–100)

BENCHMARKS // INSIGHTS

Scores ≈ capability; notes = why it mattered

mcp

Dense spec. Full vertical slice on Convex + Better Auth + MCP; discovery, scopes, admin UI, tests.

View Details →

Convex OAuth MCP server

Security-sensitive slice with OAuth semantics, streaming MCP transport, and admin surface.

GPT-5.2 medium edges xhigh (76 vs 75); both dominate Claude (65), Gemini (63).
Dense plans still leave scope edges uncovered—human review mandatory on security invariants.

permissions

Extend existing ACL whitelist to docs/folders; inheritance, Better Auth invites, guest filtering, tests.

View Details →

Convex document & folder permissions

Tests whether agents can extend ACL patterns without regressions when specs are intentionally light.

GPT-5.2 medium and xhigh tie (78) with best ACL inference; Claude (65) solid; Gemini struggles (49).
Owner checks, guest filters, activation hooks—common gaps even with light specs.

design

High-spec UX brief; Next 16 App Router + Tailwind + shadcn; multi-page customer portal.

View Details →

Remote secretarial service dashboard

Separates layout craft from data wiring; measures taste + speed under strong aesthetic constraints.

Claude + frontend-design plugin wins (86) with polished visuals; xhigh second (82) despite higher LLM score.
GPT-5.1 (77) had better aesthetics than 5.2 despite lower functional scores—proves taste ≠ code quality.
Without explicit design prompting, models converge on AI slop—navigation completeness still lags.

Note

Design eval reweighted 11 Dec 2025

Originally scored LLM 0-90 + design 0-10 = 100 max. This underweighted aesthetics—a model could score 83/90 functionally but lose only 3 pts for mediocre design. Reweighted to 70/30 split: llm_weighted = (llm/90)*70, design_weighted = (design/10)*30. High design scores now properly amplified; functional-but-boring outputs penalized.

zig

Medium-spec prompt; char-level GPT, AdamW, warmup/cosine, CPU-only; build/train/sample acceptance.

View Details →

Tiny GPT in pure Zig

Cross-language generalisation on Karpathy's minGPT/nanoGPT lineage, rebuilt in Zig without ML libs.

GPT-5.2 xhigh (82) beats Gemini (81)—first OpenAI model to win Zig; medium (70) also solid.
Claude (62) succeeded on 3rd attempt via self-correction; Codex 5.1 (36) still crashes on matmul.

Note

Claude recovers via best-of-3

GPT-5.2 xhigh (82) beats Gemini 3.0 Pro (81) on the Zig eval—first OpenAI model to win Zig. Claude (62) initially crashed but succeeded on 3rd attempt through self-correction. Codex 5.1 (36) still crashes on matmul assertions.

smarttrim

Swift 6 LSUIElement MenuBarExtra with TextHealer heuristic, clipboard monitor, tests, hotkey, login item.

View Details →

SmartTrim macOS menu bar utility

System-integration slice: macOS menu bar UI, clipboard healing, global hotkey, launch-at-login.

GPT-5.1 wins (92), Claude/xhigh tied (88), medium (85); Gemini lags (67) with ghost indentation.
Swift 6 strict concurrency + SwiftUI tractable; heuristics still need human-tuned edges.

xlsx

Python FastAPI routes + Convex tool wiring + agent prompts + tests/evals for Excel manipulation.

View Details →

XLSX backend + agent tools

Full-stack agent integration: Python service, TypeScript tools, prompt engineering, formula recalc.

GPT-5.2 xhigh dominates (90); medium (76), Claude/5.1 tied (70); Gemini struggles (56).
Formula recalculation common gap—models stub it rather than integrating LibreOffice.
Gemini regressed existing tools (dropped refetchParagraphs) showing change hygiene risks.

Note

Cross-stack complexity

This eval touches Python (FastAPI, openpyxl, pandas), TypeScript (Convex actions, agent tools), and prompt engineering. Even with detailed instructions on Excel manipulation, none of the models fully integrated formula recalculation via LibreOffice as specified. The pattern: models implement the happy path but skip the hard system integration.

METHODOLOGY // HOW SCORES ARE CALCULATED

TRANSPARENT PROCESS

Evaluation methodology

Real repositories, consistent prompts, rigorous scoring

Real-world source

Every eval starts from a tagged checkpoint in an actual repository. DocIQ Sphere provides the bulk of evals (MCP server, permissions, docshare). Side projects like SmartTrim and Zig experiments add language diversity.

Best-of-3 execution

Each model runs the identical prompt three times with consistent settings. All MCP servers, plugins, custom commands, and subagents are disabled (exception: Anthropic's frontend-design plugin for design evals). The highest score is recorded - mirrors how humans retry when the first attempt doesn't land.

Dual scoring

Scores combine GPT-5.2 Pro as LLM judge (code quality, structure, patterns) with human review (instruction following, functional correctness, does it actually work?).

Scoring dimensions

INSTRUCTION FOLLOWING

Did it do what was asked? All requirements met?

CODE QUALITY

Clean patterns, proper error handling, maintainable structure.

CHANGE HYGIENE

Minimal diffs, no regressions, respects existing patterns.

FUNCTIONAL CORRECTNESS

Does it compile, run, and pass acceptance criteria?

Evaluation sources

DocIQ Sphere

MCP server, permissions, docshare, XLSX agent tools

SmartTrim

Swift 6 / SwiftUI macOS system integration

Zig experiments

Low-level systems programming, ML from scratch

INSIGHTS // WHAT THE SCORES SAY

Model strengths

• Flow-Next leads (88.3 avg), +7% over GPT-5.2 xhigh (82.5)—orchestration and cross-model review delivers consistent gains across all 6 evals.

• Biggest Flow-Next gains on open-ended tasks: Zig (+9 vs xhigh), SmartTrim (+8 vs 5.1), MCP (+6 vs xhigh).

• Brownfield strength: Flow-Next's docs analyzer and pattern matching excels on Convex evals (MCP +6, Permissions +5, XLSX +2 vs xhigh).

• Tradeoff: Flow-Next uses more tokens and takes longer—but extra compute buys quality. Review loops catch blind spots single-model runs miss.

Why best-of-3?

• Single runs are noisy—models can fluke or get unlucky with context.

• Humans naturally retry; taking best score reflects real workflow.

• Three runs balance signal quality against compute cost.

Coming next

New surfaces queued to keep the bench fresh

Flow-swarm

Parallelized orchestration with token efficiency and cost optimization.

Legacy code port

Deep refactor/port of a service with sparse docs.

Microservice integration

Cross-service change touching contracts and ACLs.

Orchestration benchmark

About Flow-Next →

Cross-model orchestration delivers +7% gains

Flow-Next combines Opus 4.5 for implementation with GPT-5.2 High for review loops. The dual-model approach catches blind spots that single-model runs miss—different models have different failure modes.

WHY IT WORKS

Review loops + acceptance criteria + epic-review gating. Brownfield pattern matching via docs analyzer.

TRADEOFFS

Slower execution. Higher token usage. More expensive—but extra compute buys quality.

EVAL-BY-EVAL GAINS VS XHIGH

Zig:+9

SmartTrim:+8

MCP:+6

Design:+5

Permissions:+5

XLSX:+2

Note: This isn't apples-to-apples—Flow-Next includes skills, review loops, gating, and acceptance criteria that baseline harnesses lack. The point: "you don't need a complicated setup" isn't universally true. Orchestration can yield meaningful improvements.

Research finding

Excluded from main benchmark

High reasoning mode underperforms on implementation tasks

Claude Opus 4.5 with high reasoning mode scored 393.3 vs default's 436 (−42.7 pts). The pattern: overthinking with detailed specs leads to custom solutions instead of following prescribed patterns, incomplete integration despite good individual components, and over-engineered concurrency that introduces bugs.

WORSE

SmartTrim (−21), Sharing (−11), MCP (−8), XLSX (−6)

BETTER

Design (+6.6), Zig (+4)

DIMENSION BREAKDOWN

Functional Correctness:−5.2 avg

Instruction Following:−2.0 avg

Code Quality:−1.6 avg

Change Hygiene:−0.6 avg

Functional correctness hit hardest—builds complex stuff that doesn't work. Instruction following second—deviates from explicit specs.

Takeaway: Reserve high reasoning for open-ended problems where default fails entirely, or for planning/reviewing. For implementation tasks with detailed specs, less thinking = more faithful execution.

← Back to main site

Living scoreboard for agentic coding + design tasks pulled from my actual work surfaces. Each result is the top score across three independent runs, mirrors how humans retry.

Benchmarks

MCP auth, ACL sharing, AI dashboard, tiny GPT, macOS utility, XLSX agent

Stacks covered

Next.js · Convex · TanStack · Python · Zig · SwiftUI

Full-stack web · macOS utility · systems programming

Signal mix

LLM + human

Mix of LLM judge + human evaluation/acceptance runs

gmickel bench — real client-grade evalsgmickel bench — real client-grade evals

Real-world source

Best-of-3 execution

Dual scoring

Scoring dimensions

Evaluation sources

Cross-model orchestration delivers +7% gains

High reasoning mode underperforms on implementation tasks

gmickel bench — real client-grade evalsgmickel bench — real client-grade evals

Real-world source

Best-of-3 execution

Dual scoring

Scoring dimensions

Evaluation sources

Cross-model orchestration delivers +7% gains

High reasoning mode underperforms on implementation tasks

gmickel bench — real client-grade evals

gmickel bench — real client-grade evals