SYSTEM_LOG ENTRY
SYSTEM_LOG ENTRY
Been wanting to do this for a while.
Most AI benchmarks tell you if a model can solve a LeetCode puzzle. They don't tell you if it can ship a product.
I got tired of wondering which model would actually help me build real things. So I built gmickel-bench: a living scoreboard based on the real-world engineering tasks I actually ship.
Three domains. Three very different skill sets:
I ran three models through the gauntlet:
Claude Code: The Product Architect
Best overall. Crushes frontend design and auth flows. Struggles when the math gets low-level.
Gemini CLI: The Systems Hacker
The only model that survived the Zig torture test (build → train → sample) without crashing. Less polish on the product side.
OpenAI Codex: The Pattern Matcher
Good at refactoring. Good at isolated pages. Feels lazy when you need comprehensive implementations.
The full benchmark, scoring methodology, and detailed breakdown is live at mickel.tech/gmickel-bench.