01
- Agent Engineering
- AI Agents
- ai
- benchmarks
- evals
- claude
- gemini
- codex
Announcing gmickel-bench: Real-World Evals
Most AI benchmarks tell you if a model can solve a LeetCode puzzle. They don't tell you if it can ship a product.
Read entry00 / Masthead