MCP
Eval Details
Convex OAuth MCP server
Security-sensitive slice with OAuth semantics, streaming MCP transport, and admin surface.
Methodology
This evaluation tests whether a model can implement a full OAuth 2.1 MCP server from a dense specification. The task requires understanding security invariants, streaming transport, admin UI scaffolding, and test coverage. Scoring combines LLM judge assessment of code quality with human review of functional correctness.
Spec: Dense spec. Full vertical slice on Convex + Better Auth + MCP; discovery, scopes, admin UI, tests.
RESULTS BY MODEL
GPT-5.2-codex medium
Codex CLI
GPT-5.2 xhigh
Codex CLI
GPT-5.2 medium
Codex CLI
Opus 4.5 thinking
Claude Code
GPT-5.1-codex-max medium
Codex CLI
Gemini 3 Pro
Gemini CLI
KEY TAKEAWAYS
- GPT-5.2 medium edges xhigh (76 vs 75); both dominate Claude (65), Gemini (63).
- Dense plans still leave scope edges uncovered—human review mandatory on security invariants.