System status: online● ONLINESYSTEM_ID: MICKEL_TECH_V2.0
    MCP
    Eval Details

    Convex OAuth MCP server

    Security-sensitive slice with OAuth semantics, streaming MCP transport, and admin surface.

    Methodology

    This evaluation tests whether a model can implement a full OAuth 2.1 MCP server from a dense specification. The task requires understanding security invariants, streaming transport, admin UI scaffolding, and test coverage. Scoring combines LLM judge assessment of code quality with human review of functional correctness.

    Spec: Dense spec. Full vertical slice on Convex + Better Auth + MCP; discovery, scopes, admin UI, tests.

    RESULTS BY MODEL

    GPT-5.2-codex medium

    Codex CLI

    72

    GPT-5.2 xhigh

    Codex CLI

    75

    GPT-5.2 medium

    Codex CLI

    76

    Opus 4.5 thinking

    Claude Code

    65

    GPT-5.1-codex-max medium

    Codex CLI

    60

    Gemini 3 Pro

    Gemini CLI

    63

    KEY TAKEAWAYS

    • GPT-5.2 medium edges xhigh (76 vs 75); both dominate Claude (65), Gemini (63).
    • Dense plans still leave scope edges uncovered—human review mandatory on security invariants.