Convex OAuth MCP server

Security-sensitive slice with OAuth semantics, streaming MCP transport, and admin surface.

Methodology

This evaluation tests whether a model can implement a full OAuth 2.1 MCP server from a dense specification. The task requires understanding security invariants, streaming transport, admin UI scaffolding, and test coverage. Scoring combines LLM judge assessment of code quality with human review of functional correctness.

Spec: Dense spec. Full vertical slice on Convex + Better Auth + MCP; discovery, scopes, admin UI, tests.

RESULTS BY MODEL

Opus 4.5 + GPT-5.2 High

Flow-Next

GPT-5.2-codex medium

Codex CLI

GPT-5.2 xhigh

Codex CLI

GPT-5.2 medium

Codex CLI

Opus 4.5 thinking

Claude Code

GPT-5.1-codex-max medium

Codex CLI

Gemini 3 Pro

Gemini CLI

KEY TAKEAWAYS

GPT-5.2 medium edges xhigh (76 vs 75); both dominate Claude (65), Gemini (63).
Dense plans still leave scope edges uncovered—human review mandatory on security invariants.

MCP

Eval Details

Convex OAuth MCP server

Security-sensitive slice with OAuth semantics, streaming MCP transport, and admin surface.

← All Benchmarks

Main Site

Methodology

Spec: Dense spec. Full vertical slice on Convex + Better Auth + MCP; discovery, scopes, admin UI, tests.

RESULTS BY MODEL

Opus 4.5 + GPT-5.2 High

Flow-Next

GPT-5.2-codex medium

Codex CLI

GPT-5.2 xhigh

Codex CLI

GPT-5.2 medium

Codex CLI

Opus 4.5 thinking

Claude Code

GPT-5.1-codex-max medium

Codex CLI

Gemini 3 Pro

Gemini CLI

KEY TAKEAWAYS

GPT-5.2 medium edges xhigh (76 vs 75); both dominate Claude (65), Gemini (63).
Dense plans still leave scope edges uncovered—human review mandatory on security invariants.

Convex OAuth MCP server

Mickel.tech

Convex OAuth MCP server