Remote secretarial service dashboard

Separates layout craft from data wiring; measures taste + speed under strong aesthetic constraints.

Methodology

This evaluation tests frontend design implementation from a detailed UX brief. The focus is on visual craft, layout fidelity, and component composition. Scoring reweighted to 70/30 split (LLM/design) to properly value aesthetic quality alongside functional implementation.

Spec: High-spec UX brief; Next 16 App Router + Tailwind + shadcn; multi-page customer portal.

RESULTS BY MODEL

Opus 4.5 + GPT-5.2 High

Flow-Next

GPT-5.2-codex medium

Codex CLI

GPT-5.2 xhigh

Codex CLI

GPT-5.2 medium

Codex CLI

Opus 4.5 thinking

Claude Code

GPT-5.1-codex-max medium

Codex CLI

Gemini 3 Pro

Gemini CLI

KEY TAKEAWAYS

Claude + frontend-design plugin wins (86) with polished visuals; xhigh second (82) despite higher LLM score.
GPT-5.1 (77) had better aesthetics than 5.2 despite lower functional scores—proves taste ≠ code quality.
Without explicit design prompting, models converge on AI slop—navigation completeness still lags.

Note

Design eval reweighted 11 Dec 2025

Originally scored LLM 0-90 + design 0-10 = 100 max. This underweighted aesthetics—a model could score 83/90 functionally but lose only 3 pts for mediocre design. Reweighted to 70/30 split: llm_weighted = (llm/90)*70, design_weighted = (design/10)*30. High design scores now properly amplified; functional-but-boring outputs penalized.