Remote secretarial service dashboard
Separates layout craft from data wiring; measures taste + speed under strong aesthetic constraints.
This evaluation tests frontend design implementation from a detailed UX brief. The focus is on visual craft, layout fidelity, and component composition. Scoring reweighted to 70/30 split (LLM/design) to properly value aesthetic quality alongside functional implementation.
Spec: High-spec UX brief; Next 16 App Router + Tailwind + shadcn; multi-page customer portal.
RESULTS BY MODEL
GPT-5.2-codex medium
Codex CLI
GPT-5.2 xhigh
Codex CLI
GPT-5.2 medium
Codex CLI
Opus 4.5 thinking
Claude Code
GPT-5.1-codex-max medium
Codex CLI
Gemini 3 Pro
Gemini CLI
KEY TAKEAWAYS
- Claude + frontend-design plugin wins (86) with polished visuals; xhigh second (82) despite higher LLM score.
- GPT-5.1 (77) had better aesthetics than 5.2 despite lower functional scores—proves taste ≠ code quality.
- Without explicit design prompting, models converge on AI slop—navigation completeness still lags.
Design eval reweighted 11 Dec 2025
Originally scored LLM 0-90 + design 0-10 = 100 max. This underweighted aesthetics—a model could score 83/90 functionally but lose only 3 pts for mediocre design. Reweighted to 70/30 split: llm_weighted = (llm/90)*70, design_weighted = (design/10)*30. High design scores now properly amplified; functional-but-boring outputs penalized.