System status: online● ONLINESYSTEM_ID: MICKEL_TECH_V2.0
    Design
    Eval Details

    Remote secretarial service dashboard

    Separates layout craft from data wiring; measures taste + speed under strong aesthetic constraints.

    Methodology

    This evaluation tests frontend design implementation from a detailed UX brief. The focus is on visual craft, layout fidelity, and component composition. Scoring reweighted to 70/30 split (LLM/design) to properly value aesthetic quality alongside functional implementation.

    Spec: High-spec UX brief; Next 16 App Router + Tailwind + shadcn; multi-page customer portal.

    RESULTS BY MODEL

    GPT-5.2-codex medium

    Codex CLI

    75

    GPT-5.2 xhigh

    Codex CLI

    82

    GPT-5.2 medium

    Codex CLI

    80

    Opus 4.5 thinking

    Claude Code

    86

    GPT-5.1-codex-max medium

    Codex CLI

    77

    Gemini 3 Pro

    Gemini CLI

    69

    KEY TAKEAWAYS

    • Claude + frontend-design plugin wins (86) with polished visuals; xhigh second (82) despite higher LLM score.
    • GPT-5.1 (77) had better aesthetics than 5.2 despite lower functional scores—proves taste ≠ code quality.
    • Without explicit design prompting, models converge on AI slop—navigation completeness still lags.
    Note

    Design eval reweighted 11 Dec 2025

    Originally scored LLM 0-90 + design 0-10 = 100 max. This underweighted aesthetics—a model could score 83/90 functionally but lose only 3 pts for mediocre design. Reweighted to 70/30 split: llm_weighted = (llm/90)*70, design_weighted = (design/10)*30. High design scores now properly amplified; functional-but-boring outputs penalized.