Insights from Goodeye Labs

Expert insights on LLM evaluation and AI quality assessment

The November 2025 AI Coding Surprise, Model by Model

Randy Olson, PhD·February 20, 2026·Interactive

In November 2025, AI coding tools went from halting and clumsy to surprisingly capable. We gave 22 models the same prompt and ran five replicates each to make the shift visible.

Explore experiment →

Article

The Return of the Data Scientists

Ege Altan, PhD·February 18, 2026·5 min read

Generating code is easy now, but more code has never meant better products. As production costs fall toward zero, the bottleneck shifts to judgment and taste. This is data science.

Read article →

Article

The AI Mirror Effect: Why Your AI Evaluations Need Domain Experts

Randy Olson, PhD·January 26, 2026·5 min read

The Anthropic Economic Index shows that the quality of what you put into AI almost perfectly predicts the quality of what you get out. If AI mirrors expertise, so must your AI evaluations.

Read article →

PresentationPortland AI Engineers

Beyond the Demo: Building Reliable AI with LLM Evaluations

Randy Olson, PhD·January 14, 2026

Learn how to build reliable AI systems using LLM evaluations. This talk covers why traditional testing breaks with stochastic systems, how generic LLM-as-Judge approaches miss domain nuance, and practical steps to implement contextual evaluations that actually work.

View presentation →

Article

2025 Year in Review for LLM Evaluation: When the Scorecard Broke

Randy Olson, PhD·December 28, 2025·15 min read

In 2025, we discovered we'd been measuring memorization, not intelligence. Models scored 80-90% on static benchmarks but dropped to 60-70% on truly novel problems. This year exposed the fundamental crisis in AI evaluation, and taught us what to build instead.

Read article →