I am an AI Evaluation Specialist focusing on the alignment and stress-testing of large language models. I specialize in creating the benchmarks that define what high-quality intelligence looks like.

RLHF (Reinforcement Learning from Human Feedback) involves more than just identifying errors; I specialize in Instructional Design and the development of Golden Standard datasets. I author comprehensive evaluation rubrics that directly influence model behaviour, reducing hallucination rates and improving adherence to complex, multi-turn constraints.

By analysing model failure modes and edge cases, I provide the strategic oversight necessary to turn raw model outputs into reliable, enterprise-grade products.

I also enjoy building dashboards relating to F1 and economics.