Research Blog
Hello! We are a sub-group of Machine Learning and Statistical Data Analysis Lab (mslab) at The University of Tokyo, with a focus on model evaluation and scalable supervision.
CapBencher: A simple protocol that gives LLM benchmarks a built-in alarm for leakage and gaming
CapBencher caps best-possible accuracy by injecting controlled randomness into logically correct answers, enabling public benchmarks that can detect leakage and leaderboard gaming.
February 2026