These aren’t really indicative of real world performance. Retrieving a single fact is pretty much the simplest possible task for a long context model. Real world use cases require considering many facts at the same time while ignoring others, all the while avoiding the overall performance degradation that current models seem susceptible to when the context is sufficiently full.