10:01AI Coding — Building an LLM Benchmark, Part 1: FoundationSergey Eroshenkov152 viewsView & Download
4:57AI Coding - Building an LLM Benchmark, Part 3: First Real RunsSergey Eroshenkov119 viewsView & Download
31:24Introducing Terminal-Bench: Evaluating LLM Agents in Realistic Terminal Settings | Ray Summit 2025Anyscale1.9K viewsView & Download
46:52Inside ParseBench How to Evaluate Document Parsing for AI AgentsLlamaIndex31 viewsView & Download
1:09Introducing ParseBench: The First Document Parsing Benchmark for AI AgentsLlamaIndex1.3K viewsView & Download
4:44The OpenHands Index: Benchmarking LLMs as Software Engineering AgentsOpenHands749 viewsView & Download
20:26ProgramBench: Can Language Models Rebuild Programs From Scratch? (May 2026)AI Paper Slop160 viewsView & Download
7:05DeepSWE: The Coding Benchmark That Tests Long-Horizon AgentsFluid Coding & AI19 viewsView & Download