EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents
EgoBench, introduced in arXiv paper 2605.27820, is the first interactive multimodal benchmark designed for tool-using agents. It comprises 1,045 egocentric-video-grounded tasks spanning four daily scenarios (e.g., cooking, assembly). The benchmark provides a user-agent-tool interactive environment for evaluation, along with a multi-agent simulated user that generates task-aligned responses. A three-stage synergistic pipeline ensures each task requires joint visual perception and tool-augmented multi-hop reasoning. A deterministic joint validation framework enables objective evaluation of dynamic interactions. This addresses the gap in existing benchmarks that fail to jointly evaluate multimodal perception, tool invocation, and user interaction. The benchmark is publicly available on arXiv.
Enables rigorous evaluation of AI agents that must perceive, reason, and interact in real-world tool-use scenarios.