Decoding EgoTV: The Ultimate Vision-Language Challenge

Published On Fri May 03 2024

EgoTV: Egocentric Task Verification Dataset

A benchmark and dataset have been created to systematically investigate vision-language models focusing on compositional, causal, and temporal reasoning in egocentric settings. This benchmark, introduced in the ICCV 2023 paper titled “EgoTV: Egocentric Task Verification from Natural Language Task Descriptions,” aims to enable progress towards egocentric agents capable of reasoning about everyday tasks specified in natural language.

The main objective of EgoTV is to verify an agent’s execution of tasks from its egocentric videos based on the natural language description of these tasks.

EgoTV Dataset Details

EgoTV contains pairs of synthetic ego-videos with an agent’s task execution and associated task descriptions for multi-step tasks. These tasks involve multiple sub-task decompositions, state changes, object interactions, and subtask ordering constraints. Additionally, abstracted task descriptions are provided with partial details about ways to accomplish a task. Each pair of natural language task description and ego-video in EgoTV is accompanied by a label indicating whether the task is performed as described (positive label) or not (negative label). This dataset requires causal, temporal, and compositional reasoning of video and language modalities.

HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision

Vision-Language Task Tracking and Verification

The goal is to determine if an agent correctly executes a task described in natural language from its egocentric video. Tasks consist of multiple partially ordered sub-tasks (e.g., heat, clean, slice, cool, place, pick) with ordering constraints (e.g., and, then, before, after) instantiated on an object of interaction.

Metrics:

Task difficulty is measured using Complexity (# sub-tasks in a task) and Ordering (# ordering constraints in a task). Model performance is measured using Accuracy and F1-Score.

Generalization Splits:

1. Novel Tasks (Unseen compositions of seen sub-tasks)
2. Novel Steps (Unseen affordances)
3. Novel Scenes (Unseen environments)
4. Abstraction (High-level task definitions)

Novel Tasks: 540, Novel Steps: 350, Novel Scenes: 1082, Abstraction: 3387
673 samples (train set: 5,363; test set: 2,310)
168 hours, 82 tasks, 1038 task-object combinations
Average video length of 84 seconds
4.6 sub-tasks per task in the EgoTV dataset, each sub-task spans ~ 14 frames
~2.4 ways to verify a task from NL description

EgoTV tasks are specified using Planning Domain Definition Language (PDDL), and viable plans fulfilling goal conditions and respecting order constraints are generated with the Metric-FF planner. The corresponding videos are recorded by executing the plans in the AI2-THOR simulator.

The EgoTV dataset is licensed under CC-BY-NC and is open-sourced.

Project Page
Code
Paper