Abstract

Following step-by-step procedures is an essential component of various activities carried out by individuals in their everyday lives. These procedures serve as a guiding framework that helps achieve goals efficiently, whether assembling furniture or preparing a recipe. However, the complexity and duration of procedural activities inherently increase the likelihood of making errors. Understanding such procedural activities from a sequence of frames is a challenging task that demands an accurate interpretation of visual information and an ability to reason about the structure of the activity. To this end, we collected a new egocentric 4D dataset CaptainCook4D comprising 384 recordings (94.5 hrs) of people performing recipes in real kitchen environments. This dataset consists of two distinct activity types: one in which participants adhere to the provided recipe instructions and another where they deviate and induce errors. We provide 5.3K step annotations and 10K fine-grained action annotations and benchmark the dataset for the following tasks: error recognition (supervised and zero-shot), multi-step localization and procedure learning.

Normal & Error Steps

Technique Error: In the recipe butter corn cup the first two video snippets exhibit the outcome of correctly following the instruction Mix the contents of the bowl well without any spillage, whereas the subsequent three snippets display the result of inducing errors by spilling out corn from the bowl while mixing.

Measurement Error: In the recipe scrambled eggs the first two video snippets exhibit the outcome of correctly following the instruction Peel 2 garlic cloves , whereas the subsequent three snippets display the result when a different number of garlic cloves (4, 1, and 1 respectively) are peeled instead of the intended 2 cloves.

Order Error: In the recipe spicy tuna avacado wraps the first two video snippets exhibit the outcome of correctly following the instruction Top lettuce leaves with tuna mixture , whereas the subsequent three snippets display the result when an incorrect order is followed where avacado is added after topping the leaves with the mixture.

Preparation Error: In the recipe mug cake the first two video snippets exhibit the outcome of correctly following the instruction Whisk batter , while the remaining snippets depict incorrect usage of utensils such as a spoon, tablespoon, and hand to perform the same task.

Technique Error: In the recipe cucumber raita the first two video snippets exhibit the outcome of correctly following the instruction Chop or grate the cucumber , while the next three frames on the right show the results when the cucumber is cut improperly, sliced vertically, and sliced horizontally, respectively.

4D Snippets of Data

Task Graphs

Task graphs for all selected recipes

Loading...

Annotation Overview

RecipeTimeline

Data Collection & Annotation Illustration


Data Statistics

The chart displays the error and normal recording statistics for each recipe. The x-axis lists all selected recipes, while the y-axis corresponds to the number of recordings and total duration of each type of recording. NormalErrorRecordingStatistics
Distribution of error types and counts of different types of errors performed for each recipe ErrorRecipeCount
StepDurationStatistics
VideoDurationStatistics

A structured synopsis of different types of errors and their short descriptions compiled from the annotations.

ErrorCategories

Baselines

Error Recognition

SupervisedErrorRecognition
ZeroShotErrorRecognition

Multi-Step Localization.

MultiStepLocalization