Cooking shows have moved beyond unglamorous narrations like “bring three litres of water to the boil” or even “dice the kiwi.” These days, cooking is performance — dynamic, dramatic, and designed to impact not only the palate but also the other senses. Award-winning chef Emeril Lagasse sums it up with his trademark catchphrase: “BAM!”
The art of cooking and other things that people do and say in kitchens is the focus of the EPIC-KITCHENS dataset. Introduced in 2018, the collection of annotated first-person viewpoint videos of individuals cooking and interacting with objects in their kitchens has enabled AI researchers to explore a variety of challenges in video understanding.
In a new paper, researchers from the University of Bristol, the University of Toronto and the University of Catania explain how they created Epic-Kitchens and introduce new baselines that emphasize the multimodal nature of the largest such egocentric video benchmark.
Unlike previous action classification benchmarks whose videos tend to be of short duration or recorded in scripted environments, Epic-Kitchen was created to capture unscripted and natural interactions from everyday scenarios — whether one grills chicken with the same gusto as a Lagasse or bakes cookies like a grandma.
The researchers note that the recordings also show the multitasking that home chefs naturally perform, like washing a few dishes during the cooking process. “Such parallel-goal interactions have not been captured in existing datasets, making this both a more realistic as well as a more challenging set of recordings.”
The researchers instructed 32 participants covering 10 nationalities and five languages to record their kitchen time for at least three consecutive days using a head-mounted GoPro camera.
The participants then watched their videos and recorded a sort of live commentary of the actions they performed to generate “coarse annotation” speech data. The researchers say recent attempts in image annotations using speech have produced speed-ups of up to 15x when annotating ImageNet. The researchers also believe the participants can describe the actions better than independent observers simply because they were the ones performing the actions.
Some issues emerged, for example, synonyms in the free text that participants used in their annotations. Different people said “put”, “place”, “put down”, “put back”, “leave”, or “return” when describing similar object-placing actions. The researchers grouped such annotations into classes to minimize semantic overlap and to accommodate common approaches to multiclass detection and recognition, where each example is believed to belong to one class only.
The resulting dataset features 55 hours of video (11.5 M frames), and a total of 39.6K action segments with 454.2K labelled object bounding boxes.
The Epic-Kitchens researchers chose three challenges for testing — object detection, action recognition, and action anticipation — which they say form the base for a higher-level understanding of the participants’ actions and intentions.
The team evaluated several existing methods to demonstrate how challenging Epic-Kitchens is and identify shortcomings in current SOTA approaches. The results on object detection using Faster R-CNN showed that objects in the Epic-Kitchens dataset are generally harder to detect than those in most other current datasets. The team also noted the importance of explicit temporal modelling in action recognition, with models that incorporated temporal modelling in the architectural design showing improved accuracy for example on verb classification tasks.
The paper The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines is on arXiv.
Journalist: Fangyu Cai | Editor: Michael Sarazen