Locating Moving Objects Using Stereo Sound Instead of Visual Input

Object localization involves predicting the location of a moving object within a scene. Not surprisingly, researchers have tended to rely on visual data as input, which together with some physics understanding will generally enable a machine to perform the task. This camera-based approach however can be compromised by low light conditions, fog, occlusions, etc.

In a bid to improve object localization in such less-than-ideal circumstances, an MIT and IBM research group has proposed a cross-modal auditory localization framework that can effectively locate objects using stereo sound.

Although vision is humans’ go-to sense for understanding environments, we instinctively draw on additional senses when vision is insufficient. Auditory cues can play a huge role for example in localizing an approaching ambulance in a busy street or a meowing cat in a dark room. Sound localization and cross-modal learning are research directions that aim to augment machines’ abilities in this regard.

Sound localization uses microphone arrays and beam-forming to analyze delays in a sound received by differently positioned microphones and estimate the location of the object emitting the sound. Because audio-visual data contains a wealth of resources for knowledge transfer between different modalities, cross-modal learning is a also a growing research area.

The MIT and IBM paper proposes a framework comprising a “teacher” vision network and “student” stereo sound network. The student network attempts to mimic teacher network outputs by transferring object detection knowledge across modalities during training. The vision network detects an object in a video and marks it with a bounding box, then the stereo sound network learns to map audio signals to the predicted bounding box coordinates. In the final inference mode, the student network directly predicts an object’s location using sound, without any visual inputs.

*Network structure and training and testing procedure for cross-modal auditory localization*

*Average Precision (AP) and Center Distances (CD) results for cross-modal auditory localization*

*Cross-modal auditory localization improves object localization under poor lighting condition*

The researchers’ proposed algorithm outperformed all audio only baselines in experiments on over 3000 video clips. Particularly under poor lighting conditions such as nighttime scenarios where traditional visual tracking systems struggle, it would appear cross-modal auditory localization has the potential to make significant contributions to object localization techniques and visual tracking systems.

The paper Self-supervised Moving Vehicle Tracking with Stereo Sound is onarXiv.

Author: Hecate He | Editor: Michael Sarazen

Share this:

Like this:

0 comments on “Locating Moving Objects Using Stereo Sound Instead of Visual Input”

Leave a Reply Cancel reply

Related