Elon Musk tweeted last week that Tesla is recruiting AI or chip talents for the company’s neural network training supercomputer project “Dojo.“ Musk boasted the “beast” will be able to process “truly vast amount of video data,” and “the FSD [full self-driving] improvement will come as a quantum leap, because it’s a fundamental architectural rewrite, not an incremental tweak.” Musk added his own use case: “I drive the bleeding edge alpha build in my car personally. Almost at zero interventions between home & work.”
The Dojo talent recruitment drive reflects Musk’s determination to achieve full (L5) autonomy for his vehicles. The plan is to grow the Tesla autopilot capabilities by upgrading dimensional comprehension to a 4D infrastructure from the current systems, which Musk has pegged at “about 2.5D.”
Musk mused on self-driving dimensions and milestones during last month’s Q2 2020 Tesla Earnings Call:
“The actual major milestone that is happening right now is really transition of the autonomy systems of cars, like AI, if you will, from thinking about things like 2.5D, things like isolated pictures… and transitioning to 4D, which is videos essentially. That architectural change, which has been underway for some time… really matters for full self-driving. It’s hard to convey how much better a fully 4D system would work. This is fundamental, the car will seem to have a giant improvement. Probably not later than this year, it will be able to do traffic lights, stops, turns, everything.”
Why is this upgrade from 2.5D to 4D so critical for the next self-driving breakthrough? UC Berkeley Researcher Dr. Fisher Yu spoke with Synced to provide some context.
Yu explains that when humans see objects, even with occluded views, we can naturally recognize their semantic categories and predict their underlying 3D structures. On the road, this would entail for example a driver understanding the geometry of other vehicles from the partial views provided by their own rear-view mirrors. Yu attributes British neuroscientist and physiologist David Marr with initiating one of the most promising theories of vision in the 1970-80s, when he asserted that recognition involved several intermediate representations and steps.
Humans can infer the surface layout of objects from 2D images, aka a 2.5D representation that adds specific visuospatial properties, and is then processed into a 3D volumetric representation with depth and volume perception of the object.
“In the field of autonomous driving, the process of 2.5D to 3D has already offered a lot of information,” Yu notes, “for example, when given the 2.5D representations and the speed of the vehicle, it is easy to predict when to brake to avoid collision with a car in front — 2.5D representations are sufficient here.” Yu says however that 3D representations are required to achieve more robust systems, as 3D information such as the dimensions of other cars can be leveraged for generating safer driving routes and can even be used to infer vehicle functionality, such as where and how doors can be opened.
Moving from 2.5D to 3D increases the capabilities of the self-driving systems beyond obtaining and processing information about surrounding obstacles and the speed of vehicles, etc. “It also enables the systems to think like humans and predict the intention of a certain object and potential interaction with it,” says Yu. “It is still challenging to predict 3D information accurately only based on video feeds. If we ask people to estimate the exact distance of a car, it is easy to say it is 20 meters away. But it is hard to imagine someone can confidently say ‘the car is 24.3 meters away’.”
“Introducing temporal information can bring out many very specific benefits for developing autonomous driving systems to be safer and more comfortable,“ says Yu. “For instance, to predict the potential routes a car can take, it is critical to consider temporal information such as previous routes and speed through referencing past frames of the videos.” Obtaining the required quality temporal information is possible due to the massive amount of large-scale video data currently being collected by robots and intelligent vehicles.
The recent Tesla patent Generating Ground Truth for Machine Learning from Time Series Elements provides further insights into what Musk envisions with the move toward 4D:
“As one example, a series of images for a time period, such as 30 seconds, is used to determine the actual path of a vehicle lane line over the time period the vehicle travels. The vehicle lane line is determined by using the most accurate images of the vehicle lane over the time period. Different portions (or locations) of the lane line may be identified from different image data of the time series. As the vehicle travels in a lane alongside a lane line, more accurate data is captured for different portions of the lane line. In some examples, occluded portions of the lane line are revealed as the vehicle travels, for example, along a hidden curve or over a crest of a hill. The most accurate portions of the lane line from each image of the time series may be used to identify a lane line over the entire group of image data. Image data of the lane line in the distance is typically less detailed than image data of the lane line near the vehicle. By capturing a time series of image data as a vehicle travels along a lane, accurate image data and corresponding odometry data for all portions of the corresponding lane line are collected.”
In his November 2019 talk PyTorch at Tesla, Tesla Senior AI Director Andrej Karpathy said the goal of the Dojo training supercomputer is to increase performance by orders of magnitude at a lower cost. If the ambitious development of the Dojo supercomputer and Autopilot system architectural change to 4D all goes well, it would give Tesla vehicles a huge lead in the race to the self-driving L5 finish line.
Elon Musk has hinted that the 4D Autopilot FSD upgrades will be a limited public release in “6 to 10 weeks.” For those interested in joining the team that wants to make history, the main Dojo recruitment engineering locations are Palo Alto, Austin, and Seattle. Tesla says working remotely would be acceptable for “exceptional candidates.”
Synced will update readers when additional information becomes available.
Reporter: Fangyu Cai | Editor: Michael Sarazen
This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.
Click here to find more reports from us.
We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.