The Staggering Cost of Training SOTA AI Models

While it is exhilarating to see AI researchers pushing the performance of cutting-edge models to new heights, the costs of such processes are also rising at a dizzying rate.

by Synced

2019-06-27

Comments 55

While it is exhilarating to see AI researchers pushing the performance of cutting-edge models to new heights, the costs of such processes are also rising at a dizzying rate.

Synced recently reported on XLNet, a new language model developed by CMU and Google Research which outperforms the previous SOTA model BERT (Bidirectional Encoder Representations from Transformers) on 20 language tasks including SQuAD, GLUE, and RACE; and has achieved SOTA results on 18 of these tasks.

What may surprise many is the staggering cost of training an XLNet model. A recent tweet from Elliot Turner — the serial entrepreneur and AI expert who is now the CEO and Co-Founder of Hologram AI — has prompted heated discussion on social media. Turner wrote “it costs $245,000 to train the XLNet model (the one that’s beating BERT on NLP tasks).” His calculation is based on a resource breakdown provided in the paper: “We train XLNet-Large on 512 TPU v3 chips for 500K steps with an Adam optimizer, linear learning rate decay and a batch size of 2048, which takes about 2.5 days.”

Reaction from researchers and academics included this comment on Reddit: “I think I’d just cry if I had to try and convince my boss to spend 250k on AWS for a single model that may or may not perform as well as needed.”

Synced has discovered however that Turner’s math might be off. A Cloud TPU v3 device, which costs US$8 per hour on Google Cloud Platform, has four independent embedded chips. As the paper authors specified “TPU v3 chips”, the calculation should be 512 (chips) * (US$8/4) * 24 (hours) * 2.5 (days) = $61,440. Google researcher James Bradbury expressed the same idea on Twitter: “512 TPU chips is 128 TPU devices, or $61,440 for 2.5 days. The authors could also have meant 512 cores, which is 64 devices or $30,720.”

Even so, spending US$61,000 to train a single language model is pricey. Of course since Google is leading XLNet research, the company’s cloud division won’t likely charge its own research team full price.

So why is it so expensive to train XLNet? For starters, the model is huge. From the paper: “Our largest model XLNet-Large has the same architecture hyperparameters as BERT-Large, which results in a similar model size.” XLNet-large has 24 Transformer blocks, 1024 hidden units in each layer, and 16 attention heads. Researchers also collected a total of 32.89 billion subword pieces as pretraining data.

Synced took a look at cost estimates for training other large AI models:

University of Washington’s Grover-Mega — total training cost: US$25,000

Grover is a 1.5-billion-parameter neural net tailored for both the generation and detection of fake news. Grover can generate the rest of an article from any headline, and outperforms other fake news detectors when defending against Grover itself. It was developed by University of Washington and Allen Institute for Artificial Intelligence in May 2019 and recently open-sourced on Github.

Training the largest Grover Mega model cost US$25k in total, based on information in the research paper: “training Grover-Mega is relatively inexpensive: at a cost of $0.30 per TPU v3 core-hour and two weeks of training.”

Google BERT — estimated total training cost: US$6,912

Released last year by Google Research, BERT is a bidirectional transformer model that redefined the state of the art for 11 natural language processing tasks. Many language models today are built on top of BERT architecture.

From the Google research paper: “training of BERT – Large was performed on 16 Cloud TPUs (64 TPU chips total). Each pretraining took 4 days to complete.” Assuming the training device was Cloud TPU v2, the total price of one-time pretraining should be 16 (devices) * 4 (days) * 24 (hours) * 4.5 (US$ per hour) = US$6,912. Google suggests researchers with tight budgets could pretrain a smaller BERT-Base model on a single preemptible Cloud TPU v2, which takes about two weeks with a cost of about US$500.

OpenAI GPT-2 — training cost US$256 per hour

GPT-2 is a large language model recently developed by OpenAI which can generate realistic paragraphs of text. Without any task-specific training data, the model still demonstrates compelling performance across a range of language tasks such as machine translation, question answering, reading comprehension and summarization.

The Register reports the GPT-2 model used 256 Google Cloud TPU v3 cores for training, which costs US$256 per hour. OpenAI didn’t specify the training duration.

While the numbers may look scary, most machine learning models are nowhere near as demanding as these high-profile examples associated with tech giants. As Turing Award Laureate Yoshua Bengio told Synced in a recent interview, “Some of the models are so big that even in MILA (Montreal Institute of Learning Algorithm) we can’t run them because we don’t have the infrastructure for that. Only a few companies can run these very big models they’re talking about.”

The cost of the compute used to train models is also expected to become significantly cheaper with the continuing advance of algorithms, computing devices, and engineering efforts. As a Reddit user commented: “Google’s cat neuron paper used days/tens of thousands of cores but now people are generating fake cats in real time. To take an example from progression of ImageNet models to 75% top-1, first DAWN benchmark submission cost $2k, then cost went down to $40 within couple of years.”

The paper XLNet: Generalized Autoregressive Pretraining for Language Understanding is on arXiv.

Journalist: Tony Peng | Editor: Michael Sarazen

55 comments on “The Staggering Cost of Training SOTA AI Models”

Nathan

2019-09-13

I’m glad the cost is coming down. Five years until I can train my own GPT2 😅

Loading...

Reply
Pingback: AI and the Cloud: Cloud Machine Learning | Algorithmia Blog
Pingback: Kai dirbtinis intelektas už mus kalbės (Mokslo populiarinimo konkursas) – Konstanta-42
Pingback: Kai dirbtinis intelektas už mus kalbės (Konkursinis straipsnis) | INFODATA.LT
Pingback: Controlling Text Generation with Plug and Play Language Models | Uber Engineering Blog
Pingback: Controlling Text Generation with Plug and Play Language Models - Engineering News
Pingback: Educating GPT-2 transformer a humorousness - TechMintz
Pingback: Classifying 200,000 articles in 7 hours using NLP - NSO News
Pingback: Classifying 200,000 articles in 7 hours using NLP - SALT.agency®
Ralph Dratman

2020-08-12

Does it matter whether this article was written by a person or a machine?
Do we automatically assume it is more useful to read an article written by a person rather than by a machine?

Loading...

Reply
Pingback: What Every Developer Needs to Know About Natural Language Processing in 2020 – Avasta
Pingback: We’re Building the Most Accurate Intent Detector on the Market - PolyAI
Pingback: THE BLOG MEDIA - LATEST DAILY TRENDS ON THE GO | 24/7 UPDATED CONTENTS.......
Pingback: 6 Lessons Learned in 6 Months as a Data Scientist – Cooding Dessign
Pingback: 6 Lessons Learned in 6 Months as a Data Scientist – My Blog
Pingback: 6 Lessons Learned in 6 Months as a Data Scientist – The Open Bootcamps
Pingback: 6 Lessons Learned in 6 Months as a Data Scientist – The Data Sciences Niche Domain News
Pingback: Could Google passage indexing be leveraging BERT?
Pingback: Could Google passage indexing be leveraging BERT? - Zeeshan Mahmood
Pingback: Could Google passage indexing be leveraging BERT? - Hubbub Digital
Pingback: Could Google passage indexing be leveraging BERT? | Marketing News
Pingback: Could Google passage indexing be leveraging BERT? - The Rawr Agency
Pingback: Could Google passage indexing be leveraging BERT? – London Digital Marketing Services
Pingback: Could Google passage indexing be leveraging BERT? |
Pingback: I Love Citations
Pingback: Could Google passage indexing be leveraging BERT? – My Blog
Pingback: Could Google passage indexing be leveraging BERT? – Social Marketing Seo
Pingback: Could Google passage indexing be leveraging BERT? – Push New Media
Pingback: Could Google passage indexing be leveraging BERT? - Client Secure Media, Inc.
Pingback: Could Google passage indexing be leveraging BERT? – SEO Web Design, LLC
Pingback: Amazon annonce AWS Trainium, sa puce dédiée à la création de modèles d’apprentissage machine - ZoneTech
Pingback: Amazon annonce AWS Trainium, sa puce dédiée à la création de modèles d’apprentissage machine | Techonologie News
Pingback: Amazon annonce AWS Trainium, sa puce dédiée à la création de modèles d’apprentissage machine - 24H News
Pingback: The Nature of Machine Learning – Pierre Hardy's Blog
Pingback: Could Google passage indexing be leveraging BERT? - RiTribes
REFLEX LEVEL GAUGE Musaffah, Abu Dhabi, UAE

2021-04-20

Great info! I recently came across your blog and have been reading along.I am very happy with the content of your article it is very beneficial

Loading...

Reply
Shantun Parmar

2021-04-21

Great article, Thanks for including us

Loading...

Reply
Pingback: My New Take on Things from Six Months in Data Science – Answers 2 Analytics!
Pingback: DeDLOC: обучаем большие нейросети всем миром — MAILSGUN.RU
Pingback: DeDLOC: обучаем большие нейросети всем миром | INFOS.BY
Pingback: обучаем большие нейросети всем миром / Блог компании Яндекс / Хабр — ColorMag
Pingback: [ML News] Stanford HAI coins Foundation Models & High-profile case of plagiarism uncovered - Artificial Intelligence Videos
Pingback: Το διαφαινόμενο πρόβλημα της υπολογιστικής ικανότητας για αλγόριθμους βαθιάς μάθησης - fomut
Pingback: Το διαφαινόμενο πρόβλημα της υπολογιστικής ικανότητας για αλγόριθμους βαθιάς μάθησης - atpew
Pingback: Το διαφαινόμενο πρόβλημα της υπολογιστικής ικανότητας για αλγόριθμους βαθιάς μάθησης - satfe
Pingback: Το διαφαινόμενο πρόβλημα της υπολογιστικής ικανότητας για αλγόριθμους βαθιάς μάθησης - cetfu
Pingback: Fine-tune and deploy a Wav2Vec2 model for speech recognition with Hugging Face and Amazon SageMaker – Maverick Studios
Pingback: Fine-tune and deploy a Wav2Vec2 model for speech recognition with Hugging Face and Amazon SageMaker – Kamal Reader
Pingback: Fine-tune and deploy a Wav2Vec2 model for speech recognition with Hugging Face and Amazon SageMaker - MKAI
Pingback: Fine-tune and deploy a Wav2Vec2 model for speech recognition with Hugging Face and Amazon SageMaker – Vedere AI

The Staggering Cost of Training SOTA AI Models

Like this:

55 comments on “The Staggering Cost of Training SOTA AI Models”

Leave a Reply to REFLEX LEVEL GAUGE Musaffah, Abu Dhabi, UAE Cancel reply

Related

Share this:

Like this:

55 comments on “The Staggering Cost of Training SOTA AI Models”

Leave a Reply to REFLEX LEVEL GAUGE Musaffah, Abu Dhabi, UAE Cancel reply

Related