AI Research

The Staggering Cost of Training SOTA AI Models

While it is exhilarating to see AI researchers pushing the performance of cutting-edge models to new heights, the costs of such processes are also rising at a dizzying rate.

While it is exhilarating to see AI researchers pushing the performance of cutting-edge models to new heights, the costs of such processes are also rising at a dizzying rate.

Synced recently reported on XLNet, a new language model developed by CMU and Google Research which outperforms the previous SOTA model BERT (Bidirectional Encoder Representations from Transformers) on 20 language tasks including SQuAD, GLUE, and RACE; and has achieved SOTA results on 18 of these tasks.

What may surprise many is the staggering cost of training an XLNet model. A recent tweet from Elliot Turner — the serial entrepreneur and AI expert who is now the CEO and Co-Founder of Hologram AI — has prompted heated discussion on social media. Turner wrote “it costs $245,000 to train the XLNet model (the one that’s beating BERT on NLP tasks).” His calculation is based on a resource breakdown provided in the paper: “We train XLNet-Large on 512 TPU v3 chips for 500K steps with an Adam optimizer, linear learning rate decay and a batch size of 2048, which takes about 2.5 days.”

Reaction from researchers and academics included this comment on Reddit: “I think I’d just cry if I had to try and convince my boss to spend 250k on AWS for a single model that may or may not perform as well as needed.”

Synced has discovered however that Turner’s math might be off. A Cloud TPU v3 device, which costs US$8 per hour on Google Cloud Platform, has four independent embedded chips. As the paper authors specified “TPU v3 chips”, the calculation should be 512 (chips) * (US$8/4) * 24 (hours) * 2.5 (days) = $61,440. Google researcher James Bradbury expressed the same idea on Twitter: “512 TPU chips is 128 TPU devices, or $61,440 for 2.5 days. The authors could also have meant 512 cores, which is 64 devices or $30,720.”

Even so, spending US$61,000 to train a single language model is pricey. Of course since Google is leading XLNet research, the company’s cloud division won’t likely charge its own research team full price.

So why is it so expensive to train XLNet? For starters, the model is huge. From the paper: “Our largest model XLNet-Large has the same architecture hyperparameters as BERT-Large, which results in a similar model size.” XLNet-large has 24 Transformer blocks, 1024 hidden units in each layer, and 16 attention heads. Researchers also collected a total of 32.89 billion subword pieces as pretraining data.

Synced took a look at cost estimates for training other large AI models:

University of Washington’s Grover-Mega — total training cost: US$25,000

Grover is a 1.5-billion-parameter neural net tailored for both the generation and detection of fake news. Grover can generate the rest of an article from any headline, and outperforms other fake news detectors when defending against Grover itself. It was developed by University of Washington and Allen Institute for Artificial Intelligence in May 2019 and recently open-sourced on Github.

Training the largest Grover Mega model cost US$25k in total, based on information in the research paper: “training Grover-Mega is relatively inexpensive: at a cost of $0.30 per TPU v3 core-hour and two weeks of training.”

Google BERT — estimated total training cost: US$6,912

Released last year by Google Research, BERT is a bidirectional transformer model that redefined the state of the art for 11 natural language processing tasks. Many language models today are built on top of BERT architecture.

From the Google research paper: “training of BERT – Large was performed on 16 Cloud TPUs (64 TPU chips total). Each pretraining took 4 days to complete.” Assuming the training device was Cloud TPU v2, the total price of one-time pretraining should be 16 (devices) * 4 (days) * 24 (hours) * 4.5 (US$ per hour) = US$6,912. Google suggests researchers with tight budgets could pretrain a smaller BERT-Base model on a single preemptible Cloud TPU v2, which takes about two weeks with a cost of about US$500.

OpenAI GPT-2 — training cost US$256 per hour

GPT-2 is a large language model recently developed by OpenAI which can generate realistic paragraphs of text. Without any task-specific training data, the model still demonstrates compelling performance across a range of language tasks such as machine translation, question answering, reading comprehension and summarization.

The Register reports the GPT-2 model used 256 Google Cloud TPU v3 cores for training, which costs US$256 per hour. OpenAI didn’t specify the training duration.

While the numbers may look scary, most machine learning models are nowhere near as demanding as these high-profile examples associated with tech giants. As Turing Award Laureate Yoshua Bengio told Synced in a recent interview, “Some of the models are so big that even in MILA (Montreal Institute of Learning Algorithm) we can’t run them because we don’t have the infrastructure for that. Only a few companies can run these very big models they’re talking about.”

The cost of the compute used to train models is also expected to become significantly cheaper with the continuing advance of algorithms, computing devices, and engineering efforts. As a Reddit user commented: “Google’s cat neuron paper used days/tens of thousands of cores but now people are generating fake cats in real time. To take an example from progression of ImageNet models to 75% top-1, first DAWN benchmark submission cost $2k, then cost went down to $40 within couple of years.”

The paper XLNet: Generalized Autoregressive Pretraining for Language Understanding is on arXiv.


Journalist: Tony Peng | Editor: Michael Sarazen

55 comments on “The Staggering Cost of Training SOTA AI Models

  1. Nathan

    I’m glad the cost is coming down. Five years until I can train my own GPT2 😅

  2. Pingback: AI and the Cloud: Cloud Machine Learning | Algorithmia Blog

  3. Pingback: Kai dirbtinis intelektas už mus kalbės (Mokslo populiarinimo konkursas) – Konstanta-42

  4. Pingback: Kai dirbtinis intelektas už mus kalbės (Konkursinis straipsnis) | INFODATA.LT

  5. Pingback: Controlling Text Generation with Plug and Play Language Models | Uber Engineering Blog

  6. Pingback: Controlling Text Generation with Plug and Play Language Models - Engineering News

  7. Pingback: Educating GPT-2 transformer a humorousness - TechMintz

  8. Pingback: Classifying 200,000 articles in 7 hours using NLP - NSO News

  9. Pingback: Classifying 200,000 articles in 7 hours using NLP - SALT.agency®

  10. Ralph Dratman

    Does it matter whether this article was written by a person or a machine?
    Do we automatically assume it is more useful to read an article written by a person rather than by a machine?

  11. Pingback: What Every Developer Needs to Know About Natural Language Processing in 2020 – Avasta

  12. Pingback: We’re Building the Most Accurate Intent Detector on the Market - PolyAI

  13. Pingback: THE BLOG MEDIA - LATEST DAILY TRENDS ON THE GO | 24/7 UPDATED CONTENTS.......

  14. Pingback: 6 Lessons Learned in 6 Months as a Data Scientist – Cooding Dessign

  15. Pingback: 6 Lessons Learned in 6 Months as a Data Scientist – My Blog

  16. Pingback: 6 Lessons Learned in 6 Months as a Data Scientist – The Open Bootcamps

  17. Pingback: 6 Lessons Learned in 6 Months as a Data Scientist – The Data Sciences Niche Domain News

  18. Pingback: Could Google passage indexing be leveraging BERT?

  19. Pingback: Could Google passage indexing be leveraging BERT? - Zeeshan Mahmood

  20. Pingback: Could Google passage indexing be leveraging BERT? - Hubbub Digital

  21. Pingback: Could Google passage indexing be leveraging BERT? | Marketing News

  22. Pingback: Could Google passage indexing be leveraging BERT? - The Rawr Agency

  23. Pingback: Could Google passage indexing be leveraging BERT? – London Digital Marketing Services

  24. Pingback: Could Google passage indexing be leveraging BERT? |

  25. Pingback: I Love Citations

  26. Pingback: Could Google passage indexing be leveraging BERT? – My Blog

  27. Pingback: Could Google passage indexing be leveraging BERT? – Social Marketing Seo

  28. Pingback: Could Google passage indexing be leveraging BERT? – Push New Media

  29. Pingback: Could Google passage indexing be leveraging BERT? - Client Secure Media, Inc.

  30. Pingback: Could Google passage indexing be leveraging BERT? – SEO Web Design, LLC

  31. Pingback: Amazon annonce AWS Trainium, sa puce dédiée à la création de modèles d’apprentissage machine - ZoneTech

  32. Pingback: Amazon annonce AWS Trainium, sa puce dédiée à la création de modèles d’apprentissage machine | Techonologie News

  33. Pingback: Amazon annonce AWS Trainium, sa puce dédiée à la création de modèles d’apprentissage machine - 24H News

  34. Pingback: The Nature of Machine Learning – Pierre Hardy's Blog

  35. Pingback: Could Google passage indexing be leveraging BERT? - RiTribes

  36. Great info! I recently came across your blog and have been reading along.I am very happy with the content of your article it is very beneficial

  37. Great article, Thanks for including us

  38. Pingback: My New Take on Things from Six Months in Data Science – Answers 2 Analytics!

  39. Pingback: DeDLOC: обучаем большие нейросети всем миром — MAILSGUN.RU

  40. Pingback: DeDLOC: обучаем большие нейросети всем миром | INFOS.BY

  41. Pingback: обучаем большие нейросети всем миром / Блог компании Яндекс / Хабр — ColorMag

  42. Pingback: [ML News] Stanford HAI coins Foundation Models & High-profile case of plagiarism uncovered - Artificial Intelligence Videos

  43. Pingback: Το διαφαινόμενο πρόβλημα της υπολογιστικής ικανότητας για αλγόριθμους βαθιάς μάθησης - fomut

  44. Pingback: Το διαφαινόμενο πρόβλημα της υπολογιστικής ικανότητας για αλγόριθμους βαθιάς μάθησης - atpew

  45. Pingback: Το διαφαινόμενο πρόβλημα της υπολογιστικής ικανότητας για αλγόριθμους βαθιάς μάθησης - satfe

  46. Pingback: Το διαφαινόμενο πρόβλημα της υπολογιστικής ικανότητας για αλγόριθμους βαθιάς μάθησης - cetfu

  47. Pingback: Fine-tune and deploy a Wav2Vec2 model for speech recognition with Hugging Face and Amazon SageMaker – Maverick Studios

  48. Pingback: Fine-tune and deploy a Wav2Vec2 model for speech recognition with Hugging Face and Amazon SageMaker – Kamal Reader

  49. Pingback: Fine-tune and deploy a Wav2Vec2 model for speech recognition with Hugging Face and Amazon SageMaker - MKAI

  50. Pingback: Fine-tune and deploy a Wav2Vec2 model for speech recognition with Hugging Face and Amazon SageMaker – Vedere AI

Leave a Reply to REFLEX LEVEL GAUGE Musaffah, Abu Dhabi, UAE Cancel reply

Your email address will not be published. Required fields are marked *