Large language models (LLMs) have taken a step toward task-agnostic machine learning by leveraging user prompts — instructions written in natural language — to help them target specific tasks without additional training or fine-tuning. Prompts can significantly boost model performance, but designing the perfect prompt, aka “prompt engineering,” remains a time-consuming, hands-on process that often comes down to trial and error.
In the new paper Ask Me Anything: A Simple Strategy for Prompting Language Models, a research team from Stanford University, Numbers Station, and the University of Wisconsin-Madison presents Ask Me Anything Prompting (AMA), a simple LLM prompting strategy that aggregates multiple “effective yet imperfect” prompts to enable a 30x smaller language model to outperform few-shot GPT3-175B.
The team summarizes their main contributions as follows:
- We identify properties of prompts that improve effectiveness across tasks, model types, and model sizes.
- We propose a strategy for scalably reformatting task inputs to effective formats.
- We propose the use of weak supervision (WS) to reliably aggregate predictions.
The researchers first explore different prompt formats, concluding that open-ended question-answering (QA) prompts (e.g. “Who went to the park?”) outperform prompts that restrict the model to particular tokens (e.g. “John went to the park. Output True or False”). They recursively use the LLM to transform task inputs to the effective open-ended question-answering format noted above, collecting multiple candidate prompts with different accuracies and complex dependencies. Finally, they apply a weak supervision (WS) technique to aggregate the outputs and produce final predictions that demonstrably improve the prompting reliability and performance of off-the-shelf LLMs without further training.
In their empirical study, the team evaluated AMA’s impact on the out-of-the-box few-shot performance of four open-source LLMs (EleutherAI, OPT, BLOOM, and T0) on seven tasks. In the experiments, AMA achieved an average improvement of 10.2 percent over the few-shot baselines; and also enabled a 30x smaller LLM to outperform few-shot GPT3-175B on 15 of 20 popular benchmarks.
Overall, this work validates the effectiveness of the proposed AMA prompting strategy. The team believes AMA could also benefit LLM applications that involve private data or require operating over large amounts of data.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.