Programming competition problems are pervasive in the AI community. They can be used to evaluate programmers’ abilities to solve artificial tasks as well as to test the limits of state-of-the-art algorithms.
A research team from MIT, Allen Institute for AI and Microsoft Research recently introduced Python Programming Puzzles (P3), a novel and open-source collection of programming challenges that capture the essence of puzzles and can be used to teach and evaluate an AI’s programming proficiency.
The team summarizes their contributions as:
- Introduce programming puzzles: a new type of problem suitable for algorithmic problem-solving (for both machines and humans).
- Propose P3, an open-source dataset of puzzles covering diverse domains and difficulties.
- Provide an evaluation of humans and baselines demonstrating that puzzles can be used to measure algorithmic problem-solving progress.
The proposed puzzles take the form of a Python function with the answer as an argument. The goal is to find an input x that makes the output of the function true, i.e., a valid answer x satisfies f(x) == True. In other words, finding a solution that returns “true” will solve the puzzles.
The open-source P3 dataset comprises a wide range of puzzles in terms of difficulty, domain, and algorithmic tools; and was inspired by sources such as Wikipedia and programming competitions. There are classic puzzles like Towers of Hanoi and chess puzzles (e.g., knight’s tour and n-queen problem variants), as well as two-player challenges such as finding optimal strategies for Tic-Tac-Toe, Rock-Paper-Scissors and Mastermind or finding the Nash equilibria of general-sum games. Also included are puzzles from the International Mathematical Olympiad (IMO) and International Collegiate Programming Contest (ICPC), graph theory algorithmic puzzles such as shortest path or planted clique, as well as elementary algebra and number theory algorithmic puzzles and many more.
The suite of puzzles enables objective evaluation. Because it is easy to test whether a candidate answer is valid without consulting an answer key, the puzzles do not add the burden of learning any answer-key biases.
The team conducted intensive experiments to compare different parametric enumerative top-down solvers based on random forests, transformers, and different types of GPT-3 prompts. They also performed a user study to test the puzzles’ ability to measure programming proficiency.
The results show that human programmers often outperform AI solvers like GPT-3 and enumerative techniques. For instance, bootstrapping GPT-3 solved 60 percent of the puzzles, while beginner and experienced human participants solved 76 percent and 87 percent, respectively. The team also identified a positive correlation between the performance of AI solvers and the difficulty for human programmers.
The paper Programming Puzzles is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.