Most machine learning techniques operate with the assumption that the training and test data were generated from the same statistical distribution (IID, for independent identically distributed). But it’s not that simple in the real world, where adversaries may supply data that violates such statistical assumptions. Typical examples are adversarial and stochastic games like Go and Chess.
In previous research in this area, as demonstrated by DeepMind’s AlphaGo and its successors, the policy and heuristic function are based on the polynomial upper confidence trees (PUCT) search algorithm, which be can be quite effective for guiding search in adversarial games. However, PUCT lacks guarantees on its search effort and is computationally inefficient. While other methods such as LevinTS provide guarantees on search steps, they do not make use of a heuristic function.
To address the deficiencies of such search algorithms, a research team from DeepMind and Alberta University has proposed Policy-guided Heuristic Search (PHS), a novel search algorithm that uses both a heuristic function and a policy while also offering guarantees on the search loss that relate to the quality of both the heuristics and the policy.
The idea behind the PUCT search algorithm is to guarantee that a model’s value function converges to the true value function in the limit of exploration. The guarantee however does not hold when actual rewards are replaced with an estimated value, and it is not possible to ensure that the level of PUCT generality is the best fit for difficult deterministic single-agent problems.
The more recent A* combined algorithm has a learned heuristic function that trades-off solution quality for search time, and tends to work better than PUCT. But the purpose of using A* is to find solutions of minimum cost, which is inconsistent with the goal of minimizing the search loss (e.g., the number of search steps).
Levin Tree Search (LevinTS), meanwhile, is an approach that uses a learned policy to guide its search and minimize the number of search steps that account for the quality of the policy but is incapable of learning heuristic functions.
The proposed PHS combines LevinTS’s policy-guided search with the heuristic-guided search of A* in a novel algorithm. The PHS algorithm is designed such that, when given a set of unknown tasks (starting with little to no knowledge of the tasks), it will solve all of them as quickly as possible. Unlike traditional search algorithms that seek minimum path loss solutions for all tasks, PHS minimizes the total search loss, hence it does not require the search algorithm solutions to be path-cost optimal.
PHS generalizes LevinTS by considering arbitrary nonnegative loss rather than enforcing the value of loss to be one, and by introducing a heuristic factor that is bigger than or equal to one. The researchers validate that for any nonnegative loss function, and for any given policy and any given heuristic factor, PHS will return a solution node. They also show that the search loss for PHS has an upper bound and that an accurate heuristic can greatly reduce the number of search steps.
The researchers compared their PHS algorithm with policy-guided and heuristic search algorithms PUCT, LevinTS, A*, Weighted A* (WA*) and Greedy Best-First Search (GBFS). They evaluate the algorithms on the 5×5 sliding-tile puzzle, Sokoban (Boxoban), and on a puzzle from the 2016 video game The Witness.
The PHSh and PHS* algorithms obtained the best results on Sokoban. On The Witness, a challenge for learning good heuristic functions, the policy-guided BFS-based algorithms (PUCT, PHS*, LevinTS and PHSh) were the winners. GBFS performed poorly on this challenge as it relies exclusively on the quality of the heuristic. WA* and PHS* meanwhile could solve the Sliding Tile Puzzle problems, clearly outperforming all other algorithms.
Overall, the study show PHS enables the rapid learning of both a policy and a heuristic function and performs well on all three test domains in terms of the number of problems solved and the search time required.
The paper Policy-Guided Heuristic Search with Guarantees is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.