Bengio Team Proposes Flow Network-Based Generative Models That Learn a Stochastic Policy From a Sequence of Actions

For standard reinforcement learning (RL) algorithms, the maximization of expected return is achieved by selecting the single highest-reward sequence of actions. But for tasks in a combinatorial domain — such as drug molecule synthesis, where exploration is important — the desired goal is no longer to simply generate the single highest-reward sequence of actions, but rather to carefully sample a diverse set of high-return solutions.

To address this specific machine learning problem, a research team from Mila, McGill University, Université de Montréal, DeepMind and Microsoft has proposed GFlowNet, a novel flow-network-based generative method that can turn a given positive reward into a generative policy that samples with a probability proportional to the return.

The team summarizes their contributions as:

Propose GFlowNet, a novel generative method for unnormalized probability distributions based on flow networks and local flow-matching conditions: the flow incoming to a state must match the outgoing flow.
Prove the crucial properties of GFlowNet, including the link between the flow-matching conditions (which many training objectives can provide) and the resulting match of the generated policy with the target reward function. Also, prove its offline properties and asymptotic convergence (if the training objective can be minimized). Demonstrate that previous related work (Buesing et al., 2019) which sees the generative process like a tree, would fail when there are many action sequences that can lead to the same state.
Demonstrate on synthetic data the usefulness of departing from seeking one mode of the return, and instead seeking to model the entire distribution and all its modes.
Successfully apply GFlowNet to a large-scale molecule synthesis domain, with comparative experiments against PPO and MCMC methods.

In graph theory, a flow network is a directed graph with sources and sinks, where each edge has a capacity, and each edge receives a flow. The motivation task of this flow network is iterative black-box optimization, where the agent has to compute a reward for a large batch of candidates at each round. The idea behind the proposed GFlowNet is to view the probability assigned to an action given a state as the flow associated with a network whose nodes are states and outgoing edges from that node are deterministic transitions driven by an action.

The team defines a flow network with a single source, where the sinks of the network correspond to the terminal states. Given the graph structure and the outflow of the sinks, they attempt to calculate a valid flow between nodes. Notably, such construction corresponds to a generative model, and the researchers conduct rigorous proofs and explanations to prove that the result is a terminal state (a sink), with probability proportional to the return if we follow the flow.

Based on the above theoretical results, the researchers then create a learning algorithm. They propose approximating the flows such that the flow conditions are obtained at convergence with enough capacity in their estimator of the flows. In this way, they will yield an objection function for GFlowNet where the minimization result of this objective achieves their desiderata — a generative policy that samples with a probability proportional to the return.

The team conducted various experiments to evaluate the performance of the proposed GFlowNet. For example, in an experiment that generates small molecules, the team reported the empirical distribution of rewards and the average reward of the top-k as a function of learning.

Compared to the baseline MARS (Xie et al., 2021), GFlowNet found more high-reward molecules. The results also show that for both GFlowNet and MARS, the more molecules are visited, the better they become, with a slow convergence towards the proxy’s max reward.

Overall, GFlowNet achieves competitive results against baseline methods on the molecule synthesis domain task and performs well on a simple domain where there are many modes to the reward function. The research team believes GFlowNet can serve as an alternative approach for turning an energy function into a fast generative model.

The implementations are available on the project GitHub. The paper Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation is on arXiv.

Author: Hecate He | Editor: Michael Sarazen, Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

1 comment on “Bengio Team Proposes Flow Network-Based Generative Models That Learn a Stochastic Policy From a Sequence of Actions”

Pingback: r/artificial - [R] Bengio Team Proposes Flow Network-Based Generative Models That Learn a Stochastic Policy From a Sequence of Actions - Cyber Bharat

Bengio Team Proposes Flow Network-Based Generative Models That Learn a Stochastic Policy From a Sequence of Actions

Like this:

1 comment on “Bengio Team Proposes Flow Network-Based Generative Models That Learn a Stochastic Policy From a Sequence of Actions”

Leave a Reply Cancel reply

Related

Share this:

Like this:

1 comment on “Bengio Team Proposes Flow Network-Based Generative Models That Learn a Stochastic Policy From a Sequence of Actions”

Leave a Reply Cancel reply

Related