Building a Reasoning Machine - Ph.D. thesis

Tuesday. April 07, 2026

This is the introduction of my PhD thesis.

Despite the immense progress of ML systems in interactive and generative tasks [Brown et al., 2020; Ho et al., 2020], the best systems to date still struggle with tasks that require deliberate and sequential reasoning, such as solving complex math problems or creating a detailed and coherent plan of action [Wei et al., 2022; Kojima et al., 2022].

Several features of human intelligence are not yet present in ML systems. For example, humans spend much time “thinking” about a problem before solving it, whereas current ML systems are largely divided into a training and an inference phase. Most of the compute is spent on training, and the inference phase is done synchronously with much less compute.1While true at the time of writing, latest ML systems based on principles mentioned in this thesis have changed that (OpenAI, 2024). Secondly, humans often consciously search for solutions in a low-dimensional discrete space, such as the space of theorem proofs or possible moves in a game; current ML systems, barring highly specialized ones like AlphaGo [Silver et al., 2016], spend most of their compute searching in the high-dimensional continuous parameter space guided by gradient descent [Bottou, 2010]. Thirdly, humans make use of compositional structures to reason about the world, such as through the explicit use of decision trees, knowledge graphs, or causal graphs. While attempts were made to incorporate these structures on the architectural level in ML systems, there is a lack of general and principled approaches to using them to aid reasoning.

We now offer two complementary perspectives on how filling the gaps above can extend the frontier of ML capabilities.

The Search Perspective

The history of AI can be seen as a progression of how compute is turned into intelligence through search.

In GOFAI (Good Old Fashioned AI), search is done directly in the solution space, e.g., the spaces of moves for chess or proofs [Smolensky, 1987; Newell & Simon, 1972; Nilsson, 1998]. This is exemplified by the use of inference algorithms like Markov chain Monte Carlo (MCMC), which performs unamortized search with proposal distributions informed by domain knowledge [MacKay, 2002]. However, MCMC is known to mix poorly among modes and can be prohibitively expensive at scale. As a result, GOFAI systems are inefficient at turning compute into intelligence, partly because no amortized learning takes place.

Deep learning [Goodfellow et al., 2016] searches the space of neural network parameters to find a solution that generalizes well on downstream tasks. On the surface, it appears to be a more daunting task because the search space is both continuous and high-dimensional.2At the time of writing, the state-of-the-art deep learning system has over 1 trillion trainable parameters. However, the gradient from the differentiable operators in the neural network and inductive biases baked into the optimization process, such as stochastic gradient descent [Bottou, 2010], make this search both tractable and efficient. Large-scale deep learning systems have proven to be excellent at both discriminative [Krizhevsky et al., 2012; He et al., 2016] and generative tasks [Goodfellow et al., 2014; Kingma & Welling, 2013] in both the supervised [Simonyan & Zisserman, 2014; He et al., 2016] and unsupervised settings [Kingma & Welling, 2013; Radford et al., 2015]. However, the search is only done in the high-dimensional parameter space, which does not cover the slow, discrete, structured, and low-dimensional search process humans consciously experience as we reason.

Many have hypothesized the advent of a slow-thinking System 2 in ML to complement the fast and intuitive System 1 [Kahneman, 2011; Goyal & Bengio, 2022]. We posit that System 2 is not a separate component with a different model architecture; rather, it is the structured computation of System 1, which includes searching in a discrete and low-dimensional latent space. Just like we explicitly reason about concepts and objects, ML systems with strong reasoning capabilities should do so as well without being explicitly fed these concepts and objects and how to manipulate them. This can be framed as the learning of a latent variable model (LVM) over compositional concepts and objects, an idea that dates back to statistical modeling [Dempster et al., 1977], with the reasoning process then corresponding to the inference of the LVM. This highlights the need to perform inference over complex compositional structures, a task typically solved with brute force in GOFAI and largely circumvented by deep learning via simplification of the latent space, such as strong conditional independence assumptions [Kingma & Welling, 2013; Rezende et al., 2014].

Recently, a new algorithm, called generative flow networks (GFlowNets) [Bengio et al., 2021a; 2021b], was proposed to train efficient amortized samplers over compositional objects. It combines insight from both variational inference [MacKay, 2002] and reinforcement learning [Sutton & Barto, 2018] to train samplers of intractable distributions specified by energy functions that are generally intractable to normalize. In other words, given enough neural network capacity and training time, we can train a GFlowNet to draw samples from an arbitrary distribution over compositional objects for which we have an energy function. It is one of the vital tools we use in this thesis to learn complex latent variable models to equip ML systems with the ability to reason by searching in a discrete and low-dimensional latent space. There are two key distinctions compared to GOFAI, which similarly searches in a discrete, structured space. First, the search here is amortized through the optimization of a neural network, leveraging the success of System 1 deep learning. Second, the search space is not prescribed by human experts but automatically discovered from data with possible guidance from domain knowledge in the form of inductive biases.

On the separation of knowledge and inference. Reasoning can be broken down into knowledge and inference. For example, to play Go well, one needs to both know the rules of the game and search for the best moves on the fly — the latter is known as the inference problem. GOFAI systems rely on manually encoded knowledge, such as in the Epilog project [Schubert, 2000], and a separate inference engine, such as MCMC. Deep learning systems automatically encode knowledge from data through reconstruction losses and benefit from a unified objective for both knowledge and inference. For instance, an autoregressive LLM [Brown et al., 2020; Radford et al., 2015] encodes knowledge and performs inference at the same time through the next-word prediction objective.

However, knowledge and inference components have conflicting design principles. On one hand, the knowledge component should be compact per Occam’s razor [Sober, 2015] and make use of inductive biases to boost sample efficiency [Goyal & Bengio, 2022]. On the other hand, the inference component needs enough capacity to perform complex amortized inference over the knowledge. Deep learning allows us to train powerful inference machines at scale, but often the same machinery is used to encode knowledge, which leads to low sample efficiency and a lack of generalization. We leverage this insight in the GFlowNet-LLM chapter and separate the knowledge and inference components in LLMs to boost sample efficiency and generalization.

The Data Perspective

The advent of LLMs poses a question: is scaling all we need to keep making progress?

The answer is a resounding yes — if we have access to the data distribution underlying the full spectrum of human intelligence, which someone might call artificial general intelligence (AGI) [Bostrom, 2014; Goertzel, 2007]. This elusive AGI distribution might include detailed thinking processes and all the mental models we use to reason about the world. With enough compute and capacity, we can optimize a neural network with the maximum likelihood estimation (MLE) objective [MacKay, 2002] and eventually fully model this AGI distribution. However, such an idealized data distribution is not readily available at scale; instead, we are left with the Internet [Caliskan et al., 2017; Bolukbasi et al., 2016] in practice, which is biased and not representative of the full spectrum of intelligent behaviors we want to model. Seeing through this perspective, what is missing in ML today is the right data, and the solution can include both the collection of more data and data augmentation.

Data augmentation is the process of generating more data from existing data. It can be done with both explicit and implicit inductive biases. For example, we can make copies of an image through rotations, scaling, or cropping [Shorten & Khoshgoftaar, 2020; Cubuk et al., 2019] — operations that don’t change the semantics of the image — when explicitly augmenting the training data for image classification. An implicit inductive bias for data augmentation could be the use of convolutional layers [LeCun et al., 1998; Krizhevsky et al., 2012], which have spatial invariance built in and are akin to making infinite copies of the training data with various spatial translations. Data augmentation can also be done through self-play, such as in the example of AlphaStar [Vinyals et al., 2019] and AlphaZero [Silver et al., 2018], where high-quality data is created through simulated experiences and a reward model of the environment. More recently, Lin et al. (2025) outperformed models an order of magnitude larger on mathematical theorem proving by combining the inductive bias of having a library of lemmas and self-play for data augmentation.

Concretely, we can improve the thinking abilities of existing ML systems by strategically injecting inductive biases [Goyal & Bengio, 2022] that correspond to infinite data augmentation and generating high-quality data through a form of self-play, which effectively turns compute into data beyond the training process.

These two complementary perspectives are the common threads behind the four articles included in this thesis. Amortized inference over compositional structures, enabled by GFlowNets, is the backbone of the GFlowNet-EM and GFlowNet-LLM chapters, where we construct a discrete latent space from data and perform search over it. The Blackboard chapter directly and losslessly embedded symbolic objects into a vector space and introduced gradient-guided search in the parameter space of neural networks to the learning of symbolic manipulations, a discrete and structured search process. The GFlowNet-EM and GFlowNet-CD chapters discuss constraints on the action and state spaces of GFlowNets as inductive biases to improve sample efficiency. The Blackboard architecture in the Blackboard chapter is strongly biased towards tree operations, such as through the explicit use of car, cdr, and cons in the action space, which yields exceptional compositional generalization for trees. Finally, the GFlowNet-LLM chapter uses compute at inference time to create chain-of-thought (CoT) samples from their Bayesian posterior, given by the unnormalized joint distribution, which is akin to self-play with an unnormalized reward model. Similarly, the step-by-step design of the Blackboard architecture can also be understood as a highly structured form of CoT reasoning, which contributed to its excellent compositional generalization.

My Publications referenced in the thesis

Acknowledgements

I’d like to thank Yoshua Bengio, Aaron Courville, Jason Eisner, and Dhanya Sridhar for being on my thesis committee!