Artificial Intelligence , Machine Learning and Data Science Hubspot

Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work. Stay updated with cutting-edge advancements, practical applications, and real-world use.

Monday, 19 December 2022

Learning to Reason with Neural Module Networks

Suppose we’re building a household robot, and want it to be able to answer questions about its surroundings. We might ask questions like these:

How can we ensure that the robot can answer these questions correctly? The standard approach in deep learning is to collect a large dataset of questions, images, and answers, and train a single neural network to map directly from questions and images to answers. If most questions look like the one on the left, we have a familiar image recognition problem, and these kinds of monolithic approaches are quite effective:

But things don’t work quite so well for questions like the one on the right:

Here the network we trained has given up and guessed the most common color in the image. What makes this question so much harder? Even though the image is cleaner, the question requires many steps of reasoning: rather than simply recognizing the main object in the image, the model must first find the blue cylinder, locate the other object with the same size, and then determine its color. This is a complicated computation, and it’s a computation specific to the question that was asked. Different questions require different sequences of steps to solve.

The dominant paradigm in deep learning is a "one size fits all" approach: for whatever problem we’re trying to solve, we write down a fixed model architecture that we hope can capture everything about the relationship between the input and output, and learn parameters for that fixed model from labeled training data.

But real-world reasoning doesn’t work this way: it involves a variety of different capabilities, combined and synthesized in new ways for every new challenge we encounter in the wild. What we need is a model that can dynamically determine how to reason about the problem in front of it—a network that can choose its own structure on the fly. In this post, we’ll talk about a new class of models we call neural module networks (NMNs), which incorporate this more flexible approach to problem-solving while preserving the expressive power that makes deep learning so effective.

Earlier, we noticed that there are three different steps involved in answering the question above: finding a blue cylinder, finding something else the same size, and determining its color. We can draw this schematically like:

A different question might involve a different series of steps. If we ask "how many things are the same size as the ball?", we might have something like:

Basic operations like "compare size" are shared between questions, but they get used in different ways. The key idea behind NMNs is to make this sharing explicit: we use two different network structures to answer the two questions above, but we share weights between pieces of networks that involve the same basic operations:

How do we learn a model like this? Rather than training a single large network on lots of input/output pairs, we actually train a huge number of different networks at the same time, while tying their parameters together where appropriate:

(Several recent deep learning frameworks, including DyNet and TensorFlow Fold, were explicitly designed with this kind of dynamic computation in mind.)

What we get at the end of the training process is not a single deep network, but rather a collection of neural "modules", each of which implements a single step of reasoning. When we want to use our trained model on a new problem instance, we can assemble these modules dynamically to produce a new network structure tailored to that problem.

One of the remarkable things about this process is that we don’t need to provide any low-level supervision for individual modules: the model never sees an isolated example of blue object or a "left-of" relationship. Modules are learned only inside larger composed structures, with only (question, answer) pairs as supervision. But the training procedure is able to automatically infer the correct relationship between pieces of structure and the computations they’re responsible for:

This same process works for answering questions about more realistic photographs, and even other knowledge sources like databases:

The key ingredient in this whole process is a collection of high-level "reasoning blueprints" like the ones above. These blueprints tell us how the network for each question should be laid out, and how different questions relate to one another. But where do the blueprints come from?

In our initial work on these models (1, 2), we drew on a surprising connection between the problem of designing question-specific neural networks and the problem of analyzing grammatical structure. Linguists have long observed that the grammar of a question is closely related to the sequence of computational steps needed to answer it. Thanks to recent advances in natural language processing, we can use off-the-shelf tools for grammatical analysis to provide approximate versions of these blueprints automatically.

But finding exactly the right mapping from linguistic structure to network structure is still a challenging problem, and the conversion process is prone to errors. In later work, rather than relying on this kind of linguistic analysis, we instead turned to data produced by human experts who directly labeled a collection of questions with idealized reasoning blueprints (3). By learning to imitate these humans, our model was able to improve the quality of its predictions substantially. Most surprisingly, when we took a model trained to imitate experts, but allowed it to explore its own modifications to these expert predictions, it was able to find even better solutions than experts on a variety of questions.

Despite the remarkable success of deep learning methods in recent years, many problems—including few-shot learning and complex reasoning—remain a challenge. But these are exactly the sorts of problems where more structured classical techniques like semantic parsing and program induction really shine. Neural module networks give us the best of both worlds: the flexibility and data efficiency of discrete compositionality, combined with the representational power of deep networks. NMNs have already seen a number of successes for visual and textual reasoning tasks, and we’re excited to start applying them to other AI problems as well.

This post is based on the following papers:

Neural Module Networks. Jacob Andreas, Marcus Rohrbach, Trevor Darrell and Dan Klein. CVPR 2016. (arXiv)
Learning to Compose Neural Networks for Question Answering. Jacob Andreas, Marcus Rohrbach, Trevor Darrell and Dan Klein. NAACL 2016. (arXiv)
Modeling Relationships in Referential Expressions with Compositional Modular Networks. Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell and Kate Saenko. CVPR 2017. (arXiv)

Images are from the VQA and CLEVR datasets.

Releasing the Dexterity Network (Dex-Net) 2.0 Dataset for Deep Grasping

Reliable robot grasping across many objects is challenging due to sensor noise and occlusions that lead to uncertainty about the precise shape, position, and mass of objects. The Dexterity Network (Dex-Net) 2.0 is a project centered on using physics-based models of robust robot grasping to generate massive datasets of parallel-jaw grasps across thousands of 3D CAD object models. These datasets are used to train deep neural networks to plan grasps from a point clouds on a physical robot that can lift and transport a wide variety of objects.

To facilitate reproducibility and future research, this blog post announces the release of the:

Dexterity Network (Dex-Net) 2.0 dataset: 6.7 million pairs of synthetic point clouds and grasps with robustness labels. [link to data folder]
Grasp Quality CNN (GQ-CNN) model: 18 million parameters trained on the Dex-Net 2.0 dataset. [link to our models]
GQ-CNN Python Package: Code to replicate our GQ-CNN training results on synthetic data (note System Requirements below). [link to code].

In the post, we also summarize the methods behind Dex-Net 2.0 (1), our experimental results on a real robot, and details on the datasets, models, and code.

Research papers and additional information on the Dexterity Network can be found on the project website: https://berkeleyautomation.github.io/dex-net.

Dex-Net is a project in the AUTOLAB at UC Berkeley that is advised by Prof. Ken Goldberg.

Background on Grasping

Robot grasping across many objects is difficult due to sensor noise and occlusions, which make it challenging to precisely infer physical properties such as object shape, pose, material properties, mass, and the locations of contact points between the fingers and object. Recent results suggest that deep neural networks trained on large datasets of human grasp labels (2) or trials of grasping on a physical system (3) can be used to plan successful grasps across a wide variety of objects directly from images (4) with no explicit modeling of physics, similar to generalization results seen in computer vision. However, the training datasets may be time consuming to generate.

To reduce training time, one alternative is to use Cloud Computing to rapidly compute grasps across a large dataset of object mesh models (5) using physics-based models of grasping (6). These methods rank grasps by a quantity called the grasp robustness, which is the probability of grasp success predicted by models from mechanics, such as whether or not the grasp can resist arbitrary forces and torques according to probability distributions over properties such as object position and surface friction (7). However, these methods make the strong assumption of a perception system that estimates these properties either perfectly or according to known Gaussian distributions. In practice, these perception systems are slow, prone to errors, and may not generalize well to new objects. Despite over 30 years of research, in practice it is common to plan grasps using heuristics such as detecting cylinders in applications such as home decluttering (8) and the Amazon Picking Challenge (9).

The Dexterity Network (Dex-Net) 2.0

Rather than attempt to estimate 3D object shape and pose from images, Dex-Net 2.0 uses a probabilistic model to generate synthetic point clouds, grasps, and grasp robustness labels from datasets of 3D object meshes (10) using physics-based models of grasping, image rendering, and camera noise. The main insight behind the method is that robust parallel-jaw grasps of an object are strongly correlated with the shape of the object. These geometric affordances for grasping, such as handles and cylinders, are visible in partial point clouds and their correlation with grasping will evident in samples from the model. We hypothesize that Deep CNNs are able to learn these correlations using a hierarchical set of filters that recognize geometric primitives, similar to the Gabor-like filters learned by CNNs for image classification (11).

We formalize and study this approach in our paper, “Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics.” In the paper we detail the Dexterity Network (Dex-Net) 2.0, a dataset of 6.7 million robust grasps and point clouds with synthetic noise generated from our probabilistic model of grasping rigid objects on a tabletop with a parallel-jaw gripper. We develop a deep Grasp Quality Convolutional Neural Network (GQ-CNN) model and train it on Dex-Net 2.0 to estimate grasp robustness from a candidate grasp and point cloud. We use the GQ-CNN to plan grasps on a physical robot by sampling a set of grasp candidates from an input point cloud with edge detection and executing the most robust grasp estimated by the GQ-CNN:

When trained on Dex-Net 2.0, the GQ-CNN learns a set of low-level filters that appear to detect image gradients at various scales. Filters can be organized into two classes: coarse oriented gradient filters that may be useful for estimating collisions between the gripper and object and fine vertical filters that may be useful for estimating surface normals at the locations of contact between the fingers and object:

Experiments with the ABB YuMi

To evaluate GQ-CNN-based grasp planning on a physical robot, we ran over 1,000 trials of grasping on an ABB YuMi to investigate:

Model Performance: Can a GQ-CNN trained entirely on synthetic data for a set of known objects be used to successfully grasp the objects on a physical robot?
Generalization: Can the GQ-CNN be used to successfully grasp novel objects that were not seen in training?

Model Performance

We first measured the ability of our method to plan grasps that could maintain a grasp on the object while lifting the object, transporting it, and shaking it within the gripper. We used a set of eight 3D printed objects with known shape, center of mass, and frictional properties to highlight differences between our physical models and grasping on the physical robot. To explore failure modes, we chose objects with adversarial geometry for two-finger grippers such as smooth, curved surfaces and narrow openings.

We found that the Dex-Net 2.0 grasp planner could achieve up to 93% success on the physical robot and was 3x faster than a method that matched the exact object shape to the point cloud. The results suggest that our physics-based model is a useful proxy for grasp outcomes on a physical robot when object properties are known and that the GQ-CNN can be used to plan highly precise grasps. Here’s an example:

dexnet_gif

Generalization

We also evaluated the ability to generalize to previously unseen objects by testing on a set of 40 novel objects including objects with moving parts and deformation, such as a can opener and a washcloth. After analyzing the data further we found a surprising result: the GQ-CNN had just one false positive out of 69 grasps it predicted to succeed. This 99% precision score is important because it suggests that the robot could anticipate failures based on its confidence labels and perform recovery actions such as poking objects or asking a human for help.

Limitations

The results of grasp planning with Dex-Net 2.0 suggest that it is possible to achieve highly reliable grasping across a wide variety of objects by training neural networks on only synthetic data that is generated using physical models of grasping and image formation. However, there are several limitations of the current method:

Sensor Capabilities. Some sources of noise on the physical depth camera, such as missing data, are not accounted for by the Dex-Net 2.0 model. Furthermore, depth cameras cannot see objects that are transparent or flat on a table.
Model Limitations. The physical model of grasping used by Dex-Net 2.0 considers fingertip grasps of rigid objects. We do not account for grasping strategies such as pinching a flat piece of paper into the gripper or hooking an object with a finger.
Single Objects. The method is designed to only grasp objects in isolation. We are currently working on extending the Dex-Net 2.0 model to grasping objects from a pile.
Task-Independence. The method plans grasps that can be used to robustly lift and transport an object but does not consider use cases of an object such as exact placement, stacking, or connecting it to another object in assembly which may require more precise grasps. We are researching possible extensions with task-based grasp quality metrics, dynamic simulation, and learning from demonstration.

Dataset and Code Release

Over summer 2017, we are releasing a subset of our code, datasets, and the trained GQ-CNN weights which we hope will facilitate further research and comparisons.

Today we’re releasing the Dex-Net 2.0 Training Dataset and Code, which includes the Dex-Net 2.0 dataset with 6.7 million synthetic datapoints, pretrained GQ-CNN models from the paper, and the gqcnn Python package for replicating our experiments on classifying robust grasps on synthetic data with GQ-CNNs. We hope this will facilitate development of new GQ-CNN architectures and training methods that perform better on both synthetic datasets and datasets collected with our robot. You can access the release with these links: [datasets] [models] [code]

System Requirements

Please note that strong performance on this particular dataset may not be indicative of performance on other robots because the dataset is specific to: 1) The ABB YuMi gripper due to collision geometry. 2) A Primesense Carmine 1.08 sensor due to camera parameters used in rendering. 3) The set of poses of the camera relative to the table: 50-70 centimeters directly above a table looking straight down.

Nonetheless, the algorithms behind the dataset can be used to generate datasets for other two-finger grippers, cameras, and camera poses relative to the robot. We hypothesize that GQ-CNN-based grasp planning will perform best if the training datasets are generated using the gripper geometry, camera intrinsics, and camera location specific to the hardware setup.

ABB YuMi Benchmark

We plan to keep a leaderboard of performance on the Dex-Net 2.0 dataset to investigate improvements to the GQ-CNN architecture, since our best models achieve only 93% classification accuracy on synthetic data. Since datasets are specific to a hardware setup, we volunteer to benchmark performance on the physical robot for models that we deed signficantly outperform other methods on synthetic data. We invite researchers from any discipline or background to participate.

Python Package

To aid in training GQ-CNNs, we developed the gqcnn Python package. Using gqcnn, you can quickly get started training GQ-CNNs on datasets generated with Dex-Net 2.0. There are tutorials to replicate the results from our RSS paper, and we invite researchers to try to improve classification performance on synthetic datasets as well as datasets of grasps collected with our physical ABB YuMi robot.

We’re also working a ROS service for grasp planning with GQ-CNNs. The ROS package will enable users to see the results of grasp planning with GQ-CNNs on custom point clouds. We encourage interested parties to set up a Primesense Carmine 1.08 or Microsoft Kinect for Xbox 360 roughly 50-70 cm above a table and attempt grasps planned by a GQ-CNN-based grasp planner. While our dataset may not generalize to other hardware setups as noted above, we hope that with further research it may be possible to use GQ-CNNs for lifting and transporting objects with other robots. If you are interested in a research collaboration on such a project, please email Jeff Mahler (jmahler@berkeley.edu).

Future Releases

We are also aiming for the following releases and dates of additional data and functionality from Dex-Net over summer and fall 2017:

Dex-Net Object Mesh Dataset v1.1: The subset of 1,500 3D object models from Dex-Net 1.0 used in the RSS paper, labeled with Parallel-Jaw grasps for the ABB YuMi (14). July 12, 2017.
Dex-Net as a Service: HTTP web API to create new databases with custom 3D models and compute grasp robustness metrics. Fall 2017.

Contact

See the project website for updates and progress.

For more information please contact Jeff Mahler or Prof. Ken Goldberg of the Berkeley AUTOLAB.

Acknowledgments

This research was performed at the AUTOLAB at UC Berkeley in affiliation with the Berkeley AI Research (BAIR) Lab, the Real-Time Intelligent Secure Execution (RISE) Lab, and the CITRIS People and Robots (CPAR) Initiative. The authors were supported in part by the U.S. National Science Foundation under NRI Award IIS-1227536: Multilateral Manipulation by Human-Robot Collaborative Systems, the Department of Defense (DoD) through the National Defense Science & Engineering Graduate Fellowship (NDSEG) Program, the Berkeley Deep Drive (BDD) Program, and by donations from Siemens, Google, Cisco, Autodesk, IBM, Amazon Robotics, and Toyota Robotics Institute. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Sponsors.

References

(1): Mahler, Jeffrey, Jacky Liang, Sherdil Niyaz, Michael Laskey, Richard Doan, Xinyu Liu, Juan Aparicio Ojea, and Ken Goldberg. “Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics.” arXiv preprint arXiv:1703.09312 (2017). (Paper) (Website)

(2): Kappler, Daniel, Jeannette Bohg, and Stefan Schaal. “Leveraging Big Data for Grasp Planning.” In Robotics and Automation (ICRA), 2015 IEEE International Conference on, pp. 4304-4311. IEEE, 2015.

(3): Levine, Sergey, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. “Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection.” arXiv preprint arXiv:1603.02199 (2016).

(4): Johns, Edward, Stefan Leutenegger, and Andrew J. Davison. “Deep Learning a Grasp Function for Grasping under Gripper Pose Uncertainty.” In Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on, pp. 4461-4468. IEEE, 2016.

(5): Goldfeder, Corey, Matei Ciocarlie, Hao Dang, and Peter K. Allen. “The Columbia Grasp Database.” In Robotics and Automation, 2009. ICRA’09. IEEE International Conference on, pp. 1710-1716. IEEE, 2009.

(6): Prattichizzo, Domenico, and Jeffrey C. Trinkle. “Grasping.” In Springer Handbook of Robotics, pp. 955-988. Springer International Publishing, 2016.

(7): Weisz, Jonathan, and Peter K. Allen. “Pose Error Robust Grasping from Contact Wrench Space Metrics.” In Robotics and Automation (ICRA), 2012 IEEE International Conference on, pp. 557-562. IEEE, 2012.

(8): Ciocarlie, Matei, Kaijen Hsiao, Edward Gil Jones, Sachin Chitta, Radu Bogdan Rusu, and Ioan A. Şucan. “Towards Reliable Grasping and Manipulation in Household Environments.” In Experimental Robotics, pp. 241-252. Springer Berlin Heidelberg, 2014.

(9): Hernandez, Carlos, Mukunda Bharatheesha, Wilson Ko, Hans Gaiser, Jethro Tan, Kanter van Deurzen, Maarten de Vries et al. “Team Delft’s Robot Winner of the Amazon Picking Challenge 2016.” arXiv preprint arXiv:1610.05514 (2016).

(10): Mahler, Jeffrey, Florian T. Pokorny, Brian Hou, Melrose Roderick, Michael Laskey, Mathieu Aubry, Kai Kohlhoff, Torsten Kröger, James Kuffner, and Ken Goldberg. “Dex-Net 1.0: A Cloud-Based Network of 3D Objects for Robust Grasp Planning using a Multi-Armed Bandit Model with Correlated Rewards.” In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pp. 1957-1964. IEEE, 2016.

(11): Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet Classification with Deep Convolutional Neural Networks.” In Advances in Neural Information Processing Systems, pp. 1097-1105. 2012.

Monday, 12 December 2022

Minibatch Metropolis-Hastings

Over the last few years we have experienced an enormous data deluge, which has played a key role in the surge of interest in AI. A partial list of some large datasets:

ImageNet, with over 14 million images for classification and object detection.
Movielens, with 20 million user ratings of movies for collaborative filtering.
Udacity’s car dataset (at least 223GB) for training self-driving cars.
Yahoo’s 13.5 TB dataset of user-news interaction for studying human behavior.

Stochastic Gradient Descent (SGD) has been the engine fueling the development of large-scale models for these datasets. SGD is remarkably well-suited to large datasets: it estimates the gradient of the loss function on a full dataset using only a fixed-sized minibatch, and updates a model many times with each pass over the dataset.

But SGD has limitations. When we construct a model, we use a loss function Lθ(x) with dataset x and model parameters θ and attempt to minimize the loss by gradient descent on θ. This shortcut approach makes optimization easy, but is vulnerable to a variety of problems including over-fitting, excessively sensitive coefficient values, and possibly slow convergence. A more robust approach is to treat the inference problem for θ as a full-blown posterior inference, deriving a joint distribution p(x,θ) from the loss function, and computing the posterior p(θ|x). This is the Bayesian modeling approach, and specifically the Bayesian Neural Network approach when applied to deep models. This recent tutorial by Zoubin Ghahramani discusses some of the advantages of this approach.

The model posterior p(θ|x) for most problems is intractable (no closed form). There are two methods in Machine Learning to work around intractable posteriors: Variational Bayesian methods and Markov Chain Monte Carlo (MCMC). In variational methods, the posterior is approximated with a simpler distribution (e.g. a normal distribution) and its distance to the true posterior is minimized. In MCMC methods, the posterior is approximated as a sequence of correlated samples (points or particle densities). Variational Bayes methods have been widely used but often introduce significant error — see this recent comparison with Gibbs Sampling, also Figure 3 from the Variational Autoencoder (VAE) paper. Variational methods are also more computationally expensive than direct parameter SGD (it’s a small constant factor, but a small constant times 1-10 days can be quite important).

MCMC methods have no such bias. You can think of MCMC particles as rather like quantum-mechanical particles: you only observe individual instances, but they follow an arbitrarily-complex joint distribution. By taking multiple samples you can infer useful statistics, apply regularizing terms, etc. But MCMC methods have one over-riding problem with respect to large datasets: other than the important class of conjugate models which admit Gibbs sampling, there has been no efficient way to do the Metropolis-Hastings tests required by general MCMC methods on minibatches of data (we will define/review MH tests in a moment). In response, researchers had to design models to make inference tractable, e.g. Restricted Boltzmann Machines (RBMs) use a layered, undirected design to make Gibbs sampling possible. In a recent breakthrough, VAEs use variational methods to support more general posterior distributions in probabilistic auto-encoders. But with VAEs, like other variational models, one has to live with the fact that the model is a best-fit approximation, with (usually) no quantification of how close the approximation is. Although they typically offer better accuracy, MCMC methods have been sidelined recently in auto-encoder applications, lacking an efficient scalable MH test.

A bridge between SGD and Bayesian modeling has been forged recently by papers on Stochastic Gradient Langevin Dynamics (SGLD) and Stochastic Gradient Hamiltonian Monte Carlo (SGHMC). These methods involve minor variations to typical SGD updates which generate samples from a probability distribution which is approximately the Bayesian model posterior p(θ|x). These approaches turn SGD into an MCMC method, and as such require Metropolis-Hastings (MH) tests for accurate results, the topic of this blog post.

Because of these developments, interest has warmed recently in scalable MCMC and in particular in doing the MH tests required by general MCMC models on large datasets. Normally an MH test requires a scan of the full dataset and is applied each time one wants a data sample. Clearly for large datasets, it’s intractable to do this. Two papers from ICML 2014, Korattikara et al. and Bardenet et al., attempt to reduce the cost of MH tests. They both use concentration bounds, and both achieve constant-factor improvements relative to a full dataset scan. Other recent work improves performance but makes even stronger assumptions about the model which limits applicability, especially for deep networks. None of these approaches come close to matching the performance of SGD, i.e. generating a posterior sample from small constant-size batches of data.

In this post we describe a new approach to MH testing which moves the cost of MH testing from O(N) to O(1) relative to dataset size. It avoids the need for global statistics and does not use tail bounds (which lead to long-tailed distributions for the amount of data required for a test). Instead we use a novel correction distribution to directly “morph” the distribution of a noisy minibatch estimator into a smooth MH test distribution. Our method is a true “black-box” method which provides estimates on the accuracy of each MH test using only data from a small expected size minibatch. It can even be applied to unbounded data streams. It can be “piggy-backed” on existing SGD implementations to provide full posterior samples (via SGLD or SGHMC) for almost the same cost as SGD samples. Thus full Bayesian neural network modeling is now possible for about the same cost as SGD optimization. Our approach is also a potential substitute for variational methods and VAEs, providing unbiased posterior samples at lower cost.

To explain the approach, we review the role of MH tests in MCMC models.

Markov Chain Monte Carlo Review

Markov Chains

MCMC methods are designed to sample from a target distribution which is difficult to compute. To generate samples, they utilize Markov Chains, which consist of nodes representing states of the system and probability distributions for transitioning from one state to another.

A key concept is the Markovian assumption, which states that the probability of being in a state at time t+1 can be inferred entirely based on the current state at time t. Mathematically, letting θt represent the current state of the Markov chain at time t, we have p(θt+1|θt,…,θ0)=p(θt+1|θt). By using these probability distributions, we can generate a chain of samples (θi)Ti=1 for some large T.

Since the probability of being in state θt+1 directly depends on θt, the samples are correlated. Rather surprisingly, it can be shown that, under mild assumptions, in the limit of many samples the distribution of the chain’s samples approximates the target distribution.

A full review of MCMC methods is beyond the scope of this post, but a good reference is the Handbook of Markov Chain Monte Carlo (2011). Standard machine learning textbooks such as Koller & Friedman (2009) and Murphy (2012) also cover MCMC methods.

Metropolis-Hastings

One of the most general and powerful MCMC methods is Metropolis-Hastings. This uses a test to filter samples. To define it properly, let p(θ) be the target distribution we want to approximate. In general, it’s intractable to sample directly from it. Metropolis-Hastings uses a simpler proposal distribution q(θ′|θ) to generate samples. Here, θ represents our current sample in the chain, and θ′ represents the proposed sample. For simple cases, it’s common to use a Gaussian proposal centered at θ.

If we were to just use a Gaussian to generate samples in our chain, there’s no way we could approximate our target p, since the samples would form a random walk. The MH test cleverly resolves this by filtering samples with the following test. Draw a uniform random variable u∈[0,1] and determine whether the following is true:

u < ? min {p ( θ ' ) q ( θ | θ ' ) p ( θ ) q ( θ ' | θ ), 1}

If true, we accept θ′. Otherwise, we reject and reuse the old sample θ. Notice that

It doesn’t require knowledge of a normalizing constant (independent of θ and θ′), because that cancels out in the p(θ′)/p(θ) ratio. This is great, because normalizing constants are arguably the biggest reason why distributions become intractable.
The higher the value of p(θ′), the more likely we are to accept.

To get more intuition on how the test works, we’ve created the following figure from this Jupyter Notebook, showing the progression of samples to approximate a target posterior. This example is derived from Welling & Teh (2011).

jupyter_notebook
A quick example of the MH test in action on a mixture of Gaussians example. The parameter is θ∈R2 with the x and y axes representing θ1 and θ2, respectively. The target posterior has contours shown in the fourth plot; the probability mass is concentrated in the diagonal between points (0,1) and (1,−1). (This posterior depends on sampled Gaussians.) The plots show the progression of the MH test after 50, 500, and 5000 samples in our MCMC chain. After 5000 samples, it's clear that our samples are concentrated in the regions with higher posterior probability.

Reducing Metropolis-Hastings Data Usage

What happens when we consider the Bayesian posterior inference case with large datasets? (Perhaps we’re interested in the same example in the figure above, except that the posterior is based on more data points.) Then our goal is to sample to approximate the distribution p(θ|x1,…,xN) for large N. By Bayes’ rule, this is p0(θ)p(x1,…,xN|θ)p(x1,…,xN) where p0 is the prior. We additionally assume that the xi are conditionally independent given θ. The MH test therefore becomes:

u < ? min {p 0 ( θ ' ) \prod N i = 1 p ( x i | θ ' ) q ( θ | θ ' ) p 0 ( θ ) \prod N i = 1 p ( x i | θ ) q ( θ ' | θ ), 1}

Or, after taking logarithms and rearranging (while ignoring the minimum operator, which technically isn’t needed here), we get

log (u q ( θ ' | θ ) p 0 ( θ ) q ( θ | θ ' ) p 0 ( θ ' )) < ? \sum i = 1 N log p ( x i | θ ' ) p ( x i | θ )

The problem now is apparent: it’s expensive to compute all the p(xi|θ′) terms, and this has to be done every time we sample since it depends on θ′.

The naive way to deal with this is to apply the same test, but with a minibatch of b elements:

log (u q ( θ ' | θ ) p 0 ( θ ) q ( θ | θ ' ) p 0 ( θ ' )) < ? N b \sum i = 1 b log p ( x * i | θ ' ) p ( x * i | θ )

Unfortunately, this won’t sample from the correct target distribution; see Section 6.1 in Bardenet et al. (2017) for details.

A better strategy is to start with the same batch of b points, but then gauge the confidence of the batch test relative to using the full data. If, after seeing b points, we already know that our proposed sample θ′ is significantly worse than our current sample θ, then we should reject right away. If θ′ is significantly better, we should accept. If it’s ambiguous, then we increase the size of our test batch, perhaps to 2b elements, and then measure the test’s confidence. Lather, rinse, repeat. As mentioned earlier, Korattikara et al. (2014) and Bardenet et al. (2014) developed algorithms following this framework.

A weakness of the above approach is that it’s doing repeated testing and one must reduce the allowable test error each time one increments the test batch size. Unfortunately, there is also a significant probability that the approaches above will grow the test batch all the way to the full dataset, and they offer at most constant factor speedups over testing the full dataset.

Minibatch Metropolis-Hastings: Our Contribution

Change the Acceptance Function

To set up our test, we first define the log transition probability ratio Δ:

Δ (θ, θ') = log p 0 ( θ ' ) \prod N i = 1 p ( x i | θ ' ) q ( θ | θ ' ) p 0 ( θ ) \prod N i = 1 p ( x i | θ ) q ( θ ' | θ )

This log ratio factors into a sum of per-sample terms, so when we approximate its value by computing on a minibatch we get an unbiased estimator of its full-data value plus some noise (which is asymptotically normal by the Central Limit Theorem).

The first step for applying our MH test is to use a different acceptance function. Expressed in terms of Δ, the classical MH accepts a transition with probability given by the blue curve.

different_tests
Functions f and g can serve as acceptance tests for Metropolis-Hastings. Given current sample θ and proposed sample θ′, the vertical axis represents the probability of accepting θ′.

Instead of using the classical test, we’ll use the sigmoid function. It might not be apparent why this is allowed, but there’s some elegant theory that explains why using this alternative function as the acceptance test for MH still results in the correct semantics of MCMC. That is, under the same mild assumptions, the distribution of samples (θi)Ti=1 approaches the target distribution.

equivalent_test
The density of the standard logistic random variable, denoted Xlog along with the equivalent MH test expression (Xlog+Δ>0) with the sigmoid acceptance function.

Our acceptance test is now the sigmoid function. Note that the sigmoid function is the cumulative distribution function of a (standard) Logistic random variable; the figure above plots the density. One can show that the MH test under the sigmoid acceptance function reduces to determining whether Xlog+Δ>0 for a sampled Xlog value.

New MH Test

This is nice, but we don’t want to compute Δ because it depends on all p(xi|θ′) terms. When we estimate Δ using a minibatch, we introduce an additive error which is approximately normal, Xnormal. The key observation in our work is that the distribution of the minibatch estimate of Δ (approximately Gaussian) is already very close to the desired test distribution Xlog, as shown below.

gaussian_logistic_cdf
A plot of the logistic CDF in red (as we had earlier) along with a normal CDF curve, colored in lime, which corresponds to a standard deviation of 1.7.

Rather than resorting to tail bounds as in prior work, we directly bridge these two distributions using an additive correction variable Xcorrection:

test_visual
A diagram of our minibatch MH test. On the right we have the full data test that we want, but we can't use it since Δ is intractable. Instead, we have Δ+Xnormal (from the left side) and must add a correction Xcorrection.

We want to make the LHS and RHS distributions equal, so we add in a correction Xcorrection which is a symmetric random variable centered at zero. Adding independent random variables gives a random variable whose distribution is the convolution of the summands’ distributions. So finding the correction distribution involves “deconvolution” of a logistic and normal distribution. It’s not always possible to do this, and several conditions must be met (e.g. the tails of the normal distribution must be weaker than the logistic) but luckily for us they are. In our paper to appear at UAI 2017 we show that the correction distribution can be approximated to essentially single-precision floating-point precision by tabulation.

In our paper, we also prove theoretical results bounding the error of our test, and present experimental results showing that our method results in accurate posterior estimation for a Gaussian Mixture Model, and that it is also highly sample-efficient in Logistic Regression for classification of MNIST digits.

paper_results
Histograms showing the batch sizes used for Metropolis-Hastings for the three algorithms benchmarked in our paper. The posterior is similar to the earlier example from the Jupyter Notebook, except generated with one million data points. Left is our result, the other two are from Korattikara et al. (2014), and Bardenet et al. (2014), respectively. Our algorithm uses an average of just 172 data points each iteration. Note the log-log scale of the histograms.