What kind of a graphical model is the brain? Geoffrey E. Hinton Canadian Institute for Advanced Research & Department of Computer Science, University of Toronto 10 Kings College Road Toronto, Canada M5S 3G4 hinton@cs.toronto.edu Abstract recent work in collaboration with Simon Osindero and Yee- Whye Teh that combines these two tricks in a surprising way If neurons are treated as latent variables, our vi- to learn a hybrid generative model that was first proposed by sual systems are non-linear, densely-connected Yee-Whye Teh. In this model, the top two layers form an graphical models containing billions of variables undirected associative memory. The remaining layers form and thousands of billions of parameters. Cur- a directed acyclic graph that converts the representation in rent algorithms would have difficulty learning a the associative memory into observable variables such as the graphical model of this scale. Starting with an pixels of an image. In addition to working well, this hybrid algorithm that has difficulty learning more than model has some other nice features: a few thousand parameters, I describe a series 1. The learning finds a fairly good model quickly even in of progressively better learning algorithms all of deep directed networks with millions of parameters and which are designed to run on neuron-like hard- many hidden layers. For optimal performance, how- ware. The latest member of this series can learn ever, a slower fine-tuning phase is required. deep, multi-layer belief nets quite rapidly. It turns a generic network with three hidden layers and 2. The learning algorithm builds a full generative model of 1.7 million connections into a very good genera- the data which makes it easy to see what the distributed tive model of handwritten digits. After learning, representations in the deeper layers have in mind. the model gives classification performance that is 3. The inference required for forming a percept is both fast comparable to the best discriminative methods. and accurate. 4. The learning algorithm is unsupervised. For labeled 1 Introduction data, it learns a model that generates both the label and the data. Our perceptual systems make sense of the visual input using a neural network that contains about 1013 synapses. There 5. The learning is local: adjustments to a synapse strength has been much debate about whether our perceptual abilities depend only on the states of the pre-synaptic and post- should be attributed to a few million generations of blind evo- synaptic neuron. lution or to a few hundred million seconds of visual experi- 6. The communication is simple: neurons only need to ence. Evolutionary search suffers from an information bottle- communicate their stochastic binary states. neck because fitness is a scalar, so my bet is that the main contribution of evolution was to endow us with a learning Section 2 describes a simple learning algorithm for undi- algorithm that could make use of high-dimensional gradient rected, densely-connected, networks composed of stochastic vectors. These vectors provide millions of bits of informa- binary variables some of which are unobserved. Section 3 tion every second thus allowing us to perform a much larger shows how to make this simple algorithm efficient by re- search in one lifetime than evolution could perform in our stricting the architecture of the network. Section 4 introduces entire evolutionary history. the idea of variational approximations for learning directed graphical models in which correct inference is intractable So what is this magic learning algorithm? I have been in- and describes the "wake-sleep" algorithm that makes use of a volved in attempts to answer this question using undirected graphical models [Hinton and Sejnowski, 1986], directed variational approximation in a multi-layer, directed network graphical models [Hinton et al., 1995], or no graphical model of stochastic binary variables. All of these sections can be at all [Rumelhart et al., 1986]. These attempts have failed as safely ignored by people already familiar with these ideas. Section 5 introduces the novel idea of a "complementary" scientific theories of how the brain learns because they sim- prior. Complementary priors seem about as probable as father ply do not work well enough. They have, however, produced Christmas because, by definition, they exactly cancel the "ex- two neat tricks, one for learning undirected models and one plaining away" phenomenon that makes inference difficult in for learning directed models. In this paper, I describe some directed models. Section 5.1 includes a simple example of a complementary prior and shows the equivalence between re- stricted Boltzmann machines and infinite directed networks with tied weights. Section 6 introduces a fast, greedy learning algorithm for constructing multi-layer directed networks one layer at a time. Using a variational bound it shows that as each new layer is added, the overall generative model improves. The greedy algorithm resembles boosting in its repeated use of the same "weak" learner, but instead of re-weighting each data- vector to ensure that the next step learns something new, it re-represents it. Curiously, the weak learner that is used to construct deep directed nets is itself an undirected graphical model. Section 7 shows how the weights produced by the efficient Figure 1: A Boltzmann machine composed of stochastic greedy algorithm can be fine-tuned using the "up-down" al- binary units with symmetric connections. When data is gorithm which is a contrastive version of the wake-sleep al- clamped on the visible units, a simple stochastic updating gorithm. rule infers a configuration of states of the hidden units that Section 8 shows the pattern recognition performance of is a good interpretation of the data. If no data is clamped and a network with three hidden layers and about 1.7 million the same updating rule is used for all of the units, the network weights on the standard MNIST set of handwritten digits. generates visible vectors from its model. When no knowledge of geometry is provided and there is no special preprocessing, the generalization performance of the network is 1.25% errors on the official test set. This beats the where s is the binary state of unit i in the global binary 1.5% achieved by the best back-propagation nets when they i are not hand-crafted for this particular application, and it is configuration . If units are chosen at random and their quite close to the 1.1% or 1.0% achieved by the best support binary states are updated using the stochastic activation rule vector machines. in Eq. 1 the Boltzmann machine will eventually converge to Finally, section 9 shows what happens in the mind of the a stationary probability distribution in which the probability model when it is running without being constrained by visual of finding it in any global state is determined by the energy of input. The network has a full generative model, so it is easy that state relative to the energies of all the other global states: to look into its mind ­ we simply generate an image from its exp(-E ) high-level representations. P = (3) exp(-E ) Throughout the paper, we will consider nets composed of stochastic binary variables but the ideas can be generalized to If we sum over all configurations of the hidden units, we get other models in which the log probability of a variable is an the probability, at thermal equilibrium, of finding the visible additive function of the states of its directly-connected neigh- units in configuration bours. exp(-E ) 2 The Boltzmann machine learning algorithm P= (4) exp(-E ) A Boltzmann machine (Hinton and Sejnowski, 1986) is a net- work of stochastic binary units with symmetric connections. A Boltzmann machine can be viewed as a generative It is usually divided into a set of "visible" units which can model that assigns a probability, via Eq. 4, to each possi- have data-vectors clamped on them, and a set of hidden units ble binary state vector over its visible units. By changing that act as latent variables (see figure 1). Each unit, i, adopts the weights and biases, we can change the probability that its "on" state with a probability that is a logistic function of the model assigns to each possible visible vector. So we can the inputs it receives from other units, j : model a set of training vectors by adjusting the weights and biases to maximize the sum of their log probabilities. A nice 1 p(si = 1) = (1) feature of Boltzmann machines is that the maximum likeli- j 1 + exp(-bi - sj wij ) hood learning rule for the weights is both simple and local. where bi is the bias of unit i, and wij is the weight on the sym- We can learn a locally optimal set of weights by collecting metric connection between i and j . The weights and biases of two sets of statistics: a Boltzmann machine define an energy function over global · In the positive phase we clamp a training vector on the configurations (i.e. binary state vectors) of the net. Using visible units and then repeatedly update the hidden units as an index over configurations of the visible units, and for in random order using Eq. 1. Once the distribution configurations of the hidden units: over hidden configurations has reached "thermal" equi- i i bi s - s s wij librium with the clamped data-vector, we sample the E = - (2) i i j hidden state and record which pairs of units are both + , the average correlation between i and j when data is clamped on the visible units. · In the negative phase we let the network run freely with the visible units unclamped. Their states are updated in just the same way as the hidden units. Once the distri- bution over global configurations has reached equilib- rium, we sample the states of all the units and record which pairs are both on. By repeating this many times, Figure 2: This depicts a Markov chain that uses alternating we can compute < si sj >- , the average correlation be- Gibbs sampling. In one full step of Gibbs sampling, the hid- tween i and j when the network is running freely and den units in the top layer are all updated in parallel by apply- therefore producing samples from its generative model. ing Eq. 1 to the inputs received from the the current states of the visible units in the bottom layer, then the visible units We can then follow the gradient of the log probability of are all updated in parallel given the current hidden states. The the data by using the simple rule chain is initialized by setting the binary states of the visible wij = (+ - - ) (5) units to be the same as a data-vector. The correlations in the activities of a visible and a hidden unit are measured after the where is a learning rate. It is very surprising that the learn- first update of the hidden units and again at the end of the ing rule is so simple because the gradient of the log likeli- chain. The difference of these two correlations provides the hood with respect to one weight depends in complicated ways learning signal for updating the weight on the connection. on all the other weights. In the back-propagation algorithm, these dependencies are computed explicitly in the backward pass. In the Boltzmann machine they show up as the differ- ence between the local correlations in the positive and nega- where the superscript 0 indicates that the correlation is mea- tive phases. sured at the start of the chain with a data-vector clamped on Unfortunately, the simplicity and generality of the Boltz- the visible units and n indicates that the correlation is mea- mann machine learning algorithm come at a price. It can take sured after n full steps of Gibbs sampling. The angle brack- a very long time for the network to settle to thermal equi- ets denote expectations over both the choice of data-vector librium, especially in the negative phase when it is uncon- and the stochastic updating used in the Gibbs sampling. This strained by data but needs to be highly multi-modal. Also, learning rule does not follow the gradient of the log like- the gradient used for learning is very noisy because it is the lihood, but it does closely approximate the gradient of an- difference of two noisy expectations. These problems make other function, contrastive divergence, which is the difference the general form of the algorithm impractical for large net- of two Kullback-Leibler divergences [Hinton, 2002]. Intu- works with many hidden units. itively, it is not necessary to run the chain to equilibrium in order to see how the data distribution is being systematically 3 Restricted Boltzmann machines and distorted by the model. If we just run the chain for a few steps and then lower the energy of the data and raise the energy of contrastive divergence learning whichever configuration the chain preferred to the data, we If we are willing to restrict the architecture of a Boltzmann will make the model more likely to generate the data and machine by not allowing connections between hidden units, less likely to generate the alternatives. An empirical inves- the positive phase no longer requires any settling. With a tigation of the relationship between the maximum likelihood data-vector clamped on the visible units, the hidden units are and the contrastive divergence learning rules can be found in all conditionally independent, so we can apply the update rule [Carreira-Perpinan and Hinton, 2005]. in Eq. 1 to all the units at the same time to get an unbiased Contrastive divergence learning in a restricted Boltzmann sample from the posterior distribution over hidden configura- machine is efficient enough to be practical [Mayraz and Hin- tions. This makes it easy to measure the first correlation in ton, 2001]. Variations that use real-valued units and differ- Eq. 5. ent sampling schemes are described in [Teh et al., 2003] and If we also prohibit connections between visible units, we have been quite successful for modeling the formation of to- can update all of the visible units in parallel given a hidden pographic maps [Welling et al., 2003] and for denoising nat- configuration. So the second correlation in Eq. 5 can be found ural images [Roth and Black, 2005]. However, it appears that by alternating Gibbs sampling as shown in figure 2. Unfortu- the efficiency has been bought at a high price: It is not pos- nately we may need to run the alternating Gibbs sampling for sible to have deep, multilayer nets because these take far too a long time before the Markov chain converges to the equilib- long even to reach conditional equilibrium with a clamped rium distribution. Fortunately, if we start the Markov chain at data-vector. Also, nets with symmetric connections do not the data distribution, learning still works well even if we only give a causal model in which the data is explained in terms of run the chain for a few steps [Hinton, 2002]. This gives an underlying causes. efficient learning rule: The next section describes a simple learning algorithm for wij = (0 - n ) (6) an apparently quite different type of network that uses di- rected connections. This learning algorithm also has defi- ciencies, but it can be combined with contrastive divergence learning in a surprising way to produce an algorithm that works much better and is significantly more similar to the real brain. 4 Variational learning Inference in directed graphical models that use non-linear, distributed representations is difficult because of a phe- nomenon called "explaining away" [Pearl, 1988] which cre- ates dependencies between hidden variables. This is illus- trated in figure 3. Radford Neal [Neal, 1992] showed that it was possible to use Gibbs sampling to perform inference cor- rectly in multilayer directed networks composed of the same type of binary stochastic units as are used in Boltzmann ma- Figure 3: A simple logistic belief net containing two inde- chines. The communication required is more complicated pendent, rare causes that become highly anti-correlated when than in a Boltzmann machine because in addition to seeing we observe the house jumping. The bias of -10 on the earth- the binary states of its ancestors and descendants, a unit needs quake node means that, in the absence of any observation, this to see the probability that each of its descendants would be node is e10 times more likely to be off than on. If the earth- turned on by the current states of all that descendant's an- quake node is on and the truck node is off, the jump node has cestors. However, once we have a sample from the poste- a total input of 0 which means that it has an even chance of rior distribution over configurations of the hidden units, the being on. This is a much better explanation of the observa- maximum likelihood learning rule for updating the directed tion that the house jumped than the odds of e-20 which apply connection from j to i is very simple: if neither of the hidden causes is active. But it is wasteful to turn on both hidden causes to explain the observation because wj i = sj (si - si ) ^ (7) the odds on them both happening are e-10 × e-10 = e-20 . ^ where is a learning rate and si is the probability that i would be turned on by the current states of all its ancestors. There where y is a configuration of the hidden units, Q(y |x) is is no need for a "negative phase" because directed models do the probability that the sender will choose to use y in order not require the awkward normalizing term that shows up in to communicate data-vector x, log p(y ) is the cost of com- the denominator of Eq. 3. Radford Neal showed that logistic municating y to a receiver who already has the model and belief nets are somewhat easier to learn than Boltzmann ma- log p(x|y ) is the cost of communicating x to a receiver who chines, but the use of Gibbs sampling to get samples from the has both the model and the hidden configuration y . Rich posterior distribution still makes it very tedious to learn large, deep nets. soon discovered that it was better to minimize a different function and we eventually understood why. Rich Zemel and I realised that it might still be possible Suppose there are two different hidden configurations that to learn a belief net that contained a layer of binary stochas- give the same communication cost for a data-vector. The tic hidden units even when the cost of computing the pos- sender can flip a coin to decide which one to use. After receiv- terior distribution, or sampling from it, was prohibitive. In- ing the data-vector, the receiver can figure out what choices stead of trying to perform maximum likelihood learning, we were available to the sender and he can therefore figure out adopted a coding perspective and attempted to learn a model that would minimize the description length of the data [Zemel the value of the random bit produced by the coin. So if there and Hinton, 1995]. The idea is that the sender and receiver are two equally good hidden configurations, the sender can communicate one additional bit from a random bit stream by both have access to the model and instead of communicating his choice of configuration. In general, the number of extra a data-vector directly, the sender first communicates a hid- bits is equal to the entropy of the sender's distribution across den configuration of the model. This costs some bits, but it hidden configurations. All of these extra bits can be used to also gives the receiver a good idea of what data to expect. communicate some other bit string, so we need to subtract Given these expectations, the data-vector can be communi- cated more cheaply1 . So it appears that the expected cost of them from the communication cost of the data-vector: communicating a data-vector is: C (x) = - Q(y |x) [log p(y ) + log p(x|y )] C (x) = - Q(y |x) [log p(y ) + log p(x|y )] (8) ( - Q(y |x) log Q(y |x) - 9) 1 Shannon showed that, using an efficient block coding scheme, the cost of communicating a discrete value to a receiver is asymp- If the sender picks hidden configurations from their true totically equal to the negative log probability of that value under posterior distribution, the communication cost is minimized a probability distribution that has already been agreed upon by the and is equal to the negative log probability of the data-vector sender and the receiver. under the model. But if it is hard to sample from the true pos- The wake-sleep algorithm works quite well, but the sleep terior, the sender can use any other distribution he chooses. phase is not exactly following the gradient of the variational The communication cost goes up, but it is still perfectly well- bound. As a result, it does the wrong thing when a data- defined. The sender could, for example, insist on using a fac- vector can be generated by two quite different hidden configu- torial distribution in which the states of the hidden units are rations: Instead of picking one of these hidden configurations chosen independently. The communication cost will then be and sticking with it, it averages the configurations to produce an upper bound on the negative log probability of the data un- a vague factorial distribution that gives significant probability der the model and by minimizing the communication cost we to many poor configurations. will either push down the negative log probability of the data or we will make the bound tighter. The looseness of the bound 5 Complementary priors is just the Kullback-Liebler divergence between the distribu- The phenomenon of explaining away makes inference diffi- tion used by the sender and the true posterior, P (y |x). cult in directed networks. It is comforting that we can still Q(y |x) improve the parameters even when inference is done incor- Q(y |x) log - log p(x) = C (x) - (10) p(y |x) rectly, but it would be much better to find a way of eliminat- ing explaining away altogether, even in models whose hidden The use of an approximate posterior distribution to bound variables have highly correlated effects on the visible vari- log p(x) [Neal and Hinton, 1998] is now a standard way to ables. Most people who use directed graphical models regard learn belief nets in which inference is intractable [Jordan et this as impossible. al., 1999]. In a logistic belief net with one hidden layer, the prior dis- tribution over the hidden variables is factorial because their 4.1 The wake-sleep algorithm binary states are chosen independently when the model is A simple way to make use of variational learning in a multi- used to generate data. The non-independence in the posterior layer logistic belief net is to use a set of "recognition" con- distribution is created by the likelihood term coming from the nections that compute a factorial approximation to the pos- data. Perhaps we could eliminate explaining away in the first terior distribution in one layer when given the binary states hidden layer by using extra hidden layers to create a "com- of the units in the layer below [Hinton et al., 1995]. These plementary" prior that has exactly the opposite correlations recognition connections will not, in general, have the same to those in the likelihood term. Then, when the likelihood values as the corresponding generative connections. Given a term is multiplied by the prior, we will get a posterior that set of recognition weights, it is easy to update the generative is exactly factorial. This seems pretty implausible, but figure weights to follow the gradient of the description cost in Eq. 9. 4 shows a simple example of a logistic belief net with repli- We use a data-vector to set the states of the visible units and cated weights in which the priors are complementary at every then we use the recognition connections to compute a prob- hidden layer. This net has some interesting properties. ability for each unit in the first hidden layer. Then we use these probabilities to pick independent binary states for all 5.1 An infinite directed model with tied weights the units in that layer. This is repeated for each hidden layer We can generate data from the infinite directed net by start- in turn until we have a sample from the approximate poste- ing with a random configuration at an infinitely deep hidden rior. Given this sample the learning rule for the generative, layer and then performing an ancestral pass all the way down top-down weights is given by Eq. 7. This is the "wake" phase to the visible variables. Clearly, the distribution that we will of the wake-sleep algorithm. get over the visible variables is exactly the same as the distri- The "correct" way to learn the recognition weights is to bution produced by the Markov chain in figure 2 so the infi- follow the derivative of the cost in Eq. 9. The recognition nite directed net with tied weights is equivalent to a restricted weights only affect the Q terms; they do not affect the p terms. Boltzmann machine.2 However, the derivatives w.r.t the Q terms are messy because We can sample from the true posterior distribution over all Q comes outside the log. So we make a further approxima- of the hidden layers of the infinite directed net by starting with tion. In the "sleep" phase, we perform an ancestral pass to a data vector on the visible units and then using the transposed generate a sample from the generative model. Starting at the weight matrices to infer the factorial distributions over each top layer, we pick binary states for the units from their inde- hidden layer in turn. At each hidden layer we sample from the pendent priors. Then we pick states for the units in each lower factorial posterior before computing the factorial posterior for layer using the probabilities computed by applying the gener- the layer above. This is exactly the same process as starting a ative weights to the states in the layer above. Once we have restricted Boltzmann machine at the data and letting it settle completed an ancestral pass, we have both a visible vector to equilibrium. It is also exactly the same as the inference and its true hidden causes. So we can adjust the recognition procedure used in the wake-sleep algorithm, but in this net weights to be better at recovering the hidden causes from the it gives unbiased samples because the complementary prior states of the units in the layer below: at each layer ensures that the posterior distribution really is wij = si (sj - sj ) ^ factorial. (11) ^ where sj is the probability that j would be turned on by the 2 We can interpret any undirected model that uses Gibbs sampling current states of all its descendants. as an infinite directed model with tied weights. Since we can easily sample from the true posterior, it is easy to learn the weights in the infinite directed net. Let us start by computing the derivative for a generative weight, wij0 , from a unit j in layer H0 to unit i in layer V0 (see figure 0 4). In a logistic belief net, the maximum likelihood learning rule for a single data-vector, x, is: log p(x) = ^i (12) ji wij0 0 where < · > denotes an average over the sampled states and s0 is the probability that unit i would be turned on if ^i the visible vector was stochastically reconstructed from the sampled hidden states. Computing the posterior distribution over the second hidden layer, V1 , from the sampled binary states in the first hidden layer, H0 , is exactly the same process as reconstructing the data, so s1 is a sample from s0 and the ^i i learning rule becomes: log p(x) = (13) i ji wij0 0 Since the weights are replicated, the full derivative for a gen- erative weight is obtained by summing the derivatives of the generative weights between all pairs of layers: Figure 4: An infinite logistic belief net with tied weights. log p(x) = ji i wij performs a non-linear transformation on its input vectors and + produces as output the vectors that will be used as input for ij j + the next model in the sequence. i ji Figure 5 shows a multilayer generative model in which the +... top two layers interact via undirected connections and all of All of the vertically aligned terms cancel leaving the Boltz- the other connections are directed. There are no intra-layer mann machine learning rule of Eq. 5. The same weights are connections and, to simplify the analysis, all layers have the also used for inference and one might expect this to contribute same number of units. It is possible to learn sensible (though extra derivatives. Fortunately, all of these derivatives are zero. not optimal) values for the parameters W0 by assuming that The inference is exact so, to first order, small changes in the the parameters between higher layers will be used to construct inferred posteriors make no change in the log probability of a complimentary prior for W0 . This is equivalent to assuming the data. The only reason the recognition weights ever change that all of the weight matrices are constrained to be equal. The is because they are tied to the generative weights. task of learning W0 under this assumption reduces to the task of learning an RBM and although this is still difficult, good 6 A greedy learning algorithm for approximate solutions can be found rapidly by minimizing contrastive divergence (Hinton, 2002). Once W0 has been transforming representations T learned, the data can be mapped through W0 to create higher- An efficient way to learn a complicated model is to combine level "data" at the first hidden layer. a set of simpler models that are learned sequentially. To force If the RBM is a perfect model of the original data, the each model in the sequence to learn something different from higher-level "data" will already be modeled perfectly by the the previous models, the data is modified in some way af- higher-level weight matrices. Generally, however, the RBM ter each model has been learned. In boosting [Freund, 1995], will not be able to model the original data perfectly and we each model in the sequence is trained on re-weighted data that can make the generative model better using the following emphasizes the cases that the preceding models got wrong. In greedy algorithm: one version of principal components analysis, the variance in 1. Learn W0 assuming all the weight matrices are tied. a modeled direction is removed thus forcing the next modeled direction to lie in the orthogonal subspace. In projection pur- T 2. Freeze W0 and commit ourselves to using W0 to infer suit [Friedman and Stuetzle, 1981], the data is transformed factorial approximate posterior distributions over the by nonlinearly distorting one direction in the data-space to states of the variables in the first hidden layer. remove all non-Gaussianity in that direction. The idea behind 3. Keeping all the higher weight matrices tied to each our greedy algorithm is to allow each model in the sequence other, but untied from W0 , learn an RBM model of the to receive a different representation of the data. The model a dataset in which y occurs with probability Q(y |x). If the bound becomes tighter, it is possible for log p(x) to fall even though the lower bound on it increases, but log p(x) can never fall below its value at step 2 of the greedy algorithm because the bound is tight at this point and the bound always increases. The greedy algorithm can clearly be applied recursively, so if we use the full maximum likelihood Boltzmann machine learning algorithm to learn each set of tied weights and then we untie the bottom layer of the set from the weights above, we can learn the weights one layer at a time with a guar- antee3 that we will never decrease the log probability of the data under the full generative model. In practice, we replace maximum likelihood Boltzmann machine learning algorithm by contrastive divergence learning because it works well and is much faster. The use of contrastive divergence voids the guarantee, but it is still reassuring to know that extra layers are guaranteed to improve imperfect models if we learn each layer with sufficient patience. To guarantee that the generative model is improved by greedily learning more layers, it is convenient to consider models in which all layers are the same size so that the higher- Figure 5: A hybrid network. The top two layers have undi- level weights can be initialized to the values learned before rected connections and form an associative memory. The lay- they are untied from the weights in the layer below. The same ers below have directed, top-down, generative connections greedy algorithm, however, can be applied even when the lay- that can be used to map a state of the associative memory ers are different sizes. to an image. There are also directed, bottom-up, recognition connections that are used to infer a factorial representation in one layer from the binary activities in the layer below. In the 7 Back-Fitting with the up-down algorithm greedy initial learning the recognition connections are tied to Learning the weight matrices one layer at a time is efficient the generative connections. but far from optimal. Once the weights in higher layers have been learned, neither the weights nor the simple infer- T higher-level "data" that is produced by using W0 to ence procedure are optimal for the lower layers. The sub- transform the original data. optimality produced by greedy learning is relatively innocu- ous for supervised methods like boosting. Labels are often If this greedy algorithm changes the higher-level weight scarce and each label may only provide a few bits of con- matrices, it is guaranteed to improve the generative model. straint on the parameters, so over-fitting is typically more of The log probability of a single data-vector, x, under the mul- a problem than under-fitting. Going back and refitting the tilayer generative model is bounded by earlier models may, therefore, cause more harm than good. log p(x) Q(y |x) (log p(y ) + log p(x|y )) Unsupervised methods, however, can use very large unla- beled datasets and each case may be very high-dimensional thus providing many bits of constraint on a generative model. - Q(y |x) log Q(y |x) Under-fitting is then a serious problem which can be alle- viated by a subsequent stage of back-fitting in which the where y is a binary configuration of the units in the first hid- weights that were learned first are revised to fit in better with den layer, p(y ) is the prior probability of y under the model the weights that were learned later. defined by the weights above H0 and Q(·|x) is any probabil- After greedily learning good initial values for the weights ity distribution over y . The bound becomes an equality if and in every layer, we untie the "recognition" weights that are only if Q(·|x) is the true posterior distribution. used for inference from the "generative" weights that de- When all of the weight matrices are tied together, the fac- fine the model, but retain the restriction that the posterior in T torial distribution over H0 produced by applying W0 to a each layer must be approximated by a factorial distribution in data-vector is the true posterior distribution, so at step 2 of which the variables within a layer are conditionally indepen- the greedy algorithm log p(x) is equal to the bound. Step 2 dent given the values of the variables in the layer below. A freezes both Q(·|x) and p(x|y ) and with these terms fixed, variant of the wake-sleep algorithm described in section 4.1 the derivative of the ound is the same as the derivative of b can then be used to allow the higher-level weights to influ- Q(y |x) log p(y ) (14) ence the lower level ones. In the "up-pass", the recognition weights are used in a bottom-up pass that stochastically picks So maximizing the bound w.r.t. the weights in the higher lay- 3 ers is exactly equivalent to maximizing the log probability of The guarantee is on the expected change in the log probability. a state for every hidden variable. The generative weights on the directed connections are then adjusted using the maxi- mum likelihood learning rule in Eq. 7. Because weights are ^ no longer tied to the weights above them, si must be com- puted using the states of the variables in the layer above i and the generative weights from these variables to i. The weights on the undirected connections at the top level are learned as before by fitting the top-level RBM to the posterior distribu- tion of the penultimate layer. The "down-pass" starts with a state of the top-level asso- ciative memory and uses the top-down generative connections to stochastically activate each lower layer in turn. During the down-pass, the top-level undirected connections and the generative directed connections are not changed. Only the bottom-up recognition weights are modified. This is equiva- lent to the sleep phase of the wake-sleep algorithm if the as- sociative memory is allowed to settle to its equilibrium distri- bution before initiating the down-pass. But if the associative memory is initialized by an up-pass and then only allowed to Figure 6: The network used to model the joint distribution of run for a few iterations of alternating Gibbs sampling before digit images and digit labels. initiating the down-pass, this is a "contrastive" form of the wake-sleep algorithm which eliminates the need to sample from the equilibrium distribution of the associative memory. separately, starting at the bottom. Each layer was trained for The contrastive form also fixes several other problems of the 30 sweeps through the training set (called "epochs"). Dur- sleep phase. It ensures that the recognition weights are being ing training, the units in the "visible" layer of each RBM had learned for representations that resemble those used for real real-valued activities between 0 and 1. These were the nor- data and it also helps to eliminate the problem of mode aver- malized pixel intensities when learning the the bottom layer aging. If, given a particular data vector, the current recogni- of weights. For training higher layers of weights, the real- tion weights always pick a particular mode at the level above valued activities of the visible units in the RBM were the ac- and ignore other very different modes that are equally good at tivation probabilities of the hidden units in the lower-level generating the data, the learning in the down-pass will not try RBM. The hidden layer of each RBM used stochastic binary to alter those recognition weights to recover any of the other values when that RBM was being trained. The greedy train- modes as it would if the sleep phase used a pure ancestral ing took a few hours in Matlab on a 3GHz Xeon processor pass. and when it was done, the error-rate on the test set was 2.49% By using a top-level associative memory we also elimi- (see below for details of how the network is tested). nate a problem in the wake phase: Independent top-level units When training the top layer of weights (the ones in the seem to be required to allow an ancestral pass, but they mean associative memory) the labels were provided as part of the that the variational approximation is very poor for the top input. The labels were represented by turning on one unit in a layer of weights. "softmax" group of 10 units. When the activities in this group were reconstructed from the activities in the layer above, ex- 8 Performance on the MNIST database actly one unit was allowed to be active and the probability of 8.1 Training the network picking unit i was given by: The MNIST database of handwritten digits contains 60,000 exp(xi ) training images and 10,000 test images. Results for many pi = (15) j exp(xj ) different pattern recognition techniques are already published for this database so it is ideal for evaluating new pattern where xi is the total input received by unit i. Curiously, recognition methods. The network4 shown in figure 6 was the learning rules are unaffected by the competition between trained on 44,000 of the training images that were divided units in a softmax group, so the synapses do not need to know into 440 balanced mini-batches each containing 10 examples which unit is competing with which other unit. The competi- of each digit class. The weights were updated after each mini- tion affects the probability of a unit turning on, but it is only batch. this probability that affects the learning. In the initial phase of training, the greedy algorithm de- After the greedy layer-by-layer training, the network was scribed in section 6 was used to train each layer of weights trained, with a different learning rate and weight-decay, for 4 Preliminary experiments with 16 × 16 images of handwritten 300 epochs using the up-down algorithm described in section 7. The learning rate, momentum, and weight-decay5 were digits from the USPS database showed that a good way to model the joint distribution of digit images and their labels was to use an 5 architecture of this type, but for 16 × 16 images, only 3/5 as many No attempt was made to use different learning rates or weight- units were used in each hidden layer. decays for different layers and the learning rate and momentum were Figure 7: All 49 cases in which the network guessed right but had a second guess whose probability was within 0.3 of the probability of the best guess. The true classes are arranged in standard scan order. chosen by training the network several times and observing its performance on a separate validation set of 10,000 im- ages that were taken from the remainder of the full training set. For the first 100 epochs of the up-down algorithm, the up-pass was followed by three full iterations of alternating Gibbs sampling in the associative memory before perform- Figure 8: The 125 test cases that the network got wrong. Each ing the down-pass. For the second 100 epochs, six iterations case is labeled by the network's guess. The true classes are were performed, and for the last 100 epochs, ten iterations arranged in standard scan order. were performed. Each time the number of iterations of Gibbs sampling was raised, the error on the validation set decreased noticeably. The network that performed best on the validation set was then tested and had an error rate of 1.39%. This network was initial weights, "softmax" output units, a cross-entropy error then trained on all 60,000 training images6 until its error-rate function, and either very gentle learning (John Platt, personal on the full training set was as low as its final error-rate had communication) or a regularizer that penalizes the squared been on the initial training set of 44,000 images. This took weights by an amount that is carefully chosen using a valida- a further 59 epochs making the total learning time about a tion set. For comparison, nearest neighbor has a reported er- week. The final network had an error-rate of 1.25%. The ror rate (google MNIST) of 3.1% if all 60,000 training cases errors made by the network are shown in figure 8. The 49 are used (which is extremely slow) and 4.4% if 20,000 are cases that the network gets correct but for which the second used. This can be reduced to 2.8% and 4.0% by using an L3 best probability is within 0.3 of the best probability are shown norm. in figure 7. The error-rate of 1.25% compares very favorably with the The only standard machine learning algorithm that outper- error-rates achieved by discriminative, feed-forward neural forms the generative model is support vector machines which networks that have one or two hidden layers and are trained give error rates of 1.1% or 1.0%. But it is hard to see how by back-propagation. When the detailed connectivity of these support vector machines can make use of the domain-specific networks is not hand-crafted for this particular task, the best tricks, like weight-sharing and sub-sampling, which LeCun reported error-rate for stochastic on-line learning with a sep- et.al. use to improve the performance of discriminative neu- arate squared error on each of the 10 output units is 2.95%. ral networks from 1.5% to 0.95%. There is no obvious reason These error-rates can be reduced to 1.51% by using small why weight-sharing and sub-sampling cannot be used to re- duce the error-rate of the generative model. Further improve- always set quite conservatively to avoid oscillations. It is highly ments can always be achieved by averaging the opinions of likely that the learning speed could be considerably improved by a multiple networks or by enhancing the training set with dis- more careful choice of learning parameters, though it is possible that torted data, but these techniques are available to all methods, this would lead to worse solutions. 6 though data enhancement can seriously slow-down methods The training set has unequal numbers of each class, so images that do not scale sub-linearly with the size of the training set. were assigned randomly to each of the 600 mini-batches. Figure 9: Each row shows 10 samples from the generative Figure 10: Each row shows 10 samples from the generative model with a particular label clamped on. The top-level asso- model with a particular label clamped on. The top-level as- ciative memory is run for 1000 iterations of alternating Gibbs sociative memory is initialized by an up-pass from a random sampling between samples. binary image in which each pixel is on with a probability of 0.5. The first column shows the results of a down-pass from this initial high-level state. Subsequent columns are produced by 20 iterations of alternating Gibbs sampling in the associa- 8.2 Testing the network tive memory. One way to test the network is use a stochastic up-pass from the image to fix the binary states of the 500 units in the lower layer of the associative memory. With these states fixed, the we use a sample from this distribution as input to the layers label units are given initial real-valued activities of 0.1 and a below and generate an image by a single down-pass through few iterations of alternating Gibbs sampling are then used to the generative connections. If we clamp the label units to a activate the correct label unit. This method of testing gives particular class during the Gibbs sampling we can see im- error rates that are almost 1% higher than the rates reported ages from the model's class-conditional distributions. Figure above. 9 shows a sequence of images for each class that were gener- A better method is to first fix the binary states of the 500 ated by allowing 1000 iterations of Gibbs sampling between units in the lower layer of the associative memory and to then samples. turn on each of the label units in turn and compute the ex- We can also initialize the state of the top two layers by act free energy of the resulting 510 component binary vector. providing a random binary image as input. Figure 10 shows Almost all the computation required is independent of which how the class-conditional state of the associative memory label unit is turned on [Teh and Hinton, 2001] and this method then evolves when it is allowed to run freely, but with the computes the exact conditional equilibrium distribution over label clamped. This internal state is "observed" by perform- labels instead of approximating it by Gibbs sampling which ing a down-pass every 20 iterations to see what the associa- is what the previous method is doing. This method gives er- tive memory has in mind. This use of the word "mind" is ror rates that are about 0.5% higher than the ones quoted be- not metaphorical. A mental state is the state of a hypotheti- cause of the stochastic decisions made in the up-pass. We cal world in which a high-level internal representation would can remove this noise in two ways. The simplest is to make constitute veridical perception. That hypothetical world is the up-pass deterministic by using probabilities of activation what the figure shows. in place of stochastic binary states. The second is to repeat the stochastic up-pass twenty times and average either the la- 10 Conclusion bel probabilities or the label log probabilities over the twenty repetitions before picking the best one. The two types of av- The network in figure 6 has about as many parameters as 0.002 cubic millimeters of mouse cortex. Several hundred erage give almost identical results and these results are also very similar to using a deterministic up-pass, which was the networks of this complexity would fit within a single voxel of method used for the reported results. a high resolution fMRI scan. Learning algorithms are getting better, but they still have a long way to go. 9 Looking into the mind of a neural network To generate samples from the model, we perform alternating Gibbs sampling in the top-level associative memory until the Markov chain converges to the equilibrium distribution. Then Acknowledgments Conf. on Computer Vision and Pattern Recognition, June 2005. The ideas presented in this paper came from research col- laborations with Terry Sejnowski, Yann LeCun, Jay McClel- [Rumelhart et al., 1986] D. E. Rumelhart, G. E. Hinton, leand, Radford Neal, Rich Zemel, Peter Dayan, Michael Jor- and R. J. Williams. Learning representations by back- dan, Brendan Frey, Zoubin Ghahramani, Yee-Whye Teh, Max propagating errors. Nature, 323:533­536, 1986. Welling, Simon Osindero and many other researchers. An- [Teh and Hinton, 2001] Y.W. Teh and G. E. Hinton. Rate- driy Mnih helped to prepare the manuscript. The research coded restricted Boltzmann machines for face recognition. was supported by NSERC, the Gatsby Charitable Founda- In Advances in Neural Information Processing Systems, tion, CFI and OIT. GEH holds a Canada Research Chair in volume 13, 2001. machine learning. [Teh et al., 2003] Y.W. Teh, M. Welling, S. Osindero, and G. E. Hinton. Energy-based models for sparse overcom- References plete representations. Journal of Machine Learning Re- [Carreira-Perpinan and Hinton, 2005] M. A. Carreira- search, 4:1235­1260, Dec 2003. Perpinan and G. E. Hinton. On contrastive divergence [Welling et al., 2003] M. Welling, G. Hinton, and S. Osin- learning. In Artificial Intelligence and Statistics, 2005, dero. Learning sparse topographic representations with 2005. products of Student-t distributions. In S. Thrun S. Becker [Freund, 1995] Y. Freund. Boosting a weak learning al- and K. Obermayer, editors, Advances in Neural Informa- gorithm by majority. Information and Computation, tion Processing Systems 15, pages 1359­1366. MIT Press, 12(2):256 ­ 285, 1995. Cambridge, MA, 2003. [Friedman and Stuetzle, 1981] J.H. Friedman and W. Stuet- [Zemel and Hinton, 1995] R. S. Zemel and G. E. Hinton. zle. Projection pursuit regression. Journal of the American Learning population codes by minimizing description Statistical Association, 76:817­823, 1981. length. Neural Computation, 7:549­564, 1995. [Hinton and Sejnowski, 1986] G. E. Hinton and T. J. Se- jnowski. Learning and relearning in Boltzmann machines. In D. E. Rumelhart and J. L. McClelland, editors, Paral- lel Distributed Processing: Volume 1: Foundations, pages 282­317. MIT Press, Cambridge, 1986. [Hinton et al., 1995] G. E. Hinton, P. Dayan, B. J. Frey, and R. Neal. The wake-sleep algorithm for self-organizing neural networks. Science, 268:1158­1161, 1995. [Hinton, 2002] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1711­1800, 2002. [Jordan et al., 1999] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. In M. I. Jordan, editor, Learning in Graphical Models. MIT Press, Cambridge, MA, 1999. [Mayraz and Hinton, 2001] G. Mayraz and G. E. Hinton. Recognizing hand-written digits using hierarchical prod- ucts of experts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:189­197, 2001. [Neal and Hinton, 1998] R. M. Neal and G. E. Hinton. A new view of the EM algorithm that justifies incremental, sparse and other variants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355--368. Kluwer Academic Publishers, 1998. [Neal, 1992] R. Neal. Connectionist learning of belief net- works. Artificial Intelligence, 56:71­113, 1992. [Pearl, 1988] J. Pearl. Probabilistic Inference in Intelligent Systems: Networks of Plausible Inference. Morgan Kauf- mann, San Mateo, CA, 1988. [Roth and Black, 2005] S. Roth and M. J. Black. Fields of experts: A framework for learning image priors. In IEEE