# Types of Boltzmann Machines

Deep Learning models are broadly classified into supervised and unsupervised models.

**Supervised DL models:**

- Artificial Neural Networks (ANNs)
- Recurrent Neural Networks (RNNs)
- Convolutional Neural Networks (CNNs)

**Unsupervised DL models:**

- Self Organizing Maps (SOMs)
- Boltzmann Machines
- Autoencoders

Let us learn what exactly Boltzmann machines are, how they work and also implement a recommender system which recommends whether the user likes a movie or not based on the previous movies watched.

**Boltzmann Machines **is an unsupervised DL model in which every node is connected to every other node. That is, unlike the ANNs, CNNs, RNNs and SOMs, the Boltzmann Machines are **undirected **(or the connections are bidirectional). Boltzmann Machine is not a deterministic DL model but a **stochastic** or **generative** DL model. It is rather a representation of a certain system. There are two types of nodes in the Boltzmann Machine â€” **Visible nodes** â€” those nodes which we can and do measure, and the **Hidden nodes** – those nodes which we cannot or do not measure. Although the node types are different, the Boltzmann machine considers them as the same and everything works as one single system. The training data is fed into the Boltzmann Machine and the weights of the system are adjusted accordingly. Boltzmann machines help us understand abnormalities by learning about the working of the system in normal conditions.

**Energy-Based Models:**

**Boltzmann Distribution **is used in the sampling distribution of the Boltzmann Machine. The Boltzmann distribution is governed by the equation –

P_{i = e(-âˆˆi/kT)/ âˆ‘e(-âˆˆj/kT) }^{ }Pprobability of system being in state i_{i -}âˆˆEnergy of system in state i_{i -}T -Temperature of the systemk -Boltzmann constantâˆ‘eSum of values for all possible states of the system^{(-âˆˆ}_{j/kT) -}^{ }

Boltzmann Distribution describes different states of the system and thus Boltzmann machines create different states of the machine using this distribution. From the above equation, as the energy of system increases, the probability for the system to be in state ‘i’ decreases. Thus, the system is the most stable in its lowest energy state (a gas is most stable when it spreads). Here, in Boltzmann machines, the energy of the system is defined in terms of the** weights of synapses**. Once the system is trained and the weights are set, the system always tries to find the lowest energy state for itself by adjusting the weights.

**Types of Boltzmann Machines:**

- Restricted Boltzmann Machines (RBMs)
- Deep Belief Networks (DBNs)
- Deep Boltzmann Machines (DBMs)

**Restricted Boltzmann Machines (RBMs):**

In a full Boltzmann machine, each node is connected to every other node and hence the connections grow **exponentially**. This is the reason we use RBMs. The restrictions in the node connections in RBMs are as follows –

- Hidden nodes cannot be connected to one another.
- Visible nodes connected to one another.

**Energy function example for Restricted Boltzmann Machine –**

E(v, h) = -âˆ‘ a_{ivi - âˆ‘ bjhj - âˆ‘âˆ‘ viwi,jhj}a, v -biases in the system - constantsvvisible node, hidden node_{i, hj -}P(v, h) =Probability of being in a certain stateP(v, h) = e^{(-E(v, h))/Z}Z -sum if values for all possible states

Suppose that we are using our RBM for building a recommender system that works on six (6) movies. RBM learns how to allocate the hidden nodes to certain features. By the process of **Contrastive Divergence**, we make the RBM close to our set of movies that is our case or scenario. RBM identifies which features are important by the training process. The training data is either 0 or 1 or missing data based on whether a user liked that movie (1), disliked that movie (0) or did not watch the movie (missing data). RBM automatically identifies important features.

**Contrastive Divergence:**

**RBM **adjusts its weights by this method. Using some randomly assigned initial weights, RBM calculates the hidden nodes, which in turn use the same weights to reconstruct the input nodes. Each hidden node is constructed from all the visible nodes and each visible node is reconstructed from all the hidden node and hence, the input is different from the reconstructed input, though the weights are the same. The process continues until the reconstructed input matches the previous input. The process is said to be converged at this stage. This entire procedure is known as **Gibbs Sampling**.

The **Gradient Formula** gives the gradient of the log probability of the certain state of the system with respect to the weights of the system. It is given as follows –

d/dwv - visible state, h- hidden state <v_{ij(log(P(v0))) = <vi0 * hj0> - <viâˆž * hjâˆž>}_{i}^{0}* h_{j}^{0}> - initial state of the system <v_{i}^{âˆž}* h_{j}^{âˆž}> - final state of the systemP(vprobability that the system is in state v^{0) -}^{0}wweights of the system_{ij -}

The above equations tell us – how the change in weights of the system will change the log probability of the system to be a particular state. The system tries to end up in the lowest possible energy state (most stable). Instead of continuing the adjusting of weights process until the current input matches the previous one, we can also consider the first few pauses only. It is sufficient to understand how to adjust our curve so as to get the lowest energy state. Therefore, we adjust the weights, redesign the system and energy curve such that we get the lowest energy for the current position. This is known as the **Hintonâ€™s shortcut**.

**Working of RBM – Illustrative Example –**

Consider – Mary watches four movies out of the six available movies and rates four of them. Say, she watched m_{1}, m_{3}, m_{4} and m_{5} and likes m_{3}, m_{5} (rated 1) and dislikes the other two, that is m_{1}, m_{4} (rated 0) whereas the other two movies – m2, m6 are unrated. Now, using our RBM, we will recommend one of these movies for her to watch next. Say –

- m
_{3}, m_{5}are of â€˜Dramaâ€™ genre. - m
_{1}, m_{4}are of â€˜Actionâ€™ genre. - â€˜Dicaprio’ played a role in m
_{5}. - m
_{3}, m_{5}have won ‘Oscar.’ - ‘Tarantino’ directed m
_{4}. - m
_{2}is of the â€˜Actionâ€™ genre. - m
_{6}is of both the genres â€˜Actionâ€™ and â€˜Dramaâ€™, â€˜Dicaprio’ acted in it and it has won an â€˜Oscar’.

We have the following observations –

- Mary likes m
_{3}, m_{5}and they are of genre â€˜Drama,â€™ she probably**likes â€˜Drama’**movies. - Mary dislikes m
_{1}, m_{4}and they are of action genre, she probably**dislikes â€˜Actionâ€™**movies. - Mary likes m
_{3}, m_{5}and they have won an â€˜Oscarâ€™, she probably**likes**an**â€˜Oscarâ€™**movie. - Since â€˜Dicaprioâ€™ acted in m
_{5}and Mary likes it, she will probably**like**a movie in which**â€˜Dicaprioâ€™**acted. - Mary does not like m
_{4}which is directed by Tarantino, she probably**dislikes**any movie directed by**‘Tarantino’**.

Therefore, based on the observations and the details of m_{2}, m_{6}; our RBM **recommends m6** to Mary (â€˜Dramaâ€™, â€˜Dicaprioâ€™ and â€˜Oscarâ€™ matches both Mary’s interests and m_{6}). This is how an RBM works and hence is used in recommender systems.

Thus, RBMs are used to build Recommender Systems.

**Deep Belief Networks (DBNs):**

Suppose we stack several RBMs on top of each other so that the first RBM outputs are the input to the second RBM and so on. Such networks are known as Deep Belief Networks. The connections within each layer are undirected (since each layer is an RBM). Simultaneously, those in between the layers are directed (except the top two layers – the connection between the top two layers is undirected). There are two ways to train the DBNs-

**Greedy Layer-wise Training Algorithm –**The RBMs are trained layer by layer. Once the individual RBMs are trained (that is, the parameters – weights, biases are set), the direction is set up between the DBN layers.**Wake-Sleep Algorithm –**The DBN is trained all the way up (connections going up – wake) and then down the network (connections going down â€” sleep).

Therefore, we stack the RBMs, train them, and once we have the parameters trained, we make sure that the connections between the layers only work downwards (except for the top two layers).

**Deep Boltzmann Machines (DBMs):**

DBMs are similar to DBNs except that apart from the connections within layers, the connections between the layers are also **undirected** (unlike DBN in which the connections between layers are directed). DBMs can extract more complex or sophisticated features and hence can be used for more complex tasks.