Unintentionally inconsiderateLearning to learn
http://rocknrollnerd.github.io/
Thu, 26 Nov 2015 13:35:51 +0000Thu, 26 Nov 2015 13:35:51 +0000Jekyll v2.4.0Wrong whales<p>OK, it’s been quite some time since I wrote anything here: partly because of all the stuff going on with my life (thesis defending process and so on), but most importantly, because of a machine learning project that have been taking lots of my time. I mean <a href="https://www.kaggle.com/c/noaa-right-whale-recognition">Kaggle’s right whale recognition challenge</a>.</p>
<p class="center"><img src="https://kaggle2.blob.core.windows.net/competitions/kaggle/4521/media/ChristinKhan_RightWhaleMomCalf_640.png" alt="" /></p>
<p>The challenge consists of two stages: first construct a “whale face detection” algorithm that can extract whale heads from an aerial view image like the one displayed on top, and than make a classification model that can discriminate between different whales based on one’s head <a href="http://www.neaq.org/conservation_and_research/projects/endangered_species_habitats/right_whale_research/right_whale_projects/monitoring_individuals_and_family_trees/identifying_with_photographs/how_it_works/callosity_patterns.php">callosity pattern</a> which is considered to be the main distinguishing feature.</p>
<p>I’m still struggling with the first problem, which, although presented by organizers as a quite simple step (they even included some <a href="https://www.kaggle.com/c/noaa-right-whale-recognition/details/creating-a-face-detector-for-whales">guidelines</a> on that), appears to be quite non-trivial. So I’ll try to explain the algorithm I’m working on and the path that lead me to it here in some detail without actually sharing the code.</p>
<h1 id="why-is-it-non-trivial">Why is it non-trivial</h1>
<p class="center"><img src="/assets/article_images/2015-11-26-right-whales/easy_whales.jpg" alt="" /></p>
<p>So when you look at some of the images it actually appears to be quite easy to detect heads without any machine learning at all: there are large distinct white-pattern tiles that can be selected with some simple hand-crafted filter. The background is always water, different colours are possible, but still, no background clutter or extra objects that would require scene segmentation. There’s always one point of view (from above), and one whale per picture. It looks like so simple compared to human face recognition, where human faces can appear on lots of different backgrounds and take lots of different 3d poses.</p>
<p>But then there are different pictures like these:</p>
<p class="center"><img src="/assets/article_images/2015-11-26-right-whales/hard_whales.jpg" alt="" /></p>
<p>Well, now… it all makes things a bit complicated. There are extra foam patterns which are bright and white and can mess with face detection, and the sun is clearly not helping with all the bright spots and reflections. Sometimes the head pattern is heavily obscured by water and foam (see the top-right picture, for example) and sometimes it seems to be really hard to distinguish it from some other body part. For example, which picture corresponds to a head here?</p>
<p class="center"><img src="/assets/article_images/2015-11-26-right-whales/tail_head.png" alt="" /></p>
<p>The left one is a lower body patch, the right one is the actual head; but the distinction is quite subtle, when you look at both patches without any context.</p>
<p class="center"><img src="/assets/article_images/2015-11-26-right-whales/tail.jpg" alt="" /></p>
<h1 id="the-first-attempt-naive-sliding-window-detection">The first attempt: naive sliding window detection</h1>
<p>I’ve decided to ignore cascade detector approach suggested by organizers and go straight to convolutional network. I’ve annotated the subset of 500 whales (thought it’ll take a lot of time, but actually I was done in a couple of nights), wrote a script that extracts head patches and samples a bunch of random background patches from a single image, and made a really simple CNN with Theano and Lasagne (basically just tried to build something like VGGNet: lots of layers with small 3x3 kernels). I’ve trained the network on my laptop: after <em>finally</em> installing the correct Nvidia driver (and by “correct” I mean “the one that doesn’t make my Ubuntu boot in a black screen”) I could fully use the almighty power of my GT 650M.</p>
<p>The results, hovewer, were quite bad: the network produced lots and lots of false positives, labeling positively almost every patch that was not a monochromatic background water. I managed to get something useful out of it, finally, just selecting the maximum certain patch (like the one that the network detected with 99% confidence; there were a lot of less-certain positives with the score of like 65%), and, well, some results were obtained:</p>
<p class="center"><img src="/assets/article_images/2015-11-26-right-whales/right_whales.jpg" alt="" /></p>
<p>Of course it was an extremely naive approach: I didn’t even use different scales of an image, used a poorly augmented dataset (no arbitrary rotations, just horizontal and vertical flips), and yet it worked - at least partly. And, of course, the expected problems were immediately faced:</p>
<p class="center"><img src="/assets/article_images/2015-11-26-right-whales/wrong_whales.jpg" alt="" /></p>
<p>Damn you, foam patches. Look at the right-bottom picture: the head is barely seen, almost covered by water, and the brightest-white patches correspond to foam. Then it’s just a matter of luck if the pattern looks similar to whale head, and bam, we’re lost.</p>
<h1 id="context-matters">Context matters</h1>
<p>So it seems that the most important thing we understood so far is that local patch-based detectors handle the problem poorly. This is, I think, what makes the problem so interesting: for example, consider the difference between ImageNet localization challenge and this one. An object in ImageNet images usually occupies a lot of space, is quite visible, there are hardly any false positive similar-looking background patches and not much extra space to search. The main difficulty is that the objects can take a lot of different shapes, points of view, and can belong to different classes. This problem is quite different: sometimes the object we have to find is so occluded or looks like background, that even human eye can be confused, <em>when it looks directly at it</em>.</p>
<p>Well, but the said human eye is actually <em>able</em> to find the head in the last right-bottom ambiguous picture, right? We can do that simply by looking at the spot where whale body ends. Therefore, an interesting idea emerges: <strong>to recognize the thing, you have to look somewhere near it, not directly at it</strong>. In other words, context matters. To be able to locate whale head, we need to know the location of whale body first. And maybe <em>that</em> should be at least a little bit easier.</p>
<h1 id="looking-for-body">Looking for body</h1>
<p>Okay, the problem slightly changes, but is still object localization. <a href="https://www.kaggle.com/c/noaa-right-whale-recognition/forums">Forums</a> seem to be full of pure computer-vision solutions for whale body detection - people use color filtering, histogram similarity; I’ve also tried local entropy and random walking segmentation (just for fun, basically).</p>
<p class="center"><img src="/assets/article_images/2015-11-26-right-whales/entropy.jpg" alt="" /></p>
<p><em>Measuring local entropy with scikit-image. Sometimes it’s quite good, other times waves and foam create enough entropy to render it meaningless.</em></p>
<p>After all, I returned to convolutional networks, simultaneously looking up the more sophisticated approaches like <a href="http://arxiv.org/abs/1312.6229">OverFeat</a> and <a href="http://arxiv.org/abs/1311.2524">R-CNN</a>. My search led me to a slide <a href="https://www.google.ru/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0ahUKEwiMnZmM7a3JAhWIGCwKHVilC_0QFggfMAA&url=http%3A%2F%2Fcourses.cs.tau.ac.il%2FCaffe_workshop%2FBootcamp%2FLecture%25206%2520CNN%2520-%2520detection.pptx&usg=AFQjCNFlswHpK0kC6bVL1KRN1oDnl4kfyg&sig2=LqH0hdb0_ELK2uc4BiHDeg&bvm=bv.108194040,d.bGg&cad=rja">comparison</a> of these methods, which brought some clarity in what’s going on with CNN-based object localization. Unfortunately, both R-CNN and OverFeat seemed to be not quite good enough for what I had in mind: detecting whale bodies with <em>oriented</em> (tight) bounding boxes. Like this (left) and <em>not</em> like this (right):</p>
<p class="center"><img src="/assets/article_images/2015-11-26-right-whales/bboxes.jpg" alt="" /></p>
<p>But then I stumbled upon that <a href="http://papers.nips.cc/paper/5207-deep-neural-networks-for-object-detection.pdf">Szegedy paper</a> that suggested to use convolutional network as a model that predicts black and white mask which corresponds to object’s location. Which is really cool, because we’re now allowed to use arbitrary “bounding-shapes” if we like, but also cool because it’s so simple. We don’t need sliding windows, different scales or extra feature extraction - just put the whole image into the network, and output a black and white version of it, maybe downscaled. So I decided to give it a try and began to annotate whale bodies.</p>
<p>(I use <a href="http://sloth.readthedocs.org/">Sloth</a> for annotations, which doesn’t allow to use oriented bounding boxes as annotation items; on the other hand, the docs suggest the possibility of creating a custom item from Qt graphic toolset. I tried that and failed, so my annotations ended up to be approximately-rectangular polygons, which I later fitted to rectangles using some geometry)</p>
<p>So, the pipeline looked somewhat like this:</p>
<ul>
<li>annotate the dataset with polygons in Sloth</li>
<li>run the annotations throught a script that fits oriented rectangles to polygons (nothing fancy here: estimating the center, locating two groups of far away vertices, calculating the slope and then width and height. Oriented bounding box is described by 5 parameters: center coordinates (ox, oy), width, height and angle)</li>
<li>paint black and white mask and store it alongside the original image</li>
</ul>
<p>It’s actually possible to skip the second step: after all, since we’re allowed to use arbitrary shapes for object location, there’s no need to make them strict rectangles. There were two reasons for that: first, after we locate the body, we need to crop it out of the image, so to fit a rectangle to a result mask, therefore it’s better if the network outputs a somewhat close to retangular shape. The second reason is that I also tried a little bit different approach, described by other Szegedy’s paper, when the network predicts not the mask, but directly 5 output parameters (ox, oy, width, height and angle). This didn’t work out well, unfortunately.</p>
<p>So after my train set was ready I spend a week or so tinkering with the network parameters, and ended up with quite nice predictions:</p>
<p class="center"><img src="/assets/article_images/2015-11-26-right-whales/mask.png" alt="" /></p>
<p><em>The last panel displays the resulting bounding box, fitted to a mask. Again nothing fancy, just some thresholding and PCA</em></p>
<p>Some nuances and tips I’ve encountered during the training:</p>
<ul>
<li>augmenting your dataset is the key. I started with 10 random rotations per image and quite poor results; extending that to 50 rotations per images improved things considerably. Unfortunately, my laptop couldn’t handle it anymore, so I moved to Amazon EC2 GPU instance.</li>
<li>the paper suggests to use slightly different objective function to penalize “predicting black pixel where there’s actually a white pixel” more then the other way around. This is based on the fact that masks are mostly black and just small parts of it are white, so the network can fall into the trivial solution of always predicting black images (maybe with centered bright spot). I couldn’t quite understand Szegedy’s function and came up with my own, which looks like this: <code>(prediction - target) ** 2 * (target * _lamda + 1)</code>, where <code>_lambda</code> controls the penalty value (and <code>target</code> is binary target mask). But at some point I’ve decided not to use it at all: standard mean square error was just fine. Maybe because whales are quite large and occupy enough white space so that the network can’t just ignore them.</li>
<li>the VGGNet-like architecture (lots of layers, tiny kernels) showed worse results than the opposite approach: 4 convolutional layers and quite large kernels (9x9 and so on).</li>
<li>adaptive histogram equalization (CLAHE) applied to images sufficiently improves training score. Other than that and downscaling, I didn’t try any preprocessing steps.</li>
<li>after the convolutional layers I placed a couple of fully-connected layers approximately twice as big as the mask size. This turned out to be good enough.</li>
</ul>
<p>The “body pipeline” is basically predict a mask, fit an oriented bounding box (I also constrained it to 1:3 aspect ratio), rotate the image and crop the whale, and end up with something like this:</p>
<p class="center"><img src="/assets/article_images/2015-11-26-right-whales/extracted_whales.jpg" alt="" /></p>
<p><em>There are two whales on the last picture: extremely rare case</em></p>
<p>Not bad, huh.</p>
<h1 id="next-step-locating-the-head">Next step: locating the head</h1>
<p>Though I like the results so far (about 98% correctly cropped whales), we still have to locate the head. And we cannot do that (as I secretly hoped) by simply chopping off the leftmost and rightmost parts of the cropped image (“look at the end of the body”), since the cropping is still quite inaccurate and includes water background.</p>
<p>But we’ve still won something: there’s much less space to search, we’ve thrown away lots of confusing foam patches and bright reflection spots, and most importantly, the whales are now rotated horizontaly, which greatly reduces the spatial variance. So why don’t we use the same mask prediction technique again?</p>
<p>This is the point I’m stuck right now, because head detection still works worse then body detection (about 75% success rate). Average successfull prediction looks like this:</p>
<p class="center"><img src="/assets/article_images/2015-11-26-right-whales/head_success.jpg" alt="" /></p>
<p>But quite often the network is still confused by the bright foam regions:</p>
<p class="center"><img src="/assets/article_images/2015-11-26-right-whales/head_failure.jpg" alt="" /></p>
<p>And sometimes it, quite interestingly, makes two predictions (when it cannot decide which part looks more like head):</p>
<p class="center"><img src="/assets/article_images/2015-11-26-right-whales/head_double.jpg" alt="" /></p>
<p>I’m trying to solve the latter situation with <a href="https://en.wikipedia.org/wiki/DBSCAN">DBSCAN clustering</a>, which allows me to decide between multiple region predictions, but the decision rule is quite arbitrary (I always select the bigger region with some randomness, which is, strictly speaking, doesn’t have to correspond to a head).</p>
<p>So, at the moment, I’ve got a couple of ideas on how to improve it:</p>
<ul>
<li>try to make better body predictions. Can we, for example, make the box tighter, so that the head precisely touches the boundary? Maybe it’s time to use some kind of histogram comparison technique: considering that the outer parts of “body” image are most likely to be water, we take the average histogram statistic for background and trying to eliminate that.</li>
<li>somehow rotate all the whales in the dataset so that they all share the same orientation. As you can notice, the body detection network cannot discriminate between right and left, but maybe we can make a separate (simpler) model that just tries to predict orientation as a binary number? That would reduce the variance even more.</li>
<li>just throw more images in the dataset! Although I’m already dead tired of manual annotations: last time it was about 1000 images. But after I see some wonderful person annotating the <a href="https://www.kaggle.com/c/noaa-right-whale-recognition/forums/t/17421/complete-train-set-head-annotations">whole training set</a> the option starts to look feasible…</li>
</ul>
<p>Okay, that’s all for now. The interesting parts are still ahead (I haven’t yet tried the classification itself and haven’t made a single leaderboard submission), and I’m fascinated so far by switching from toy projects and papers to an actually big machine learning project. I hope I’ll post an update on that soon.</p>
Thu, 26 Nov 2015 20:11:00 +0000
http://rocknrollnerd.github.io/ml/2015/11/26/right-whales.html
http://rocknrollnerd.github.io/ml/2015/11/26/right-whales.htmlmlProbability and variational methods, part 1<p>Apparently what’s started as a couple of lectures and assignments has turned into completely different understanding of probability and machine learning I’m still trying to digest. So this is going too be a big (and possibly really oversimplified) post.</p>
<h1 id="everything-is-random">Everything is random</h1>
<p>We start from the idea that the world around us, natively represented by our sensors, can be compressed into some kind of collection of smaller representations.</p>
<p>The truth of this assumption is not actually self-evident. We can roughly estimate brain’s storage capacity by counting possible number of connections between its neurons and maybe somehow estimate the amount of bits of information we experience on a daily basis (the latter seems kinda difficult, though) and compare them. We can also suggest that the universe around us is not by itself unique and is composed of some object of repeated and similar structure, so at least <em>in theory</em> there’s some room for compression and it would be natural for the evolution to take the opportunity to exploit this structure and save computational power of the brain. We can also argue that the usual tasks our brain performs, including classification, recognition and pattern matching, assume existence of some kind of common ground for the objects in the same cathegories (Imagine you have to make a classification rule for completely different objects/data vectors. You’ll need <em>extra</em> bits of information to explicitly map all the objects to their categories, if there’s nothing common in the data <em>itself</em>).</p>
<p>Anyway, we’re going to stick with the assumption for a while. There are some parameters that represent the structure of the data, they can be arranged in lots of different ways, maybe as a graphical model of some kind, maybe interacting with each other or maybe just a bunch of numbers that correspond to inner mechanics of the data. Doesn’t matter for now. Let’s call the parameters <script type="math/tex">\theta</script> and the data <script type="math/tex">X</script>, without any specifics considering their form (scalar/vector/matrix/etc).</p>
<p>Now let’s assume the data is random.</p>
<p>This also takes some time to digest. At least, it did for me, when I first thought about it, mostly because by “random” we usually mean something like complete mess, and the world we perceive seems solid, structured and deterministic (at least at large scale). This is, of course, a miscomprehension (actually, a single number can be a random variable, considering it corresponds to <a href="https://en.wikipedia.org/wiki/Dirac_delta_function">delta distribution</a>). By saying that the data is random we just mean that there’s some kind of noise in our observations, which may correspond to the flaws of our sensors, or to the unpredictable small changes in the world itself. When we see a bunch of handwritten examples of some digit, each image is unique, written slightly differently, and yet they all share the same structure, which corresponds to the underlying distribution of the data. Naturally, we’d like to configure the parameters <script type="math/tex">\theta</script> so that they would describe this distribution and capture our knowledge of the world.</p>
<p>Suppose we toss a coin twice, not knowing anything about coins in advance, and it comes up first heads, then tails. Suppose we have the simplest parameter possible <script type="math/tex">\theta</script> which is a number from 0 to 1 that denotes the probability of heads. What value of <script type="math/tex">\theta</script> should we choose so that our model would be consistent with the data? Considering our one-parameter model, and the fact that two tosses are independent, we can express the probability of observed data given parameters as <script type="math/tex">P(X\mid\theta)=P(H)P(T)=P(H)(1-P(H))=\theta(1-\theta)</script>. One possible way to find a good value of <script type="math/tex">\theta</script> then is to pick the value that <em>maximizes</em> that probability, which is actually the likelihood function (and the method is called <strong>maximum likelihood estimation</strong> or MLE). Let’s do that by the dumbest way possible: by brute-force search.</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="n">ml</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">theta</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1000</span><span class="p">):</span>
<span class="n">likelihood</span> <span class="o">=</span> <span class="n">t</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">t</span><span class="p">)</span>
<span class="k">if</span> <span class="n">likelihood</span> <span class="o">></span> <span class="n">ml</span><span class="p">:</span>
<span class="n">ml</span> <span class="o">=</span> <span class="n">likelihood</span>
<span class="n">theta</span> <span class="o">=</span> <span class="n">t</span>
<span class="k">print</span> <span class="n">theta</span>
<span class="c"># >> 0.49949949949949951</span></code></pre></div>
<p>Okay, that was suddenly not surprising. And of course, there’s no need to iterate over all possible values of <script type="math/tex">\theta</script> when we can find the maximum simply by differentiating the function, like <script type="math/tex">F'(X\mid\theta)=(\theta - \theta^{2})' = -2\theta + 1</script>, which yields <script type="math/tex">\theta=0.5</script>. Plain and simple. It can be shown that lots of machine learning methods are performing maximum likelihood estimation, sometimes without using probabilistic vocabulary at all, like simple feed-forward neural networks, for example.</p>
<p>Now let’s imagine a different example. Suppose a coin is tossed twise again, but now it results in two heads. Following the same path as before, we get the equation for the likelihood <script type="math/tex">P(X\mid\theta)=\theta^{2}</script>, and its derivative <script type="math/tex">2\theta</script>, which points to maximum at <script type="math/tex">\theta=1</script>. And this is perfectly right in terms of maximizing the likelihood, but doesn’t look like a good answer in general (do we really expect the coin to show up heads only after just two tosses?). One solution to this problem is to always get lots of data before tweaking your model, and this is perfectly good advice we’d like to follow anytime possible, except sometimes it’s just might not be possible. If we’re dealing with the data that can result in multiple outcomes, each missing outcome would have probability 0 estimated by the model (like the probability of tails in the second coin example is 0), and that might be too strong of an assumtion to make. The other problem is that our data has to be <em>equally partitioned</em>. Think about the case when we perform 100 coin tosses and obtain 48 heads, when waximum likelihood estimate would be 0.48: good estimate, but still wouldn’t the probability of 0.5 be more likely? (or at least as good?)</p>
<p>To tackle this issue we’re going to allow some uncertainity in our answers. Concretely, we’re going to make <script type="math/tex">\theta</script> a random variable and instead of trying to find a single possible good value of it we’re going to estimate its distribution given the observed data, or <script type="math/tex">P(\theta \mid X)</script>. This is what’s called Bayesian inference, and naturally, the main tool we’re going to use to estimate this distribution is Bayes’ rule: <script type="math/tex">P(\theta \mid X)=\frac{P(X \mid \theta)P(\theta)}{D(X)}</script>.</p>
<p>So not only the data is random, but our model estimate is random <em>as well</em>. Nice. Let’s try this approach for the second coin example.</p>
<p><em>(Note: there are other reasons why full Bayesian approch may be better then MLE. A well-known example is a cancer test that detects cancer with very high probability, but still is practically useless without considering the prior probability of getting cancer. Usually, however, we don’t have a good prior distribution for parameters <script type="math/tex">\theta</script>, or we start from the uniform prior, so that’s not the main reason to go Bayes, I guess)</em></p>
<h1 id="an-impractical-example">An impractical example</h1>
<p>Since we don’t know anything about coins, let’s start from the uniform distribution as the initial estimate, meaning that we expect any value of <script type="math/tex">\theta</script> between 0 and 1 with the same probability.</p>
<p class="center"><img src="/assets/article_images/2015-09-10-variational-methods/uniform.png" alt="" /></p>
<p>After observing one head, here’s what happens: <script type="math/tex">P(\theta \mid \{H\})=\frac{P(H\mid \theta)P(\theta)}{P(H)}</script>. To find the numerator, we have to multiply our initial distribution <script type="math/tex">P(\theta)</script> by the likelihood of obtaining a heads given this particular value of theta. How do we get the likelihood, this <script type="math/tex">P(H\mid \theta)</script> term? Well, we’ve chosen the <script type="math/tex">\theta</script> wisely to define the probability of heads, so it’s easy to see that <script type="math/tex">P(H\mid \theta)=\theta</script>. Let’s plot the numerator expression:</p>
<p class="center"><img src="/assets/article_images/2015-09-10-variational-methods/unnormalized.png" alt="" /></p>
<p>Notice that’s not a probability distribution: this is an <em>unnormalized</em> posterior, since the area under the line doesn’t sum to 1. We have to normalize it by dividing it by the denominator, <script type="math/tex">P(H)</script>, or the <em>evidence</em> term. This is simply the integral over <script type="math/tex">\theta</script> of this unnormalized posterior triangle: <script type="math/tex">\int_{0}^{1} P(H\mid \theta)P(\theta) d\theta</script>, or in our case, this is simply the half of the unit-square, which has the area of <script type="math/tex">\frac{1}{2}</script>.</p>
<p class="center"><img src="/assets/article_images/2015-09-10-variational-methods/normalized.png" alt="" /></p>
<p>Let’s make another coin toss: and thanks to the fact both tosses are independent, we can perform updates sequentially, using <script type="math/tex">P(\theta \mid \{H\})</script> as the new prior for calculating <script type="math/tex">P(\theta \mid \{H, H\})</script>. The likelihood stays the same, so basically we just multiply our distribution by <script type="math/tex">\theta</script> again and normalize it:</p>
<p class="center"><img src="/assets/article_images/2015-09-10-variational-methods/two_tosses.png" alt="" /></p>
<p>Look how much this is better then maximum likelihood estimate! This distribution is still biased towards probability of heads, but there’s enough uncertainity to allow other explanations. You can make a couple of integrations to see the actual probabilities: for example, <script type="math/tex">P(0.4 \leqslant \theta \leqslant 0.6)=0.152</script> and <script type="math/tex">P(\theta \geqslant 0.9)=0.271</script>, so the difference is sifnificant, but not quite one-vs-all.</p>
<p>Let’s run it for some more steps, now using actual random number generator and see where it’ll take us:</p>
<div class="photo_frame_center">
<video width="612" height="512" controls="" preload="none" poster="/assets/article_images/2015-09-10-variational-methods/uniform.png">
<source src="/assets/article_images/2015-09-10-variational-methods/learning_coin.webm" type="video/webm; codecs="vp8, vorbis"" />
</video>
</div>
<p><br /></p>
<p>Notice how the spread of the distribution stops changing after the certain amount of time, meaning that we’re still allowing uncertainity to affect our decisions.</p>
<p>Okay, that was great, but it’s actualy even better, because full Bayesian learing makes some problems specific to maximum likelihood estimation magically disappear. For example, we aren’t afraid to be stuck in the local optima during gradient descent, because we’re not looking for a single best set of parameters anymore. We can also make use of some prior knowledge of the model - for example, if we’re dealing with a neural network, we can start with some assumption that the parameters, i.e. weights, shouldn’t be very large.</p>
<p>Well, let’s try to use it for something that looks more like a machine learning model.</p>
<h1 id="a-slightly-less-impractical-example">A slightly less impractical example</h1>
<p>Consider this super-small neural network with two input units and one output unit:</p>
<p class="center"><img src="/assets/article_images/2015-09-10-variational-methods/toy_nn.png" alt="" /></p>
<p>Where parameters of the model <script type="math/tex">\theta</script> correspond to the weights of the network <script type="math/tex">[\theta_0, \theta_1]</script>. There’s not much stuff this tiny network can do, but we’ll give it a toy problem, for example, function as logic OR gate. But first, lets give this model a prior parameter distribution.</p>
<p>I use these <a href="http://www.cs.cmu.edu/afs/cs/academic/class/15782-f06/slides/bayesian.pdf">excellent slides</a> by Aaron Courville as a source of the problem setup, but instead of using Gaussian prior, as he does, I’m going to stick with the uniform - just because I can. So,</p>
<script type="math/tex; mode=display">
P(\theta)=P(\theta_0, \theta_1) \sim Uniform(-5, 5)
</script>
<p>Looks just like our coin toss example prior, just 3D. Nothing fancy.</p>
<p class="center"><img src="/assets/article_images/2015-09-10-variational-methods/3d_prior.png" alt="" /></p>
<p>As for the likelihood, that’s easy too: we’re going to use sigmoid function to output a number between 0 and 1 which can conviniently represent probability.</p>
<script type="math/tex; mode=display">% <![CDATA[
P([X_i, t_i] \mid \theta)=
\begin{cases}
\frac{1}{1+e^{-X_i \theta}}, & \text{if } t_i = 1\\
1 - \frac{1}{1+e^{-X_i \theta}}, & \text{if } t_i = 0
\end{cases}
%]]></script>
<p>Where <script type="math/tex">i</script> denotes <script type="math/tex">i</script>-th observation/dataset element. Let’s pick an observation, for example, <script type="math/tex">[(0, 1), 1]</script>), and plot the unnormalized posterior (which is again, a product of the prior and the likelihood):</p>
<p class="center"><img src="/assets/article_images/2015-09-10-variational-methods/3d_unnormalized.png" alt="" /></p>
<p>Nice! Then we’re going to normalize, and again, and so on, and the process actually looks very similar to the coin example:</p>
<div class="photo_frame_center">
<video width="612" height="512" controls="" preload="none" poster="/assets/article_images/2015-09-10-variational-methods/3d_prior.png">
<source src="/assets/article_images/2015-09-10-variational-methods/logic_or.webm" type="video/webm; codecs="vp8, vorbis"" />
</video>
</div>
<p><br /></p>
<p>The only difference is that our dataset now is noise-free (I simply generated a bunch of number pairs and logical-OR-ed them), so that the posterior peak doesn’t jump around.</p>
<p>At this point it might became obvious what is the main problem about Bayesian approach to machine learning: you have to evaluate that posterior for <em>every combination of parameters</em> which your model allows. And when we use graphical models and neural networks, especially deep ones, we usually have <em>lots</em>, hundreds and thousands of parameters. Think about it this way: when we’re doing maximum likelihood learning via gradient descent, having an extra parameter means we have to compute an extra derivative. But if we’re doing full Bayesian learning it means we have to compute all the combinations that this new parameter makes with the other parameters, so the total number of computations is</p>
<script type="math/tex; mode=display">
(possible \: parameter \: range)^{number \: of \: parameters}
</script>
<p>which grows exponentially. Not cool.</p>
<h1 id="variational-bayes-and-boltzmann-machine-example">Variational Bayes and Boltzmann machine example</h1>
<p>So the idea is that instead of computing exact posterior, which is often (for sufficiently complicated models) intractable, why don’t we resort to some approximation of the posterior, which doesn’t involve exponential number of computations? This is called <em>variational approximation</em> and we’re going to illustrate this idea with yet another example - now with Boltzmann machine model.</p>
<p>Just as a reminder, the likelihood function for Boltzmann machine is written with respect to the energy of the model and looks like this:</p>
<script type="math/tex; mode=display">
P(x \mid \theta) = \frac{e^{-E(x)}}{Z}
</script>
<p>where <script type="math/tex">Z</script> is yet another partition function, which equals to the sum over <script type="math/tex">e^{-E(x)}</script> for all possible data vectors <script type="math/tex">x</script>. So the likelihood alone involves <script type="math/tex">2^N + number \: of \: hidden \: units</script> computations, and we still have to normalize it by <script type="math/tex">P(X)</script>, which is the sum over all parameters <script type="math/tex">\theta</script>. So not only Bayesian learning for Boltzmann machine is a nightmare; even exact maximum likelihood learning is often intractable (that’s why we use Gibbs sampling and tricks like conrastive divergence).</p>
<p>So let’s step back a bit and instead of trying to approximate the posterior first deal with the likelihood. After all, we can approximate any distribution we want. So there’s <script type="math/tex">P(X \mid \theta)</script> and some approximation distribution <script type="math/tex">q(X \mid \mu)</script> where <script type="math/tex">\mu</script> are variational parameters. For simplicity I’m going to denote the distributions just as <script type="math/tex">P(x)</script> and <script type="math/tex">q(x)</script>, using little <script type="math/tex">x</script> so that it means single observations and cannot be confused with the evidence term. So, to make the approximation close to the original, we’re going to minimize the KL-divergence term:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
KL(q(x)||P(x)) & = \sum_{x} q(x) \log \frac{q(x)}{P(x)} dx \\
& = \sum_{x} q(x) \log q(x) dx - \sum_{x} q(x) \log P(x) dx \\
& = -H(q) - \sum_{x} q(x) \left[ -E(x) - \log Z \right] \\
& = \sum_{x} q(x) E(x) - H(q) + \log Z \\
& = \mathbb{E}_{q(x)} \left[ E(x) \right] - H(q) + \log Z
\end{align*}
%]]></script>
<p>where <script type="math/tex">H</script> means information entropy, and <script type="math/tex">\mathbb{E}_{q(x)}</script> means expected value with respect to <script type="math/tex">q(x)</script> (I borrowed the derivation from these excellent <a href="http://cvn.ecp.fr/personnel/iasonas/course/DL4.pdf">slides</a>).</p>
<p>Notice that minimization of KL-divergence allows us to ignore <script type="math/tex">\log Z</script>, because it doesn’t depend on <script type="math/tex">q(x)</script>. A popular choice for <script type="math/tex">q(x)</script> is a fully-factorized distribution, a.k.a <em>mean field</em> distribution, which assumes that the probability of an overall configuration is just the product of the probabilities of each binary unit to turn on:</p>
<script type="math/tex; mode=display">
q(x) = \prod_{i} q_{i}(x_{i}) = \prod_{i} \mu_{i}^{x_{i}} (1 - \mu_{i})^{1-x_{i}}
</script>
<p>So each unit is represented by a single value <script type="math/tex">\mu_{i}</script> when it’s on and <script type="math/tex">1-\mu_{i}</script> when it’s off. The it can be shown that mean field update equations look like this:</p>
<script type="math/tex; mode=display">
\mu_i=\sigma \left( \sum_{j\neq i} w_{ij}\mu_{i} \right)
</script>
<p>(I couldn’t quite follow the derivation, although I really wanted to). The other quite obscured thing is the reason we cannot use these equations during the “negative phase” of Boltzmann machine learning: there’s brief note on it in Hinton & Salakhutdinov <a href="http://jmlr.org/proceedings/papers/v5/salakhutdinov09a/salakhutdinov09a.pdf">paper</a>, but again I couldn’t follow the reason why this is true. So to summarize:</p>
<ol>
<li>Each hidden unit gets a parameter <script type="math/tex">\mu_i</script> which is a number between 0 and 1 that denotes the probability of that unit to turn on.</li>
<li>During the positive phase we can update all units in parallel, using mean field equations.</li>
<li>During the negative phase we do the same as we used to (sequential updates). There are tricks to speed up the negative phase too (in fact those two tricks are the basics of Deep Boltzmann Machine learning algorithm I hope to get back to one day), but let’s ignore them for now.</li>
</ol>
<p>So I took all this and put it into my <a href="http://rocknrollnerd.github.io/ml/2015/07/18/general-boltzmann-machines.html">previous attempt</a> to make a general toy Boltzmann machine. To monitor the performance, however, I cannot use visualized weights anymore (because our parameters are not just weights now, but mean-field values too), so the only way to see if the model actually learns something is to sample data from it.</p>
<p class="center"><img src="/assets/article_images/2015-09-10-variational-methods/bm_samples.png" alt="" /></p>
<p><em>Left: samples from the original Boltzmann machine. Right: samples from mean field-approximated Boltzmann machine. Both models observed a toy subset of MNIST digits that included only 3 classes</em></p>
<h1 id="summary">Summary</h1>
<p>Okay, that took longer than expected. I sincerely hoped working through mean field for Boltzmann machines will get me closer to understanding variational methods, but apparently it was a poor choice to start with, because of all the difficulties: not being able to derive mean field equations (which was the most frustrating), the confusing part about still doing weight updates during the negative phase, and after all, that whole thing was a distraction from full Bayesian approach, because the model still does maximum likelihood learning (we end up with one single value of <script type="math/tex">\mu_i</script> per each unit). Models I’m mostly interested in are <a href="http://jmlr.csail.mit.edu/proceedings/papers/v15/larochelle11a/larochelle11a.pdf">NADE</a>, <a href="http://arxiv.org/pdf/1312.6114v10.pdf">variational autoencoders</a> and <a href="http://arxiv.org/pdf/1505.05424v2.pdf">Bayesian neural networks</a>, which were discussed during the DTU summer school and have really got my interest, so I hope I’ll get to them in the next part.</p>
Sun, 20 Sep 2015 20:11:00 +0000
http://rocknrollnerd.github.io/ml/2015/09/20/variational-methods.html
http://rocknrollnerd.github.io/ml/2015/09/20/variational-methods.htmlmlStuff I've been reading<p>So before the summer school (which I’m going to briefly mention a little bit later) I was going to speedread a randomly chosen sample set of recent papers just to catch up with what’s going on. Here are some short notes I made during the reading. Most of the papers are not really recent: it’s mostly because I tried to choose the latest publications from researches/groups I haven’t been paying close attention to (like Coates/Salakhutdinov/LeCun etc).</p>
<ol>
<li>
<p><strong>Learning to Disentangle Factors of Variation with Manifold Interaction</strong> by S. Reed, K. Sohn, Y. Zhang, H. Lee.</p>
<p>The paper describes a model (a kind of RBM) designed to capture different factors of the input data: for example, human face pose and expression. Extremely interesting; after all, that’s the big question — looking for independencies in the data and trying to find separate latent representations for them. The practical approach, however, looks like it requires a lot of hand engineering — the number of factor we’re trying to extract is defined in terms of network architecture and cannot be adjusted during the learning. I also haven’t got the math right, mainly because it seems to be based on variational learning which I’m still struggling to undestand.</p>
</li>
<li>
<p><strong>Dropout: A simple way to prevent neural networks from overfitting</strong> by N. Srivastava, G. Hinton, A. Krizhevsky.</p>
<p>Yeah, despite the fact that dropout is a famous technique now I never read the original paper, and actually shame on me, because it was awesome. Some people I tried to discuss dropout with were really skeptical about the concept, arguing that there’s no theory behind it and it’s purely a heuristic that works. From my perspective this looks like a great concept certainly worth investigating maybe even on a biological level (I particularly liked the section about evolutionary inspiration for dropout). Does it really make sense for a single neuron to follow the lonely wolf strategy by refusing to cooperate with its neighbor neurons that receive the same input? And if it does, how can it possibly do such a thing?</p>
</li>
<li>
<p><strong>Deep Boltzmann Machines</strong> by R. Salakhutdinov and G. Hinton.</p>
<p>Another famous paper; but now I couldn’t understand <em>a thing</em> from it. That’s really sad. Damn you variational learning!</p>
</li>
<li>
<p><strong>Deep generative stochastic networks trainable by backprop</strong> by Y. Bengio, E. Thibodeau-Laufer, G. Alain.</p>
<p>Again I understood almost nothing, <em>but</em> this paper was mentioned in a talk by Tapani Raiko during the summer school and it seems I’ve grasped a bit of intuition from it (the relationship between models wuth probabilistic hidden units and denoising models). So I’m definitely going to get back to it right after I’m done with all the stochastic stuff which is suddenly all over the neural networks.</p>
</li>
<li>
<p><strong>Deep learning of invariant features via simulated fixations in video</strong> by W. Zou, S. Zhu, K. Yu, A. Ng.</p>
<p><em>Ahhh.</em> The first time I met machine learning was during Andrew Ng’s Coursera class and now I’m starting to understand why I’m experiencing trouble trying to read about probability and neural networks: because that course didn’t mention probability <em>at all</em>. Precisely by the same (I guess) reasons it’s super-easy to read his papers: everything is loud and clear, moving on. The biggest mistery of this paper was the usage of the tracking algorithm to track video frames: doesn’t it produce redundant sequences of replicated patches? Anyway, I absolutely like the idea of time as weak supervisor, although it seems that the model learns features not really different from the well-known edges and corners.</p>
</li>
<li>
<p><strong>Learning Feature Representations with K-means</strong> by A. Coates and A. Ng.</p>
<p>I used to consider this paper a some kind of blast: after all, being able to produce state-of-the-art unsupervised learning results with plain and simple K-means kinda implies that maybe we should throw off all the fancy algorithms like RBMs, autoencoders and sparse coding models. So I kept asking <em>everyone</em> from the summer school about it and suprisingly almost never encountered this reaction. There’s a post on <a href="https://www.reddit.com/r/MachineLearning/comments/1rsmlt/whats_wrong_with_kmeans_clustering_compared_to/">reddit</a> which conveys the tone of my question, and the answers are basically sound like “wellll, convolutional networks are better anyway”. Interesting. As about the paper itself: I should try to implement the result one day, seems to be a useful tool to play with unsupervised learning methods.</p>
</li>
<li>
<p><strong>Emergence of Object-Selective Features in Unsupervised Feature Learning</strong> by A. Coates, A. Karpathy, A. Ng</p>
<p>A continuation of the previous paper, which introduces a deep hierarchy model built on K-means cells. I like the pooling approach they used (pooling “similar-looking” patches together), although I still believe that the idea of grouping edge-like features spatially trying to get higher-level features is wrong…</p>
</li>
<li>
<p><strong>The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization</strong> by A. Coates and A. Ng.</p>
<p>That was… strange. I’m not sure I get the idea right: combining different learning/encoding methods produces comparable results? It’s possible to encode an image using a linear combination of random code components? Something like this, I guess. The paper is from 2011, so not recent at all; and is it just me or the interest to unsupervised learning methods is lower now than it used to be? Maybe that’s becayse we don’t need unsupervised pre-training anymore with RELU and stuff.</p>
</li>
</ol>
Fri, 04 Sep 2015 19:00:00 +0000
http://rocknrollnerd.github.io/ml/2015/09/04/stuff-ive-been-reading.html
http://rocknrollnerd.github.io/ml/2015/09/04/stuff-ive-been-reading.htmlmlDeepdreaming pets<p>So I’ve tried to make an attempt against Kaggle’s cats and dogs dataset as I <a href="http://rocknrollnerd.github.io/ml/2015/08/20/i-want-my-corners.html">intended to</a>. In short, some good and bad news:</p>
<ul>
<li>I’ve managed to achieve 75% success recognition rate, which is, to be honest, not good at all for binary classification. Still I hoped that’ll give me some features to look at.</li>
<li>Implementing a deconvolutional network in Theano is a tricky business. There’s an <a href="https://github.com/Theano/Theano/issues/2022">issue</a> on Github about in with no progress, it seems. People also suggest to use <a href="http://arxiv.org/abs/1312.6034">Karen Simonyan et al.</a>’s technique for visualization, which is nice, but cannot give me a sneak peek into the second or third network layers (those are the ones I’m interested in, mostly). Still, better then nothing.</li>
<li>Apparently my network is still not good enough, because that’s what I’ve managed to obtain:</li>
</ul>
<p class="center"><img src="/assets/article_images/2015-08-22-deepdreaming-pets/catdog.png" alt="" /></p>
<p><em>Left: an image that maximizes “dog” class. Right: an image that maximizes “cat” class. Both: I sincerely hoped for deeper dreams.</em></p>
<p><em>Actually</em> after staring at them for several minutes it seems that I start to figure out some doggish patterns. Or it’s just plain wishful thinking, I guess.</p>
<p>The technique is very simple and produces valid results for simpler images. For example, that’s MNIST digits:</p>
<p class="center"><img src="/assets/article_images/2015-08-22-deepdreaming-pets/mnist.png" alt="" /></p>
<p>Which is, again, very nice, but isn’t what I wanted to see. I’d like to check out what’s going on in layer 3 or somewhere around — what are the features that the network learns after it’s done with edges and corners?</p>
<p>Maybe it’s better to switch from Theano and Lasagne to something else. Also, to find a more powerful computing unit than my laptop: it runs fast and nice on MNIST digits, but this little catdog network took something like 3 hours to learn (images were resized to 60x60). Also, still trying to make CUDA work, but given the performance boost people report, there isn’t going to be a miracle anyway.</p>
Sat, 22 Aug 2015 09:00:00 +0000
http://rocknrollnerd.github.io/ml/2015/08/22/deepdreaming-pets.html
http://rocknrollnerd.github.io/ml/2015/08/22/deepdreaming-pets.htmlml"I want my corners" (Avon Barksdale)<p>So, how do we learn feature hierarchies from images?</p>
<ol>
<li>Cut a lot of patches from image data, put them into an unsupervised algorithm. Get a bunch of features that look like edges (which is well-known and reproduced many times).</li>
<li>Then we can make bigger patches, represent them in terms of the learned edges and put into the second layer using the same algorithm. Surprisingly, I’ve seen just <a href="http://ai.stanford.edu/~ang/papers/nips07-sparsedeepbeliefnetworkv2.pdf">one paper</a> that shows that you can get corner-like features as a result (and also different contour parts). The key part here is that we look for first-level features that occuur together (close to each other). That’s what spatial pooling does.</li>
<li><em>Then</em> as different deep learning architectures suggest, we perform the same pooling/feature learning steps over and over. But wait a minute, if we do believe that second-level features are indeed corners, it doesn’t make sense. We can find corners by looking at (different) edges that are active in some close proximity, but we can’t use the same logic for features of higher order, because combining different corners together produces junstions and gratings, which are structurally the same. What we’d like to get at the abstract level 3 is maybe some simple geometric figures or <a href="https://en.wikipedia.org/wiki/Geon_%28psychology%29">geons</a> and we <strong>can’t</strong> do that by using spatial pooling simply because different corners that compose the figure can be far away from each other. Actually, isn’t it the place where scale invariance kicks in? Maybe we can get away from absolute-valued spatial metric here by representing an image as some elastic corner graph?</li>
</ol>
<p>An idea: look for coincident corners (pairs of corners that correspond to each other, i.e. can be connected together). That might include corners that doesn’t have an actual edge between them - <a href="https://en.wikipedia.org/wiki/Illusory_contours">illusory contours</a>? To do that, we have to know corner orientation (or somehow to learn all these corresponding pairs which may be expensive). I hoped I can do that by using <a href="http://arxiv.org/pdf/1109.6638v2.pdf">factored sparse coding</a> which ables angle parametrization for edge features, but I failed to understand and implement the algorithm, sadly.</p>
<p>Still, what features do convolutional networks learn, when they use spatial pooling approach? As <a href="http://www.matthewzeiler.com/pubs/arxive2013/eccv2014.pdf">this paper</a> shows, it’s mostly pieces of aligned image parts, if they are available (it also learns corners and junctions and gratings, no surprise). That reminds me of a simple sparse coding <a href="https://github.com/rocknrollnerd/deep_hierarchy">hierarchy</a> I tried to implement some time ago which <em>didn’t</em> learn higher-level features running against Kaggle’s cats and dogs dataset. Does it mean this set doesn’t contain any aligned parts (meaning each dog/cat is unique; kinda unlikely, I guess), or maybe my network/sample set wasn’t big enough, or maybe sparse coding is just worse than convnet somehow? Now I want to build a convnet and run it against this dataset, using a feature visualization technique. An interesting question: what if we really construct a dataset of a lot of unique dogs with no repeated aligned image parts to learn the corresponding filters? Will a convnet fail there?</p>
<p>(if anybody’s reading it, I apologize for that stream of consciousness)</p>
<p>As a side note: started reading “Machine Learning: probabilistic perspective”. Great book, hope it’ll bring some clarity into crazy probability-heavy papers that I can’t understand at all for now.</p>
Thu, 20 Aug 2015 00:50:00 +0000
http://rocknrollnerd.github.io/ml/2015/08/20/i-want-my-corners.html
http://rocknrollnerd.github.io/ml/2015/08/20/i-want-my-corners.htmlmlOccasional thoughts on spatial coding<p>I’ve been trying to read Norbert Wiener’s <em>Cybernetics</em> recently, just to look up the origins of all the AI thing. Among other cases he talks about a problem of perceiving a geometric figure (a square) the same way regardless of its size, a.k.a. scale invariance problem. I thought about how is this problem solved nowadays, which is basically just using image pyramids — either calculating SIFT features or performing max/mean pooling by deep convnet layers. There are other approaches that try to separate object’s location and content by using separate units encoding rotation and scale, such as Hinton’s transforming autoencoders or quite recent DeepMind’s <a href="http://arxiv.org/pdf/1506.02025v1.pdf">spatial transformer networks</a>, and personally I like those even more, but anyway, seems we’ve got scale invariance covered now.</p>
<p>There are a couple of things that still look a bit suspicious, however. All these methods can be divided in roughly two groups: “global”, that scale or transform the entire image at once (Gaussian pyramids and transforming autoencoders), and “local”, that perform pooling operations on a some local image region (convnet’s max pooling). And you can notice that max pooling performs the same replicated operation across all the pooling regions, so we can call it global too. The point I’m trying to make here is that none of these methods can perform a heterogeneous scaling transformation on an image; for example, for a human figure, to scale up just its head in a caricature style. Transforming autoencoder can decode more complex affine transformation, but that’s still quite a limited case of invariance. There are SIFT features still, that operate on all the pyramid levels at once — they’ll match an enlarged human head/face part, but then again, if we encode our image using whole bunch of SIFT features, the spatial correspondence between the head and the body wil be different now, which is a problem.</p>
<p>Or to make a simpler example, something like these digits, when a part of a digit is scaled differently from the other parts:</p>
<p class="center"><img src="/assets/article_images/2015-08-10-spatial-coding/digits.png" alt="" /></p>
<p>Still pretty much recognizable, right?</p>
<p>Now, it looks like a heterogeneous scaling transform can <em>only</em> be applied to structurally separate object parts. We can think about an enlarged eye but not a half-eye — it doesn’t even make sense, considering there’s no way to connect two parts of the transformed eye in a realistic manner. So, we’d like to use some kind of constellation model to split a digit into separate parts, and how do we do <em>that</em>? There are multiple possible ways, like:</p>
<ul>
<li>a <a href="https://www.cs.princeton.edu/courses/archive/fall07/cos429/slides/constellation_model.pdf">classic</a> method: extracting interesting patterns, applying them to images, using EM to find proper combinations. Generally applicable, but I don’t like the idea of processing a big dataset just to understand that “8” consists of two circles.</li>
<li>using something like time as a supervisor: if a part of an object moves in a separate (from other parts) direction, say, then we can safely assume some kind of structural independence. Sadly, our handwritten digits are pretty much static, although this method still can be used for some cases — for example, extracting parts from different images of “1” digit (with or without a bottom-horizontal bar) that are missing somethimes.</li>
<li>one-short learning approach: extracting salient regions and assuming they belong to constellation parts. Actually I’d like to see how that works for digits…</li>
<li>maybe going low-level a bit? We can safely assume that part of our digits’ structure can be described by penstroke intersections/corners. Digits with no corners aren’t prone to heterogenous scaling: you can’t scale a part of “1” or a part of “0” without breaking a contour. Now: these corner features should be a well-known thing mentioned in Neocognitron network and convnets, but then again, convnet’s max pooling doesn’t look like the scalable solution we’re looking for.</li>
</ul>
<p>I guess before I’ll start jumping to conclusions, I should investigate what features a convnet can actually learn on MNIST digits and how does it combine them together. But for now it looks like we can’t achieve scale invariance simply by stacking together lots of pooling layers, let alone heterogenous scaling.</p>
<p>What if at hypothetical second layer, where we can represent an image by a significantly smaller subset of corners, we are no longer interested in features’ absolute positions and instead look at corners’ mutual relations? That would automatically solve <em>all</em> our invariance problems, including translation and rotation, but it seems that knowing corners only is simply not enough to represent an object…</p>
Mon, 10 Aug 2015 12:49:00 +0000
http://rocknrollnerd.github.io/ml/2015/08/10/spatial-coding.html
http://rocknrollnerd.github.io/ml/2015/08/10/spatial-coding.htmlmlRBMs: part 2<p>I’ve interrupted my previous <a href="http://rocknrollnerd.github.io/ml/2015/07/23/finally-rbms.html">post</a> because of the nasty error with whitened data. So, good news, me (I guess): it seems that whitening MNIST dataset is just a bad idea and I shouldn’t do it at all.</p>
<p>Now, I’m still not quite sure if this is in fact the case. The strongest counterargument agaist this position is simply the way whitened digits look like (the <a href="http://rocknrollnerd.github.io/assets/article_images/2015-07-23-finally-rbms-part-1/whitened_digits.png">example</a>, again) — they are still perfectly distinguishable by a human eye although a little noisy. But when I’m trying to run third-party, well tested models against whitened dataset, it’s always a failure:</p>
<ul>
<li>I’ve tried two RBM implementations — by <a href="http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.BernoulliRBM.html">scikit-learn</a> and <a href="https://github.com/lmjohns3/py-rbm">Leif Johnson</a>, both gave me the same randomized weights.</li>
<li>I’ve also tried an antoencoder solution by <a href="https://github.com/lmjohns3/theanets">theanets</a>, with again, same kind of error.</li>
<li>I’ve re-implemended whitening from <a href="http://ufldl.stanford.edu/wiki/index.php/Implementing_PCA/Whitening">UFLDL excercise</a> in Octave in case my Python whitening code is somewhat wrong, but nope, two outputs look pretty much the same.</li>
<li>I’ve found some <a href="https://plus.google.com/+AlexSusemihl/posts/LTE53tun5DC">posts</a> that report the same problem.</li>
</ul>
<p>Whitening transform makes features less corellated with each other; so is there something wrong with these corellations in MNIST digits? One can safely assume that the border around digits is quite a redundant and intercorellated piece of data, but maybe that’s not true for the digit pixels themselves… Well, anyway, I’m going to drop it for now and assume I’m done with Gaussian RBMs. No penstrokes still, which is sad.</p>
<h1 id="convolutional-rbm">Convolutional RBM</h1>
<p>And <em>this</em> one took an enormous amount of time. The architecture is quite simple: if one’s familiar with convolutional networks, it’s basically just the same with a couple of hints. Convolutional RBM is introduced in <a href="http://www.cs.toronto.edu/~rgrosse/icml09-cdbn.pdf">this</a> paper as a part of a larger convolutional DBN structure. The key differences with Bernoulli RBM are the following:</p>
<ul>
<li>Instead of <script type="math/tex">k</script> hidden units we now have <script type="math/tex">k</script> <em>feature maps</em> which are 2d collections of binary units. So actually there are <script type="math/tex">k \times H_{h} \times W_{h}</script> hidden units total (assuming <script type="math/tex">H_{h}</script> and <script type="math/tex">W_{h}</script> are feature map’s height and width). I find it easier to think about just <script type="math/tex">k</script> hidden variables, when each variable is a 2d array but essentially still represents one feature (and has one scalar bias value <script type="math/tex">b_{k}</script>).</li>
<li>Each of <script type="math/tex">k</script> feature maps has a filter weight matrix <script type="math/tex">W_{k}</script> which is also 2-dimensional (<script type="math/tex">H_{W} \times W_{W}</script>). To obtain a feature map, the input image is convolved with the corresponding filter (that’s a forward pass).</li>
<li>Now there’s a trick: remember, RBMs are symmetric. So to obtain visible data given the hidden we’re going to perform convolution again, but in “full” mode (instead of “valid” mode we used in the forward pass). So if the input image has size <script type="math/tex">H_{I} \times W_{I}</script>, feature maps obtained by valid convolutions have size <script type="math/tex">H_{h}=H_{I}-H_{W}+1, W_{h}=W_{i}-W_{W}+1</script>. And to perform the backward pass we have to full-convolve feature maps by filter weights producing reconstrunction of <script type="math/tex">H_{I} \times W_{I}</script> size again. Funny that it’s not mentioned anywhere except py-rbm implementation. Also don’t forget that a reconstruction is actually a sum of these convolutions across all feature maps.</li>
<li>And to compute visible-hidden associations required for contrastive divergence we perform another convolution, namely convolve visible data with hidden (in “valid” mode again), so the result is shaped like filter weights.</li>
<li>Ah, and also visible bias is just a scalar value now. I wonder why.</li>
</ul>
<p><strong>Note</strong>: if you’re trying to implement a convolutional RBM only, without further extending it to DBN, forget about pooling layer — you don’t need it. I’ve spent quite some time here before I realized that Lee’s paper places pooling to a DBN-related section and it’s perfectly possible to perform sampling without it. It’s included in <a href="https://github.com/lmjohns3/py-rbm">py-rbm</a> for some reason, although.</p>
<p>Now the problem is, it seems, that a convolutional RBM (even binary one) requires a lot of fine-tuning, playing with parameters and weights. The most surprising part, I guess, is that sometimes it actually needs <em>just a few</em> hidden variables/feature maps to learn something meaningful. I tried to feed little toy datasets to it (something like 5 random MNIST digits) and was able to observe some patterns in its behaviour:</p>
<ul>
<li>the more the filter size is, the better CRBM works. With <script type="math/tex">28 \times 28</script> filters it actually reduces to a Bernoulli RBM showing comparable results.</li>
<li>when filter size is small (not even <em>very</em> small, something like <script type="math/tex">17 \times 17</script>) and number of hidden feature maps is “large” (like 5 for a dataset of 5 MNIST digits), it very quickly learns a redundant set of round-shaped weights and then seemingly stops learning, which looks like this:</li>
</ul>
<div class="photo_frame_center">
<video width="650" height="250" controls="" preload="none">
<source src="/assets/article_images/2015-07-31-rbms-part-2/5_weights.webm" type="video/webm; codecs="vp8, vorbis"" />
</video>
</div>
<p><br /></p>
<ul>
<li>it’s also unable to produce a correct reconstruction ending up with more or less random noise:</li>
</ul>
<div class="photo_frame_center">
<video width="650" height="250" controls="" preload="none">
<source src="/assets/article_images/2015-07-31-rbms-part-2/5_recon.webm" type="video/webm; codecs="vp8, vorbis"" />
</video>
</div>
<p><br /></p>
<ul>
<li>when you <em>decrease</em> the number of feature maps the perfomance actually turns better (for example, 3 feature maps instead of 5 makes it much better). An alternative way is to add sparsity contraint. For example, here’s the same amount of feature maps (5) but with 0.1 sparsity target:</li>
</ul>
<div class="photo_frame_center">
<video width="650" height="250" controls="" preload="none">
<source src="/assets/article_images/2015-07-31-rbms-part-2/5_sparse_weights.webm" type="video/webm; codecs="vp8, vorbis"" />
</video>
</div>
<p><br /></p>
<p>And reconstructions:</p>
<div class="photo_frame_center">
<video width="650" height="250" controls="" preload="none">
<source src="/assets/article_images/2015-07-31-rbms-part-2/5_sparse_recon.webm" type="video/webm; codecs="vp8, vorbis"" />
</video>
</div>
<p><br /></p>
<p>Can’t imagine why such a strange behaviour — I’d say increasing hidden units size results in redundant features, yes, but also increases reconstruction precision…</p>
<p>I haven’t tried larger datasets mainly because my convolution implementation is not very efficient (I use scipy’s <code>convolve2d</code> looping across training examples). Apparently Theano is the way to go. And maybe go a step further and make a full-sized convolutional DBN.</p>
Fri, 31 Jul 2015 12:49:00 +0000
http://rocknrollnerd.github.io/ml/2015/07/31/rbms-part-2.html
http://rocknrollnerd.github.io/ml/2015/07/31/rbms-part-2.htmlmlFinally, RBMs: part 1<p>Checking my progress: about 10 days dedicated to energy-based models, including Hopfield nets, Boltzmann machines and their little sisters RBMs. And I’m starting to notice that I can read and understand (some) machine learning papers now, not just scroll through all the equations in a panic. Another thing I’ve also discovered during the run, is that among other concepts I find probability ones to be the hardest to understand. Until it was all single neurons and energy landscapes, all was loud and clear, but “to get an unbiased sample from the posterior” still looks a bit vague to me. So now I’ll try to watch the amazing Probabilistic Graphical Models course to get closer with sampling, Markov chains and belief nets.</p>
<p>And now, let’s play with some RBMs.</p>
<p><strong>A short disclaimer</strong>: I’m going to make the following implementations to be as simple as possible — mainly because RBM is a popular model and to add all the bells and whistles like momentum, mini-batch gradient descent and validation comes close to reinventing the wheel. So these pieces of code are mostly educational. However, I don’t like the idea of concentrating on toy problems too much, so I’m going to try to make a full-sized implementation in Theano (since I’ve made a few baby steps with it already) and try to run it on GPU (since it can’t handle Witcher 3 in all its glory).</p>
<h1 id="bernoulli-rbm">Bernoulli RBM</h1>
<p>This is the simplest form of RBM with binary hidden and visible units (where Bernoulli means binary, naturally). There are a few difficulties, however, considering different ways of collecting statistics for contrastive-divergence learning: sometimes we can use activation probabilities, and sometimes sampled binary states. Hinton’s <a href="https://www.cs.toronto.edu/~hinton/absps/guideTR.pdf">practical guide</a> (I absolutely love the idea of making such a document) specifies it in details, but at first I’ve found the section on statistic collection kinda obscure (and I’m <a href="http://stats.stackexchange.com/questions/93010/contrastive-divergence-making-hidden-states-binary">not alone</a>). But after googling some other implementations everything came in its place.</p>
<ul>
<li>First, assume for the sake of simplicity that we’re going to use CD-1 (i.e., perform one pass of contrastive divergence)</li>
<li>When performing CD-updates, always sample the states except maybe for the last hidden units update (because you won’t need them; that’s just for economy. I’ve actually performed sampling on every update).</li>
<li>When calculating statistics, you can use probabilities everywhere except for the input data (which is fixed and binary) and the first hidden state vector <script type="math/tex">h_{0}</script>. Our you can just use states everywhere and be fine.</li>
</ul>
<p>I’ve chosen 100 hidden units, clamped a subset of MNIST dataset to them, and here’s the result:</p>
<p><img src="/assets/article_images/2015-07-23-finally-rbms-part-1/dense-binary.png" alt="" /></p>
<p>Looks kinda feature-ish, but a bit messed up. Let’s try to add some sparsity to make hidden units represent more independent features. The basic way to to it is described by <a href="http://web.eecs.umich.edu/~honglak/nips07-sparseDBN.pdf">Honglak Lee</a>, and it’s just adding <script type="math/tex">\rho - mean(h_{0})</script> to weights and hidden biases, where <script type="math/tex">\rho</script> is the desired sparsity target, meaning average probability that a certain unit is on. <em>Actually</em> it is possible to add that term only to hidden biases, and we’ll get back to it in a moment.</p>
<p>Sparse RBM filters looks like this:</p>
<p><img src="/assets/article_images/2015-07-23-finally-rbms-part-1/sparse-binary.png" alt="" /></p>
<p>I’m still not sure whether that’s cool or rather not — on the one hand, average hidden unit’s activation definitely go down, and units start to represent different (independent) features. On the other hand, too many of them look like “1” — maybe that’s because those diagonal straight lines are tend to occur in “1” and “7” quite often. I also would like to end up with nice penstrokes-like features like those pictured in Lee’s paper, and after some digging discovered that whitening (preprocessing operation that make different image pixels less corellated with each other) helps to achieve something like that. But I couldn’t apply it right now, because I needed my data to be real-valued, and <em>that</em>, in turn, required my visible units to be Gaussian.</p>
<p><img src="/assets/article_images/2015-07-23-finally-rbms-part-1/wdwhb.jpg" alt="" /></p>
<h1 id="gaussian-rbm">Gaussian RBM</h1>
<p>At first I’d like to point out that I haven’t found literally any analysis on the matter. Energy function for Gaussian RBM is just stated (by Hinton, Lee and some other papers I’ve managed to google), and no further comments are made. That’s really disappointing <em>especially</em> because everyone keeps saying Gaussian RBMs are hard to train. And, by the way, this is so true.</p>
<p>First, it doesn’t even work at all with uniformly initialized <script type="math/tex">[-\frac{1}{m}, \frac{1}{m}]</script> weights. To make it learn, I had to replace them with normaly distributed weights with zero mean and 0.001 standart deviation (thanks to practical guide again). Any attempt to increate the std value breaks learning like completely.</p>
<p>Oh, and I forgot to mention the actual change: for visible units I’ve replaced the activation function with sampling from normal distribution of <script type="math/tex">(hw + b_{vis})</script> mean and unit variance. To be able to do that I had to rescale the input data to zero mean and unit variance (otherwise it’s also possible to learn precise variance parameters per each unit). Also I guess I can’t use “raw” <script type="math/tex">hw + b_{vis}</script> value to collect learning statistic (as I did with Bernoulli probability), so I’m going to sample states everywhere.</p>
<p>According to my (kinda sloppy) observations sparsity doesn’t work so good for Gaussian RBM either — adding sparsity penalty to the weights seems to push the gradient in the wrong direction maybe? Anyway, average hidden activation doesn’t change properly. I’ve followed Lee’s advice about adding sparsity penalty just to visible biases, and now it works better.</p>
<p>The results are… slightly better as well, I guess:</p>
<p><img src="/assets/article_images/2015-07-23-finally-rbms-part-1/sparse-gaussian.png" alt="" /></p>
<p><em>(this is with sparsity enabled)</em></p>
<p>I can distinguish specific digits, but not penstrokes, still. Let’s whiten the data and try it again:</p>
<p><img src="/assets/article_images/2015-07-23-finally-rbms-part-1/whitened_digits.png" alt="" /></p>
<p><em>Whitened MNIST digits. Can’t say they actually change a lot</em></p>
<p><img src="/assets/article_images/2015-07-23-finally-rbms-part-1/whitened.png" alt="" /></p>
<p><em>Whoops</em></p>
<p>…and this is where I’m stuck at the moment. This doesn’t look like features in any way, and I don’t understand what is the problem. The curious thing about it is that the model still learns, reconstruction error goes down — but it learns these pretty much random noise filters, still being able to make better reconstructions.</p>
<p>I’m going to pause for a bit and maybe look for external help. All the cases mentioned are uploaded in my playground repo.</p>
Thu, 23 Jul 2015 15:05:00 +0000
http://rocknrollnerd.github.io/ml/2015/07/23/finally-rbms.html
http://rocknrollnerd.github.io/ml/2015/07/23/finally-rbms.htmlmlOversimplified introduction to Boltzmann Machines<p>This is a continuation of the <a href="http://rocknrollnerd.github.io/ml/2015/07/14/memory-is-a-lazy-mistress.html">previous</a> post dedicated to (eventually) understand Restricted Boltzmann Machines. I’ve already seen Hopfield nets that act like associative memory systems by storing memories in local minima and getting there from corrupted inputs by minimizing energy, and now… to something completely different.</p>
<p>The first unexpected thing is understanding that Boltzmann Machines are nothing like Hopfield nets, yet bear a strong resemblance to them. So, let’s start with similarities:</p>
<ul>
<li>A Boltzmann Machine is a network of binary units, all of which are connected together — just like a Hopfield net</li>
<li>There’s an energy function, which is exactly the same as Hopfield’s</li>
<li>When we update units, we use kinda the same rule of summing up all weighted inputs and estimating unit’s output by the sum. We’re going to use different activation function, though, instead of Hopfield’s binary threshold one.</li>
</ul>
<p>As for the differences, there are plenty of them too:</p>
<ul>
<li>The main one, I guess, is that a Boltzmann Machine is <strong>not</strong> a memory network. We’re not trying to store things anymore. Instead we’re going to look for <em>representations</em> of our input data (like MNIST digits), i.e., some probably more compact and meaningful way to represent all training examples, capturing their inner structure and regularities.</li>
<li>And we’re going to do that by adding an extra bunch of neurons called <em>hidden units</em> to the network. They are essentially just the same neurons as the other (<em>visible units</em>), except their values aren’t directly set from training data (but are, of course, influenced by it via input connections).</li>
<li>There’s another global objective — instead of simply minimizing the energy function, we’re trying to minimize the error between the input data and the reconstruction produced by hidden units and their weights.</li>
<li>Remember how having local minima was fine for Hopfield nets because different minima corresponded to different memories? Well, that’s not the case anymore. When we do representation learning, there’s one and final objective we’re trying to achieve, and therefore, local minima becomes an issue. To avoid it, we add noise to our network by making neurons’ activations stochastic (that’s what I meant by having “different” activation function before), so they could be active/inactive with some probability.</li>
</ul>
<p>So… well, having stochastic neurons is a novel thing, but actually doesn’t change the inner logic of the model much, does it? These new hidden units, however, do. Remember how we used to minimize energy in Hopfield nets? Right, by using Hebbian rule for each pair of neurons, and this worked because we knew exactly the value of each neuron (because it was set (or “clamped”) by our training data examples). Now when there are hidden units, they are free from external interference, and their values are unknown to us. Hence, we cannot update their weights, hence, a problem.</p>
<h1 id="representative-democracy-of-hidden-units">Representative democracy of hidden units</h1>
<p>I’ve suddenly discovered that my previous metaphor of “voting” neurons can actually be useful again. Remember how units in Hopfield net used to cooperate with each other, voting for their neighbors to change their value? They don’t vote personally now — instead, they use a group of established hidden units to represent their collective will. Now initially, when there’s no training data, hidden units don’t have an opinion of their own, they’re like slick politicians waiting to hear the voice of the crowd (this metaphor starts to get out of hand). But when a training example is available, they have to mimic it as good as they can.</p>
<p>Well, how do they do it? To find out, we should use gradient descent. Because that’s not Hopfield net anymore, we’re trying to minimize not the energy function, but the discrepancy between the input data and the reconstruction. The objective function can be written as KL-divergence between two probability dictributions (input and reconstruction), and it turns out, that it’s derivative is quite simple and equals to <script type="math/tex">(s_{i}s_{j})_{+} - (s_{i}s_{j})_{-}</script> (derivation is provided by <a href="http://web.archive.org/web/20110718022336/http://learning.cs.toronto.edu/~hinton/absps/cogscibm.pdf">original paper</a>, which is surprisingly easy to read). The first term of the derivative corresponds to mutual neurons’ activations in a so-called “positive phase” (when a network observes the training data), and the second one is the same but for “negative phase” (when a network tries to reconstruct the input). So, if you are a slick politician hidden unit trying to capture the many voices of the crowd, you do the following:</p>
<ul>
<li>say something at random (activate hidden unit by a random set of weights)</li>
<li>when listening to the crowd, strengthen the connections to the people you happen to agree with (units with the same state as the hidden unit), and similarly, weaken the connections to the people that disagree with you.</li>
<li>later at home, trying to rehearse your speech, do exactly the opposite — weaken the connections to the units that are different from the hidden one, and vice versa.</li>
</ul>
<p>This is indeed a very interesting learning rule, and while I understand it numerically, I still can’t construct a good intuitive explanation for, say, why do we need the negative phase. In Hinton’s course the second term is described as the unlearning term, but is that the same unlearning that Hopfield nets perform, and if so, why now is it possible to measure it precisely (wait, I know, that’s because Hopfield nets are not error-driven)? The other reason to have the negative phase is that Hebbian learning in its simplest form is unstoppable, so the weights will grow until the learning is stopped. The negative phase allows to control this growth — for example, imagine we’ve contructed the ideal representation, so that the input data is equal to the reconstruction. Then, <script type="math/tex">(s_{i}s_{j})_{+} = (s_{i}s_{j})_{-}</script>, i.e. derivative becomes zero and the learning naturally stops. Still, that’s surely not the one and only way to cap the weights…</p>
<p>I’ve made a simple toy Boltzmann Machine with binary threshold units that performs <em>only positive</em> phase, just to see what’s happening in that case. One interesting thing apart from endlessly growing weights is that opposite-valued units in different states force some weights to stay zero. For example, if there’s one hidden unit and three visible units, weights are <script type="math/tex">(0, 0, 0)</script> and the network observes two states <script type="math/tex">(-1, 1, -1)</script> and <script type="math/tex">(1, 1, -1)</script>, the first weight will never change from zero.</p>
<p class="center"><img src="/assets/article_images/2015-07-18-general-boltzmann-machines/1hidden.png" alt="" /></p>
<p><em>My example network. Notice that it’s actually a Restricted Boltzmann Machine, because there are no connections other then visible-hidden</em></p>
<p>If we choose a different set of weights, like <script type="math/tex">(2, 0, 0)</script>, the opposite thing happens — the first weight will grow suppressing the other two.</p>
<p>Here’s the class to play with:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="k">class</span> <span class="nc">PositiveToyRBM</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">num_visible</span><span class="p">,</span> <span class="n">num_hidden</span><span class="p">,</span> <span class="n">w</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">num_visible</span> <span class="o">=</span> <span class="n">num_visible</span>
<span class="bp">self</span><span class="o">.</span><span class="n">num_hidden</span> <span class="o">=</span> <span class="n">num_hidden</span>
<span class="k">if</span> <span class="n">w</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">w</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">num_visible</span><span class="p">,</span> <span class="n">num_hidden</span><span class="p">))</span>
<span class="k">else</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">w</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">(</span><span class="n">w</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">threshold</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">arr</span><span class="p">):</span>
<span class="n">arr</span><span class="p">[</span><span class="n">arr</span> <span class="o">>=</span> <span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">arr</span><span class="p">[</span><span class="n">arr</span> <span class="o"><</span> <span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span>
<span class="k">return</span> <span class="n">arr</span>
<span class="k">def</span> <span class="nf">hebbian</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">visible</span><span class="p">,</span> <span class="n">hidden</span><span class="p">):</span>
<span class="c"># for each pair of units determine if they are both on</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">visible</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">visible</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">1</span><span class="p">),</span>
<span class="n">hidden</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">hidden</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">1</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">pp</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">arr</span><span class="p">):</span>
<span class="c"># pretty print</span>
<span class="k">return</span> <span class="nb">list</span><span class="p">([</span><span class="nb">list</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">arr</span><span class="p">])</span>
<span class="k">def</span> <span class="nf">try_reconstruct</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span>
<span class="n">h</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">threshold</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">w</span><span class="p">))</span>
<span class="n">recon</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">threshold</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">w</span><span class="o">.</span><span class="n">T</span><span class="p">))</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">data</span> <span class="o">&</span><span class="n">mdash</span><span class="p">;</span> <span class="n">recon</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span>
<span class="k">def</span> <span class="nf">train</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">10</span><span class="p">):</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="k">for</span> <span class="n">e</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="n">epochs</span><span class="p">):</span>
<span class="n">delta_w</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">example</span> <span class="ow">in</span> <span class="n">data</span><span class="p">:</span>
<span class="n">h</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">threshold</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">example</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">w</span><span class="p">))</span>
<span class="n">delta_w</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">hebbian</span><span class="p">(</span><span class="n">example</span><span class="p">,</span> <span class="n">h</span><span class="p">))</span>
<span class="c"># average</span>
<span class="n">delta_w</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">delta_w</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">w</span> <span class="o">+=</span> <span class="n">delta_w</span>
<span class="n">result</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">try_reconstruct</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="k">print</span> <span class="s">'epoch'</span><span class="p">,</span> <span class="n">e</span><span class="p">,</span> <span class="s">'delta w ='</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">pp</span><span class="p">(</span><span class="n">delta_w</span><span class="p">),</span> <span class="s">'new weights ='</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">pp</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">w</span><span class="p">),</span> <span class="s">'reconstruction ok?'</span><span class="p">,</span> <span class="n">result</span></code></pre></div>
<h1 id="how-do-we-daydream">How do we daydream</h1>
<p>Time to add the negative phase (also called daydreaming phase) term! And surprisingly, this is not so simple, because in the negative phase we need the network to be <em>free</em> from external interference, and… well, how do we do that? We generally cannot just set hidden units to random values with some fixed probability, because there will be no learning in that case. Turns out we should do the following: set units to random values and let them settle down by updating the units one at a time so that each unit takes the most probable value considering its neighbors. For example, if a neuron is surrounded by positive neighbors with positive weights, it will most likely become positive itself.</p>
<p>You can imagine that the weights define an energy landscape, and each state of the network corresponds to a certain point on it. When the state is set by external data example, the network is “forced” to keep the high ground, but what it “wants” to do is to become free of external influence and fall down to the (hopefully global) energy minimum (this is also called thermal equilibrium). The learning algorithm tries to make these two states the same by terraforming energy landscape (modifying weights) — this is actually the same process as in Hopfield nets.</p>
<p>Now we can implement it:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">collect_negative_stats</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="c"># we don't know in advance how many loops required to reach equilibrium</span>
<span class="n">stats</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">e</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mi">10</span><span class="p">):</span>
<span class="c"># initial random state</span>
<span class="n">visible</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">threshold</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">num_visible</span><span class="p">),</span> <span class="mf">0.5</span><span class="p">)</span>
<span class="n">hidden</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">threshold</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">num_hidden</span><span class="p">),</span> <span class="mf">0.5</span><span class="p">)</span>
<span class="n">idx</span> <span class="o">=</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">num_visible</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">num_hidden</span><span class="p">)</span>
<span class="c"># settling for equilibrium</span>
<span class="c"># again, number of loops is guessed</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mi">50</span><span class="p">):</span>
<span class="n">i</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">idx</span><span class="p">)</span> <span class="c"># selecting random neuron</span>
<span class="k">if</span> <span class="n">i</span> <span class="o"><</span> <span class="bp">self</span><span class="o">.</span><span class="n">num_visible</span><span class="p">:</span> <span class="c"># visible neuron</span>
<span class="n">visible</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">threshold</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">w</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">visible</span><span class="p">[</span><span class="n">i</span><span class="p">]))</span>
<span class="k">else</span><span class="p">:</span> <span class="c"># hidden neuron</span>
<span class="n">i</span> <span class="o">-=</span> <span class="bp">self</span><span class="o">.</span><span class="n">num_visible</span>
<span class="n">hidden</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">threshold</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">w</span><span class="p">[:,</span> <span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">hidden</span><span class="p">[</span><span class="n">i</span><span class="p">]))</span>
<span class="c"># hopefully done, now make a reconstruction and collect stats</span>
<span class="n">recon</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">threshold</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">hidden</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">w</span><span class="o">.</span><span class="n">T</span><span class="p">))</span>
<span class="n">stats</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">hebbian</span><span class="p">(</span><span class="n">recon</span><span class="p">,</span> <span class="n">hidden</span><span class="p">))</span>
<span class="c"># average</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">stats</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span></code></pre></div>
<p>And subsctract the stats from the value of <code>delta_w</code>:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">delta_w</span> <span class="o">-=</span> <span class="bp">self</span><span class="o">.</span><span class="n">collect_negative_stats</span><span class="p">()</span></code></pre></div>
<p>And now it works pretty nice, requiring just one pass to learn the correct reconstruction:</p>
<pre><code>if __name__ == '__main__':
rbm = ToyRBM(3, 1, w=[[0], [0], [0]])
rbm.train([[1, -1, 1], [-1, 1, -1]], with_negative=True)
>> epoch 0 delta w = [[-1.0], [-1.0], [-1.0]] new weights = [[-1.0], [-1.0], [-1.0]] reconstruction ok? True
</code></pre>
<h1 id="a-little-bit-more-serious-example">A little bit more serious example</h1>
<p>There are still some things left before we can apply our Boltzmann Machine to a “real” problem like representing MNIST digits.</p>
<ul>
<li>First of all, our toy examples use deterministic activation function. While technicaly there’s nothing wrong with it, our network becomes vulnerable to local minima, meaning we won’t be able to reach “relaxed” equilibrium state. So we’re going to replace our activation function with the coin toss of the following probability <script type="math/tex">p(s_{i}) = \frac{1}{1 + e^{-\Delta E}}</script>, where <script type="math/tex">\Delta E</script> is the weighted sum of neuron’s inputs (<script type="math/tex">\Delta E =\sum_{j}w_{ij}s_{j}+b_{i}</script>). Why this function exactly? This is another question I still haven’t intuitively understood, but the answer is because Boltzmann distribution has a property that allows to express the probability of a single unit turning on by a function of its energy gap. The function is derived step-by-step right in <a href="https://en.wikipedia.org/wiki/Boltzmann_machine">Wikipedia</a>.</li>
<li>Next, so far we’ve used only visible-to-hidden connections. The original model assumes units are connected to each other, so we’re going to add these hidden-to-hidden and visible-to-visible connections to the network.</li>
<li>This will slightly change the procedure, namely, the positive phase, because now hidden neurons can influence each other. We’re going to apply the same logic here, by letting the network to settle down to the minima, updating hidden units only (because visible units are fixed to training data).</li>
<li>The negative phase won’t change at all, just don’t forget to take into account these new weights.</li>
</ul>
<p>I’ve also tried to switch to 0/1 binary unit values, which, I guess, adds a bit of extra computation to the Hebbian update (which still should be 1/-1). The need to update units one at a time makes learning quite slow (a hint: computing random choices at once speeds up things), so I’ve used just a small subset of MNIST digits restricted to 3 classes. And it seems we’re learning something:</p>
<div class="photo_frame_center">
<video width="650" height="250" controls="" preload="none" poster="/assets/article_images/2015-07-18-general-boltzmann-machines/poster.png">
<source src="/assets/article_images/2015-07-18-general-boltzmann-machines/learning.webm" type="video/webm; codecs="vp8, vorbis"" />
</video>
</div>
<p><br /></p>
<p>The unpleasant surprise is that showing more digit classes to the network makes the weights merge together in these ugly blobs (you can see 4 on one of the weights, and 1 elsewhere, but there’s also 2 which is kinda missing). I didn’t try to use simulated annealing, mainly because it’s mentioned to be a distraction from how Restricted Boltzmann Machines work. Playing with different parameters sometimes gives interesting outcomes: for example, if we don’t let the network to settle down enough in the negative phase, we get these clumsy weights:</p>
<p><img src="/assets/article_images/2015-07-18-general-boltzmann-machines/negative_short.png" alt="" /></p>
<p>But strangely, when do let it settle down <em>too much</em> (how much?), the weights get weird too:</p>
<p><img src="/assets/article_images/2015-07-18-general-boltzmann-machines/negative_long.png" alt="" /></p>
<p>That’s really not the way it’s supposed to work, so I guess I should debug my implementation. Which is available <a href="https://github.com/rocknrollnerd/ml-playground">here</a>, by the way.</p>
<h1 id="quick-rbm-note">Quick RBM note</h1>
<p>Now it’s actually quite easy to understand how Restricted Boltzmann Machines are trained. In RBMs there are no hidden-to-hidden or visible-to-visible connections, so influence flows just between hidden and visible units. Meaning we can now update them in parallel — first compute hidden units’ activations, then visible, then hidden again, and so on until the network settles down to equilibrium. That’s called contrastive divergence. And it turns out, this method works even if we make <em>one</em> iteration of it — when the network is certainly not in equilibrium, but still gets updated in the right direction.</p>
<h1 id="summary">Summary</h1>
<p>Whoa, that took longer than I expected. But the incredible feeling of finally understanding if not every part of it, but certainly the main idea — that’s absolutely worth looking up Hopfield nets and original Boltzmann Machines. Next thing I want to try, is to implement some different RBMs (convolutional, gaussian, softmax) and maybe compare their results with autoencoders, because now I’m starting to favor RBMs more, perhaps just because of the beauty of the concept. They don’t even use backprop, how cool is that? I wonder if there are any attemps to discover similar positive-negative cycles in real neurons, which, of course, don’t have symmetric connections, but still may constitute the same kind of structures. Or, are real neurons tend to “settle down” and minimize their energy? Oh snap, now I’m going to google it for hours.</p>
Sat, 18 Jul 2015 15:05:00 +0000
http://rocknrollnerd.github.io/ml/2015/07/18/general-boltzmann-machines.html
http://rocknrollnerd.github.io/ml/2015/07/18/general-boltzmann-machines.htmlmlMemory is a lazy mistress<p>Trying to jump on the deep learning bandwagon, I often miss things. Sometimes I find my mind filled with models and algorihtms I hardly fully undestand: they become obscure concepts and fancy buzzwords. That actually bothers me, so I’ve decided to make a couple of detailed runs across the stuff I’m kinda already familiar with — this time, Restricted Boltzmann Machines. And turns out that the best way to understand something for me is to write it down in a stupidly oversimplified, tutorial style, leaving out intimidating equations and trying to make small code examples all the way through. So that’s what I’m going to do now. Maybe someone else will find it helpful too.</p>
<p>What do I <em>already</em> know about RBMs? They are models that perform a kind of factor analysis on input data, extracting a smaller set of hidden variables, that can be used as data representation. First time when I encountered RBMs, I wasn’t quite excited about it — after all, there are <em>lots</em> of representation algorithms in machine learning, including autoencoders, that are simple perceptrons and can be learned via already familiar backprop. RBM is different by being both stochastic (its neurons’ values are calculated by tossing a probabilistic coin) and generative — meaning that it can generate data on its own after learning. Aaand basically that’s what I start with.</p>
<p>So let’s start with amazingly complete yet sometimes still-hard-for-me-to-grasp Hinton’s course of <a href="https://class.coursera.org/neuralnets-2012-001">Neural Networks for Machine Learning</a>, lections 11 and 12. And surprisingly, the journey begins with Hopfield nets.</p>
<h1 id="memory-networks">Memory networks</h1>
<p>Hopfield net is a bunch of binary neurons that can take values of 1 or -1. Each neuron is connected to every other neurons except itself, the weights are real-valued and symmetric — so if there are <script type="math/tex">N</script> neurons, there going to be <script type="math/tex">\frac{N(N -1)}{2}</script> weights.</p>
<p>Now, there’s a function that calculates a special scalar property of a network, called the <strong>energy function</strong>. To obtain the energy for a single neuron, we do the following:</p>
<ul>
<li>for all neighboring neurons (there are exactly <script type="math/tex">N-1</script> of them), multiply each neighbor’s value by the value of its weight and by the value of current neuron.</li>
<li>simply add them together with the minus sign.</li>
</ul>
<p>The energy of the whole network is basically just the sum of these terms, calculated for each neuron. Let’s forget about why exactly this value is called “energy” for a moment and think about how it depends on the state of the network. Suppose we have a neuron <script type="math/tex">s_{i}</script>, one of its neighbors <script type="math/tex">s_{j}</script> and a weight <script type="math/tex">w_{ij}</script>. Their product <script type="math/tex">s_{i}s_{j}w_{ij}</script> is a part of a global energy value, and when it’s positive, global energy goes down (remember a minus sign). Now let’s think about the weight as something we can adjust to be the way we want to and concentrate on neurons’ values. The product is positive if both of them are positive or negative at the same time, and weight is positive too. Or otherwise, if both values are different and weight is negative. Now, that does actually look familiar… because that’s the famous <a href="https://en.wikipedia.org/wiki/Hebbian_theory">Hebbian learning</a> rule — neurons that fire together, wire together (i.e., have positive connection). So one thing we learn for now is that energy dynamics depend on neurons’ behaviour — if neurons that “agree” with each other are connected by positive weights, and neurons that “disagree” are connected by negative weights, energy goes down, otherwise it goes up.</p>
<p>Now why does that matter? Because minimizing the energy function is exactly the thing we’re going to do when learning weights for a Hopfield net. The purpose of training is, however, different from the usual supervised learning objective: we’re not comparing data to some labeled target, but instead trying to <em>store</em> it in the network so that the state of the network corresponds to energy minimum. And the storage rule is quite simple: set the neurons to the values of our data vector (say, a single binary image), and update the weights the following way:</p>
<script type="math/tex; mode=display">
w_{ij} = w_{ij} + s_{i}s_{j}
</script>
<p>Time to write a piece of code!</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">HopfieldNet</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">num_units</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">num_units</span> <span class="o">=</span> <span class="n">num_units</span>
<span class="bp">self</span><span class="o">.</span><span class="n">w</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">num_units</span><span class="p">,</span> <span class="n">num_units</span><span class="p">))</span>
<span class="bp">self</span><span class="o">.</span><span class="n">b</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">num_units</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">store</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">_data</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">data</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">activations</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">data</span><span class="o">.</span><span class="n">T</span><span class="p">)</span>
<span class="n">np</span><span class="o">.</span><span class="n">fill_diagonal</span><span class="p">(</span><span class="n">activations</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="c"># because there are no connections to itself</span>
<span class="bp">self</span><span class="o">.</span><span class="n">w</span> <span class="o">+=</span> <span class="n">activations</span>
<span class="bp">self</span><span class="o">.</span><span class="n">b</span> <span class="o">+=</span> <span class="n">data</span><span class="o">.</span><span class="n">ravel</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">get_energy</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span>
<span class="c"># first let's again compute product of activations</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">_data</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">data</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">activations</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">data</span><span class="o">.</span><span class="n">T</span><span class="p">))</span>
<span class="n">np</span><span class="o">.</span><span class="n">fill_diagonal</span><span class="p">(</span><span class="n">activations</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="c"># then multiply each activation by a weight elementwise</span>
<span class="n">activations</span> <span class="o">*=</span> <span class="bp">self</span><span class="o">.</span><span class="n">w</span>
<span class="c"># total energy consists of weight and bias term</span>
<span class="n">weight_term</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">activations</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span> <span class="c"># divide by 2, because we've counted neurons twice</span>
<span class="n">bias_term</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">b</span><span class="p">,</span> <span class="n">data</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">return</span> <span class="o">-</span> <span class="n">bias_term</span> <span class="o">-</span> <span class="n">weight_term</span></code></pre></div>
<p><strong>Some things I’ve purposely left out:</strong></p>
<ul>
<li>biases (notice <code>self.b = np.zeros(num_units)</code>). If you’re already familiar with neural networks they work just the same way here. If not, look up a <a href="http://stackoverflow.com/questions/2480650/role-of-bias-in-neural-networks">great explanation</a> why we need biases.</li>
<li>data is formatted as binary and takes values of 1 in -1.</li>
</ul>
<p>Let’s put a MNIST digit in it (resized to 8x8 px). If we try to visualize the network, it kinda looks like a messy hairball…</p>
<p><img src="/assets/article_images/2015-07-14-memory-is-a-lazy-mistress/hopfield-full.png" alt="" /></p>
<p><em>Red weights are positive and blue weights are negative, but they’re too messed up to see any pattern</em></p>
<p>So let’s instead show only positive weights between positive neurons.</p>
<p><img src="/assets/article_images/2015-07-14-memory-is-a-lazy-mistress/hopfield-positive.png" alt="" /></p>
<p>Active neurons are highlighted in purple. It seems that negative neurons <em>inside</em> the zero symbol have incoming positive connections too, but that’s just an overlay — they really are not connected to any positive neurons.</p>
<p>Now, what do we see here? The storage rule has brought us exactly to the state Hebbian learning predicts: positive neurons “support” each other. Negative units, not shown here, also support each other, and (guess what) these two fractions have negative connections between each other’s units too.</p>
<p>So let’s think about any possible application of such a network, like, what can we do with all this support? Literally the first thing that comes to mind is error correction. For example, there are 16 positive neurons. Suppose one of them has flipped its state and become negative, and we want to check what state it actually should be in. Sometimes it’s convinient for me to think about a neural network as a council of voters, and here’s what happens in that case:</p>
<ul>
<li>a neuron asks 15 of its <em>positive</em> neighbors something like “should I behave like you?” and they tell him “yes, you should”.</li>
<li>then, a neuron asks 49 of its <em>negative</em> neighbors the same question and they tell him “no”.</li>
<li>so, our neuron receives a total score of 63 votes that tell it to became positive and immediately does so, changing its initial value.</li>
</ul>
<p>63 to 0 vote ratio leaves our neuron literally no doubt, and that also means we can afford to have more errors in our corrupted data. If another one of positive neurons flips and becomes negative, the ratio would be 62 to 1, and so on. So we actually can put just a small chunk of original data vector (or a piece of image) and still be able to correctly restore it. That is what’s called <em>associative memory</em> — a kind of memory that can be restored by observing just a tiny part of it. It is believed to be the kind of memory (at least part of it) we humans have, because we’re incredibly good at recognizing wholes by parts.</p>
<p>Let’s formalize our voting procedure to get the restoring rule for Hopfield nets:</p>
<ul>
<li>for each neuron, compute a weighted sum of all its inputs, i.e. <script type="math/tex">\sum\limits_{j=1}^{N-1} s_{j}w_{ij}</script>.</li>
<li>if this sum <script type="math/tex">>0</script>, set neuron’s value to 1, else to -1…</li>
<li>…or alternatively, compute and store network’s energy <script type="math/tex">E</script>, then flip a neuron and compute the energy again after that state change (<script type="math/tex">E_{new}</script>). If <script type="math/tex">% <![CDATA[
E_{new}<E %]]></script>, remember the new energy minimum and keep the change, otherwise do nothing and pick another neuron to try.</li>
</ul>
<p>The amazing thing here is that these two rules are equivalent (because we’ve defined energy to support the agreement between neurons that vote the same way and to repel neurons that disagree). The first rule, however, is much cheaper computationally because it only requires local information (and the second rule requires access to all the network).</p>
<p>Let’s implement it:</p>
<div class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">restore</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">copy</span><span class="p">(</span><span class="n">_data</span><span class="p">)</span>
<span class="n">idx</span> <span class="o">=</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">))</span>
<span class="c"># make 10 passes through the data</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mi">10</span><span class="p">):</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">)):</span>
<span class="n">j</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">idx</span><span class="p">)</span>
<span class="n">inputs</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">data</span> <span class="o">*</span> <span class="bp">self</span><span class="o">.</span><span class="n">w</span><span class="p">[</span><span class="n">j</span><span class="p">])</span>
<span class="k">if</span> <span class="n">inputs</span> <span class="o">></span> <span class="mi">0</span><span class="p">:</span>
<span class="n">data</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">data</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span>
<span class="k">return</span> <span class="n">data</span></code></pre></div>
<p>And try to restore something corrupt:</p>
<div class="photo_frame_center">
<video width="650" height="500" controls="" preload="none" poster="/assets/article_images/2015-07-14-memory-is-a-lazy-mistress/poster.png">
<source src="/assets/article_images/2015-07-14-memory-is-a-lazy-mistress/restore.webm" type="video/webm; codecs="vp8, vorbis"" />
</video>
</div>
<p><br /></p>
<h1 id="how-many-errors-we-can-make">How many errors we can make</h1>
<p>If you are familiar with the concept if <a href="https://en.wikipedia.org/wiki/Forward_error_correction">error correction coding</a>, you should be, at least for now, a bit disappointed. Think about it this way: error correction becomes possible when we carry some extra information with our precious data. The amount of this information can be determined by the means of information theory and it’s usually not much, because data transfer and storage cost money and resources. With a Hopfield net we can correct literally <em>any</em> amount of errors (you can start from blank image and still get your correct answer, when only one image is stored), but we pay the quadratic price of <script type="math/tex">\mathcal{O}(N^2)</script> (each neuron is connected with each other neuron, remember), and that’s quite a lot.</p>
<p>The cool thing that really surprised me is that we can store <em>multiple</em> memories in the same network, and we don’t even have to modify our storage rule. Simply apply it again for a new piece of data, and than again and so on. To understand how does this work, let’s get back to our voting example again:</p>
<ul>
<li>Remember, we calculated <script type="math/tex">\Delta w_{ij}</script> as the product of neurons’ activations, so it could be either 1 or -1.
We can say that was “one weight one voice” model, meaning that absolute values were the same.</li>
<li>Now when we apply the storage rule one more time, weight values accumulate. Some of the weights (that connect the same active/inactive neurons for both images) will end up having values of +/- 2, meaning that their vote costs more now.</li>
<li>Think about these “doubled” weights as values the network is certain about. For example, if we’d like to continue storing different MNIST digits, each of them would have a negative border (the background). All the agreement connections between these background neurons will add up, and when it’s time to ask neighbor voters about a certain neuron’s state, it will receive strong support from every one of them. The network kinda tells us “I don’t know what MNIST digit that neuron belongs to, but it really should be negative anyway”.</li>
<li>Other neurons now start competing with each other by gathering votes in their support. If a certain neuron should be active in one image and inactive in the other, here’s what happens: a neuron asks for support from its neighbors and they divide in two fractions, that tell it to switch on or off. The fraction which casts more votes wins and subdues a neuron to its collective will. And of course, if some of the neurons from both fractions are corrupted too, that messes up the output decision.</li>
</ul>
<p><em>Now</em> you can see that network’s ability to correct errors has been decreased. We cannot restore images from almost random noise now, because we don’t know which fraction is going to prevail. We have to show the network a relatively distinct piece of image to obtain the correctly restored version of it, and that really looks more like actual memory now.</p>
<h1 id="learning-backwards">Learning backwards</h1>
<p>There are additional complications, though: turns out, different memories can merge together when they correspond to the same local minimum (another cool thing about Hopfield nets is that we’re not trying to escape local minima anymore, because they are memory storage locations). It’s been shown that you can store about <script type="math/tex">0.138N</script> memories in <script type="math/tex">N</script>-neuron net, but my MNIST example actually breaks at third — I guess, that’s because some memories (0 and 2, for example) are partially similar (sic).</p>
<p><img src="/assets/article_images/2015-07-14-memory-is-a-lazy-mistress/mixed.png" alt="" /></p>
<p>To avoid the issue, you can use a curious technique called <em>unlearning</em> or <em>reverse learning</em>, which is basically this: you set the network to a random state, and then apply the same Hebbian learning rule but with the minus sign. The idea of reverse learning actually was introduced before Hopfield nets by Crick (no less) and Mitchinson, who considered it to be a possible theory of dreams. I’d certainly like to read something on the matter, partially because (getting a little bit ahead of myself) unlearning takes an important part in Boltzmann Machine learning, but mostly because it looks like a awesomely cool concept. And, by the way, the whole concept of memories as energy minima, too! As George Carlin said, <em>“When I first heard of entropy in high school science I was attracted to it immediately”</em>.</p>
<p>I didn’t try to implement the unlearning procedure, mostly because I felt already dug too much into Hopfield nets. Full implementation is available <a href="https://github.com/rocknrollnerd/ml-playground">here</a>. Next stop is Boltzmann Machine station!</p>
Tue, 14 Jul 2015 15:05:00 +0000
http://rocknrollnerd.github.io/ml/2015/07/14/memory-is-a-lazy-mistress.html
http://rocknrollnerd.github.io/ml/2015/07/14/memory-is-a-lazy-mistress.htmlml