The dramatic increase of computing power coupled with the exponential expansion of training data is powering Deep Learning (DL), a subset of Artificial Intelligence (AI). As a result, the extent of its capabilities and problem-solving abilities are becoming more powerful and vast. Both academia and industry are turning to DL to fund research and solve existing business problems. With the aim of helping you gain a deeper understanding of its inner workings, I'll provide an overview of some of Naftali Tishby’s recent work, a leader in machine learning research and computational neuroscience.
The Current State of Deep Learning
Despite its wide adoption, there remains little theoretical understanding of DL – much of what we know is derived solely from applied algorithms. That is, DL model selection and parameter tuning still belong to experiments rather than provable and justifiable problems. And though there are some vague rules for these, there is rarely a provable theoretical conclusion.
Tishby's Advancements in Deep Learning
In 2017, Naftali Tishby and his students made some significant advancements in providing a theoretical conclusion and understanding of DL via information [i]. By the visualization of Deep Neural Networks (DNN) on the Information Plane – this is the place where each layer preserves mutual information values on the input and output variables – it was shown that the training of a DNN can be divided into two phases: The Empirical Error Minimization (ERM) phase and the Representation Compression (RC) phase. They also illustrated the effect that the number of layers in a DNN and the size of its training set will have on results. It’s worth noting that Tishby and his team arrived at these conclusions by observing the behaviour of a DNN during training, rather than basing them on strict theoretical proof. There are two further observations that are interesting here:
- Deep Neural Networks learn like humans. The two phases observed in training are strikingly consistent with the stages of human learning. On one hand, the ERM phase is consistent with how humans learn from concrete examples. For instance, when we were kids and learned the concept of the number “1”, we were taught that “one apple” is “1” and “one banana” is also “1”. Throughout the learning process, we were given multiple examples of these instances and tried to memorize them all. On the other hand, the RC phase is consistent with how humans learn abstract concepts. Based on all the concrete examples that we learn of the number “1”, we are gradually able to grasp the abstract concept of the number.
- Information Theory will help us understand Deep Learning. The work by Tishby and his team also illustrated that concepts in information theory, such as entropy and mutual information, are useful in our understanding of DL. The concept that there is a connection between DL and information theory and that the latter can be used to analyze DL models isn't new. Finding a good DL model closely resembles the process of finding a good encoder, which is a typical problem in information theory. What Tishby and his team brought to the table, however, is showing that information is a valuable tool when observing the behaviour of DL models. The ability to prove a conclusion using information theory is very rare. And, before Tishby, some ideas and concepts from information theory were only used in a very restricted way. For example, the concept of entropy was only used for a loss function.
What the Future Holds
Though there have been successful efforts after Tishby’s publication to demonstrate that his findings cannot be generalized to all DNNs [ii], his work has helped the community take a large step forward to better understand DL. Next, I expect that information theory will be used to analyze DL models and provide a means to achieve provable results for the model selection of different DL problems. I also expect that we will soon be able to not only ‘open the black box of DNNs’ but also ‘shine a light’ on them to gain an even deeper understanding.
[i] Shwartz-Ziv R, Tishby N. Opening the black box of deep neural networks via information[J]. arXiv preprint arXiv:1703.00810, 2017.
[ii] Saxe A M, Bansal Y, Dapello J, et al. On the information bottleneck theory of deep learning[J]. 2018.