# Projects

This page is a collection of projects that I have worked on. They’re listed here in approximately chronological order, where entries closer to the top are newer and/or more exciting than projects near the bottom.

# Learning to Learn 2016 – 2017

Much of the modern work in optimization is based around designing update rules tailored to specific classes of problems, with the types of problems of interest differing between different research communities. For example, in the deep learning community we have seen a proliferation of optimization methods specialized for high-dimensional, non-convex optimization problems. In contrast, communities who focus on sparsity tend to favour very different approaches, and this is even more the case for combinatorial optimization for which relaxations are often the norm.

Each of these communities make different assumptions about their problems of interest. In nearly every case, when more is known about the target problem then more focused methods can be developed to exploit these structural regularities.

Instead of characterising structural regularities mathematically we can treat their identification as a learning problem, and train models which learn about the model learning process itself.

We applied this insight to learning update rules for gradient based optimization using recurrent neural networks.

- Marcin Andrychowicz, Misha Denil, Sergio Gómez Colmenarejo, Matthew W. Hoffman, David Pfau, Tom Schaul, and Nando de Freitas.
**Learning to Learn by Gradient Descent by Gradient Descent**. In*Neural Information Processing Systems*. 2016. [arXiv] [project] [bibtex]@inproceedings{Andrychowicz2016, author = "Andrychowicz, Marcin and Denil, Misha and Colmenarejo, Sergio Gómez and Hoffman, Matthew W. and Pfau, David and Schaul, Tom and de Freitas, Nando", title = "Learning to Learn by Gradient Descent by Gradient Descent", year = "2016", booktitle = "Neural Information Processing Systems", project = "learning-to-learn", arxiv = "https://arxiv.org/abs/1606.04474" }

We have code for training optimizers available on github.

This works surprisingly well more or less out of the box, with two weaknesses. The trained optimizer has a fairly high memory overhead, and generalising to new problems does not always work reliably. We explored these problems in some detail, and found that the generalisation properties of our learned optimizers could be greatly improved through careful design of the model and training procedure.

- Olga Wichrowska, Niru Maheswaranathan, Matthew W. Hoffman, Sergio Gómez Colmenarejo, Misha Denil, Nando de Freitas, and Jascha Sohl-Dickstein.
**Learned Optimizers that Scale and Generalize**. In*International Conference on Machine Learning*. 2017. [arXiv] [project] [bibtex]@inproceedings{Wichrowska2017a, author = "Wichrowska, Olga and Maheswaranathan, Niru and Hoffman, Matthew W. and Colmenarejo, Sergio Gómez and Denil, Misha and de Freitas, Nando and Sohl-Dickstein, Jascha", title = "Learned Optimizers that Scale and Generalize", year = "2017", booktitle = "International Conference on Machine Learning", project = "learning-to-learn", arxiv = "https://arxiv.org/abs/1703.04813" }

In addition to learning gradient based updates, we have also looked at learning *global* optimizers. Here the challenges are different than with gradient based methods, and we show how to design supervised objectives to explicitly encourage the optimizer to explore the search space. We also show how to exploit gradients at training time in order to learn optimizers that operate gradient free at test time.

- Yutian Chen, Matthew W. Hoffman, Sergio Gómez Colmenarejo, Misha Denil, Timothy P. Lillicrap, and Nando de Freitas.
**Learning to Learn for Global Optimization of Black Box Functions**. In*International Conference on Machine Learning*. 2017. [arXiv] [project] [bibtex]@inproceedings{Chen2017a, author = "Chen, Yutian and Hoffman, Matthew W. and Colmenarejo, Sergio Gómez and Denil, Misha and Lillicrap, Timothy P. and de Freitas, Nando", title = "Learning to Learn for Global Optimization of Black Box Functions", year = "2017", booktitle = "International Conference on Machine Learning", project = "learning-to-learn", arxiv = "https://arxiv.org/abs/1611.03824" }

# Learning to Experiment 2016

When encountering novel objects, humans are able to infer a wide range of physical properties such as mass, friction and deformability by interacting with them in a goal driven way. This process of active interaction is in the same spirit of a scientist performing an experiment to discover hidden facts.

Recent advances in artificial intelligence have yielded machines that can achieve superhuman performance in Go, Atari, natural language processing, and complex control problems, but it is not clear that these systems can rival the scientific intuition of even a young child.

We introduce two tasks that require require agents to interact with the world in order to identify some non-visual physical property of the objects it contains. In the first environment the agents must poke blocks in order to determine which of them is the heaviest, and in the second enviornment agents must knock down block towers and count the number of rigid bodies they are composed of.

- Misha Denil, Pulkit Agrawal, Tejas D Kulkarni, Tom Erez, Peter Battaglia, and Nando de Freitas.
**Learning to Perform Physics Experiments via Deep Reinforcement Learning**. In*International Conference on Learning Representations*. 2017. [arXiv] [paper] [project] [bibtex]@inproceedings{Denil2017a, author = "Denil, Misha and Agrawal, Pulkit and Kulkarni, Tejas D and Erez, Tom and Battaglia, Peter and de Freitas, Nando", title = "Learning to Perform Physics Experiments via Deep Reinforcement Learning", year = "2017", booktitle = "International Conference on Learning Representations", pdf = "http://openreview.net/forum?id=r1nTpv9eg", project = "learning-to-experiment", arxiv = "https://arxiv.org/abs/1611.01843" }

Several news sites have written nice articles about this work for a general audience.

# Predicting Parameters in Deep Learning 2013 – 2016

It is well known that the first layer features in many deep learning architectures tend to be globally smooth with local edge features, similar to local Gabor functions, when trained on natural images. We show that it is possible to exploit the redundancy in this type of structure when learning neural nets.

Our key innovation is to store explicit values for only a sparse subset of the weights in each feature, and predict the remaining values using ridge regression. By representing only a small number of weights explicitly we are able to dramatically reduce the number of parameters which must be learned. In the best case we are able to predict more than 95% of the parameters in a network without any drop in accuracy.

Our technique is very general, and can be applied to almost any neural network model; it is also orthogonal but complementary to other recent advances in deep learning, such as dropout, rectified units and maxout.

Reducing the number of learned parameters could also be of great benefit in distributed settings. The bottleneck in large scale distributed learning of neural networks is the cost of synchronizing parameter updates between machines. If fewer parameters need to be learned then this burden can be reduced.

- Misha Denil, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ranzato, and Nando de Freitas.
**Predicting Parameters in Deep Learning**. In*Neural Information Processing Systems*. 2013. [paper] [poster] [project] [bibtex]@inproceedings{Denil2013b, author = "Denil, Misha and Shakibi, Babak and Dinh, Laurent and Ranzato, Marc’Aurelio and de Freitas, Nando", title = "Predicting Parameters in Deep Learning", year = "2013", booktitle = "Neural Information Processing Systems", pdf = "http://papers.nips.cc/paper/5025-predicting-parameters-in-deep-learning", project = "parameter-prediction", poster = "2013-predicting-parameters-poster-mdenil.pdf" }

I also presented this work in the ICML 2013 Challenges in Representation Learning Workshop.

**Feb 2014:** Our lab received a Google research award to continue this work.

We applied the predicting parameters idea to reducing the memory footprint of a standard convolutional network (eg., AlexNet). In particular, we show how the fully connected layers can be replaced with a randomized kernel machine. This transformation is efficient in terms of both memory and computation, and importantly we can learn the parameters of the kernel transformation jointly with the parameters of the convolutional layers of the network.

- Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song, and Ziyu Wang.
**Deep Fried Convnets**. In*International Conference on Computer Vision*. 2015. [arXiv] [project] [bibtex]@inproceedings{Yang2015, author = "Yang, Zichao and Moczulski, Marcin and Denil, Misha and de Freitas, Nando and Smola, Alex and Song, Le and Wang, Ziyu", title = "Deep Fried Convnets", year = "2015", booktitle = "International Conference on Computer Vision", arxiv = "http://arxiv.org/abs/1412.7149", project = "parameter-prediction" }

A major drawback of the Deep Fried approach is that the randomized kernel machine requires a lot of features in order to work well. The feature mapping itself is efficient, but the high dimensionality introduces a lot of parameters into the final softmax layer, which has had its input dimensionality increased.

To correct this drawback we can make the network deep instead of wide. We designed a new layer composed of diagonal matrices and cosine transforms which can be stacked to form very deep hierarchies which we call ACDC. From a high level ACDC is similar to the Deep Fried layer, but the construction is simplified and the interpretation as a kernel machine is abandoned.

- Marcin Moczulski, Misha Denil, Jeremy Appleyard, and Nando de Freitas.
**ACDC: A Structured Efficient Linear Layer**. In*International Conference on Learning Representations*. 2016. [arXiv] [project] [bibtex]@inproceedings{Moczulski2015, author = "Moczulski, Marcin and Denil, Misha and Appleyard, Jeremy and de Freitas, Nando", title = "ACDC: A Structured Efficient Linear Layer", year = "2016", arxiv = "http://arxiv.org/abs/1511.05946", booktitle = "International Conference on Learning Representations", project = "parameter-prediction" }

# Deep Learning for Natural Language Processing 2014 – 2015

Capturing the compositional process which maps the meaning of words to that of documents is a central challenge for researchers in Natural Language Processing and Information Retrieval. We introduce a model that is able to represent the meaning of documents by embedding them in a low dimensional vector space, while preserving distinctions of word and sentence order crucial for capturing nuanced semantics.

Our model learns convolution filters at both the sentence and document level, hierarchically learning to capture and compose low level lexical features into high level semantic concepts.

Inspired by recent advances in visualising deep convolution networks for computer vision, we present a novel visualisation technique for our document networks which not only provides insight into their learning process, but also can be interpreted to produce a compelling automatic summarisation system for texts.

- Misha Denil, Alban Demiraj, and Nando de Freitas.
**Extraction of Salient Sentences from Labelled Documents**. Technical Report, University of Oxford, 2014. [arXiv] [project] [bibtex]@techreport{Denil2014c, author = "Denil, Misha and Demiraj, Alban and de Freitas, Nando", title = "Extraction of Salient Sentences from Labelled Documents", year = "2014", arxiv = "http://arxiv.org/abs/1412.6815", institution = "University of Oxford", project = "deep-nlp" }

The code for this paper is available on github.

An interesting property of our deep convolutional document model is that it simultaneously produces embeddings for words, sentences and documents using labels only at the document level. We exploit this to develop a new approach to training a sentence classifier by using the sentence embeddings from the document model as a similarity measure. Our approach, which combines ideas from transfer learning, deep learning and multi-instance learning, reduces the need for laborious human labelling of sentences when abundant document labels are available.

- Dimitrios Kotzias, Misha Denil, Nando de Freitas, and Padhraic Smyth.
**From Group to Individual Labels using Deep Features**. In*ACM SIGKDD*. 2015. [paper] [poster] [project] [bibtex]@inproceedings{Kotzias2015, author = "Kotzias, Dimitrios and Denil, Misha and de Freitas, Nando and Smyth, Padhraic", title = "From Group to Individual Labels using Deep Features", year = "2015", booktitle = "ACM SIGKDD", project = "deep-nlp", pdf = "/static/papers/2015-deep-multi-instance-learning.pdf", poster = "2015-group-to-individual-labels-using-deep-features-poster.pdf" }

Dimitros Kotzias has more information about our multi-instance learning work on his website.

# Consistency of Random Forests 2013 – 2014

Random forests are a popular classification algorithm, originally developed by Leo Breiman in the early 2000's. While this algorithm has been extremely successful in practice, relatively little is known about the underlying mathematical forces which drive this success.

We have been working on narrowing the gap between the theory and practice of random forests. To this end we have designed a variant of random regression forests which achieves the closest match to date between theoretically tractable models and what is actually used in practice. This match is achieved both in terms of similarity of the algorithms and in terms of performance on real world tasks.

Our work also shows empirically that data splitting, a simplification made by recent theoretical works on random forests, is responsible for nearly all of the remaining performance gap between the theoretically tractable models and their more practical counterparts.

- Misha Denil, David Matheson, and Nando de Freitas.
**Narrowing the Gap: Random Forests In Theory and In Practice**. In*International Conference on Machine Learning*. 2014. [arXiv] [paper] [poster] [project] [bibtex]@inproceedings{Denil2014a, author = "Denil, Misha and Matheson, David and de Freitas, Nando", title = "Narrowing the Gap: Random Forests In Theory and In Practice", year = "2014", booktitle = "International Conference on Machine Learning", poster = "2014-rf-theory-practice-icml-poster.pdf", arxiv = "http://arxiv.org/abs/1310.1415", pdf = "http://jmlr.org/proceedings/papers/v32/denil14.html", project = "random-forests" }

We have also looked at the problem of training random forests in an online setting. The main limitation of online tree building algorithms is memory. Because the trees are built online they must be built depth first. Collecting statistics in the fringe becomes very expensive in large trees. We work around this by limiting the number leafs permitted to be “active” at any time, which allows us to control the amount of memory used to store the fringe. Our analysis shows that the algorithm is still consistent when this refinement is included.

Our consistent algorithm compares favourably to an existing online random forest algorithm on a simple problem, and outperforms it on a more realistic task.

- Misha Denil, David Matheson, and Nando de Freitas.
**Consistency of Online Random Forests**. In*International Conference on Machine Learning*. 2013. [arXiv] [paper] [poster] [project] [bibtex]@inproceedings{Denil2013a, author = "Denil, Misha and Matheson, David and de Freitas, Nando", title = "Consistency of Online Random Forests", year = "2013", booktitle = "International Conference on Machine Learning", arxiv = "http://arxiv.org/abs/1302.4853", poster = "2013-online_random_forests-icml-poster.pdf", pdf = "http://jmlr.org/proceedings/papers/v28/denil13.html", project = "random-forests" }

The code for this paper is available on github. You can also download the kinect data we used in the experiments.

# Distributed Composite Likelihood 2014

This work lays out the theoretical foundations for distributed parameter estimation in Markov Random Fields. We take a composite likelihood approach, which allows us to distribute sub-problems across machines to be solved in parallel. We provide a general condition on composite likelihood factorisations which ensures that the distributed estimators are consistent.

- Yariv Dror Mizrahi, Misha Denil, and Nando de Freitas.
**Distributed Parameter Estimation in Probabilistic Graphical Models**. In*Neural Information Processing Systems*. 2014. [arXiv] [poster] [project] [bibtex]@inproceedings{Mizrahi2014b, author = "Mizrahi, Yariv Dror and Denil, Misha and de Freitas, Nando", title = "Distributed Parameter Estimation in Probabilistic Graphical Models", year = "2014", booktitle = "Neural Information Processing Systems", arxiv = "http://arxiv.org/abs/1406.3070", poster = "2014-lap-nips-poster.pdf", project = "lap" }

Our work on composite likelihood grew out of our work on the LAP algorithm, which can be seen a a specific instantiation of the distributed composite likelihood framework.

- Yariv Dror Mizrahi, Misha Denil, and Nando de Freitas.
**Linear and Parallel Learning of Markov Random Fields**. In*International Conference on Machine Learning*. 2014. [arXiv] [project] [bibtex]@inproceedings{Mizrahi2014a, author = "Mizrahi, Yariv Dror and Denil, Misha and de Freitas, Nando", title = "Linear and Parallel Learning of Markov Random Fields", year = "2014", booktitle = "International Conference on Machine Learning", arxiv = "http://arxiv.org/abs/1308.6342", project = "lap" }

# Recklessly Approximate Sparse Coding 2012

The introduction of “k-means” features by Coates et al., along with their later refinement into soft threshold features by the same authors, caused considerable discussion in the deep learning community. These very simple and efficiently computed features were shown to achieve state of the art performance on several image classification tasks, beating out a variety of more sophisticated feature learning techniques.

The purpose of this work is to provide a theoretical explanation for the surprising efficacy of these feature encodings. This explanation is realized by a formal connection between the design of the soft threshold features and an approximate solution to a non-negative sparse coding problem using proximal gradient descent.

- Misha Denil and Nando de Freitas.
**Recklessly Approximate Sparse Coding**. Technical Report, University of British Columbia, 2012. [arXiv] [project] [bibtex]@techreport{Denil2012b, author = "Denil, Misha and de Freitas, Nando", title = "Recklessly Approximate Sparse Coding", institution = "University of British Columbia", year = "2012", arxiv = "http://arxiv.org/abs/1208.0959", project = "approximate-sparse-coding" }

# Quantum Deep Learning 2011

I spent the summer of 2011 as an intern in the applications group at D-Wave Systems. D-Wave is a company whose flagship product is a quantum computer which they have designed and built.

While I was at D-Wave I worked on applying their quantum computer to deep learning problems. There are a lot of interesting challenges in this area, both in terms of finding ways express problems in terms which the machine can handle, and in terms of dealing with the idiosyncrasies of what is still very new technology.

The result of this project was a paper in the 2011 NIPS Deep Learning Workshop where we presented an algorithm for training RBMs with lateral connections using the quantum computer. The algorithm itself is sound (in the sense that it works in software), but unfortunately practical issues prevent it from working in vivo. The the main difficulty is that the algorithm requires the ability to set parameters on the machine with much higher precision than the hardware was able to support. However, the algorithm may become feasible on future generations of hardware.

- Misha Denil and Nando de Freitas.
**Toward the Implementation of a Quantum RBM**. In*NIPS Deep Learning and Unsupervised Feature Learning Workshop*. 2011. [paper] [poster] [project] [bibtex]@inproceedings{Denil2011b, author = "Denil, Misha and de Freitas, Nando", title = "Toward the Implementation of a Quantum RBM", year = "2011", booktitle = "NIPS Deep Learning and Unsupervised Feature Learning Workshop", project = "quantum-deep-learning", poster = "2011-quantum-deep-learning-poster-mdenil.pdf", pdf = "/static/papers/2011-mdenil-quantum_deep_learning-nips_workshop.pdf" }

# Gaze Selection for Object Tracking in Video 2011

My involvement in this project began as project for a machine learning course I took in the winter semester of the 2010-2011 school year. The larger goal is to build an object tracking system inspired by the human visual system.

My contribution was to build a model for gaze selection (that is, a control mechanism for saccades). During tracking the system maintains a template of the target object being tracked and at each time step it selects a small sub-region of the template to compare with its estimated location in the frame. I built a model to decide which sub-region of the template to select for comparison at each time step.

- Misha Denil, Loris Bazzani, Hugo Larochelle, and Nando de Freitas.
**Learning where to Attend with Deep Architectures for Image Tracking**.*Neural Computation*, 2012. [arXiv] [paper] [project] [bibtex]@article{Denil2012a, author = "Denil, Misha and Bazzani, Loris and Larochelle, Hugo and de Freitas, Nando", title = "Learning where to Attend with Deep Architectures for Image Tracking", journal = "Neural Computation", year = "2012", arxiv = "http://arxiv.org/abs/1109.3737", pdf = "http://www.mitpressjournals.org/doi/abs/10.1162/NECO_a_00312", project = "gaze-selection" }

- My project report gives a much shorter description of the model and focuses specifically on my contribution.
- The source code is available for download. It’s a large file because the archive includes several trained models needed for actually running the system.