Projects

This page is a collection of projects that I have worked on. They’re listed here in approximately chronological order, where entries closer to the top are newer and/or more exciting than projects near the bottom.

Data Driven Robotics 2019 – 2024

Data Driven Robotics

I have spent the past several years working on data driven robotics. In particular I have focused on addressing challenges that arrise at the intersection of large scale machine learning and real robots.

Working with real hardware that runs in real time in the real world poses a unique set of constraints that are not present in many other areas of machine learning, and my work has focused on developing algorithmic and operational solutions that leverage large scale machine learning to address to these challenges.

My work on data driven robotics falls into four categories.

  1. Learning from demonstrations
  2. Reward learning
  3. Real world evaluation
  4. Complete agent systems

The first three are fundamental challenges faced when bringing learned policies in the real world. Complete agent systems are projects that aim to synthesize the solutions we have in isolation for the more fundamental problems into complete systems that work together in concert.

Learning from demonstrations

An important challenge in training policies that act in the real world is task specification. How does an agent know what it is supposed to be doing, and how does it recognize when it has done so successfully?

One possible answer to this question is demonstrations. We provide the agent with some examples of correct behavior, and aim to imitate the behavior of those examples in novel settings. In the real world this can be especially challenging due to the fact that many tasks of interest are highly variable and we typically do not have access to a smooth indication of progress to learn from.

With Making Efficient Use of Demonstrations to Solve Hard Exploration Problems we tackled a particularly challenging version of this problem in simulation where we also focused partially observable tasks, requiring the agent to remember information across time in order to complete its task.

In this paper we showed that mixing a small amount of demonstration data with the agent's own self-collected experience was sufficient signal to guide learn eight very difficult long horizon tasks where no other method we tested was able to succeed even a single time.

One challenge of using demonstrations as task specifications is that they both under- and over-constrain the task in different ways. They under-constrain the task by failing to cover all possible initial conditions, and they over constrain the task by specifying states for irrelevant features of the environment. Imagine a demonstration of pouring water with a clock in the background. The demonstration shows what should happen to the water, but also specifies a particular time at which the water should be poured.

In Task-Relevant Adversarial Imitation Learning (TRAIL) we studied a particular solution to the over-specification problem.

In this paper we show that when using GAIL, a learning from demonstrations method that is particularly prone to locking on to irrelevant details of the demonstrations, we can encourage the model to focus on task relevant features of the problem through a careful choice of regularization for the discriminator.

Another challenge of working with demonstrations is that you need demonstrations for each task of interest, and it is unclear how to exploit demonstrations for one task to improve performance or learning efficiency on other tasks. In Offline Learning from Demonstrations and Unlabeled Experience we extend TRAIL to the offline setting by using demonstrations to learn a reward function (regualrized as in TRAIL) and then apply that reward function to label an offline dataset that includes demonstrations from other tasks.

Reward learning

All GAIL-based learning from demonstrations methods are all built around learning a reward function in a particular way. In addition to the GAIL-based methods discussed above, I have also worked on the problem of learning reward functions directly. Discarding the trappings of GAIL opens up many different possibilities for how to train reward functions and reward functions are useful for many things beyond providing a training signal to agents, so focusing on the reward learning problem on its own has merit.

The motivation of Positive-Unlabeled Reward Learning is very similar to that of Offline Learning from Demonstrations and Unlabeled Experience (and actually predates the latter work, although the final publication date is later). One perspective on PURL is that it fixes an incorrect assumption in GAIL, namely that all the agent generated episodes are worse than all of the demonstrations. If the agent is good then this assumption fails, and this failed assumption is a major source of instability in GAIL. Our offline learning paper addressed this through regularizing the discriminator, while in PURL we approach the problem by removing this problematic assumption.

We have also looked at the reward learning problem from the perspective of leveraging the generalization cababilities of modern foundation models to build accurate, robust and universal success detectors. In Vision-language models as success detectors we applied a state of the art (at the time) vision language model to the problem of success detection. The results here are very promising, VLM based success detectors who dramatically improved robustness over baseline models, and this paper also shows the path towards making truly universal, rather than single- or multi-task, reward models.

Real world evaluation

When operationalizing trained policies on real robots, one of the key bottlenecks is evaluation. The most consistently effective strategy to improve model perforance across all domains of machine learning is to measure a metric that matters, and then build models that drive that metric up. In robotics, all of those metrics ultimately depend on the real world performance, and measuring real world performance of a policy is time consuming and noisy.

In Active offline policy selection we address the problem of evaluation throughput by modeling our state of knowledge of the entire population of models under consideration. We model our beliefs about the true performance of each model given the evaluations we have observed, as well as correlations between the performance of different models. Quantifying our uncertainty about the performance of each model allows us choose evaluations that will maximize information gain about which model is best, and modeling correlations in performance allows us to use evaluations of one model to inform our performance estimates of others.

One of the key challenges in the AOPS paper was determining a good way to measure the similarity of different policies. We used a simple method of featurizing policies in the AOPS paper that works well when the policies are very similar, and in our followup work $\pi$2vec we developed a more complex method that works better than the AOPS featurization strategy when the policies are more different.

Accurate, sound and efficient evaluation is a deep problem, and my public works only scratch the surface of this important topic.

Complete agent systems

My earliest works in this area is Scaling data-driven robotics with reward sketching and batch reinforcement learning. This paper presents a complete end to end agent system, and laid the foundations for many of the works that followed. This paper was one of the first to show that learned rewards can be used in combination with a large dataset of experience from different tasks to learn a robot policy offline using batch RL. We showed that using our approach it is possible to train agents to perform a variety of challenging manipulation tasks including stacking rigid objects and handling cloth.

Many of the ideas originally developed in this paper can be seen operating in more refined forms, and at much larger scale, in RoboCat: A Self-Improving Foundation Agent for Robotic Manipulation.

Learning to Learn 2016 – 2017

Learning to Learn

Much of the modern work in optimization is based around designing update rules tailored to specific classes of problems, with the types of problems of interest differing between different research communities. For example, in the deep learning community we have seen a proliferation of optimization methods specialized for high-dimensional, non-convex optimization problems. In contrast, communities who focus on sparsity tend to favour very different approaches, and this is even more the case for combinatorial optimization for which relaxations are often the norm.

Each of these communities make different assumptions about their problems of interest. In nearly every case, when more is known about the target problem then more focused methods can be developed to exploit these structural regularities.

Instead of characterising structural regularities mathematically we can treat their identification as a learning problem, and train models which learn about the model learning process itself.

We applied this insight to learning update rules for gradient based optimization using recurrent neural networks.

We have code for training optimizers available on github.

This works surprisingly well more or less out of the box, with two weaknesses. The trained optimizer has a fairly high memory overhead, and generalising to new problems does not always work reliably. We explored these problems in some detail, and found that the generalisation properties of our learned optimizers could be greatly improved through careful design of the model and training procedure.

In addition to learning gradient based updates, we have also looked at learning global optimizers. Here the challenges are different than with gradient based methods, and we show how to design supervised objectives to explicitly encourage the optimizer to explore the search space. We also show how to exploit gradients at training time in order to learn optimizers that operate gradient free at test time.

Learning to Experiment 2016

Learning to Experiment

When encountering novel objects, humans are able to infer a wide range of physical properties such as mass, friction and deformability by interacting with them in a goal driven way. This process of active interaction is in the same spirit of a scientist performing an experiment to discover hidden facts.

Recent advances in artificial intelligence have yielded machines that can achieve superhuman performance in Go, Atari, natural language processing, and complex control problems, but it is not clear that these systems can rival the scientific intuition of even a young child.

We introduce two tasks that require require agents to interact with the world in order to identify some non-visual physical property of the objects it contains. In the first environment the agents must poke blocks in order to determine which of them is the heaviest, and in the second enviornment agents must knock down block towers and count the number of rigid bodies they are composed of.

Several news sites have written nice articles about this work for a general audience.

Predicting Parameters in Deep Learning 2013 – 2016

Predicting Parameters in Deep LearningPredicting Parameters in Deep Learning

It is well known that the first layer features in many deep learning architectures tend to be globally smooth with local edge features, similar to local Gabor functions, when trained on natural images. We show that it is possible to exploit the redundancy in this type of structure when learning neural nets.

Our key innovation is to store explicit values for only a sparse subset of the weights in each feature, and predict the remaining values using ridge regression. By representing only a small number of weights explicitly we are able to dramatically reduce the number of parameters which must be learned. In the best case we are able to predict more than 95% of the parameters in a network without any drop in accuracy.

Our technique is very general, and can be applied to almost any neural network model; it is also orthogonal but complementary to other recent advances in deep learning, such as dropout, rectified units and maxout.

Reducing the number of learned parameters could also be of great benefit in distributed settings. The bottleneck in large scale distributed learning of neural networks is the cost of synchronizing parameter updates between machines. If fewer parameters need to be learned then this burden can be reduced.

I also presented this work in the ICML 2013 Challenges in Representation Learning Workshop.

Feb 2014: Our lab received a Google research award to continue this work.

We applied the predicting parameters idea to reducing the memory footprint of a standard convolutional network (eg., AlexNet). In particular, we show how the fully connected layers can be replaced with a randomized kernel machine. This transformation is efficient in terms of both memory and computation, and importantly we can learn the parameters of the kernel transformation jointly with the parameters of the convolutional layers of the network.

A major drawback of the Deep Fried approach is that the randomized kernel machine requires a lot of features in order to work well. The feature mapping itself is efficient, but the high dimensionality introduces a lot of parameters into the final softmax layer, which has had its input dimensionality increased.

To correct this drawback we can make the network deep instead of wide. We designed a new layer composed of diagonal matrices and cosine transforms which can be stacked to form very deep hierarchies which we call ACDC. From a high level ACDC is similar to the Deep Fried layer, but the construction is simplified and the interpretation as a kernel machine is abandoned.

Deep Learning for Natural Language Processing 2014 – 2015

Deep Learning for Natural Language Processing

Capturing the compositional process which maps the meaning of words to that of documents is a central challenge for researchers in Natural Language Processing and Information Retrieval. We introduce a model that is able to represent the meaning of documents by embedding them in a low dimensional vector space, while preserving distinctions of word and sentence order crucial for capturing nuanced semantics.

Our model learns convolution filters at both the sentence and document level, hierarchically learning to capture and compose low level lexical features into high level semantic concepts.

Inspired by recent advances in visualising deep convolution networks for computer vision, we present a novel visualisation technique for our document networks which not only provides insight into their learning process, but also can be interpreted to produce a compelling automatic summarisation system for texts.

The code for this paper is available on github.

An interesting property of our deep convolutional document model is that it simultaneously produces embeddings for words, sentences and documents using labels only at the document level. We exploit this to develop a new approach to training a sentence classifier by using the sentence embeddings from the document model as a similarity measure. Our approach, which combines ideas from transfer learning, deep learning and multi-instance learning, reduces the need for laborious human labelling of sentences when abundant document labels are available.

Dimitros Kotzias has more information about our multi-instance learning work on his website.

Consistency of Random Forests 2013 – 2014

Consistency of Random Forests

Random forests are a popular classification algorithm, originally developed by Leo Breiman in the early 2000's. While this algorithm has been extremely successful in practice, relatively little is known about the underlying mathematical forces which drive this success.

We have been working on narrowing the gap between the theory and practice of random forests. To this end we have designed a variant of random regression forests which achieves the closest match to date between theoretically tractable models and what is actually used in practice. This match is achieved both in terms of similarity of the algorithms and in terms of performance on real world tasks.

Our work also shows empirically that data splitting, a simplification made by recent theoretical works on random forests, is responsible for nearly all of the remaining performance gap between the theoretically tractable models and their more practical counterparts.

We have also looked at the problem of training random forests in an online setting. The main limitation of online tree building algorithms is memory. Because the trees are built online they must be built depth first. Collecting statistics in the fringe becomes very expensive in large trees. We work around this by limiting the number leafs permitted to be “active” at any time, which allows us to control the amount of memory used to store the fringe. Our analysis shows that the algorithm is still consistent when this refinement is included.

Our consistent algorithm compares favourably to an existing online random forest algorithm on a simple problem, and outperforms it on a more realistic task.

The code for this paper is available on github. You can also download the kinect data we used in the experiments.

Distributed Composite Likelihood 2014

Distributed Composite Likelihood

This work lays out the theoretical foundations for distributed parameter estimation in Markov Random Fields. We take a composite likelihood approach, which allows us to distribute sub-problems across machines to be solved in parallel. We provide a general condition on composite likelihood factorisations which ensures that the distributed estimators are consistent.

Our work on composite likelihood grew out of our work on the LAP algorithm, which can be seen a a specific instantiation of the distributed composite likelihood framework.

Recklessly Approximate Sparse Coding 2012

Recklessly Approximate Sparse Coding

The introduction of “k-means” features by Coates et al., along with their later refinement into soft threshold features by the same authors, caused considerable discussion in the deep learning community. These very simple and efficiently computed features were shown to achieve state of the art performance on several image classification tasks, beating out a variety of more sophisticated feature learning techniques.

The purpose of this work is to provide a theoretical explanation for the surprising efficacy of these feature encodings. This explanation is realized by a formal connection between the design of the soft threshold features and an approximate solution to a non-negative sparse coding problem using proximal gradient descent.

Quantum Deep Learning 2011

Quantum Deep Learning

I spent the summer of 2011 as an intern in the applications group at D-Wave Systems. D-Wave is a company whose flagship product is a quantum computer which they have designed and built.

While I was at D-Wave I worked on applying their quantum computer to deep learning problems. There are a lot of interesting challenges in this area, both in terms of finding ways express problems in terms which the machine can handle, and in terms of dealing with the idiosyncrasies of what is still very new technology.

The result of this project was a paper in the 2011 NIPS Deep Learning Workshop where we presented an algorithm for training RBMs with lateral connections using the quantum computer. The algorithm itself is sound (in the sense that it works in software), but unfortunately practical issues prevent it from working in vivo. The the main difficulty is that the algorithm requires the ability to set parameters on the machine with much higher precision than the hardware was able to support. However, the algorithm may become feasible on future generations of hardware.

Gaze Selection for Object Tracking in Video 2011

Gaze Selection

My involvement in this project began as project for a machine learning course I took in the winter semester of the 2010-2011 school year. The larger goal is to build an object tracking system inspired by the human visual system.

My contribution was to build a model for gaze selection (that is, a control mechanism for saccades). During tracking the system maintains a template of the target object being tracked and at each time step it selects a small sub-region of the template to compare with its estimated location in the frame. I built a model to decide which sub-region of the template to select for comparison at each time step.