*Posted: 6 August 2017*

This is the first ever academic conference that I’m attending. I paid for it out
of my own pocket because I wanted to experience first hand what an academic
conference is like. This is an **uncurated** rant.

- Day 1 - Tutorials
- Day 2 - Conference
- Day 3 - Conference
- Session 1 - Deep Learning 5 - Fisher Approximations
- Relative Fisher Information and Natural Gradient for Learning Large Modular Models
- Learning Deep Architectures via Generalized Whitened Neural Networks
- Continual Learning Through Synaptic Intelligence
- Adaptive Neural Networks for Efficient Inference
- Combined Group and Exclusive Sparsity for Deep Neural Networks

- Session 2 - Recurrent Neural Networks 3
- Session 3 - Deep Learning 7 - Analysis
- Invited Talk: Genomics, Big Data, and Machine Learning: Understanding the Human Wiring Diagram and Driving the Healthcare Revolution, Peter Donnelly

- Session 1 - Deep Learning 5 - Fisher Approximations
- Day 4 - Conference
- Day 5 - Workshops
- Invited Talk by Sanjeev Arora: Do GANs Actually Learn the Distribution? Some Theory and Empirics.
- Towards a Deeper Understanding of Training Quantized Networks
- Invited Talk by Surya Ganguli: On the Beneficial Role of Dynamic Criticality and Chaos in Deep Learning.
- Invited Talk by Pedro Domingos - The Sum-Product Theorem: A Foundation for Learning Tractable Deep Models.
- Invited Talk by Tomaso Poggio - Why and When Can Deep - but Not Shallow - Networks Avoid the Curse of Dimensionality: Theoretical Results.
- Invited Talk by Zhou Bolei: Quantifying the Interpretability of Deep Visual Representations.
- Invited Talk by Dhruv Batra: Visual Explanations from Deep Networks
- Invited Talk by Pierre Andrews: Visual Exploration of Industry-Scale Deep Neural Network Models.
- Principled Approaches to Deep Learning Panel

- Day 6 - Workshops
- End

There were 9 different tutorials at 3 different locations today. You could only choose 3 to go for. Selecting tutorials are a tough choice. Going for all would be optimal. In fact, going for every single talk would be great! However, that would mean that the conference would span a month or more. Well, I guess it’s structured this way because not everyone can master everything. There will come a point in time where it’s beyond the capacity of a human to be well versed in so many domains.

I was expecting more of methods that exactly show us how to interpret a deep learning model. Some introductions to papers in this area were made but I’ve seen them previously. Other than that, I got the idea that it’s more of formalizing exactly what interpretability is. I think this is a nascent idea and definitely very important in the field of Machine Learning (hardcore theory people definitely think otherwise). I’m always of the idea that diversity of ideas are important and everyone will play a small part in advancing the field and everything will somehow converge eventually. Interpretability is important for practical users; formal proofs are important for academic rigour and potentially practical use (if they exist).

Allen Zhu is an incredible speaker. He’s got a really impressive CV and is extremely technical but can bring the talk down to a level where most can understand (I think?). I don’t claim to understand everything fully, but I thought I could at least follow most of it.

I was introduced to primal, primal-dual, and dual forms of optimization functions. This was totally new to me. I have no idea how these forms work and I should really look into that if I intend to get a stronger foundation in optimization. I was also introduced to the ideas of coordinate descent and mirror descent. I think all these theories are extremely interesting (like how coordinate descent guarantees a correct step).

What struck me the most was when he said that momentum fails for SGD and SVRG.
He mentioned this in the context of convex optimization and offline learning.
Empirically, they work very well on almost all deep learning problems (which is
non-convex). So how can he say that it fails? My interpretation would be that
he’s looking at it from a very strict optimization stand point of reaching
global minima. In that perspective, momentum is definitely worse as the
mini-batch gradient has high variance and taking momentum of a wrong gradient
can’t be good. When we put this in the framework of deep learning, it starts to
make sense because there could be multiple local minima (which are all very
good) and saddle points (which you could escape from with SGD naturally). After
sitting through this lecture, **I’m really interested to know more about
optimization and formal theories of deep learning.**

This is something that has been really popular in recent times. I’ve glanced through some papers on it. I’d really love to try an implementation of this on a problem.

Oriol and Navdeep did a great job in introducing the various applications and
state-of-the-art methods and for me, what interests me the most would be
attention. **I’d like to learn more about how attention methods
work.**

Started with an introduction of PixelCNN and sampling in an autoregressive manner.

They introduce x_hat, an auxiliary variable, that generates x through p(x|x_hat). These auxiliary variables are images themselves. 4-bit grayscale images are used for initial experiments.

Pyramid PixelCNN. x_hat is a low resolution view of x, and then generate recursively. This is simply framing super resolution as image generation.

Nice results on CelebA: generating 128x128 images from 8x8 images.

Results on: https://github.com/kolesman/FaceGeneration

PixelRNN and PixelCNN achieve the best natural image density estimation performance, but sampling is costly: 3N sequential sampling steps for N pixels.

Fast-sampling models still have a big gap in performance (IAF, ConvDraw, NVP, etc.).

How can we accelerate autoregressive image models with the least performance degradation?

Main idea of this model is to combine some of these pixels into groups of pixels and have multiple groups G of such pixels, and all the pixels in the group are generated in part or at the same time conditioned on the previous group. By introducing independence assumptions in each group, we can speed up generation a lot. Every group in the hierarchy uses a different network for prediction. An improved version adds in local autoregressive dependencies among output pixels.

All these models can be done recursively and super resolution is a natural extension of this. They generate very realistic looking 8x8 -> 512x512 images! They have a 100x speed up because it’s O(log N)!

So how far can we close this gap? Well, we can’t really do it because there’s always conditional dependencies. But we can get better.

An interesting insight: full sized grid models actually perform worse.

Conclusion:

- Autoregressive models can be fast to sample from and can generate high-resolution images and video.
- Modeling some pixels as conditionally independent is an effecive way to enable parallelism during sampling
- Multiscale image struture provides a good basis on which to model pixels as conditionally independent.

For learning the distribution of natural videos, we condition not just on the spatial dependencies but also the previous frames. They have full dependency across space, time and RGB channels.

In practice, there’s actually blind spots so you have to be aware of that.

See how the lower bound is computed in the paper.

VPN can produce crisp images 10 steps into the future conditioned on 10 previously.

Examples on nal.ai/vpn

Future work on this would include speeding up and generating beyond 18 frames to perhaps 100 frames.

The talk began with an introduction to texture synthesis. Why is it important?

- The look and feel of a surface is important both in nature and CGI.
- Textures as stochastic processes. Stationary, periodic, ergodic, non-ergodic mixing and non-mixing.

Goal: given an example texture image, learn the generating process and sample textures with the “right” properties. A recent deep parametric approach would be “Texture Synthetsis Using CNNs” (Gatys et al. 2015). This is basically the style transfer paper.

All of the current methods have drawbacks:

- Slow to generate large output textures. This is important in many applications.
- Cannot handle multiple diverse textures.
- Can’t deal with periodic textures. Portilla and Efros can do this in a limited way.

They have a previous paper “Texture Synthesis with Spatial GANs” (Jetchev et al. 2016) in NIPS. However, this was limited to stationary, ergodic and mixing. It can’t handle periodic structures.

Key idea: Combine 3 types of noise tensors Z. Z_local, sampled spatially iid from noise prior as in SGAN. Z_global, equal on each spatial position. Z_periodic, periodic filters that look a little like gabor filters and stuff. A pretty neat idea I must say!

Key abilities:

- learn periodical textures
- learn textures of great variability from large image datasets
- learn whole manifolds of textures and smoothly blend between their elements, thus creating novel textures.

Only PSGAN was accurate for the honeycomb. A question I have immediately would be why is there an input honeycomb? Isn’t it supposed to be generating from noise? I have to read the paper to get the details.

The smooth texture morphing image was pretty cool.

http://www.offconvex.org

This talk addresses the following questions:

- Does an equilibrium exist in this 2-person game when Generator has capacity S and Discriminator has capacity N?
- If Generator wins, does this mean it has learnt the target distribution (from a small number of samples)?

Goodfellow did answer this, if Discriminator capacity, training time, and number of samples is large (this of course means exponentially large), then yes to both.

Their theorems:

- Near-equilibrium exists when Generator capacity >= (Discriminator capacity)^2. Near in this sense means less than an epsilon.
- But this learnt distribution may be far from target distribution w.r.t any standard metric.
- Only good property they can guarantee is that learnt distribution is indistinguishable from target distribution by every Discriminator with the stated bound.

They have an empirical contribution of an effective way to add capacity to
generator. **Replace generator by weighted mixture of k generators.** Train
mixing weights via backpropagation; use entropy regularizer to discourage
mixture from collapsing. This often stabilizes and improves training for GANs.
They call it MIX + GAN.

Follow up read: “Do GANs actually learn the distribution? An empirical study.”

The focus of this paper is on the discriminator.

Known issues in GAN training:

- Loss uncorrelated with sample quality.
- Vanishing gradient needs ad-hoc log(D(G(z)) for G training, so different loss for D and G.
- Optimizing a problematic metric (Arjovsky and Bottou 2017).
- Unstable training, mode collapse.

Use Integral Probability Metrics for GAN training. There’s lots of theory to this so I have to read up more on it.

The idea that I got away was there’s a target distribution that you’re trying to match to, and you use a neural network to convert the 2 probabilitiy distributions into some other feature space and measure the distances of these distributions in the transformed space.

They have some primal dual form again which I don’t understand. And you can optimize both forms and get similar sample qualities. This is an insanely hard presentation to understand.

The motivation for this paper is that it is hard to collect image pairs. Like the exact same person but only changing the hair colour or taking off the glasses. Humans can do it easily (infer the transformation given 2 groups of images), so can we do it on GANs?

They assume that there exists a transformation in the form of a generator. The generator takes a blond hair portrait as input and tries to transform it into a black hair portrait. The discriminator is supposed to distinguish real black hair from fake black hair. Vanilla GANs in this way does not make sense at all as the hair might be correct but the face is entirely wrong. This is their baseline.

They use another GAN that tries to reconstruct the original image. They coin this a GAN with reconstruction.

Their final method is coined DiscoGAN. It’s a pretty neat idea because there’s a mirror model. And the GANs have shared weights. Very nice results on gender conversion. Nice results on handbag to shoe conversion as well.

Code: https//github.com/SKTBrain/DiscoGAN Follow up work: Unsupervised Visual Attribute Transfer with Reconfigurable GAN.

Questions from audience:

- How is this different from CycleGAN from UC Berkeley? CycleGAN is a parallel line of work. https://arxiv.org/abs/1703.10593.

Poor flow and structure to overall presentation. It was really hard to follow. I’d read the paper if I want to know more.

The paper that I’d really like to hear about! I see lots of post online on how WGAN really improves GAN.

TLDR: Check out Reddit for the nice TLDR.

- As the discriminator gets better, updates to the generator get worse.
- Discriminator achieves 1 test accuracy quickly.
- Real data is usually concentrated in a low dimensional manifold.

Basically, there are problems with JSD and KL. Wasserstein distance is a “better” metric for measuring the distance between the 2 distributions you’re trying to match.

It looks tough at the start, but thankfully, there’s a dual.

Idea: train one net f (critic) to maximize the dual then do gradient descent on theta.

To make 1-Lipschitz or K-Lipschitz, clip weights! It’s a terrible idea but it kind of works. There is also follow up work for this.

Conclusion:

- Use difference instead of CE, not sigmoid or logs.
- Enforce a Lipschitz constraint (clipping or gradient penalty).
- Train disc/critic to optimality (ncritic=5).

Hierarchical injection of randomness. As we go up the feature hierarchy, it becomes increasingly abstract.

Limitations of Stacked Hierarchical Generative Models.

- Bottlenecked by bottom layer, higher layers are redundant.
- Stacked HGM with simple conditional do not learn hierarchical features.

The basic assumption is that more complex abstract features require deeper networks to model. They coin it architectural hierarchy. Their model is Variational Ladder Autoencoder (VLAE). It’s different from LVAE. This paper gives a significant step forward on the ability to disentangle the latent codes.

Main motivation of their work is conditional density estimation. Take the top half of the face and try to impute the bottom half (basically inpainting). Very mathematically heavy presentation and there’s a need to read the paper to get the details.

Very mathematically heavy and there’s a need to read the paper to get the details.

Joined this talk towards the end and saw many mathematical equations. I doubt I would have understood it even if I sat in from the start. Interesting title though.

Goal of this paper is to have data efficient black-box optimization. Let the RNN do optimization.

Started off with a background on Bayesian Optimization. Model based sequential optimization algorithm. They use some posterior and expected improvement. Traditionally very difficult or slow to compute. I don’t know much about it and need to read more.

Excellent results overall. Competitive convergence properties and speed compared to existing methods.

Optimization requires hand tuning and manual supervision. Can optimizers be learned that work well on a variety of problems? Previous work can outperform existing optimizers but generalize and scale poorly. They introduce hierarchical RNNs.

In essence, it’s a black box that takes in parameters and gradients and tries to generate a parameter update. Generally, history is important for this and thus RNNs are a natural choice.

Per-parameter RNN has some advantages and disadvantages. A key disadvantage is that it fails to generalize (for some reason). Their contributions include architecture, design features, and the training process.

In terms of architecture, they still have a per-parameter RNN. They now pipe this into a block and have a Tensor RNN. Every layer has a different Tensor RNN that can coordinate information within each parameter. They then add another Global RNN. Natural idea of parameter sharing.

In terms of design features, they incorporate design features from optimization literature, like momentum and dynamic input scaling (similar to RMSProp and ADAM).

They use a meta-training ensemble to train this black box. Convex and non-convex objectives, stochastic and deterministic, problems with poor scaling, no neural networks in the training set! Just small toy problems.

Contributions:

- Random scaling.
- Combination with Convex Functions.
- RNNprop. Normalized gradient for RMSprop and normalized momentum in ADAM. They propose to feed some squared rooted value in instead.

Active learning. Goal is to train a model that makes useful predictions on new data. But in the active learning settings, there may be a few or no labels available for training. Gathering labels may be costly too. The solution is to collect labels for a subset of points - balance labelling costs and prediction quality. Some strategies include updating classifier online, then label an instance that confuses the classifier. Bayesian approach is to label the instance expected to provide the most information.

An example is movie recommendations. With a cold start, how do we ask a new user? Their approach is to use other users’ ratings to learn a strategy for selecting informative movies.

Key motivation for their paper is that many methods have been developed for active learning but make strong assumptions and require approximations. Through metalearning, they can learn an active learning algorithm end-to-end. They train a model to do active learning on data from related tasks. The learned strategy should apply to new tasks from the same distribution. This allows co-adaptation.

Related papers include “Active One-Short Learning” (Woodward and Finn, 2016) and Matching Network (Vinyals et al. 2016).

In summary, their model learns to actively build the labeled support set for a Matching Network.

Lots of stuff on Fisher, don’t understand at all.

Lots of stuff on Fisher, don’t understand at all. They compare their results to BatchNorm and stuff and show better results. Apparently it can be added as something like a BatchNorm.

Humans learn continuously throughout their life. However, for machines, there’s a training phase on stationary data. If you want to update the data, you need to run through the entire data set again usually or you suffer from catastrophic forgetting. The new memories overwrite the old ones and capacity is not the issue.

In biology, we have synapses that combine neurons and they are complex biochemical dynamical systems. See Redondo and Morris (2010) for more information. Machines are single scalar values. Doesn’t really corroborate with computational neuroscience. See Fusi et al. (2005), Lahiri and Ganguli (2013), Benna and Fusi (2016) for more.

Existing approaches to alleviate this are architectural (progressive networks), functional (some regularization) or structural. They are excited about structural methods like Elastic Weight Consolidation by Kirkpatrick et al. 2017. They seem to model the synapse pretty well. Nice illustration by Zenke on the loss landscape on 2 different tasks.

They take ideas from EWC but change the Fisher to an Omega. They show that their importance metric is similar to Fisher as well. Their results work on MNIST, CIFAR10, CIFAR100, etc.

https://github.com/ganguli-lab/pathint

There are various approaches for efficient inference. First, models more efficient by design. Second, model compression. The authors propose that they adapt the model in test time. Every sample sees a different model. The main advantage of doing it this way is to leverage the existing set of models and leverage the best model for this. The motivation for adaptation is to maintain accuracy of existing models with improved efficiency through better utilization.

The key reason why sample adaptivity reduces cost is because not all samples are equal in complexity. Some are a lot easier than others. The goal is to learn a policy that can recognize easy examples at run time. This policy is something like gates that route the input to different networks and even within each layer, you can exit any time.

I like this idea.

Code: https://github.com/tolga-b/ann

For large scale classification, memory cost for softmax layer 1000 classes is 65.5MB. For 1000000 classes, it’s 131GB. It just doesn’t make sense to use softmax. It’s too memory intensive and compute intensive.

The authors propose to reduce redundancy in feature space by learning them to be as different as possible. They propose some regularizer to learn features that are as different as possible, while allowing them to share important features.

It’s interesting how the motivation they mentioned at the start is different from their solution. They talk about softmax, which is totally irrelevant in my opinion.

Code: https://github.com/jaehong-yoon93/CGES

Not too familiar with RNNs but decided to come for this because the last 2 papers on music look really interesting.

No idea what this paper is about.

Generally interesting talk on better training methods of RNN for sequence generation. The music samples were really cool and mechanical turk reliably rated their methods as better. Music is pretty generic, therefore they tried their method on molecule generation, which can be quantitatively measured. Their method performs better.

In conclusion:

- Optimizes for task-specific reward with RL.
- Maintains information learned from data.
- Encourages diversity in generated samples.

Balance of data and reward important, because datasets are often incomplete, imperfect or biased and sometimes not possible to design sufficiently good reward.

Code: TF Magenta Repository, RL Tuner

As an aside, it was a really well planned and informative talk that was easy to follow.

The goal of TTS is to take a string of text and turn it into natural sounding human speech. Existing TTS systems are necessary for you to get started, domain expertise required to construct the features, and hand labelled data is really hard to get because it is specialized.

Three main points to take away:

- Make TTS much simpler as you can start from scratch. Fully neural.
- Sound quality tied to several components. Only 2 mattered, F0 (pitch) and phoneme duration.
- Accelerate research and productize autoregressive models. Currently human evaluation key for measuring progress. Train generate listen loop speed is research criticak.

HELLO! gets converted into phonemes sil HH AH L OW sil. I’ve no idea what these phonemes are actually. Next, the phonemes need duration and F0, which is like “hello”, or “hellllllooooo”. Pitch is necesssary after that too.

“A cat has 9 lives”. Lives in this case is L AY V Z. “A cat lives a long time”. Live in this case is L IH V Z.

If you now have text to phonemes, you need to figure out phonemes to pitch and duration next. For every phoneme, how long am I going to stay at this phoneme? Next, what’s the pitch at every phoneme?

With F0, durations, and phonemes, we send all these into WaveNet++. Autoregressive WaveNet predicts waveforms. These waveforms can be converted into audio. However, for WaveNet, every time step requires a convolution. Which is extremely expensive! This paper makes it real time through a series of many optimizations. It can be applied to PixelCNN and others too (as claimed by author).

To evaluate the audio quality, Mean Opinion Score on Mechanical Turk is used. The results show that F0 and duration are really important to make it realistic.

As an aside, it was a really well planned and informative talk that was easy to follow.

The focus is on chorale harmonization. It’s like the MNIST for polyphonic music. Many approaches in the last 30 years, but often need expert knowledge, lack of musical evaluation, or can’t be used in a creative way.

Bach Chorales: given one line of notes, write the three lower parts. It’s relatively homogeneous (same timing, same rhythms, same rules, etc.), large corpus, and has been studied for centuries so it is easy to evaluate for experts.

Very nice audio results!

WaveNet works pretty well when trained on stuff like “only piano” or “only something”. Not entirely sure what the purpose of this paper is as I’m not well versed in this area.

Code: g.co/soundmaker

Excited for this session, especially the first and second ones. Finally something I might actually have an inkling of understanding today.

Related work included the ICLR 2017 paper by Zhang et al, which showed that DNNs can fit random labels. We need data-dependent explanations of DNN generalization ability.

3 Experiments:

- Qualitative differences in fitting noise vs. real data.
- Deep networks learn simple patterns first.
- Regularization can reduce memorization.

See the paper for more details of results of each experiment. It’s mostly empirical, but the intuition is that if there are patterns, then it’s easy for the network to fit it.

Does memorization imply generalization? It’s really hard to answer that question. K-NN does memorize but it can generalize quite well when K is chosen properly. It’s an interesting insight to me because I always thought a neural network that memorizes almost always does not generalize.

Deep neural networks have become very powerful recently but we don’t understand these systems. Engineering, mathematical, neuroscience approaches have been developed to try and understand it. The last one is the psychological approach. This paper tries to merge psychology with neural networks.

This work is based on Matching Networks. Basically we have inception pre-trained features and compare these features.

Imagine you visit a culture that is entirely different and you have to learn their language. It’s something like that alien movie and “kangaroo”. “Kangaroo” can mean many things but there are many other hypotheses can be removed. For example “Kangaroo” could be the name of that animal or the “animal” class. The author presented the example of matching a shape. You were first given a shape and told this is a “dax”. In the next example, you’re given a “dax” with a different colour and an object with the same colour as the previous “dax”. In their experiments, they showed that humans have a shape bias instead of a colour bias. They use the CogPsych data set and a few others with these triplets. They showed that there is a very strong shape bias in inception networks.

The conclusion is that you have to be wary when shape matters or when color matters. Accuracy isn’t everything. Having high accuracy doesn’t mean your model is great. Stopping time matters as well, as the model might have different properties as the training goes on.

**Human bias changes over time. In young kids, it’s really small, and over time
as we grow older, almost shape bias takes over. I think this is a really nice
point that we have to take into account when designing real world AI.**

One of the questions was “ImageNet doesn’t have colour labels, could this be why inception is biased too?”. It’s a great question IMO, and the author agrees that this is probably true.

Great presentation and paper.

This paper aims to study the activations and weights. The first thing they did was to analyse the activations in each node through the average node response and then running a hierarchical clustering (seems iffy though). The author acknowledges this and goes on to talk about other analysis in activations.

They analyse the weights next.

Overall, it’s done on speech stuff so I don’t really understand it fully. However, it would be a better presentation if the author draws the differences between her work and other previous papers that do similar things.

Attribute a complex deep networks’ prediction to its input features. Many methods have been trying to do this. The basic question to ask: What makes for a good attribution method?

The author proposes to list desirable criteria (axioms) and develop X, the only method that satisfies these desirable criteria.

Interesting results on fireboat, watch, jackfruit, school bus. Very similar to LRP (aside: oh they actually cited LRP! Typed this before that slide came up). Another paper he cited was DeepLift 2017. Previous methods fail because they use chain rule and it’s not “technically correct” according to the author.

They show nice results on other domains as well. It solves many problems as well but there’s one deficiency on predicting even number of black pixels, for example.

Overall a nice paper to help in understanding what the network learned.

Code: https://github.com/ankurtaly/Integrated-Gradients

Basically a prediction probability together with a confidence value. Nice idea! Always wanted a paper like that to refer to. Oops I got this wrong. The author mentioned this at the start of the presentation but turns out this is not what he truly meant. He just used it to frame the problem.

The problem that happens quite often today is that neural networks are overconfident. He refers to this problem as miscalibration, and we cannot really trust the probability output of neural networks.

Questions to answer:

- How can we measure and visualize miscaliration?
- Get all the probabilities and bin them. In each bin, compute the accuracy. Subtract confidence with accuracy, divide by total, and you get Expected Calibration Error.

- What makes neural networks miscalibrated?
- They were well calibrated in 2005. However, miscalibration is a big problem in 2017 (showed graph on ResNet).
- Increased network capacity
- Batch Normalization
- Less regularization
- Nice insight on why it miscalibrates. Author claims we are actually overfitting on negative log likelihood and become more confident about misclassifed results.

- How can we correct miscalibration?
- Temperature scaling. Manages to decrease negative log likelihood and ECE by factor of 10.
- Replaces the softmax score with a parameter T. Minimize NLL on validation set after training with this new softmax score.

Miscalibration is inherently low dimensional because a single parameter can change it. The miscalibrated confidences are ordered too, since the same parameters is to all parameters. It’s also cool that this an correct miscalibration without changing predictions.

Damn cool animation on the slides. Good paper and worth a read.

Code: https://github.com/gpleiss/temperature_scaling

There is about to be a huge explosion of data in genomics and there are going to be a lot of opportunities for that and Peter would really like more people to start working on ML for genomics. The projection is that there would be a billion people with their genome sequenced in 2025.

“All models are false, but some are useful” - George Box

How important are genomes compared to Mitochondrial DNA, RNA, Microbiome etc.? The key thing about DNA is that it is primary and causes a lot of things.

peter.donnelly@genomicsplc.com

Angry Birds is not solved from an RL perspective. Really? I didn’t know that. Robocup, a robot football team that can beat the world cup winners. This is “due” in 2050.

This talk is about algorithms and environments and hopefully bring us closer to AGI.

**Montezuma’s Revenge and Hierarchical RL**

It is extremely hard because reward signal is weak and delayed. Hard for the network to generalize. Can we abstract away primitive actions and have coarser temporal resolution?

One possible better representation would be to have subgoals. However, can we learn these subgoals? Hierarchical RL aims to be able to solve this. Feudal RL (NIPS ‘93) is a nice paper to read for this. FeUdal Nets is a paper presented at ICML this year. Interesting demo on the game itself.

**Multiple Tasks and Continual Learning**

If we have 3 tasks, we can take all 3 tasks and learn them at the same time. Or, we can learn each task separately. Of course, this is known as catastrophic forgetting in current neural network literature. And even if we only have one task, the performance of the RL agent degrades quickly when we change the enemy to black, add gaussian noise to the input image, or invert the colour (and the RL agent actually gave up). Their solution is Elastic Weight Consolidation, which was described in a talk previously.

If the tasks don’t really get along, one potential solution is to use progressive networks or Distral. Distral does pretty well exploring mazes for apples and exploring deserts for mushrooms. It explores the environment well.

**Auxiliary Tasks and Labyrinth Mazes**

The mazes are procedurally generated and the game involves random start, find the goal, teleport randomly, find the goal again, repeat. Adding depth prediction on the visual features causes their curves to take off in terms of performance and it’s more stable as well. These are the auxiliary tasks they are talking about. It gives the visual features more information.

**Multimodal Agents and StreetLearn**

They took SteetView and converted it into StreetLearn. It’s a really cool data set. They came up with a task: The Taxi Task.

- Spawn randomly and navigate to a random target location.
- Small reward on random 1% of nodes in graph.
- Start receiving reward when close to target (within 400m).
- If target is reached (100m), navigate to a new random target.

Really nice demo though it seemed to go against traffic. This isn’t published yet but will be in a few weeks.

**Continuous Control and Parkour**

Proprioceptive - “near the body”. These include joint angles and velocities, touch sensors, etc.

Exteroceptive - “away from the body”. These include position/velocity, vision, task-related information.

They trained the spider with proprioception and terrain; a single uniform reward, based on forward progress; a penalty on energy consumed. Curriculum structure in the terrain helps. Curriculum in this sense means give it easier tasks first and then increasingly harder tasks. The demos on the humanoid and 4 legged creature is really cool.

**Conclusion**

A nice question was asked: Do you think the work we are doing now will lead to AGI?

“Well, once upon a time, people said AI was captioning an image. When we got there, they kinda said oh that isn’t AI any more. The term AI/AGI will always be evolving. I think that memory; attention, and adaptive computation are very interesting areas of work today and they definitely start to resemble a baby’s intelligence.”

DL/RL has allowed us to replace greedy heuristics in many applications. It’s a natural extension to use DL and RL for this. The policy they are learning is the assignment of operations to hardware. It’s casted as a seq2seq network. This allows them to condition the placement of new ops on previous assignments.

Any placement problem could be solved with this. Like circuit placement and stuff.

Why are we talking about CPU?

- Latency sensitive application
- Autoregressive
- Memory constrained
- And we have a lot of CPUs

We can actually do fast things on CPU through better algorithms, better utilization and scalability. See paper for technical details of algorithms but it’s basically some transforms that we see in many other implementations. They claim to work a lot better for higher N-dimensional convolutions.

Parallelizing into the batch is going to cause non-contiguous blocks in memory.

The author argues that there are applications where CPU is faster like A3C.

MEC offers reduced memory requirements, faster runtime due to improved cache-locality, exact convolution results. Read this paper already so didn’t learn anything new from the talk.

Similar to the other works on model compression. Their future work is evolution compression. Sounds cool.

Code: https://github.com/YunheWang/RedCNN

Language modeling is an important task for NLPs but we all know that softmax on a vocabulary is gigantic and extremely slow. It currently takes days to train large models on relatively small datasets.

Hierarchical softmax is one way to circumvent this problem but in a batch, to parallelize the hierarchical softmax, you have a lot of memory overhead. Word vocabularies follows Ziph law. 80% of occurrences are 1400 words. This paper exploits Ziph law and small clusters.

Code: https://github.com/facebookresearch/adaptive-softmax

Don’t know enough of background to understand this paper.

It is currently infeasible to train a household robot to do every possible combination of instructions. We want to train the agent on a small set of tasks such that it can generalize over a larger set of tasks without retraining.

For example, a robot has an interaction of eat and throw with an apple or a ball. But when the robot now sees chips, what interaction should it do?

The idea that they have is to learn the correspondences between similar tasks and to teach the neural network with analogies. For example, if we visit A and pick up A, and A is similar to B, do we pick up B when we visit B?

Very cool demo in the end on how the agent can execute many lines of instructions.

Experience replay stabilizes training and they are stabilizing the stabilizer. Why do we care about multi-agent systems? Well, the real world is made up of that and they have to make decentralized decision while optimizing a common reward. DQN relies heavily on experience replay buffer.

Model-based ML trains a model first and then uses it. Can we build a model and use the model at the same time? This is what model-based RL aims to do.

Using a model is generally called planning. There are different types of planning and here they focus on search like MCTS.

Choreographing music notes into steps for playing. It’s hard to craft this because they must be challenging enough (and with different levels), follow rhythm, make sure you don’t have a sequence that causes the user to face away from the screen, etc.

Inputs are audio features. This converts to when to place steps and which steps to place?

Cool demo and well received by the community.

Can’t really be done real time in front of a band because you need to “predict into the future”. The arrows come up from the bottom.

Code: https://github.com/chrisdonahue/ddc Demo: http://deepx.ucsd.edu/ddc

Agent controls keyboard and mouse and communicates with the web and gets information (Pixel Input and DOM) from the screen. It’s easy to use because you have access to the entire internet and the mouse and keyboard is your collection method.

Mostly demonstrated on doing small tasks on the Internet like clicking or filling up forms.

Web brings open-domain realism to RL agents. Very challenging due to sparse rewards. Tradeoff depending on goals. For generality, low-level pixels to actions, good for RL. For efficiency, higher-level DOM, type actions, good for progress.

Site: world-of-bits.com

Nice results showing they beat many other standards. Good work!

Their model predicts 11 out of 13 targets to chemical accuracy on the DFT (apparently very expensive).

There are some papers on generative models on chemical space and this is really interesting!

This might be a fun read in future, that’s about it.

Code: https://github.com/brain-research/mpnn

The goal is to get cool looking explosion simulations for graphics applications.

The problem is purely unsupervised. The model is essentially volumetric convnet in a multi scale approach.

Nice images but not my interest so it might be a fun read in future.

American Jurisprudence in cameras, phones, and camcorders. In 1983, the Sony camcorder had no mute button. A Boston University student was recording a policeman making an arrest. He was arrested instead. A kid was getting bullied on the bus by the driver. The kid’s mom went to get a camera to film the driver. She brought the video to the police and she was arrested instead. Fast forward to today, cameras are now everywhere and we can film freely. They are even used as evidence!

The design decisions that we make have tremendous effect on our society. Like today, it’s not hard to add a mute button for video but we don’t add it in! Bed sensor data is also sent to the cloud. We can’t just simply say stuff like “previous generations have done this”.

Latanya claims that we now live in a technocracy. As researchers, we are essentially policy makers now. We make policies through the technology we create. The simple decision of choosing to add this button or not, to add this feature or not, etc.

In the past, advertisements had to be vetted before they were posted. In today’s age where ads are freely posted by algorithms, we really have to be careful. A black college fraternity in the USA was celebrating their centennial. So there were the standard ads, but banners like “check if you are arrested” or “credit cards ads that were harshly criticized” were posted.

Is your medical record really private? With “shared anonymous data”, we can actually get your data. By cross referencing your private hospital records (perhaps without your name), and a newspaper article (that contains your name), there are ways to cross reference it such that you get the identity of the individual.

For recommender systems for employers, if the system recommends young people, and you picked them, they eventually start to recommend only young people and this is not what was originally designed! And it violates US employment laws.

She is publishing a paper in 2 weeks time on the 2016 Presidential Election. They showed vulnerability in 36 states.

Latanya taught a class called “Data Science to Save the World”. Basically, use data to find a fact scientifically where technology has had an unforeseen consequences. At the end of this class, she had 26 students who crossed the mark and she brought them all down to DC to meet with regulators. At the end of the class they felt that it shouldn’t stop here and they started the Journal of Technology Science. Some of the initial papers include “Price Discrimination in The Princeton Review’s Online SAT Tutoring Service”. Some of the results show Asians are 1.8x likely to be quoted a higher price. Many other papers were shown, which talk about AirBnB discriminatory pricing and Facebook’s geolocation automatic sharing (Aran Khanna).

The conclusion is basically, technology now rules the world through Facebook, Google Chrome, GMail, etc. As technologists, we hold a lot of power now and in future and we have to think carefully about that.

Rotated between 2 workshops today. Principled Approaches to Deep Learning and Visualization for Deep Learning.

This talk started with the oral paper in Generalization and Equilibrium in GANs. But goes further to talk about other things.

Theorem suggests that GANs training objectives not guaranteed to avoid mode collapse, so does this happen during real life training? Sanjeev went back to think about this question and sent a reply to Google.

If you put 23 random people in a room, the chance is more than 50% that two of them share a birthday. Suppose a distribution is supported on N images. Then prob(sample of size sqrt(N) has a duplicate image) > 0.5.

Follow up work by them is the Birthday paradox test on Arxiv. The ran the DCGAN on CelebA and found that near-duplicates found among 500 samples, which implies the GAN has a support size of 250K, slightly larger than the 200K training set. So is it learning the true distribution? Well, not really. There was quite a bit of debate on this during the workshop, but of course more work is necessary.

Why can’t Stochastic Rounding beat BinaryConnect in terms of error classification rates? It’s well known that floating point methods have better performance. If you can train without floating point computations at all, we can speed up training by a lot. The goal in this talk is to develop principled frameworks for training quantized networks. There are 2 questions to ask:

Why are we able to train quantized networks at all? Why does training require floating points?

Convergence theory for stochastic rounding states that stochastic rounding converges until it reaches an “accuracy floor”, which is determined by the quantization error.

Convergence theory for BinaryConnect converges until an accuracy floor as well, but BC finds exact solutions to quadratic problems.

What can we say about SR on non-convex problems?

Exploration vs Exploitation. Floating point computations start to exploit when we shrink the learning rate. But in SR, we explore at the start and when we shrink the learning rate, it doesn’t exploit as much as compared to floating point.

He went on to talk about a Markov Chain Interpretation which was slightly more complicated.

In summary, for convex problems, methods converge until an accuracy floor but for non-convex problems, the annealing properties doesn’t really work in stochastic rounding.

What does a generic deep function look like? How can we exploit this knowledge to improve performance? This talk combines diverse theoretical techniques from many things I’ve never heard of before. I’m probably going to get blown away.

The origins of the exponential growth can be attributed to chaos theory. In fact the author claims that chaos theory is a special case of deep learning.

Propagation of a manifold (a circle) through a deep network. That was interesting though I don’t understand it fully.

In the chaotic regime as the circle propagates through, the linear function expands the circle, and the non-linearity folds it.

Deep networks can disentangle manifolds whose curvature grows exponentially with depth.

Conclusion by Surya:

- Our results reveal the existence of a transient chaotic phase in which the network expands input manifolds without straightening them out, leading to “space filling” curves that explore many dimensions while turning at a constant rate. The number of turns grows exponentially with depth.
- Such exponential growth does not happen with width in a shallow net.

**Beyond manifold geometry to entire Jacobian singular value distributions**

How do random initializations and nonlinearities impact learning dynamics? Backpropagation is essentially a product of Jacobians.

The brain actually has sigmoidal weights and the work he has done shows that with orthogonal weights, sigmoids can outperform ReLU.

We should actually initialize weights at the edge of chaos so the network can decide on what to do given the input training data.

Conclusion by Surya:

- An order to chaos phase transition governs the dynamics of random deep networks, often used for initialization.
- Not all networks at the edge of chaos - with neither vanishing nor exploding gradients - are created equal.
- The entire Jacobian singular value distribution, and not just its second moment impacts learning speed.
- We introduced free probability theory to deep learning to compute this entire distribution.
- We found that tanh networks with orthogonal weights have well conditioned Jacobians, but ReLU networks with orthogonal weights, or any network with Gaussian weights does not.
- Correspondingly, we found that with orthogonal weights, tanh networks learn faster than ReLU networks.

Papers for further reading:

- Exponential expressivity in deep neural networks through transient chaos, NIPS 2016.
- On the expressive power of deep neural networks, ICML 2017.
- Exact solutions to the nonlinear dynamics of learning in deep linear networks, ICLR 2014.
- Investigating the learning dynamics of deep neural networks using random matrix theory, under review.
- Deep information propagation, ICLR 2017.

Site: ganguli-gang.stanford.edu

Overall, it was a thoroughly interesting and well presented talk and has greatly sparked my interest in theory of deep learning.

This was a pretty tough talk that I don’t understand but there’s a new library called LibSPN that seems promising so I’d really like to read this. As history would have it, SVMs were superbly popular until Deep Learning, who knows what’s next?

I’ve chanced upon this paper before and heard about Tomaso Poggio. In fact, I’m going to buy his book. I’m getting more attracted to methods are truly more inspired by neuroscience than linear algebra that works. Man has always taken inspiration from nature, and we definitely should take inspiration from analysis of our brains.

**It is time for a Theory of Deep Learning**

Wow, the title itself is exciting. CBMM’s main goal is the Science and the Engineering of Intelligence.

We aim to make progress in understanding human intelligence, that is in understanding how the brain makes the mind, how the brain works and how to build intelligent machines. We believe the science of intelligence will enable better engineering of intelligence.

Tomaso believes a theory of deep learning is emerging. There are three main scientific questions:

Approximation theory: when and why are deep networks better than shallow networks?

Optimization: what is the landscape of the empirical risk?

Generalization by SGD: how can overparametrized networks generalize?

Conventional wisdom states otherwise. We have more parameters than input data points! We should overfit, so how come we can generalize?

Since the 1980s, it is well known that if you have networks with 1 hidden layer with a nonlinearity, you can arbitrarily approximate a continuous function. While it is indeed true that you can do this well, you might need a huge number of weights. And this is called the curse of dimensionality.

There are certain classes of functions that allow you to avoid the curse of dimensionality. And these are compositional functions in a certain hierarchical local way. The important point is that convolutional neural networks are a special case of this compositional functions. It is interesting that the “weight sharing” is not the one that helps, it is the “locality” of it that it is key. Based on this statement, he did some experiments and the results corroborate with his theory. It is important to note that this is an extension of the classical theorem by Hastad in 1987. See also Telgarsky, 2016.

So why are compositional functions important for perception (vision, text audio)? Neuroscience makes the argument that our brain evolved in a way that favours local connectivity. Hierarchical structure in the visual cortex have different areas one on top of each other. It’s easy to see how our brain might be compositional because language in itself is made up of letters, words, sentences, paragraphs, texts, which is hierarchical. This concept of hierarchical locally connected elements can be found from as early as the Perceptron book by Minsky.

We are now in the phase of Big Data. But children don’t really learn this way. Is it true?

The next phase of ML: implicitly supervised learning, learning like children do from small datasets.

Some of thoughts he has:

- Evolution has put in the genes prior information that is key for learning. In this case, the prior does not really mean probabilistic priors. He means implicit priors in the architecture, so certain contexts can be learned more easily.
- A good example that has come up in recent times is colourization. The system can have as input a black and white version and output colour images.
- Time continuity. When we look at the face of a person, during that time period, we see a few perspectives of his face and relatively few discontinuities. From that short instance, we can instantly “generalize” his face.
- The question of what is intelligence came up again and we can’t really define it right? In the 1950s, machines that could do integrals a lot better than all mathematicians were present but is that AI? Well, he doesn’t know. But his definition would be “human intelligence”.

Refer to Liao, Poggio 2017 for more information.

This was CVPR 2017’s best paper so it’s really nice to see it at ICML and hear it from the presenter himself. I got to ask him questions about it so it was nice too. The basic intuition I got was that you simply take the activations in the channel you want to analyse and upsample it into the original image space. With that region, it is possible to then compute an IOU score with the ground truth and identify the meaning of the feature. This is the rough idea I get which might be wrong so I’ll have to check the paper. This is definitely on my to try list as it’s pretty important to interpret your models so you can “prove” your model is doing the right thing.

Site: netdissect.csail.mit.edu

This paper is an extension to existing visualization methods like the first few papers of simply using gradients by Simonyan, or deconvolution by Zeiler, or guided backpropagation by Springenberg. Dhruv talks about using Grad-CAM and show much better visualizations.

Questions were raised on whether gradient based methods are the best way of visualizing networks, and Dhruv does acknowledge that there could be better ways but right now it appears that such visualization methods actually work and are useful.

Not everyone is a hardcore machine learning programmer. Some engineers just need some tools to be able to run some models and visualize them so they can deploy it into products. As such, they developed FBLearner Flow which is used by 25% of all engineers in Facebook. It has compute capability equivalent to 40 PFLOPS of GPU and 50x more AI experiments per day than a year ago.

Pierre showed a live demo which is essentially NVIDIA Digits on Drugs.

He wanted to show some nice pictures of their visualizations but can’t due to confidential reasons.

They developed ActiVis for some visual analysis. It basically looks like TensorBoard and the likes.

The panel is made up of Pedro Domingos, Surya Ganguli, Pascal Poupart, and Nati Srebro.

Trends have come and go. What we see in DL today is the combination of big data and hardware. The future is about sequences of task, continuous learning, stuff like that. We are pretty faraway from human level intelligence and the likes.

Ganguli has really great work and his intuition on manifolds and stuff like that is really amazing.

We have replaced feature engineering with architecture engineering and hyperparameter tuning. Best roles that theory can play of this is the theory of hyperparameters. Ganguli started quoting condensed matter physics stuff on solids liquids and glasses. In the past, plotting some features and getting a straight line was considered a breakthrough. He thinks that DL is now at that stage. If we have some nonlinear quantities on the x-axis and the same thing on the y-axis and manage to get a straight line, we can start having a clue of what’s going on.

4 years ago, Andrew Ng gave an invited talk and Andrew Ng said it’s a magical tool. On the acknowledgment slides, he had 6 projects, and each project had 4 different graduate students. But now, you don’t need 4 graduate students worth of engineering on each project.

Closing thoughts and predictions of deep learning:

- Quantum deep learning?
- Ganguli would be super happy if we had better understanding of adversarial examples and generalization.
- Reinforcement learning so we won’t need labellers.
- Unsupervised learning? But the problem with this is that if you don’t have any type of feedback, your model can generalize in a way that you don’t want it to. Ganguli believes babies are unsupervised but the other panellists believe that there is definitely some form of supervision.
- Ganguli believes that there must be some mechanism of learning without a task.
- If AI is a cake, RL is the cherry, Supervised Learning is the frost, and the entire cake below is Unsupervised Learning.

Shattered Gradients won the best award in this workshop and was an oral paper too. Might make sense to decipher thoroughly.

Only going for Reliable Machine Learning in the World workshop. Dylan Hadfield-Menell gave the opening remarks. The main theme is “how do we advance and develop reliability engineering for artificial intelligence and machine learning?”. Civil engineers need to ensure that the bridge is reliable, etc. This is pretty high stakes and if something goes wrong, people will die. There are 2 flavours to reliability. Models and systems. As both become more complex, it is necessary to have some research in this area to prepare for the future.

A recent paper that came out talked about attacks on vision systems. Basically, they stick some black and white tape on stop signs and the system failed. An interesting paper to look at would be “Summoning Demons. The Pursuit of Exploitable Bugs in Machine Learning.” by Stevens et al.. Another recent piece of news that had lots of press is a security robot falling into a pool and “drowned”.

Dylan gave some remarks on various questions we should think about on various sub tasks of reliability and I reproduce it in full below:

Some questions to think about on robustness:

- How can we make a system robust to novel or potentially adversarial inputs?
- What are ways of handling model mis-specification or corrupted training data?
- What can be done if the training data is potentially a function of system behaviour or of other agents in the environment? For example, when collecting data on users that respond to changes in the system and might also behave strategically.

Some questions to think about on awareness:

- How do we make a system aware of its environment and of its own limitations, so that it can recognize and signal when it is no longer able to make reliable predictions or decisions?
- Can it successfully identify “strange” inputs or situations and take appropriately conservative actions?
- How can it detect when changes in the environment have occurred that require re-training?
- How can it detect that its model might be mis-specified or poorly calibrated?

Some questions to think about on adaptation:

- How can machine learning systems detect and adapt to changes in their environment, especially large changes? For example, low overlap between trains and test distributions, poor initial model assumptions, or shifts in the underlying prediction function.
- How should an autonomous agent act when confronting radically new contexts?

Some question to think about on value alignment:

- For systems with complex desiderata, how can we learn a value function that captures and balances all relevant considerations?
- How should a system act given uncertainty about its value function?
- Can we make sure that a system reflects the values of the humans who use it?

Some questions to think about human factors:

- How do we build systems in light of the fact that actual humans will be interacting with and adapting to these systems when they are deployed?
- How can we monitor large-scale systems in order to judge if they are performing well? If things go wrong, what tools can help?
- How do properties of humans affect the guarantees of performance that the system has?
- What if the humans are suboptimal or even adversarial?

The motivation of this paper comes from safety. A natural way to incorporate safety in reinforcement learning is via constraints. We normally do this in RL by giving very large negative reward in RL. But this is unsufficient! If the penalty is too small, it’s unsafe. If it’s too large, it’s too cautious. He showed the videos on the effect of penalty.

Risk sensitive RL is another way of getting safety in RL.

This work is the first RL and constraints algorithm with constraint-satisfying exploration.

There’s been lots of work on developing principled approaches to test automotive control systems for years. But how do we test perception systems in light of self-driving cars?

Formal verification in ML requires **model** and **specification**. But this is
a really hard question! How do we model complex CNNs? What is a specification
for a CNN detector?

To analyze a CNN, one should focus on the domain of interest and realistic alterations.

They find the vanishing point and then realistically alter the images. Perhaps a multi-view approach might be useful for the synthesis of images. Can we fully train an autonomous vehicle on synthetic images?

Tony Jebara and Justin Basilico.

In 2006, there 6m members in US only. They had a task of predicting the rating of a movie. Now, they have 100m members in 190 countries. They don’t really use stars any more but use thumbs up and thumbs down. Their goal is to help members find content to watch and enjoy to maximize member satisfaction and retention.

They use a lot of algorithms: Bayesian Nonparametrics, Gaussian Processes, Bandits, etc. Basically anything under the sun that works.

They run A/B tests and they run it in online production.

Idea -> Offline Experiment -> Online A/B Test -> Full Deployment

Traditionally, they collect a lot of data over a period of time and learn a model of the data in the offline setting. Once they have the model, they bring it online and do A/B testing on it. And then roll-out. This is a standard machine learning approach but there’s a lot of regret.

As such, for many reasons, they are moving towards online learning and bandit learning. This allows them to have less regret, helps cold start models in the wild, maintain some exploration for nonstationarity, adapt reliably to changing and new data. They use bandits to select the best image to place on the front page for every show. But that’s not really enough, because each individual has different preferences. They use contextual bandits to get even better images.

The model needs to train reliably offline and needs to predict reliably online. The reliability approach:

- Detection. How do you detect if there’s a problem?
- Response
- Prevention

Reliability in training:

- Learning must be repeatable.
- Automate retraining of models. Akin to continuous deployment in software engineering. How often depends on application; typically daily.
- Detect problems and fail fast to avoid using a bad model. There’s a series of metrics that they use and these fire off alarms when they happen.

Metrics:

- Absolute and relative performance to previous runs
- Offline metrics on hold-out set
- Data size for train and test
- Feature distributions
- Feature importance
- Large (unexpected) changes in output between models
- Model coverage (e.g. number of items it can predict)
- Model integrity
- Error counters
- etc.

Conclusion

- Consider online learning and bandits
- Build off best practices from software engineering
- Automate as much as possible
- Adjust your data to cover the conditions you want your model to be resilient to
- Detect problems and degrade gracefuly

Koh Pang Wei is the author of this and is also the author of the ICML best paper. The normal attack would be to change the test data. But what if an attacker changes the training data? This is not unrealistic because in many cases, training data is from the outside world. If these attacks are successful, they subsequently affect all predictions.

It’s like an arm race. An attack is published, someone defends it, etc. The space of attacks is so large and we can’t possibly empirically try every attack. Given a defense and dataset, can we bound the damage of an attacker? In their work, their defense focuses on discarding outliers outside some feasible set F.

With that, this is the entire **uncurated** rant of ICML 2017.