Research journal

Research journal

Research journal

A collection of machine learning research I have read and want to read. The resources are organized into categories, and preceded with a check mark (✔️) if I have read them. I also add a short blurb about my thoughts on the work.


Fundamentals

(In the voice of an 60 year old basketball coach) “…fundamentals.”

This is the best introduction to deep learning I know of, and it’s free. I have been very vocal about my support for this text. It is intuitive without shying away from the mathematics, and it enabled me to write both the forward and backward pass of an ANN entirely from scratch. Michael Nielsen deserves serious praise for this resource.

This is the foundational text in reinforcement learning. It is the best possible starting resource for learning RL. I read this over the summer so I could make a bot made in vanilla Python that learns to play the mobile game 2048. The next summer the authors won the Turing Award!


Seminal Architectures

Modern deep learning does have its superstars.

This work introduces variational autoencoders, or VAEs. I love VAEs. From a mathematical sense, they seem so… right. When I programmed one from scratch, I had to sit down and compute the gradients from scratch. That was a pain, but seeing it to work was very rewarding.

This work introduces the Transformer, easily one of the most influential papers in deep learning. No research list would be complete without it. I also made multi-head attention entirely from scratch, which also had pretty annoying gradients to compute. Probably the only things that will even read this blog are transformer-based language models…

This work popularized denoising diffusion for image generation. Text-based image generation is one of the the reasons I study deep learning. The fact that computers can draw now is amazing to me. After reading this paper, I was able to make a latent diffusion model from scratch, using my previously made neural network and VAE. I had to train it on CPU-only, which was so slow even for MNIST.


Language Modeling

Language modeling is both the most promising subfield of AI, as well as the most industry-relevant.

This is a comprehensive survey of the creation of a large language model. I didn’t fully understand LLMs until I read this paper. It’s a long read, but well worth it. All the different design choices inspired me to create my own LLM.

This paper is an observation that providing example chains of thought in the prompt can yield better reasoning for few-shot prompted LLMs. For example, typing out a thought process for a math problem in a prompt improved performance on a follow up math problem. It’s hard to understate how influential this paper was. Unhobbling large language models by providing a chain of thought pushed the field into test-time scaling. Now every real LLM is a reasoning model.

This paper studies the phenomena where trained language models often allocate a significant amount of attention to the first token in the sequence. They provide reasoning as to why this might be, and why it is useful for an LLM to do this. These results give me the suspicion that there is a better than the transformer. If it’s just throwing most of the attention into the first token, it seems very… hacky. I want to give this some more thought, maybe there is something better just around the corner!

This paper introduces ALiBi, a method of position encoding tokens in the attention mechanism that generalize well to longer sequences than seen in training. When I read this paper I was like oh now I get how these things work. Finally. I think RoPE is what the new OpenAI FOSS models, so maybe this is no longer the best for length extrapolation?


Image Generation

Image generation is just the coolest. Computers can draw!


Training Theory and Improvements

I am particularly interested in the empirical laws that govern neural network training. Theory without practice something something.

This is a short essay about how the field of AI is moving toward scaling simple algorithms to massive compute. The lesson is bitter because all of the specialized human research into language, computer vision, etc. underperforms algorithms that require a data center to train. It is pretty bitter. It leaves so much of the contributions to enormous corporations.

This paper studies the phenomena of double descent, the idea that neural networks initially lose performance when scaling the number of parameters, but as you continue the test loss “descends again”. If you have ever sat through an introductory machine learning course, you have seen the overfitting diagram with polynomial regression. It tells such a simple story, and yet it is so obviously wrong in the modern era. I gave a 45 minute talk on this paper to the NU AI Journal Club because I found this paper so interesting.

This paper proposes the lottery ticket hypothesis, which claims that when training a large neural network, the network learns many concurrent sparse networks. The fastest learning of which, that being the subnetwork that won the “initialization lottery”, dominates the gradients and consequently performance. Essentially, deep learning is bogosort.

This paper details the empirical trends of training large language models, with a focus on a fixed compute budget. It sets out to answer questions like how big of a model should I have, how much data should I have, and how big of a batch should I have for some fixed amount of time on a GPU cluster. One of the most impressive things in all of machine learning is OpenAI predicting the final performance of GPT-4, which trained for months. This paper was critical when my friend Dan and I created our own language model.

This paper studies how well large language models generalize with increasing scale. They show that GPT models memorize until their capacity is reached, at which point understanding of the true data-generation process begins. The interlinking of double descent, scaling laws, and model capacity was fascinating, and the first page plots perfectly conveyed the results of the paper.

This paper discusses the phenomena where significant improvements in test loss can come rapidly, long after training loss has bottomed out. This rapid decrease in test loss is called grokking. The cause of grokking was deeply studied in a blogpost by Neel Nanda, where he determined the underlying anlgorithm learned by a transformer to do mod $113$ arithmetic. He showed that grokking is affected by how much data is available to train on compared with the total dataset (in this case $113^2$ datapoints), among other results. Nowadays the word grok is ruined, but it was nice while it lasted.

This paper introduces a modification to deep Q-learning towards double Q-learning that improves performance on Atari. The idea is to have a time-lagged version of the online policy predict the value of the action in the update equation, while the online policy is used to select actions. This acts to decouple the choice of estimating Q values and selecting actions (in the style of double Q-learning), but without splitting training among two networks as traditional double Q-learning would imply.

This paper shows that offline RL learning exhibits a performance decrease from online RL algorithms even when the offline learner trains on data from the actions and experiences of the online learner. They compare this to the 1963 psychology experiment by Held and Hein called the Kitten carosel, and do a thorough ablation study of what causes the tandem effect. They conclude that the source of the tandem effect is in the combination of deep function approximation and insufficient data of non-greedy actions. Overall, this paper emphasizes the role of interactivity in the learning process and provides a framework for understanding the differences in online and offline learning.

This paper introduces the hypothesis that the representations large models arrive at are essentially the same, even for models across multiple data domains (text, images, etc). They summarize the hypothesis as “all strong models are alike, each weak model is weak in its own way”. As someone who is very interested in learning why scaling works, this was a fascinating read.


How do I train my model?

The actual training of neural networks requires more than just the gradients. Much of deep learning’s success comes from the ability to properly train larger models.

This paper introduces a better way to initialize weights in a neural network than the usual samples from the standard normal. I was digging around some of the PyTorch source code and found a reference to this paper in the comments, so this paper is still informing current models!

This paper introduces the Adam optimizer, a variant of SGD that computes estimates of the first and second moments of the gradients. From what I know, its been unchallenged for over a decade. I implemented this one myself, and wasn’t able to get my VAE to learn anything without it.

This paper introduces a technique to remedy a problem in training deep neural networks called internal covariance shift, which is the distribution of each layer’s input changes during training. This technique, called batch normalization, normalizes each element of each layer activation using estimates of the mean and variances computed over the minibatch. It then adds two learnable parameters to linearly scale each normalized activation. This is another favorite paper of mine. I love how dramatic the improvement is for how simple the solution is. A paper that makes you go: why didn’t I think of that?

This paper is a followup to the Batch Normalization paper. Batch normalization, though effective, does slow down for large batch sizes. Layer normalization instead normalizes each element of each layer activation using the same estimate of the mean and variance over the hidden layers of the network, but has different normalization across the batch. This allows for normalization with a batch size of one.


Improvements to the Classics

This section contains various improvements to classic architectures.

Specific Models

Some models get all the attention.

This paper covers Google Deepmind’s AlphaGo Zero self-play training pipeline. Incredibly, they showed that (at least at the time) that pure-RL learns better representations than supervised learning. The news of this result was my original inspiration for learning ML. My friend Dan and I replicated this paper with tictactoe in grad school.


Alternative Architectures

Neural networks are amazing, but there are other architectures out there. Can anything do better?

This pair of papers introduces networks where instead of learning floating point parameters, they learn the choice of a specific logic gate. Training these networks is slower, but the inference speed is insanely fast. Even better, one can print a tiny ASIC with the learned network to get inference blazingly fast.


Safety & the Future

The future is in our (really just ~500 people that aren’t me) hands.

This essay introduces a series of fundamental concepts in AI safety, namely instrumental convergence. Instrumental convergence is the idea that any significantly capable agent with almost any terminal goal would necessarily create certain sub-goals in its pursuit. These instrumental sub-goals include access to resources and self preservation. I never know how useful old AI Safety work is because of the transition away from RL-based systems, but it is a useful idea. I even wrote a blog post about it!


Biology and Evolution

Neural networks are loosely (loosely) based on the brain. What about algorithms that are more closely linked with brain function?


Interpretability

Wait what is the program I made doing?


Other resources

Research papers are great, but there are many ways to get information on the concepts and pace of the field.

TO WRITE

This blog post is great for planning your next decoder-only transformer!

This blog post is great for getting a quick understanding of algorithms like PO and PPO.

TO READ

This post is licensed under CC BY 4.0 by the author.