Here's where I keep a list of papers I have read.

This list was curated by Lexington Whalen, beginning from his first year of PhD to end. As he is me, I hope he keeps going!

I typically use this to organize papers I found interesting. Please feel free to do whatever you want with it. Note that this is not every single paper I have ever read, just a collection of ones that I remember to put down.

So far, we have read 133 papers. Let's keep it up!

Your search returned 133 papers. Nice!
Title Author Year Topic Publication Venue Description Link
Improving Training Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architecture Huijie Zhang et al 2024 diffusion, multi-stage CVPR This paper proposes a multi-stage framework for diffusion models that uses a shared encoder and separate decoders for different timestep intervals, along with an optimal denoiser-based timestep clustering method, to improve training and sampling efficiency while maintaining or enhancing image generation quality. Link
Temporal Dynamic Quantization for Diffusion Models Junhyuk So et al 2023 diffusion, quantization NeurIPS Temporal Dynamic Quantization (TDQ) addresses the challenge of quantizing diffusion models by dynamically adjusting quantization parameters based on the denoising time step. TDQ employs a trainable module consisting of frequency encoding, a multi-layer perceptron (MLP), and a SoftPlus activation to predict optimal quantization intervals for each time step. This module maps the temporal information to appropriate quantization parameters, allowing the method to adapt to the varying activation distributions across different stages of the diffusion process. By pre-computing these quantization intervals, TDQ avoids the runtime overhead associated with traditional dynamic quantization methods while still providing the necessary flexibility to handle the temporal dynamics of diffusion models. Link
Learning Efficient Convolutional Networks through Network Slimming Zhuang Liu et al 2017 pruning, importance CVPR This paper introduces *network slimming*, a method to reduce the size, memory footprint, and computation of CNNs by enforcing channel-level sparsity without sacrificing accuracy. It works by identifying and pruning insignificant channels during training, leveraging the γ scaling factors in Batch Normalization (BN) layers to effectively determine channel importance. The approach introduces minimal training overhead and is compatible with modern CNN architectures, eliminating the need for specialized hardware or software. Using the BN layer’s built-in scaling properties makes this pruning efficient, avoiding redundant scaling layers or issues that arise from linear transformations in convolution layers. Link
Q-Diffusion: Quantizing Diffusion Models Xiuyu Li et al 2023 diffusion, sampling ICCV This paper tackles the inefficiencies of diffusion models, such as slow inference and high computational cost, by proposing a post-training quantization (PTQ) method designed specifically for their multi-timestep process. The key innovation includes a *time step-aware calibration data sampling* approach, which uniformly samples inputs across multiple time steps to better reflect real inference data, addressing quantization errors and varying activation distributions without the need for additional data. Additionally, the paper introduces *shortcut-splitting quantization* to handle the bimodal activation distributions caused by the concatenation of deep and shallow feature channels in shortcuts, quantizing them separately before concatenation for improved accuracy with minimal extra resources. Link
Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection Alireza Ganjdanesh et al 2024 diffusion, sampling Arxiv This paper reduces the cost of sampling via pruning a pretrained diffusion model into a mixture of experts (MoE) for their respective time intervals, via a routing agent that predicts the architecture needed to generate the experts. Link
A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training Kai Wang et al 2024 diffusion, sampling Arxiv This paper introduces SpeeD, a novel approach for accelerating the training of diffusion models without compromising performance. The authors analyze the diffusion process and identify three distinct areas: acceleration, deceleration, and convergence, each with different characteristics and importance for model training. Based on these insights, SpeeD implements two key components: asymmetric sampling, which reduces the sampling of less informative time steps in the convergence area, and change-aware weighting, which gives more importance to the rapidly changing areas between acceleration and deceleration. The authors' key insight is that not all time steps in the diffusion process are equally valuable for training, with the convergence area providing limited benefits despite occupying a large proportion of time steps, while the rapidly changing area between acceleration and deceleration is crucial but often undersampled. To address this, SpeeD introduces an asymmetric sampling strategy using a two-step probability function: $P(t) = \begin{cases} \frac{k}{T + \tau(k-1)}, & 0 < t \leq \tau \ \frac{1}{T + \tau(k-1)}, & \tau < t \leq T \end{cases}$, where τ is a carefully selected threshold marking the beginning of the convergence area, k is a suppression intensity factor, T is the total number of time steps, and t is the current time step. This function increases sampling probability before τ and suppresses it after. Additionally, SpeeD employs a change-aware weighting scheme based on the gradient of the process increment's variance, assigning higher weights to time steps with faster changes. By combining these strategies, SpeeD aims to focus computational resources on the most informative parts of the diffusion process, potentially leading to significant speedups in training time without sacrificing model quality. Link
HyperGAN: A Generative Model for Diverse, Performant Neural Networks Neale Ratzlaff et al 2019 gan, ensemble ICML This paper introduces HyperGAN, a novel generative model designed to learn a distribution of neural network parameters, addressing the issue of overconfidence in standard neural networks when faced with out-of-distribution data. Unlike traditional approaches, HyperGAN doesn't require restrictive prior assumptions and can rapidly generate large, diverse ensembles of neural networks. The model employs a unique "mixer" component that projects prior samples into a correlated latent space, from which layer-specific generators create weights for a deep neural network. Experimental results show that HyperGAN can achieve competitive performance on datasets like MNIST and CIFAR-10 while providing improved uncertainty estimates for out-of-distribution and adversarial data compared to standard ensembles. NOTE: There has actually been a diffusion variant of this idea: https://arxiv.org/pdf/2402.13144 Link
Diffusion Models Already Have a Semantic Latent Space Mingi Kwon et al 2023 diffusion, latent space ICLR This paper introduces Asymmetric Reverse Process (Asyrp), a method that discovers a semantic latent space (h-space) in pretrained diffusion models, enabling controlled image manipulation with desirable properties such as homogeneity, linearity, and consistency across timesteps, while also proposing a principled design for versatile editing and quality enhancement in the generative process. The authors propose Asymmetric Reverse Process (Asyrp). It modifies only the P_{t} term while preserving the D_{t} term in the reverse process. This makes sense because it a) breaks the destructive interference seen in previous methods, b) allows for controlled modification of the generation process towards target attributes, and c) maintains the overall structure and quality of the diffusion process. Link
One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale Fan Bao et al 2023 diffusion, multi-model ICML The authors present a method of sampling from joint and conditional distributions using a small modification on diffusion models. UniDiffuser’s proposed method involves handling multiple modalities (such as images and text) within a single diffusion model. Here is in general what they do: 1. Perturb data in all modalities: For a given data point (x0,y0), where x0 is an image and y0 is text, UniDiffuser adds noise to both simultaneously. The noisy versions are represented as xt_{x} and yt_{y}, where t_{x} and t_{y} are the respective timesteps. 2. Use of individual timesteps for different modalities: Instead of using a single timestep t for both modalities, UniDiffuser uses separate timesteps t_{x} and t_{y}. This allows for more flexibility in handling the different characteristics of each modality. 3. Predicting noise for all modalities simultaneously: UniDiffuser uses a joint noise prediction network \epsilon_{\theta}(xt_{x},yt_{y},t_{x},t_{y}) that takes in the noisy versions of both modalities and their respective timesteps. The network then outputs predicted noise for both modalities in one forward pass. Link
Diffusion Models as a Representation Learner Xingyi Yang et al 2023 diffusion, representation learner ICCV This paper (smartly!) notices that one of the major reasons for long training and poor results of diffusion models is the lack of fast learning of relationships. For instance, they remark on the learning of one eye of a dog before both eyes. They propose to mask the input image in the latent space and learn how to predict the masks, and then diffuse these masks. Brilliant! Link
Masked Diffusion Transformer is a Strong Image Synthesizer Shanghua Gao et al 2023 diffusion, masking, transformer ICCV This paper (smartly!) notices that one of the major reasons for long training and poor results of diffusion models is the lack of fast learning of relationships. For instance, they remark on the learning of one eye of a dog before both eyes. They propose to mask the input image in the latent space and learn how to predict the masks, and then diffuse these masks. Brilliant! Link
Generative Modeling by Estimating Gradients of the Data Distribution Yang Song et al 2019 diffusion, score matching NeurIPS This paper introduces Noise Conditional Score Networks (NCSNs), a novel approach to generative modeling that learns to estimate the score function of a data distribution at multiple noise levels. NCSNs are trained using score matching, avoiding the need to compute normalizing constants, and generate samples using annealed Langevin dynamics. The method addresses challenges in modeling complex, high-dimensional data distributions, particularly for data lying on or near low-dimensional manifolds. Link
LAPTOP-Diff: Layer Pruning and Normalized Distillation for Compression Diffusion Models Dingkun Zhang et al 2024 diffusion, pruning Arxiv This paper proposes layer pruning and normalized distillation for pruning diffusion models. They use a surrogate function and show that their surrogate implies a property called "additivity", where the output distortion caused by many perturbations approximately equals the sum of the output distortion caused by each single perturbation. They then show that their computation can be formed as a 0-1 Knapsack problem. They then analyze what is the important objective for retraining, and see that there is an imbalance in previous feature distillation approaches employed in the retraining phase. They note that the L2-Norms of feature maps at the end of different stages and the values of different feature loss terms vary significantly, for instance, the highest loss term is ~10k times greater than the lowest one throughout the distillation process, and produces about 1k times larger gradients. This dilutes the gradients of the numerically insignificant feature loss terms. So, they opt to normalize the feature loss. Link
Classifier-Free Diffusion Guidance Jonathan Ho et al 2022 diffusion, guidance NeurIPS This paper introduces classifier-free guidance, a novel technique for improving sample quality in conditional diffusion models without using a separate classifier. Unlike traditional classifier guidance, which relies on gradients from an additional classifier model, classifier-free guidance achieves similar results by combining score estimates from jointly trained conditional and unconditional diffusion models. The method involves training a single neural network that can produce both conditional and unconditional score estimates, and then using a weighted combination of these estimates during the sampling process. This approach simplifies the training pipeline, avoids potential issues associated with training classifiers on noisy data, and eliminates the need for adversarial attacks on classifiers during sampling. The authors demonstrate that classifier-free guidance can achieve a similar trade-off between Fréchet Inception Distance (FID) and Inception Score (IS) as classifier guidance, effectively boosting sample quality while reducing diversity. The key difference is that classifier-free guidance operates purely within the generative model framework, without relying on external classifier gradients. This method provides an intuitive explanation for how guidance works: it increases conditional likelihood while decreasing unconditional likelihood, pushing generated samples towards more characteristic features of the desired condition. Link
LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights Thibault Castells et al 2024 pruning, diffusion, ldm CVPR This paper presents LD-Pruner. The main interesting part is how the frame the pruning problem. Basically, they define an "operator" (any fundamental building block of a net, like convolutional layers, activation functions, transformer blocks), and try to either 1) remove it or 2) replace it with a less demanding operation. As they operate on the latent space, this work can be applied to any generation that uses diffusion (task agnostic). It is interesting to note their limitations: the approach does not extend to pruning the decoder, and their approach does not consider dependencies between operators (which is a big deal I think). Finally, their score function seems a bit arbitrary (maybe this could be learned?). Link
RoFormer: Enhanced Transformer with Rotary Position Embedding Jianlin Su et al 2021 attention, positional embedding Arxiv This paper introduces Rotary Position Embedding (RoPE), a method for integrating positional information into transformer models by using a rotation matrix to encode absolute positions and incorporating relative position dependencies. Link
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models Alex Nichol et al 2022 text-conditioned diffusion, inpainting Arxiv This paper explores text-conditional image synthesis using diffusion models, comparing CLIP guidance and classifier-free guidance, and finds that classifier-free guidance produces more photorealistic and caption-aligned images. Link
LLM Inference Unveiled: Survey and Roofline Model Insights Roger Waleffe et al 2024 llms, survey Arxiv This paper surveys some recent advancements in LLC inference, like speculative decoding or operator fusion. They also analyze the findings using the Roofline model, which is likely the first paper to do such a thing for LLM inference. Good for checking out other papers that have recently been published. Link
An Empirical Study of Mamba-based Language Models Roger Waleffe et al 2024 mamba, llms, transformer Arxiv This paper compares Mamba-based, Transformer-based, and hybrid-based language models in a controlled setting where sizes and datasets are larger than the past (8B-params / 3.5T tokens). They find that Mamba and Mamba-2 lag behind Transformer models on copying and in-context learning tasks. They then see that a hybrid architecture of 43% Mamba, 7% self attention, and 50% MLP layers performs better than all others. Link
Diffusion Models Beat GANs on Image Synthesis Prafulla Dhariwal et al 2021 diffusion, gan Arxiv This work demonstrates that diffusion models surpass the current state-of-the-art generative models in image quality, achieved through architecture improvements and classifier guidance, which balances diversity and fidelity. The model attains FID scores of 2.97 on ImageNet 128×128 and 4.59 on ImageNet 256×256, matching BigGAN-deep with as few as 25 forward passes while maintaining better distribution coverage. Additionally, combining classifier guidance with upsampling diffusion models further enhances FID scores to 3.94 on ImageNet 256×256 and 3.85 on ImageNet 512×512. Link
Progressive Distillation for Fast Sampling of Diffusion Models Tim Salimans et al 2022 diffusion, distillation, sampling ICLR Diffusion models excel in generative modeling, surpassing GANs in perceptual quality and autoregressive models in density estimation, but they suffer from slow sampling times. This paper introduces two key contributions: new parameterizations that improve stability with fewer sampling steps and a distillation method that progressively reduces the number of required steps by half each time. Applied to benchmarks like CIFAR-10 and ImageNet, the approach distills models from 8192 steps down to as few as 4 steps, maintaining high image quality while offering a more efficient solution for both training and inference. Link
On Distillation of Guided Diffusion Models Chenlin Meng et al 2023 diffusion, classifier-free guidance Arxiv Classifier-free guided diffusion models are effective for high-resolution image generation but are computationally expensive during inference due to the need to evaluate both conditional and unconditional models many times. This paper proposes a method to distill these models into faster ones by learning a single model that approximates the combined outputs, then progressively reducing the number of sampling steps. The approach significantly accelerates inference, generating images with comparable quality to the original model using as few as 1-4 denoising steps, achieving up to 256× speedup on datasets like ImageNet and LAION. Link
Diffusion Probabilistic Models Made Slim Xingyi Yang et al 2022 diffusion, dpms, spectral diffusion Arxiv Diffusion Probabilistic Models (DPMs) produce impressive visual results but suffer from high computational costs, limiting their use on resource-limited platforms. This paper introduces Spectral Diffusion (SD), a lightweight model designed to address DPMs' bias against high-frequency generation, which smaller networks struggle to capture. SD incorporates wavelet gating for frequency dynamics and spectrum-aware distillation to enhance high-frequency recovery, achieving 8-18× computational efficiency while maintaining competitive image fidelity. Link
Structural Pruning for Diffusion Models Gongfan Fang et al 2023 diffusion, pruning NeurIPS Generative modeling has advanced significantly with Diffusion Probabilistic Models (DPMs), but these models often require substantial computational resources. To address this, Diff-Pruning is introduced as a compression method that reduces the computational load by pruning unnecessary diffusion steps, using a Taylor expansion to identify key weights without extensive re-training. Empirical results show that Diff-Pruning can cut FLOPs by around 50%, while maintaining consistent generative performance at only 10-20% of the original training cost. Link
Diffusion Models: A Comprehensive Survey of Methods and Applications Ling Yang et al 2024 diffusion, survey ACM Diffusion models are a powerful class of deep generative models known for their success in tasks like image synthesis, video generation, and molecule design. This survey categorizes diffusion model research into efficient sampling, improved likelihood estimation, and handling specialized data structures, while also discussing the potential for combining them with other generative models. The review highlights their broad applications across fields such as computer vision, NLP, temporal data modeling, and interdisciplinary sciences, suggesting areas for further exploration. Link
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium Martin Heusel et al 2017 gan, equilibrium, fid, is NeurIPS This paper introduces a two time-scale update rule (TTUR) for GANs, and proves that this makes GANs converge to a local Nash equilibrium. More cited is the FID score introduced here. FID improves on IS by comparing the distributions of real and generated images directly. This is done by using the Inception model to extract features from images and then assuming these features follow a multidimensional Gaussian distribution. FID measures the difference between the Gaussians (representing the real and generated images) using the Frechet distance, which effectively captures differences in the mean and covariance (the first two moments) of the distributions. FID makes sense as it directly compares the distributions of real and generated images by using the extracted features from Inception. These features are assumed to follow some multidimensional Gaussian, which simplifies the comparison. The Guassian is chosen as it is the maximum entropy distribution for a given mean and covariance (proof: https://medium.com/mathematical-musings/how-gaussian-distribution-maximizes-entropy-the-proof-7f7dcb2caf4d) -- maximum entropy is important, because this means that the Gaussian makes the fewest additional assumptions about the data, making sure the model is as non-committal as possible given the available information. Then, we calculate the statistics between the real and generated image features, like their mean and covariances. Finally, we compute the FID score using Frechet AKA Wasserstein-2 distance. Link
Scalable Diffusion Models with Transformers William Peebles et al 2023 diffusion,ddpm, dit CVPR The authors explore using transformers in the latent space, rather than U-Nets. They find that their methods can lead to lower FID scores compared to prior SOTA. In this paper, their image generation pipeline is roughly: 1) Input high resolution image x 2) Encoder z = E(x), where E is a pre-trained frozen VAE encoder, and z is the latent representation 3) The DiT model operates on z 4) New latent representation z’ is sampled from the diffusion model 5) We then decode the z’ using the pre-trained frozen VAE decoder D, and x’ is now the generated high resolution image. Link
Max-Affine Spline Insights Into Deep Network Pruning Haoran You et al 2022 early-bird, lottery-hypothesis, pruning, low-precision TMLR The authors make connections from spline-theory (AKA, consdering DNNs as a continuous piecewise affline mapping) and pruning. Basically, they say that pruning removes redundant decision boundaries in layers that are pruned, and that we can compare the decision boundaries of unpruned networks to their pruned counterparts to show this (they have some nice visualizations). They also note that the final decision boundary often does not always depend on existing subdivision lines. Finally, they demonstrate another way of finding EB tickets using this spline formulation. Link
Drawing Early-Bird Tickets: Towards More Efficient Training of Deep Networks Haoran You et al 2020 early-bird, lottery-hypothesis, pruning, low-precision ICLR The authors show that there exist early-bird (EB) tickets: small, but critical subnetworks for dense randomly intialized networks, that can be found using low-cost training schemes (low precision, early stopping). They also design a practical low compute method for finding these. They use mask distance. Basically, for each pruning iteration, a binary mask is created. This mask represents which parts of the network are kept (the "ticket", or pruned subnet) and which parts are removed. They then consider the scaling factor "r" in BN layers as indicators of significance. This r is learned during training and is used to scale normalized activations. The magnitude of r is an indicator of how important the channel is to the network's performance. After deciding which channels to prune based on r, the binary mask is created. If the channel is kept (not pruned), marked as 1 in the mask. Else, 0. For any two subnets, they then compute the "mask distance" (AKA Hamming distance) between the two ticketmasks. They measure the mask distance between consequtive epochs and draw EB tickets when such distance is smaller than some threshold. Link
Learning both Weights and Connections for Efficient Neural Networks Song Han et al 2015 pruning, compression, regularization NeurIPS The authors show a method of pruning neural networks in three steps: 1) train the network to learn what connections are important, 2) prune unimportant connections, 3) retrain and fine-tune. In order to train for learning what connections are important, they do not focus on learning the final weight values, but rather just focus on the importance of connections. They don't explicitly mention how this is done, but one could look at the Hessian of the loss or the magnitude of the weights. I'd imagine you could do this within only a few training iterations. In their "Regularization" section, it is interesting to note that L1 regularization (penalizes non-zero params resulting in more params near zero) gave better accuracy after pruning, but before retraining. But, these remaining connections are not as good as with using L2. The authors also present a discussion of what dropout rate to use. Link
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference Jiaming Tang et al 2024 KV cache, sparsity, LLM ICML Long context LLM inference is slow and the speed decreases significantly as sequence lengths grow. This is mainly due to needing to load a big KV cache during self-attention. Prior works have use methods to evict tokens in the attention maps to promote sparsity, but the Han lab (smartly!) found that the criticality of tokens strongly correlates with the current query token. Thus, they employ a KV Cache eviction method that retains all KV cache (since past evicted tokens may be needed to handle future queries), while being able to select the top K relevant tokens to a particular query. This allows for speedups in self-attention at low cost to accuracy. Link
BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage Models Jiahui Yu et al 2020 NAS, one-shot Arxiv Most NAS frameworks train some one-shot model to rank the quality of different child architectures. However, these rankings often are different than reality, so frameworks typically finetune architecture after finding them. BigNAS proposes that this fine-tuning / post-processing is not necessary. They find some interesting points, such as that "big models converge faster while small child models converge slower". Thus, at some training step t when the performance of a big model peaks, the small child models are not yet fully-trained, and at a t' where the child models are fully trained, the big model is overfitting. Thus, they use an exponentially decaying with constant ending learning rate scheduler, which has constant learning rate at the end of training when it reaches 5% of initial learning rate. Another point they bring up is a "coarse-to-fine" strategy where one first finds a rough sketch of promising network candidates, and then samples multiple finer grained variations around the sketch of interest. Link
Meta-Learning of Neural Architectures for Few-Shot Learning Thomas Elsken et al 2021 NAS, meta-learning, few-shot, fsl Arxiv The authors propose MetaNAS, which is the first method that fully integrates NAS with gradient-based meta-learning. Basically, they learn a method of joint learning gradient-based NAS methods like DARTS and meta-learning the architecture itself. Their goal is thus: meta-learn an architecture \alpha_{meta} with corresponding meta-learned weights w_{meta}. When given a new task \mathcal{T}_{i}, both \alpha_{meta} and w_{meta} adapt quickly to \mathcal{T}_{i} based on a few samples. One interesting technique they do is add a temperature term that is annealed to 0 over the course of task training; this is to help with sparsity of the mixture weights of the operations when using the DARTS search. Link
MetAdapt: Meta-Learned Task-Adaptive Architecture for Few-Shot Classification Sivan Doveh et al 2020 NAS, meta-learning, few-shot, fsl Arxiv The authors propose a method using a DARTS-like search for FSL architectures. "Our goal is to learn a neural network where connections are controllable and adapt to the few-shot task with novel categories... However, unlike DARTS, our goal is not to learn a one time architecture to be used for all tasks... we need to make our architecture task adaptive so it would be able to quickly rewire for each new target task.". Basically, they design a thing called a MetAdapt Controller that changes the connection in the main network according to some given task. Link
Distilling the Knowledge in a Neural Network Geoffry Hinton et al 2015 distillation, ensemble, MoE Arxiv The first proposal of knowledge distillation. The main interesting point I found was that they change the temperature of the softmax to be higher to allow for softer targets. This allows for understanding what 2's look like 3's (in an MNIST example). Basically, adds a sort of regularization since more information can be carried in these softer targets compared to a single 0 or 1. They also propose the idea of having an ensemble of models, and then learning a distilled model that is smaller. The biological example of having a clumsy larvae that then becomes a more specialized bug was good. Link
HyperTuning: Toward Adapting Large Language Models without Back-propagation Jason Phang et al 2023 hypernetworks, adaptation, tuning, LoRA, LLMs ICML The authors show that we can a hypernetwork for model adaptation in order to generate task-specific parameters. They try two approaches: generating prefixs and generating LoRA parameters for a frozen T5 model using few-shot examples. They also note the importance of hyperpretraining, i.e., an additional stage to adapt the hypernet to generate parameters for the downstream model. They also propose a scheme for this. NOTE! "We also observe a consistent trend where HyperT5-Prefix outperforms HyperT5-LoRA. We speculate that it is easier for hypermodels to learn to generate soft prefixes as compared to LoRA weights..." Link
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning Armen Aghajanyan et al 2020 fine-tuning, intrinsic dimension, lora Arxiv Large models with billions of parameters can be fine-tuned using only a few hundred examples. Why is this? Furthermore, large models often allow for significant sparsification, which implies that there is much redundancy. This paper targets both of these ideas, by showing that many common models have an "intrinsic dimension" much less than the full parameterization. Link
LoRA: Low-Rank Adaptation of Large Language Models Edward Hu et al 2021 low rank adaptation, lora, llm, fine-tuning Arxiv Fine-tuning large models is expensive, because we update all the original parameters. LoRA, taking inspiration from Aghajanyan et al, 2020 (pre-trained language models have a low "intrinsic dimension"), the authors thought that the weight updates would also have low intrinsic rank. Thus, they decompose Delta W = BA, where B and A are lower rank. The A and B are trainable. They initialize A with Gaussian, and B as zero, so Delta W = BA is zero initialy. They then optimize and find this method to be more efficient in terms of both time and space. Link
Learning to Compress Prompts with Gist Tokens Jesse Mu et al 2023 llms, prompting, compression, tokens NeurIPS The authors describe a method of using a distilling function G (similar to a hypernet) that is able to compress LM prompts into a smaller set of "gist" tokens. These tokens can then be cached and reused. The neat trick is that they reuse the LM itself as G, so gisting itself incurs no additional training cost. Note that in their "Failure Cases" section, they mention "... While it is unclear why only the gist models exhibit this behavior (i.e. the fail example behavior), these issues can likely be mitigated with more careful sampling techniques. Link
Once-For-All: Train One Network and Specialize it For Efficient Deployment Han Cai et al 2020 nas, supernets ICLR The authors proposed training one large supernetwork and then sampling subnetworks as an approach for NAS. This method allows for the simultaneous generation of many different subnetworks that could satisfy different constraints (i.e. hardware, latency, accuracy, etc). The authors also propose a progressive shrinking method to train the net (start by training the big supernet, then progressively shrink down), which can be seen as a generalized pruning method. Furthermore, they introduce an idea of training a twin neural network to help estimate latency / accuracy given some architecture, which allows for fast feedback when conducting the search for subnetworks. Link
Dataless Knowledge Fusion by Merging Weights Xisen Jin et al 2023 knowledge fusion, weight merging ICLR The paper introduces RegMean, a method for merging pre-trained language models from different datasets by solving a linear optimization problem, which improves generalization across domains without requiring the original training data. Compared to existing methods like Simple Averaging and Fisher Averaging, RegMean offers higher computational efficiency and comparable memory overhead, while achieving better or equivalent performance across various natural language tasks, including out-of-domain generalization. The method is evaluated using GLUE datasets and demonstrates superior performance in most tasks, outperforming traditional model ensembling and multi-task learning approaches. Link
Superposition of Many Models into One Cheung et al 2019 superposition, online learning, tasks, continual learning NeurIPS A method of storing multiple models using only one set of parameters via parameter superposition is provided; it shares similarities to superposition in the fourier analysis for signal processing. Link
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation Yoshua Bengio et al 2013 gradients, stochasticy, backpropagation Arxiv The authors introduce a several methods of estimation / propagation for networks that have stochastic neurons. This is used often in networks that are quantization-aware, as they sometimes have decision-boundaries in the neurons that are not differentiable regularly. The paper also introduces the "Straight Through Estimator", which was actually first introduced in one of Hinton's lectures. One interesting idea they present (that I think may have also been introduced in Kingma's VAE paper?) is that we can model the output h_{i} of some stochastic neuron as the application of a deterministic function that also depends on some noise source z_{i}: h_{i} = f(a_{i},z_{i}). TLDR: Straight through units are typically the go-to due to ease of use and good performance. Link
DoReFaNet: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients Shuchang Zhou et al 2018 quantization, cnn, gradients Arxiv The authors introduce a method to train CNNs with low bitwidth weights and activations using low bitwidth param gradients. They use deterministic quantization for weights and activations, while stochastically quantizing gradients. Note that they do not quantize the weights of the first CNN layer for the most part, as they noted that it would often degrade performance (Han et al. 2015 also notes a similar thing). Another interesting thing they do is add noise to the gradient after quantization to increase performance. This paper also uses the straight through estimator (Bengio et al 2013) for propagating gradients when using their quantization scheme. Link
Training Deep Neural Networks with 8-bit Floating Point Numbers Naigang Wang et al 2018 quantization, floating-point, precision NeurIPS The authors show that it is possible to train DNNs with 8-bit fp values while maintaining decent accuracy. To do this, they make a new FP8 format, develop a technique "chunk-based computations" that allow matrix and convolution ops to be computed using 8-bit multiplications and 16 bit additions, and use fp stochastic rounding in weight updates. One interesting point they make is that swamping (the issue of truncation in large-to-small number addition) is a serious problem in DNN bit-precision reduction. Link
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference Benoit Jacob et al 2017 quantization, quantization schemes, efficient inference, floating-point Arxiv The authors propose a quantization scheme that allows us to only use integer arithmetic to approximate fp computations in a neural network. They also describe a training approach that simulates the effect of quantization in the forward pass. Backprop still occurs, but all weights and biases are stored in fp. The forward prop pass then simulates quantized inference by rounding off using the quantization scheme they describe that changes fp to int. Link
PACT: Parameterized Clipping Activation for Quantized Neural Networks Jungwook Choi et al 2018 quantization, clipping, activations ICLR The authors present a method of quantization by clipping activations using a learnable parameter, alpha. They show that this can lead to lower decreases in accuracy compared to other quantization methods. They also note that activations have been hard to quantize compared to weights in the past. They also prove that PACT is as expressive as ReLU, by showing it can reach the same solution as ReLU if SGD is used. They also describe the hardware benefits that can be incurred. Link
SMASH: One-Shot Model Architecture Search through Hypernetworks Andrew Brock et al 2017 hypernetworks, nas, one-shot, few-shot Arxiv The authors propose a technique to speed up NAS by using a hypernet. Basically, they train a hypernet to generate weights of a main model that has variable architecure. The input to the hypernet is a binarized representation of model architecture. The hypernet takes this representation in, and then outputs weights. They then train only for a few epochs, and compare the validation scores obtained across different representations. Then, they fully train the model that had the best validation score. Link
Example-based Hypernetworks for Multi-source Adaptation to Unseen Domains Tomer Volk et al 2023 hypernetworks, multi-source adaptation, unseen domains, NLP EMNLP The authors apply hypernets to unsupervised domain adaptation in NLP. They use example-based adaptation. The main idea is that they use an encoder-decoder to initially create the unique signatures from an input example, and then they embed it within the source domain's semantic space. The signature is then used by a hypernet to generate the task classifier's weights. The paper focuses on improving generalization to unseen domains by explicitly modeling the shared and domain specific characteristics of the input. To allow for parameter sharing, they propose modeling based on hypernets, which allow soft weight sharing. Link
Meta-Learning via Hypernetworks Dominic Zhao et al 2020 hypernetworks, meta-learning NeurIPS The authors propose a soft weight-sharing hypernet architecture that performs well on meta-learning tasks. A good paper to show efforts in meta-learning with regards to hypernets, and comparing them to SOTA methods like Model-Agnostic Meta-Learning (MAML). Link
HyperDynamics: Meta-Learning Object and Agent Dynamics with Hypernetworks Zhou Xian et al 2021 hypernetworks, meta-learning, dynamics ICLR The authors present a dynamics meta-learning framework which conditions on an agent's interations w/ env and (optionally) the visual input from it. From this, they can generate params of a neural dynamics model. The three modules they use are 1) an encoding module that encodes a few agent-env interations / agent's visual observations into a feature code, 2) a hypernet that conditions on the latent feature code to generate params of a dynamic model dedicated to this observed system, and 3) a target dynamics model that is made using the generated parameters, and takes input as a low-dim system state / agent action and outputs the prediction of next system state. Link
Principled Weight Initialization for Hypernetworks Oscar Chang et al 2020 hypernetworks, weight initialization ICLR Classical weight initialization techniques don't really work on hypernets, because they fail to produce weights for the mainnet in the correct scale. The authors derive formulas for hyperfan-out and hyperfan-in weight initialization, and show that it works well for the mainnet. Link
Continual Learning with Hypernetworks Johannes von Oswald et al 2020 hypernetworks, continual learning, meta learning ICLR The authors present a method of preventing catastrophic forgetting, by using task-conditioned hypernets (i.e., hypernets that generate weights of target model based on some task embedding). Thus, rather than memorizing many data characteristics, we can split the problem into just learning a single point per task, given the task embedding. Link
Stochastic Hyperparameter Optimization through Hypernetworks Jonathan Lorraine et al 2018 hypernetworks, hyperparameters ICLR Using hypernetworks to learn hyperparameters. They replace the training optimization loop in favor of a differentiable hypernetwork to allow for tuning of hyperparameters using grad descent. Link
Playing Atari with Deep Reinforcement Learning Volodymyr Mnih et al 2013 q-learning, reinforcement learning Arxiv The authors present the first deep learning model that can learn complex control policies, and they teach it to play Atari 2600 games using Q-learning. Their goal was to create one net that can play as many games as possible. Link
Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Encoding Song Han et al 2016 quantization, encoding, pruning ICML A three-pronged approach to compressing nets. They prune networks, then quantize and share weights, and then apply Huffman encoding. Link
Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or -1 Matthieu Courbariaux et al 2016 quantization, efficiency, binary Arxiv Introduction of training Binary Neural Networks, or nets with binary weights and activations. They also present experiments on deterministic vs stochastic binarization. They use the deterministic one for the most part, except for activations. Link
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks Mingxing Tan et al 2020 efficiency, scaling ICML A study of model scaling is presented. They propose a novel scaling method to uniformly scale all dimensions of depth/width/resolution using a compound coefficient. This paper presents a method for scaling width/depth/resolution; for instance, if you want to use 2^{N} more compute resources, then you can scale by their coefficients to do so. They also quantify the relationship between width, depth, and resolution. Link
2-in-1 Accelerator: Enabling Random Precision Switch for Winning Both Adversarial Robustness and Efficiency Yonggan Fu et al 2021 precision, adversarial, efficiency ACM Introduction of a Random Precision Switch algorithm that has potential for defending against adversarial attacks while promoting efficiency. Link
The wake-sleep algorithm for unsupervised neural networks Geoffry Hinton et al 1995 representation, generative Arxiv One of the first generative neural networks that kind of resembles diffusion. Link
ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design Haoran You et al 2022 vit, accelerator, attention Arxiv Co-deisng for ViTs. Prunes and polarizes attention maps to have denser/sparser patterns. Development of hardware accelerator as well. Link
Evolving Neural Networks through Augmenting Topologies Kenneth O. Stanley et al 2002 nas, evolution Arxiv Evolution for NAS. Link
A Brief Review of Hypernetworks in Deep Learning Vinod Kumar Chauhan et al 2024 hypernetwork Arxiv Review of hypernets. Link
HyperNetworks David Ha et al 2016 hypernetwork Arxiv Looking at HyperNetworks: networks that generate weights for other networks. Link
Deep Learners Benefit More from Out-of-Distribution Examples Yoshio Bengio et al 2024 ood ICML Evidence that ood samples can help learning. They also argue that intermediate levels of representation can benefit the models in multi-task settings. Link
Balanced Data, Imbalanced Spectra: Unveiling Class Disparities with Spectral Imbalance Chiraag Kaushik et al 2024 spectra, class imbalance ICML Introduction of the idea of "spectral imbalance", which can affect classification accuracy even when classes are balanced. Basically, they look at how the distributions of eigenvalues in different classes affect classification accuracy. Link
DeepArchitect: Automatically Designing and Training Deep Architectures Renato Negrinho et al 2017 nas Arxiv Proposal of a language to describe neural networks architectures. Can then describe them as trees to search through. Show different search methods for going through the trees (Monte Carlo tree search, random, use of surrogate function, etc.). Link
Graph neural networks: A review of methods and applications Jie Zhou et al 2020 gnn AI Open What graph neural networks are, what they are made of, how to train them. And examples. They describe a general design pipeline (Find graph structure, specify graph type and scale, design loss function) and explain the main modules in GNNs (propagation to propagate information between notes, sampling module to conduct the propagation, pooling module to extract information from notes). Link
1D convolution neural networks and applications: A survey Serkan Kiranyaz et al 2020 cnn, survey Mechanical Systems and Signal Processing A brief overview of applications of 1D CNNs is performed. It is largely focused on medicine (for instance, ECG) and fault detection (for instance, vibration based structural damage). Link
2 in 1 Accelerator: Enabling Random Precision Switch for Winning Both Adversarial Robustness and Efficiency Yonggan Fu et al 2021 quantization, accelerator ACM The most interesting point of this paper (among many things!) is the smart idea to use quantization as a way to boost DNN robustness. Cool! Link
Token Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation Junyoung Park et al 2024 efficiency, hardware, accelerator, attention DAC In autoregressive models with attention, off chip memory accesses need to be minimized. The authors note that there have been efforts to prune unimportant tokens, but these do not do much for removing tokens with attention scores near zero. The authors (smartly!) notice this issue, and provide a fast method of estimating the decision to prune or not based on estimation of the probability if a token is or is not important. An architecture for this is also provided. Link
Maxout Networks Ian Goodfellow et al 2013 dropout, maxout ICML The authors note that dropout is "most effective when taking relatively large steps in parameter space. In this regime, each update can be seen as making a significat update to a different model on a different subset of the training set". I really liked that quote. They then develop the maxout unit, which iessentially takes the maxmimum across some number of affine transformations, allowing for learning of piecewise linear approximations of nonlinear functions. Link
Geometric deep learning: Going beyond Euclidean data Michael Bronstein et al 2017 geometric deep learning IEEE SIG Provides an overview of geometric deep learning, which are methods of generalizing DNNs to non Euclidean domains (graphs, manifolds, etc). Link
Sampling in Constrained Domains with Orthogonal Space Variational Gradient Descent Ruqi Zhang et al 2022 variational gradient descent, gradient flow NeurIPS The authors propose a new variational framework called O Gradient for sampling in implicitly defined constrained domains, using two orthogonal directions to drive the sampler towards the domain and explore it by decreasing a KL divergence. They prove the convergence of O Gradient and apply it to both Langevin dynamics and Stein Variational Gradient Descent (SVGD), demonstrating its effectiveness on various machine learning tasks. Link
Entropy MCMC: Sampling from Flat Basins with Ease Bolian Li et al 2024 sampling, bayesian, flat basins ICML The authors propose a practical MCMC algorithm for sampling from flat basins of DNN posterior distributions, using a guiding variable based on local entropy to steer the sampler. They prove the fast convergence rate of their method compared to existing flatness aware methods and demonstrate its superior performance on various tasks through comprehensive experiments. The method is mathematically simple and computationally efficient, making it suitable as a drop in replacement for standard sampling methods like SGLD. Link
AdderNet: Do We Really Need Multiplications in Deep Learning? Hanting Chen et al 2021 multiplication-less, efficiency CVPR The authors show that with a cost of accuracy you can use additions instead of multiplications. They mainly tested CNNs. Link
Explaining and Harnessing Adversarial Examples Ian Goodfellow et al 2015 adversarial examples ICLR Adversarial examples (adding "small but intentially worst case perturbations to examples from the dataset") proves to be an interesting method to train models. The authors also (smartly!) describe a method to generate adversarial examples by a linear method. Link
Identifying and attacking the saddle point problem in high dimensional non convex optimization Yann Dauphin et al 2014 saddle points, optimization NeurIPS The authors argue that saddle points, rather than local minima, are the primary challenge in minimizing non convex error functions in high dimensional spaces, based on insights from various scientific fields and empirical evidence. They explain that saddle points surrounded by high error plateaus can significantly slow down learning and create the illusion of local minima, particularly in high dimensional problems of practical interest. To address this challenge, the authors propose a new approach called the saddle free Newton method, designed to quickly escape high dimensional saddle points, unlike traditional gradient descent and quasi Newton methods. Link
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Sergey Ioffe et al 2015 batch, normalization PMLR The authors identify internal covariate shift as a significant challenge in training deep neural networks, where the distribution of each layer's inputs changes during training due to parameter updates in previous layers. To address this issue, they propose Batch Normalization, a method that normalizes layer inputs as part of the model architecture, performing normalization for each training mini batch. Batch Normalization enables the use of much higher learning rates, reduces sensitivity to initialization, and acts as a regularizer, sometimes eliminating the need for Dropout.| Link
Bayesian Deep Learning and a Probabilistic Perspective of Generalization Andrew Wilson et al 2020 bayesian, marginalization NeurIPS The authors emphasize that marginalization, rather than using a single set of weights, is the key distinguishing feature of a Bayesian approach, which can significantly improve the accuracy and calibration of modern deep neural networks. They demonstrate that deep ensembles provide an effective mechanism for approximate Bayesian marginalization and propose a related approach that further improves the predictive distribution by marginalizing within basins of attraction. The paper investigates the prior over functions implied by a vague distribution over neural network weights, explaining neural network generalization from a probabilistic perspective and showing that seemingly mysterious results (like fitting random labels) can be reproduced with Gaussian processes. The authors demonstrate that Bayesian model averaging mitigates the double descent phenomenon, leading to monotonic performance improvements as model flexibility increases. Link
A Practical Bayesian Framework for Backpropagation Networks David MacKay et al 1992 bayesian Neural Computation The authors present a quantitative and practical Bayesian framework for learning mappings in feedforward networks, enabling objective comparisons between different network architectures and providing stopping rules for network pruning or growing procedures. This framework allows for objective selection of weight decay terms or regularizers, measures the effective number of well determined parameters in a model, and provides quantified estimates of error bars on network parameters and outputs. The approach helps detect poor underlying assumptions in learning models and demonstrates a good correlation between generalization ability and Bayesian evidence for well matched learning models. Link
Bayesian Neural Network Priors Revisited Vincent Fortuin et al 2022 bayesian, priors ICLR Isotropic Gaussian priors are the standard for modern Bayesian neural network inference, but their accuracy and optimal performance are uncertain. By studying summary statistics of neural network weights trained with stochastic gradient descent (SGD), the authors find that CNN and ResNet weights exhibit strong spatial correlations, while FCNNs display heavy tailed weight distributions. Incorporating these observations into priors improves performance on various image classification datasets, mitigating the cold posterior effect in FCNNs but slightly increasing it in ResNets. Link
Hands on Bayesian Neural Networks A Tutorial for Deep Learning Users Laurent Jospin et al 2022 bayesian IEEE A good summary / tutorial for using Bayesian Nets. Also provides some good paper references within. Link
Position Paper: Bayesian Deep Learning in the Age of Large Scale AI Theodore Papamarkou et al 2024 bayesian, mcmc ICML A good summary of strengths of BDL (Bayesian Deep Learning) with regards to modern deep learning, while also addressing some weaknesses. A good paper if need to do an overview of modern challenges (as of 2024).| Link
A Neural Probabilistic Language Model Bengio et al 2003 statistical language modeling JMLR One of the first papers about modern methods in using neural systems to estimate probability functions of word sequences. They show that MLPs can model better than the SOTA (at that time). A classic.| Link
Bit Fusion: Bit Level Dynamically Composable Architecture for Accelerating Deep Neural Networks Hardik Sharma et al 2018 accelerator, quantization, bit fusion ISCA Hardware acceleration of Deep Neural Networks (DNNs) aims to address their high compute intensity, with the paper focusing on the potential of reducing bitwidth in operations without compromising classification accuracy. To prevent accuracy loss, the bitwidth varies significantly across DNNs, and a fixed bitwidth accelerator may lead to limited benefits or degraded accuracy. The authors introduce Bit Fusion, a bit flexible accelerator that dynamically adjusts bitwidth for individual DNN layers, resulting in significant speedup and energy savings compared to state of the art accelerators, Eyeriss and Stripes, and achieving performance close to a 250 Watt Titan Xp GPU while consuming much less power. Link
A Framework for the Cooperation of Learning Algorithms Leon Bottou et al 1990 learning algorithms, modules NeurIPS Cooperative training of modular systems offers a unified approach to many learning algorithms and hybrid systems, allowing the design and implementation of complex learning systems that incorporate structural a priori knowledge about tasks. The authors introduce a framework using a statistical formulation of learning systems to define and combine modules into cooperative systems, enabling the creation of hybrid systems that combine the advantages of connectionist and other learning algorithms. By decomposing complex tasks into simpler subtasks, modular architectures can be built, where each module corresponds to a subtask, facilitating easier achievement of the learning goal by introducing a modular decomposition of the global task. Link
CNP: An FPGA Based Processor for Convolutional Networks Clement Farabet et al 2009 fpga, cnn IEEE One of the first attempts (that I have found) at putting a CNN into an FPGA and showing it can be done to perform some task (face detection). Link
A Complete Recipe for Stochastic Gradient MCMC Yi An Ma et al 2015 hamiltonian, mcmc NeurIPS Recent Markov chain Monte Carlo (MCMC) samplers use continuous dynamics and scalable variants with stochastic gradients to efficiently explore target distributions, but proving convergence with stochastic gradient noise remains challenging. The authors provide a general framework for constructing MCMC samplers, including stochastic gradient versions, based on continuous Markov processes defined by two matrices, demonstrating that any such process can be represented within this framework. Using this framework, they propose a new state adaptive sampler, stochastic gradient Riemann Hamiltonian Monte Carlo (SGRHMC), which combines the benefits of Riemann HMC with the scalability of stochastic gradient methods, as shown in experiments with simulated data and a streaming Wikipedia analysis. Link
CPT: Efficient Deep Neural Network Training via Cyclic Precision Yonggan Fu et al 2021 precision, efficiency, wide minima ICLR Low precision deep neural network (DNN) training is an effective method for improving training time and energy efficiency, with this paper proposing a new perspective: that DNN precision may act similarly to the learning rate during training. The authors introduce Cyclic Precision Training (CPT), which cyclically varies precision between two boundary values identified through a simple precision range test in the initial training epochs, aiming to boost time and energy efficiency further. Link
Approximation by Superpositions of a Sigmoidal Function G. Cybenko universal approximator, completeness TODO Mathematics of Control, Signals, and Systems This paper demonstrates that finite linear combinations of compositions of a fixed univariate function and a set of affine functionals can uniformly approximate any continuous function of nn real variables within the unit hypercube, under mild conditions on the univariate function. These findings resolve an open question about the representability of functions by single hidden layer neural networks, specifically showing that arbitrary decision regions can be well approximated by continuous feedforward neural networks with a single hidden layer and any continuous sigmoidal nonlinearity. Link
Cyclical Stochastic Gradient MCMC for Bayesian Deep Learning Ruqi Zhang et al 2020 mcmc, bayesian ICLR The posteriors over neural network weights are high dimensional and multimodal, with each mode representing a different meaningful interpretation of the data. The authors introduce Cyclical Stochastic Gradient MCMC (SG MCMC) with a cyclical stepsize schedule, where larger steps discover new modes and smaller steps characterize each mode, and they prove the non asymptotic convergence of this algorithm. Link
DaDianNao: A Machine Learning Supercomputer Yunji Chen et al 2014 accelerator, gpu IEEE/ACM This paper introduces a custom multi chip architecture optimized for Convolutional and Deep Neural Networks (CNNs and DNNs), addressing their computational and memory intensive nature by leveraging on chip storage to enhance internal bandwidth and reduce external communication bottlenecks. The authors demonstrate significant performance gains with their 64 chip system achieving up to a 450.65x speedup over GPUs and reducing energy consumption by up to 150.31x on large neural network layers, implemented with custom storage, computational units, and robust interconnects at 28nm scale. Link
DARTS: Differentiable Architecture Search Hanxiao Liu et al 2019 nas ICLR This paper introduces a differentiable approach to architecture search, tackling scalability challenges by reformulating the task to allow gradient based optimization over a continuous relaxation of architecture representations. Unlike traditional methods relying on evolutionary or reinforcement learning in discrete, non differentiable spaces, the proposed method efficiently discovers high performance convolutional architectures for image classification and recurrent architectures for language modeling. Link
Decoupled Contrastive Learning Chun Hsiao Yeh et al 2022 contrastive learning, self-supervised learning ACM This paper introduces decoupled contrastive learning (DCL), which removes the negative positive coupling (NPC) effect from the InfoNCE loss, significantly improving the efficiency of self supervised learning (SSL) tasks with smaller batch sizes. DCL achieves efficient and reliable performance enhancements across various benchmarks, outperforming the SimCLR baseline without requiring momentum encoding, large batch sizes, or extensive epochs. Link
Deep Image Prior Dmitry Ulyanov et al 2020 inpatining, super-resolution, denoising IEEE This paper challenges the conventional wisdom by demonstrating that the structure of a generator network, even when randomly initialized, can effectively capture low level image statistics without any specific training on example images. The authors show that this randomly initialized neural network can serve as a powerful handcrafted prior, yielding excellent results in standard image processing tasks such as denoising, super resolution, and inpainting. Furthermore, the same network structure can invert deep neural representations for diagnostic purposes and restore images based on input pairs like flash and no flash conditions, showcasing its versatility and effectiveness across various image restoration applications. Link
Deep Double Descent: Where Bigger Models and More Data Hurt Preetum Nakkiran et al 2019 capacity, double descent Arxiv This paper explores the "double descent" phenomenon in modern deep learning tasks, showing that as model size or training epochs increase, performance initially worsens before improving. The authors unify these observations by introducing a new complexity measure termed effective model complexity, conjecturing a generalized double descent across this measure. Link
DeepShift: Towards Multiplication Less Neural Networks Mostafa Elhoushi et al 2021 multiplication-less, efficiency Arxiv This paper addresses the computational challenges of deploying convolutional neural networks (CNNs) on edge computing platforms by introducing convolutional shifts and fully connected shifts, replacing multiplications with efficient bitwise operations during both training and inference. The proposed DeepShift models achieve competitive or higher accuracies compared to baseline models like ResNet18, ResNet50, VGG16, and GoogleNet, while significantly reducing the memory footprint by using only 5 bits or less for weight representation during inference. Link
DepthShrinker: A New Compression Paradigm Towards Boosting Real Hardware Efficiency of Compact Neural Networks Yonggan Fu et al 2022 compression, efficiency, pruning ICML This paper introduces DepthShrinker, a framework designed to enhance hardware efficiency of deep neural networks (DNNs) by transforming irregular computation patterns of compact operators into dense ones, thereby improving hardware utilization without sacrificing model accuracy. By leveraging insights that certain activation functions can be removed post training without loss of accuracy, DepthShrinker pioneers a compression paradigm that optimizes DNNs for real hardware efficiency, presenting a significant advancement in efficient model deployment.| Link
Dimensionality Reduction by Learning an Invariant Mapping Raia Hadsell et al 2006 dimensionality reduction, mapping CVPR DrLIM, or Dimensionality Reduction by Learning an Invariant Mapping, addresses key limitations of existing dimensionality reduction techniques by learning a non linear function that maps high dimensional data to a low dimensional manifold based solely on neighborhood relationships, without requiring a predefined distance metric in input space. The method is distinguished by its ability to handle transformations and maintain invariant mappings, demonstrated through experiments that show its effectiveness in preserving neighborhood relationships and accurately mapping new, unseen samples to meaningful locations on the manifold. Unlike methods like LLE, which may struggle with variability and registration issues in input data, DrLIM's contrastive loss function ensures robustness by balancing similarity and dissimilarity in output space, offering a promising approach for applications requiring invariant mappings, such as learning positional information from image sequences in robotics. Link
Disentangling Trainability and Generalization in Deep Neural Networks Lechao Xiao et al 2020 neural tangent kernel, ntk ICML This study focuses on characterizing the trainability and generalization of deep neural networks, particularly under conditions of very wide and very deep architectures, leveraging insights from the Neural Tangent Kernel (NTK). By analyzing the NTK's spectrum, the study formulates necessary conditions for both memorization and generalization across architectures like Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs). The research identifies key spectral quantities such as λmax, λbulk, κ, and P(Θ(l)) that critically influence the performance of deep networks, providing a precise theoretical framework validated by extensive experiments on CIFAR10. It highlights distinctions in generalization behavior between CNNs with and without global average pooling. Link
Finding Structure in Time Jeffrey Elman 1990 rnn Cognitive Science I think this was the original backpropagation through time paper. Good insights on time dependent system learning. Link
E2 Train: Training State of the art CNNs with Over 80% Energy Savings Yue Wang et al 2019 cnn, batch, energy NeurIPS This paper introduces E2 Train, a framework for energy efficient CNN training on resource constrained platforms. E2 Train optimizes training energy costs through stochastic mini batch dropping, selective layer updates, and low cost, low precision back propagation strategies. Experimental results on CIFAR 10 and CIFAR 100 demonstrate significant energy savings of over 90% and 84%, respectively, with minimal loss in accuracy. E2 Train addresses the challenge of on device CNN training by integrating three levels of energy saving techniques: data level stochastic mini batch dropping, model level selective layer updates, and algorithm level low precision back propagation. Real energy measurements on an FPGA validate its effectiveness, achieving notable energy reductions in training ResNet models on CIFAR datasets. Link
cuDNN: Efficient Primitives for Deep Learning Sharan Chetlur et al 2014 cuda, gpu Arxiv This paper introduces cuDNN, a library designed to optimize deep learning primitives akin to BLAS for HPC. cuDNN offers efficient implementations of key deep learning kernels tailored for GPUs, improving performance and reducing memory usage in frameworks like Caffe by up to 36%. Link
EIE: Efficient Inference Engine on Compressed Deep Neural Network Song Han et al 2016 compression, accelerator, co-design Arxiv This paper introduces EIE, an energy efficient inference engine designed for compressed deep neural networks, achieving significant energy savings by exploiting weight sharing, sparsity, and quantization. EIE performs sparse matrix vector multiplications directly on compressed models, enabling 189× and 13× faster inference speeds compared to CPU and GPU implementations of uncompressed DNNs. Link
An Empirical Analysis of Deep Network Loss Surfaces Daniel Jiwoong Im et al 2016 optimization, loss surface, saddle points Arxiv This paper empirically investigates the geometry of loss functions in state of the art neural networks, employing various stochastic optimization methods. Through visualizations in low dimensional subspaces, it explores how different optimization procedures lead to distinct local minima, even when algorithms are changed late in the optimization process. The study reveals that modifications to optimization procedures consistently yield different local minima, each affecting the network's performance on test examples differently. Interestingly, while different optimization algorithms find varied local minima from different initializations, the shape of the loss function around these minima remains characteristic to the algorithm used, with ADAM showing larger basins compared to vanilla SGD. Link
EyeCoD: Eye Tracking System Accelerator via FlatCam based Algorithm & Accelerator Co Design Haoran You et al 2023 accelerator, co-design, eye-tracking ACM This paper introduces EyeCoD, a lensless FlatCam based eye tracking system designed to overcome limitations of traditional systems, such as large form factor and high communication costs. By integrating a predict then focus algorithm pipeline and dedicated hardware accelerator, EyeCoD achieves significant reductions in computation and communication overhead while maintaining high tracking accuracy. Link
Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices Yu Hsin Chen et al 2019 efficiency, sparsity Arxiv This paper introduces Eyeriss v2, a specialized DNN accelerator architecture designed to efficiently handle compact and sparse neural networks. Unlike traditional DNN accelerators, Eyeriss v2 incorporates a hierarchical mesh network on chip to adapt to varying layer shapes and sizes, optimizing data reuse and bandwidth utilization. Eyeriss v2 excels in processing sparse data directly in the compressed domain, both for weights and activations, thereby enhancing processing speed and energy efficiency particularly suited for sparse models. Link
Eyeriss: A Spatial Architecture for Energy Efficient Dataflow for Convolutional Neural Networks Yu Hsin Chen et al 2016 cnn, row-stationary, efficiency ACM/IEEE The paper addresses the high energy consumption in deep convolutional neural networks (CNNs) due to extensive data movement, despite advancements in parallel computing paradigms like SIMD/SIMT. Introduces a novel row stationary (RS) dataflow designed for spatial architectures. RS maximizes local data reuse and minimizes data movement during convolutions, leveraging PE local storage, inter PE communication, and spatial parallelism. Link
Flat Minima Sepp Hochreiter et al 1997 flat minima, low complexity, gibbs Neural Computation The algorithm focuses on identifying "flat" minima of the error function in weight space. A flat minimum is characterized by a large connected region where the error remains approximately constant. This property suggests simplicity in the network structure and low expected overfitting, supported by an MDL based Bayesian argument. Unlike traditional approaches that rely on Gaussian assumptions or specific weight priors, this algorithm uses a Bayesian framework with a prior over input output functions. This approach considers both network architecture and the training set, facilitating the identification of simpler and more generalizable models. Link
Fused Layer CNN Accelerators Manoj Alwani et al 2016 cnn, accelerator, fusion IEEE This work introduces a novel approach to CNN accelerator design by fusing the computation of multiple convolutional layers. By rearranging the dataflow across layers, intermediate data can be cached on chip between adjacent layers, reducing the need for off chip memory storage and minimizing data transfer. Specifically, the study demonstrates the effectiveness of this approach by implementing a fused layer CNN accelerator for the initial five convolutional layers of VGGNet E. Using 362KB of on chip storage, the accelerator significantly reduces off chip feature map data transfer by 95%, from 77MB to 3.6MB per image processed. This innovation targets early convolutional layers where data transfer typically dominates. By optimizing data reuse and minimizing off chip memory usage, the proposed design strategy enhances the efficiency of CNN accelerators, paving the way for improved performance in various machine learning tasks. Link
EnlightenGAN: Deep Light Enhancement Without Paired Supervision Yifan Jiang et al 2021 gan, enhancement, unsupervised IEEE Exploration of low light to well lit image generation using GANs. Also provides an interesting global local discriminator and self regularized perceptual loss fusion, with a simplified attention (the attention is just an inverse of the brightness of the image essentially). Link
Understanding the difficulty of training deep feedforward neural networks Xavier Glorot et al 2010 activation, saturation, initialization AISTATS The logistic sigmoid activation function is problematic for deep networks due to its mean value, which can lead to saturation of the top hidden layer. This saturation slows down learning and can cause training plateaus. The difficulty in training deep networks correlates with the singular values of the Jacobian matrix for each layer. When these values deviate significantly from 1, it indicates poor activation and gradient flow across layers, complicating training. New initialization schemes have been proposed to address issues with activation saturation and gradient flow. These schemes aim to achieve faster convergence by ensuring that activations and gradients flow well across layers, thereby improving overall training efficiency. Link
Group Normalization Yuxin Wu et al 2018 normalization Arxiv Batch Normalization performs normalization along the batch dimension, which causes errors to increase rapidly as batch sizes decrease. This limitation makes BN less effective for training larger models and tasks that require smaller batches due to memory constraints. GN divides channels into groups and computes normalization statistics (mean and variance) independently within each group. Unlike BN, GN's computation is not dependent on batch sizes, leading to stable performance across a wide range of batch sizes. Link
Singularity of the Hessian in Deep Learning Levent Sagun et al 2017 eigenvalues, hessian ICLR The bulk of eigenvalues concentrated around zero indicates how overparametrized the model is. In deep learning, overparametrization often leads to better generalization despite the potential for higher computational costs. The edges of the eigenvalue distribution, scattered away from zero, reflect the complexity of the input data. This complexity influences how the loss landscape is structured and affects optimization difficulty. Second order optimization methods, which leverage information from the Hessian, can potentially accelerate training and find better solutions by providing insights into the loss landscape's curvature. The top discrete eigenvalues of the Hessian are influenced by the data characteristics, indicating that different datasets may require different optimization strategies or model architectures for optimal performance. Link
Long Short Term Memory Sepp Hochreiter et al 1997 lstm, rnn Neural Computation The original paper on the LSTM. A classic, and demonstrated the power of gating. Link
On the importance of initialization and momentum in deep learning Ilya Sutskever et al 2013 initialization, momentum ICML Traditionally, training DNNs and RNNs with stochastic gradient descent (SGD) with momentum was considered challenging due to issues with gradient propagation and vanishing/exploding gradients, especially in networks with many layers or long term dependencies. The paper demonstrates that using a well designed random initialization significantly improves the training success of deep and recurrent networks with SGD and momentum. Link
Algorithms for manifold learning Lawrence Cayton 2005 manifold learning, dimensionality reduction Arxiv Many datasets exhibit complex relationships that cannot be effectively captured by linear methods like Principal Component Analysis (PCA). Manifold hypothesis: Despite high dimensional appearances, data points often lie on or near a much lower dimensional manifold embedded within the higher dimensional space. Manifold learning aims to uncover this underlying low dimensional structure to provide a more meaningful and compact representation of the data. Link
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis Ben Mildenhall et al 2020 nerf, view synthesis, 3d, scene representation, volume rendering Arxiv The method utilizes a fully connected deep network to represent scenes as continuous volumetric functions. This network takes a 5D input (spatial location and viewing direction) and outputs volume density and view dependent radiance. By querying these 5D coordinates along camera rays and employing differentiable volume rendering techniques, the method synthesizes novel views of scenes. Link
On Large Batch Training for Deep Learning Generalization Gap and Sharp Minima Nitish Shirish Keskar et al 2017 sharp minima, large batch ICLR The study identifies a phenomenon where large batch SGD methods tend to converge towards sharp minimizers of training and testing functions. Sharp minima are associated with poorer generalization, meaning the model performs worse on unseen data. In contrast, small batch methods more consistently converge towards flat minimizers. This behavior is attributed to the inherent noise in gradient estimation during training with small batches. Link
Optimizing FPGA based Accelerator Design for Deep Convolutional Neural Networks Chen Zhang et al 2015 cnn, fpga, accelerator ACM The study employs quantitative analysis techniques, including loop tiling and transformation, to optimize the CNN accelerator design. These techniques aim to maximize computation throughput while minimizing the resource utilization on the FPGA, particularly balancing logic resource usage and memory bandwidth.| Link
Learning Phrase Representation using RNN Encoder Decoder for Statistical Machine Translation Kyunghyun Cho et al 2014 encoder decoder, machine translation Arxiv Introduces a novel neural network architecture called RNN Encoder Decoder, comprising two recurrent neural networks. One RNN serves as an encoder, converting a sequence of symbols into a fixed length vector representation. The other RNN acts as a decoder, generating another sequence of symbols based on the encoded representation. Link
Qualitatively Characterizing Neural Network Optimization Problems Ian Goodfellow et al 2015 optimization, visualization ICLR Demonstrates that contemporary neural networks can achieve minimal training error through direct training with stochastic gradient descent alone, without needing complex schemes like unsupervised pretraining. This finding challenges earlier beliefs about the difficulty of navigating non convex optimization landscapes in neural network training. They also introduce a nice graphical tool to show the energy landscape. Link
Language Models are Unsupervised Multitask Learners Alec Radford et al 2018 unsupervised, GPT Arxiv Demonstrates that language models, specifically GPT 2, trained on the WebText dataset, start to learn various natural language processing tasks (question answering, machine translation, reading comprehension, summarization) without explicit task specific supervision. For instance, when conditioned on a document and questions, the model achieves an F1 score of 55 on the CoQA dataset, matching or exceeding several baseline systems that were trained with over 127,000 examples. Link
On the difficulty of training Recurrent Neural Networks Razvan Pascanu et al 2013 exploding gradient, vanishing gradient, gradient clipping, normalization Arxiv Explanation of issues in RNNs (vanishing / exploding gradient) and proposal of gradient clipping. Link
Learning representations by back propagating errors David Rumelhart et al 1986 backpropagation, learning procedure, convergence Nature The main paper for backprop. Link
The Shattered Gradients Problem: If resnets are the answer, then what is the question? David Balduzzi et al 2018 shattering, initialization ICML The paper identifies the "shattered gradients" problem in standard feedforward neural networks. It shows that gradients in these networks exhibit an exponential decay in correlation with depth, leading to gradients that resemble white noise. In contrast, architectures like highway and ResNets with skip connections demonstrate gradients that decay sublinearly, indicating greater resilience against shattering. The paper introduces a new initialization technique termed "Looks Linear" (LL) that addresses the shattered gradients issue. Preliminary experiments demonstrate that LL initialization enables the training of very deep networks without the need for skip connections like those in ResNets or highway networks. This initialization method offers a promising alternative to achieving stable gradient propagation in deep networks, potentially simplifying network architecture and improving training efficiency. Link
A Simple Baseline for Bayesian Uncertainty in Deep Learning Wesley Maddox et al 2019 bayesian, uncertainty, guassian NeurIPS SWAG combines Stochastic Weight Averaging (SWA) with Gaussian fitting to provide an approximate posterior distribution over neural network weights. SWA computes the first moment of SGD iterates using a modified learning rate schedule. SWAG extends this by fitting a Gaussian distribution using SWA's solution as the first moment and incorporating a low rank plus diagonal covariance derived from SGD iterates. Link
SmartExchange: Trading Higher cost Memory Storage/Access for Lower cost Computation Yang Zhao et al 2020 compression, accelerator, pruning, decomposition, quantization ACM/IEEE SmartExchange integrates sparsification or pruning, decomposition, and quantization techniques into a unified algorithm. It aims to enforce a structured DNN weight format where each layer's weight matrix is represented as a product of a small basis matrix and a large sparse coefficient matrix with power of 2 elements. Link
On the Spectral Bias of Neural Networks Nasim Rahaman et al 2019 spectra, fourier analysis, manifold learning ICML Neural networks, particularly deep ReLU networks, exhibit a learning bias towards low frequency functions. This bias means they tend to prioritize learning global variations over local fluctuations in data. This property aligns with their ability to generalize well across different samples and datasets. Contrary to intuition, as the complexity of the data manifold increases, deep networks find it easier to learn higher frequency functions. This suggests that while they naturally favor low frequency patterns, they can also adapt to more complex data structures to capture higher frequency variations. Link
Sequence to Sequence Learning with Neural Networks Ilya Sutskever et al 2014 seq2seq Arxiv The paper introduces an end to end approach for sequence learning using multilayered Long Short Term Memory (LSTM) networks. This method requires minimal assumptions about the structure of the sequences and effectively maps input sequences to a fixed dimensional vector using one LSTM layer, and decodes target sequences using another deep LSTM layer. Link
Tiled convolutional neural networks Quoc Le et al 2010 tiling, cnn NeurIPS Tiled CNNs introduce a novel approach to learning invariances by using a regular "tiled" pattern of tied weights. Unlike traditional CNNs where adjacent hidden units share identical weights, Tiled CNNs require only that hidden units at a certain distance from each other share tied weights. This allows the network to learn complex invariances such as scale and rotational invariance, in addition to translational invariance. Link
Unsupervised Learning of Image Manifolds by Semidefinite Programming Kilian Weinberger et al 2004 manifold learning, dimensionality reduction IEEE The paper proposes a new approach to detect low dimensional structure in high dimensional datasets using semidefinite programming (SDP). SDP is leveraged to analyze data that resides on or near a low dimensional manifold, which is a common challenge in computer vision and pattern recognition. The algorithm introduced overcomes limitations observed in previous manifold learning techniques like Isomap and locally linear embedding (LLE). These traditional methods often struggle with certain types of data distributions or computational complexities, which the proposed SDP based approach aims to address more effectively. Link