Here's where I keep a list of papers I have read.

This list was curated by Lexington Whalen, beginning from his first year of PhD to end. As he is me, I hope he keeps going!

I typically use this to organize papers I found interesting. Please feel free to do whatever you want with it. Note that this is not every single paper I have ever read, just a collection of ones that I remember to put down.

So far, we have read 176 papers. Let's keep it up!

Your search returned 176 papers. Nice!
Title Author Year Topic Publication Venue Description Link
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising Zigeng Chen, et al 2024 diffusion, parallelization, denoising Arxiv This paper introduces AsyncDiff, a novel approach to accelerate diffusion models through parallel processing across multiple devices. The key insight is that hidden states between consecutive diffusion steps are highly similar, which allows them to break the traditional sequential dependency chain of the denoising process by transforming it into an asynchronous one. They execute this by dividing the denoising model into multiple components distributed across different devices, where each component uses the output from the previous component's prior step as an approximation of its input, enabling parallel computation. To further enhance efficiency, they introduce stride denoising, which completes multiple denoising steps simultaneously through a single parallel computation batch and reduces the frequency of communication between devices. This solution is particularly elegant because it's universal and plug-and-play, requiring no model retraining or architectural changes to achieve significant speedups while maintaining generation quality. Link
DoRA: Weight-Decomposed Low-Rank Adaptation Shih-Yang Liu et al 2024 peft, lora Arxiv This paper introduces DoRA (Weight-Decomposed Low-Rank Adaptation), a novel parameter-efficient fine-tuning method that decomposes pre-trained weights into magnitude and direction components for separate optimization. Through a detailed weight decomposition analysis, the authors reveal that LoRA and full fine-tuning exhibit distinct learning patterns, with LoRA showing proportional changes in magnitude and direction while full fine-tuning demonstrates more nuanced, independent adjustments between these components. Based on this insight, DoRA uses LoRA specifically for directional updates while allowing independent magnitude optimization, which simplifies the learning task compared to having LoRA learn both components simultaneously. The authors also provide theoretical analysis showing how this decomposition benefits optimization by aligning the gradient's covariance matrix more closely with the identity matrix and demonstrate mathematically why DoRA's learning pattern more closely resembles full fine-tuning. Link
SphereFed: Hyperspherical Federated Learning Xin Dong et al 2022 federated learning Arxiv This paper presents a novel approach to addressing the non-i.i.d. (non-independent and identically distributed) data challenge in federated learning by introducing hyperspherical federated learning (SphereFed). The key insight is that instead of letting clients independently learn their classifiers, which leads to inconsistent learning targets across clients, they should share a fixed classifier whose weights span a unit hypersphere, ensuring all clients work toward the same learning objectives. The approach normalizes features to project them onto this same hypersphere and uses mean squared error loss instead of cross-entropy to avoid scaling issues that arise when working with normalized features. Finally, after federated training is complete, they propose a computationally efficient way to calibrate the classifier using a closed-form solution that can be computed in a distributed manner without requiring direct access to private client data. Link
A deeper look at depth pruning of LLMs Shoaib Ahmed Siddiqui et al 2024 pruning, depth pruning, llm ICML This paper explored different approaches to pruning large language models, revealing that while static metrics like cosine similarity work well for maintaining MMLU performance, adaptive metrics like Shapley values show interesting trade-offs between different tasks. A key insight was that self-attention layers are significantly more amenable to pruning compared to feed-forward layers, suggesting that models can maintain performance even with substantial attention layer reduction. The paper also demonstrated that simple performance recovery techniques, like applying an average update in place of removed layers, can be as effective or better than more complex approaches like low-rank adapters. Finally, the work highlighted how pruning affects different tasks unequally - while some metrics preserve performance on certain tasks like MMLU, they may significantly degrade performance on others like mathematical reasoning tasks. Link
Editing Models with Task Arithmetic Gabriel Ilharco et al 2023 task arithmetic, finetuning, task ICLR This paper introduces a novel method for model editing called task arithmetic, where "task vectors" represent specific tasks by capturing the difference between pre-trained and fine-tuned model weights. Task vectors can be manipulated mathematically, such as being negated to unlearn tasks or added together to enable multi-tasking or improve performance in novel settings. A standout finding is the ability to create new task capabilities through analogies (e.g., "A is to B as C is to D"), which allows performance improvement on tasks with little or no data. This method is computationally efficient, leveraging linear operations on model weights without incurring extra inference costs, providing a flexible and modular framework for modifying models post-training. The approach highlights significant advantages in adapting existing models while bypassing costly re-training or data access constraints. Link
SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales Tianyang Xu et al 2024 confidence estimation, llm Arxiv The SaySelf framework trains large language models (LLMs) to produce fine-grained confidence estimates and self-reflective rationales by focusing on internal uncertainties. It consists of two stages: supervised fine-tuning and reinforcement learning (RL). In the first stage, multiple reasoning chains are sampled from the LLM, clustered for semantic similarity, and analyzed by an advanced LLM to generate rationales summarizing uncertainties. The model is fine-tuned on a dataset that pairs questions with reasoning chains, rationales, and confidence estimates, using a loss function that optimizes the generation of all three outputs. In the second stage, RL refines the confidence predictions using a reward function that encourages accurate, high-confidence outputs while penalizing overconfidence in incorrect responses. The framework ensures that LLMs not only generate confidence scores but also provide explanations for their uncertainty, making their outputs more interpretable and calibrated. Link
Deep Reinforcement Learning from Human Preferences Paul F Christiano et al 2016 rl, rlhf Arxiv This paper introduces a method to train reinforcement learning (RL) systems using human preferences over trajectory segments rather than traditional reward functions. The approach allows agents to learn tasks that are hard to define programmatically, enabling non-expert users to provide feedback on agent behavior through comparisons of short video clips. By learning a reward model from these preferences, the method dramatically reduces the need for human oversight while maintaining adaptability to large-scale and complex RL environments. This paradigm bridges the gap between human-defined objectives and scalable RL systems, addressing challenges in alignment and usability for real-world applications. Link
The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning Tian Jin et al 2023 pruning, icl Arxiv This paper explores the effects of scaling the parameter count of large language models (LLMs) on two distinct capabilities: fact recall from pre-training and in-context learning (ICL). By investigating both dense scaling (training models of varying sizes) and pruning (removing weights), the authors identify that these approaches disproportionately affect fact recall while preserving ICL abilities. They demonstrate that a model's ability to learn from in-context information remains robust under significant parameter reductions, whereas the ability to recall pre-trained facts degrades with even moderate scaling down. This dichotomy highlights a fundamental difference in how these capabilities rely on model size and opens avenues for more efficient model design and deployment, emphasizing trade-offs between memory augmentation and parameter efficiency. Link
Fine-Tuning Language Models with Just Forward Passes Sadhika Malladi et al 2024 finetuning, zo, optimization Arxiv The paper introduces MeZO, a memory-efficient zeroth-order optimization method, to fine-tune large language models using forward passes alone. Classical zeroth-order methods scale poorly with model size, but MeZO adapts these approaches to leverage structured pre-trained model landscapes, avoiding catastrophic slowdown even with billions of parameters. The authors theoretically show that MeZO’s convergence depends on the local effective rank of the Hessian, not the number of parameters, enabling efficient optimization despite prior bounds suggesting otherwise. Furthermore, MeZO’s flexibility allows optimization of non-differentiable objectives (e.g., accuracy or F1 score) and compatibility with parameter-efficient tuning methods like LoRA and prefix-tuning. Link
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference Hanshi Sun et al 2024 kv cache Arxiv The key insight of this paper lies in optimizing long-context large language model inference by addressing the memory and latency bottlenecks associated with managing the key-value (KV) cache. The authors observe that pre-Rotary Position Embedding (RoPE) keys exhibit a low-rank structure, allowing them to be compressed without accuracy loss, while value caches lack this property and are therefore offloaded to the CPU to reduce GPU memory usage. To minimize decoding latency, they leverage landmarks—compact representations of the low-rank key cache—and identify a small set of outliers to be retained on the GPU, enabling efficient reconstruction of sparse KV pairs on-the-fly. This approach allows the system to handle significantly longer contexts and larger batch sizes while maintaining inference throughput and accuracy. Link
LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning Rui Pan et al 2024 peft, finetuning, sampling Arxiv The key insight of this paper is the discovery of a skewed weight-norm distribution across layers during LoRA fine-tuning, where the majority of updates occur in the bottom (embedding) and top (language modeling head) layers, leaving middle layers underutilized. This highlights that different layers have varied importance and suggests that selectively updating layers could improve efficiency without sacrificing performance. Building on this, the authors propose Layerwise Importance Sampling AdamW (LISA), which randomly freezes most middle layers during training, using importance sampling to emulate LoRA’s fast learning pattern while avoiding its low-rank constraints. This approach achieves significant memory savings, faster convergence, and superior performance compared to LoRA and full-parameter fine-tuning, particularly in large-scale and domain-specific tasks. Link
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Muyang Li et al 2024 quantization, diffusion Arxiv SVDQuant introduces a novel approach to 4-bit quantization of diffusion models by using a low-rank branch to absorb outliers in both weights and activations, making quantization more feasible at such aggressive bit reduction. The method first consolidates outliers from activations to weights through smoothing, then decomposes the weights using Singular Value Decomposition (SVD) to separate the dominant components into a 16-bit low-rank branch while keeping the residual in 4 bits. To make this practical, they developed an inference engine called Nunchaku that fuses the low-rank and low-bit branch kernels together, eliminating redundant memory access that would otherwise negate the performance benefits. The approach is designed to work across different diffusion model architectures and can seamlessly integrate with existing low-rank adapters (LoRAs) without requiring re-quantization. Link
One Weight Bitwidth to Rule Them All Ting-Wu Chin et al 2020 quantization, bitwidth Arxiv This paper examines weight quantization in deep neural networks and challenges the common assumption that using the lowest possible bitwidth without accuracy loss is optimal. The key insight is that when considering model size as a constraint and allowing network width to vary, some bitwidths consistently outperform others - specifically, networks with standard convolutions work better with binary weights while networks with depthwise convolutions prefer higher bitwidths. The authors discover that this difference is related to the number of input channels (fan-in) per convolutional kernel, with higher fan-in making networks more resilient to aggressive quantization. Most surprisingly, they demonstrate that using a single well-chosen bitwidth throughout the network can outperform more complex mixed-precision quantization approaches when comparing networks of equal size, suggesting that the traditional focus on minimizing bitwidth without considering network width may be suboptimal. Link
Consistency Models Yang Song et al 2023 diffusion, ode, consistency ICML This paper introduces consistency models, a new family of generative models that can generate high-quality samples in a single step while preserving the ability to trade compute for quality through multi-step sampling. The key innovation is training models to map any point on a probability flow ODE trajectory to its origin point, enforcing consistency across different time steps through either distillation from pre-trained diffusion models or direct training. The models support zero-shot data editing capabilities like inpainting, colorization, and super-resolution without requiring explicit training on these tasks, similar to diffusion models. The authors provide two training approaches - consistency distillation which leverages existing diffusion models, and consistency training which allows training from scratch without any pre-trained models, establishing consistency models as an independent class of generative models. Link
One Step Diffusion via ShortCut Models Kevin Frans et al 2024 diffusion, ode, flow-matching Arxiv This paper introduces shortcut models, a new type of diffusion model that enables high-quality image generation in a single forward pass by conditioning the model not only on the timestep but also on the desired step size, allowing it to learn larger jumps during the denoising process. Unlike previous approaches that require multiple training phases or complex scheduling, shortcut models can be trained end-to-end in a single phase by leveraging a self-consistency property where one large step should equal two consecutive smaller steps, combined with flow-matching loss as a base case. The key insight is that by conditioning on step size, the model can account for future curvature in the denoising path and jump directly to the correct next point rather than following the curved path naively, which would lead to errors with large steps. The approach simplifies the training pipeline while maintaining flexibility in inference budget, as the same model can generate samples using either single or multiple steps after training. Link
Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models Hongjie Wang et al 2024 diffusion, training-free, attention, token pruning CVPR This paper introduces AT-EDM, a training-free framework to accelerate diffusion models by pruning redundant tokens during inference without requiring model retraining. The key innovation is a Generalized Weighted PageRank (G-WPR) algorithm that uses attention maps to identify and prune less important tokens, along with a novel similarity-based token recovery method that fills in pruned tokens based on attention patterns to maintain compatibility with convolutional layers. The authors also propose a Denoising-Steps-Aware Pruning (DSAP) schedule that prunes fewer tokens in early denoising steps when attention maps are more chaotic and less informative, and more tokens in later steps when attention patterns are better established. The overall approach focuses on making diffusion models more efficient by leveraging the rich information contained in attention maps to guide token pruning decisions while maintaining image generation quality. Link
Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks Tim Salimans et al 2016 normalization, gradient descent Arxiv This paper introduces weight normalization, a simple reparameterization technique that decouples a neural network's weight vectors into their direction and magnitude by expressing w = (g/||v||)v, where g is a scalar and v is a vector. The key insight is that this decoupling improves optimization by making the conditioning of the gradient better - the direction and scale of weight updates can be learned somewhat independently, which helps avoid problems with pathological curvature in the optimization landscape. While inspired by batch normalization, weight normalization is deterministic and doesn't add noise to gradients or create dependencies between minibatch examples, making it well-suited for scenarios like reinforcement learning and RNNs where batch normalization is problematic. The authors also propose a data-dependent initialization scheme where g and bias terms are initialized to normalize the initial pre-activations of neurons, helping ensure good scaling of activations across layers at the start of training. Link
Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models Tuomas Kynkäänniemi et al 2024 diffusion, cfg, guidance Arxiv This paper's key insight is that classifier-free guidance (CFG) in diffusion models should only be applied during a specific interval of noise levels in the middle of the sampling process, rather than throughout the entire sampling chain as traditionally done. The intuition is that guidance is harmful at high noise levels (where it causes mode collapse and template-like outputs), largely unnecessary at low noise levels, and only truly beneficial in the middle range. They demonstrate this theoretically using a 1D synthetic example where they can visualize how guidance at high noise levels causes sampling trajectories to drift far from the smoothed data distribution, leading to mode dropping. Beyond this theoretical demonstration, they propose a simple solution of making the guidance weight a piecewise function that only applies guidance within a specific noise level interval. Link
Cache Me if You Can: Accelerating Diffusion Models through Block Caching Felix Wimbauer et al 2024 diffusion, caching, distillation Arxiv This paper introduces "block caching" to accelerate diffusion models by reusing computations across denoising steps. The key insight is that many layer blocks (particularly attention blocks) in diffusion models change very gradually during the denoising process, making their repeated computation redundant. The authors propose automatically determining which blocks to cache and when to refresh them based on measuring the relative changes in block outputs across timesteps. They also introduce a lightweight scale-shift adjustment mechanism that uses a student-teacher setup, where the student (cached model) learns additional scale and shift parameters to better align its cached block outputs with those of the teacher (uncached model), while keeping the original model weights frozen. Link
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads Guangxuan Xiao et al 2024 llm, kv cache, attention Arxiv The key insight of DuoAttention is the observation that attention heads in LLMs naturally fall into two distinct categories: retrieval heads that need to access the full context to make connections across long distances, and streaming heads that mainly focus on recent tokens and attention sinks. This dichotomy makes intuitive sense because not all parts of language processing require long-range dependencies - while some aspects like fact recall or logical reasoning need broad context, others like local grammar or immediate context processing can work with nearby tokens. The paper's approach of using optimization to identify these heads (rather than just looking at attention patterns) is clever because it directly measures the impact on model outputs, capturing the true functional role of each head rather than just its surface behavior. Finally, the insight to maintain two separate KV caches (full for retrieval heads, minimal for streaming heads) is an elegant way to preserve the model's capabilities while reducing memory usage, since it aligns the memory allocation with each head's actual needs. Link
Efficient Streaming Language Models with Attention Sinks Guangxuan Xiao et al 2024 llm, kv cache, attention ICLR This paper introduces StreamingLLM, a framework that enables large language models to process infinitely long text sequences efficiently without fine-tuning, based on a key insight about "attention sinks." The authors discover that LLMs allocate surprisingly high attention scores to initial tokens regardless of their semantic relevance, which they explain is due to the softmax operation requiring attention scores to sum to one - even when a token has no strong matches in context, the model must distribute attention somewhere, and initial tokens become natural "sinks" since they're visible to all subsequent tokens during autoregressive training. Building on this insight, StreamingLLM maintains just a few initial tokens (as attention sinks) along with a sliding window of recent tokens, achieving up to 22.2x speedup compared to baselines while maintaining performance on sequences up to 4 million tokens long. Additionally, they show that incorporating a dedicated learnable "sink token" during model pre-training can further improve streaming capabilities by providing an explicit token for collecting excess attention. Link
MagicPIG: LSH Sampling for Efficient LLM Generation Zhuoming Chen et al 2024 llm, kv cache Arxiv This paper challenges the common assumption that attention in LLMs is naturally sparse, showing that TopK attention (selecting only the highest attention scores) can significantly degrade performance on tasks that require aggregating information across the full context. The authors demonstrate that sampling-based approaches to attention can be more effective than TopK selection, leading them to develop MagicPIG, a system that uses Locality Sensitive Hashing (LSH) to efficiently sample attention keys and values. A key insight is that the geometry of attention in LLMs has specific patterns - notably that the initial attention sink token remains almost static regardless of input, and that query and key vectors typically lie in opposite directions - which helps explain why simple TopK selection is suboptimal. Their solution involves a heterogeneous system design that leverages both GPU and CPU resources, with hash computations on GPU and attention computation on CPU, allowing for efficient processing of longer contexts while maintaining accuracy. Link
Guiding a Diffusion Model with a Bad Version of Itself Tero Karras et al 2024 diffusion, guidance Arxiv The paper makes two key contributions: First, they show that Classifier-Free Guidance (CFG) improves image quality not just through better prompt alignment, but because the unconditional model D0 learns a more spread-out distribution than the conditional model D1, causing the guidance term ∇x log(p1/p0) to push samples toward high-probability regions of the data manifold. Second, based on this insight, they introduce "autoguidance" - using a smaller, less-trained version of the model itself as the guiding model D0 rather than an unconditional model, which allows for quality improvements without reducing variation and works even for unconditional models. Link
LLM-Pruner: On the Structural Pruning of Large Language Models Xinyin Ma et al 2023 llm, structural pruning Arxiv The authors introduce LLM-Pruner, a novel approach for compressing large language models that operates in a task-agnostic manner while requiring minimal access to the original training data. Their key insight is to first automatically identify groups of interdependent neural structures within the LLM by analyzing dependency patterns, ensuring that coupled structures are pruned together to maintain model coherence. The method then estimates the importance of these structural groups using both first-order gradients and approximated Hessian information from a small set of calibration samples, allowing them to selectively remove less critical groups while preserving the model's core functionality. Finally, they employ a rapid recovery phase using low-rank adaptation (LoRA) to fine-tune the pruned model with a limited dataset in just a few hours, enabling efficient compression while maintaining the LLM's general-purpose capabilities. Link
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models Guangxuan Xiao et al 2023 llm, quantization, activations ICML The key insight of SmoothQuant is that in large language models, while weights are relatively easy to quantize, activations are much harder due to outliers. They observed that these outliers persistently appear in specific channels across different tokens, suggesting that the difficulty could be redistributed. Their solution is to mathematically transform the model by scaling down problematic activation channels while scaling up the corresponding weight channels proportionally, which maintains mathematical equivalence while making both weights and activations easier to quantize. This "difficulty migration" approach allows them to balance the quantization challenges between weights and activations using a tunable parameter α, rather than having all the difficulty concentrated in the activation values. Link
ESPACE: Dimensionality Reduction of Activations for Model Compression Charbel Sakr et al 2024 llm, dimensionality reduction, activations, compression NeurIPS Instead of decomposing weight matrices as done in previous work, ESPACE reduces the dimensionality of activation tensors by projecting them onto a pre-calibrated set of principal components using a static projection matrix P, where for an activation x, its projection is x̃ = PPᵀx. The projection matrix P is carefully constructed (using eigendecomposition of activation statistics) to preserve the most important components while reducing dimensionality, taking advantage of natural redundancies that exist in activation patterns due to properties like the Central Limit Theorem when stacking sequence/batch dimensions. During training, the weights remain uncompressed and fully trainable (maintaining model expressivity), while at inference time, the weight matrices can be pre-multiplied with the projection matrix (PTWᵀ) to achieve compression through matrix multiplication associativity: Y = WᵀX ≈ Wᵀ(PPᵀX) = (PTWᵀ)(PᵀX). This activation-centric approach is fundamentally different from previous methods because it maintains full model expressivity during training while still achieving compression at inference time, and it takes advantage of natural statistical redundancies in activation patterns rather than trying to directly compress weights. Link
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model Chunting Zhou et al 2024 diffusion, transformer, multi-modal Arxiv The key insight of this paper is that a single transformer model can effectively handle both discrete data (like text) and continuous data (like images) by using different training objectives for each modality within the same model. They introduce "Transfusion," which uses traditional language modeling (next token prediction) for text sequences while simultaneously applying diffusion modeling for image sequences, combining these distinct objectives into a unified training approach. The architecture employs a novel attention pattern that allows for causal attention across the entire sequence while enabling bidirectional attention within individual images, letting image patches attend to each other freely while maintaining proper causality for text generation. This unified approach avoids the need for separate specialized models or complex architectures while still allowing each modality to be processed according to its most effective paradigm. Link
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection Jiawei Zhao et al 2024 lora, low-rank projection ICML This paper introduces GaLore, a memory-efficient approach for training large language models that exploits the inherent low-rank structure of gradients rather than imposing low-rank constraints on the model weights themselves. The key insight is that while weight matrices may need to be full-rank for optimal performance, their gradients naturally become low-rank during training due to the specific structure of backpropagated gradients in neural networks, particularly in cases where the batch size is smaller than the matrix dimensions or when the gradients follow certain parametric forms. Building on this observation, GaLore projects gradients into low-rank spaces for memory-efficient optimization while still allowing full-parameter learning, contrasting with previous approaches like LoRA that restrict the weight updates to low-rank spaces. By periodically switching between different low-rank subspaces during training, GaLore maintains the flexibility of full-rank training while significantly reducing memory usage, particularly in storing optimizer states. Link
Neural Discrete Representation Learning Aaron van der Oord et al 2017 generative models, vae NeurIPS The key innovation of this paper is the introduction of the Vector Quantised-Variational AutoEncoder (VQ-VAE), which combines vector quantization with VAEs to learn discrete latent representations instead of continuous ones. Unlike previous approaches to discrete latent variables which struggled with high variance or optimization challenges, VQ-VAE uses a simple but effective nearest-neighbor lookup system in the latent space, along with a straight-through gradient estimator, to learn meaningful discrete codes. This approach allows the model to avoid the common posterior collapse problem where latents are ignored when paired with powerful decoders, while still maintaining good reconstruction quality comparable to continuous VAEs. The discrete nature of the latent space enables the model to focus on capturing important high-level features that span many dimensions in the input space (like objects in images or phonemes in speech) rather than local details, and these discrete latents can then be effectively modeled using powerful autoregressive priors for generation. Link
Improved Precision and Recall Metric for Assessing Generative Models Tuomas Kynkaanniemi et al 2019 generative models, precision, recall NeurIPS This paper introduces an improved metric for evaluating generative models by separately measuring precision (quality of generated samples) and recall (coverage/diversity of generated distribution) using k-nearest neighbors to construct non-parametric manifold approximations of real and generated data distributions. The authors demonstrate their metric's effectiveness using StyleGAN and BigGAN, showing how it provides more nuanced insights than existing metrics like FID, particularly in revealing tradeoffs between image quality and variation that other metrics obscure. They use their metric to analyze and improve StyleGAN's architecture and training configurations, identifying new variants that achieve state-of-the-art results, and perform the first principled analysis of truncation methods. Finally, they extend their metric to evaluate individual sample quality, enabling quality assessment of interpolations and providing insights into the shape of the latent space that produces realistic images. Link
Generative Pretraining from Pixels Mark Chen et al 2020 pretraining, gpt PMLR The paper demonstrates that transformer models can learn high-quality image representations by simply predicting pixels in a generative way, without incorporating any knowledge of the 2D structure of images. They show that as the generative models get better at predicting pixels (measured by log probability), they also learn better representations that can be used for downstream image classification tasks. The authors discover that, unlike in supervised learning where the best representations are in the final layers, their generative models learn the best representations in the middle layers - suggesting the model first builds up representations before using them to predict pixels. Finally, while their approach requires significant compute and works best at lower resolutions, it achieves competitive results with other self-supervised methods and shows that generative pre-training can be a promising direction for learning visual representations without labels. Link
Why Does Unsupervised Pre-Training Help Deep Learning? Dumitru Erhan et al 2010 pretraining, unsupervised JMLR This paper argues that standard training schemes place parameters in regions of the parameter space that generalize poorly, while greedy layer-wise unsupervised pre-training allows each layer to learn a nonlinear transformation of its input that captures the main variations in the input, which acts as a regularizer: minimizing variance and introducing bias towards good initializations for the parameters. They argue that defining particular initialization points implicitly imposes constraints on the parameters in that it specifies which minima (out of many possible minima) of the cost function are allowed. They further argue that small perturbations in the trajectory of the parameters have a larger effect early on, and hint that early examples have larger influence and may trap model parameters in particular regions of parameter space corresponding to the arbitrary ordering of training examples (similar to the "critical period" in developmental psychology). Link
Improving Language Understanding by Generative Pre-Training Alec Radford et al 2020 pretraining Arxiv The key insight of this paper is that language models can learn deep linguistic and world knowledge through unsupervised pre-training on large corpora of contiguous text, which can then be effectively transferred to downstream tasks. The authors demonstrate this by using a Transformer architecture that can capture long-range dependencies, pre-training it on a books dataset that contains extended narratives rather than shuffled sentences, making it particularly effective at understanding context. Their innovation extends to how they handle transfer learning - rather than creating complex task-specific architectures, they show that simple input transformations can adapt their pre-trained model to various tasks while preserving its learned capabilities. This elegant approach proves remarkably effective, with their single task-agnostic model outperforming specially-designed architectures across nine different natural language understanding tasks, suggesting that their pre-training method captures fundamental aspects of language understanding. Link
Learning Transferable Visual Models from Natural Language Supervision Alec Radford et al 2021 CLIP Arxiv CLIP (Contrastive Language-Image Pre-training) works by simultaneously training two neural networks - one that encodes images and another that encodes text - to project their inputs into a shared multi-dimensional space where similar concepts end up close together. During training, CLIP takes a batch of image-text pairs and learns to identify which text descriptions actually match which images, doing this by maximizing the cosine similarity between embeddings of genuine pairs while minimizing similarity between mismatched pairs. The training data consists of hundreds of millions of (image, text) pairs collected from the internet, which helps CLIP learn broad visual concepts and their relationships to language without requiring hand-labeled data. What makes CLIP particularly powerful is its zero-shot capability - after training, it can make predictions about images it has never seen before by comparing them against any arbitrary text descriptions, rather than being limited to a fixed set of predetermined labels. Link
Adam: A Method for Stochastic Optimization Diederik Kingma et al 2015 optimizers ICLR Adam combines momentum (through exponential moving average of gradients mt) and adaptive learning rates (through exponential moving average of squared gradients vt) to create an efficient optimizer, where mt captures the direction of updates while vt adapts the step size for each parameter based on its gradient history. The optimizer corrects initialization bias in these moving averages by scaling them with factors 1/(1-β₁ᵗ) and 1/(1-β₂ᵗ) respectively, ensuring unbiased estimates even in early training. The parameter update θt ← θt-1 - α·mt/(√vt + ϵ) is invariant to gradient scaling because it uses the ratio mt/√vt, while the adaptive learning rate 1/√vt approximates the diagonal of the Fisher Information Matrix's square root, making it a more conservative version of natural gradient descent that works well with sparse gradients and non-stationary objectives. The hyperparameters β₁ = 0.9 and β₂ = 0.999 mean the momentum term considers roughly the last 10 steps while the variance term considers the last 1000 steps, allowing Adam to both move quickly in consistent directions while being careful in directions with high historical variance. Link
Simplifying Neural Networks by Soft Weight-Sharing Steven Nowlan et al 1992 soft weight sharing, mog Neural Computation This paper tackles the challenge of penalizing complexity and preventing overfitting in neural networks. Traditional methods, like L2 regularization, penalize the sum of squared weights but can favor multiple weak connections over a single strong one, leading to suboptimal weight configurations. To address this, the authors propose a mixture of Gaussians (MoG) prior: a narrow Gaussian encourages small weights to shrink to zero, while a broad Gaussian preserves large weights essential for modeling the data accurately. By clustering weights into near-zero and larger groups, this data-driven regularization avoids forcing all weights toward zero equally and demonstrates better generalization on 12 toy tasks compared to early stopping and traditional squared-weight penalties. Link
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models Muyang Li et al 2024 diffusion, distributed inference Arxiv DistriFusion introduces *displaced patch parallelism*, where the input image is split into patches, each processed independently by different GPUs. To maintain fidelity and reduce communication costs, the method reuses activations from the previous timestep as context for the current step, ensuring interaction between patches without excessive synchronization. Synchronous communication is only used at the initial step, while subsequent steps leverage asynchronous communication, hiding communication overhead within computation. This technique allows each device to process only a portion of the workload efficiently, avoiding artifacts and achieving scalable parallelism tailored to the sequential nature of diffusion models. Link
Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching Xinyin Ma et al 2024 diffusion, caching Arxiv This paper proposes interpolation between computationally inexpensive solutions that are suboptimal and optimal solutions that are expensive by training a router the learn how to cache layers of the diffusion transformer. Link
Flash Attention Tri Dao et al 2022 attention, transformer Arxiv This introduces FlashAttention, which is an IO-aware exact attention algo that uses tiling. Basically, they use tiling to prevent needing to put the large NxN attention matrix on GPU HBM; FlashAttention goes through blocks of the K and V matrices, loads them to on-chip SRAM, which increases speed! Neat! Link
Token Merging for Fast Stable Diffusion Daniel Bolya et al 2023 diffusion, token merging Arxiv This paper seeks to apply ToMe (https://arxiv.org/pdf/2210.09461) to diffusion models, introducing techniques for token partitioning (by changing the way src and dst is merged) and a token unmerging operation (which is basically just setting the two merged tokens equal to their average, and then resetting back the two tokens with that average). Remarkably, this works very well! Link
DeepCache: Accelerating Diffusion Models for Free Xinyin Ma et al 2023 diffusion, cache Arxiv Similarly to Faster Diffusion (Senma Li et al, 2024), this paper uses the temporal redundancy in the denoising stages. They then cache features across the UNet by skipping some of the skip branches / paths. Basically, for timesteps t and t+1 that are similar, we can cache some of the high level features between them and directly use them. Also smart! Link
Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models Senmao Li et al 2024 diffusion, encoder NeurIPS This paper notes that the UNet decoder in diffusion models has similar output between timesteps. Thus, they seek to basically cyclically reuse encoder features for the decoder. Smart! Link
Improved Denoising Diffusion Probabilistic Models Alex Nichol et al 2021 diffusion, precision, recall Arxiv This paper is the first to show that DDPMs can get competitive log-likelihoods. They use a reparameterization and a hybrid learning objective to more tightly optimize the variational lower bound, and find that their objective has less gradient noise during training. They use learned variances and find that they can get convincing samples using fewer steps. They also use the improved precision and recall metrics (Kynkaanniemi et al 2019) to show that diffusion models have higher recall for similar FID, which suggests they cover a large portion of the target distribution. They focused on optimizing log-likelihood as it is believed that optimizing ll forces the model to capture all models of data distribution (Razavi et al 2019). Heninghan et al 2020 has also shown that small improvements in ll can dramatically impact sample quality / learned feature representations. The authors argue that fixing \sigma_{t} (as Ho et al 2020 does) is reasonable in terms of sample quality, but does not explain much about the ll. Thus, to improve ll they think of finding a better choice for \Sigma_{\theta}(x_{t},t), so they choose to try to learn it. They note that it is better to parameterize the var as an interpolation between \beta_{t} and \tilde{\beta_{t}} in the log domain. Remember that \beta_{t} is the noise schedule, which is typically a small value that increases over time following some schedule. \tilde{\beta is a reparameterization of \beta_{t} used to simplify calculations. They are related via \alpha, which is 1-eta_{t}. Finally, they note that a linear schedule for noise leads to faster destroying of information than is necessary, and propose a different noise scheduler. Lots of insights! Link
Improving Training Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architecture Huijie Zhang et al 2024 diffusion, multi-stage CVPR This paper proposes a multi-stage framework for diffusion models that uses a shared encoder and separate decoders for different timestep intervals, along with an optimal denoiser-based timestep clustering method, to improve training and sampling efficiency while maintaining or enhancing image generation quality. Link
Temporal Dynamic Quantization for Diffusion Models Junhyuk So et al 2023 diffusion, quantization NeurIPS Temporal Dynamic Quantization (TDQ) addresses the challenge of quantizing diffusion models by dynamically adjusting quantization parameters based on the denoising time step. TDQ employs a trainable module consisting of frequency encoding, a multi-layer perceptron (MLP), and a SoftPlus activation to predict optimal quantization intervals for each time step. This module maps the temporal information to appropriate quantization parameters, allowing the method to adapt to the varying activation distributions across different stages of the diffusion process. By pre-computing these quantization intervals, TDQ avoids the runtime overhead associated with traditional dynamic quantization methods while still providing the necessary flexibility to handle the temporal dynamics of diffusion models. Link
Learning Efficient Convolutional Networks through Network Slimming Zhuang Liu et al 2017 pruning, importance CVPR This paper introduces *network slimming*, a method to reduce the size, memory footprint, and computation of CNNs by enforcing channel-level sparsity without sacrificing accuracy. It works by identifying and pruning insignificant channels during training, leveraging the γ scaling factors in Batch Normalization (BN) layers to effectively determine channel importance. The approach introduces minimal training overhead and is compatible with modern CNN architectures, eliminating the need for specialized hardware or software. Using the BN layer’s built-in scaling properties makes this pruning efficient, avoiding redundant scaling layers or issues that arise from linear transformations in convolution layers. Link
Q-Diffusion: Quantizing Diffusion Models Xiuyu Li et al 2023 diffusion, sampling ICCV This paper tackles the inefficiencies of diffusion models, such as slow inference and high computational cost, by proposing a post-training quantization (PTQ) method designed specifically for their multi-timestep process. The key innovation includes a *time step-aware calibration data sampling* approach, which uniformly samples inputs across multiple time steps to better reflect real inference data, addressing quantization errors and varying activation distributions without the need for additional data. Additionally, the paper introduces *shortcut-splitting quantization* to handle the bimodal activation distributions caused by the concatenation of deep and shallow feature channels in shortcuts, quantizing them separately before concatenation for improved accuracy with minimal extra resources. Link
Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection Alireza Ganjdanesh et al 2024 diffusion, sampling Arxiv This paper reduces the cost of sampling via pruning a pretrained diffusion model into a mixture of experts (MoE) for their respective time intervals, via a routing agent that predicts the architecture needed to generate the experts. Link
A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training Kai Wang et al 2024 diffusion, sampling Arxiv This paper introduces SpeeD, a novel approach for accelerating the training of diffusion models without compromising performance. The authors analyze the diffusion process and identify three distinct areas: acceleration, deceleration, and convergence, each with different characteristics and importance for model training. Based on these insights, SpeeD implements two key components: asymmetric sampling, which reduces the sampling of less informative time steps in the convergence area, and change-aware weighting, which gives more importance to the rapidly changing areas between acceleration and deceleration. The authors' key insight is that not all time steps in the diffusion process are equally valuable for training, with the convergence area providing limited benefits despite occupying a large proportion of time steps, while the rapidly changing area between acceleration and deceleration is crucial but often undersampled. To address this, SpeeD introduces an asymmetric sampling strategy using a two-step probability function: $P(t) = \begin{cases} \frac{k}{T + \tau(k-1)}, & 0 < t \leq \tau \ \frac{1}{T + \tau(k-1)}, & \tau < t \leq T \end{cases}$, where τ is a carefully selected threshold marking the beginning of the convergence area, k is a suppression intensity factor, T is the total number of time steps, and t is the current time step. This function increases sampling probability before τ and suppresses it after. Additionally, SpeeD employs a change-aware weighting scheme based on the gradient of the process increment's variance, assigning higher weights to time steps with faster changes. By combining these strategies, SpeeD aims to focus computational resources on the most informative parts of the diffusion process, potentially leading to significant speedups in training time without sacrificing model quality. Link
HyperGAN: A Generative Model for Diverse, Performant Neural Networks Neale Ratzlaff et al 2019 gan, ensemble ICML This paper introduces HyperGAN, a novel generative model designed to learn a distribution of neural network parameters, addressing the issue of overconfidence in standard neural networks when faced with out-of-distribution data. Unlike traditional approaches, HyperGAN doesn't require restrictive prior assumptions and can rapidly generate large, diverse ensembles of neural networks. The model employs a unique "mixer" component that projects prior samples into a correlated latent space, from which layer-specific generators create weights for a deep neural network. Experimental results show that HyperGAN can achieve competitive performance on datasets like MNIST and CIFAR-10 while providing improved uncertainty estimates for out-of-distribution and adversarial data compared to standard ensembles. NOTE: There has actually been a diffusion variant of this idea: https://arxiv.org/pdf/2402.13144 Link
Diffusion Models Already Have a Semantic Latent Space Mingi Kwon et al 2023 diffusion, latent space ICLR This paper introduces Asymmetric Reverse Process (Asyrp), a method that discovers a semantic latent space (h-space) in pretrained diffusion models, enabling controlled image manipulation with desirable properties such as homogeneity, linearity, and consistency across timesteps, while also proposing a principled design for versatile editing and quality enhancement in the generative process. The authors propose Asymmetric Reverse Process (Asyrp). It modifies only the P_{t} term while preserving the D_{t} term in the reverse process. This makes sense because it a) breaks the destructive interference seen in previous methods, b) allows for controlled modification of the generation process towards target attributes, and c) maintains the overall structure and quality of the diffusion process. Link
One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale Fan Bao et al 2023 diffusion, multi-model ICML The authors present a method of sampling from joint and conditional distributions using a small modification on diffusion models. UniDiffuser’s proposed method involves handling multiple modalities (such as images and text) within a single diffusion model. Here is in general what they do: 1. Perturb data in all modalities: For a given data point (x0,y0), where x0 is an image and y0 is text, UniDiffuser adds noise to both simultaneously. The noisy versions are represented as xt_{x} and yt_{y}, where t_{x} and t_{y} are the respective timesteps. 2. Use of individual timesteps for different modalities: Instead of using a single timestep t for both modalities, UniDiffuser uses separate timesteps t_{x} and t_{y}. This allows for more flexibility in handling the different characteristics of each modality. 3. Predicting noise for all modalities simultaneously: UniDiffuser uses a joint noise prediction network \epsilon_{\theta}(xt_{x},yt_{y},t_{x},t_{y}) that takes in the noisy versions of both modalities and their respective timesteps. The network then outputs predicted noise for both modalities in one forward pass. Link
Diffusion Models as a Representation Learner Xingyi Yang et al 2023 diffusion, representation learner ICCV This paper (smartly!) notices that one of the major reasons for long training and poor results of diffusion models is the lack of fast learning of relationships. For instance, they remark on the learning of one eye of a dog before both eyes. They propose to mask the input image in the latent space and learn how to predict the masks, and then diffuse these masks. Brilliant! Link
Masked Diffusion Transformer is a Strong Image Synthesizer Shanghua Gao et al 2023 diffusion, masking, transformer ICCV This paper (smartly!) notices that one of the major reasons for long training and poor results of diffusion models is the lack of fast learning of relationships. For instance, they remark on the learning of one eye of a dog before both eyes. They propose to mask the input image in the latent space and learn how to predict the masks, and then diffuse these masks. Brilliant! Link
Generative Modeling by Estimating Gradients of the Data Distribution Yang Song et al 2019 diffusion, score matching NeurIPS This paper introduces Noise Conditional Score Networks (NCSNs), a novel approach to generative modeling that learns to estimate the score function of a data distribution at multiple noise levels. NCSNs are trained using score matching, avoiding the need to compute normalizing constants, and generate samples using annealed Langevin dynamics. The method addresses challenges in modeling complex, high-dimensional data distributions, particularly for data lying on or near low-dimensional manifolds. Link
LAPTOP-Diff: Layer Pruning and Normalized Distillation for Compression Diffusion Models Dingkun Zhang et al 2024 diffusion, pruning Arxiv This paper proposes layer pruning and normalized distillation for pruning diffusion models. They use a surrogate function and show that their surrogate implies a property called "additivity", where the output distortion caused by many perturbations approximately equals the sum of the output distortion caused by each single perturbation. They then show that their computation can be formed as a 0-1 Knapsack problem. They then analyze what is the important objective for retraining, and see that there is an imbalance in previous feature distillation approaches employed in the retraining phase. They note that the L2-Norms of feature maps at the end of different stages and the values of different feature loss terms vary significantly, for instance, the highest loss term is ~10k times greater than the lowest one throughout the distillation process, and produces about 1k times larger gradients. This dilutes the gradients of the numerically insignificant feature loss terms. So, they opt to normalize the feature loss. Link
Classifier-Free Diffusion Guidance Jonathan Ho et al 2022 diffusion, guidance NeurIPS This paper introduces classifier-free guidance, a novel technique for improving sample quality in conditional diffusion models without using a separate classifier. Unlike traditional classifier guidance, which relies on gradients from an additional classifier model, classifier-free guidance achieves similar results by combining score estimates from jointly trained conditional and unconditional diffusion models. The method involves training a single neural network that can produce both conditional and unconditional score estimates, and then using a weighted combination of these estimates during the sampling process. This approach simplifies the training pipeline, avoids potential issues associated with training classifiers on noisy data, and eliminates the need for adversarial attacks on classifiers during sampling. The authors demonstrate that classifier-free guidance can achieve a similar trade-off between Fréchet Inception Distance (FID) and Inception Score (IS) as classifier guidance, effectively boosting sample quality while reducing diversity. The key difference is that classifier-free guidance operates purely within the generative model framework, without relying on external classifier gradients. This method provides an intuitive explanation for how guidance works: it increases conditional likelihood while decreasing unconditional likelihood, pushing generated samples towards more characteristic features of the desired condition. Link
LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights Thibault Castells et al 2024 pruning, diffusion, ldm CVPR This paper presents LD-Pruner. The main interesting part is how the frame the pruning problem. Basically, they define an "operator" (any fundamental building block of a net, like convolutional layers, activation functions, transformer blocks), and try to either 1) remove it or 2) replace it with a less demanding operation. As they operate on the latent space, this work can be applied to any generation that uses diffusion (task agnostic). It is interesting to note their limitations: the approach does not extend to pruning the decoder, and their approach does not consider dependencies between operators (which is a big deal I think). Finally, their score function seems a bit arbitrary (maybe this could be learned?). Link
RoFormer: Enhanced Transformer with Rotary Position Embedding Jianlin Su et al 2021 attention, positional embedding Arxiv This paper introduces Rotary Position Embedding (RoPE), a method for integrating positional information into transformer models by using a rotation matrix to encode absolute positions and incorporating relative position dependencies. Link
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models Alex Nichol et al 2022 text-conditioned diffusion, inpainting Arxiv This paper explores text-conditional image synthesis using diffusion models, comparing CLIP guidance and classifier-free guidance, and finds that classifier-free guidance produces more photorealistic and caption-aligned images. Link
LLM Inference Unveiled: Survey and Roofline Model Insights Roger Waleffe et al 2024 llms, survey Arxiv This paper surveys some recent advancements in LLC inference, like speculative decoding or operator fusion. They also analyze the findings using the Roofline model, which is likely the first paper to do such a thing for LLM inference. Good for checking out other papers that have recently been published. Link
An Empirical Study of Mamba-based Language Models Roger Waleffe et al 2024 mamba, llms, transformer Arxiv This paper compares Mamba-based, Transformer-based, and hybrid-based language models in a controlled setting where sizes and datasets are larger than the past (8B-params / 3.5T tokens). They find that Mamba and Mamba-2 lag behind Transformer models on copying and in-context learning tasks. They then see that a hybrid architecture of 43% Mamba, 7% self attention, and 50% MLP layers performs better than all others. Link
Diffusion Models Beat GANs on Image Synthesis Prafulla Dhariwal et al 2021 diffusion, gan Arxiv This work demonstrates that diffusion models surpass the current state-of-the-art generative models in image quality, achieved through architecture improvements and classifier guidance, which balances diversity and fidelity. The model attains FID scores of 2.97 on ImageNet 128×128 and 4.59 on ImageNet 256×256, matching BigGAN-deep with as few as 25 forward passes while maintaining better distribution coverage. Additionally, combining classifier guidance with upsampling diffusion models further enhances FID scores to 3.94 on ImageNet 256×256 and 3.85 on ImageNet 512×512. Link
Progressive Distillation for Fast Sampling of Diffusion Models Tim Salimans et al 2022 diffusion, distillation, sampling ICLR Diffusion models excel in generative modeling, surpassing GANs in perceptual quality and autoregressive models in density estimation, but they suffer from slow sampling times. This paper introduces two key contributions: new parameterizations that improve stability with fewer sampling steps and a distillation method that progressively reduces the number of required steps by half each time. Applied to benchmarks like CIFAR-10 and ImageNet, the approach distills models from 8192 steps down to as few as 4 steps, maintaining high image quality while offering a more efficient solution for both training and inference. Link
On Distillation of Guided Diffusion Models Chenlin Meng et al 2023 diffusion, classifier-free guidance Arxiv Classifier-free guided diffusion models are effective for high-resolution image generation but are computationally expensive during inference due to the need to evaluate both conditional and unconditional models many times. This paper proposes a method to distill these models into faster ones by learning a single model that approximates the combined outputs, then progressively reducing the number of sampling steps. The approach significantly accelerates inference, generating images with comparable quality to the original model using as few as 1-4 denoising steps, achieving up to 256× speedup on datasets like ImageNet and LAION. Link
Diffusion Probabilistic Models Made Slim Xingyi Yang et al 2022 diffusion, dpms, spectral diffusion Arxiv Diffusion Probabilistic Models (DPMs) produce impressive visual results but suffer from high computational costs, limiting their use on resource-limited platforms. This paper introduces Spectral Diffusion (SD), a lightweight model designed to address DPMs' bias against high-frequency generation, which smaller networks struggle to capture. SD incorporates wavelet gating for frequency dynamics and spectrum-aware distillation to enhance high-frequency recovery, achieving 8-18× computational efficiency while maintaining competitive image fidelity. Link
Structural Pruning for Diffusion Models Gongfan Fang et al 2023 diffusion, pruning NeurIPS Generative modeling has advanced significantly with Diffusion Probabilistic Models (DPMs), but these models often require substantial computational resources. To address this, Diff-Pruning is introduced as a compression method that reduces the computational load by pruning unnecessary diffusion steps, using a Taylor expansion to identify key weights without extensive re-training. Empirical results show that Diff-Pruning can cut FLOPs by around 50%, while maintaining consistent generative performance at only 10-20% of the original training cost. Link
Diffusion Models: A Comprehensive Survey of Methods and Applications Ling Yang et al 2024 diffusion, survey ACM Diffusion models are a powerful class of deep generative models known for their success in tasks like image synthesis, video generation, and molecule design. This survey categorizes diffusion model research into efficient sampling, improved likelihood estimation, and handling specialized data structures, while also discussing the potential for combining them with other generative models. The review highlights their broad applications across fields such as computer vision, NLP, temporal data modeling, and interdisciplinary sciences, suggesting areas for further exploration. Link
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium Martin Heusel et al 2017 gan, equilibrium, fid, is NeurIPS This paper introduces a two time-scale update rule (TTUR) for GANs, and proves that this makes GANs converge to a local Nash equilibrium. More cited is the FID score introduced here. FID improves on IS by comparing the distributions of real and generated images directly. This is done by using the Inception model to extract features from images and then assuming these features follow a multidimensional Gaussian distribution. FID measures the difference between the Gaussians (representing the real and generated images) using the Frechet distance, which effectively captures differences in the mean and covariance (the first two moments) of the distributions. FID makes sense as it directly compares the distributions of real and generated images by using the extracted features from Inception. These features are assumed to follow some multidimensional Gaussian, which simplifies the comparison. The Guassian is chosen as it is the maximum entropy distribution for a given mean and covariance (proof: https://medium.com/mathematical-musings/how-gaussian-distribution-maximizes-entropy-the-proof-7f7dcb2caf4d) -- maximum entropy is important, because this means that the Gaussian makes the fewest additional assumptions about the data, making sure the model is as non-committal as possible given the available information. Then, we calculate the statistics between the real and generated image features, like their mean and covariances. Finally, we compute the FID score using Frechet AKA Wasserstein-2 distance. Link
Scalable Diffusion Models with Transformers William Peebles et al 2023 diffusion,ddpm, dit CVPR The authors explore using transformers in the latent space, rather than U-Nets. They find that their methods can lead to lower FID scores compared to prior SOTA. In this paper, their image generation pipeline is roughly: 1) Input high resolution image x 2) Encoder z = E(x), where E is a pre-trained frozen VAE encoder, and z is the latent representation 3) The DiT model operates on z 4) New latent representation z’ is sampled from the diffusion model 5) We then decode the z’ using the pre-trained frozen VAE decoder D, and x’ is now the generated high resolution image. Link
Max-Affine Spline Insights Into Deep Network Pruning Haoran You et al 2022 early-bird, lottery-hypothesis, pruning, low-precision TMLR The authors make connections from spline-theory (AKA, consdering DNNs as a continuous piecewise affline mapping) and pruning. Basically, they say that pruning removes redundant decision boundaries in layers that are pruned, and that we can compare the decision boundaries of unpruned networks to their pruned counterparts to show this (they have some nice visualizations). They also note that the final decision boundary often does not always depend on existing subdivision lines. Finally, they demonstrate another way of finding EB tickets using this spline formulation. Link
Drawing Early-Bird Tickets: Towards More Efficient Training of Deep Networks Haoran You et al 2020 early-bird, lottery-hypothesis, pruning, low-precision ICLR The authors show that there exist early-bird (EB) tickets: small, but critical subnetworks for dense randomly intialized networks, that can be found using low-cost training schemes (low precision, early stopping). They also design a practical low compute method for finding these. They use mask distance. Basically, for each pruning iteration, a binary mask is created. This mask represents which parts of the network are kept (the "ticket", or pruned subnet) and which parts are removed. They then consider the scaling factor "r" in BN layers as indicators of significance. This r is learned during training and is used to scale normalized activations. The magnitude of r is an indicator of how important the channel is to the network's performance. After deciding which channels to prune based on r, the binary mask is created. If the channel is kept (not pruned), marked as 1 in the mask. Else, 0. For any two subnets, they then compute the "mask distance" (AKA Hamming distance) between the two ticketmasks. They measure the mask distance between consequtive epochs and draw EB tickets when such distance is smaller than some threshold. Link
Learning both Weights and Connections for Efficient Neural Networks Song Han et al 2015 pruning, compression, regularization NeurIPS The authors show a method of pruning neural networks in three steps: 1) train the network to learn what connections are important, 2) prune unimportant connections, 3) retrain and fine-tune. In order to train for learning what connections are important, they do not focus on learning the final weight values, but rather just focus on the importance of connections. They don't explicitly mention how this is done, but one could look at the Hessian of the loss or the magnitude of the weights. I'd imagine you could do this within only a few training iterations. In their "Regularization" section, it is interesting to note that L1 regularization (penalizes non-zero params resulting in more params near zero) gave better accuracy after pruning, but before retraining. But, these remaining connections are not as good as with using L2. The authors also present a discussion of what dropout rate to use. Link
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference Jiaming Tang et al 2024 KV cache, sparsity, LLM ICML Long context LLM inference is slow and the speed decreases significantly as sequence lengths grow. This is mainly due to needing to load a big KV cache during self-attention. Prior works have use methods to evict tokens in the attention maps to promote sparsity, but the Han lab (smartly!) found that the criticality of tokens strongly correlates with the current query token. Thus, they employ a KV Cache eviction method that retains all KV cache (since past evicted tokens may be needed to handle future queries), while being able to select the top K relevant tokens to a particular query. This allows for speedups in self-attention at low cost to accuracy. Link
BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage Models Jiahui Yu et al 2020 NAS, one-shot Arxiv Most NAS frameworks train some one-shot model to rank the quality of different child architectures. However, these rankings often are different than reality, so frameworks typically finetune architecture after finding them. BigNAS proposes that this fine-tuning / post-processing is not necessary. They find some interesting points, such as that "big models converge faster while small child models converge slower". Thus, at some training step t when the performance of a big model peaks, the small child models are not yet fully-trained, and at a t' where the child models are fully trained, the big model is overfitting. Thus, they use an exponentially decaying with constant ending learning rate scheduler, which has constant learning rate at the end of training when it reaches 5% of initial learning rate. Another point they bring up is a "coarse-to-fine" strategy where one first finds a rough sketch of promising network candidates, and then samples multiple finer grained variations around the sketch of interest. Link
Meta-Learning of Neural Architectures for Few-Shot Learning Thomas Elsken et al 2021 NAS, meta-learning, few-shot, fsl Arxiv The authors propose MetaNAS, which is the first method that fully integrates NAS with gradient-based meta-learning. Basically, they learn a method of joint learning gradient-based NAS methods like DARTS and meta-learning the architecture itself. Their goal is thus: meta-learn an architecture \alpha_{meta} with corresponding meta-learned weights w_{meta}. When given a new task \mathcal{T}_{i}, both \alpha_{meta} and w_{meta} adapt quickly to \mathcal{T}_{i} based on a few samples. One interesting technique they do is add a temperature term that is annealed to 0 over the course of task training; this is to help with sparsity of the mixture weights of the operations when using the DARTS search. Link
MetAdapt: Meta-Learned Task-Adaptive Architecture for Few-Shot Classification Sivan Doveh et al 2020 NAS, meta-learning, few-shot, fsl Arxiv The authors propose a method using a DARTS-like search for FSL architectures. "Our goal is to learn a neural network where connections are controllable and adapt to the few-shot task with novel categories... However, unlike DARTS, our goal is not to learn a one time architecture to be used for all tasks... we need to make our architecture task adaptive so it would be able to quickly rewire for each new target task.". Basically, they design a thing called a MetAdapt Controller that changes the connection in the main network according to some given task. Link
Distilling the Knowledge in a Neural Network Geoffry Hinton et al 2015 distillation, ensemble, MoE Arxiv The first proposal of knowledge distillation. The main interesting point I found was that they change the temperature of the softmax to be higher to allow for softer targets. This allows for understanding what 2's look like 3's (in an MNIST example). Basically, adds a sort of regularization since more information can be carried in these softer targets compared to a single 0 or 1. They also propose the idea of having an ensemble of models, and then learning a distilled model that is smaller. The biological example of having a clumsy larvae that then becomes a more specialized bug was good. Link
HyperTuning: Toward Adapting Large Language Models without Back-propagation Jason Phang et al 2023 hypernetworks, adaptation, tuning, LoRA, LLMs ICML The authors show that we can a hypernetwork for model adaptation in order to generate task-specific parameters. They try two approaches: generating prefixs and generating LoRA parameters for a frozen T5 model using few-shot examples. They also note the importance of hyperpretraining, i.e., an additional stage to adapt the hypernet to generate parameters for the downstream model. They also propose a scheme for this. NOTE! "We also observe a consistent trend where HyperT5-Prefix outperforms HyperT5-LoRA. We speculate that it is easier for hypermodels to learn to generate soft prefixes as compared to LoRA weights..." Link
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning Armen Aghajanyan et al 2020 fine-tuning, intrinsic dimension, lora Arxiv Large models with billions of parameters can be fine-tuned using only a few hundred examples. Why is this? Furthermore, large models often allow for significant sparsification, which implies that there is much redundancy. This paper targets both of these ideas, by showing that many common models have an "intrinsic dimension" much less than the full parameterization. Link
LoRA: Low-Rank Adaptation of Large Language Models Edward Hu et al 2021 low rank adaptation, lora, llm, fine-tuning Arxiv Fine-tuning large models is expensive, because we update all the original parameters. LoRA, taking inspiration from Aghajanyan et al, 2020 (pre-trained language models have a low "intrinsic dimension"), the authors thought that the weight updates would also have low intrinsic rank. Thus, they decompose Delta W = BA, where B and A are lower rank. The A and B are trainable. They initialize A with Gaussian, and B as zero, so Delta W = BA is zero initialy. They then optimize and find this method to be more efficient in terms of both time and space. Link
Learning to Compress Prompts with Gist Tokens Jesse Mu et al 2023 llms, prompting, compression, tokens NeurIPS The authors describe a method of using a distilling function G (similar to a hypernet) that is able to compress LM prompts into a smaller set of "gist" tokens. These tokens can then be cached and reused. The neat trick is that they reuse the LM itself as G, so gisting itself incurs no additional training cost. Note that in their "Failure Cases" section, they mention "... While it is unclear why only the gist models exhibit this behavior (i.e. the fail example behavior), these issues can likely be mitigated with more careful sampling techniques. Link
Once-For-All: Train One Network and Specialize it For Efficient Deployment Han Cai et al 2020 nas, supernets ICLR The authors proposed training one large supernetwork and then sampling subnetworks as an approach for NAS. This method allows for the simultaneous generation of many different subnetworks that could satisfy different constraints (i.e. hardware, latency, accuracy, etc). The authors also propose a progressive shrinking method to train the net (start by training the big supernet, then progressively shrink down), which can be seen as a generalized pruning method. Furthermore, they introduce an idea of training a twin neural network to help estimate latency / accuracy given some architecture, which allows for fast feedback when conducting the search for subnetworks. Link
Dataless Knowledge Fusion by Merging Weights Xisen Jin et al 2023 knowledge fusion, weight merging ICLR The paper introduces RegMean, a method for merging pre-trained language models from different datasets by solving a linear optimization problem, which improves generalization across domains without requiring the original training data. Compared to existing methods like Simple Averaging and Fisher Averaging, RegMean offers higher computational efficiency and comparable memory overhead, while achieving better or equivalent performance across various natural language tasks, including out-of-domain generalization. The method is evaluated using GLUE datasets and demonstrates superior performance in most tasks, outperforming traditional model ensembling and multi-task learning approaches. Link
Superposition of Many Models into One Cheung et al 2019 superposition, online learning, tasks, continual learning NeurIPS A method of storing multiple models using only one set of parameters via parameter superposition is provided; it shares similarities to superposition in the fourier analysis for signal processing. Link
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation Yoshua Bengio et al 2013 gradients, stochasticy, backpropagation Arxiv The authors introduce a several methods of estimation / propagation for networks that have stochastic neurons. This is used often in networks that are quantization-aware, as they sometimes have decision-boundaries in the neurons that are not differentiable regularly. The paper also introduces the "Straight Through Estimator", which was actually first introduced in one of Hinton's lectures. One interesting idea they present (that I think may have also been introduced in Kingma's VAE paper?) is that we can model the output h_{i} of some stochastic neuron as the application of a deterministic function that also depends on some noise source z_{i}: h_{i} = f(a_{i},z_{i}). TLDR: Straight through units are typically the go-to due to ease of use and good performance. Link
DoReFaNet: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients Shuchang Zhou et al 2018 quantization, cnn, gradients Arxiv The authors introduce a method to train CNNs with low bitwidth weights and activations using low bitwidth param gradients. They use deterministic quantization for weights and activations, while stochastically quantizing gradients. Note that they do not quantize the weights of the first CNN layer for the most part, as they noted that it would often degrade performance (Han et al. 2015 also notes a similar thing). Another interesting thing they do is add noise to the gradient after quantization to increase performance. This paper also uses the straight through estimator (Bengio et al 2013) for propagating gradients when using their quantization scheme. Link
Training Deep Neural Networks with 8-bit Floating Point Numbers Naigang Wang et al 2018 quantization, floating-point, precision NeurIPS The authors show that it is possible to train DNNs with 8-bit fp values while maintaining decent accuracy. To do this, they make a new FP8 format, develop a technique "chunk-based computations" that allow matrix and convolution ops to be computed using 8-bit multiplications and 16 bit additions, and use fp stochastic rounding in weight updates. One interesting point they make is that swamping (the issue of truncation in large-to-small number addition) is a serious problem in DNN bit-precision reduction. Link
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference Benoit Jacob et al 2017 quantization, quantization schemes, efficient inference, floating-point Arxiv The authors propose a quantization scheme that allows us to only use integer arithmetic to approximate fp computations in a neural network. They also describe a training approach that simulates the effect of quantization in the forward pass. Backprop still occurs, but all weights and biases are stored in fp. The forward prop pass then simulates quantized inference by rounding off using the quantization scheme they describe that changes fp to int. Link
PACT: Parameterized Clipping Activation for Quantized Neural Networks Jungwook Choi et al 2018 quantization, clipping, activations ICLR The authors present a method of quantization by clipping activations using a learnable parameter, alpha. They show that this can lead to lower decreases in accuracy compared to other quantization methods. They also note that activations have been hard to quantize compared to weights in the past. They also prove that PACT is as expressive as ReLU, by showing it can reach the same solution as ReLU if SGD is used. They also describe the hardware benefits that can be incurred. Link
SMASH: One-Shot Model Architecture Search through Hypernetworks Andrew Brock et al 2017 hypernetworks, nas, one-shot, few-shot Arxiv The authors propose a technique to speed up NAS by using a hypernet. Basically, they train a hypernet to generate weights of a main model that has variable architecure. The input to the hypernet is a binarized representation of model architecture. The hypernet takes this representation in, and then outputs weights. They then train only for a few epochs, and compare the validation scores obtained across different representations. Then, they fully train the model that had the best validation score. Link
Example-based Hypernetworks for Multi-source Adaptation to Unseen Domains Tomer Volk et al 2023 hypernetworks, multi-source adaptation, unseen domains, NLP EMNLP The authors apply hypernets to unsupervised domain adaptation in NLP. They use example-based adaptation. The main idea is that they use an encoder-decoder to initially create the unique signatures from an input example, and then they embed it within the source domain's semantic space. The signature is then used by a hypernet to generate the task classifier's weights. The paper focuses on improving generalization to unseen domains by explicitly modeling the shared and domain specific characteristics of the input. To allow for parameter sharing, they propose modeling based on hypernets, which allow soft weight sharing. Link
Meta-Learning via Hypernetworks Dominic Zhao et al 2020 hypernetworks, meta-learning NeurIPS The authors propose a soft weight-sharing hypernet architecture that performs well on meta-learning tasks. A good paper to show efforts in meta-learning with regards to hypernets, and comparing them to SOTA methods like Model-Agnostic Meta-Learning (MAML). Link
HyperDynamics: Meta-Learning Object and Agent Dynamics with Hypernetworks Zhou Xian et al 2021 hypernetworks, meta-learning, dynamics ICLR The authors present a dynamics meta-learning framework which conditions on an agent's interations w/ env and (optionally) the visual input from it. From this, they can generate params of a neural dynamics model. The three modules they use are 1) an encoding module that encodes a few agent-env interations / agent's visual observations into a feature code, 2) a hypernet that conditions on the latent feature code to generate params of a dynamic model dedicated to this observed system, and 3) a target dynamics model that is made using the generated parameters, and takes input as a low-dim system state / agent action and outputs the prediction of next system state. Link
Principled Weight Initialization for Hypernetworks Oscar Chang et al 2020 hypernetworks, weight initialization ICLR Classical weight initialization techniques don't really work on hypernets, because they fail to produce weights for the mainnet in the correct scale. The authors derive formulas for hyperfan-out and hyperfan-in weight initialization, and show that it works well for the mainnet. Link
Continual Learning with Hypernetworks Johannes von Oswald et al 2020 hypernetworks, continual learning, meta learning ICLR The authors present a method of preventing catastrophic forgetting, by using task-conditioned hypernets (i.e., hypernets that generate weights of target model based on some task embedding). Thus, rather than memorizing many data characteristics, we can split the problem into just learning a single point per task, given the task embedding. Link
Stochastic Hyperparameter Optimization through Hypernetworks Jonathan Lorraine et al 2018 hypernetworks, hyperparameters ICLR Using hypernetworks to learn hyperparameters. They replace the training optimization loop in favor of a differentiable hypernetwork to allow for tuning of hyperparameters using grad descent. Link
Playing Atari with Deep Reinforcement Learning Volodymyr Mnih et al 2013 q-learning, reinforcement learning Arxiv The authors present the first deep learning model that can learn complex control policies, and they teach it to play Atari 2600 games using Q-learning. Their goal was to create one net that can play as many games as possible. Link
Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Encoding Song Han et al 2016 quantization, encoding, pruning ICML A three-pronged approach to compressing nets. They prune networks, then quantize and share weights, and then apply Huffman encoding. Link
Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or -1 Matthieu Courbariaux et al 2016 quantization, efficiency, binary Arxiv Introduction of training Binary Neural Networks, or nets with binary weights and activations. They also present experiments on deterministic vs stochastic binarization. They use the deterministic one for the most part, except for activations. Link
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks Mingxing Tan et al 2020 efficiency, scaling ICML A study of model scaling is presented. They propose a novel scaling method to uniformly scale all dimensions of depth/width/resolution using a compound coefficient. This paper presents a method for scaling width/depth/resolution; for instance, if you want to use 2^{N} more compute resources, then you can scale by their coefficients to do so. They also quantify the relationship between width, depth, and resolution. Link
2-in-1 Accelerator: Enabling Random Precision Switch for Winning Both Adversarial Robustness and Efficiency Yonggan Fu et al 2021 precision, adversarial, efficiency ACM Introduction of a Random Precision Switch algorithm that has potential for defending against adversarial attacks while promoting efficiency. Link
The wake-sleep algorithm for unsupervised neural networks Geoffry Hinton et al 1995 representation, generative Arxiv One of the first generative neural networks that kind of resembles diffusion. Link
ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design Haoran You et al 2022 vit, accelerator, attention Arxiv Co-deisng for ViTs. Prunes and polarizes attention maps to have denser/sparser patterns. Development of hardware accelerator as well. Link
Evolving Neural Networks through Augmenting Topologies Kenneth O. Stanley et al 2002 nas, evolution Arxiv Evolution for NAS. Link
A Brief Review of Hypernetworks in Deep Learning Vinod Kumar Chauhan et al 2024 hypernetwork Arxiv Review of hypernets. Link
HyperNetworks David Ha et al 2016 hypernetwork Arxiv Looking at HyperNetworks: networks that generate weights for other networks. Link
Deep Learners Benefit More from Out-of-Distribution Examples Yoshio Bengio et al 2024 ood ICML Evidence that ood samples can help learning. They also argue that intermediate levels of representation can benefit the models in multi-task settings. Link
Balanced Data, Imbalanced Spectra: Unveiling Class Disparities with Spectral Imbalance Chiraag Kaushik et al 2024 spectra, class imbalance ICML Introduction of the idea of "spectral imbalance", which can affect classification accuracy even when classes are balanced. Basically, they look at how the distributions of eigenvalues in different classes affect classification accuracy. Link
DeepArchitect: Automatically Designing and Training Deep Architectures Renato Negrinho et al 2017 nas Arxiv Proposal of a language to describe neural networks architectures. Can then describe them as trees to search through. Show different search methods for going through the trees (Monte Carlo tree search, random, use of surrogate function, etc.). Link
Graph neural networks: A review of methods and applications Jie Zhou et al 2020 gnn AI Open What graph neural networks are, what they are made of, how to train them. And examples. They describe a general design pipeline (Find graph structure, specify graph type and scale, design loss function) and explain the main modules in GNNs (propagation to propagate information between notes, sampling module to conduct the propagation, pooling module to extract information from notes). Link
1D convolution neural networks and applications: A survey Serkan Kiranyaz et al 2020 cnn, survey Mechanical Systems and Signal Processing A brief overview of applications of 1D CNNs is performed. It is largely focused on medicine (for instance, ECG) and fault detection (for instance, vibration based structural damage). Link
2 in 1 Accelerator: Enabling Random Precision Switch for Winning Both Adversarial Robustness and Efficiency Yonggan Fu et al 2021 quantization, accelerator ACM The most interesting point of this paper (among many things!) is the smart idea to use quantization as a way to boost DNN robustness. Cool! Link
Token Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation Junyoung Park et al 2024 efficiency, hardware, accelerator, attention DAC In autoregressive models with attention, off chip memory accesses need to be minimized. The authors note that there have been efforts to prune unimportant tokens, but these do not do much for removing tokens with attention scores near zero. The authors (smartly!) notice this issue, and provide a fast method of estimating the decision to prune or not based on estimation of the probability if a token is or is not important. An architecture for this is also provided. Link
Maxout Networks Ian Goodfellow et al 2013 dropout, maxout ICML The authors note that dropout is "most effective when taking relatively large steps in parameter space. In this regime, each update can be seen as making a significat update to a different model on a different subset of the training set". I really liked that quote. They then develop the maxout unit, which iessentially takes the maxmimum across some number of affine transformations, allowing for learning of piecewise linear approximations of nonlinear functions. Link
Geometric deep learning: Going beyond Euclidean data Michael Bronstein et al 2017 geometric deep learning IEEE SIG Provides an overview of geometric deep learning, which are methods of generalizing DNNs to non Euclidean domains (graphs, manifolds, etc). Link
Sampling in Constrained Domains with Orthogonal Space Variational Gradient Descent Ruqi Zhang et al 2022 variational gradient descent, gradient flow NeurIPS The authors propose a new variational framework called O Gradient for sampling in implicitly defined constrained domains, using two orthogonal directions to drive the sampler towards the domain and explore it by decreasing a KL divergence. They prove the convergence of O Gradient and apply it to both Langevin dynamics and Stein Variational Gradient Descent (SVGD), demonstrating its effectiveness on various machine learning tasks. Link
Entropy MCMC: Sampling from Flat Basins with Ease Bolian Li et al 2024 sampling, bayesian, flat basins ICML The authors propose a practical MCMC algorithm for sampling from flat basins of DNN posterior distributions, using a guiding variable based on local entropy to steer the sampler. They prove the fast convergence rate of their method compared to existing flatness aware methods and demonstrate its superior performance on various tasks through comprehensive experiments. The method is mathematically simple and computationally efficient, making it suitable as a drop in replacement for standard sampling methods like SGLD. Link
AdderNet: Do We Really Need Multiplications in Deep Learning? Hanting Chen et al 2021 multiplication-less, efficiency CVPR The authors show that with a cost of accuracy you can use additions instead of multiplications. They mainly tested CNNs. Link
Explaining and Harnessing Adversarial Examples Ian Goodfellow et al 2015 adversarial examples ICLR Adversarial examples (adding "small but intentially worst case perturbations to examples from the dataset") proves to be an interesting method to train models. The authors also (smartly!) describe a method to generate adversarial examples by a linear method. Link
Identifying and attacking the saddle point problem in high dimensional non convex optimization Yann Dauphin et al 2014 saddle points, optimization NeurIPS The authors argue that saddle points, rather than local minima, are the primary challenge in minimizing non convex error functions in high dimensional spaces, based on insights from various scientific fields and empirical evidence. They explain that saddle points surrounded by high error plateaus can significantly slow down learning and create the illusion of local minima, particularly in high dimensional problems of practical interest. To address this challenge, the authors propose a new approach called the saddle free Newton method, designed to quickly escape high dimensional saddle points, unlike traditional gradient descent and quasi Newton methods. Link
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Sergey Ioffe et al 2015 batch, normalization PMLR The authors identify internal covariate shift as a significant challenge in training deep neural networks, where the distribution of each layer's inputs changes during training due to parameter updates in previous layers. To address this issue, they propose Batch Normalization, a method that normalizes layer inputs as part of the model architecture, performing normalization for each training mini batch. Batch Normalization enables the use of much higher learning rates, reduces sensitivity to initialization, and acts as a regularizer, sometimes eliminating the need for Dropout.| Link
Bayesian Deep Learning and a Probabilistic Perspective of Generalization Andrew Wilson et al 2020 bayesian, marginalization NeurIPS The authors emphasize that marginalization, rather than using a single set of weights, is the key distinguishing feature of a Bayesian approach, which can significantly improve the accuracy and calibration of modern deep neural networks. They demonstrate that deep ensembles provide an effective mechanism for approximate Bayesian marginalization and propose a related approach that further improves the predictive distribution by marginalizing within basins of attraction. The paper investigates the prior over functions implied by a vague distribution over neural network weights, explaining neural network generalization from a probabilistic perspective and showing that seemingly mysterious results (like fitting random labels) can be reproduced with Gaussian processes. The authors demonstrate that Bayesian model averaging mitigates the double descent phenomenon, leading to monotonic performance improvements as model flexibility increases. Link
A Practical Bayesian Framework for Backpropagation Networks David MacKay et al 1992 bayesian Neural Computation The authors present a quantitative and practical Bayesian framework for learning mappings in feedforward networks, enabling objective comparisons between different network architectures and providing stopping rules for network pruning or growing procedures. This framework allows for objective selection of weight decay terms or regularizers, measures the effective number of well determined parameters in a model, and provides quantified estimates of error bars on network parameters and outputs. The approach helps detect poor underlying assumptions in learning models and demonstrates a good correlation between generalization ability and Bayesian evidence for well matched learning models. Link
Bayesian Neural Network Priors Revisited Vincent Fortuin et al 2022 bayesian, priors ICLR Isotropic Gaussian priors are the standard for modern Bayesian neural network inference, but their accuracy and optimal performance are uncertain. By studying summary statistics of neural network weights trained with stochastic gradient descent (SGD), the authors find that CNN and ResNet weights exhibit strong spatial correlations, while FCNNs display heavy tailed weight distributions. Incorporating these observations into priors improves performance on various image classification datasets, mitigating the cold posterior effect in FCNNs but slightly increasing it in ResNets. Link
Hands on Bayesian Neural Networks A Tutorial for Deep Learning Users Laurent Jospin et al 2022 bayesian IEEE A good summary / tutorial for using Bayesian Nets. Also provides some good paper references within. Link
Position Paper: Bayesian Deep Learning in the Age of Large Scale AI Theodore Papamarkou et al 2024 bayesian, mcmc ICML A good summary of strengths of BDL (Bayesian Deep Learning) with regards to modern deep learning, while also addressing some weaknesses. A good paper if need to do an overview of modern challenges (as of 2024).| Link
A Neural Probabilistic Language Model Bengio et al 2003 statistical language modeling JMLR One of the first papers about modern methods in using neural systems to estimate probability functions of word sequences. They show that MLPs can model better than the SOTA (at that time). A classic.| Link
Bit Fusion: Bit Level Dynamically Composable Architecture for Accelerating Deep Neural Networks Hardik Sharma et al 2018 accelerator, quantization, bit fusion ISCA Hardware acceleration of Deep Neural Networks (DNNs) aims to address their high compute intensity, with the paper focusing on the potential of reducing bitwidth in operations without compromising classification accuracy. To prevent accuracy loss, the bitwidth varies significantly across DNNs, and a fixed bitwidth accelerator may lead to limited benefits or degraded accuracy. The authors introduce Bit Fusion, a bit flexible accelerator that dynamically adjusts bitwidth for individual DNN layers, resulting in significant speedup and energy savings compared to state of the art accelerators, Eyeriss and Stripes, and achieving performance close to a 250 Watt Titan Xp GPU while consuming much less power. Link
A Framework for the Cooperation of Learning Algorithms Leon Bottou et al 1990 learning algorithms, modules NeurIPS Cooperative training of modular systems offers a unified approach to many learning algorithms and hybrid systems, allowing the design and implementation of complex learning systems that incorporate structural a priori knowledge about tasks. The authors introduce a framework using a statistical formulation of learning systems to define and combine modules into cooperative systems, enabling the creation of hybrid systems that combine the advantages of connectionist and other learning algorithms. By decomposing complex tasks into simpler subtasks, modular architectures can be built, where each module corresponds to a subtask, facilitating easier achievement of the learning goal by introducing a modular decomposition of the global task. Link
CNP: An FPGA Based Processor for Convolutional Networks Clement Farabet et al 2009 fpga, cnn IEEE One of the first attempts (that I have found) at putting a CNN into an FPGA and showing it can be done to perform some task (face detection). Link
A Complete Recipe for Stochastic Gradient MCMC Yi An Ma et al 2015 hamiltonian, mcmc NeurIPS Recent Markov chain Monte Carlo (MCMC) samplers use continuous dynamics and scalable variants with stochastic gradients to efficiently explore target distributions, but proving convergence with stochastic gradient noise remains challenging. The authors provide a general framework for constructing MCMC samplers, including stochastic gradient versions, based on continuous Markov processes defined by two matrices, demonstrating that any such process can be represented within this framework. Using this framework, they propose a new state adaptive sampler, stochastic gradient Riemann Hamiltonian Monte Carlo (SGRHMC), which combines the benefits of Riemann HMC with the scalability of stochastic gradient methods, as shown in experiments with simulated data and a streaming Wikipedia analysis. Link
CPT: Efficient Deep Neural Network Training via Cyclic Precision Yonggan Fu et al 2021 precision, efficiency, wide minima ICLR Low precision deep neural network (DNN) training is an effective method for improving training time and energy efficiency, with this paper proposing a new perspective: that DNN precision may act similarly to the learning rate during training. The authors introduce Cyclic Precision Training (CPT), which cyclically varies precision between two boundary values identified through a simple precision range test in the initial training epochs, aiming to boost time and energy efficiency further. Link
Approximation by Superpositions of a Sigmoidal Function G. Cybenko universal approximator, completeness TODO Mathematics of Control, Signals, and Systems This paper demonstrates that finite linear combinations of compositions of a fixed univariate function and a set of affine functionals can uniformly approximate any continuous function of nn real variables within the unit hypercube, under mild conditions on the univariate function. These findings resolve an open question about the representability of functions by single hidden layer neural networks, specifically showing that arbitrary decision regions can be well approximated by continuous feedforward neural networks with a single hidden layer and any continuous sigmoidal nonlinearity. Link
Cyclical Stochastic Gradient MCMC for Bayesian Deep Learning Ruqi Zhang et al 2020 mcmc, bayesian ICLR The posteriors over neural network weights are high dimensional and multimodal, with each mode representing a different meaningful interpretation of the data. The authors introduce Cyclical Stochastic Gradient MCMC (SG MCMC) with a cyclical stepsize schedule, where larger steps discover new modes and smaller steps characterize each mode, and they prove the non asymptotic convergence of this algorithm. Link
DaDianNao: A Machine Learning Supercomputer Yunji Chen et al 2014 accelerator, gpu IEEE/ACM This paper introduces a custom multi chip architecture optimized for Convolutional and Deep Neural Networks (CNNs and DNNs), addressing their computational and memory intensive nature by leveraging on chip storage to enhance internal bandwidth and reduce external communication bottlenecks. The authors demonstrate significant performance gains with their 64 chip system achieving up to a 450.65x speedup over GPUs and reducing energy consumption by up to 150.31x on large neural network layers, implemented with custom storage, computational units, and robust interconnects at 28nm scale. Link
DARTS: Differentiable Architecture Search Hanxiao Liu et al 2019 nas ICLR This paper introduces a differentiable approach to architecture search, tackling scalability challenges by reformulating the task to allow gradient based optimization over a continuous relaxation of architecture representations. Unlike traditional methods relying on evolutionary or reinforcement learning in discrete, non differentiable spaces, the proposed method efficiently discovers high performance convolutional architectures for image classification and recurrent architectures for language modeling. Link
Decoupled Contrastive Learning Chun Hsiao Yeh et al 2022 contrastive learning, self-supervised learning ACM This paper introduces decoupled contrastive learning (DCL), which removes the negative positive coupling (NPC) effect from the InfoNCE loss, significantly improving the efficiency of self supervised learning (SSL) tasks with smaller batch sizes. DCL achieves efficient and reliable performance enhancements across various benchmarks, outperforming the SimCLR baseline without requiring momentum encoding, large batch sizes, or extensive epochs. Link
Deep Image Prior Dmitry Ulyanov et al 2020 inpatining, super-resolution, denoising IEEE This paper challenges the conventional wisdom by demonstrating that the structure of a generator network, even when randomly initialized, can effectively capture low level image statistics without any specific training on example images. The authors show that this randomly initialized neural network can serve as a powerful handcrafted prior, yielding excellent results in standard image processing tasks such as denoising, super resolution, and inpainting. Furthermore, the same network structure can invert deep neural representations for diagnostic purposes and restore images based on input pairs like flash and no flash conditions, showcasing its versatility and effectiveness across various image restoration applications. Link
Deep Double Descent: Where Bigger Models and More Data Hurt Preetum Nakkiran et al 2019 capacity, double descent Arxiv This paper explores the "double descent" phenomenon in modern deep learning tasks, showing that as model size or training epochs increase, performance initially worsens before improving. The authors unify these observations by introducing a new complexity measure termed effective model complexity, conjecturing a generalized double descent across this measure. Link
DeepShift: Towards Multiplication Less Neural Networks Mostafa Elhoushi et al 2021 multiplication-less, efficiency Arxiv This paper addresses the computational challenges of deploying convolutional neural networks (CNNs) on edge computing platforms by introducing convolutional shifts and fully connected shifts, replacing multiplications with efficient bitwise operations during both training and inference. The proposed DeepShift models achieve competitive or higher accuracies compared to baseline models like ResNet18, ResNet50, VGG16, and GoogleNet, while significantly reducing the memory footprint by using only 5 bits or less for weight representation during inference. Link
DepthShrinker: A New Compression Paradigm Towards Boosting Real Hardware Efficiency of Compact Neural Networks Yonggan Fu et al 2022 compression, efficiency, pruning ICML This paper introduces DepthShrinker, a framework designed to enhance hardware efficiency of deep neural networks (DNNs) by transforming irregular computation patterns of compact operators into dense ones, thereby improving hardware utilization without sacrificing model accuracy. By leveraging insights that certain activation functions can be removed post training without loss of accuracy, DepthShrinker pioneers a compression paradigm that optimizes DNNs for real hardware efficiency, presenting a significant advancement in efficient model deployment.| Link
Dimensionality Reduction by Learning an Invariant Mapping Raia Hadsell et al 2006 dimensionality reduction, mapping CVPR DrLIM, or Dimensionality Reduction by Learning an Invariant Mapping, addresses key limitations of existing dimensionality reduction techniques by learning a non linear function that maps high dimensional data to a low dimensional manifold based solely on neighborhood relationships, without requiring a predefined distance metric in input space. The method is distinguished by its ability to handle transformations and maintain invariant mappings, demonstrated through experiments that show its effectiveness in preserving neighborhood relationships and accurately mapping new, unseen samples to meaningful locations on the manifold. Unlike methods like LLE, which may struggle with variability and registration issues in input data, DrLIM's contrastive loss function ensures robustness by balancing similarity and dissimilarity in output space, offering a promising approach for applications requiring invariant mappings, such as learning positional information from image sequences in robotics. Link
Disentangling Trainability and Generalization in Deep Neural Networks Lechao Xiao et al 2020 neural tangent kernel, ntk ICML This study focuses on characterizing the trainability and generalization of deep neural networks, particularly under conditions of very wide and very deep architectures, leveraging insights from the Neural Tangent Kernel (NTK). By analyzing the NTK's spectrum, the study formulates necessary conditions for both memorization and generalization across architectures like Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs). The research identifies key spectral quantities such as λmax, λbulk, κ, and P(Θ(l)) that critically influence the performance of deep networks, providing a precise theoretical framework validated by extensive experiments on CIFAR10. It highlights distinctions in generalization behavior between CNNs with and without global average pooling. Link
Finding Structure in Time Jeffrey Elman 1990 rnn Cognitive Science I think this was the original backpropagation through time paper. Good insights on time dependent system learning. Link
E2 Train: Training State of the art CNNs with Over 80% Energy Savings Yue Wang et al 2019 cnn, batch, energy NeurIPS This paper introduces E2 Train, a framework for energy efficient CNN training on resource constrained platforms. E2 Train optimizes training energy costs through stochastic mini batch dropping, selective layer updates, and low cost, low precision back propagation strategies. Experimental results on CIFAR 10 and CIFAR 100 demonstrate significant energy savings of over 90% and 84%, respectively, with minimal loss in accuracy. E2 Train addresses the challenge of on device CNN training by integrating three levels of energy saving techniques: data level stochastic mini batch dropping, model level selective layer updates, and algorithm level low precision back propagation. Real energy measurements on an FPGA validate its effectiveness, achieving notable energy reductions in training ResNet models on CIFAR datasets. Link
cuDNN: Efficient Primitives for Deep Learning Sharan Chetlur et al 2014 cuda, gpu Arxiv This paper introduces cuDNN, a library designed to optimize deep learning primitives akin to BLAS for HPC. cuDNN offers efficient implementations of key deep learning kernels tailored for GPUs, improving performance and reducing memory usage in frameworks like Caffe by up to 36%. Link
EIE: Efficient Inference Engine on Compressed Deep Neural Network Song Han et al 2016 compression, accelerator, co-design Arxiv This paper introduces EIE, an energy efficient inference engine designed for compressed deep neural networks, achieving significant energy savings by exploiting weight sharing, sparsity, and quantization. EIE performs sparse matrix vector multiplications directly on compressed models, enabling 189× and 13× faster inference speeds compared to CPU and GPU implementations of uncompressed DNNs. Link
An Empirical Analysis of Deep Network Loss Surfaces Daniel Jiwoong Im et al 2016 optimization, loss surface, saddle points Arxiv This paper empirically investigates the geometry of loss functions in state of the art neural networks, employing various stochastic optimization methods. Through visualizations in low dimensional subspaces, it explores how different optimization procedures lead to distinct local minima, even when algorithms are changed late in the optimization process. The study reveals that modifications to optimization procedures consistently yield different local minima, each affecting the network's performance on test examples differently. Interestingly, while different optimization algorithms find varied local minima from different initializations, the shape of the loss function around these minima remains characteristic to the algorithm used, with ADAM showing larger basins compared to vanilla SGD. Link
EyeCoD: Eye Tracking System Accelerator via FlatCam based Algorithm & Accelerator Co Design Haoran You et al 2023 accelerator, co-design, eye-tracking ACM This paper introduces EyeCoD, a lensless FlatCam based eye tracking system designed to overcome limitations of traditional systems, such as large form factor and high communication costs. By integrating a predict then focus algorithm pipeline and dedicated hardware accelerator, EyeCoD achieves significant reductions in computation and communication overhead while maintaining high tracking accuracy. Link
Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices Yu Hsin Chen et al 2019 efficiency, sparsity Arxiv This paper introduces Eyeriss v2, a specialized DNN accelerator architecture designed to efficiently handle compact and sparse neural networks. Unlike traditional DNN accelerators, Eyeriss v2 incorporates a hierarchical mesh network on chip to adapt to varying layer shapes and sizes, optimizing data reuse and bandwidth utilization. Eyeriss v2 excels in processing sparse data directly in the compressed domain, both for weights and activations, thereby enhancing processing speed and energy efficiency particularly suited for sparse models. Link
Eyeriss: A Spatial Architecture for Energy Efficient Dataflow for Convolutional Neural Networks Yu Hsin Chen et al 2016 cnn, row-stationary, efficiency ACM/IEEE The paper addresses the high energy consumption in deep convolutional neural networks (CNNs) due to extensive data movement, despite advancements in parallel computing paradigms like SIMD/SIMT. Introduces a novel row stationary (RS) dataflow designed for spatial architectures. RS maximizes local data reuse and minimizes data movement during convolutions, leveraging PE local storage, inter PE communication, and spatial parallelism. Link
Flat Minima Sepp Hochreiter et al 1997 flat minima, low complexity, gibbs Neural Computation The algorithm focuses on identifying "flat" minima of the error function in weight space. A flat minimum is characterized by a large connected region where the error remains approximately constant. This property suggests simplicity in the network structure and low expected overfitting, supported by an MDL based Bayesian argument. Unlike traditional approaches that rely on Gaussian assumptions or specific weight priors, this algorithm uses a Bayesian framework with a prior over input output functions. This approach considers both network architecture and the training set, facilitating the identification of simpler and more generalizable models. Link
Fused Layer CNN Accelerators Manoj Alwani et al 2016 cnn, accelerator, fusion IEEE This work introduces a novel approach to CNN accelerator design by fusing the computation of multiple convolutional layers. By rearranging the dataflow across layers, intermediate data can be cached on chip between adjacent layers, reducing the need for off chip memory storage and minimizing data transfer. Specifically, the study demonstrates the effectiveness of this approach by implementing a fused layer CNN accelerator for the initial five convolutional layers of VGGNet E. Using 362KB of on chip storage, the accelerator significantly reduces off chip feature map data transfer by 95%, from 77MB to 3.6MB per image processed. This innovation targets early convolutional layers where data transfer typically dominates. By optimizing data reuse and minimizing off chip memory usage, the proposed design strategy enhances the efficiency of CNN accelerators, paving the way for improved performance in various machine learning tasks. Link
EnlightenGAN: Deep Light Enhancement Without Paired Supervision Yifan Jiang et al 2021 gan, enhancement, unsupervised IEEE Exploration of low light to well lit image generation using GANs. Also provides an interesting global local discriminator and self regularized perceptual loss fusion, with a simplified attention (the attention is just an inverse of the brightness of the image essentially). Link
Understanding the difficulty of training deep feedforward neural networks Xavier Glorot et al 2010 activation, saturation, initialization AISTATS The logistic sigmoid activation function is problematic for deep networks due to its mean value, which can lead to saturation of the top hidden layer. This saturation slows down learning and can cause training plateaus. The difficulty in training deep networks correlates with the singular values of the Jacobian matrix for each layer. When these values deviate significantly from 1, it indicates poor activation and gradient flow across layers, complicating training. New initialization schemes have been proposed to address issues with activation saturation and gradient flow. These schemes aim to achieve faster convergence by ensuring that activations and gradients flow well across layers, thereby improving overall training efficiency. Link
Group Normalization Yuxin Wu et al 2018 normalization Arxiv Batch Normalization performs normalization along the batch dimension, which causes errors to increase rapidly as batch sizes decrease. This limitation makes BN less effective for training larger models and tasks that require smaller batches due to memory constraints. GN divides channels into groups and computes normalization statistics (mean and variance) independently within each group. Unlike BN, GN's computation is not dependent on batch sizes, leading to stable performance across a wide range of batch sizes. Link
Singularity of the Hessian in Deep Learning Levent Sagun et al 2017 eigenvalues, hessian ICLR The bulk of eigenvalues concentrated around zero indicates how overparametrized the model is. In deep learning, overparametrization often leads to better generalization despite the potential for higher computational costs. The edges of the eigenvalue distribution, scattered away from zero, reflect the complexity of the input data. This complexity influences how the loss landscape is structured and affects optimization difficulty. Second order optimization methods, which leverage information from the Hessian, can potentially accelerate training and find better solutions by providing insights into the loss landscape's curvature. The top discrete eigenvalues of the Hessian are influenced by the data characteristics, indicating that different datasets may require different optimization strategies or model architectures for optimal performance. Link
Long Short Term Memory Sepp Hochreiter et al 1997 lstm, rnn Neural Computation The original paper on the LSTM. A classic, and demonstrated the power of gating. Link
On the importance of initialization and momentum in deep learning Ilya Sutskever et al 2013 initialization, momentum ICML Traditionally, training DNNs and RNNs with stochastic gradient descent (SGD) with momentum was considered challenging due to issues with gradient propagation and vanishing/exploding gradients, especially in networks with many layers or long term dependencies. The paper demonstrates that using a well designed random initialization significantly improves the training success of deep and recurrent networks with SGD and momentum. Link
Algorithms for manifold learning Lawrence Cayton 2005 manifold learning, dimensionality reduction Arxiv Many datasets exhibit complex relationships that cannot be effectively captured by linear methods like Principal Component Analysis (PCA). Manifold hypothesis: Despite high dimensional appearances, data points often lie on or near a much lower dimensional manifold embedded within the higher dimensional space. Manifold learning aims to uncover this underlying low dimensional structure to provide a more meaningful and compact representation of the data. Link
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis Ben Mildenhall et al 2020 nerf, view synthesis, 3d, scene representation, volume rendering Arxiv The method utilizes a fully connected deep network to represent scenes as continuous volumetric functions. This network takes a 5D input (spatial location and viewing direction) and outputs volume density and view dependent radiance. By querying these 5D coordinates along camera rays and employing differentiable volume rendering techniques, the method synthesizes novel views of scenes. Link
On Large Batch Training for Deep Learning Generalization Gap and Sharp Minima Nitish Shirish Keskar et al 2017 sharp minima, large batch ICLR The study identifies a phenomenon where large batch SGD methods tend to converge towards sharp minimizers of training and testing functions. Sharp minima are associated with poorer generalization, meaning the model performs worse on unseen data. In contrast, small batch methods more consistently converge towards flat minimizers. This behavior is attributed to the inherent noise in gradient estimation during training with small batches. Link
Optimizing FPGA based Accelerator Design for Deep Convolutional Neural Networks Chen Zhang et al 2015 cnn, fpga, accelerator ACM The study employs quantitative analysis techniques, including loop tiling and transformation, to optimize the CNN accelerator design. These techniques aim to maximize computation throughput while minimizing the resource utilization on the FPGA, particularly balancing logic resource usage and memory bandwidth.| Link
Learning Phrase Representation using RNN Encoder Decoder for Statistical Machine Translation Kyunghyun Cho et al 2014 encoder decoder, machine translation Arxiv Introduces a novel neural network architecture called RNN Encoder Decoder, comprising two recurrent neural networks. One RNN serves as an encoder, converting a sequence of symbols into a fixed length vector representation. The other RNN acts as a decoder, generating another sequence of symbols based on the encoded representation. Link
Qualitatively Characterizing Neural Network Optimization Problems Ian Goodfellow et al 2015 optimization, visualization ICLR Demonstrates that contemporary neural networks can achieve minimal training error through direct training with stochastic gradient descent alone, without needing complex schemes like unsupervised pretraining. This finding challenges earlier beliefs about the difficulty of navigating non convex optimization landscapes in neural network training. They also introduce a nice graphical tool to show the energy landscape. Link
Language Models are Unsupervised Multitask Learners Alec Radford et al 2018 unsupervised, GPT Arxiv Demonstrates that language models, specifically GPT 2, trained on the WebText dataset, start to learn various natural language processing tasks (question answering, machine translation, reading comprehension, summarization) without explicit task specific supervision. For instance, when conditioned on a document and questions, the model achieves an F1 score of 55 on the CoQA dataset, matching or exceeding several baseline systems that were trained with over 127,000 examples. Link
On the difficulty of training Recurrent Neural Networks Razvan Pascanu et al 2013 exploding gradient, vanishing gradient, gradient clipping, normalization Arxiv Explanation of issues in RNNs (vanishing / exploding gradient) and proposal of gradient clipping. Link
Learning representations by back propagating errors David Rumelhart et al 1986 backpropagation, learning procedure, convergence Nature The main paper for backprop. Link
The Shattered Gradients Problem: If resnets are the answer, then what is the question? David Balduzzi et al 2018 shattering, initialization ICML The paper identifies the "shattered gradients" problem in standard feedforward neural networks. It shows that gradients in these networks exhibit an exponential decay in correlation with depth, leading to gradients that resemble white noise. In contrast, architectures like highway and ResNets with skip connections demonstrate gradients that decay sublinearly, indicating greater resilience against shattering. The paper introduces a new initialization technique termed "Looks Linear" (LL) that addresses the shattered gradients issue. Preliminary experiments demonstrate that LL initialization enables the training of very deep networks without the need for skip connections like those in ResNets or highway networks. This initialization method offers a promising alternative to achieving stable gradient propagation in deep networks, potentially simplifying network architecture and improving training efficiency. Link
A Simple Baseline for Bayesian Uncertainty in Deep Learning Wesley Maddox et al 2019 bayesian, uncertainty, guassian NeurIPS SWAG combines Stochastic Weight Averaging (SWA) with Gaussian fitting to provide an approximate posterior distribution over neural network weights. SWA computes the first moment of SGD iterates using a modified learning rate schedule. SWAG extends this by fitting a Gaussian distribution using SWA's solution as the first moment and incorporating a low rank plus diagonal covariance derived from SGD iterates. Link
SmartExchange: Trading Higher cost Memory Storage/Access for Lower cost Computation Yang Zhao et al 2020 compression, accelerator, pruning, decomposition, quantization ACM/IEEE SmartExchange integrates sparsification or pruning, decomposition, and quantization techniques into a unified algorithm. It aims to enforce a structured DNN weight format where each layer's weight matrix is represented as a product of a small basis matrix and a large sparse coefficient matrix with power of 2 elements. Link
On the Spectral Bias of Neural Networks Nasim Rahaman et al 2019 spectra, fourier analysis, manifold learning ICML Neural networks, particularly deep ReLU networks, exhibit a learning bias towards low frequency functions. This bias means they tend to prioritize learning global variations over local fluctuations in data. This property aligns with their ability to generalize well across different samples and datasets. Contrary to intuition, as the complexity of the data manifold increases, deep networks find it easier to learn higher frequency functions. This suggests that while they naturally favor low frequency patterns, they can also adapt to more complex data structures to capture higher frequency variations. Link
Sequence to Sequence Learning with Neural Networks Ilya Sutskever et al 2014 seq2seq Arxiv The paper introduces an end to end approach for sequence learning using multilayered Long Short Term Memory (LSTM) networks. This method requires minimal assumptions about the structure of the sequences and effectively maps input sequences to a fixed dimensional vector using one LSTM layer, and decodes target sequences using another deep LSTM layer. Link
Tiled convolutional neural networks Quoc Le et al 2010 tiling, cnn NeurIPS Tiled CNNs introduce a novel approach to learning invariances by using a regular "tiled" pattern of tied weights. Unlike traditional CNNs where adjacent hidden units share identical weights, Tiled CNNs require only that hidden units at a certain distance from each other share tied weights. This allows the network to learn complex invariances such as scale and rotational invariance, in addition to translational invariance. Link
Unsupervised Learning of Image Manifolds by Semidefinite Programming Kilian Weinberger et al 2004 manifold learning, dimensionality reduction IEEE The paper proposes a new approach to detect low dimensional structure in high dimensional datasets using semidefinite programming (SDP). SDP is leveraged to analyze data that resides on or near a low dimensional manifold, which is a common challenge in computer vision and pattern recognition. The algorithm introduced overcomes limitations observed in previous manifold learning techniques like Isomap and locally linear embedding (LLE). These traditional methods often struggle with certain types of data distributions or computational complexities, which the proposed SDP based approach aims to address more effectively. Link