This list was curated by Lexington Whalen, beginning from his first year of PhD to end. As he is me, I hope he keeps going!
I typically use this to organize papers I found interesting. Please feel free to do whatever you want with it. Note that this is not every single paper I have ever read, just a collection of ones that I remember to put down.
So far, we have read 208 papers. Let's keep it up!
Your search returned 208 papers. Nice!Title | Author | Year | Topic | Venue | Description | Link |
---|---|---|---|---|---|---|
Think Before You Speak: Training Language Models with Pause Tokens | Sachin Goyal et al | 2024 | test-time compute, meta-tokens | Arxiv | This paper introduces "Pause Tokens" which are a way of appending a sequence of tokens to the input prefix, and then delaying the output until the last pause token is seen. | Link |
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters | Charlie Snell et al | 2024 | test-time compute | Arxiv | This paper explores the question of "If an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenge prompt?". Good for references on various test-time compute strategies. | Link |
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads | Tianle Cai et al | 2024 | speculative decoding, drafting, llm | ICML | This paper presents Medusa which augments LLM inference by adding extra decoding heads to predict multiple subsequent tokens in parallel. They also introduce a form of tree-based attention to process candidates. Through the Medusa heads, they obtain probability predictions for the subsequent K+1 tokens. These predictions enable them to create length-K+1 continuations as the candidates. In order to process multiple cnadidates concurrently, they structure their attention such that only tokens from the same continuation are regarded as historical data.For instance, they have in Figure 2 an example where the first Medusa head and generates some top two predictions while the second medusa head generates a top three for each of the top two from the first head. Instead of filling the entire attention mask, they only consider the mask from these 2*3 = 6 tokens, plus the standard identity line. | Link |
Recurrent Drafter for Fast Speculative Decoding in Large Language Models | Yunfei Cheng et al | 2024 | speculative decoding, drafting, llm | Arxiv | This paper introduces ReDrafter (Recurrent Drafter) that uses an RNN as the draft model and conditions on the LLM's hidden states. They use a beam search to explore the candidate seqeunces and then apply a dynamic tree attention alg to remove duplicated prefixes among the candidates to improve the speedup. They also train via knowledge distillation from LLMs to improve the alignment of the draft model's predictions with those of the LLM. | Link |
QuIP: 2-Bit Quantization of Large Language Models With Guarantees | Jerry Chee et al | 2024 | quantization, block-wise | Arxiv | QuIP (quantization with incoherence processing) is a method based on the insight that quantization benefits from incoherent weight and Hessian mats, meaning that they benefit from the weights being even in magnitude and benefit from having the directions in whcih they are rounded to being unaligned with the coordinate axes. | Link |
BlockDialect: Block-wise Fine-grained Mixed Format for Energy-Efficient LLM Inference | Wonsuk Jang et al | 2024 | quantization, block-wise | Arxiv | This paper introduces a block-wise quantization scheme that assigns a per-block optimal number format from a format book (they make their own format book called "DialectFP4"). "Focusing on how to represent over how to scale". | Link |
SpinQuant: LLM Quantization with Learned Rotations | Zechun Liu et al | 2024 | quantization, spins, rotation | Arxiv | This paper uses two mergeable rotation matrices (R1, R2) that make rotationally invariant full-precision networks, and then apply two online Hadamard rotations (G3, R4) to further reduce the outliers so they can quantize activations and KV-cache quantizations. They then show how one can optimize these rotation matrices on Stiefel manifolds (orthogonal manifolds) using Cayley SGD. The reason for Cayley SGD and Stiefel manifolds is bc they need to optimize rotation matrices (R1, R2) such that they stay orthogonal during optimization. Regular SGD would break this constraint. By optimizing on Stiefel manifolds (space of all orthonormal matrices), they can specifc that the optimizations stays on a specific surface that only contains rotation matrices. | Link |
SnapKV: LLM Knows What You are Looking for Before Generation | Yuhong Li et al | 2024 | llm, kv cache | Arxiv | This paper identifies and selects the most important features per head to create compressed KV cache. It works in two stages: 1) vote for important previous features by taking the last segment of the prompt ("observation window") and uses this window to analyze which parts of the earlier text (prefix) are most important. For each attn head, we aggregate the attn weights from queries in the observation window. Then we select the top-k positions based on the aggregated weights (k=p*L_prefix, where p is the compression weight) 2) then cluster and perform context preservation: we then use a pooling layer to cluster the selected important features. The last part of the prompt is kept as the observation window because they note that the attention patterns observed in the last window of the input sequence have high overlap rates (~80-90%) with the actual attention patterns used during generation. | Link |
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention | Angelos Katharopoulos et al | 2020 | attention, transformer | ICML | This paper rephrases transformers as RNNs (title). They express the self-attention mechanism as a linear dot-product of kernel feature maps to make the complexity go from O(N^2) to O(N). Personal note: this is the 200th paper recorded on here, and the last of 2024! Summer of 2024 was when I began studying machine learning. Let's keep it up! | Link |
Prefix-Tuning: Optimizing Continuous Prompts for Generation | Xiang Lisa Li et al | 2021 | prefix-tuning, prompting, llm | Arxiv | This paper proposes prefix-tuning, which keeps language model params frozen but optimizes a continuous task-specific vector (prefix). | Link |
The Power of Scale for Parameter-Efficient Prompt Tuning | Brian Lester et al | 2021 | prompting, llm | Arxiv | This paper explores adding soft prompts to condition frozen language models. Basically, soft prompts are learned through back-propagation and can be used to finetune language models without fully retraining. They also introduce the idea of "prompt ensembling" which is basically using multiple soft prompts on a model and ensembling their outputs. | Link |
Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs | Nguyen Nhat Minh et al | 2024 | sampling, llm | Arxiv | This paper introduces a neat trick to sample the next token. Min-p sampling basically adjusts the sampling threshold based on the model's confidence. It does so by scaling according to the top token's probability. This is a compelling alternative to other common sampling methods, like nucleus sampling. | Link |
LASER: Attention with Exponential Transformation | Sai Surya Duvvuri et al | 2024 | attention, gradients | Arxiv | This paper identifies that gradients backpropagated through the softmax operation often can be quite small. To mitigate this, they propose doing a dot-product attention with an exp()-transformed value matrix V (meaning, they do the attention calculation on exp(V)), which allows for a larger Jacobian (mitigating the small gradient issue). | Link |
Hyper-Connections | Defa Zhu et al | 2024 | residual connections, hyper-connections | Arxiv | This paper introduces hyper-connections, which is a novel alternative to residual connections. Basically, they introduce learnable depth and width connections. | Link |
Remix-DiT: Mixing Diffusion Transformers for Multi-Expert Denoising | Gongfan Fang et al | 2024 | dit, diffusion, moe | NeurIPS | This paper introduces a method of mixing diffusion models for multi-expert denoising. Basically, they increase the width of the linear layers by a factor of K, and then modify the forward pass to support it. This allows for K experts that are initialized from the original weights. | Link |
Hymba: A Hybrid-head Architecture for Small Language Models | Xin Dong et al | 2024 | llm, hybrid, meta-tokens | Arxiv | This paper introduces a family of small language models that have a hybrid attention-ssm head parallel architecture. There are many interesting architectural designs to note here, but my favoriate is the use of "meta tokens", learnable tokens that are prepended to prompts. These tokens help reduce the entropy of attention and ssm heads, and can be seen as a good initialization for KV cache and the SSM state. | Link |
All are Worth Words: A ViT Backbone for Diffusion Models | Fan Bao et al | 2023 | diffusion, vit | Arxiv | This paper designs a general ViT-based architecture for diffusion models. Notably, it treats all inputs (time, condition, noisy image patches) as tokens and uses long skip connections between the shallow and deep layers. | Link |
SliceGPT: Compress Large Language Models by Deleting Rows and Columns | Saleh Ashkboos et al | 2024 | pruning, llm | ICLR | The authors propose a method of slicing off entire rows or columns of weight matrices. They do this by applying a transformation that leaves the predictions invariant prior to the slice. The authors also introduce the notion of "computational invariance", AKA that one can apply orthogonal matrix transformations to each weight matrix in the transformer without changing the model, which they use to edit the blocks in a transformer to project the activation matrix between blocks onto its principal components, and then slice. They make the key insight that if you insert linear layers with the orthogonal matrix Q before RMSNorm and Q^{T} after, the network remains unchanged, i.e. RMSNorm(XQ)Q^{T} = RMSNorm(X). They also note that since LayerNorm can be converted to RMSNorm, LayerNorm is the same story. To find the Qs they use a calibration dataset from the training set and run it thru the model. They then use the output of the network to find the orthogonal matrices of the next layers by computing the covariance matrix and then getting the eigenvalues (read the paper for more). | Link |
Visual Autoregressive Modeling: Scaling Image Generation via Next-Scale Prediction | Key Tian et al | 2024 | tokens, reference model | NeurIPS | The paper proposes Visual AutoRegressive (VAR) modeling, which shifts the paradigm of autoregressive learning for image generation from sequential "next-token prediction" to "next-scale prediction." This approach treats entire token maps at progressively finer resolutions as the autoregressive units, reflecting the coarse-to-fine manner in which humans perceive images. Unlike traditional models that flatten 2D spatial structures into 1D sequences, VAR preserves spatial locality and leverages multi-scale visual representations to reduce computational inefficiencies. By adopting hierarchical generation aligned with natural image structures, VAR overcomes the limitations of standard autoregressive models, such as mathematical premise violations and loss of spatial coherence. Its design integrates autoregressive transformers with multi-scale tokenization, creating a framework that is theoretically scalable and generalizable across diverse visual generation tasks. | Link |
Rho-1: Not All Tokens Are What You Need | Zhenghao Lin et al | 2024 | tokens, reference model | NeurIPS | This paper scores tokens using a reference model and then trains a language model to focus on the tokens with higher scores. They find that they can improve performance while training on less tokens. | Link |
QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs | Saleh Ashkboos et al | 2024 | quantization, rotation | NeurIPS | This paper introduces a quantization scheme based on rotations, that allows quantization of down to 4-bits for weights, activations, and KV cache. They rotate LLMs in such a way that removes outliers from hidden state w/o changing the output. In particular, they use randomized Hadamard transformations on the weight matrices to remove outlier features and make activations easier to quantize. They then extend this to apply online Hadamard transformations to attention model to remove outlier features in keys and values, which allows the KV cache to be quantized. | Link |
Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch | Le Yu et al | 2024 | model merging | ICML | This paper shows that language models (LMs) can get new abilities via assimilating params from homologous models. They also note that LMs after Supervised Fine-Tuning (SFT) have many redundant delta parameters (i.e, the alteration of the model params before and after SFT). They then present DARE (Drop And REscale) as a means of setting delta parameters to zero with drop rate of p and then rescaling the remaining ones by a factor of 1/(1-p). They then use DARE to remove redundant delta parameters in each model prior to merging, which they find can help mitigate the interference of params among multiple models. Then they use standard model merging techniqes to merge the models. | Link |
Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch | Le Yu et al | 2024 | model merging | ICML | This paper shows that language models (LMs) can get new abilities via assimilating params from homologous models. They also note that LMs after Supervised Fine-Tuning (SFT) have many redundant delta parameters (i.e, the alteration of the model params before and after SFT). They then present DARE (Drop And REscale) as a means of setting delta parameters to zero with drop rate of p and then rescaling the remaining ones by a factor of 1/(1-p). They then use DARE to remove redundant delta parameters in each model prior to merging, which they find can help mitigate the interference of params among multiple models. Then they use standard model merging techniqes to merge the models. | Link |
Training-Free Pretrained Model Merging | Zhengqi Xu et al | 2024 | model merging | CVPR | This paper introduces Merging under Dual-Space Constraints (MuDSC), a novel framework for merging pretrained neural network models without additional training or requiring the same pretraining initialization. Unlike prior approaches that operate solely in either the weight space or the activation space, MuDSC addresses inconsistencies between these two spaces by combining their similarity measures into a unified objective using a weighted linear combination. This approach ensures that merged units are similar in both their structure and behavior, leading to more consistent and effective merging outcomes. The framework also adapts to networks with group structures, such as those using multi-head attention or group normalization, by proposing modifications to unit-matching algorithms. Overall, MuDSC simplifies model merging while enhancing performance across diverse architectures and tasks, enabling merged models to achieve balanced and overlapping multi-task performance. | Link |
Similarity of Neural Network Representations Revisited | Simon Kornblith et al | 2019 | network similarity | ICML | This paper examines methods for comparing neural network representations and proposes Centered Kernel Alignment (CKA) as a more effective similarity measure. The authors provide key theoretical insights about what properties a similarity metric should have - arguing it should be invariant to orthogonal transformations and isotropic scaling, but not to arbitrary invertible linear transformations, as neural network training itself isn't invariant to such transformations. They show that for representations with more dimensions than training examples, any metric invariant to arbitrary invertible transformations will give meaningless results. CKA works by first measuring the similarity between every pair of examples in each representation separately (creating representational similarity matrices), then comparing these similarity structures - when using inner products, this reduces to computing normalized Hilbert-Schmidt Independence Criterion between the representations. They demonstrate theoretically that CKA is closely related to canonical correlation analysis (CCA) and regression, but incorporates feature scale information that CCA discards. Finally, they show that unlike previous methods like CCA and SVCCA, CKA can reliably identify corresponding layers between networks trained from different initializations and reveal meaningful relationships between different architectures. | Link |
What is being transferred in transfer learning? | Behnam Neyshabur et al | 2020 | transfer learning | NeurIPS | This paper investigated what exactly gets transferred during transfer learning in neural networks through a comprehensive series of analyses. Through experiments with block-shuffled images, the researchers demonstrated that successful transfer learning relies on both feature reuse and low-level statistics of the data, showing that even when visual features are disrupted, transfer learning still provides benefits. The study revealed that models fine-tuned from pre-trained weights tend to stay in the same basin in the loss landscape, make similar mistakes, and remain close to each other in parameter space, while models trained from scratch end up in different basins with more diverse behaviors. By analyzing module criticality, they found that lower layers handle more general features while higher layers become more specialized, confirming previous theories about feature hierarchy in neural networks. Finally, they showed that transfer learning can begin from earlier checkpoints of the pre-trained model without losing accuracy, suggesting that the benefits of pre-training emerge before the model fully converges on the source task. | Link |
ZipIt! Merging Models from Different Tasks without Training | George Stoica et al | 2024 | model merging | ICLR | This paper presents a novel approach to model merging that significantly improves upon previous methods by recognizing that similar features can exist within the same model, not just across different models. The key insight is that when merging models trained on different tasks, it's often better to combine similar features within each model first, rather than forcing dissimilar features from different models to merge, as these features may have developed to solve fundamentally different problems. Their method first concatenates the feature spaces of both models and computes a comprehensive correlation matrix between all features (both within and across models), using these correlations to guide intelligent feature merging decisions. To handle the multi-layer nature of neural networks, they introduce an "unmerge" operation that allows the merged features to remain compatible with later layers in both original networks, essentially decompressing the merged features before they're processed by subsequent layers. Theoretically, they prove that this approach provides better guarantees than traditional cross-model merging, showing that when models have internal redundancy (which is common in practice), their method can achieve perfect merging with zero performance loss. | Link |
TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts | Ruida Wang et al | 2024 | llm, llm agent | Arxiv | This research introduces TheoremLlama, a framework that transforms general-purpose Large Language Models (LLMs) into expert theorem provers for the Lean4 formal mathematics language, addressing a significant challenge in automated theorem proving. The key innovation is their "NL-FL bootstrapping" method, which integrates natural language reasoning steps directly into formal mathematical proofs as comments during training, helping LLMs bridge the gap between natural language understanding and formal mathematical reasoning. The researchers also contribute the Open Bootstrapped Theorems (OBT) dataset, containing over 100,000 theorem-proof pairs with aligned natural and formal language, helping address the scarcity of training data in this field. The framework introduces specialized training techniques like block training and curriculum learning that help LLMs gradually build theorem-proving capabilities, potentially offering a blueprint for adapting LLMs to other specialized domains that lack extensive training data. | Link |
A Simple Early Exiting Framework for Accelerating Sampling in Diffusion Models | Taehong Moon et al | 2024 | diffusion, early exit | ICML | This paper presents Adaptive Score Estimation (ASE), a novel framework that accelerates diffusion model sampling by adaptively allocating computational resources based on the time step being processed. The authors observe that score estimation near the noise distribution (t→1) requires less computational power than estimation near the data distribution (t→0), leading them to develop a time-dependent early-exiting scheme where more neural network blocks are skipped during the noise-phase sampling steps. Their approach differs between architectures - for DiT models they skip entire blocks, while for U-ViT models they preserve the linear layers connected to skip connections while dropping other block components to maintain the residual pathway information. The authors fine-tune their models using a specially designed training procedure that employs exponential moving averages and weighted coefficients to ensure minimal information updates near t→0 while allowing more updates near t→1. | Link |
Active Prompting with Chain-of-Thought for Large Language Models | Shizhe Diao et al | 2023 | prompting, cot | Arxiv | The paper introduces Active-Prompt, a novel method that improves chain-of-thought (CoT) prompting by strategically selecting which examples to use as demonstrations for large language models. Rather than using randomly selected or manually crafted examples, Active-Prompt identifies the most informative examples by measuring the model's uncertainty on different potential prompts through metrics like disagreement and entropy across multiple model outputs. The key insight is that by systematically choosing examples where the model shows high uncertainty, and then having humans provide detailed reasoning chains for those specific cases, the resulting prompts will be more effective at teaching the model how to approach challenging problems. This approach shifts the human effort from trying to intuitively guess good examples to a more principled selection process guided by the model's own uncertainty signals. | Link |
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment | Hanze Dong et al | 2023 | watermark, offset learning | TMLR | The paper introduces RAFT (Reward rAnked FineTuning), a simpler alternative to RLHF for aligning generative models with human preferences. The key insight is decoupling the data generation and model fine-tuning steps - instead of using complex reinforcement learning, RAFT generates multiple samples for each prompt, ranks them by reward, and then fine-tunes the model on only the highest-scoring samples in an iterative process. This approach is more stable and efficient than RLHF because it uses standard supervised learning techniques rather than RL, while being less sensitive to reward scaling issues since it only uses relative rankings rather than absolute reward values. Additionally, the decoupled nature of RAFT means it requires less memory (only needs to load one model at a time vs multiple for RLHF) and allows more flexibility in data collection and processing. | Link |
Finding needles in a haystack: A Black-Box Approach to Invisible Watermark Detection | Minzhou Pan et al | 2024 | watermark, offset learning | Arxiv | The key insight of this paper centers on using "offset learning" to detect invisible watermarks in images. The intuition is that by having a clean reference dataset of similar images, you can effectively "cancel out" the normal image features that are common between clean and watermarked images, leaving only the watermark perturbations. They design an asymmetric loss function where clean images use exponential/softmax loss (to focus on hard examples) while detection dataset uses linear loss (to give equal weight to all examples), helping isolate the watermark signal. This is combined with an iterative pruning strategy that gradually removes likely-clean images from the detection set, allowing the model to better focus on and learn the watermark patterns. By formulating watermark detection this way, they avoid needing any prior knowledge of watermarking techniques or labeled data, making it a truly black-box approach. | Link |
Mitigating the Alignment Tax of RLHF | Yong Lin et al | 2024 | rlhf, alignment | Arxiv | This paper investigates the "alignment tax" problem where large language models lose some of their pre-trained abilities when aligned with human preferences through RLHF. The key insight is that model averaging (interpolating between pre-RLHF and post-RLHF model weights) is surprisingly effective at mitigating this trade-off because tasks share overlapping feature spaces, particularly in lower layers of the model. Building on this understanding, they propose Heterogeneous Model Averaging (HMA) which applies different averaging ratios to different layers of the transformer model, allowing optimization of the alignment-forgetting trade-off. The intuition is that since different layers capture different levels of features and task similarities, they should not be averaged equally, and finding optimal layer-specific averaging ratios can better preserve both alignment and pre-trained capabilities. | Link |
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising | Zigeng Chen et al | 2024 | diffusion, parallelization, denoising | Arxiv | This paper introduces AsyncDiff, a novel approach to accelerate diffusion models through parallel processing across multiple devices. The key insight is that hidden states between consecutive diffusion steps are highly similar, which allows them to break the traditional sequential dependency chain of the denoising process by transforming it into an asynchronous one. They execute this by dividing the denoising model into multiple components distributed across different devices, where each component uses the output from the previous component's prior step as an approximation of its input, enabling parallel computation. To further enhance efficiency, they introduce stride denoising, which completes multiple denoising steps simultaneously through a single parallel computation batch and reduces the frequency of communication between devices. This solution is particularly elegant because it's universal and plug-and-play, requiring no model retraining or architectural changes to achieve significant speedups while maintaining generation quality. | Link |
DoRA: Weight-Decomposed Low-Rank Adaptation | Shih-Yang Liu et al | 2024 | peft, lora | Arxiv | This paper introduces DoRA (Weight-Decomposed Low-Rank Adaptation), a novel parameter-efficient fine-tuning method that decomposes pre-trained weights into magnitude and direction components for separate optimization. Through a detailed weight decomposition analysis, the authors reveal that LoRA and full fine-tuning exhibit distinct learning patterns, with LoRA showing proportional changes in magnitude and direction while full fine-tuning demonstrates more nuanced, independent adjustments between these components. Based on this insight, DoRA uses LoRA specifically for directional updates while allowing independent magnitude optimization, which simplifies the learning task compared to having LoRA learn both components simultaneously. The authors also provide theoretical analysis showing how this decomposition benefits optimization by aligning the gradient's covariance matrix more closely with the identity matrix and demonstrate mathematically why DoRA's learning pattern more closely resembles full fine-tuning. | Link |
SphereFed: Hyperspherical Federated Learning | Xin Dong et al | 2022 | federated learning | Arxiv | This paper presents a novel approach to addressing the non-i.i.d. (non-independent and identically distributed) data challenge in federated learning by introducing hyperspherical federated learning (SphereFed). The key insight is that instead of letting clients independently learn their classifiers, which leads to inconsistent learning targets across clients, they should share a fixed classifier whose weights span a unit hypersphere, ensuring all clients work toward the same learning objectives. The approach normalizes features to project them onto this same hypersphere and uses mean squared error loss instead of cross-entropy to avoid scaling issues that arise when working with normalized features. Finally, after federated training is complete, they propose a computationally efficient way to calibrate the classifier using a closed-form solution that can be computed in a distributed manner without requiring direct access to private client data. | Link |
A deeper look at depth pruning of LLMs | Shoaib Ahmed Siddiqui et al | 2024 | pruning, depth pruning, llm | ICML | This paper explored different approaches to pruning large language models, revealing that while static metrics like cosine similarity work well for maintaining MMLU performance, adaptive metrics like Shapley values show interesting trade-offs between different tasks. A key insight was that self-attention layers are significantly more amenable to pruning compared to feed-forward layers, suggesting that models can maintain performance even with substantial attention layer reduction. The paper also demonstrated that simple performance recovery techniques, like applying an average update in place of removed layers, can be as effective or better than more complex approaches like low-rank adapters. Finally, the work highlighted how pruning affects different tasks unequally - while some metrics preserve performance on certain tasks like MMLU, they may significantly degrade performance on others like mathematical reasoning tasks. | Link |
Editing Models with Task Arithmetic | Gabriel Ilharco et al | 2023 | task arithmetic, finetuning, task | ICLR | This paper introduces a novel method for model editing called task arithmetic, where "task vectors" represent specific tasks by capturing the difference between pre-trained and fine-tuned model weights. Task vectors can be manipulated mathematically, such as being negated to unlearn tasks or added together to enable multi-tasking or improve performance in novel settings. A standout finding is the ability to create new task capabilities through analogies (e.g., "A is to B as C is to D"), which allows performance improvement on tasks with little or no data. This method is computationally efficient, leveraging linear operations on model weights without incurring extra inference costs, providing a flexible and modular framework for modifying models post-training. The approach highlights significant advantages in adapting existing models while bypassing costly re-training or data access constraints. | Link |
SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales | Tianyang Xu et al | 2024 | confidence estimation, llm | Arxiv | The SaySelf framework trains large language models (LLMs) to produce fine-grained confidence estimates and self-reflective rationales by focusing on internal uncertainties. It consists of two stages: supervised fine-tuning and reinforcement learning (RL). In the first stage, multiple reasoning chains are sampled from the LLM, clustered for semantic similarity, and analyzed by an advanced LLM to generate rationales summarizing uncertainties. The model is fine-tuned on a dataset that pairs questions with reasoning chains, rationales, and confidence estimates, using a loss function that optimizes the generation of all three outputs. In the second stage, RL refines the confidence predictions using a reward function that encourages accurate, high-confidence outputs while penalizing overconfidence in incorrect responses. The framework ensures that LLMs not only generate confidence scores but also provide explanations for their uncertainty, making their outputs more interpretable and calibrated. | Link |
Deep Reinforcement Learning from Human Preferences | Paul F Christiano et al | 2016 | rl, rlhf | Arxiv | This paper introduces a method to train reinforcement learning (RL) systems using human preferences over trajectory segments rather than traditional reward functions. The approach allows agents to learn tasks that are hard to define programmatically, enabling non-expert users to provide feedback on agent behavior through comparisons of short video clips. By learning a reward model from these preferences, the method dramatically reduces the need for human oversight while maintaining adaptability to large-scale and complex RL environments. This paradigm bridges the gap between human-defined objectives and scalable RL systems, addressing challenges in alignment and usability for real-world applications. | Link |
The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning | Tian Jin et al | 2023 | pruning, icl | Arxiv | This paper explores the effects of scaling the parameter count of large language models (LLMs) on two distinct capabilities: fact recall from pre-training and in-context learning (ICL). By investigating both dense scaling (training models of varying sizes) and pruning (removing weights), the authors identify that these approaches disproportionately affect fact recall while preserving ICL abilities. They demonstrate that a model's ability to learn from in-context information remains robust under significant parameter reductions, whereas the ability to recall pre-trained facts degrades with even moderate scaling down. This dichotomy highlights a fundamental difference in how these capabilities rely on model size and opens avenues for more efficient model design and deployment, emphasizing trade-offs between memory augmentation and parameter efficiency. | Link |
Fine-Tuning Language Models with Just Forward Passes | Sadhika Malladi et al | 2024 | finetuning, zo, optimization | Arxiv | The paper introduces MeZO, a memory-efficient zeroth-order optimization method, to fine-tune large language models using forward passes alone. Classical zeroth-order methods scale poorly with model size, but MeZO adapts these approaches to leverage structured pre-trained model landscapes, avoiding catastrophic slowdown even with billions of parameters. The authors theoretically show that MeZO’s convergence depends on the local effective rank of the Hessian, not the number of parameters, enabling efficient optimization despite prior bounds suggesting otherwise. Furthermore, MeZO’s flexibility allows optimization of non-differentiable objectives (e.g., accuracy or F1 score) and compatibility with parameter-efficient tuning methods like LoRA and prefix-tuning. | Link |
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference | Hanshi Sun et al | 2024 | kv cache | Arxiv | The key insight of this paper lies in optimizing long-context large language model inference by addressing the memory and latency bottlenecks associated with managing the key-value (KV) cache. The authors observe that pre-Rotary Position Embedding (RoPE) keys exhibit a low-rank structure, allowing them to be compressed without accuracy loss, while value caches lack this property and are therefore offloaded to the CPU to reduce GPU memory usage. To minimize decoding latency, they leverage landmarks—compact representations of the low-rank key cache—and identify a small set of outliers to be retained on the GPU, enabling efficient reconstruction of sparse KV pairs on-the-fly. This approach allows the system to handle significantly longer contexts and larger batch sizes while maintaining inference throughput and accuracy. | Link |
LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning | Rui Pan et al | 2024 | peft, finetuning, sampling | Arxiv | The key insight of this paper is the discovery of a skewed weight-norm distribution across layers during LoRA fine-tuning, where the majority of updates occur in the bottom (embedding) and top (language modeling head) layers, leaving middle layers underutilized. This highlights that different layers have varied importance and suggests that selectively updating layers could improve efficiency without sacrificing performance. Building on this, the authors propose Layerwise Importance Sampling AdamW (LISA), which randomly freezes most middle layers during training, using importance sampling to emulate LoRA’s fast learning pattern while avoiding its low-rank constraints. This approach achieves significant memory savings, faster convergence, and superior performance compared to LoRA and full-parameter fine-tuning, particularly in large-scale and domain-specific tasks. | Link |
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion | Muyang Li et al | 2024 | quantization, diffusion | Arxiv | SVDQuant introduces a novel approach to 4-bit quantization of diffusion models by using a low-rank branch to absorb outliers in both weights and activations, making quantization more feasible at such aggressive bit reduction. The method first consolidates outliers from activations to weights through smoothing, then decomposes the weights using Singular Value Decomposition (SVD) to separate the dominant components into a 16-bit low-rank branch while keeping the residual in 4 bits. To make this practical, they developed an inference engine called Nunchaku that fuses the low-rank and low-bit branch kernels together, eliminating redundant memory access that would otherwise negate the performance benefits. The approach is designed to work across different diffusion model architectures and can seamlessly integrate with existing low-rank adapters (LoRAs) without requiring re-quantization. | Link |
One Weight Bitwidth to Rule Them All | Ting-Wu Chin et al | 2020 | quantization, bitwidth | Arxiv | This paper examines weight quantization in deep neural networks and challenges the common assumption that using the lowest possible bitwidth without accuracy loss is optimal. The key insight is that when considering model size as a constraint and allowing network width to vary, some bitwidths consistently outperform others - specifically, networks with standard convolutions work better with binary weights while networks with depthwise convolutions prefer higher bitwidths. The authors discover that this difference is related to the number of input channels (fan-in) per convolutional kernel, with higher fan-in making networks more resilient to aggressive quantization. Most surprisingly, they demonstrate that using a single well-chosen bitwidth throughout the network can outperform more complex mixed-precision quantization approaches when comparing networks of equal size, suggesting that the traditional focus on minimizing bitwidth without considering network width may be suboptimal. | Link |
Consistency Models | Yang Song et al | 2023 | diffusion, ode, consistency | ICML | This paper introduces consistency models, a new family of generative models that can generate high-quality samples in a single step while preserving the ability to trade compute for quality through multi-step sampling. The key innovation is training models to map any point on a probability flow ODE trajectory to its origin point, enforcing consistency across different time steps through either distillation from pre-trained diffusion models or direct training. The models support zero-shot data editing capabilities like inpainting, colorization, and super-resolution without requiring explicit training on these tasks, similar to diffusion models. The authors provide two training approaches - consistency distillation which leverages existing diffusion models, and consistency training which allows training from scratch without any pre-trained models, establishing consistency models as an independent class of generative models. | Link |
One Step Diffusion via ShortCut Models | Kevin Frans et al | 2024 | diffusion, ode, flow-matching | Arxiv | This paper introduces shortcut models, a new type of diffusion model that enables high-quality image generation in a single forward pass by conditioning the model not only on the timestep but also on the desired step size, allowing it to learn larger jumps during the denoising process. Unlike previous approaches that require multiple training phases or complex scheduling, shortcut models can be trained end-to-end in a single phase by leveraging a self-consistency property where one large step should equal two consecutive smaller steps, combined with flow-matching loss as a base case. The key insight is that by conditioning on step size, the model can account for future curvature in the denoising path and jump directly to the correct next point rather than following the curved path naively, which would lead to errors with large steps. The approach simplifies the training pipeline while maintaining flexibility in inference budget, as the same model can generate samples using either single or multiple steps after training. | Link |
Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models | Hongjie Wang et al | 2024 | diffusion, training-free, attention, token pruning | CVPR | This paper introduces AT-EDM, a training-free framework to accelerate diffusion models by pruning redundant tokens during inference without requiring model retraining. The key innovation is a Generalized Weighted PageRank (G-WPR) algorithm that uses attention maps to identify and prune less important tokens, along with a novel similarity-based token recovery method that fills in pruned tokens based on attention patterns to maintain compatibility with convolutional layers. The authors also propose a Denoising-Steps-Aware Pruning (DSAP) schedule that prunes fewer tokens in early denoising steps when attention maps are more chaotic and less informative, and more tokens in later steps when attention patterns are better established. The overall approach focuses on making diffusion models more efficient by leveraging the rich information contained in attention maps to guide token pruning decisions while maintaining image generation quality. | Link |
Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks | Tim Salimans et al | 2016 | normalization, gradient descent | Arxiv | This paper introduces weight normalization, a simple reparameterization technique that decouples a neural network's weight vectors into their direction and magnitude by expressing w = (g/||v||)v, where g is a scalar and v is a vector. The key insight is that this decoupling improves optimization by making the conditioning of the gradient better - the direction and scale of weight updates can be learned somewhat independently, which helps avoid problems with pathological curvature in the optimization landscape. While inspired by batch normalization, weight normalization is deterministic and doesn't add noise to gradients or create dependencies between minibatch examples, making it well-suited for scenarios like reinforcement learning and RNNs where batch normalization is problematic. The authors also propose a data-dependent initialization scheme where g and bias terms are initialized to normalize the initial pre-activations of neurons, helping ensure good scaling of activations across layers at the start of training. | Link |
Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models | Tuomas Kynkäänniemi et al | 2024 | diffusion, cfg, guidance | Arxiv | This paper's key insight is that classifier-free guidance (CFG) in diffusion models should only be applied during a specific interval of noise levels in the middle of the sampling process, rather than throughout the entire sampling chain as traditionally done. The intuition is that guidance is harmful at high noise levels (where it causes mode collapse and template-like outputs), largely unnecessary at low noise levels, and only truly beneficial in the middle range. They demonstrate this theoretically using a 1D synthetic example where they can visualize how guidance at high noise levels causes sampling trajectories to drift far from the smoothed data distribution, leading to mode dropping. Beyond this theoretical demonstration, they propose a simple solution of making the guidance weight a piecewise function that only applies guidance within a specific noise level interval. | Link |
Cache Me if You Can: Accelerating Diffusion Models through Block Caching | Felix Wimbauer et al | 2024 | diffusion, caching, distillation | Arxiv | This paper introduces "block caching" to accelerate diffusion models by reusing computations across denoising steps. The key insight is that many layer blocks (particularly attention blocks) in diffusion models change very gradually during the denoising process, making their repeated computation redundant. The authors propose automatically determining which blocks to cache and when to refresh them based on measuring the relative changes in block outputs across timesteps. They also introduce a lightweight scale-shift adjustment mechanism that uses a student-teacher setup, where the student (cached model) learns additional scale and shift parameters to better align its cached block outputs with those of the teacher (uncached model), while keeping the original model weights frozen. | Link |
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads | Guangxuan Xiao et al | 2024 | llm, kv cache, attention | Arxiv | The key insight of DuoAttention is the observation that attention heads in LLMs naturally fall into two distinct categories: retrieval heads that need to access the full context to make connections across long distances, and streaming heads that mainly focus on recent tokens and attention sinks. This dichotomy makes intuitive sense because not all parts of language processing require long-range dependencies - while some aspects like fact recall or logical reasoning need broad context, others like local grammar or immediate context processing can work with nearby tokens. The paper's approach of using optimization to identify these heads (rather than just looking at attention patterns) is clever because it directly measures the impact on model outputs, capturing the true functional role of each head rather than just its surface behavior. Finally, the insight to maintain two separate KV caches (full for retrieval heads, minimal for streaming heads) is an elegant way to preserve the model's capabilities while reducing memory usage, since it aligns the memory allocation with each head's actual needs. | Link |
Efficient Streaming Language Models with Attention Sinks | Guangxuan Xiao et al | 2024 | llm, kv cache, attention | ICLR | This paper introduces StreamingLLM, a framework that enables large language models to process infinitely long text sequences efficiently without fine-tuning, based on a key insight about "attention sinks." The authors discover that LLMs allocate surprisingly high attention scores to initial tokens regardless of their semantic relevance, which they explain is due to the softmax operation requiring attention scores to sum to one - even when a token has no strong matches in context, the model must distribute attention somewhere, and initial tokens become natural "sinks" since they're visible to all subsequent tokens during autoregressive training. Building on this insight, StreamingLLM maintains just a few initial tokens (as attention sinks) along with a sliding window of recent tokens, achieving up to 22.2x speedup compared to baselines while maintaining performance on sequences up to 4 million tokens long. Additionally, they show that incorporating a dedicated learnable "sink token" during model pre-training can further improve streaming capabilities by providing an explicit token for collecting excess attention. | Link |
MagicPIG: LSH Sampling for Efficient LLM Generation | Zhuoming Chen et al | 2024 | llm, kv cache | Arxiv | This paper challenges the common assumption that attention in LLMs is naturally sparse, showing that TopK attention (selecting only the highest attention scores) can significantly degrade performance on tasks that require aggregating information across the full context. The authors demonstrate that sampling-based approaches to attention can be more effective than TopK selection, leading them to develop MagicPIG, a system that uses Locality Sensitive Hashing (LSH) to efficiently sample attention keys and values. A key insight is that the geometry of attention in LLMs has specific patterns - notably that the initial attention sink token remains almost static regardless of input, and that query and key vectors typically lie in opposite directions - which helps explain why simple TopK selection is suboptimal. Their solution involves a heterogeneous system design that leverages both GPU and CPU resources, with hash computations on GPU and attention computation on CPU, allowing for efficient processing of longer contexts while maintaining accuracy. | Link |
Guiding a Diffusion Model with a Bad Version of Itself | Tero Karras et al | 2024 | diffusion, guidance | Arxiv | The paper makes two key contributions: First, they show that Classifier-Free Guidance (CFG) improves image quality not just through better prompt alignment, but because the unconditional model D0 learns a more spread-out distribution than the conditional model D1, causing the guidance term ∇x log(p1/p0) to push samples toward high-probability regions of the data manifold. Second, based on this insight, they introduce "autoguidance" - using a smaller, less-trained version of the model itself as the guiding model D0 rather than an unconditional model, which allows for quality improvements without reducing variation and works even for unconditional models. | Link |
LLM-Pruner: On the Structural Pruning of Large Language Models | Xinyin Ma et al | 2023 | llm, structural pruning | Arxiv | The authors introduce LLM-Pruner, a novel approach for compressing large language models that operates in a task-agnostic manner while requiring minimal access to the original training data. Their key insight is to first automatically identify groups of interdependent neural structures within the LLM by analyzing dependency patterns, ensuring that coupled structures are pruned together to maintain model coherence. The method then estimates the importance of these structural groups using both first-order gradients and approximated Hessian information from a small set of calibration samples, allowing them to selectively remove less critical groups while preserving the model's core functionality. Finally, they employ a rapid recovery phase using low-rank adaptation (LoRA) to fine-tune the pruned model with a limited dataset in just a few hours, enabling efficient compression while maintaining the LLM's general-purpose capabilities. | Link |
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models | Guangxuan Xiao et al | 2023 | llm, quantization, activations | ICML | The key insight of SmoothQuant is that in large language models, while weights are relatively easy to quantize, activations are much harder due to outliers. They observed that these outliers persistently appear in specific channels across different tokens, suggesting that the difficulty could be redistributed. Their solution is to mathematically transform the model by scaling down problematic activation channels while scaling up the corresponding weight channels proportionally, which maintains mathematical equivalence while making both weights and activations easier to quantize. This "difficulty migration" approach allows them to balance the quantization challenges between weights and activations using a tunable parameter α, rather than having all the difficulty concentrated in the activation values. | Link |
ESPACE: Dimensionality Reduction of Activations for Model Compression | Charbel Sakr et al | 2024 | llm, dimensionality reduction, activations, compression | NeurIPS | Instead of decomposing weight matrices as done in previous work, ESPACE reduces the dimensionality of activation tensors by projecting them onto a pre-calibrated set of principal components using a static projection matrix P, where for an activation x, its projection is x̃ = PPᵀx. The projection matrix P is carefully constructed (using eigendecomposition of activation statistics) to preserve the most important components while reducing dimensionality, taking advantage of natural redundancies that exist in activation patterns due to properties like the Central Limit Theorem when stacking sequence/batch dimensions. During training, the weights remain uncompressed and fully trainable (maintaining model expressivity), while at inference time, the weight matrices can be pre-multiplied with the projection matrix (PTWᵀ) to achieve compression through matrix multiplication associativity: Y = WᵀX ≈ Wᵀ(PPᵀX) = (PTWᵀ)(PᵀX). This activation-centric approach is fundamentally different from previous methods because it maintains full model expressivity during training while still achieving compression at inference time, and it takes advantage of natural statistical redundancies in activation patterns rather than trying to directly compress weights. | Link |
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model | Chunting Zhou et al | 2024 | diffusion, transformer, multi-modal | Arxiv | The key insight of this paper is that a single transformer model can effectively handle both discrete data (like text) and continuous data (like images) by using different training objectives for each modality within the same model. They introduce "Transfusion," which uses traditional language modeling (next token prediction) for text sequences while simultaneously applying diffusion modeling for image sequences, combining these distinct objectives into a unified training approach. The architecture employs a novel attention pattern that allows for causal attention across the entire sequence while enabling bidirectional attention within individual images, letting image patches attend to each other freely while maintaining proper causality for text generation. This unified approach avoids the need for separate specialized models or complex architectures while still allowing each modality to be processed according to its most effective paradigm. | Link |
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection | Jiawei Zhao et al | 2024 | lora, low-rank projection | ICML | This paper introduces GaLore, a memory-efficient approach for training large language models that exploits the inherent low-rank structure of gradients rather than imposing low-rank constraints on the model weights themselves. The key insight is that while weight matrices may need to be full-rank for optimal performance, their gradients naturally become low-rank during training due to the specific structure of backpropagated gradients in neural networks, particularly in cases where the batch size is smaller than the matrix dimensions or when the gradients follow certain parametric forms. Building on this observation, GaLore projects gradients into low-rank spaces for memory-efficient optimization while still allowing full-parameter learning, contrasting with previous approaches like LoRA that restrict the weight updates to low-rank spaces. By periodically switching between different low-rank subspaces during training, GaLore maintains the flexibility of full-rank training while significantly reducing memory usage, particularly in storing optimizer states. | Link |
Neural Discrete Representation Learning | Aaron van der Oord et al | 2017 | generative models, vae | NeurIPS | The key innovation of this paper is the introduction of the Vector Quantised-Variational AutoEncoder (VQ-VAE), which combines vector quantization with VAEs to learn discrete latent representations instead of continuous ones. Unlike previous approaches to discrete latent variables which struggled with high variance or optimization challenges, VQ-VAE uses a simple but effective nearest-neighbor lookup system in the latent space, along with a straight-through gradient estimator, to learn meaningful discrete codes. This approach allows the model to avoid the common posterior collapse problem where latents are ignored when paired with powerful decoders, while still maintaining good reconstruction quality comparable to continuous VAEs. The discrete nature of the latent space enables the model to focus on capturing important high-level features that span many dimensions in the input space (like objects in images or phonemes in speech) rather than local details, and these discrete latents can then be effectively modeled using powerful autoregressive priors for generation. | Link |
Improved Precision and Recall Metric for Assessing Generative Models | Tuomas Kynkaanniemi et al | 2019 | generative models, precision, recall | NeurIPS | This paper introduces an improved metric for evaluating generative models by separately measuring precision (quality of generated samples) and recall (coverage/diversity of generated distribution) using k-nearest neighbors to construct non-parametric manifold approximations of real and generated data distributions. The authors demonstrate their metric's effectiveness using StyleGAN and BigGAN, showing how it provides more nuanced insights than existing metrics like FID, particularly in revealing tradeoffs between image quality and variation that other metrics obscure. They use their metric to analyze and improve StyleGAN's architecture and training configurations, identifying new variants that achieve state-of-the-art results, and perform the first principled analysis of truncation methods. Finally, they extend their metric to evaluate individual sample quality, enabling quality assessment of interpolations and providing insights into the shape of the latent space that produces realistic images. | Link |
Generative Pretraining from Pixels | Mark Chen et al | 2020 | pretraining, gpt | PMLR | The paper demonstrates that transformer models can learn high-quality image representations by simply predicting pixels in a generative way, without incorporating any knowledge of the 2D structure of images. They show that as the generative models get better at predicting pixels (measured by log probability), they also learn better representations that can be used for downstream image classification tasks. The authors discover that, unlike in supervised learning where the best representations are in the final layers, their generative models learn the best representations in the middle layers - suggesting the model first builds up representations before using them to predict pixels. Finally, while their approach requires significant compute and works best at lower resolutions, it achieves competitive results with other self-supervised methods and shows that generative pre-training can be a promising direction for learning visual representations without labels. | Link |
Why Does Unsupervised Pre-Training Help Deep Learning? | Dumitru Erhan et al | 2010 | pretraining, unsupervised | JMLR | This paper argues that standard training schemes place parameters in regions of the parameter space that generalize poorly, while greedy layer-wise unsupervised pre-training allows each layer to learn a nonlinear transformation of its input that captures the main variations in the input, which acts as a regularizer: minimizing variance and introducing bias towards good initializations for the parameters. They argue that defining particular initialization points implicitly imposes constraints on the parameters in that it specifies which minima (out of many possible minima) of the cost function are allowed. They further argue that small perturbations in the trajectory of the parameters have a larger effect early on, and hint that early examples have larger influence and may trap model parameters in particular regions of parameter space corresponding to the arbitrary ordering of training examples (similar to the "critical period" in developmental psychology). | Link |
Improving Language Understanding by Generative Pre-Training | Alec Radford et al | 2020 | pretraining | Arxiv | The key insight of this paper is that language models can learn deep linguistic and world knowledge through unsupervised pre-training on large corpora of contiguous text, which can then be effectively transferred to downstream tasks. The authors demonstrate this by using a Transformer architecture that can capture long-range dependencies, pre-training it on a books dataset that contains extended narratives rather than shuffled sentences, making it particularly effective at understanding context. Their innovation extends to how they handle transfer learning - rather than creating complex task-specific architectures, they show that simple input transformations can adapt their pre-trained model to various tasks while preserving its learned capabilities. This elegant approach proves remarkably effective, with their single task-agnostic model outperforming specially-designed architectures across nine different natural language understanding tasks, suggesting that their pre-training method captures fundamental aspects of language understanding. | Link |
Learning Transferable Visual Models from Natural Language Supervision | Alec Radford et al | 2021 | CLIP | Arxiv | CLIP (Contrastive Language-Image Pre-training) works by simultaneously training two neural networks - one that encodes images and another that encodes text - to project their inputs into a shared multi-dimensional space where similar concepts end up close together. During training, CLIP takes a batch of image-text pairs and learns to identify which text descriptions actually match which images, doing this by maximizing the cosine similarity between embeddings of genuine pairs while minimizing similarity between mismatched pairs. The training data consists of hundreds of millions of (image, text) pairs collected from the internet, which helps CLIP learn broad visual concepts and their relationships to language without requiring hand-labeled data. What makes CLIP particularly powerful is its zero-shot capability - after training, it can make predictions about images it has never seen before by comparing them against any arbitrary text descriptions, rather than being limited to a fixed set of predetermined labels. | Link |
Adam: A Method for Stochastic Optimization | Diederik Kingma et al | 2015 | optimizers | ICLR | Adam combines momentum (through exponential moving average of gradients mt) and adaptive learning rates (through exponential moving average of squared gradients vt) to create an efficient optimizer, where mt captures the direction of updates while vt adapts the step size for each parameter based on its gradient history. The optimizer corrects initialization bias in these moving averages by scaling them with factors 1/(1-β₁ᵗ) and 1/(1-β₂ᵗ) respectively, ensuring unbiased estimates even in early training. The parameter update θt ← θt-1 - α·mt/(√vt + ϵ) is invariant to gradient scaling because it uses the ratio mt/√vt, while the adaptive learning rate 1/√vt approximates the diagonal of the Fisher Information Matrix's square root, making it a more conservative version of natural gradient descent that works well with sparse gradients and non-stationary objectives. The hyperparameters β₁ = 0.9 and β₂ = 0.999 mean the momentum term considers roughly the last 10 steps while the variance term considers the last 1000 steps, allowing Adam to both move quickly in consistent directions while being careful in directions with high historical variance. | Link |
Simplifying Neural Networks by Soft Weight-Sharing | Steven Nowlan et al | 1992 | soft weight sharing, mog | Neural Computation | This paper tackles the challenge of penalizing complexity and preventing overfitting in neural networks. Traditional methods, like L2 regularization, penalize the sum of squared weights but can favor multiple weak connections over a single strong one, leading to suboptimal weight configurations. To address this, the authors propose a mixture of Gaussians (MoG) prior: a narrow Gaussian encourages small weights to shrink to zero, while a broad Gaussian preserves large weights essential for modeling the data accurately. By clustering weights into near-zero and larger groups, this data-driven regularization avoids forcing all weights toward zero equally and demonstrates better generalization on 12 toy tasks compared to early stopping and traditional squared-weight penalties. | Link |
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models | Muyang Li et al | 2024 | diffusion, distributed inference | Arxiv | DistriFusion introduces *displaced patch parallelism*, where the input image is split into patches, each processed independently by different GPUs. To maintain fidelity and reduce communication costs, the method reuses activations from the previous timestep as context for the current step, ensuring interaction between patches without excessive synchronization. Synchronous communication is only used at the initial step, while subsequent steps leverage asynchronous communication, hiding communication overhead within computation. This technique allows each device to process only a portion of the workload efficiently, avoiding artifacts and achieving scalable parallelism tailored to the sequential nature of diffusion models. | Link |
Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching | Xinyin Ma et al | 2024 | diffusion, caching | Arxiv | This paper proposes interpolation between computationally inexpensive solutions that are suboptimal and optimal solutions that are expensive by training a router the learn how to cache layers of the diffusion transformer. | Link |
Flash Attention | Tri Dao et al | 2022 | attention, transformer | Arxiv | This introduces FlashAttention, which is an IO-aware exact attention algo that uses tiling. Basically, they use tiling to prevent needing to put the large NxN attention matrix on GPU HBM; FlashAttention goes through blocks of the K and V matrices, loads them to on-chip SRAM, which increases speed! Neat! | Link |
Token Merging for Fast Stable Diffusion | Daniel Bolya et al | 2023 | diffusion, token merging | Arxiv | This paper seeks to apply ToMe (https://arxiv.org/pdf/2210.09461) to diffusion models, introducing techniques for token partitioning (by changing the way src and dst is merged) and a token unmerging operation (which is basically just setting the two merged tokens equal to their average, and then resetting back the two tokens with that average). Remarkably, this works very well! | Link |
DeepCache: Accelerating Diffusion Models for Free | Xinyin Ma et al | 2023 | diffusion, cache | Arxiv | Similarly to Faster Diffusion (Senma Li et al, 2024), this paper uses the temporal redundancy in the denoising stages. They then cache features across the UNet by skipping some of the skip branches / paths. Basically, for timesteps t and t+1 that are similar, we can cache some of the high level features between them and directly use them. Also smart! | Link |
Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models | Senmao Li et al | 2024 | diffusion, encoder | NeurIPS | This paper notes that the UNet decoder in diffusion models has similar output between timesteps. Thus, they seek to basically cyclically reuse encoder features for the decoder. Smart! | Link |
Improved Denoising Diffusion Probabilistic Models | Alex Nichol et al | 2021 | diffusion, precision, recall | Arxiv | This paper is the first to show that DDPMs can get competitive log-likelihoods. They use a reparameterization and a hybrid learning objective to more tightly optimize the variational lower bound, and find that their objective has less gradient noise during training. They use learned variances and find that they can get convincing samples using fewer steps. They also use the improved precision and recall metrics (Kynkaanniemi et al 2019) to show that diffusion models have higher recall for similar FID, which suggests they cover a large portion of the target distribution. They focused on optimizing log-likelihood as it is believed that optimizing ll forces the model to capture all models of data distribution (Razavi et al 2019). Heninghan et al 2020 has also shown that small improvements in ll can dramatically impact sample quality / learned feature representations. The authors argue that fixing \sigma_{t} (as Ho et al 2020 does) is reasonable in terms of sample quality, but does not explain much about the ll. Thus, to improve ll they think of finding a better choice for \Sigma_{\theta}(x_{t},t), so they choose to try to learn it. They note that it is better to parameterize the var as an interpolation between \beta_{t} and \tilde{\beta_{t}} in the log domain. Remember that \beta_{t} is the noise schedule, which is typically a small value that increases over time following some schedule. \tilde{\beta is a reparameterization of \beta_{t} used to simplify calculations. They are related via \alpha, which is 1-eta_{t}. Finally, they note that a linear schedule for noise leads to faster destroying of information than is necessary, and propose a different noise scheduler. Lots of insights! | Link |
Improving Training Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architecture | Huijie Zhang et al | 2024 | diffusion, multi-stage | CVPR | This paper proposes a multi-stage framework for diffusion models that uses a shared encoder and separate decoders for different timestep intervals, along with an optimal denoiser-based timestep clustering method, to improve training and sampling efficiency while maintaining or enhancing image generation quality. | Link |
Temporal Dynamic Quantization for Diffusion Models | Junhyuk So et al | 2023 | diffusion, quantization | NeurIPS | Temporal Dynamic Quantization (TDQ) addresses the challenge of quantizing diffusion models by dynamically adjusting quantization parameters based on the denoising time step. TDQ employs a trainable module consisting of frequency encoding, a multi-layer perceptron (MLP), and a SoftPlus activation to predict optimal quantization intervals for each time step. This module maps the temporal information to appropriate quantization parameters, allowing the method to adapt to the varying activation distributions across different stages of the diffusion process. By pre-computing these quantization intervals, TDQ avoids the runtime overhead associated with traditional dynamic quantization methods while still providing the necessary flexibility to handle the temporal dynamics of diffusion models. | Link |
Learning Efficient Convolutional Networks through Network Slimming | Zhuang Liu et al | 2017 | pruning, importance | CVPR | This paper introduces *network slimming*, a method to reduce the size, memory footprint, and computation of CNNs by enforcing channel-level sparsity without sacrificing accuracy. It works by identifying and pruning insignificant channels during training, leveraging the γ scaling factors in Batch Normalization (BN) layers to effectively determine channel importance. The approach introduces minimal training overhead and is compatible with modern CNN architectures, eliminating the need for specialized hardware or software. Using the BN layer’s built-in scaling properties makes this pruning efficient, avoiding redundant scaling layers or issues that arise from linear transformations in convolution layers. | Link |
Q-Diffusion: Quantizing Diffusion Models | Xiuyu Li et al | 2023 | diffusion, sampling | ICCV | This paper tackles the inefficiencies of diffusion models, such as slow inference and high computational cost, by proposing a post-training quantization (PTQ) method designed specifically for their multi-timestep process. The key innovation includes a *time step-aware calibration data sampling* approach, which uniformly samples inputs across multiple time steps to better reflect real inference data, addressing quantization errors and varying activation distributions without the need for additional data. Additionally, the paper introduces *shortcut-splitting quantization* to handle the bimodal activation distributions caused by the concatenation of deep and shallow feature channels in shortcuts, quantizing them separately before concatenation for improved accuracy with minimal extra resources. | Link |
Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection | Alireza Ganjdanesh et al | 2024 | diffusion, sampling | Arxiv | This paper reduces the cost of sampling via pruning a pretrained diffusion model into a mixture of experts (MoE) for their respective time intervals, via a routing agent that predicts the architecture needed to generate the experts. | Link |
A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training | Kai Wang et al | 2024 | diffusion, sampling | Arxiv | This paper introduces SpeeD, a novel approach for accelerating the training of diffusion models without compromising performance. The authors analyze the diffusion process and identify three distinct areas: acceleration, deceleration, and convergence, each with different characteristics and importance for model training. Based on these insights, SpeeD implements two key components: asymmetric sampling, which reduces the sampling of less informative time steps in the convergence area, and change-aware weighting, which gives more importance to the rapidly changing areas between acceleration and deceleration. The authors' key insight is that not all time steps in the diffusion process are equally valuable for training, with the convergence area providing limited benefits despite occupying a large proportion of time steps, while the rapidly changing area between acceleration and deceleration is crucial but often undersampled. To address this, SpeeD introduces an asymmetric sampling strategy using a two-step probability function: $P(t) = \begin{cases} \frac{k}{T + \tau(k-1)}, & 0 < t \leq \tau \ \frac{1}{T + \tau(k-1)}, & \tau < t \leq T \end{cases}$, where τ is a carefully selected threshold marking the beginning of the convergence area, k is a suppression intensity factor, T is the total number of time steps, and t is the current time step. This function increases sampling probability before τ and suppresses it after. Additionally, SpeeD employs a change-aware weighting scheme based on the gradient of the process increment's variance, assigning higher weights to time steps with faster changes. By combining these strategies, SpeeD aims to focus computational resources on the most informative parts of the diffusion process, potentially leading to significant speedups in training time without sacrificing model quality. | Link |
HyperGAN: A Generative Model for Diverse, Performant Neural Networks | Neale Ratzlaff et al | 2019 | gan, ensemble | ICML | This paper introduces HyperGAN, a novel generative model designed to learn a distribution of neural network parameters, addressing the issue of overconfidence in standard neural networks when faced with out-of-distribution data. Unlike traditional approaches, HyperGAN doesn't require restrictive prior assumptions and can rapidly generate large, diverse ensembles of neural networks. The model employs a unique "mixer" component that projects prior samples into a correlated latent space, from which layer-specific generators create weights for a deep neural network. Experimental results show that HyperGAN can achieve competitive performance on datasets like MNIST and CIFAR-10 while providing improved uncertainty estimates for out-of-distribution and adversarial data compared to standard ensembles. NOTE: There has actually been a diffusion variant of this idea: https://arxiv.org/pdf/2402.13144 | Link |
Diffusion Models Already Have a Semantic Latent Space | Mingi Kwon et al | 2023 | diffusion, latent space | ICLR | This paper introduces Asymmetric Reverse Process (Asyrp), a method that discovers a semantic latent space (h-space) in pretrained diffusion models, enabling controlled image manipulation with desirable properties such as homogeneity, linearity, and consistency across timesteps, while also proposing a principled design for versatile editing and quality enhancement in the generative process. The authors propose Asymmetric Reverse Process (Asyrp). It modifies only the P_{t} term while preserving the D_{t} term in the reverse process. This makes sense because it a) breaks the destructive interference seen in previous methods, b) allows for controlled modification of the generation process towards target attributes, and c) maintains the overall structure and quality of the diffusion process. | Link |
One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale | Fan Bao et al | 2023 | diffusion, multi-model | ICML | The authors present a method of sampling from joint and conditional distributions using a small modification on diffusion models. UniDiffuser’s proposed method involves handling multiple modalities (such as images and text) within a single diffusion model. Here is in general what they do: 1. Perturb data in all modalities: For a given data point (x0,y0), where x0 is an image and y0 is text, UniDiffuser adds noise to both simultaneously. The noisy versions are represented as xt_{x} and yt_{y}, where t_{x} and t_{y} are the respective timesteps. 2. Use of individual timesteps for different modalities: Instead of using a single timestep t for both modalities, UniDiffuser uses separate timesteps t_{x} and t_{y}. This allows for more flexibility in handling the different characteristics of each modality. 3. Predicting noise for all modalities simultaneously: UniDiffuser uses a joint noise prediction network \epsilon_{\theta}(xt_{x},yt_{y},t_{x},t_{y}) that takes in the noisy versions of both modalities and their respective timesteps. The network then outputs predicted noise for both modalities in one forward pass. | Link |
Diffusion Models as a Representation Learner | Xingyi Yang et al | 2023 | diffusion, representation learner | ICCV | This paper (smartly!) notices that one of the major reasons for long training and poor results of diffusion models is the lack of fast learning of relationships. For instance, they remark on the learning of one eye of a dog before both eyes. They propose to mask the input image in the latent space and learn how to predict the masks, and then diffuse these masks. Brilliant! | Link |
Masked Diffusion Transformer is a Strong Image Synthesizer | Shanghua Gao et al | 2023 | diffusion, masking, transformer | ICCV | This paper (smartly!) notices that one of the major reasons for long training and poor results of diffusion models is the lack of fast learning of relationships. For instance, they remark on the learning of one eye of a dog before both eyes. They propose to mask the input image in the latent space and learn how to predict the masks, and then diffuse these masks. Brilliant! | Link |
Generative Modeling by Estimating Gradients of the Data Distribution | Yang Song et al | 2019 | diffusion, score matching | NeurIPS | This paper introduces Noise Conditional Score Networks (NCSNs), a novel approach to generative modeling that learns to estimate the score function of a data distribution at multiple noise levels. NCSNs are trained using score matching, avoiding the need to compute normalizing constants, and generate samples using annealed Langevin dynamics. The method addresses challenges in modeling complex, high-dimensional data distributions, particularly for data lying on or near low-dimensional manifolds. | Link |
LAPTOP-Diff: Layer Pruning and Normalized Distillation for Compression Diffusion Models | Dingkun Zhang et al | 2024 | diffusion, pruning | Arxiv | This paper proposes layer pruning and normalized distillation for pruning diffusion models. They use a surrogate function and show that their surrogate implies a property called "additivity", where the output distortion caused by many perturbations approximately equals the sum of the output distortion caused by each single perturbation. They then show that their computation can be formed as a 0-1 Knapsack problem. They then analyze what is the important objective for retraining, and see that there is an imbalance in previous feature distillation approaches employed in the retraining phase. They note that the L2-Norms of feature maps at the end of different stages and the values of different feature loss terms vary significantly, for instance, the highest loss term is ~10k times greater than the lowest one throughout the distillation process, and produces about 1k times larger gradients. This dilutes the gradients of the numerically insignificant feature loss terms. So, they opt to normalize the feature loss. | Link |
Classifier-Free Diffusion Guidance | Jonathan Ho et al | 2022 | diffusion, guidance | NeurIPS | This paper introduces classifier-free guidance, a novel technique for improving sample quality in conditional diffusion models without using a separate classifier. Unlike traditional classifier guidance, which relies on gradients from an additional classifier model, classifier-free guidance achieves similar results by combining score estimates from jointly trained conditional and unconditional diffusion models. The method involves training a single neural network that can produce both conditional and unconditional score estimates, and then using a weighted combination of these estimates during the sampling process. This approach simplifies the training pipeline, avoids potential issues associated with training classifiers on noisy data, and eliminates the need for adversarial attacks on classifiers during sampling. The authors demonstrate that classifier-free guidance can achieve a similar trade-off between Fréchet Inception Distance (FID) and Inception Score (IS) as classifier guidance, effectively boosting sample quality while reducing diversity. The key difference is that classifier-free guidance operates purely within the generative model framework, without relying on external classifier gradients. This method provides an intuitive explanation for how guidance works: it increases conditional likelihood while decreasing unconditional likelihood, pushing generated samples towards more characteristic features of the desired condition. | Link |
LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights | Thibault Castells et al | 2024 | pruning, diffusion, ldm | CVPR | This paper presents LD-Pruner. The main interesting part is how the frame the pruning problem. Basically, they define an "operator" (any fundamental building block of a net, like convolutional layers, activation functions, transformer blocks), and try to either 1) remove it or 2) replace it with a less demanding operation. As they operate on the latent space, this work can be applied to any generation that uses diffusion (task agnostic). It is interesting to note their limitations: the approach does not extend to pruning the decoder, and their approach does not consider dependencies between operators (which is a big deal I think). Finally, their score function seems a bit arbitrary (maybe this could be learned?). | Link |
RoFormer: Enhanced Transformer with Rotary Position Embedding | Jianlin Su et al | 2021 | attention, positional embedding | Arxiv | This paper introduces Rotary Position Embedding (RoPE), a method for integrating positional information into transformer models by using a rotation matrix to encode absolute positions and incorporating relative position dependencies. | Link |
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models | Alex Nichol et al | 2022 | text-conditioned diffusion, inpainting | Arxiv | This paper explores text-conditional image synthesis using diffusion models, comparing CLIP guidance and classifier-free guidance, and finds that classifier-free guidance produces more photorealistic and caption-aligned images. | Link |
LLM Inference Unveiled: Survey and Roofline Model Insights | Roger Waleffe et al | 2024 | llms, survey | Arxiv | This paper surveys some recent advancements in LLC inference, like speculative decoding or operator fusion. They also analyze the findings using the Roofline model, which is likely the first paper to do such a thing for LLM inference. Good for checking out other papers that have recently been published. | Link |
An Empirical Study of Mamba-based Language Models | Roger Waleffe et al | 2024 | mamba, llms, transformer | Arxiv | This paper compares Mamba-based, Transformer-based, and hybrid-based language models in a controlled setting where sizes and datasets are larger than the past (8B-params / 3.5T tokens). They find that Mamba and Mamba-2 lag behind Transformer models on copying and in-context learning tasks. They then see that a hybrid architecture of 43% Mamba, 7% self attention, and 50% MLP layers performs better than all others. | Link |
Diffusion Models Beat GANs on Image Synthesis | Prafulla Dhariwal et al | 2021 | diffusion, gan | Arxiv | This work demonstrates that diffusion models surpass the current state-of-the-art generative models in image quality, achieved through architecture improvements and classifier guidance, which balances diversity and fidelity. The model attains FID scores of 2.97 on ImageNet 128×128 and 4.59 on ImageNet 256×256, matching BigGAN-deep with as few as 25 forward passes while maintaining better distribution coverage. Additionally, combining classifier guidance with upsampling diffusion models further enhances FID scores to 3.94 on ImageNet 256×256 and 3.85 on ImageNet 512×512. | Link |
Progressive Distillation for Fast Sampling of Diffusion Models | Tim Salimans et al | 2022 | diffusion, distillation, sampling | ICLR | Diffusion models excel in generative modeling, surpassing GANs in perceptual quality and autoregressive models in density estimation, but they suffer from slow sampling times. This paper introduces two key contributions: new parameterizations that improve stability with fewer sampling steps and a distillation method that progressively reduces the number of required steps by half each time. Applied to benchmarks like CIFAR-10 and ImageNet, the approach distills models from 8192 steps down to as few as 4 steps, maintaining high image quality while offering a more efficient solution for both training and inference. | Link |
On Distillation of Guided Diffusion Models | Chenlin Meng et al | 2023 | diffusion, classifier-free guidance | Arxiv | Classifier-free guided diffusion models are effective for high-resolution image generation but are computationally expensive during inference due to the need to evaluate both conditional and unconditional models many times. This paper proposes a method to distill these models into faster ones by learning a single model that approximates the combined outputs, then progressively reducing the number of sampling steps. The approach significantly accelerates inference, generating images with comparable quality to the original model using as few as 1-4 denoising steps, achieving up to 256× speedup on datasets like ImageNet and LAION. | Link |
Diffusion Probabilistic Models Made Slim | Xingyi Yang et al | 2022 | diffusion, dpms, spectral diffusion | Arxiv | Diffusion Probabilistic Models (DPMs) produce impressive visual results but suffer from high computational costs, limiting their use on resource-limited platforms. This paper introduces Spectral Diffusion (SD), a lightweight model designed to address DPMs' bias against high-frequency generation, which smaller networks struggle to capture. SD incorporates wavelet gating for frequency dynamics and spectrum-aware distillation to enhance high-frequency recovery, achieving 8-18× computational efficiency while maintaining competitive image fidelity. | Link |
Structural Pruning for Diffusion Models | Gongfan Fang et al | 2023 | diffusion, pruning | NeurIPS | Generative modeling has advanced significantly with Diffusion Probabilistic Models (DPMs), but these models often require substantial computational resources. To address this, Diff-Pruning is introduced as a compression method that reduces the computational load by pruning unnecessary diffusion steps, using a Taylor expansion to identify key weights without extensive re-training. Empirical results show that Diff-Pruning can cut FLOPs by around 50%, while maintaining consistent generative performance at only 10-20% of the original training cost. | Link |
Diffusion Models: A Comprehensive Survey of Methods and Applications | Ling Yang et al | 2024 | diffusion, survey | ACM | Diffusion models are a powerful class of deep generative models known for their success in tasks like image synthesis, video generation, and molecule design. This survey categorizes diffusion model research into efficient sampling, improved likelihood estimation, and handling specialized data structures, while also discussing the potential for combining them with other generative models. The review highlights their broad applications across fields such as computer vision, NLP, temporal data modeling, and interdisciplinary sciences, suggesting areas for further exploration. | Link |
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium | Martin Heusel et al | 2017 | gan, equilibrium, fid, is | NeurIPS | This paper introduces a two time-scale update rule (TTUR) for GANs, and proves that this makes GANs converge to a local Nash equilibrium. More cited is the FID score introduced here. FID improves on IS by comparing the distributions of real and generated images directly. This is done by using the Inception model to extract features from images and then assuming these features follow a multidimensional Gaussian distribution. FID measures the difference between the Gaussians (representing the real and generated images) using the Frechet distance, which effectively captures differences in the mean and covariance (the first two moments) of the distributions. FID makes sense as it directly compares the distributions of real and generated images by using the extracted features from Inception. These features are assumed to follow some multidimensional Gaussian, which simplifies the comparison. The Guassian is chosen as it is the maximum entropy distribution for a given mean and covariance (proof: https://medium.com/mathematical-musings/how-gaussian-distribution-maximizes-entropy-the-proof-7f7dcb2caf4d) -- maximum entropy is important, because this means that the Gaussian makes the fewest additional assumptions about the data, making sure the model is as non-committal as possible given the available information. Then, we calculate the statistics between the real and generated image features, like their mean and covariances. Finally, we compute the FID score using Frechet AKA Wasserstein-2 distance. | Link |
Scalable Diffusion Models with Transformers | William Peebles et al | 2023 | diffusion,ddpm, dit | CVPR | The authors explore using transformers in the latent space, rather than U-Nets. They find that their methods can lead to lower FID scores compared to prior SOTA. In this paper, their image generation pipeline is roughly: 1) Input high resolution image x 2) Encoder z = E(x), where E is a pre-trained frozen VAE encoder, and z is the latent representation 3) The DiT model operates on z 4) New latent representation z’ is sampled from the diffusion model 5) We then decode the z’ using the pre-trained frozen VAE decoder D, and x’ is now the generated high resolution image. | Link |
Max-Affine Spline Insights Into Deep Network Pruning | Haoran You et al | 2022 | early-bird, lottery-hypothesis, pruning, low-precision | TMLR | The authors make connections from spline-theory (AKA, consdering DNNs as a continuous piecewise affline mapping) and pruning. Basically, they say that pruning removes redundant decision boundaries in layers that are pruned, and that we can compare the decision boundaries of unpruned networks to their pruned counterparts to show this (they have some nice visualizations). They also note that the final decision boundary often does not always depend on existing subdivision lines. Finally, they demonstrate another way of finding EB tickets using this spline formulation. | Link |
Drawing Early-Bird Tickets: Towards More Efficient Training of Deep Networks | Haoran You et al | 2020 | early-bird, lottery-hypothesis, pruning, low-precision | ICLR | The authors show that there exist early-bird (EB) tickets: small, but critical subnetworks for dense randomly intialized networks, that can be found using low-cost training schemes (low precision, early stopping). They also design a practical low compute method for finding these. They use mask distance. Basically, for each pruning iteration, a binary mask is created. This mask represents which parts of the network are kept (the "ticket", or pruned subnet) and which parts are removed. They then consider the scaling factor "r" in BN layers as indicators of significance. This r is learned during training and is used to scale normalized activations. The magnitude of r is an indicator of how important the channel is to the network's performance. After deciding which channels to prune based on r, the binary mask is created. If the channel is kept (not pruned), marked as 1 in the mask. Else, 0. For any two subnets, they then compute the "mask distance" (AKA Hamming distance) between the two ticketmasks. They measure the mask distance between consequtive epochs and draw EB tickets when such distance is smaller than some threshold. | Link |
Learning both Weights and Connections for Efficient Neural Networks | Song Han et al | 2015 | pruning, compression, regularization | NeurIPS | The authors show a method of pruning neural networks in three steps: 1) train the network to learn what connections are important, 2) prune unimportant connections, 3) retrain and fine-tune. In order to train for learning what connections are important, they do not focus on learning the final weight values, but rather just focus on the importance of connections. They don't explicitly mention how this is done, but one could look at the Hessian of the loss or the magnitude of the weights. I'd imagine you could do this within only a few training iterations. In their "Regularization" section, it is interesting to note that L1 regularization (penalizes non-zero params resulting in more params near zero) gave better accuracy after pruning, but before retraining. But, these remaining connections are not as good as with using L2. The authors also present a discussion of what dropout rate to use. | Link |
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference | Jiaming Tang et al | 2024 | KV cache, sparsity, LLM | ICML | Long context LLM inference is slow and the speed decreases significantly as sequence lengths grow. This is mainly due to needing to load a big KV cache during self-attention. Prior works have use methods to evict tokens in the attention maps to promote sparsity, but the Han lab (smartly!) found that the criticality of tokens strongly correlates with the current query token. Thus, they employ a KV Cache eviction method that retains all KV cache (since past evicted tokens may be needed to handle future queries), while being able to select the top K relevant tokens to a particular query. This allows for speedups in self-attention at low cost to accuracy. | Link |
BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage Models | Jiahui Yu et al | 2020 | NAS, one-shot | Arxiv | Most NAS frameworks train some one-shot model to rank the quality of different child architectures. However, these rankings often are different than reality, so frameworks typically finetune architecture after finding them. BigNAS proposes that this fine-tuning / post-processing is not necessary. They find some interesting points, such as that "big models converge faster while small child models converge slower". Thus, at some training step t when the performance of a big model peaks, the small child models are not yet fully-trained, and at a t' where the child models are fully trained, the big model is overfitting. Thus, they use an exponentially decaying with constant ending learning rate scheduler, which has constant learning rate at the end of training when it reaches 5% of initial learning rate. Another point they bring up is a "coarse-to-fine" strategy where one first finds a rough sketch of promising network candidates, and then samples multiple finer grained variations around the sketch of interest. | Link |
Meta-Learning of Neural Architectures for Few-Shot Learning | Thomas Elsken et al | 2021 | NAS, meta-learning, few-shot, fsl | Arxiv | The authors propose MetaNAS, which is the first method that fully integrates NAS with gradient-based meta-learning. Basically, they learn a method of joint learning gradient-based NAS methods like DARTS and meta-learning the architecture itself. Their goal is thus: meta-learn an architecture \alpha_{meta} with corresponding meta-learned weights w_{meta}. When given a new task \mathcal{T}_{i}, both \alpha_{meta} and w_{meta} adapt quickly to \mathcal{T}_{i} based on a few samples. One interesting technique they do is add a temperature term that is annealed to 0 over the course of task training; this is to help with sparsity of the mixture weights of the operations when using the DARTS search. | Link |
MetAdapt: Meta-Learned Task-Adaptive Architecture for Few-Shot Classification | Sivan Doveh et al | 2020 | NAS, meta-learning, few-shot, fsl | Arxiv | The authors propose a method using a DARTS-like search for FSL architectures. "Our goal is to learn a neural network where connections are controllable and adapt to the few-shot task with novel categories... However, unlike DARTS, our goal is not to learn a one time architecture to be used for all tasks... we need to make our architecture task adaptive so it would be able to quickly rewire for each new target task.". Basically, they design a thing called a MetAdapt Controller that changes the connection in the main network according to some given task. | Link |
Distilling the Knowledge in a Neural Network | Geoffry Hinton et al | 2015 | distillation, ensemble, MoE | Arxiv | The first proposal of knowledge distillation. The main interesting point I found was that they change the temperature of the softmax to be higher to allow for softer targets. This allows for understanding what 2's look like 3's (in an MNIST example). Basically, adds a sort of regularization since more information can be carried in these softer targets compared to a single 0 or 1. They also propose the idea of having an ensemble of models, and then learning a distilled model that is smaller. The biological example of having a clumsy larvae that then becomes a more specialized bug was good. | Link |
HyperTuning: Toward Adapting Large Language Models without Back-propagation | Jason Phang et al | 2023 | hypernetworks, adaptation, tuning, LoRA, LLMs | ICML | The authors show that we can a hypernetwork for model adaptation in order to generate task-specific parameters. They try two approaches: generating prefixs and generating LoRA parameters for a frozen T5 model using few-shot examples. They also note the importance of hyperpretraining, i.e., an additional stage to adapt the hypernet to generate parameters for the downstream model. They also propose a scheme for this. NOTE! "We also observe a consistent trend where HyperT5-Prefix outperforms HyperT5-LoRA. We speculate that it is easier for hypermodels to learn to generate soft prefixes as compared to LoRA weights..." | Link |
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning | Armen Aghajanyan et al | 2020 | fine-tuning, intrinsic dimension, lora | Arxiv | Large models with billions of parameters can be fine-tuned using only a few hundred examples. Why is this? Furthermore, large models often allow for significant sparsification, which implies that there is much redundancy. This paper targets both of these ideas, by showing that many common models have an "intrinsic dimension" much less than the full parameterization. | Link |
LoRA: Low-Rank Adaptation of Large Language Models | Edward Hu et al | 2021 | low rank adaptation, lora, llm, fine-tuning | Arxiv | Fine-tuning large models is expensive, because we update all the original parameters. LoRA, taking inspiration from Aghajanyan et al, 2020 (pre-trained language models have a low "intrinsic dimension"), the authors thought that the weight updates would also have low intrinsic rank. Thus, they decompose Delta W = BA, where B and A are lower rank. The A and B are trainable. They initialize A with Gaussian, and B as zero, so Delta W = BA is zero initialy. They then optimize and find this method to be more efficient in terms of both time and space. | Link |
Learning to Compress Prompts with Gist Tokens | Jesse Mu et al | 2023 | llms, prompting, compression, tokens | NeurIPS | The authors describe a method of using a distilling function G (similar to a hypernet) that is able to compress LM prompts into a smaller set of "gist" tokens. These tokens can then be cached and reused. The neat trick is that they reuse the LM itself as G, so gisting itself incurs no additional training cost. Note that in their "Failure Cases" section, they mention "... While it is unclear why only the gist models exhibit this behavior (i.e. the fail example behavior), these issues can likely be mitigated with more careful sampling techniques. | Link |
Once-For-All: Train One Network and Specialize it For Efficient Deployment | Han Cai et al | 2020 | nas, supernets | ICLR | The authors proposed training one large supernetwork and then sampling subnetworks as an approach for NAS. This method allows for the simultaneous generation of many different subnetworks that could satisfy different constraints (i.e. hardware, latency, accuracy, etc). The authors also propose a progressive shrinking method to train the net (start by training the big supernet, then progressively shrink down), which can be seen as a generalized pruning method. Furthermore, they introduce an idea of training a twin neural network to help estimate latency / accuracy given some architecture, which allows for fast feedback when conducting the search for subnetworks. | Link |
Dataless Knowledge Fusion by Merging Weights | Xisen Jin et al | 2023 | knowledge fusion, weight merging | ICLR | The paper introduces RegMean, a method for merging pre-trained language models from different datasets by solving a linear optimization problem, which improves generalization across domains without requiring the original training data. Compared to existing methods like Simple Averaging and Fisher Averaging, RegMean offers higher computational efficiency and comparable memory overhead, while achieving better or equivalent performance across various natural language tasks, including out-of-domain generalization. The method is evaluated using GLUE datasets and demonstrates superior performance in most tasks, outperforming traditional model ensembling and multi-task learning approaches. | Link |
Superposition of Many Models into One | Cheung et al | 2019 | superposition, online learning, tasks, continual learning | NeurIPS | A method of storing multiple models using only one set of parameters via parameter superposition is provided; it shares similarities to superposition in the fourier analysis for signal processing. | Link |
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation | Yoshua Bengio et al | 2013 | gradients, stochasticy, backpropagation | Arxiv | The authors introduce a several methods of estimation / propagation for networks that have stochastic neurons. This is used often in networks that are quantization-aware, as they sometimes have decision-boundaries in the neurons that are not differentiable regularly. The paper also introduces the "Straight Through Estimator", which was actually first introduced in one of Hinton's lectures. One interesting idea they present (that I think may have also been introduced in Kingma's VAE paper?) is that we can model the output h_{i} of some stochastic neuron as the application of a deterministic function that also depends on some noise source z_{i}: h_{i} = f(a_{i},z_{i}). TLDR: Straight through units are typically the go-to due to ease of use and good performance. | Link |
DoReFaNet: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients | Shuchang Zhou et al | 2018 | quantization, cnn, gradients | Arxiv | The authors introduce a method to train CNNs with low bitwidth weights and activations using low bitwidth param gradients. They use deterministic quantization for weights and activations, while stochastically quantizing gradients. Note that they do not quantize the weights of the first CNN layer for the most part, as they noted that it would often degrade performance (Han et al. 2015 also notes a similar thing). Another interesting thing they do is add noise to the gradient after quantization to increase performance. This paper also uses the straight through estimator (Bengio et al 2013) for propagating gradients when using their quantization scheme. | Link |
Training Deep Neural Networks with 8-bit Floating Point Numbers | Naigang Wang et al | 2018 | quantization, floating-point, precision | NeurIPS | The authors show that it is possible to train DNNs with 8-bit fp values while maintaining decent accuracy. To do this, they make a new FP8 format, develop a technique "chunk-based computations" that allow matrix and convolution ops to be computed using 8-bit multiplications and 16 bit additions, and use fp stochastic rounding in weight updates. One interesting point they make is that swamping (the issue of truncation in large-to-small number addition) is a serious problem in DNN bit-precision reduction. | Link |
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference | Benoit Jacob et al | 2017 | quantization, quantization schemes, efficient inference, floating-point | Arxiv | The authors propose a quantization scheme that allows us to only use integer arithmetic to approximate fp computations in a neural network. They also describe a training approach that simulates the effect of quantization in the forward pass. Backprop still occurs, but all weights and biases are stored in fp. The forward prop pass then simulates quantized inference by rounding off using the quantization scheme they describe that changes fp to int. | Link |
PACT: Parameterized Clipping Activation for Quantized Neural Networks | Jungwook Choi et al | 2018 | quantization, clipping, activations | ICLR | The authors present a method of quantization by clipping activations using a learnable parameter, alpha. They show that this can lead to lower decreases in accuracy compared to other quantization methods. They also note that activations have been hard to quantize compared to weights in the past. They also prove that PACT is as expressive as ReLU, by showing it can reach the same solution as ReLU if SGD is used. They also describe the hardware benefits that can be incurred. | Link |
SMASH: One-Shot Model Architecture Search through Hypernetworks | Andrew Brock et al | 2017 | hypernetworks, nas, one-shot, few-shot | Arxiv | The authors propose a technique to speed up NAS by using a hypernet. Basically, they train a hypernet to generate weights of a main model that has variable architecure. The input to the hypernet is a binarized representation of model architecture. The hypernet takes this representation in, and then outputs weights. They then train only for a few epochs, and compare the validation scores obtained across different representations. Then, they fully train the model that had the best validation score. | Link |
Example-based Hypernetworks for Multi-source Adaptation to Unseen Domains | Tomer Volk et al | 2023 | hypernetworks, multi-source adaptation, unseen domains, NLP | EMNLP | The authors apply hypernets to unsupervised domain adaptation in NLP. They use example-based adaptation. The main idea is that they use an encoder-decoder to initially create the unique signatures from an input example, and then they embed it within the source domain's semantic space. The signature is then used by a hypernet to generate the task classifier's weights. The paper focuses on improving generalization to unseen domains by explicitly modeling the shared and domain specific characteristics of the input. To allow for parameter sharing, they propose modeling based on hypernets, which allow soft weight sharing. | Link |
Meta-Learning via Hypernetworks | Dominic Zhao et al | 2020 | hypernetworks, meta-learning | NeurIPS | The authors propose a soft weight-sharing hypernet architecture that performs well on meta-learning tasks. A good paper to show efforts in meta-learning with regards to hypernets, and comparing them to SOTA methods like Model-Agnostic Meta-Learning (MAML). | Link |
HyperDynamics: Meta-Learning Object and Agent Dynamics with Hypernetworks | Zhou Xian et al | 2021 | hypernetworks, meta-learning, dynamics | ICLR | The authors present a dynamics meta-learning framework which conditions on an agent's interations w/ env and (optionally) the visual input from it. From this, they can generate params of a neural dynamics model. The three modules they use are 1) an encoding module that encodes a few agent-env interations / agent's visual observations into a feature code, 2) a hypernet that conditions on the latent feature code to generate params of a dynamic model dedicated to this observed system, and 3) a target dynamics model that is made using the generated parameters, and takes input as a low-dim system state / agent action and outputs the prediction of next system state. | Link |
Principled Weight Initialization for Hypernetworks | Oscar Chang et al | 2020 | hypernetworks, weight initialization | ICLR | Classical weight initialization techniques don't really work on hypernets, because they fail to produce weights for the mainnet in the correct scale. The authors derive formulas for hyperfan-out and hyperfan-in weight initialization, and show that it works well for the mainnet. | Link |
Continual Learning with Hypernetworks | Johannes von Oswald et al | 2020 | hypernetworks, continual learning, meta learning | ICLR | The authors present a method of preventing catastrophic forgetting, by using task-conditioned hypernets (i.e., hypernets that generate weights of target model based on some task embedding). Thus, rather than memorizing many data characteristics, we can split the problem into just learning a single point per task, given the task embedding. | Link |
Stochastic Hyperparameter Optimization through Hypernetworks | Jonathan Lorraine et al | 2018 | hypernetworks, hyperparameters | ICLR | Using hypernetworks to learn hyperparameters. They replace the training optimization loop in favor of a differentiable hypernetwork to allow for tuning of hyperparameters using grad descent. | Link |
Playing Atari with Deep Reinforcement Learning | Volodymyr Mnih et al | 2013 | q-learning, reinforcement learning | Arxiv | The authors present the first deep learning model that can learn complex control policies, and they teach it to play Atari 2600 games using Q-learning. Their goal was to create one net that can play as many games as possible. | Link |
Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Encoding | Song Han et al | 2016 | quantization, encoding, pruning | ICML | A three-pronged approach to compressing nets. They prune networks, then quantize and share weights, and then apply Huffman encoding. | Link |
Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or -1 | Matthieu Courbariaux et al | 2016 | quantization, efficiency, binary | Arxiv | Introduction of training Binary Neural Networks, or nets with binary weights and activations. They also present experiments on deterministic vs stochastic binarization. They use the deterministic one for the most part, except for activations. | Link |
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks | Mingxing Tan et al | 2020 | efficiency, scaling | ICML | A study of model scaling is presented. They propose a novel scaling method to uniformly scale all dimensions of depth/width/resolution using a compound coefficient. This paper presents a method for scaling width/depth/resolution; for instance, if you want to use 2^{N} more compute resources, then you can scale by their coefficients to do so. They also quantify the relationship between width, depth, and resolution. | Link |
The wake-sleep algorithm for unsupervised neural networks | Geoffry Hinton et al | 1995 | representation, generative | Arxiv | One of the first generative neural networks that kind of resembles diffusion. | Link |
ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design | Haoran You et al | 2022 | vit, accelerator, attention | Arxiv | Co-deisng for ViTs. Prunes and polarizes attention maps to have denser/sparser patterns. Development of hardware accelerator as well. | Link |
Evolving Neural Networks through Augmenting Topologies | Kenneth O. Stanley et al | 2002 | nas, evolution | Arxiv | Evolution for NAS. | Link |
A Brief Review of Hypernetworks in Deep Learning | Vinod Kumar Chauhan et al | 2024 | hypernetwork | Arxiv | Review of hypernets. | Link |
HyperNetworks | David Ha et al | 2016 | hypernetwork | Arxiv | Looking at HyperNetworks: networks that generate weights for other networks. | Link |
Deep Learners Benefit More from Out-of-Distribution Examples | Yoshio Bengio et al | 2024 | ood | ICML | Evidence that ood samples can help learning. They also argue that intermediate levels of representation can benefit the models in multi-task settings. | Link |
Balanced Data, Imbalanced Spectra: Unveiling Class Disparities with Spectral Imbalance | Chiraag Kaushik et al | 2024 | spectra, class imbalance | ICML | Introduction of the idea of "spectral imbalance", which can affect classification accuracy even when classes are balanced. Basically, they look at how the distributions of eigenvalues in different classes affect classification accuracy. | Link |
DeepArchitect: Automatically Designing and Training Deep Architectures | Renato Negrinho et al | 2017 | nas | Arxiv | Proposal of a language to describe neural networks architectures. Can then describe them as trees to search through. Show different search methods for going through the trees (Monte Carlo tree search, random, use of surrogate function, etc.). | Link |
Graph neural networks: A review of methods and applications | Jie Zhou et al | 2020 | gnn | AI Open | What graph neural networks are, what they are made of, how to train them. And examples. They describe a general design pipeline (Find graph structure, specify graph type and scale, design loss function) and explain the main modules in GNNs (propagation to propagate information between notes, sampling module to conduct the propagation, pooling module to extract information from notes). | Link |
1D convolution neural networks and applications: A survey | Serkan Kiranyaz et al | 2020 | cnn, survey | Mechanical Systems and Signal Processing | A brief overview of applications of 1D CNNs is performed. It is largely focused on medicine (for instance, ECG) and fault detection (for instance, vibration based structural damage). | Link |
2 in 1 Accelerator: Enabling Random Precision Switch for Winning Both Adversarial Robustness and Efficiency | Yonggan Fu et al | 2021 | quantization, accelerator | ACM | The most interesting point of this paper (among many things!) is the smart idea to use quantization as a way to boost DNN robustness. Cool! | Link |
Token Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation | Junyoung Park et al | 2024 | efficiency, hardware, accelerator, attention | DAC | In autoregressive models with attention, off chip memory accesses need to be minimized. The authors note that there have been efforts to prune unimportant tokens, but these do not do much for removing tokens with attention scores near zero. The authors (smartly!) notice this issue, and provide a fast method of estimating the decision to prune or not based on estimation of the probability if a token is or is not important. An architecture for this is also provided. | Link |
Maxout Networks | Ian Goodfellow et al | 2013 | dropout, maxout | ICML | The authors note that dropout is "most effective when taking relatively large steps in parameter space. In this regime, each update can be seen as making a significat update to a different model on a different subset of the training set". I really liked that quote. They then develop the maxout unit, which iessentially takes the maxmimum across some number of affine transformations, allowing for learning of piecewise linear approximations of nonlinear functions. | Link |
Geometric deep learning: Going beyond Euclidean data | Michael Bronstein et al | 2017 | geometric deep learning | IEEE SIG | Provides an overview of geometric deep learning, which are methods of generalizing DNNs to non Euclidean domains (graphs, manifolds, etc). | Link |
Sampling in Constrained Domains with Orthogonal Space Variational Gradient Descent | Ruqi Zhang et al | 2022 | variational gradient descent, gradient flow | NeurIPS | The authors propose a new variational framework called O Gradient for sampling in implicitly defined constrained domains, using two orthogonal directions to drive the sampler towards the domain and explore it by decreasing a KL divergence. They prove the convergence of O Gradient and apply it to both Langevin dynamics and Stein Variational Gradient Descent (SVGD), demonstrating its effectiveness on various machine learning tasks. | Link |
Entropy MCMC: Sampling from Flat Basins with Ease | Bolian Li et al | 2024 | sampling, bayesian, flat basins | ICML | The authors propose a practical MCMC algorithm for sampling from flat basins of DNN posterior distributions, using a guiding variable based on local entropy to steer the sampler. They prove the fast convergence rate of their method compared to existing flatness aware methods and demonstrate its superior performance on various tasks through comprehensive experiments. The method is mathematically simple and computationally efficient, making it suitable as a drop in replacement for standard sampling methods like SGLD. | Link |
AdderNet: Do We Really Need Multiplications in Deep Learning? | Hanting Chen et al | 2021 | multiplication-less, efficiency | CVPR | The authors show that with a cost of accuracy you can use additions instead of multiplications. They mainly tested CNNs. | Link |
Explaining and Harnessing Adversarial Examples | Ian Goodfellow et al | 2015 | adversarial examples | ICLR | Adversarial examples (adding "small but intentially worst case perturbations to examples from the dataset") proves to be an interesting method to train models. The authors also (smartly!) describe a method to generate adversarial examples by a linear method. | Link |
Identifying and attacking the saddle point problem in high dimensional non convex optimization | Yann Dauphin et al | 2014 | saddle points, optimization | NeurIPS | The authors argue that saddle points, rather than local minima, are the primary challenge in minimizing non convex error functions in high dimensional spaces, based on insights from various scientific fields and empirical evidence. They explain that saddle points surrounded by high error plateaus can significantly slow down learning and create the illusion of local minima, particularly in high dimensional problems of practical interest. To address this challenge, the authors propose a new approach called the saddle free Newton method, designed to quickly escape high dimensional saddle points, unlike traditional gradient descent and quasi Newton methods. | Link |
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift | Sergey Ioffe et al | 2015 | batch, normalization | PMLR | The authors identify internal covariate shift as a significant challenge in training deep neural networks, where the distribution of each layer's inputs changes during training due to parameter updates in previous layers. To address this issue, they propose Batch Normalization, a method that normalizes layer inputs as part of the model architecture, performing normalization for each training mini batch. Batch Normalization enables the use of much higher learning rates, reduces sensitivity to initialization, and acts as a regularizer, sometimes eliminating the need for Dropout.| | Link |
Bayesian Deep Learning and a Probabilistic Perspective of Generalization | Andrew Wilson et al | 2020 | bayesian, marginalization | NeurIPS | The authors emphasize that marginalization, rather than using a single set of weights, is the key distinguishing feature of a Bayesian approach, which can significantly improve the accuracy and calibration of modern deep neural networks. They demonstrate that deep ensembles provide an effective mechanism for approximate Bayesian marginalization and propose a related approach that further improves the predictive distribution by marginalizing within basins of attraction. The paper investigates the prior over functions implied by a vague distribution over neural network weights, explaining neural network generalization from a probabilistic perspective and showing that seemingly mysterious results (like fitting random labels) can be reproduced with Gaussian processes. The authors demonstrate that Bayesian model averaging mitigates the double descent phenomenon, leading to monotonic performance improvements as model flexibility increases. | Link |
A Practical Bayesian Framework for Backpropagation Networks | David MacKay et al | 1992 | bayesian | Neural Computation | The authors present a quantitative and practical Bayesian framework for learning mappings in feedforward networks, enabling objective comparisons between different network architectures and providing stopping rules for network pruning or growing procedures. This framework allows for objective selection of weight decay terms or regularizers, measures the effective number of well determined parameters in a model, and provides quantified estimates of error bars on network parameters and outputs. The approach helps detect poor underlying assumptions in learning models and demonstrates a good correlation between generalization ability and Bayesian evidence for well matched learning models. | Link |
Bayesian Neural Network Priors Revisited | Vincent Fortuin et al | 2022 | bayesian, priors | ICLR | Isotropic Gaussian priors are the standard for modern Bayesian neural network inference, but their accuracy and optimal performance are uncertain. By studying summary statistics of neural network weights trained with stochastic gradient descent (SGD), the authors find that CNN and ResNet weights exhibit strong spatial correlations, while FCNNs display heavy tailed weight distributions. Incorporating these observations into priors improves performance on various image classification datasets, mitigating the cold posterior effect in FCNNs but slightly increasing it in ResNets. | Link |
Hands on Bayesian Neural Networks A Tutorial for Deep Learning Users | Laurent Jospin et al | 2022 | bayesian | IEEE | A good summary / tutorial for using Bayesian Nets. Also provides some good paper references within. | Link |
Position Paper: Bayesian Deep Learning in the Age of Large Scale AI | Theodore Papamarkou et al | 2024 | bayesian, mcmc | ICML | A good summary of strengths of BDL (Bayesian Deep Learning) with regards to modern deep learning, while also addressing some weaknesses. A good paper if need to do an overview of modern challenges (as of 2024).| | Link |
A Neural Probabilistic Language Model | Bengio et al | 2003 | statistical language modeling | JMLR | One of the first papers about modern methods in using neural systems to estimate probability functions of word sequences. They show that MLPs can model better than the SOTA (at that time). A classic.| | Link |
Bit Fusion: Bit Level Dynamically Composable Architecture for Accelerating Deep Neural Networks | Hardik Sharma et al | 2018 | accelerator, quantization, bit fusion | ISCA | Hardware acceleration of Deep Neural Networks (DNNs) aims to address their high compute intensity, with the paper focusing on the potential of reducing bitwidth in operations without compromising classification accuracy. To prevent accuracy loss, the bitwidth varies significantly across DNNs, and a fixed bitwidth accelerator may lead to limited benefits or degraded accuracy. The authors introduce Bit Fusion, a bit flexible accelerator that dynamically adjusts bitwidth for individual DNN layers, resulting in significant speedup and energy savings compared to state of the art accelerators, Eyeriss and Stripes, and achieving performance close to a 250 Watt Titan Xp GPU while consuming much less power. | Link |
A Framework for the Cooperation of Learning Algorithms | Leon Bottou et al | 1990 | learning algorithms, modules | NeurIPS | Cooperative training of modular systems offers a unified approach to many learning algorithms and hybrid systems, allowing the design and implementation of complex learning systems that incorporate structural a priori knowledge about tasks. The authors introduce a framework using a statistical formulation of learning systems to define and combine modules into cooperative systems, enabling the creation of hybrid systems that combine the advantages of connectionist and other learning algorithms. By decomposing complex tasks into simpler subtasks, modular architectures can be built, where each module corresponds to a subtask, facilitating easier achievement of the learning goal by introducing a modular decomposition of the global task. | Link |
CNP: An FPGA Based Processor for Convolutional Networks | Clement Farabet et al | 2009 | fpga, cnn | IEEE | One of the first attempts (that I have found) at putting a CNN into an FPGA and showing it can be done to perform some task (face detection). | Link |
A Complete Recipe for Stochastic Gradient MCMC | Yi An Ma et al | 2015 | hamiltonian, mcmc | NeurIPS | Recent Markov chain Monte Carlo (MCMC) samplers use continuous dynamics and scalable variants with stochastic gradients to efficiently explore target distributions, but proving convergence with stochastic gradient noise remains challenging. The authors provide a general framework for constructing MCMC samplers, including stochastic gradient versions, based on continuous Markov processes defined by two matrices, demonstrating that any such process can be represented within this framework. Using this framework, they propose a new state adaptive sampler, stochastic gradient Riemann Hamiltonian Monte Carlo (SGRHMC), which combines the benefits of Riemann HMC with the scalability of stochastic gradient methods, as shown in experiments with simulated data and a streaming Wikipedia analysis. | Link |
CPT: Efficient Deep Neural Network Training via Cyclic Precision | Yonggan Fu et al | 2021 | precision, efficiency, wide minima | ICLR | Low precision deep neural network (DNN) training is an effective method for improving training time and energy efficiency, with this paper proposing a new perspective: that DNN precision may act similarly to the learning rate during training. The authors introduce Cyclic Precision Training (CPT), which cyclically varies precision between two boundary values identified through a simple precision range test in the initial training epochs, aiming to boost time and energy efficiency further. | Link |
Approximation by Superpositions of a Sigmoidal Function | G. Cybenko | universal approximator, completeness | TODO | Mathematics of Control, Signals, and Systems | This paper demonstrates that finite linear combinations of compositions of a fixed univariate function and a set of affine functionals can uniformly approximate any continuous function of nn real variables within the unit hypercube, under mild conditions on the univariate function. These findings resolve an open question about the representability of functions by single hidden layer neural networks, specifically showing that arbitrary decision regions can be well approximated by continuous feedforward neural networks with a single hidden layer and any continuous sigmoidal nonlinearity. | Link |
Cyclical Stochastic Gradient MCMC for Bayesian Deep Learning | Ruqi Zhang et al | 2020 | mcmc, bayesian | ICLR | The posteriors over neural network weights are high dimensional and multimodal, with each mode representing a different meaningful interpretation of the data. The authors introduce Cyclical Stochastic Gradient MCMC (SG MCMC) with a cyclical stepsize schedule, where larger steps discover new modes and smaller steps characterize each mode, and they prove the non asymptotic convergence of this algorithm. | Link |
DaDianNao: A Machine Learning Supercomputer | Yunji Chen et al | 2014 | accelerator, gpu | IEEE/ACM | This paper introduces a custom multi chip architecture optimized for Convolutional and Deep Neural Networks (CNNs and DNNs), addressing their computational and memory intensive nature by leveraging on chip storage to enhance internal bandwidth and reduce external communication bottlenecks. The authors demonstrate significant performance gains with their 64 chip system achieving up to a 450.65x speedup over GPUs and reducing energy consumption by up to 150.31x on large neural network layers, implemented with custom storage, computational units, and robust interconnects at 28nm scale. | Link |
DARTS: Differentiable Architecture Search | Hanxiao Liu et al | 2019 | nas | ICLR | This paper introduces a differentiable approach to architecture search, tackling scalability challenges by reformulating the task to allow gradient based optimization over a continuous relaxation of architecture representations. Unlike traditional methods relying on evolutionary or reinforcement learning in discrete, non differentiable spaces, the proposed method efficiently discovers high performance convolutional architectures for image classification and recurrent architectures for language modeling. | Link |
Decoupled Contrastive Learning | Chun Hsiao Yeh et al | 2022 | contrastive learning, self-supervised learning | ACM | This paper introduces decoupled contrastive learning (DCL), which removes the negative positive coupling (NPC) effect from the InfoNCE loss, significantly improving the efficiency of self supervised learning (SSL) tasks with smaller batch sizes. DCL achieves efficient and reliable performance enhancements across various benchmarks, outperforming the SimCLR baseline without requiring momentum encoding, large batch sizes, or extensive epochs. | Link |
Deep Image Prior | Dmitry Ulyanov et al | 2020 | inpatining, super-resolution, denoising | IEEE | This paper challenges the conventional wisdom by demonstrating that the structure of a generator network, even when randomly initialized, can effectively capture low level image statistics without any specific training on example images. The authors show that this randomly initialized neural network can serve as a powerful handcrafted prior, yielding excellent results in standard image processing tasks such as denoising, super resolution, and inpainting. Furthermore, the same network structure can invert deep neural representations for diagnostic purposes and restore images based on input pairs like flash and no flash conditions, showcasing its versatility and effectiveness across various image restoration applications. | Link |
Deep Double Descent: Where Bigger Models and More Data Hurt | Preetum Nakkiran et al | 2019 | capacity, double descent | Arxiv | This paper explores the "double descent" phenomenon in modern deep learning tasks, showing that as model size or training epochs increase, performance initially worsens before improving. The authors unify these observations by introducing a new complexity measure termed effective model complexity, conjecturing a generalized double descent across this measure. | Link |
DeepShift: Towards Multiplication Less Neural Networks | Mostafa Elhoushi et al | 2021 | multiplication-less, efficiency | Arxiv | This paper addresses the computational challenges of deploying convolutional neural networks (CNNs) on edge computing platforms by introducing convolutional shifts and fully connected shifts, replacing multiplications with efficient bitwise operations during both training and inference. The proposed DeepShift models achieve competitive or higher accuracies compared to baseline models like ResNet18, ResNet50, VGG16, and GoogleNet, while significantly reducing the memory footprint by using only 5 bits or less for weight representation during inference. | Link |
DepthShrinker: A New Compression Paradigm Towards Boosting Real Hardware Efficiency of Compact Neural Networks | Yonggan Fu et al | 2022 | compression, efficiency, pruning | ICML | This paper introduces DepthShrinker, a framework designed to enhance hardware efficiency of deep neural networks (DNNs) by transforming irregular computation patterns of compact operators into dense ones, thereby improving hardware utilization without sacrificing model accuracy. By leveraging insights that certain activation functions can be removed post training without loss of accuracy, DepthShrinker pioneers a compression paradigm that optimizes DNNs for real hardware efficiency, presenting a significant advancement in efficient model deployment.| | Link |
Dimensionality Reduction by Learning an Invariant Mapping | Raia Hadsell et al | 2006 | dimensionality reduction, mapping | CVPR | DrLIM, or Dimensionality Reduction by Learning an Invariant Mapping, addresses key limitations of existing dimensionality reduction techniques by learning a non linear function that maps high dimensional data to a low dimensional manifold based solely on neighborhood relationships, without requiring a predefined distance metric in input space. The method is distinguished by its ability to handle transformations and maintain invariant mappings, demonstrated through experiments that show its effectiveness in preserving neighborhood relationships and accurately mapping new, unseen samples to meaningful locations on the manifold. Unlike methods like LLE, which may struggle with variability and registration issues in input data, DrLIM's contrastive loss function ensures robustness by balancing similarity and dissimilarity in output space, offering a promising approach for applications requiring invariant mappings, such as learning positional information from image sequences in robotics. | Link |
Disentangling Trainability and Generalization in Deep Neural Networks | Lechao Xiao et al | 2020 | neural tangent kernel, ntk | ICML | This study focuses on characterizing the trainability and generalization of deep neural networks, particularly under conditions of very wide and very deep architectures, leveraging insights from the Neural Tangent Kernel (NTK). By analyzing the NTK's spectrum, the study formulates necessary conditions for both memorization and generalization across architectures like Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs). The research identifies key spectral quantities such as λmax, λbulk, κ, and P(Θ(l)) that critically influence the performance of deep networks, providing a precise theoretical framework validated by extensive experiments on CIFAR10. It highlights distinctions in generalization behavior between CNNs with and without global average pooling. | Link |
Finding Structure in Time | Jeffrey Elman | 1990 | rnn | Cognitive Science | I think this was the original backpropagation through time paper. Good insights on time dependent system learning. | Link |
E2 Train: Training State of the art CNNs with Over 80% Energy Savings | Yue Wang et al | 2019 | cnn, batch, energy | NeurIPS | This paper introduces E2 Train, a framework for energy efficient CNN training on resource constrained platforms. E2 Train optimizes training energy costs through stochastic mini batch dropping, selective layer updates, and low cost, low precision back propagation strategies. Experimental results on CIFAR 10 and CIFAR 100 demonstrate significant energy savings of over 90% and 84%, respectively, with minimal loss in accuracy. E2 Train addresses the challenge of on device CNN training by integrating three levels of energy saving techniques: data level stochastic mini batch dropping, model level selective layer updates, and algorithm level low precision back propagation. Real energy measurements on an FPGA validate its effectiveness, achieving notable energy reductions in training ResNet models on CIFAR datasets. | Link |
cuDNN: Efficient Primitives for Deep Learning | Sharan Chetlur et al | 2014 | cuda, gpu | Arxiv | This paper introduces cuDNN, a library designed to optimize deep learning primitives akin to BLAS for HPC. cuDNN offers efficient implementations of key deep learning kernels tailored for GPUs, improving performance and reducing memory usage in frameworks like Caffe by up to 36%. | Link |
EIE: Efficient Inference Engine on Compressed Deep Neural Network | Song Han et al | 2016 | compression, accelerator, co-design | Arxiv | This paper introduces EIE, an energy efficient inference engine designed for compressed deep neural networks, achieving significant energy savings by exploiting weight sharing, sparsity, and quantization. EIE performs sparse matrix vector multiplications directly on compressed models, enabling 189× and 13× faster inference speeds compared to CPU and GPU implementations of uncompressed DNNs. | Link |
An Empirical Analysis of Deep Network Loss Surfaces | Daniel Jiwoong Im et al | 2016 | optimization, loss surface, saddle points | Arxiv | This paper empirically investigates the geometry of loss functions in state of the art neural networks, employing various stochastic optimization methods. Through visualizations in low dimensional subspaces, it explores how different optimization procedures lead to distinct local minima, even when algorithms are changed late in the optimization process. The study reveals that modifications to optimization procedures consistently yield different local minima, each affecting the network's performance on test examples differently. Interestingly, while different optimization algorithms find varied local minima from different initializations, the shape of the loss function around these minima remains characteristic to the algorithm used, with ADAM showing larger basins compared to vanilla SGD. | Link |
EyeCoD: Eye Tracking System Accelerator via FlatCam based Algorithm & Accelerator Co Design | Haoran You et al | 2023 | accelerator, co-design, eye-tracking | ACM | This paper introduces EyeCoD, a lensless FlatCam based eye tracking system designed to overcome limitations of traditional systems, such as large form factor and high communication costs. By integrating a predict then focus algorithm pipeline and dedicated hardware accelerator, EyeCoD achieves significant reductions in computation and communication overhead while maintaining high tracking accuracy. | Link |
Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices | Yu Hsin Chen et al | 2019 | efficiency, sparsity | Arxiv | This paper introduces Eyeriss v2, a specialized DNN accelerator architecture designed to efficiently handle compact and sparse neural networks. Unlike traditional DNN accelerators, Eyeriss v2 incorporates a hierarchical mesh network on chip to adapt to varying layer shapes and sizes, optimizing data reuse and bandwidth utilization. Eyeriss v2 excels in processing sparse data directly in the compressed domain, both for weights and activations, thereby enhancing processing speed and energy efficiency particularly suited for sparse models. | Link |
Eyeriss: A Spatial Architecture for Energy Efficient Dataflow for Convolutional Neural Networks | Yu Hsin Chen et al | 2016 | cnn, row-stationary, efficiency | ACM/IEEE | The paper addresses the high energy consumption in deep convolutional neural networks (CNNs) due to extensive data movement, despite advancements in parallel computing paradigms like SIMD/SIMT. Introduces a novel row stationary (RS) dataflow designed for spatial architectures. RS maximizes local data reuse and minimizes data movement during convolutions, leveraging PE local storage, inter PE communication, and spatial parallelism. | Link |
Flat Minima | Sepp Hochreiter et al | 1997 | flat minima, low complexity, gibbs | Neural Computation | The algorithm focuses on identifying "flat" minima of the error function in weight space. A flat minimum is characterized by a large connected region where the error remains approximately constant. This property suggests simplicity in the network structure and low expected overfitting, supported by an MDL based Bayesian argument. Unlike traditional approaches that rely on Gaussian assumptions or specific weight priors, this algorithm uses a Bayesian framework with a prior over input output functions. This approach considers both network architecture and the training set, facilitating the identification of simpler and more generalizable models. | Link |
Fused Layer CNN Accelerators | Manoj Alwani et al | 2016 | cnn, accelerator, fusion | IEEE | This work introduces a novel approach to CNN accelerator design by fusing the computation of multiple convolutional layers. By rearranging the dataflow across layers, intermediate data can be cached on chip between adjacent layers, reducing the need for off chip memory storage and minimizing data transfer. Specifically, the study demonstrates the effectiveness of this approach by implementing a fused layer CNN accelerator for the initial five convolutional layers of VGGNet E. Using 362KB of on chip storage, the accelerator significantly reduces off chip feature map data transfer by 95%, from 77MB to 3.6MB per image processed. This innovation targets early convolutional layers where data transfer typically dominates. By optimizing data reuse and minimizing off chip memory usage, the proposed design strategy enhances the efficiency of CNN accelerators, paving the way for improved performance in various machine learning tasks. | Link |
EnlightenGAN: Deep Light Enhancement Without Paired Supervision | Yifan Jiang et al | 2021 | gan, enhancement, unsupervised | IEEE | Exploration of low light to well lit image generation using GANs. Also provides an interesting global local discriminator and self regularized perceptual loss fusion, with a simplified attention (the attention is just an inverse of the brightness of the image essentially). | Link |
Understanding the difficulty of training deep feedforward neural networks | Xavier Glorot et al | 2010 | activation, saturation, initialization | AISTATS | The logistic sigmoid activation function is problematic for deep networks due to its mean value, which can lead to saturation of the top hidden layer. This saturation slows down learning and can cause training plateaus. The difficulty in training deep networks correlates with the singular values of the Jacobian matrix for each layer. When these values deviate significantly from 1, it indicates poor activation and gradient flow across layers, complicating training. New initialization schemes have been proposed to address issues with activation saturation and gradient flow. These schemes aim to achieve faster convergence by ensuring that activations and gradients flow well across layers, thereby improving overall training efficiency. | Link |
Group Normalization | Yuxin Wu et al | 2018 | normalization | Arxiv | Batch Normalization performs normalization along the batch dimension, which causes errors to increase rapidly as batch sizes decrease. This limitation makes BN less effective for training larger models and tasks that require smaller batches due to memory constraints. GN divides channels into groups and computes normalization statistics (mean and variance) independently within each group. Unlike BN, GN's computation is not dependent on batch sizes, leading to stable performance across a wide range of batch sizes. | Link |
Singularity of the Hessian in Deep Learning | Levent Sagun et al | 2017 | eigenvalues, hessian | ICLR | The bulk of eigenvalues concentrated around zero indicates how overparametrized the model is. In deep learning, overparametrization often leads to better generalization despite the potential for higher computational costs. The edges of the eigenvalue distribution, scattered away from zero, reflect the complexity of the input data. This complexity influences how the loss landscape is structured and affects optimization difficulty. Second order optimization methods, which leverage information from the Hessian, can potentially accelerate training and find better solutions by providing insights into the loss landscape's curvature. The top discrete eigenvalues of the Hessian are influenced by the data characteristics, indicating that different datasets may require different optimization strategies or model architectures for optimal performance. | Link |
Long Short Term Memory | Sepp Hochreiter et al | 1997 | lstm, rnn | Neural Computation | The original paper on the LSTM. A classic, and demonstrated the power of gating. | Link |
On the importance of initialization and momentum in deep learning | Ilya Sutskever et al | 2013 | initialization, momentum | ICML | Traditionally, training DNNs and RNNs with stochastic gradient descent (SGD) with momentum was considered challenging due to issues with gradient propagation and vanishing/exploding gradients, especially in networks with many layers or long term dependencies. The paper demonstrates that using a well designed random initialization significantly improves the training success of deep and recurrent networks with SGD and momentum. | Link |
Algorithms for manifold learning | Lawrence Cayton | 2005 | manifold learning, dimensionality reduction | Arxiv | Many datasets exhibit complex relationships that cannot be effectively captured by linear methods like Principal Component Analysis (PCA). Manifold hypothesis: Despite high dimensional appearances, data points often lie on or near a much lower dimensional manifold embedded within the higher dimensional space. Manifold learning aims to uncover this underlying low dimensional structure to provide a more meaningful and compact representation of the data. | Link |
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis | Ben Mildenhall et al | 2020 | nerf, view synthesis, 3d, scene representation, volume rendering | Arxiv | The method utilizes a fully connected deep network to represent scenes as continuous volumetric functions. This network takes a 5D input (spatial location and viewing direction) and outputs volume density and view dependent radiance. By querying these 5D coordinates along camera rays and employing differentiable volume rendering techniques, the method synthesizes novel views of scenes. | Link |
On Large Batch Training for Deep Learning Generalization Gap and Sharp Minima | Nitish Shirish Keskar et al | 2017 | sharp minima, large batch | ICLR | The study identifies a phenomenon where large batch SGD methods tend to converge towards sharp minimizers of training and testing functions. Sharp minima are associated with poorer generalization, meaning the model performs worse on unseen data. In contrast, small batch methods more consistently converge towards flat minimizers. This behavior is attributed to the inherent noise in gradient estimation during training with small batches. | Link |
Optimizing FPGA based Accelerator Design for Deep Convolutional Neural Networks | Chen Zhang et al | 2015 | cnn, fpga, accelerator | ACM | The study employs quantitative analysis techniques, including loop tiling and transformation, to optimize the CNN accelerator design. These techniques aim to maximize computation throughput while minimizing the resource utilization on the FPGA, particularly balancing logic resource usage and memory bandwidth.| | Link |
Learning Phrase Representation using RNN Encoder Decoder for Statistical Machine Translation | Kyunghyun Cho et al | 2014 | encoder decoder, machine translation | Arxiv | Introduces a novel neural network architecture called RNN Encoder Decoder, comprising two recurrent neural networks. One RNN serves as an encoder, converting a sequence of symbols into a fixed length vector representation. The other RNN acts as a decoder, generating another sequence of symbols based on the encoded representation. | Link |
Qualitatively Characterizing Neural Network Optimization Problems | Ian Goodfellow et al | 2015 | optimization, visualization | ICLR | Demonstrates that contemporary neural networks can achieve minimal training error through direct training with stochastic gradient descent alone, without needing complex schemes like unsupervised pretraining. This finding challenges earlier beliefs about the difficulty of navigating non convex optimization landscapes in neural network training. They also introduce a nice graphical tool to show the energy landscape. | Link |
Language Models are Unsupervised Multitask Learners | Alec Radford et al | 2018 | unsupervised, GPT | Arxiv | Demonstrates that language models, specifically GPT 2, trained on the WebText dataset, start to learn various natural language processing tasks (question answering, machine translation, reading comprehension, summarization) without explicit task specific supervision. For instance, when conditioned on a document and questions, the model achieves an F1 score of 55 on the CoQA dataset, matching or exceeding several baseline systems that were trained with over 127,000 examples. | Link |
On the difficulty of training Recurrent Neural Networks | Razvan Pascanu et al | 2013 | exploding gradient, vanishing gradient, gradient clipping, normalization | Arxiv | Explanation of issues in RNNs (vanishing / exploding gradient) and proposal of gradient clipping. | Link |
Learning representations by back propagating errors | David Rumelhart et al | 1986 | backpropagation, learning procedure, convergence | Nature | The main paper for backprop. | Link |
The Shattered Gradients Problem: If resnets are the answer, then what is the question? | David Balduzzi et al | 2018 | shattering, initialization | ICML | The paper identifies the "shattered gradients" problem in standard feedforward neural networks. It shows that gradients in these networks exhibit an exponential decay in correlation with depth, leading to gradients that resemble white noise. In contrast, architectures like highway and ResNets with skip connections demonstrate gradients that decay sublinearly, indicating greater resilience against shattering. The paper introduces a new initialization technique termed "Looks Linear" (LL) that addresses the shattered gradients issue. Preliminary experiments demonstrate that LL initialization enables the training of very deep networks without the need for skip connections like those in ResNets or highway networks. This initialization method offers a promising alternative to achieving stable gradient propagation in deep networks, potentially simplifying network architecture and improving training efficiency. | Link |
A Simple Baseline for Bayesian Uncertainty in Deep Learning | Wesley Maddox et al | 2019 | bayesian, uncertainty, guassian | NeurIPS | SWAG combines Stochastic Weight Averaging (SWA) with Gaussian fitting to provide an approximate posterior distribution over neural network weights. SWA computes the first moment of SGD iterates using a modified learning rate schedule. SWAG extends this by fitting a Gaussian distribution using SWA's solution as the first moment and incorporating a low rank plus diagonal covariance derived from SGD iterates. | Link |
SmartExchange: Trading Higher cost Memory Storage/Access for Lower cost Computation | Yang Zhao et al | 2020 | compression, accelerator, pruning, decomposition, quantization | ACM/IEEE | SmartExchange integrates sparsification or pruning, decomposition, and quantization techniques into a unified algorithm. It aims to enforce a structured DNN weight format where each layer's weight matrix is represented as a product of a small basis matrix and a large sparse coefficient matrix with power of 2 elements. | Link |
On the Spectral Bias of Neural Networks | Nasim Rahaman et al | 2019 | spectra, fourier analysis, manifold learning | ICML | Neural networks, particularly deep ReLU networks, exhibit a learning bias towards low frequency functions. This bias means they tend to prioritize learning global variations over local fluctuations in data. This property aligns with their ability to generalize well across different samples and datasets. Contrary to intuition, as the complexity of the data manifold increases, deep networks find it easier to learn higher frequency functions. This suggests that while they naturally favor low frequency patterns, they can also adapt to more complex data structures to capture higher frequency variations. | Link |
Sequence to Sequence Learning with Neural Networks | Ilya Sutskever et al | 2014 | seq2seq | Arxiv | The paper introduces an end to end approach for sequence learning using multilayered Long Short Term Memory (LSTM) networks. This method requires minimal assumptions about the structure of the sequences and effectively maps input sequences to a fixed dimensional vector using one LSTM layer, and decodes target sequences using another deep LSTM layer. | Link |
Tiled convolutional neural networks | Quoc Le et al | 2010 | tiling, cnn | NeurIPS | Tiled CNNs introduce a novel approach to learning invariances by using a regular "tiled" pattern of tied weights. Unlike traditional CNNs where adjacent hidden units share identical weights, Tiled CNNs require only that hidden units at a certain distance from each other share tied weights. This allows the network to learn complex invariances such as scale and rotational invariance, in addition to translational invariance. | Link |
Unsupervised Learning of Image Manifolds by Semidefinite Programming | Kilian Weinberger et al | 2004 | manifold learning, dimensionality reduction | IEEE | The paper proposes a new approach to detect low dimensional structure in high dimensional datasets using semidefinite programming (SDP). SDP is leveraged to analyze data that resides on or near a low dimensional manifold, which is a common challenge in computer vision and pattern recognition. The algorithm introduced overcomes limitations observed in previous manifold learning techniques like Isomap and locally linear embedding (LLE). These traditional methods often struggle with certain types of data distributions or computational complexities, which the proposed SDP based approach aims to address more effectively. | Link |