Here's where I keep a list of papers I have read.

This list was curated by myself, beginning from about May 2024 to now.

I typically use this to organize papers I found interesting. Please feel free to do whatever you want with it. Note that this is not every single paper I have ever read, just a collection of ones that I remember to put down.

So far, we have read 314 papers. Let's keep it up!

Your search returned 314 papers. Nice!

Title	Author	Year	Topic	Venue	Description	Link
Multiverse: Your Language Model Secretely Decide How to Parallelize and Merge Generation	Xinyu Yang et al	2024	llm, attention	Arxiv	This paper introduces Multiverse, a novel generative modeling framework that enables natively parallel generation through an internalized MapReduce paradigm, contrasting with traditional autoregressive models that are limited to sequential token generation. The framework operates through three stages: a Map stage for adaptive task decomposition, a Process stage for parallel subtask execution, and a Reduce stage for lossless result synthesis, allowing models to dynamically adjust parallelism during generation. To build real-world Multiverse models, the authors co-design three components: Multiverse Curator (an automated pipeline that transforms sequential reasoning chains into parallel structures), Multiverse Attention (a modified attention mechanism that enables parallel generation while maintaining training efficiency), and Multiverse Engine (a specialized inference system supporting dynamic switching between sequential and parallel modes). After fine-tuning Qwen-2.5-32B-Instruct on just 1,000 examples for 3 hours, Multiverse-32B achieves performance comparable to leading autoregressive models on reasoning tasks (AIME24: 54%, AIME25: 46%) while demonstrating superior scaling efficiency within fixed context lengths and achieving up to 2× wall-clock speedup through parallel generation capabilities.	Link
Tell Your Model Where to Attend: Post-Hoc Attention Steering for LLMs	Qingru Zhang et al	2024	llm, attention	ICLR	This paper introduces PASTA (Post-hoc Attention Steering Approach), a novel method that enables Large Language Models to process text with user-specified emphasis marks, similar to how humans use bold and italics to guide reader attention. PASTA works by identifying a small subset of attention heads through multi-task model profiling and applying precise attention reweighting during inference, scaling down attention scores for non-emphasized tokens while preserving the relative importance of highlighted content. The method requires no parameter updates and only needs to be profiled once per model, making it applicable at inference time across various tasks including complex instruction following, lengthy context interpretation, and knowledge conflict resolution. Experimental results on GPT-J-6B and LLaMA-7B demonstrate substantial improvements, with PASTA achieving an average 22% accuracy improvement over few-shot prompting for LLaMA-7B across four challenging tasks, while also reducing sensitivity to prompt variations and maintaining text generation quality.	Link
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees	Yuhui Li et al	2024	llm, speculative sampling	Arxiv	This paper presents EAGLE-2, an improved speculative sampling method that introduces dynamic draft trees to accelerate Large Language Model inference beyond the capabilities of its predecessor EAGLE. The key insight is that draft token acceptance rates are not only position-dependent but also context-dependent, challenging the static tree assumption used by existing methods like EAGLE and Medusa. EAGLE-2 leverages the well-calibrated confidence scores from EAGLE's draft model to approximate acceptance rates and dynamically adjusts the draft tree structure accordingly, using a value-based expansion phase and reranking strategy to optimize token selection. Comprehensive experiments across six tasks (dialogue, code generation, math reasoning, instruction following, summarization, and QA) on three LLM series demonstrate that EAGLE-2 achieves 3.05x-4.26x speedup ratios, representing 20%-40% improvement over EAGLE while maintaining lossless acceleration and requiring no additional training beyond the base EAGLE model.	Link
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty in Large Language Models	Yuhui Li et al	2024	llm, speculative sampling	Arxiv	This paper introduces EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), a speculative sampling framework designed to accelerate Large Language Model (LLM) inference by addressing key limitations in existing approaches. The authors identify two critical insights: first, that autoregression at the feature level (second-to-top-layer) is more straightforward than at the token level, and second, that uncertainty in feature-level prediction significantly constrains performance. EAGLE resolves the uncertainty problem by incorporating a token sequence advanced by one time step into the draft model, enabling more precise feature prediction with minimal computational overhead. Comprehensive experiments across dialogue, code generation, mathematical reasoning, and instruction following tasks demonstrate that EAGLE achieves 2.7x-3.5x speedup for LLaMA2-Chat 70B while maintaining the original text distribution, outperforming existing methods like Lookahead (1.7x-2.1x faster) and Medusa (1.5x-1.6x faster) with low training costs and broad applicability to autoregressive LLMs.	Link
SpanBERT: Improving Pre-Training by Representing and Predicting Spans	Mandar Joshi et al	2020	llm, bert	Arxiv	This paper introduces SpanBERT, a pre-training method that improves upon BERT by focusing on span-level representations rather than individual tokens. The key innovations are: (1) masking contiguous random spans of text rather than individual tokens, using a geometric distribution to sample span lengths; and (2) introducing a Span Boundary Objective (SBO) that trains the model to predict the entire content of masked spans using only the representations of tokens at the span boundaries, plus position embeddings. Additionally, SpanBERT uses single-sequence training without BERT's Next Sentence Prediction objective, which proves more effective than bi-sequence training. The method achieves substantial improvements on span selection tasks like question answering (94.6% F1 on SQuAD 1.1, 88.7% on SQuAD 2.0) and coreference resolution (79.6% F1 on OntoNotes), demonstrating that span-focused pre-training objectives can significantly enhance performance on tasks requiring span-level reasoning.	Link
Unified Language Model Pre-Training for Natural Language Understanding and Generation	Li Dong et al	2019	llm, bert	NeurIPS	This paper introduces UNILM (UNIfied pre-trained Language Model), a single Transformer model that can be fine-tuned for both natural language understanding and generation tasks by using different self-attention masks during pre-training. The model is jointly trained on three language modeling objectives: unidirectional (left-to-right and right-to-left), bidirectional, and sequence-to-sequence prediction, with the key innovation being the use of specific attention masks to control what context each token can access during prediction. UNILM achieves competitive performance with BERT on understanding tasks like GLUE and question answering, while establishing new state-of-the-art results on five generation tasks including CNN/DailyMail summarization, question generation, and dialog response generation. The unified approach eliminates the need for separate models for different tasks while enabling parameter sharing that leads to more general text representations and better performance across both understanding and generation applications.	Link
Flow Matching for Generative Modeling	Yaron Lipman et al	2023	llm, flow matching	Arxiv	This paper introduces Flow Matching (FM), a simulation-free framework for training Continuous Normalizing Flows (CNFs) that breaks away from the limitations of diffusion-based generative models. The key innovation is constructing target probability paths and vector fields through conditional formulations per data sample, then showing that optimizing a Conditional Flow Matching objective (which is tractable) provides identical gradients to the intractable Flow Matching objective. The framework supports a general family of Gaussian probability paths, including both existing diffusion paths and a novel Optimal Transport (OT) path that creates straight-line trajectories with constant direction, making it simpler to learn than curved diffusion paths. Experimental results on ImageNet demonstrate that FM with OT paths achieves superior performance in likelihood estimation and sample quality while requiring fewer function evaluations for sampling, offering faster training and generation compared to traditional diffusion methods.	Link
Discrete Flow Matching	Itai Gat et al	2024	llm, flow matching	Arxiv	This paper introduces Discrete Flow Matching, a novel generative framework that extends continuous flow matching to discrete data like text and code. The method uses continuous-time discrete Markov chains to define probability paths between source (masked) and target distributions, with generating probability velocities that have identical forms to continuous flow matching when using denoiser or noise-prediction parameterizations. Key practical contributions include a general family of probability paths with different schedulers, unified corrector sampling techniques, and the ability to scale to large models (1.7B parameters) that achieve competitive results on coding benchmarks like HumanEval (6.7% Pass@1) and text generation tasks. The approach significantly closes the performance gap between autoregressive models and discrete flow models while enabling non-autoregressive generation, though it still requires more function evaluations than continuous flow matching.	Link
Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles	Qingyan Wei et al	2025	llm, diffusion	Arxiv	This paper proposes SlowFast Sampling to accelerate diffusion-based large language models (dLLMs) by addressing the static, inefficient nature of existing sampling strategies that suffer from suboptimal parallel token generation. The authors identify three "Golden Principles" governing effective dLLM sampling: the Certainty Principle (high-confidence tokens are more stable), the Convergence Principle (token confidences stabilize over iterations), and the Positional Principle (high-confidence tokens cluster in contiguous regions). Based on these principles, they design a dynamic two-stage approach that alternates between an Exploratory Stage (cautious decoding to identify stable regions) and an Accelerated Decoding Stage (aggressive parallel decoding within identified stable spans). Experiments show SlowFast Sampling achieves up to 15.63× speedup on LLaDA alone and up to 34.22× when combined with dLLM-Cache, while maintaining generation quality and even outperforming autoregressive models like LLaMA3 8B in throughput.	Link
PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs	Tengxuan Liu et al	2025	llm, cot, quantization	Arxiv	This paper addresses the memory bottleneck in long Chain-of-Thought (CoT) reasoning LLMs by proposing Progressive Mixed-precision KV Cache Quantization (PM-KVQ), which tackles two key issues with existing KV cache quantization methods. The authors identify that existing methods cause large cumulative quantization errors and fail to properly calibrate on long-context data due to Rotary Positional Embedding (RoPE) characteristics. PM-KVQ introduces progressive quantization that gradually reduces bit-width as memory fills up and block-wise memory allocation that assigns higher precision to more sensitive transformer blocks. To address calibration issues, they propose using short-context data with positional interpolation to approximate long-context distributions without additional computational overhead. Experiments on 7B-70B long-CoT LLMs demonstrate up to 8% improvement in reasoning benchmark performance compared to state-of-the-art baselines under the same memory constraints.	Link
Rethinking the Outlier Distribution in Large Language Models: An In-depth Study	Rahul Raman et al	2025	llm, quantization	Arxiv	This paper investigates the root causes of two types of outliers in large language models (LLMs): massive activations (MAs) and channel-wise outliers, which significantly impact quantization performance. The authors find that most massive activations are propagated through residual connections rather than being newly generated, and these "fake" MAs can be safely removed without affecting model accuracy. They discover that channel-wise outliers primarily emerge from normalization operations, specifically the rescaling step within LayerNorm/RMSNorm layers. Based on these insights, they propose efficient methods to eliminate the majority of both types of outliers with minimal impact on model performance, simplifying subsequent quantization processes.	Link
DLM-One: Diffusion Language Models for One-Step Sequence Generation	Tianqi Chen et al	2025	diffusion llm	Arxiv	This paper introduces DLM-One, a score distillation framework that enables one-step sequence generation with continuous diffusion language models (DLMs), eliminating the need for iterative refinement during inference. The key contribution is adapting vision-domain diffusion acceleration techniques to language modeling by aligning the scores of a student model's outputs with a pretrained teacher DLM's score function in the continuous token embedding space. To address training instability and prevent degenerate solutions common in data-free distillation of language models, the authors propose an adversarial regularization approach that combines score matching loss with adversarial training, using the same network for both score estimation and GAN discrimination. The framework includes a two-stage training strategy where the first stage produces a reasonably good student model, and the second stage reinitializes the score estimator with teacher parameters to provide better learning signals for continued improvement, achieving up to 500× speedup over iterative diffusion methods while maintaining competitive performance on benchmark text generation tasks.	Link
Post-Training Quantization for Diffusion Transformer via Hierarchical Timestep Grouping	Ning Ding et al	2025	dit	Arxiv	This paper introduces Hierarchical Timestep Grouping (HTG), a post-training quantization framework specifically designed for Diffusion Transformers (DiT) to enable efficient deployment without requiring retraining. The key contribution is identifying that DiT quantization difficulties stem from time-dependent, channel-specific activation outliers that vary across denoising timesteps, making traditional static quantization approaches ineffective. The authors propose a two-pronged solution: (1) temporally-grouped channel-wise shifting that uses hierarchical clustering to group similar timesteps and applies channel-wise shifts to center asymmetric outlier distributions, and (2) temporally-aggregated channel-wise scaling that transfers quantization difficulty from activations to weights using exponential moving averages of outlier magnitudes across timesteps. The framework includes a re-parameterization scheme that absorbs the shifting and scaling operations into existing AdaLN modules, eliminating additional computational overhead while achieving state-of-the-art quantization performance for both W8A8 and W4A8 configurations.	Link
Esoteric Language Models	Subham Sekhar Sahoo et al	2025	llm	Arxiv	This paper introduces Esoteric Language Models (Eso-LMs), which fuse autoregressive and masked diffusion paradigms through a hybrid training objective that combines AR and MDM losses, enabling smooth interpolation between the two approaches via a hyperparameter α₀ that controls the masking probability during training. The core technical innovation is a unified attention mechanism using attention bias matrices that allows a single transformer to dynamically switch between causal (AR-style) and bidirectional (MDM-style) attention patterns, with two variants: Eso-LM (A) that sparsifies attention over mask tokens, and Eso-LM (B) that enforces causal attention among clean tokens to enable KV caching. The method employs a two-phase sampling strategy where tokens are first generated in parallel via diffusion (with custom denoising schedules that avoid processing unnecessary mask tokens), followed by sequential left-to-right completion of remaining masked positions, with Eso-LM (B) uniquely supporting KV caching throughout both phases. Empirically, Eso-LMs achieve state-of-the-art perplexity among diffusion models while delivering substantial speedups (up to 65× faster than standard MDMs and 4× faster than block diffusion approaches), establishing a new Pareto frontier for the speed-quality tradeoff in non-autoregressive language generation.	Link
Accelerating Diffusion LLMs via Adaptive Parallel Decoding	Daniel Israel	2025	llm	Arxiv	This paper addresses the fundamental challenge that diffusion language models (dLLMs), while theoretically capable of parallel token generation, fail to achieve competitive speed-quality tradeoffs compared to autoregressive models when generating multiple tokens simultaneously due to poor intra-token dependency modeling. The core contribution is Adaptive Parallel Decoding (APD), which inverts the speculative decoding paradigm by using a small auxiliary autoregressive model to verify parallel samples from a larger diffusion model, employing a multiplicative mixture of the dLLM's marginal probabilities and the autoregressive model's joint probabilities to determine how many tokens to accept in each iteration. The method incorporates several practical optimizations including KV caching for diffusion models (despite their bidirectional attention), masked input size limiting, and universal coupling via the Gumbel-Softmax trick to ensure valid sampling, resulting in three tunable hyperparameters that enable flexible speed-quality tradeoffs. Empirically, APD achieves Pareto-optimal performance by generating 5+ tokens per iteration while maintaining high accuracy on reasoning benchmarks, demonstrating that dynamic parallel sampling can overcome the sharp tradeoff between parallelization and quality that plagues naive approaches.	Link
Qwen3 Technical Report	Qwen Team	2025	llm	Arxiv	Qwen3 introduces a unified architecture that seamlessly integrates thinking and non-thinking modes within a single model, eliminating the need for separate reasoning and chat-optimized models through a novel chat template design with /think and /no_think flags that enable dynamic mode switching based on user queries. The system implements a thinking budget mechanism that allows adaptive computational resource allocation during inference by truncating the reasoning process at user-defined token limits, naturally emerging from the thinking mode fusion training without explicit optimization. The authors develop a highly efficient strong-to-weak distillation pipeline that leverages both off-policy and on-policy knowledge transfer from larger teacher models, achieving superior performance compared to traditional reinforcement learning while requiring only 1/10th the computational resources and enabling lightweight models to match the reasoning capabilities of much larger counterparts. The training methodology employs a sophisticated four-stage post-training process (long-CoT cold start, reasoning RL, thinking mode fusion, general RL) combined with three-stage pre-training on 36 trillion tokens across 119 languages, utilizing architectural innovations like fine-grained expert segmentation in MoE models with global-batch load balancing and QK-normalization for training stability.	Link
LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models	Fengxi Zhu et al	2025	llm, dpo	Arxiv	LLaDA 1.5 addresses the challenge of applying Direct Preference Optimization (DPO) to Masked Diffusion Models (MDMs), where exact log-likelihoods must be approximated using Evidence Lower Bound (ELBO) estimates that introduce significant variance. The authors provide a theoretical analysis showing that both the bias and variance introduced by this approximation are governed by the variance of the preference score estimator—a linear combination of four ELBO terms. They propose Variance-Reduced Preference Optimization (VRPO), which employs three unbiased variance reduction techniques: increasing the Monte Carlo sampling budget for ELBOs, optimally allocating the sampling budget across distinct diffusion timesteps with one masked sample per timestep, and applying antithetic sampling by sharing the same random samples between current and reference policy ELBO estimates. The framework provides formal theoretical guarantees that these techniques reduce variance while maintaining unbiasedness, and extends beyond DPO to other alignment algorithms that involve ELBO estimation or correlated ELBO differences.	Link
Parallel Scaling Law for Language Models	Mouxiang Chen et al	2025	llm, scaling	Arxiv	This paper introduces Parallel Scaling (PARSCALE), a novel scaling paradigm that increases model capability by scaling parallel computation rather than parameters or inference tokens. The core technique involves applying P diverse learnable transformations (using prefix tuning) to the input, executing forward passes in parallel through the same model, and dynamically aggregating the outputs using learnable weights to produce a single prediction. The authors theoretically demonstrate that scaling P parallel streams is equivalent to scaling parameters by O(log P), establishing a new scaling law that generalizes the Chinchilla scaling law to include computational scaling alongside parameter scaling. The method offers superior inference efficiency compared to parameter scaling, using significantly less memory and latency while achieving comparable performance improvements, making it particularly suitable for edge deployment scenarios.	Link
Why Does the Effective Context Length of LLMs Fall Short?	Chenxin An et al	2024	llm, context length	Arxiv	This paper introduces STRING (ShifTed Rotary position embeddING), a training-free method that addresses the fundamental issue of left-skewed position frequency distribution in LLM pretraining, where distant position indices are severely undertrained compared to nearby ones. The key insight is that well-trained frequent position indices can be strategically shifted to replace undertrained distant positions during inference, allowing models to leverage their existing positional knowledge for long-range dependencies. STRING is implemented by manipulating the position matrix through three steps: dropping infrequent positions beyond a threshold, shifting frequent positions from the main diagonal to fill the empty bottom-left triangle, and restoring locality with a small window to maintain neighboring token relationships. The method can be efficiently implemented using FlashAttention by combining sliding window attention around the diagonal with shifted self-attention in the bottom-left triangle, requiring no additional training while dramatically improving long-context performance within existing training lengths.	Link
Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models	Boxin Wang et al	2022	llm, detoxify	NeurIPS	This paper introduces Self-Generation Enabled domain-Adaptive Training (SGEAT), which leverages the language model's own generative capabilities to create training data for detoxification rather than relying on pre-existing filtered corpora. The key insight is that self-generated data better captures the high-density regions of the model's output space and mitigates exposure bias between training and inference, leading to more data-efficient detoxification. The authors demonstrate that adapter-based parameter-efficient training provides a better trade-off between toxicity reduction and model quality preservation compared to full model fine-tuning, especially for large-scale models. The paper also reveals that model toxicity primarily stems from training data rather than model size, as larger models show similar toxicity levels to smaller ones when trained on the same corpus, though they require more effort to detoxify effectively.	Link
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning	Zebin You et al	2025	llm, multimodal	Arxiv	This paper introduces LLaDA-V, the first purely diffusion-based multimodal large language model that extends masked diffusion language models to handle visual inputs through a standard visual instruction tuning framework with a vision encoder and MLP connector. The key innovation is adapting the masked diffusion training objective to multimodal dialogues by keeping image features and prompts unmasked while randomly masking only the response tokens, enabling the model to predict masked tokens using bidirectional context from both visual and textual information. During inference, the model generates responses iteratively through the reverse diffusion process, starting with fully masked responses and progressively unmasking high-confidence predictions while re-masking low-confidence ones. The paper demonstrates that bidirectional attention (allowing tokens to attend to all other tokens) works better than causal attention patterns for multimodal understanding tasks, suggesting that the inherent bidirectional nature of diffusion models provides advantages over autoregressive approaches for capturing global context in multimodal scenarios.	Link
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective	Siyue Zhang et al	2025	llm, embedding	Arxiv	This paper proposes using diffusion language models instead of autoregressive LLMs for text embedding tasks, motivated by the fundamental mismatch between LLMs' unidirectional attention and the bidirectional context understanding required for embeddings. The key insight is that diffusion models are inherently trained with bidirectional attention through their masked token prediction objective, making them naturally suited for capturing global context without requiring additional adaptations like LLM2Vec. The authors introduce REASONAUG, a novel training dataset constructed using LLMs that creates logically related positive-negative pairs for reasoning-intensive retrieval tasks, addressing the lack of effective training data for such scenarios. Their analysis reveals that bidirectional attention (both causal and reverse directions) is crucial for encoding long and complex documents, with reverse attention having a particularly significant impact on diffusion models compared to LLM-based approaches.	Link
MaskGIT: Masked Generative Image Transformer	Huiwen Chen et al	2022	image generation	CVPR	MaskGIT introduces a bidirectional transformer that replaces autoregressive sequence generation with parallel token prediction using masked visual token modeling (MVTM), where random subsets of image tokens are masked during training and the model learns to predict them using context from all directions. The key innovation is an iterative decoding strategy that starts with all tokens masked and progressively unmasks the most confident predictions at each step, using a mask scheduling function that determines how many tokens to reveal per iteration. The paper demonstrates that cosine mask scheduling works best, following a "less-to-more" information flow where the model makes fewer confident predictions initially and gradually fills in more details. This approach enables natural extension to image editing tasks like inpainting and outpainting by simply initializing the iterative process with different mask patterns, making it inherently more flexible than autoregressive models that are constrained by sequential ordering.	Link
Show-O: One Single Transformer To Unify Multimodal Understanding and Generation	Jinheng Xie et al	2024	llm, multimodal	Arxiv	The paper introduces Show-o, a unified transformer that combines autoregressive modeling for text with discrete diffusion modeling for images, using an "omni-attention" mechanism that applies causal attention to text tokens and full attention to image tokens within the same sequence. It employs a unified prompting strategy with special task tokens to format multimodal inputs into structured sequences, enabling seamless handling of understanding and generation tasks. The model uses discrete image tokenization (via MAGVIT-v2) to create a shared vocabulary space for both text and image tokens, allowing unified training with combined next-token prediction and mask-token prediction objectives. A three-stage training pipeline progressively learns image token embeddings, cross-modal alignment, and high-quality generation capabilities, demonstrating that diffusion-based approaches can be more efficient than autoregressive methods for visual generation while maintaining competitive understanding performance.	Link
MMaDA: Multimodal Large Diffusion Language Models	Ling Yang et al	2025	llm multimodal	Arxiv	The paper introduces MMaDA, a unified multimodal diffusion foundation model with three key innovations. First, it employs a modality-agnostic diffusion architecture that uses discrete tokenization for both text and images, enabling unified training under a single probabilistic formulation rather than requiring separate components for different modalities. Second, it implements Mixed Long-CoT fine-tuning that creates a unified chain-of-thought format across textual reasoning, multimodal understanding, and text-to-image generation tasks, facilitating cross-modal knowledge transfer and enabling cold-start training for reinforcement learning. Third, it develops UniGRPO, a novel reinforcement learning algorithm specifically designed for diffusion models that uses structured masking strategies and diversified reward modeling to optimize performance across reasoning and generation tasks while avoiding the computational inefficiencies of prior approaches. The framework demonstrates that diffusion models can serve as effective general-purpose foundation models by systematically bridging pretraining and post-training methodologies in a unified architecture.	Link
Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance	Omer Goldman et al	2024	llm	Arxiv	This paper investigates the relationship between tokenization quality and language model performance by focusing on compression as a key metric. The researchers demonstrate both theoretically and empirically that better-compressing tokenizers lead to improved downstream model performance, particularly for generation tasks and smaller models.	Link
On the Transformer Growth for Progressive BERT Training	Fabian Gloeckle et al	2024	llm	Arxiv	This paper introduces a simple yet effective training method that enhances large language models (LLMs) by having them predict multiple future tokens simultaneously during training. The key innovation is training LLMs with multiple independent output heads operating on a shared model trunk, with each head responsible for predicting a different future token. Unlike conventional next-token prediction where models only predict one token ahead, multi-token prediction creates models that are both more powerful and faster at inference.	Link
On the Transformer Growth for Progressive BERT Training	Xiaotao Gu et al	2024	llm	Arxiv	This paper introduces CompoundGrow, a novel approach for progressive BERT training that applies compound growth operators across multiple model dimensions simultaneously. Unlike previous methods that grow models in only a single dimension (either depth, width, or input length), CompoundGrow balances growth across all three dimensions, significantly accelerating training without sacrificing model performance. The researchers discovered that compound scaling principles from network architecture search also apply to Transformer growth during progressive training - growing the network from multiple dimensions creates greater potential for achieving better performance with fewer resources. Through comprehensive experiments and analyses, the authors determined the most effective growth operators for each dimension: embedding pooling for length, parameter sharing for width, and progressive stacking for depth.	Link
SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling	Dahyun Kim et al	2024	llm	Arxiv	SOLAR 10.7B introduces depth up-scaling (DUS), a novel method for efficiently scaling large language models by combining depthwise scaling with continued pretraining. Unlike mixture-of-experts approaches, DUS maintains structural simplicity by removing layers from each end of a base model, joining the resulting portions, and then continuing pretraining to recover and enhance performance.	Link
LLAMA PRO: Progressive LLaMA with Block Expansion	Chengyue Wu et al	2024	llm	Arxiv	This paper introduces LLAMA PRO, a novel post-pretraining method that expands Transformer blocks to enhance Large Language Models (LLMs) with new capabilities while preserving their original skills. The main innovation is a "block expansion" technique that adds new Transformer blocks to an existing LLM, initializing them as identity blocks, and then fine-tuning only these new blocks on domain-specific data while keeping the original blocks frozen.	Link
Continual Diffusion for Categorical Data	Sander Dieleman et al	2025	llm, diffusion	Arxiv	Continuous diffusion for categorical data (CDCD) is a framework for applying continuous diffusion models to discrete categorical data like language by embedding tokens in Euclidean space, enabling generative modeling through iterative refinement. The authors propose score interpolation as an alternative to score matching for diffusion model training, which allows using the familiar cross-entropy loss function and enables end-to-end training of both the diffusion model and the Euclidean embeddings with a single loss function. Time warping is introduced as an active learning strategy that automatically adapts the distribution of noise levels during training to maximize efficiency by ensuring the entropy of model predictions increases linearly. While autoregressive models currently dominate language modeling research, diffusion-based approaches offer advantages including architectural flexibility, adaptivity of the denoising procedure to trade off computational cost and sample quality, and the ability to perform arbitrary text infilling.	Link
Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning	Jiasheng Ye et al	2025	llm, diffusion	Arxiv	Diffusion Language Models (DIFFUSION-LLMs) demonstrate that scaling in data, model size, and task diversity enables them to perform competently across various language tasks, challenging the dominance of autoregressive models in the discrete domain. The authors establish a connection between masked language modeling and discrete diffusion models, allowing them to repurpose pre-trained masked LMs into DIFFUSION-LLMs through "diffusive adaptation" without expensive pre-training from scratch. Experiments across translation, summarization, and instruction-following tasks show that these models consistently improve as they scale up, exhibiting zero-shot and few-shot capabilities comparable to autoregressive counterparts. Notably, DIFFUSION-LLMs demonstrate unique advantages in their bidirectional receptive field and flexible generation order, enabling them to excel in tasks requiring implicit planning and structured reasoning where fixed left-to-right generation is suboptimal.	Link
LLM Processes: Numerical Predictive Distributions Conditioned on Natural Language	James Requeima et al	2024	llm	NeurIPS	LLM Processes provide a framework for eliciting coherent numerical predictive distributions from large language models without any fine-tuning, allowing users to incorporate domain knowledge through natural language descriptions. The authors develop two approaches—independent marginal sampling and autoregressive sampling—with the latter capturing dependencies between outputs for superior performance by maintaining information across predictions. Through extensive experiments, they demonstrate LLMPs perform competitively with specialized statistical methods like Gaussian Processes across diverse tasks including forecasting, multi-dimensional regression, image reconstruction, and black-box optimization. Most significantly, the research shows how conditioning on textual information meaningfully alters predictive distributions to create forecasts that incorporate real-world knowledge not explicitly present in numerical data, such as seasonal patterns or domain-specific constraints.	Link
Large Language Models to Diffusion Finetuning	Edoardo Cetin et al	2025	llm, diffusion, autoregression	Arxiv	L2D introduces a finetuning method that gives pre-trained language models the scaling properties of diffusion models, enabling them to improve performance by increasing computation at inference time. The approach adds a parallel diffusion path to existing models that processes diffusion tokens separately, preserving original capabilities while enabling multi-step reasoning. Unlike traditional finetuning which risks knowledge loss, L2D augments models with new capabilities such as adaptive computation and domain-specific expertise through classifier-free guidance. This innovation effectively bridges the autoregressive and diffusion paradigms, creating models that combine the training efficiency of language models with the compute-scaling advantages of diffusion.	Link
Autoregressive Diffusion Models	Emiel Hoogeboom et al	2022	llm, diffusion, autoregression	ICLR	Autoregressive Diffusion Models (ARDMs) introduce a novel framework that unifies order-agnostic autoregressive models and discrete diffusion processes, allowing generation in any arbitrary order without requiring causal masking in the neural network architecture. ARDMs employ a training objective that optimizes only a single step of the generation process at a time (similar to diffusion models), using varying mask probabilities and implementing generation through iterative mask replacement with predicted tokens, which enables more efficient training while preserving bidirectional context. The authors develop parallelized inference and generation processes using dynamic programming that can adaptively determine which steps to parallelize, creating a flexible tradeoff between generation speed and model quality while requiring significantly fewer steps than discrete diffusion models to achieve equivalent performance. ARDMs are also extended with "depth upscaling" capabilities that structure the generation process into multiple stages of progressive refinement (such as bit-by-bit generation), which proves particularly effective for lossless compression tasks where ARDMs outperform competing approaches, especially in single-image compression scenarios.	Link
FiLM: Fill-In Language Models for Any-Order Generation	Tianxiao Shen et al	2023	llm, order	Arxiv	FiLM introduces a flexible language modeling approach that enables generation at any position without adhering to a specific order, extending the masked language modeling objective by adopting varying mask probabilities sampled from the Beta distribution to enhance its generative capabilities. Unlike traditional left-to-right language models, FiLM can seamlessly insert missing phrases, sentences, or paragraphs by considering both preceding and subsequent context, making it particularly effective for text infilling tasks that involve filling text in the middle. FiLM's training methodology is inspired by both Masked Language Models and text diffusion models, drawing a mask probability from the Beta distribution for each training sequence and masking tokens with this probability, which significantly enhances its generative capacity compared to MLMs trained with fixed mask ratios. During decoding, FiLM progressively replaces one mask with a predicted token at each step using various strategies, with the left-to-right and minimum entropy approaches proving most effective, while its perplexity approaches that of comparable Causal Language Models as model size increases.	Link
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching	Simon A. Aytes et al	2024	sketch, llm	Arxiv	Sketch-of-Thought introduces a novel prompting framework that significantly reduces token usage in large language model reasoning while maintaining accuracy. The paper presents three cognitive science-inspired paradigms—Conceptual Chaining, Chunked Symbolism, and Expert Lexicons—each designed for different reasoning tasks and selected dynamically by a lightweight routing model. Conceptual Chaining creates concise logical sequences between key concepts for commonsense reasoning, Chunked Symbolism organizes mathematical reasoning into compact symbolic representations, and Expert Lexicons leverages domain-specific shorthand for technical disciplines. Through evaluation across 15 reasoning datasets spanning multiple domains, languages, and modalities, SoT achieves token reductions of 76% with minimal accuracy impact, and sometimes even improves performance in mathematical and multi-hop reasoning tasks.	Link
PHI-S: Distribution Balancing for Label-Free Multi-Teacher Distillation	Mike Ranzinger et al	2024	distillation, teacher	Arxiv	PHI-S introduces a novel distribution balancing technique for label-free multi-teacher knowledge distillation that addresses the significant variance disparities among heterogeneous teacher models (like CLIP, DINOv2, SigLIP, and SAM). The method leverages Hadamard matrices in an innovative way to create an isotropic standardization approach that uniformly distributes error magnitudes across all feature dimensions, preventing any single dimension from having disproportionate influence in the loss function. Unlike previous whitening or standardization techniques that distort the feature space unevenly, PHI-S first rotates the distribution using a combination of PCA and Hadamard matrices to achieve equal variance in all dimensions, then applies a uniform scalar normalization factor that maintains better numerical stability even with rank-deficient distributions. The authors prove PHI-S's mathematical properties and empirically demonstrate its superiority over other normalization methods, achieving improved distillation performance while requiring significantly less computation than competing approaches when training a student model to simultaneously match multiple visual foundation model teachers.	Link
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Models	Shizhe Diao et al	2025	data	ICLR	CLIMB introduces a novel framework that automates the search for optimal data mixtures during pre-training by first embedding and clustering large-scale datasets in a semantic space, then constructing mixture-performance pairs by sampling and pruning data mixtures while training proxy models, and finally fitting a predictor that treats data mixture as input features and performance metrics as target labels. The method frames data mixture construction as a search problem and solves it through a bootstrapping strategy where candidate mixtures are iteratively proposed, pruned, and refined to optimize diversity and domain relevance. Unlike static mixing strategies, CLIMB dynamically adjusts data mixtures throughout training using a weak predictor approach, integrating multiple predictors iteratively to discover effective configurations for domain adaptation without relying on predefined domain labels. The technique actively learns to refine and optimize data mixtures based on real-world feedback from environment verifications rather than passively relying on predefined heuristics or human-annotated domain labels, while also prioritizing computational efficiency through lightweight proxy models to evaluate mixture quality and reduce the search space by pruning progressively.	Link
Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation	Xuefei Ning et al	2024	LLM	ICLR	SoT is a novel prompting approach that decreases generation latency by guiding LLMs to first produce a skeleton outline of key points, then expand those points in parallel rather than generating the entire response sequentially. The technique leverages batched decoding for open-source models or parallel API calls for proprietary models to complete multiple skeleton points simultaneously, addressing a fundamental bottleneck in LLM inference without requiring model, system, or hardware changes. SoT introduces a "data-centric" optimization pathway by having the LLM organize its output content in a way that enables parallelization, rather than relying on model-level or system-level optimizations. The authors also develop SoT-R (SoT with router), which employs a classifier to selectively apply SoT only for suitable question types, achieving both significant speed-ups (up to 2.4×) while maintaining or even improving answer quality across various LLMs.	Link
Learning Adaptive Parallel Reasoning with Language Models	Jiayi Pan et al	2025	LLM	Arxiv	APR introduces a novel reasoning framework that enables language models to dynamically orchestrate both serialized and parallel computations through a parent-child threading mechanism where parent inference threads can delegate subtasks to multiple child threads using spawn() operations. The key innovation is an end-to-end reinforcement learning strategy that optimizes both parent and child inference threads without requiring predefined reasoning structures, allowing models to adaptively allocate computation between serial and parallel paths for maximum efficiency. APR implements this parent-child threading mechanism using a standard model serving framework to perform inference in child threads simultaneously through batching, which significantly reduces real-time latency compared to traditional serialized approaches. This approach demonstrates superior performance over serialized chain-of-thought methods in three dimensions: higher success rates within the same context window constraints, better scaling behavior when increasing compute budgets, and improved accuracy at equivalent latency when evaluated on complex reasoning tasks.	Link
LLaMAFlex: Many-in-one LLMs Via Generalized Pruning and Weight Sharing	Ruisi Cai et al	2025	LLM	ICLR	LLamaFlex introduces a novel nested weight-shared architecture that can be pruned across both width (attention heads, hidden dimensions, MLP sizes) and depth (layers) dimensions in a zero-shot manner, enabling instant generation of multiple compressed models from a single trained model. The approach differs from traditional pruning methods by requiring only a single continued training phase of approximately 60 billion tokens to train the elastic network alongside an end-to-end Gumbel Softmax-based router that smoothly interpolates across model sizes. LLamaFlex enhances nested architectures through a "policy-aware modulation" technique inspired by diffusion models, which introduces additional parameters for each nesting level to increase model expressivity beyond what pure weight sharing allows. Unlike previous elastic frameworks, LLamaFlex produces uniform architectures (consistent configurations across all layers) that are optimized for deployment in common frameworks like TensorRT-LLM, while achieving accuracy on par with or better than state-of-the-art pruned models, elastic/flexible models, and models trained from scratch.	Link
Packing Input Frame Context in Next-Frame Prediction Models for Video Generation	Lvmin Zhang et al	2025	video	Arxiv	This paper introduces FramePack, a neural network structure that compresses input frames to maintain fixed transformer context length regardless of video length, enabling processing of numerous frames with computational requirements similar to image diffusion. This approach tackles the competing challenges of "forgetting" (fading memory of earlier content) and "drifting" (quality degradation from error accumulation) by using variable transformer patchify kernel sizes that prioritize frames based on importance, typically temporal proximity. Complementing this architecture, the authors propose anti-drifting sampling methods that generate frames in inverted temporal order with early-established endpoints, avoiding typical error propagation in sequential generation. Experimental results demonstrate significant benefits including greater training batch sizes, more balanced diffusion schedulers, and improved visual quality, particularly with the inverted anti-drifting sampling approach for generating consistent, high-quality long-form videos.	Link
Sleep-time Compute: Beyond Inference Scaling at Test-time	Kevin Lin et al	2025	test-time compute	Arxiv	Sleep-time compute leverages the inherently stateful nature of many LLM applications by having models "think" offline about existing contexts (like documents, codebases, or conversation histories) between user interactions, generating enriched representations that can be leveraged at test-time to reduce computational requirements. The authors demonstrate that this approach produces a significant pareto improvement in the test-time compute vs. accuracy curve, reducing the test-time compute needed to achieve the same accuracy by approximately 5× on mathematical reasoning benchmarks like Stateful GSM-Symbolic and Stateful AIME, while further scaling sleep-time compute can shift accuracy upward by 13% and 18% respectively. By amortizing sleep-time compute across multiple related queries about the same context using their Multi-Query GSM-Symbolic dataset, the authors show a 2.5× decrease in average cost per query when handling 10 queries per context. Through careful analysis, the authors find that sleep-time compute is most effective when queries are more predictable from the context, and demonstrate its practical application in a software engineering task where it achieves up to 1.5× reduction in test-time tokens.RetryClaude can make mistakes. Please double-check responses.	Link
SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning	Rui Pan et al	2025	diffusion, ddlm	ICLR	SpecReason exploits the insight that LRM inference is highly tolerant of approximations since complex tasks are broken down into simpler steps where the semantic insights matter more than exact tokens, allowing a lightweight model to speculate on intermediate reasoning steps while reserving the base model for assessment and correction. Unlike speculative decoding which requires token-level equivalence, SpecReason targets semantic-level similarity for thinking tokens, enabling it to accept reasoning steps that carry the same insight even if phrased differently, resulting in 1.5-2.5× speedups over vanilla LRM inference while improving accuracy by 1.0-9.9% across various reasoning benchmarks. The approach employs an efficient verification mechanism where the base model scores each speculated step in a single prefill-only pass that requires only ∼70 new tokens, with adjustable acceptance thresholds to control the aggressiveness of speculation and balance the accuracy-latency tradeoff. Furthermore, SpecReason can be combined with traditional speculative decoding in a hierarchical speculation scheme that provides an additional 19.4-44.2% latency reduction over speculative decoding alone, demonstrating the complementary nature of these two approaches.	Link
Think While You Generate: Discrete Diffusion with Planned Denoising	Sulin Liu et al	2025	diffusion, ddlm	ICLR	The authors propose decomposing the generative rate of discrete diffusion into two components: a planner that identifies which positions are corrupted and need denoising, and a denoiser that predicts values for those positions, enabling a more flexible and adaptive sampling procedure that can correct errors during generation. This plan-and-denoise approach allows for adaptive time discretization and error correction, as the planner can identify when the sequence is noisier than expected or when errors were introduced in earlier steps, adjusting sampling to ensure all corrupted tokens are accurately reconstructed. The authors derive simple and effective training objectives for both planner and denoiser models grounded in the Evidence Lower Bound (ELBO) for discrete diffusion processes, enabling independent training of each component. Experimental results on language modeling tasks (text8, OpenWebText) and token-based image generation (ImageNet 256×256) demonstrate that DDPD significantly outperforms traditional mask diffusion methods, reducing the performance gap between diffusion-based and autoregressive approaches especially in terms of generative perplexity, while also showing that a planner can enhance generation quality even when using a smaller or weaker denoiser model compared to baseline methods.	Link
Simplified and Generalized Masked Diffusion for Discrete Data	Jiaxin Shi et al	2025	diffusion, ddlm	Arxiv	The authors develop a simplified masked diffusion framework that provides a remarkably clear expression of the Evidence Lower Bound (ELBO), showing it corresponds to a weighted integral over time of cross-entropy losses, which improves our understanding of this model class and leads to more effective training. Based on this simplified formulation, they propose a generalized masked diffusion model (GenMD4) that allows for state-dependent masking schedules, where the probability of unmasking tokens depends not only on time but also on token values, further improving predictive performance. The authors demonstrate superior performance across several benchmarks, with their models surpassing prior diffusion language models at GPT-2 scale on 4 out of 5 zero-shot language modeling tasks and vastly outperforming previous discrete diffusion models on pixel-level image modeling, achieving 2.75 and 3.40 bits per dimension on CIFAR-10 and ImageNet 64×64, respectively. Through detailed analysis, they unify and clarify the relationships between different approaches to masked diffusion, revealing how previous complex formulations led to suboptimal parameterization, training objectives, and ad hoc adjustments that their framework addresses effectively.	Link
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild	Weihao Zeng et al	2025	rl, zoo	Arxiv	This paper introduces SCATTERED FOREST SEARCH (SFS), a novel approach for code generation that frames the task as a black-box optimization problem within the code space, employing optimization-inspired techniques to enhance inference scaling. SFS incorporates three key techniques: SCATTERING (which generates diverse search directions to improve exploration), FORESTING (which initializes multiple random seed solutions for broader search coverage), and SCOUTING (which shares feedback across search branches to improve exploitation). The authors provide theoretical analysis showing how these techniques help avoid local optima and experimental results across multiple benchmarks (HumanEval, MBPP, APPS, CodeContests, and Leetcode) demonstrate significant performance gains, with SFS achieving a pass@1 rate of 67.1% on HumanEval+ using GPT-3.5, marking an 8.6% improvement over state-of-the-art while halving the iterations needed to find correct solutions. Furthermore, the approach scales more efficiently than existing search techniques such as tree search, line search, and repeated sampling, with the authors demonstrating that weaker models benefit disproportionately more from their inference scaling method.	Link
SFS: Smarter Code Space Optimization Improves LLM Inference Scaling	Johnathan Light et al	2025	code generation, search	ICLR	This paper introduces SCATTERED FOREST SEARCH (SFS), a novel approach for code generation that frames the task as a black-box optimization problem within the code space, employing optimization-inspired techniques to enhance inference scaling. SFS incorporates three key techniques: SCATTERING (which generates diverse search directions to improve exploration), FORESTING (which initializes multiple random seed solutions for broader search coverage), and SCOUTING (which shares feedback across search branches to improve exploitation). The authors provide theoretical analysis showing how these techniques help avoid local optima and experimental results across multiple benchmarks (HumanEval, MBPP, APPS, CodeContests, and Leetcode) demonstrate significant performance gains, with SFS achieving a pass@1 rate of 67.1% on HumanEval+ using GPT-3.5, marking an 8.6% improvement over state-of-the-art while halving the iterations needed to find correct solutions. Furthermore, the approach scales more efficiently than existing search techniques such as tree search, line search, and repeated sampling, with the authors demonstrating that weaker models benefit disproportionately more from their inference scaling method.	Link
d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning	Siyan Zhao et al	2025	rl, diffusion, llm	Arxiv	Zhao et al. introduce d1, a novel framework that enhances reasoning capabilities in diffusion Large Language Models (dLLMs) through a two-stage post-training approach combining supervised fine-tuning (SFT) and reinforcement learning. While traditional autoregressive models like Claude or GPT generate text left-to-right sequentially, diffusion models like LLaDA generate through iterative denoising from masked tokens in a non-autoregressive manner. The authors' key innovation is diffu-GRPO, a policy gradient algorithm specifically adapted for masked dLLMs that efficiently estimates log-probabilities using a one-step random prompt masking technique. This approach serves as a regularization mechanism, allowing more gradient updates per batch and reducing computational demands. Applied to LLaDA-8B-Instruct, the full d1 pipeline consistently outperforms both the base model and individual SFT or RL approaches across four mathematical and logical reasoning benchmarks, with particularly significant gains on complex tasks like Countdown (+21.5%) and Sudoku (+10.4%). The results demonstrate that masked dLLMs can benefit from RL-based reasoning enhancement techniques previously limited to autoregressive models, opening new directions for improving non-autoregressive language models.	Link
Score-Based Continuous-Time Discrete Diffusion Models	Haoran Sun et al	2023	continuous, llm, diffusion, discrete	ICLR	Sun et al. introduce a breakthrough approach extending score-based diffusion models to discrete categorical data through a continuous-time Markov chain formulation. Their key innovation is developing "categorical ratio matching," a discrete analog to score matching that effectively characterizes how discrete distributions evolve over time without requiring gradients. This theoretical advancement enables the formulation of a coherent stochastic jump process for the reverse (denoising) process with an analytical sampling method as an alternative to numerical simulation. The authors further introduce three architectural approaches—energy-based models, masked models, and a novel "hollow Transformer"—to parameterize conditional distributions while avoiding information leakage issues. Their experiments on synthetic data, CIFAR10 images, and music generation demonstrate superior performance compared to previous discrete diffusion approaches, particularly when sampling with fewer steps, highlighting the flexibility advantages of the continuous-time formulation.	Link
A Continuous Time Framework for Discrete Denoising Models	Andrew Campbell et al	2022	continuous, llm, diffusion, discrete	NeurIPS	The paper "A Continuous Time Framework for Discrete Denoising Models" presents the first complete theoretical foundation for extending diffusion models to discrete data using continuous time. The authors formulate both the forward noising process and generative reverse process as Continuous Time Markov Chains (CTMCs), providing a mathematically rigorous alternative to previous discrete-time approaches. This continuous framework enables the derivation of a continuous time evidence lower bound (ELBO) for efficient training, and unlocks high-performance sampling strategies like tau-leaping—borrowed from chemical physics—which allows multiple dimensions to change simultaneously in a single simulation step. Particularly innovative is their predictor-corrector scheme that alternates between sampling steps guided by the reverse rate and corrector steps that push the distribution toward the true marginal, significantly improving sample quality on image generation tasks. The authors also contribute a novel theoretical error bound between generated and true data distributions, demonstrating how their approach maintains accuracy even in high-dimensional discrete spaces while offering greater flexibility than discrete time methods.	Link
PLANNER: Generating Diversified Paragraph via Latent Language Diffusion Model	Yizhe Zhang et al	2023	planning, llm, diffusion	NeurIPS	The paper introduces PLANNER, a novel two-stage latent text diffusion model for generating diversified paragraphs that addresses repetitive and low-quality output issues in autoregressive models. Unlike previous text diffusion models that operate directly on tokens, PLANNER combines an autoregressive "decoding" module with a "planning" module that uses latent diffusion to generate semantic paragraph embeddings in a coarse-to-fine manner. The approach first learns a variational paragraph embedder that condenses lengthy texts into a fixed number of semantic tokens, then applies a continuous-time latent diffusion model to learn the distribution of these embeddings. Experimental results across sentiment-guided generation, text completion, and summarization tasks demonstrate that PLANNER generates more fluent and diverse text with less repetition compared to both autoregressive methods and text diffusion baselines, while maintaining comparable relevance scores and offering computational advantages through batched processing of fixed-length latent codes.	Link
Diffusion of Thought: Chain-of-Thought Reasoning in Diffusion Language Models	Jiacheng Ye et al	2024	cot, llm, diffusion	NeurIPS	The paper introduces Diffusion-of-Thought (DoT), a novel approach that adapts chain-of-thought reasoning for diffusion language models. Unlike autoregressive language models that generate reasoning steps sequentially left-to-right, DoT allows reasoning to diffuse over time through parallel updates of latent variables, offering greater flexibility in trading computation for reasoning performance. Two key innovations include scheduled sampling during training to improve self-correction capabilities and a multi-pass variant (DoTMP) that generates reasoning steps sequentially to introduce causal bias. For training, the authors fine-tune pre-trained diffusion models (Plaid and SEDD) using classifier-free guidance and implement a conditional ODE solver to accelerate inference. Experimental results on multiplication, boolean logic, and grade school math problems demonstrate that DoT achieves comparable or better performance than autoregressive models while offering significant speed advantages on simpler tasks, with a small diffusion model outperforming a much larger autoregressive model in both efficiency and accuracy.	Link
Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning	Jiacheng Ye et al	2025	planning, llm, diffusion	ICLR	The authors introduce the concept of "subgoal imbalance," demonstrating that autoregressive models struggle with difficult subgoals in planning tasks, often achieving near-random performance. They show that diffusion models effectively decompose these challenging subgoals into more manageable interrelated views within a multi-view learning framework, resulting in superior performance. Building on these insights, they propose Multi-Granularity Diffusion Modeling (MGDM), which prioritizes subgoals based on difficulty during learning, leading to more effective outcomes and faster convergence. Their experimental evaluation focuses on complex problem-solving tasks like Countdown (a mathematical reasoning challenge), Sudoku, and Boolean Satisfiability Problems. For problems like math reasoning or Sudoku where later steps directly depend on earlier ones, this creates a "plan as you go" approach that's inherently flawed. Diffusion models transform hard subgoals into multiple interrelated "views" during the denoising process. Each view offers a different perspective on the same problem, creating a more manageable learning objective. The authors identified a phenomenon they call the "Regretful Compromise" where AR models, after making early mistakes, are forced to produce clearly incorrect calculations in final steps to reach the target answer. The iterative nature of diffusion models naturally promotes global consistency. Each refinement step considers the entire solution, allowing the model to make coordinated changes that maintain mathematical validity across all equations or puzzle constraints.	Link
Do Language Models Plan Ahead for Future Tokens?	Wilson Wu et al	2024	planning, llm	COLM	This paper investigates whether transformers "think ahead" during inference by preparing information in hidden states that will be useful for future tokens. The authors propose two hypotheses: the "pre-caching" hypothesis (where models deliberately compute features irrelevant to the current token but useful for future tokens) and the "breadcrumbs" hypothesis (where features helpful for current prediction naturally benefit future tokens without deliberate planning). To test these hypotheses, the researchers develop "myopic training", where models are trained without propagating gradients to past timesteps. They first create a synthetic task that can only be solved via pre-caching, confirming that transformers can learn this capability when necessary. However, in natural language modeling with smaller models like GPT-2, they find minimal pre-caching, suggesting the breadcrumbs hypothesis predominates—models compute features relevant to the immediate next token that happen to benefit future tokens, without significant trade-offs. Interestingly, the authors discover that pre-caching increases with model scale, becoming more significant with larger models like Pythia 2.8B. This indicates that larger language models may indeed "plan for the future" in ways smaller models cannot. They also examine multiplication tasks, finding evidence that pre-caching enables computation on "filler tokens" that improves overall performance.	Link
Categorical Reparameterization with Gumbel-Softmax	Eric Jang et al	2017	gumbel-softmax	ICLR	This paper presents a method for efficiently training neural networks with discrete random variables. The Gumbel-Softmax is a continuous relaxation of categorical distributions that addresses the challenge of backpropagating through discrete random variables in neural networks. It builds upon the Gumbel-Max trick, which samples from categorical distributions by adding Gumbel noise to logits and taking the argmax, but replaces the non-differentiable argmax operation with a differentiable softmax function controlled by a temperature parameter τ. When the temperature approaches zero, the distribution approximates a categorical one-hot vector, while higher temperatures yield more uniform distributions. This temperature can be gradually annealed during training to balance exploration and discrete decision-making. For applications requiring truly discrete outputs, the Straight-Through Gumbel-Softmax variant uses argmax in the forward pass while preserving differentiable gradients through the softmax in the backward pass. This technique has enabled significant advances in training neural networks with discrete variables, categorical variational autoencoders, and efficient semi-supervised learning algorithms, solving a fundamental limitation in stochastic neural networks.	Link
Training Verifiers to Solve Math Word Problems	Karl Cobbe et al	2021	verifiers	Arxiv	This paper introduces GSM8K, a dataset of 8.5K high-quality linguistic diverse grade school math word problems, to address the challenges language models face with multi-step mathematical reasoning. Despite their success in many tasks, even large language models struggle with mathematics due to the sensitivity to individual errors in step-by-step reasoning. The authors propose a verification approach to improve performance: they first finetune a generator model, then train a separate verifier model to judge the correctness of potential solutions. At test time, they generate multiple candidate solutions and select the highest-ranked one. This verification method significantly outperforms basic finetuning, providing performance equivalent to a 30x increase in model size.	Link
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models	Yukang Chen et al	2024	lora, long context	ICML	LongLoRA efficiently extends the context window of large language models through a two-pronged methodology. First, the authors introduce Shifted Sparse Attention (S²-Attn), which divides the input sequence into multiple groups and performs attention only within each group, then critically shifts the group partitioning by half a group size in 50% of attention heads to ensure information flows between groups. This approximates full attention during training while dramatically reducing computational costs. Second, they enhance Low-Rank Adaptation (LoRA) by making embedding and normalization layers trainable—components that comprise less than 2% of model parameters but prove essential for long-context adaptation. Their experiments demonstrate this approach closes the performance gap between LoRA and full fine-tuning while maintaining significantly lower memory requirements. Importantly, models trained with S²-Attn retain standard attention during inference, ensuring compatibility with existing optimization techniques like Flash-Attention2, making LongLoRA a practical solution that enables extending Llama2 models to context lengths of up to 100K tokens on modest hardware setups.	Link
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference	Hao (Mark) Chen et al	2024	llm, hardware, parallel decoding	Arxiv	This paper introduces Parallel Prompt Decoding (PPD), a memory-efficient approach for accelerating Large Language Model inference by using trainable prompt tokens to enable parallel token generation. Unlike existing speculative decoding methods that require separate draft models, PPD works by training only the embeddings of special prompt tokens (0.0002% of parameters) that are strategically inserted into the input sequence at positions where future tokens are expected to be generated. The prompt tokens partially recover conditional dependency information necessary for accurate multi-token prediction, with the authors using Knowledge Distillation to train the prompt token embeddings against the original LLM's logits while keeping the base model frozen. They enhance accuracy through Ensemble Prompt Tokens (EPTs), where multiple trained embeddings represent each prompt position with specialized attention masking to ensure each EPT only depends on corresponding EPTs from preceding positions. To optimize performance across hardware, they developed a hardware-aware dynamic sparse tree technique that adaptively determines the optimal tree structure and balances the number of candidate tokens versus prompt tokens at each decoding step based on token acceptance probabilities and computational resources. Experiments across models from MobileLlama to Vicuna-13B demonstrate up to 2.49× inference speedup with minimal memory overhead (0.0004%) while maintaining output quality, and PPD can be combined with existing speculative decoding methods for further performance improvements.	Link
Recurrent Memory Transformer	Aydar Bulatov et al	2022	meta tokens, memory	NeurIPS	This paper introduces the Recurrent Memory Transformer (RMT), a novel architecture that combines memory tokens with segment-level recurrence to address two key limitations of standard Transformer models: the difficulty of storing global information across a sequence and the quadratic computational complexity that limits input sequence length. RMT works by adding special memory tokens to both the beginning and end of each input segment, with the end tokens capturing information as "write memory" and then being passed to the next segment as "read memory" creating a recurrent connection between segments. The key innovation is that both the memory mechanism and recurrence are implemented without modifying the core Transformer architecture - only the input and output sequences are changed by adding special tokens. This makes RMT compatible with any Transformer-based model.	Link
Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding	Tian Jin et al	2025	parallelism, asynchronous	Arxiv	This paper introduces PASTA (PArallel STructure Annotation LANGuage), a system that teaches large language models (LLMs) to identify semantic independence in their responses and leverage it for parallel decoding to improve inference speed. The authors observe that while autoregressive LLM decoding is inherently sequential and inefficient (often achieving less than 20% Model Flops Utilization during inference), many parts of an LLM's response can actually be generated in parallel because they're semantically independent. Unlike previous approaches that rely on hand-crafted heuristics tied to specific syntactic structures like lists and paragraphs, PASTA uses a learning-based approach. The system consists of three components: (1) a specialized annotation language (PASTA-LANG) that allows LLMs to express semantic independence in their own responses using tags like <promise> and <sync/>; (2) an interpreter that orchestrates asynchronous decoding at inference time based on these annotations; and (3) a two-stage finetuning process that teaches LLMs to generate effective annotations that optimize both response quality and decoding speed.	Link
Enabling Autoregressive Models to Fill in Masked Tokens	Daniel Israel et al	2025	autoregression, mlm	Arxiv	This paper introduces MARIA (Masked and Autoregressive Infilling Architecture), a novel approach that combines the strengths of both autoregressive (AR) and masked language modeling (MLM) paradigms. The method addresses a fundamental limitation of AR models, which cannot effectively perform masked infilling (predicting tokens between past and future context) despite their dominance in the field due to superior training and inference efficiency. MARIA works by taking a pre-trained MLM model (which can naturally handle infilling but is computationally inefficient) and an AR model, then training a simple linear decoder that operates on their concatenated hidden states. This minimal architectural modification enables the resulting model to perform masked infilling while retaining the computational benefits of AR models, particularly KV caching during inference.	Link
Unlocking Guidance for Discrete State-Space Diffusion and Flow Models	Hunter Nisonoff et al	2025	state space, discrete, diffusion	ICLR	This paper introduces "Discrete Guidance" a novel approach for applying guidance to generative models in discrete state-spaces. The key innovation is a mathematically rigorous framework for controlling the outputs of discrete diffusion and flow models by leveraging continuous-time Markov processes. The authors tackle a fundamental challenge in discrete guidance: the intractability of computing the normalizing constant in Bayes' theorem, which would normally require summing over an exponentially large state space. They overcome this by exploiting the fact that in continuous-time Markov chains, only one dimension of the state can change at any instant, making exact guidance computationally feasible. The paper develops both predictor guidance (PG) and predictor-free guidance (PFG) variants, along with a computationally efficient approximation called Taylor-approximated guidance (TAG). They demonstrate their method's effectiveness across multiple domains including small-molecule generation, DNA sequence design, and protein engineering. Notably, their framework enables repurposing existing unconditional generative models by modulating their behavior with guidance signals, avoiding the need for retraining on limited labeled data. This approach creates new possibilities for controllable generation in scientific applications where discrete representations are essential, such as molecular and biological sequence design.	Link
Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data	Jingyang Ou et al	2025	dit, token, caching	ICLR	This paper created a simpler and faster way to generate text using AI. The researchers realized that previous text generation models were unnecessarily complicated because they included time as an input. They found that time could be separated into a simple formula multiplied by the actual prediction, allowing them to remove time completely from their model. This simplification made their model (called RADD) faster in two ways: it had fewer parameters to compute, and they could cache results when parts of the text stayed the same during generation. They also proved that their approach is mathematically equivalent to another type of text generation method, bringing together different research approaches.	Link
Accelerating Diffusion Transformers with Token-wise Feature Caching	Chang Zou et al	2025	dit, token, caching	ICLR	The ToCa (Token-wise Feature Caching) methodology accelerates diffusion transformers through a sophisticated token selection strategy for caching. The approach begins with cache initialization at the first timestep where all tokens are computed and stored, followed by selective computing in subsequent timesteps based on a pre-defined caching ratio R. Unlike traditional methods, ToCa updates the cache with newly computed values at every timestep to reduce error accumulation. The heart of ToCa is its caching score function, which combines four critical criteria: the token's influence on other tokens through self-attention (s1), its importance for text conditioning via cross-attention entropy (s2), how frequently it has been cached (s3), and whether caching it maintains uniform spatial distribution (s4). ToCa further refines its approach by applying different caching ratios to different layers, with deeper layers receiving higher caching ratios since errors in them have less opportunity to propagate, while shallow layers get lower ratios. The method also distinguishes between layer types, handling self-attention layers differently from MLP layers to prevent error propagation. Additional engineering optimizations include timestep-dependent scheduling and consistent token selection between conditional and unconditional paths. By intelligently selecting which tokens can be safely cached based on their specific characteristics and position in the network, ToCa achieves significant speedups while maintaining generation quality across different diffusion transformer architectures.	Link
Faith and Fate: Limits of Transformers on Compositionality	Nouha Dziri et al	2023	transformers, compositionality	Arxiv	The paper reveals that transformer large language models (LLMs) struggle with seemingly simple compositional problems requiring multi-step reasoning, despite their impressive performance on complex tasks. Using a computation graph framework to measure task complexity across multi-digit multiplication, logic puzzles, and dynamic programming problems, the authors demonstrate that LLMs solve compositional tasks through pattern matching rather than systematic reasoning, performing well on in-distribution examples but failing dramatically on out-of-distribution cases even after extensive fine-tuning with explicit reasoning steps. Their error analysis shows that while models can perform single-step operations correctly, they fail to compose these steps into correct reasoning paths, with errors propagating through the computation graph because the autoregressive nature of transformers forces them to tackle problems sequentially through a greedy process of producing the next token without developing a global understanding of the task. Both empirical findings and mathematical analysis reveal a fundamental limitation where error probability increases exponentially with task complexity, suggesting transformers may be inherently limited in solving complex compositional tasks without additional mechanisms. The paper concludes that transformers are best suited for tasks with few reasoning steps, tasks allowing approximate solutions, or when combined with planning modules or refinement methods to overcome the limitations of their autoregressive architecture.	Link
The Generative AI Paradox: What it Can Create, It May Not Understand	Peter West et al	2023	llm, intelligence, generation	Arxiv	This paper introduces the "Generative AI Paradox", revealing that generative AI models can produce high-quality outputs while simultaneously failing to demonstrate understanding of those same outputs. Through experiments in both language and vision modalities, the authors demonstrate that models like GPT-4 and Midjourney often outperform humans in generation tasks but underperform in related discriminative and question-answering tasks about their own creations. This capability gap widens with increasing task complexity and presents more starkly than in human intelligence, where understanding typically precedes generation ability. The research suggests this divergence stems from fundamental differences in how AI systems and humans develop intelligence—models are trained directly on reproducing expert-like outputs without requiring deep understanding. This finding cautions against interpreting AI capabilities by analogy to human intelligence and suggests studying AI as a counterpoint rather than a parallel to human cognition.	Link
StyleRemix: Interpretable Authorship Obfuscation via Distillation and Perturbation of Style Elements	Jillian Fisher et al	2024	authorship, style, llm	Arxiv	STYLEREMIX operates through two distinct phases to effectively obscure authorship style. In the pre-obfuscation phase, which is performed just once, the system identifies key stylistic axes that differentiate authors, including text length, function word usage, grade level, formality, voice, sarcasm, and writing type. For each of these style axes, the researchers create parallel datasets where identical content is rewritten in different stylistic directions, which are then used to train lightweight Low-Rank Adaptation (LoRA) modules that can be integrated with a larger base language model. During the actual obfuscation phase, STYLEREMIX analyzes the author's text to create an "author vector" measuring their style along various axes and compares this to the average style vector of all authors in the same domain to identify which style elements are most distinctive for this specific author. Based on this analysis, the system selectively applies the appropriate LoRA adapters with calculated weights to steer the text away from the author's most recognizable style elements. These adapters can be applied sequentially, through adapter merging (combining weights before applying), or using LoraHub+ (which optimizes the weights through gradient-free optimization). What makes STYLEREMIX innovative is its adaptive approach to identifying and targeting the specific style elements that make an author recognizable, rather than applying generic transformations, resulting in more effective obfuscation while maintaining content integrity and fluency.	Link
Scaling Diffusion Language Models via Adaptive From Autoregressive Models	Shansan Gong et al	2025	diffusion, masked diffusion, mdm	ICLR	The paper introduces a method for converting autoregressive (AR) language models into diffusion language models (DLMs) through three key techniques. First, they unify the loss functions between AR and diffusion models by establishing connections between their modeling objectives. Second, they implement attention mask annealing, which gradually transitions from the causal (unidirectional) attention mask of AR models to the full bidirectional attention pattern needed for diffusion models. Third, they maintain the shift operation from AR models, where predictions at each position correspond to the next token, ensuring alignment between the pretrained AR model's behavior and the diffusion objective. This approach allows them to efficiently adapt existing models like GPT2 and LLaMA to create DiffuGPT and DiffuLLaMA without time-embedding layers, preserving the original models' capabilities while enabling the non-autoregressive generation benefits of diffusion models.	Link
Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling	Kaiwen Zheng et al	2025	diffusion, masked diffusion, mdm	ICLR	The authors provide a comprehensive theoretical analysis revealing that masked diffusion models (MDMs) are functionally equivalent to simpler masked language models despite their diffusion-based formulation. They mathematically prove that MDMs' continuous-time Evidence Lower Bound (ELBO) can be reformulated based on the number of masked tokens rather than time, effectively showing that the time variable serves merely as a continuous relaxation of the masked ratio. Building on this insight, they develop the First-Hitting Sampler (FHS), which analytically calculates when each masked token should be unmasked, enabling token-by-token decoding that is provably equivalent to MDMs' original sampling process but 20 times faster. The authors also uncover a critical numerical precision issue in Gumbel-based categorical sampling where 32-bit floating-point representation truncates maximum Gumbel values, artificially lowering the effective temperature during sampling. Through closed-form mathematical analysis, they demonstrate how this truncation shifts probability distributions and creates unequal unmasking probabilities across token positions, explaining why MDMs appeared to outperform auto-regressive models in previous evaluations when measured by generative perplexity. Their high-order sampling variants (using extrapolation and predictor-corrector approaches) further enhance efficiency while maintaining theoretical equivalence to the original MDM sampling process.	Link
Simple and Effective Masked Diffusion Language Models	Subham Sekhar Sahoo et al	2024	diffusion, masked diffusion, ddlm	NeurIPS	Masked Diffusion Language Models (MDLMs) improve upon previous discrete diffusion approaches by implementing a simplified, Rao-Blackwellized objective that serves as a weighted average of masked language modeling losses, enabling principled generation from encoder-only language models. This approach combines a substitution-based (SUBS) parameterization of the reverse unmasking process with effective training recipes that significantly outperform previous diffusion models across language benchmarks, approaching autoregressive perplexity within 15-25%. MDLMs support variable-length generation through a semi-autoregressive sampling algorithm that outperforms previous methods, while the framework extends to other domains like biological sequence modeling where it preserves or improves downstream task performance compared to traditional masked language models. The core insight is that through well-engineered training and a tighter variational bound, masked diffusion can be more competitive with autoregressive approaches than previously thought, providing alternative pathways for controllable text generation.	Link
Structured Denoising Diffusion Models in Discrete State-Spaces	Jacob Austin et al	2023	diffusion, discrete	NeurIPS	This paper introduces Discrete Denoising Diffusion Probabilistic Models (D3PMs), extending diffusion models to discrete data like text and images. Unlike previous approaches with uniform transition probabilities, D3PMs incorporate structured corruption processes through various transition matrices that can better model domain-specific relationships. The paper explores several matrix types: uniform transitions, absorbing states (like BERT's masking), discretized Gaussian transitions for ordinal data, and token embedding distance for semantic relationships in text. D3PMs achieve state-of-the-art results for discrete diffusion, approaching or exceeding continuous diffusion models on image tasks while enabling effective text generation. The authors also establish connections between D3PMs and existing models like BERT and autoregressive language models, showing how these seemingly different approaches can be unified under the diffusion framework. A key contribution is a new auxiliary loss that stabilizes training and improves sample quality by combining variational inference with cross-entropy loss.	Link
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models	Marianne Arriola et al	2025	diffusion, lldm	ICLR	Block Diffusion Language Models (BD3-LMs) introduce a hybrid approach that combines autoregressive modeling over blocks of tokens with diffusion within each block, effectively bridging discrete diffusion and autoregressive language models. This innovative design overcomes key limitations of existing diffusion models by supporting flexible-length generation and improving inference efficiency through KV caching and parallel token sampling. The authors identify gradient variance as a critical factor in the performance gap between diffusion and autoregressive models, and propose data-driven "clipped" noise schedules that significantly reduce this variance during training. Through extensive experiments, BD3-LMs establish a new state-of-the-art in perplexity among diffusion language models and demonstrate the ability to generate arbitrary-length sequences with improved quality using fewer generation steps than alternative approaches.	Link
CLLMs: Consistency Large Language Models	Siqi Kou et al	2024	llms, consistency models	ICML	This paper introduces Consistency Large Language Models (CLLMs), a new approach that significantly improves the efficiency of Jacobi decoding for language model inference. The key innovation is training language models to consistently predict the fixed point of a Jacobi trajectory in fewer steps, addressing the main limitation of standard Jacobi decoding which typically predicts only one correct token per iteration. By fine-tuning target LLMs using a consistency loss that maps arbitrary points on the Jacobi trajectory to the fixed point, CLLMs achieve 2.4× to 3.4× improvements in generation speed while maintaining quality across both domain-specific and open-domain tasks. Unlike alternative approaches like speculative decoding or Medusa, CLLMs don't require auxiliary model components or architectural modifications, making them memory-efficient and highly adaptable. The authors identify two key acceleration mechanisms in CLLMs: "fast forwarding," where multiple consecutive tokens are correctly predicted in a single step, and "stationary tokens," which remain correct despite being preceded by inaccurate tokens. These phenomena are especially prominent in domain-specific tasks where text contains predictable collocations and syntactic structures.	Link
Large Language Diffusion Models	Shen Nie et al	2025	lldm	Arxiv	This paper introduces LLaDA (Large Language Diffusion with mAsking), a diffusion model for natural language trained from scratch that challenges the dominance of autoregressive models (ARMs) in large language modeling. LLaDA models distributions through a forward data masking process and a reverse prediction process using a vanilla Transformer to predict masked tokens. By optimizing a likelihood bound, it provides a principled generative approach for probabilistic inference. The research demonstrates that diffusion models can be a viable alternative to autoregressive models, showing comparable or superior performance on various benchmarks. LLaDA addresses the reversal curse that affects autoregressive models, showing consistent performance in both forward and reverse reasoning tasks. After scaling to 8B parameters and supervised fine-tuning, LLaDA demonstrates strong capabilities in in-context learning, instruction following, and multi-turn dialogue generation, challenging the assumption that these capabilities are inherently tied to autoregressive architectures.	Link
DORY: Deliberative Prompt Recovery for LLM	Lirong Gao et al	2024	inversion	Arxiv	This paper introduces a novel approach to recover original prompts from limited outputs of large language models. The authors discover a strong negative correlation between output probability-based uncertainty and prompt recovery success, showing that tokens with lower uncertainty are more likely to have appeared in the original prompt. Building on this insight, DORY recovers prompts through a three-step process: reconstructing a draft from output text, generating hints based on uncertainty, and reducing noise by comparing draft output with actual output. Unlike previous methods, DORY requires only a single LLM without any external resources or model training, making it a cost-effective solution for prompt recovery.	Link
Weak-to-Strong Reasoning	Yuqing Yang et al	2024	reasoning	Arxiv	This paper introduces a progressive learning framework for weak-to-strong reasoning, addressing the challenge of improving large language models (LLMs) without high-quality supervision. The authors demonstrate that naively fine-tuning a stronger model (like Llama2-70b) on outputs from weaker models (like Llama2-7b or Gemma-2b) is insufficient for complex reasoning tasks. Their proposed two-stage approach first uses selective data curation through a "final answer consistency" method to identify potentially correct examples, then applies preference optimization that enables the model to learn from contrasting examples. Experiments on mathematical reasoning datasets show substantial improvements over baseline approaches, with the framework proving particularly effective when the strong model learns to distinguish between correct and incorrect reasoning paths.	Link
The False Promise of Imitating Proprietary LLMs	Arnav Gudibande et al	2023	knowledge distillation, imitation	Arxiv	This paper analyzes the phenomenon of imitating stronger proprietary language models (like ChatGPT) by finetuning weaker open-source models on outputs from the stronger models. The authors discover that while imitation models appear competitive in human evaluations because they successfully mimic ChatGPT's style, they fail to close the capabilities gap on factual knowledge, coding, and problem-solving tasks unless the imitation dataset specifically targets those domains. Their experiments show that scaling up the base model size leads to much more significant improvements than increasing the amount of imitation data, suggesting that developing better base models is more valuable than collecting more imitation data. The researchers conclude that model imitation is not a "free lunch" and that there exists a substantial capabilities gap between open and closed-source language models that cannot be easily bridged through imitation alone.	Link
Born Again Networks	Tommaso Furlanello et al	2018	knowledge distillation, self distillation	ICML	This paper introduces Born-Again Neural Networks (BANs), a novel approach where knowledge distillation is applied to train student models with identical architectures to their teachers. Surprisingly, these student models consistently outperform their teachers on both computer vision and language modeling tasks, even when the student and teacher have the exact same architecture and capacity. The authors experiment with multiple generations of BANs, showing that performance continues to improve (though with diminishing returns), and explore the effect of "dark knowledge" by testing variations where they either weight examples by teacher confidence or permute non-argmax outputs. Their framework also allows for cross-architecture knowledge transfer, such as training ResNet students from DenseNet teachers, resulting in state-of-the-art performance on CIFAR datasets and demonstrating that the benefits of knowledge distillation extend beyond model compression.	Link
The Curious Case of Neural Text Degeneration	Ari Holtzman et al	2020	nucleus sampling, generation	ICLR	This work identifies fundamental problems with traditional text generation methods: beam search creates repetitive text while pure sampling produces incoherent content. As a solution, the authors propose Nucleus Sampling, which dynamically truncates the probability distribution to include only the most likely tokens that constitute the vast majority of the probability mass, avoiding both repetition and incoherence issues. Through extensive evaluations comparing perplexity, vocabulary distribution, self-similarity, and human judgments, they demonstrate that Nucleus Sampling produces text that is both high-quality and diverse, closely matching human-written text distributions. The authors also make the important observation that human language rarely maximizes probability, suggesting that language models which optimize for likelihood may inherently struggle to generate natural text.	Link
Data-Free Knowledge Distillation for Deep Neural Networks	Raphael Gontijo Lopes et al	2017	knowledge distillation, network inversion	Arxiv	This paper introduces a novel method for data-free knowledge distillation, which enables the compression of deep neural networks without requiring access to the original training dataset. The authors propose using various forms of activation metadata collected during the initial model training to reconstruct synthetic datasets that can then be used to train smaller student networks. They explore different approaches to creating this activation metadata, including top-layer statistics, all-layers statistics with dropout filters, and spectral methods based on graph Fourier transforms. Experimental results on MNIST and CelebA datasets demonstrate that their spectral methods can achieve compression rates of approximately 50% with minimal accuracy loss, making this approach valuable for scenarios where the original training data cannot be shared due to privacy concerns, storage limitations, or proprietary restrictions.	Link
Dreaming to Distill: Data-free Knowledge Transfer via DeepInversion	Hongxu Yu et al	2020	knowledge distillation, network inversion	CVPR	The paper introduces DeepInversion, a method that synthesizes realistic, high-fidelity images from a trained CNN without requiring access to the original training data by using information stored in batch normalization layers. The authors further enhance this technique with Adaptive DeepInversion, which improves image diversity by maximizing Jensen-Shannon divergence between teacher and student network outputs. With these methods, the paper demonstrates three data-free applications: network pruning, knowledge transfer between models, and continual learning for adding new classes to existing networks. The synthesized images show impressive realism and generalize well across different model architectures, enabling knowledge distillation and other tasks that typically require the original training dataset.	Link
Sequence-Level Knowledge Distillation	Yoon Kim et al	2016	knowledge distillation	Arxiv	This paper introduces sequence-level knowledge distillation for neural machine translation, allowing smaller student models to achieve performance comparable to larger teacher models. The authors demonstrate that their approach works better than standard word-level knowledge distillation by having students learn from complete translations generated by the teacher rather than just matching word-level probabilities. Remarkably, their method enables student models to perform well even with greedy decoding, eliminating the need for computationally expensive beam search at inference time. Combining their distillation techniques with weight pruning, they produce models with 13× fewer parameters than the original teacher model while maintaining strong translation performance, making efficient NMT deployment possible even on mobile devices.	Link
The Mamba in the Llama: Distilling and Accelerating Hybrid Models	Junxiong Wang et al	2025	knowledge distillation, llm	Arxiv	This paper demonstrates how large Transformer models can be effectively distilled into hybrid models that incorporate linear RNNs like Mamba while maintaining much of their generation quality, notably by reusing the weights from attention layers. The researchers developed a multistage distillation approach combining progressive distillation, supervised fine-tuning, and directed preference optimization, which outperforms models trained from scratch with trillions of tokens. They also introduced a hardware-aware speculative decoding algorithm that significantly accelerates inference speed for both Mamba and hybrid architectures, achieving impressive throughput for large language models. The resulting hybrid models show comparable performance to the original Transformers on chat benchmarks while requiring less computational resources for deployment, highlighting how transformer knowledge can be effectively transferred to other architectures with customized inference profiles.	Link
Compact Language Models via Pruning and Knowledge Distillation	Saurav Muralidharan et al	2024	pruning, knowledge distillation	Arxiv	This paper investigates whether large language models (LLMs) can be efficiently compressed by pruning an existing model and retraining it with a fraction of the original training data, rather than training smaller variants from scratch. The authors develop and empirically explore best practices for structured LLM pruning across multiple dimensions (depth, width, attention, and MLP) combined with knowledge distillation-based retraining. Their approach produces the MINITRON family of models (8B and 4B variants) from the Nemotron-4 15B model using up to 40× fewer training tokens than training from scratch, while maintaining competitive performance compared to similarly-sized models like Mistral 7B, Gemma 7B, and Llama-3 8B. The methodology demonstrates significant compute savings (1.8×) for training a full model family and outperforms state-of-the-art compression techniques in the literature.	Link
MergeNet: Knowledge Migration across Heterogeneous Models, Tasks, and Modalities	Kunxi Li et al	2024	model merging, knowledge distillation	Arxiv	The paper introduces MergeNet, a novel framework for knowledge transfer between heterogeneous models, tasks, and modalities. Unlike traditional methods like knowledge distillation that require similar model architectures or tasks, MergeNet facilitates knowledge transfer by operating directly on model parameters through low-rank decomposition and a specialized adapter that bridges different parameter spaces. The authors demonstrate MergeNet's effectiveness through extensive experiments across challenging scenarios including cross-structure (different model architectures), cross-modal (image-text), and cross-task (classification-QA) knowledge transfer, consistently outperforming baseline methods. Their approach enables previously difficult knowledge transfers by allowing models to extract only the knowledge they need from source models, effectively addressing the issue of knowledge incompatibility between heterogeneous models.	Link
Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models	Jupinder Parmar et al	2024	pretraining	Arxiv	This paper introduces a recipe for effectively continuing the pretraining of large language models (LLMs) without having to retrain them from scratch. The authors demonstrate that using a two-phase data distribution approach—starting with general data similar to pretraining and transitioning to specialized data focused on model weaknesses—produces the best results when combined with a specific learning rate schedule that starts at the pretrained model's minimum learning rate and decays with cosine annealing. They find that the optimal point to switch between data distributions occurs at one-fifth of the maximum learning rate, and demonstrate that their approach yields a 9% improvement in model accuracy compared to simply continuing training on the pretraining dataset. The recipe proves effective across different training scales (from 100B to 1T tokens) and includes innovations like document mining to identify the most useful examples for continued training, enabling developers to improve model capabilities without the massive computational costs of retraining from scratch.	Link
DistiLLM: Towards Streamlined Distillation for Large Language Models	Jongwoo Ko et al	2024	knowledge distillation, llm	ICML	This paper introduces DISTILLM, a new knowledge distillation framework for large language models that addresses critical limitations in efficiency and effectiveness through two key components: a novel skew Kullback-Leibler divergence loss with strong theoretical foundations, and an adaptive off-policy approach that efficiently utilizes student-generated outputs. The skew KLD is mathematically proven to provide more stable gradients and minimal approximation errors compared to standard KLD objectives, while the adaptive off-policy approach uses a replay buffer to dramatically improve sample efficiency and reduce training time. Through extensive experiments on various tasks like instruction-following and text summarization, DISTILLM achieves state-of-the-art performance while requiring up to 4.3x less training time than existing methods. The framework demonstrates strong scalability across different model sizes (120M to 13B parameters) and model families, making it a practical solution for compressing large language models.	Link
MiniLLM: Knowledge Distillation of Large Language Models	Yuxian Gu et al	2024	knowledge distillation, llm	ICLR	This paper introduces MiniLLM, a novel approach to knowledge distillation for large language models that uses reverse Kullback-Leibler divergence (KLD) instead of forward KLD, which prevents student models from overestimating low-probability regions of the teacher's distribution. The authors develop an optimization approach using policy gradient with three key improvements: single-step decomposition to reduce variance, teacher-mixed sampling to prevent reward hacking, and length normalization to eliminate length bias. Through extensive experiments on various model sizes (120M to 13B parameters) and instruction-following tasks, they demonstrate that MiniLLM produces more precise responses with higher quality, lower exposure bias, better calibration, and improved long-text generation compared to baselines. Most importantly, they show the approach scales effectively across different model families and sizes while requiring significantly fewer training tokens than traditional methods, making it a practical solution for compressing large language models.	Link
Rethinking Soft Labels for Knowledge Distillation: A Bias-Variance Tradeoff Perspective	Helong Zhou et al	2021	knowledge distillation	ICLR	The paper analyzes knowledge distillation through the lens of bias-variance tradeoff, discovering that soft labels create a sample-wise tradeoff where some training examples reduce variance at the cost of increased bias while others have different effects. The authors identify "regularization samples" where distillation primarily acts as a regularizer and find that their quantity negatively correlates with model performance in standard knowledge distillation. To address this, they propose "weighted soft labels" that adaptively weight each training sample's contribution to optimally balance the bias-variance tradeoff, leading to improved distillation performance. The key insight is that regularization samples shouldn't be completely ignored but rather have their influence carefully modulated through weighting, which the authors validate through both theoretical analysis and extensive experiments establishing new state-of-the-art results.	Link
The Unreasonable Ineffectiveness of the Deeper Layers	Andrey Gromov et al	2024	layer analysis, pruning	Arxiv	This paper presents an empirical study of layer pruning in large language models, demonstrating that many layers can be removed without significant performance degradation until a critical threshold. The authors introduce a novel pruning approach that identifies optimal layers to remove by analyzing the similarity between layer representations, combined with a small amount of parameter-efficient finetuning to "heal" the model after pruning. They discover that LLMs are surprisingly robust to removing up to half of their layers, suggesting either that current pretraining methods don't fully utilize deeper layers or that shallow layers play a crucial role in storing knowledge. A key finding is that while question-answering performance shows a sharp transition after removing critical layers, autoregressive loss changes smoothly, indicating an interesting disconnect between these different measures of model capability.	Link
LLM Pruning and Distillation in Practice: The Minitron Approach	Sharath Turuvekere Sreenivas et al	2024	llm, pruning, distillation	Arxiv	The paper introduces an improved approach to LLM model compression by combining structured pruning with knowledge distillation, notably adding a "teacher correction" phase that allows the teacher model to adapt to new data distributions when the original pretraining dataset is unavailable. The authors explore two distinct pruning strategies - depth pruning (removing entire layers) and width pruning (reducing hidden/attention/MLP dimensions), along with a new task-based saliency criteria for depth pruning. They demonstrate this approach by successfully compressing Mistral NeMo 12B and Llama 3.1 8B models to 8B and 4B parameters respectively, using significantly fewer training tokens than training from scratch. The methodology is particularly valuable because it removes the dependency on accessing the original pretraining dataset, making it more practical for compressing proprietary models.	Link
Direct Preference Optimization: Your Language Model is Secretly a Reward Model	Rafael Rafailov et al	2023	rlhf, reward modeling	NeurIPS	The key innovation is that the paper eliminates the need for reinforcement learning in training language models from human preferences by introducing Direct Preference Optimization (DPO). DPO achieves this by leveraging a mathematical mapping between reward functions and optimal policies to transform preference learning into a simple classification problem. This allows language models to be trained directly from human preference data using a straightforward binary cross-entropy loss, while theoretically maintaining the same optimization objective as traditional RLHF methods.	Link
Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling	Hritik Bansal et al	2024	llm, compute-optimal, test-time compute	Arxiv	The paper's key innovation is challenging the common practice of using larger language models to generate synthetic training data for reasoning tasks. It shows that at a fixed compute budget, generating more samples from a smaller, cheaper model can be more effective than fewer samples from a larger, more expensive model. The authors introduce a novel "weak-to-strong improvement" paradigm where a stronger model can be improved using data from a weaker model, while also providing a theoretical framework for compute-matched sampling between different model sizes.	Link
ReAct: Synergizing Reasoning and Acting in Language Models	Shunyu Yao et al	2023	llm, reasoning, acting	Arxiv	This paper introduces ReAct, a novel approach that combines reasoning and acting capabilities in large language models through prompting. The key innovation is having language models generate both verbal reasoning traces and task-specific actions in an interleaved manner, allowing models to maintain high-level plans while gathering information from external sources. Through experiments on question answering, fact verification, and interactive decision-making tasks, ReAct outperforms baselines that use only reasoning or only actions, while providing interpretable decision traces that enable human oversight of the model's behavior.	Link
STaR: Self Taught Reasoner	Eric Zelikman et al	2022	reasoning, llm	Arxiv	This paper proposes a method of iteratively leveraging a small number of rationale examples and a large dataset w/o rationales to bootstrap the ability to perform successively more complex reasoning. Basically it is a loop: if generated answers are wrong, try again to generate a rationale given the correct answer; finetune on all rationales that ultimately yielded correct answers, and repeat.	Link
Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries	Eden Biran et al	2024	multi-hop reasoning, llm	Arxiv	The paper shows that large language models solve multi-hop queries (like "spouse of performer of Imagine") through a sequential process where early layers resolve the first hop ("performer of Imagine" → "John Lennon"), middle layers propagate this information, and later layers resolve the second hop ("spouse of John Lennon" → "Yoko Ono"), with performance limitations arising because later layers sometimes lack the knowledge needed for the second hop.	Link
Distributed Reasoning in LLMs: Parallel Reasoning Processes in Multi-Hop Reasoning	Yuval Shalev et al	2024	multi-hop reasoning, llm	Arxiv	The paper demonstrates that large language models perform multi-hop reasoning by generating interpretable distributions of potential intermediate answers in their middle layers, which are then linearly transformed into final answers through parallel reasoning paths - providing novel insights into how artificial neural networks implement both associative and structured reasoning processes.	Link
Training Large Language Models to Reason in Continuous Latent Space	Shibo Hao et al	2024	test-time compute, latent space	Arxiv	The paper introduces "Coconut" (Chain of Continuous Thought), which lets language models perform reasoning in continuous latent space rather than through discrete language tokens, leading to improved performance through implicit breadth-first search capabilities, especially on tasks requiring complex planning.	Link
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking	Eric Zelikman et al	2024	test-time compute, star	Arxiv	The paper presents Quiet-STaR, a technique that trains language models to generate internal rationales at each token position by using parallel sampling, learned meta-tokens, and reinforcement learning to optimize thoughts that improve future token prediction, demonstrating significant zero-shot performance gains on reasoning tasks.	Link
From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step	Yuntian Deng et al	2024	test-time compute, cot	Arxiv	The authors present "Stepwise Internalization" a simple yet effective method that gradually removes chain-of-thought reasoning steps during training while maintaining high performance, enabling a small GPT-2 model to solve complex multiplication problems with 99% accuracy and achieving state-of-the-art results on GSM8K math problems without using intermediate reasoning steps.	Link
Deliberation in Latent Space via Differentiable Cache Augmentation	Luyang Liu et al	2024	test-time compute, latent space, coprocessor	Arxiv	The paper introduces a differentiable coprocessor that augments a frozen LLM's key-value cache with learned latent embeddings, enabling improved reasoning capabilities through asynchronous "thinking" in latent space rather than generating explicit intermediate steps as text.	Link
Training Language Models to Self-Correct via Reinforcement Learning	Aviral Kumar et al	2024	self-correct, reinforcement learning	Arxiv	The paper introduces SCoRe, a novel multi-turn reinforcement learning approach that successfully teaches language models to self-correct their own mistakes without external feedback by addressing two key failure modes of previous methods: distribution shift and behavior collapse.	Link
Improving Factuality and Reasoning in Language Models through Multiagent Debate	Yilun Du et al	2023	multiagent, agentic	Arxiv	The paper demonstrates that having multiple large language model instances debate and critique each other's responses over multiple rounds leads to improved factual accuracy and reasoning capabilities compared to single-model approaches, without requiring access to model internals.	Link
Large Language Models Cannot Self-Correct Reasoning Yet	Jie Huang et al	2024	self-correction	Arxiv	This paper demonstrates that large language models cannot effectively self-correct their reasoning without external feedback, showing that existing self-correction methods either rely on oracle labels, perform worse than simpler alternatives, or benefit mainly from improved prompting rather than actual self-correction.	Link
Think Before You Speak: Training Language Models with Pause Tokens	Sachin Goyal et al	2024	test-time compute, meta-tokens	Arxiv	This paper introduces "Pause Tokens" which are a way of appending a sequence of tokens to the input prefix, and then delaying the output until the last pause token is seen.	Link
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters	Charlie Snell et al	2024	test-time compute	Arxiv	This paper explores the question of "If an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenge prompt?". Good for references on various test-time compute strategies.	Link
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads	Tianle Cai et al	2024	speculative decoding, drafting, llm	ICML	This paper presents Medusa which augments LLM inference by adding extra decoding heads to predict multiple subsequent tokens in parallel. They also introduce a form of tree-based attention to process candidates. Through the Medusa heads, they obtain probability predictions for the subsequent K+1 tokens. These predictions enable them to create length-K+1 continuations as the candidates. In order to process multiple cnadidates concurrently, they structure their attention such that only tokens from the same continuation are regarded as historical data.For instance, they have in Figure 2 an example where the first Medusa head and generates some top two predictions while the second medusa head generates a top three for each of the top two from the first head. Instead of filling the entire attention mask, they only consider the mask from these 2*3 = 6 tokens, plus the standard identity line.	Link
Recurrent Drafter for Fast Speculative Decoding in Large Language Models	Yunfei Cheng et al	2024	speculative decoding, drafting, llm	Arxiv	This paper introduces ReDrafter (Recurrent Drafter) that uses an RNN as the draft model and conditions on the LLM's hidden states. They use a beam search to explore the candidate seqeunces and then apply a dynamic tree attention alg to remove duplicated prefixes among the candidates to improve the speedup. They also train via knowledge distillation from LLMs to improve the alignment of the draft model's predictions with those of the LLM.	Link
QuIP: 2-Bit Quantization of Large Language Models With Guarantees	Jerry Chee et al	2024	quantization, block-wise	Arxiv	QuIP (quantization with incoherence processing) is a method based on the insight that quantization benefits from incoherent weight and Hessian mats, meaning that they benefit from the weights being even in magnitude and benefit from having the directions in whcih they are rounded to being unaligned with the coordinate axes.	Link
BlockDialect: Block-wise Fine-grained Mixed Format for Energy-Efficient LLM Inference	Wonsuk Jang et al	2024	quantization, block-wise	Arxiv	This paper introduces a block-wise quantization scheme that assigns a per-block optimal number format from a format book (they make their own format book called "DialectFP4"). "Focusing on how to represent over how to scale".	Link
SpinQuant: LLM Quantization with Learned Rotations	Zechun Liu et al	2024	quantization, spins, rotation	Arxiv	This paper uses two mergeable rotation matrices (R1, R2) that make rotationally invariant full-precision networks, and then apply two online Hadamard rotations (G3, R4) to further reduce the outliers so they can quantize activations and KV-cache quantizations. They then show how one can optimize these rotation matrices on Stiefel manifolds (orthogonal manifolds) using Cayley SGD. The reason for Cayley SGD and Stiefel manifolds is bc they need to optimize rotation matrices (R1, R2) such that they stay orthogonal during optimization. Regular SGD would break this constraint. By optimizing on Stiefel manifolds (space of all orthonormal matrices), they can specifc that the optimizations stays on a specific surface that only contains rotation matrices.	Link
SnapKV: LLM Knows What You are Looking for Before Generation	Yuhong Li et al	2024	llm, kv cache	Arxiv	This paper identifies and selects the most important features per head to create compressed KV cache. It works in two stages: 1) vote for important previous features by taking the last segment of the prompt ("observation window") and uses this window to analyze which parts of the earlier text (prefix) are most important. For each attn head, we aggregate the attn weights from queries in the observation window. Then we select the top-k positions based on the aggregated weights (k=p*L_prefix, where p is the compression weight) 2) then cluster and perform context preservation: we then use a pooling layer to cluster the selected important features. The last part of the prompt is kept as the observation window because they note that the attention patterns observed in the last window of the input sequence have high overlap rates (~80-90%) with the actual attention patterns used during generation.	Link
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention	Angelos Katharopoulos et al	2020	attention, transformer	ICML	This paper rephrases transformers as RNNs (title). They express the self-attention mechanism as a linear dot-product of kernel feature maps to make the complexity go from O(N^2) to O(N). Personal note: this is the 200th paper recorded on here, and the last of 2024! Summer of 2024 was when I began studying machine learning. Let's keep it up!	Link
Prefix-Tuning: Optimizing Continuous Prompts for Generation	Xiang Lisa Li et al	2021	prefix-tuning, prompting, llm	Arxiv	This paper proposes prefix-tuning, which keeps language model params frozen but optimizes a continuous task-specific vector (prefix).	Link
The Power of Scale for Parameter-Efficient Prompt Tuning	Brian Lester et al	2021	prompting, llm	Arxiv	This paper explores adding soft prompts to condition frozen language models. Basically, soft prompts are learned through back-propagation and can be used to finetune language models without fully retraining. They also introduce the idea of "prompt ensembling" which is basically using multiple soft prompts on a model and ensembling their outputs.	Link
Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs	Nguyen Nhat Minh et al	2024	sampling, llm	Arxiv	This paper introduces a neat trick to sample the next token. Min-p sampling basically adjusts the sampling threshold based on the model's confidence. It does so by scaling according to the top token's probability. This is a compelling alternative to other common sampling methods, like nucleus sampling.	Link
LASER: Attention with Exponential Transformation	Sai Surya Duvvuri et al	2024	attention, gradients	Arxiv	This paper identifies that gradients backpropagated through the softmax operation often can be quite small. To mitigate this, they propose doing a dot-product attention with an exp()-transformed value matrix V (meaning, they do the attention calculation on exp(V)), which allows for a larger Jacobian (mitigating the small gradient issue).	Link
Hyper-Connections	Defa Zhu et al	2024	residual connections, hyper-connections	Arxiv	This paper introduces hyper-connections, which is a novel alternative to residual connections. Basically, they introduce learnable depth and width connections.	Link
Remix-DiT: Mixing Diffusion Transformers for Multi-Expert Denoising	Gongfan Fang et al	2024	dit, diffusion, moe	NeurIPS	This paper introduces a method of mixing diffusion models for multi-expert denoising. Basically, they increase the width of the linear layers by a factor of K, and then modify the forward pass to support it. This allows for K experts that are initialized from the original weights.	Link
Hymba: A Hybrid-head Architecture for Small Language Models	Xin Dong et al	2024	llm, hybrid, meta-tokens	Arxiv	This paper introduces a family of small language models that have a hybrid attention-ssm head parallel architecture. There are many interesting architectural designs to note here, but my favoriate is the use of "meta tokens", learnable tokens that are prepended to prompts. These tokens help reduce the entropy of attention and ssm heads, and can be seen as a good initialization for KV cache and the SSM state.	Link
All are Worth Words: A ViT Backbone for Diffusion Models	Fan Bao et al	2023	diffusion, vit	Arxiv	This paper designs a general ViT-based architecture for diffusion models. Notably, it treats all inputs (time, condition, noisy image patches) as tokens and uses long skip connections between the shallow and deep layers.	Link
SliceGPT: Compress Large Language Models by Deleting Rows and Columns	Saleh Ashkboos et al	2024	pruning, llm	ICLR	The authors propose a method of slicing off entire rows or columns of weight matrices. They do this by applying a transformation that leaves the predictions invariant prior to the slice. The authors also introduce the notion of "computational invariance", AKA that one can apply orthogonal matrix transformations to each weight matrix in the transformer without changing the model, which they use to edit the blocks in a transformer to project the activation matrix between blocks onto its principal components, and then slice. They make the key insight that if you insert linear layers with the orthogonal matrix Q before RMSNorm and Q^{T} after, the network remains unchanged, i.e. RMSNorm(XQ)Q^{T} = RMSNorm(X). They also note that since LayerNorm can be converted to RMSNorm, LayerNorm is the same story. To find the Qs they use a calibration dataset from the training set and run it thru the model. They then use the output of the network to find the orthogonal matrices of the next layers by computing the covariance matrix and then getting the eigenvalues (read the paper for more).	Link
Visual Autoregressive Modeling: Scaling Image Generation via Next-Scale Prediction	Key Tian et al	2024	tokens, reference model	NeurIPS	The paper proposes Visual AutoRegressive (VAR) modeling, which shifts the paradigm of autoregressive learning for image generation from sequential "next-token prediction" to "next-scale prediction." This approach treats entire token maps at progressively finer resolutions as the autoregressive units, reflecting the coarse-to-fine manner in which humans perceive images. Unlike traditional models that flatten 2D spatial structures into 1D sequences, VAR preserves spatial locality and leverages multi-scale visual representations to reduce computational inefficiencies. By adopting hierarchical generation aligned with natural image structures, VAR overcomes the limitations of standard autoregressive models, such as mathematical premise violations and loss of spatial coherence. Its design integrates autoregressive transformers with multi-scale tokenization, creating a framework that is theoretically scalable and generalizable across diverse visual generation tasks.	Link
Rho-1: Not All Tokens Are What You Need	Zhenghao Lin et al	2024	tokens, reference model	NeurIPS	This paper scores tokens using a reference model and then trains a language model to focus on the tokens with higher scores. They find that they can improve performance while training on less tokens.	Link
QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs	Saleh Ashkboos et al	2024	quantization, rotation	NeurIPS	This paper introduces a quantization scheme based on rotations, that allows quantization of down to 4-bits for weights, activations, and KV cache. They rotate LLMs in such a way that removes outliers from hidden state w/o changing the output. In particular, they use randomized Hadamard transformations on the weight matrices to remove outlier features and make activations easier to quantize. They then extend this to apply online Hadamard transformations to attention model to remove outlier features in keys and values, which allows the KV cache to be quantized.	Link
Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch	Le Yu et al	2024	model merging	ICML	This paper shows that language models (LMs) can get new abilities via assimilating params from homologous models. They also note that LMs after Supervised Fine-Tuning (SFT) have many redundant delta parameters (i.e, the alteration of the model params before and after SFT). They then present DARE (Drop And REscale) as a means of setting delta parameters to zero with drop rate of p and then rescaling the remaining ones by a factor of 1/(1-p). They then use DARE to remove redundant delta parameters in each model prior to merging, which they find can help mitigate the interference of params among multiple models. Then they use standard model merging techniqes to merge the models.	Link
Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch	Le Yu et al	2024	model merging	ICML	This paper shows that language models (LMs) can get new abilities via assimilating params from homologous models. They also note that LMs after Supervised Fine-Tuning (SFT) have many redundant delta parameters (i.e, the alteration of the model params before and after SFT). They then present DARE (Drop And REscale) as a means of setting delta parameters to zero with drop rate of p and then rescaling the remaining ones by a factor of 1/(1-p). They then use DARE to remove redundant delta parameters in each model prior to merging, which they find can help mitigate the interference of params among multiple models. Then they use standard model merging techniqes to merge the models.	Link
Training-Free Pretrained Model Merging	Zhengqi Xu et al	2024	model merging	CVPR	This paper introduces Merging under Dual-Space Constraints (MuDSC), a novel framework for merging pretrained neural network models without additional training or requiring the same pretraining initialization. Unlike prior approaches that operate solely in either the weight space or the activation space, MuDSC addresses inconsistencies between these two spaces by combining their similarity measures into a unified objective using a weighted linear combination. This approach ensures that merged units are similar in both their structure and behavior, leading to more consistent and effective merging outcomes. The framework also adapts to networks with group structures, such as those using multi-head attention or group normalization, by proposing modifications to unit-matching algorithms. Overall, MuDSC simplifies model merging while enhancing performance across diverse architectures and tasks, enabling merged models to achieve balanced and overlapping multi-task performance.	Link
Similarity of Neural Network Representations Revisited	Simon Kornblith et al	2019	network similarity	ICML	This paper examines methods for comparing neural network representations and proposes Centered Kernel Alignment (CKA) as a more effective similarity measure. The authors provide key theoretical insights about what properties a similarity metric should have - arguing it should be invariant to orthogonal transformations and isotropic scaling, but not to arbitrary invertible linear transformations, as neural network training itself isn't invariant to such transformations. They show that for representations with more dimensions than training examples, any metric invariant to arbitrary invertible transformations will give meaningless results. CKA works by first measuring the similarity between every pair of examples in each representation separately (creating representational similarity matrices), then comparing these similarity structures - when using inner products, this reduces to computing normalized Hilbert-Schmidt Independence Criterion between the representations. They demonstrate theoretically that CKA is closely related to canonical correlation analysis (CCA) and regression, but incorporates feature scale information that CCA discards. Finally, they show that unlike previous methods like CCA and SVCCA, CKA can reliably identify corresponding layers between networks trained from different initializations and reveal meaningful relationships between different architectures.	Link
What is being transferred in transfer learning?	Behnam Neyshabur et al	2020	transfer learning	NeurIPS	This paper investigated what exactly gets transferred during transfer learning in neural networks through a comprehensive series of analyses. Through experiments with block-shuffled images, the researchers demonstrated that successful transfer learning relies on both feature reuse and low-level statistics of the data, showing that even when visual features are disrupted, transfer learning still provides benefits. The study revealed that models fine-tuned from pre-trained weights tend to stay in the same basin in the loss landscape, make similar mistakes, and remain close to each other in parameter space, while models trained from scratch end up in different basins with more diverse behaviors. By analyzing module criticality, they found that lower layers handle more general features while higher layers become more specialized, confirming previous theories about feature hierarchy in neural networks. Finally, they showed that transfer learning can begin from earlier checkpoints of the pre-trained model without losing accuracy, suggesting that the benefits of pre-training emerge before the model fully converges on the source task.	Link
ZipIt! Merging Models from Different Tasks without Training	George Stoica et al	2024	model merging	ICLR	This paper presents a novel approach to model merging that significantly improves upon previous methods by recognizing that similar features can exist within the same model, not just across different models. The key insight is that when merging models trained on different tasks, it's often better to combine similar features within each model first, rather than forcing dissimilar features from different models to merge, as these features may have developed to solve fundamentally different problems. Their method first concatenates the feature spaces of both models and computes a comprehensive correlation matrix between all features (both within and across models), using these correlations to guide intelligent feature merging decisions. To handle the multi-layer nature of neural networks, they introduce an "unmerge" operation that allows the merged features to remain compatible with later layers in both original networks, essentially decompressing the merged features before they're processed by subsequent layers. Theoretically, they prove that this approach provides better guarantees than traditional cross-model merging, showing that when models have internal redundancy (which is common in practice), their method can achieve perfect merging with zero performance loss.	Link
TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts	Ruida Wang et al	2024	llm, llm agent	Arxiv	This research introduces TheoremLlama, a framework that transforms general-purpose Large Language Models (LLMs) into expert theorem provers for the Lean4 formal mathematics language, addressing a significant challenge in automated theorem proving. The key innovation is their "NL-FL bootstrapping" method, which integrates natural language reasoning steps directly into formal mathematical proofs as comments during training, helping LLMs bridge the gap between natural language understanding and formal mathematical reasoning. The researchers also contribute the Open Bootstrapped Theorems (OBT) dataset, containing over 100,000 theorem-proof pairs with aligned natural and formal language, helping address the scarcity of training data in this field. The framework introduces specialized training techniques like block training and curriculum learning that help LLMs gradually build theorem-proving capabilities, potentially offering a blueprint for adapting LLMs to other specialized domains that lack extensive training data.	Link
A Simple Early Exiting Framework for Accelerating Sampling in Diffusion Models	Taehong Moon et al	2024	diffusion, early exit	ICML	This paper presents Adaptive Score Estimation (ASE), a novel framework that accelerates diffusion model sampling by adaptively allocating computational resources based on the time step being processed. The authors observe that score estimation near the noise distribution (t→1) requires less computational power than estimation near the data distribution (t→0), leading them to develop a time-dependent early-exiting scheme where more neural network blocks are skipped during the noise-phase sampling steps. Their approach differs between architectures - for DiT models they skip entire blocks, while for U-ViT models they preserve the linear layers connected to skip connections while dropping other block components to maintain the residual pathway information. The authors fine-tune their models using a specially designed training procedure that employs exponential moving averages and weighted coefficients to ensure minimal information updates near t→0 while allowing more updates near t→1.	Link
Active Prompting with Chain-of-Thought for Large Language Models	Shizhe Diao et al	2023	prompting, cot	Arxiv	The paper introduces Active-Prompt, a novel method that improves chain-of-thought (CoT) prompting by strategically selecting which examples to use as demonstrations for large language models. Rather than using randomly selected or manually crafted examples, Active-Prompt identifies the most informative examples by measuring the model's uncertainty on different potential prompts through metrics like disagreement and entropy across multiple model outputs. The key insight is that by systematically choosing examples where the model shows high uncertainty, and then having humans provide detailed reasoning chains for those specific cases, the resulting prompts will be more effective at teaching the model how to approach challenging problems. This approach shifts the human effort from trying to intuitively guess good examples to a more principled selection process guided by the model's own uncertainty signals.	Link
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment	Hanze Dong et al	2023	watermark, offset learning	TMLR	The paper introduces RAFT (Reward rAnked FineTuning), a simpler alternative to RLHF for aligning generative models with human preferences. The key insight is decoupling the data generation and model fine-tuning steps - instead of using complex reinforcement learning, RAFT generates multiple samples for each prompt, ranks them by reward, and then fine-tunes the model on only the highest-scoring samples in an iterative process. This approach is more stable and efficient than RLHF because it uses standard supervised learning techniques rather than RL, while being less sensitive to reward scaling issues since it only uses relative rankings rather than absolute reward values. Additionally, the decoupled nature of RAFT means it requires less memory (only needs to load one model at a time vs multiple for RLHF) and allows more flexibility in data collection and processing.	Link
Finding needles in a haystack: A Black-Box Approach to Invisible Watermark Detection	Minzhou Pan et al	2024	watermark, offset learning	Arxiv	The key insight of this paper centers on using "offset learning" to detect invisible watermarks in images. The intuition is that by having a clean reference dataset of similar images, you can effectively "cancel out" the normal image features that are common between clean and watermarked images, leaving only the watermark perturbations. They design an asymmetric loss function where clean images use exponential/softmax loss (to focus on hard examples) while detection dataset uses linear loss (to give equal weight to all examples), helping isolate the watermark signal. This is combined with an iterative pruning strategy that gradually removes likely-clean images from the detection set, allowing the model to better focus on and learn the watermark patterns. By formulating watermark detection this way, they avoid needing any prior knowledge of watermarking techniques or labeled data, making it a truly black-box approach.	Link
Mitigating the Alignment Tax of RLHF	Yong Lin et al	2024	rlhf, alignment	Arxiv	This paper investigates the "alignment tax" problem where large language models lose some of their pre-trained abilities when aligned with human preferences through RLHF. The key insight is that model averaging (interpolating between pre-RLHF and post-RLHF model weights) is surprisingly effective at mitigating this trade-off because tasks share overlapping feature spaces, particularly in lower layers of the model. Building on this understanding, they propose Heterogeneous Model Averaging (HMA) which applies different averaging ratios to different layers of the transformer model, allowing optimization of the alignment-forgetting trade-off. The intuition is that since different layers capture different levels of features and task similarities, they should not be averaged equally, and finding optimal layer-specific averaging ratios can better preserve both alignment and pre-trained capabilities.	Link
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising	Zigeng Chen et al	2024	diffusion, parallelization, denoising	Arxiv	This paper introduces AsyncDiff, a novel approach to accelerate diffusion models through parallel processing across multiple devices. The key insight is that hidden states between consecutive diffusion steps are highly similar, which allows them to break the traditional sequential dependency chain of the denoising process by transforming it into an asynchronous one. They execute this by dividing the denoising model into multiple components distributed across different devices, where each component uses the output from the previous component's prior step as an approximation of its input, enabling parallel computation. To further enhance efficiency, they introduce stride denoising, which completes multiple denoising steps simultaneously through a single parallel computation batch and reduces the frequency of communication between devices. This solution is particularly elegant because it's universal and plug-and-play, requiring no model retraining or architectural changes to achieve significant speedups while maintaining generation quality.	Link
DoRA: Weight-Decomposed Low-Rank Adaptation	Shih-Yang Liu et al	2024	peft, lora	Arxiv	This paper introduces DoRA (Weight-Decomposed Low-Rank Adaptation), a novel parameter-efficient fine-tuning method that decomposes pre-trained weights into magnitude and direction components for separate optimization. Through a detailed weight decomposition analysis, the authors reveal that LoRA and full fine-tuning exhibit distinct learning patterns, with LoRA showing proportional changes in magnitude and direction while full fine-tuning demonstrates more nuanced, independent adjustments between these components. Based on this insight, DoRA uses LoRA specifically for directional updates while allowing independent magnitude optimization, which simplifies the learning task compared to having LoRA learn both components simultaneously. The authors also provide theoretical analysis showing how this decomposition benefits optimization by aligning the gradient's covariance matrix more closely with the identity matrix and demonstrate mathematically why DoRA's learning pattern more closely resembles full fine-tuning.	Link
SphereFed: Hyperspherical Federated Learning	Xin Dong et al	2022	federated learning	Arxiv	This paper presents a novel approach to addressing the non-i.i.d. (non-independent and identically distributed) data challenge in federated learning by introducing hyperspherical federated learning (SphereFed). The key insight is that instead of letting clients independently learn their classifiers, which leads to inconsistent learning targets across clients, they should share a fixed classifier whose weights span a unit hypersphere, ensuring all clients work toward the same learning objectives. The approach normalizes features to project them onto this same hypersphere and uses mean squared error loss instead of cross-entropy to avoid scaling issues that arise when working with normalized features. Finally, after federated training is complete, they propose a computationally efficient way to calibrate the classifier using a closed-form solution that can be computed in a distributed manner without requiring direct access to private client data.	Link
A deeper look at depth pruning of LLMs	Shoaib Ahmed Siddiqui et al	2024	pruning, depth pruning, llm	ICML	This paper explored different approaches to pruning large language models, revealing that while static metrics like cosine similarity work well for maintaining MMLU performance, adaptive metrics like Shapley values show interesting trade-offs between different tasks. A key insight was that self-attention layers are significantly more amenable to pruning compared to feed-forward layers, suggesting that models can maintain performance even with substantial attention layer reduction. The paper also demonstrated that simple performance recovery techniques, like applying an average update in place of removed layers, can be as effective or better than more complex approaches like low-rank adapters. Finally, the work highlighted how pruning affects different tasks unequally - while some metrics preserve performance on certain tasks like MMLU, they may significantly degrade performance on others like mathematical reasoning tasks.	Link
Editing Models with Task Arithmetic	Gabriel Ilharco et al	2023	task arithmetic, finetuning, task	ICLR	This paper introduces a novel method for model editing called task arithmetic, where "task vectors" represent specific tasks by capturing the difference between pre-trained and fine-tuned model weights. Task vectors can be manipulated mathematically, such as being negated to unlearn tasks or added together to enable multi-tasking or improve performance in novel settings. A standout finding is the ability to create new task capabilities through analogies (e.g., "A is to B as C is to D"), which allows performance improvement on tasks with little or no data. This method is computationally efficient, leveraging linear operations on model weights without incurring extra inference costs, providing a flexible and modular framework for modifying models post-training. The approach highlights significant advantages in adapting existing models while bypassing costly re-training or data access constraints.	Link
SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales	Tianyang Xu et al	2024	confidence estimation, llm	Arxiv	The SaySelf framework trains large language models (LLMs) to produce fine-grained confidence estimates and self-reflective rationales by focusing on internal uncertainties. It consists of two stages: supervised fine-tuning and reinforcement learning (RL). In the first stage, multiple reasoning chains are sampled from the LLM, clustered for semantic similarity, and analyzed by an advanced LLM to generate rationales summarizing uncertainties. The model is fine-tuned on a dataset that pairs questions with reasoning chains, rationales, and confidence estimates, using a loss function that optimizes the generation of all three outputs. In the second stage, RL refines the confidence predictions using a reward function that encourages accurate, high-confidence outputs while penalizing overconfidence in incorrect responses. The framework ensures that LLMs not only generate confidence scores but also provide explanations for their uncertainty, making their outputs more interpretable and calibrated.	Link
Deep Reinforcement Learning from Human Preferences	Paul F Christiano et al	2016	rl, rlhf	Arxiv	This paper introduces a method to train reinforcement learning (RL) systems using human preferences over trajectory segments rather than traditional reward functions. The approach allows agents to learn tasks that are hard to define programmatically, enabling non-expert users to provide feedback on agent behavior through comparisons of short video clips. By learning a reward model from these preferences, the method dramatically reduces the need for human oversight while maintaining adaptability to large-scale and complex RL environments. This paradigm bridges the gap between human-defined objectives and scalable RL systems, addressing challenges in alignment and usability for real-world applications.	Link
The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning	Tian Jin et al	2023	pruning, icl	Arxiv	This paper explores the effects of scaling the parameter count of large language models (LLMs) on two distinct capabilities: fact recall from pre-training and in-context learning (ICL). By investigating both dense scaling (training models of varying sizes) and pruning (removing weights), the authors identify that these approaches disproportionately affect fact recall while preserving ICL abilities. They demonstrate that a model's ability to learn from in-context information remains robust under significant parameter reductions, whereas the ability to recall pre-trained facts degrades with even moderate scaling down. This dichotomy highlights a fundamental difference in how these capabilities rely on model size and opens avenues for more efficient model design and deployment, emphasizing trade-offs between memory augmentation and parameter efficiency.	Link
Fine-Tuning Language Models with Just Forward Passes	Sadhika Malladi et al	2024	finetuning, zo, optimization	Arxiv	The paper introduces MeZO, a memory-efficient zeroth-order optimization method, to fine-tune large language models using forward passes alone. Classical zeroth-order methods scale poorly with model size, but MeZO adapts these approaches to leverage structured pre-trained model landscapes, avoiding catastrophic slowdown even with billions of parameters. The authors theoretically show that MeZO’s convergence depends on the local effective rank of the Hessian, not the number of parameters, enabling efficient optimization despite prior bounds suggesting otherwise. Furthermore, MeZO’s flexibility allows optimization of non-differentiable objectives (e.g., accuracy or F1 score) and compatibility with parameter-efficient tuning methods like LoRA and prefix-tuning.	Link
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference	Hanshi Sun et al	2024	kv cache	Arxiv	The key insight of this paper lies in optimizing long-context large language model inference by addressing the memory and latency bottlenecks associated with managing the key-value (KV) cache. The authors observe that pre-Rotary Position Embedding (RoPE) keys exhibit a low-rank structure, allowing them to be compressed without accuracy loss, while value caches lack this property and are therefore offloaded to the CPU to reduce GPU memory usage. To minimize decoding latency, they leverage landmarks—compact representations of the low-rank key cache—and identify a small set of outliers to be retained on the GPU, enabling efficient reconstruction of sparse KV pairs on-the-fly. This approach allows the system to handle significantly longer contexts and larger batch sizes while maintaining inference throughput and accuracy.	Link
LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning	Rui Pan et al	2024	peft, finetuning, sampling	Arxiv	The key insight of this paper is the discovery of a skewed weight-norm distribution across layers during LoRA fine-tuning, where the majority of updates occur in the bottom (embedding) and top (language modeling head) layers, leaving middle layers underutilized. This highlights that different layers have varied importance and suggests that selectively updating layers could improve efficiency without sacrificing performance. Building on this, the authors propose Layerwise Importance Sampling AdamW (LISA), which randomly freezes most middle layers during training, using importance sampling to emulate LoRA’s fast learning pattern while avoiding its low-rank constraints. This approach achieves significant memory savings, faster convergence, and superior performance compared to LoRA and full-parameter fine-tuning, particularly in large-scale and domain-specific tasks.	Link
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion	Muyang Li et al	2024	quantization, diffusion	Arxiv	SVDQuant introduces a novel approach to 4-bit quantization of diffusion models by using a low-rank branch to absorb outliers in both weights and activations, making quantization more feasible at such aggressive bit reduction. The method first consolidates outliers from activations to weights through smoothing, then decomposes the weights using Singular Value Decomposition (SVD) to separate the dominant components into a 16-bit low-rank branch while keeping the residual in 4 bits. To make this practical, they developed an inference engine called Nunchaku that fuses the low-rank and low-bit branch kernels together, eliminating redundant memory access that would otherwise negate the performance benefits. The approach is designed to work across different diffusion model architectures and can seamlessly integrate with existing low-rank adapters (LoRAs) without requiring re-quantization.	Link
One Weight Bitwidth to Rule Them All	Ting-Wu Chin et al	2020	quantization, bitwidth	Arxiv	This paper examines weight quantization in deep neural networks and challenges the common assumption that using the lowest possible bitwidth without accuracy loss is optimal. The key insight is that when considering model size as a constraint and allowing network width to vary, some bitwidths consistently outperform others - specifically, networks with standard convolutions work better with binary weights while networks with depthwise convolutions prefer higher bitwidths. The authors discover that this difference is related to the number of input channels (fan-in) per convolutional kernel, with higher fan-in making networks more resilient to aggressive quantization. Most surprisingly, they demonstrate that using a single well-chosen bitwidth throughout the network can outperform more complex mixed-precision quantization approaches when comparing networks of equal size, suggesting that the traditional focus on minimizing bitwidth without considering network width may be suboptimal.	Link
Consistency Models	Yang Song et al	2023	diffusion, ode, consistency	ICML	This paper introduces consistency models, a new family of generative models that can generate high-quality samples in a single step while preserving the ability to trade compute for quality through multi-step sampling. The key innovation is training models to map any point on a probability flow ODE trajectory to its origin point, enforcing consistency across different time steps through either distillation from pre-trained diffusion models or direct training. The models support zero-shot data editing capabilities like inpainting, colorization, and super-resolution without requiring explicit training on these tasks, similar to diffusion models. The authors provide two training approaches - consistency distillation which leverages existing diffusion models, and consistency training which allows training from scratch without any pre-trained models, establishing consistency models as an independent class of generative models.	Link
One Step Diffusion via ShortCut Models	Kevin Frans et al	2024	diffusion, ode, flow-matching	Arxiv	This paper introduces shortcut models, a new type of diffusion model that enables high-quality image generation in a single forward pass by conditioning the model not only on the timestep but also on the desired step size, allowing it to learn larger jumps during the denoising process. Unlike previous approaches that require multiple training phases or complex scheduling, shortcut models can be trained end-to-end in a single phase by leveraging a self-consistency property where one large step should equal two consecutive smaller steps, combined with flow-matching loss as a base case. The key insight is that by conditioning on step size, the model can account for future curvature in the denoising path and jump directly to the correct next point rather than following the curved path naively, which would lead to errors with large steps. The approach simplifies the training pipeline while maintaining flexibility in inference budget, as the same model can generate samples using either single or multiple steps after training.	Link
Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models	Hongjie Wang et al	2024	diffusion, training-free, attention, token pruning	CVPR	This paper introduces AT-EDM, a training-free framework to accelerate diffusion models by pruning redundant tokens during inference without requiring model retraining. The key innovation is a Generalized Weighted PageRank (G-WPR) algorithm that uses attention maps to identify and prune less important tokens, along with a novel similarity-based token recovery method that fills in pruned tokens based on attention patterns to maintain compatibility with convolutional layers. The authors also propose a Denoising-Steps-Aware Pruning (DSAP) schedule that prunes fewer tokens in early denoising steps when attention maps are more chaotic and less informative, and more tokens in later steps when attention patterns are better established. The overall approach focuses on making diffusion models more efficient by leveraging the rich information contained in attention maps to guide token pruning decisions while maintaining image generation quality.	Link
Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks	Tim Salimans et al	2016	normalization, gradient descent	Arxiv	This paper introduces weight normalization, a simple reparameterization technique that decouples a neural network's weight vectors into their direction and magnitude by expressing w = (g/\|\|v\|\|)v, where g is a scalar and v is a vector. The key insight is that this decoupling improves optimization by making the conditioning of the gradient better - the direction and scale of weight updates can be learned somewhat independently, which helps avoid problems with pathological curvature in the optimization landscape. While inspired by batch normalization, weight normalization is deterministic and doesn't add noise to gradients or create dependencies between minibatch examples, making it well-suited for scenarios like reinforcement learning and RNNs where batch normalization is problematic. The authors also propose a data-dependent initialization scheme where g and bias terms are initialized to normalize the initial pre-activations of neurons, helping ensure good scaling of activations across layers at the start of training.	Link
Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models	Tuomas Kynkäänniemi et al	2024	diffusion, cfg, guidance	Arxiv	This paper's key insight is that classifier-free guidance (CFG) in diffusion models should only be applied during a specific interval of noise levels in the middle of the sampling process, rather than throughout the entire sampling chain as traditionally done. The intuition is that guidance is harmful at high noise levels (where it causes mode collapse and template-like outputs), largely unnecessary at low noise levels, and only truly beneficial in the middle range. They demonstrate this theoretically using a 1D synthetic example where they can visualize how guidance at high noise levels causes sampling trajectories to drift far from the smoothed data distribution, leading to mode dropping. Beyond this theoretical demonstration, they propose a simple solution of making the guidance weight a piecewise function that only applies guidance within a specific noise level interval.	Link
Cache Me if You Can: Accelerating Diffusion Models through Block Caching	Felix Wimbauer et al	2024	diffusion, caching, distillation	Arxiv	This paper introduces "block caching" to accelerate diffusion models by reusing computations across denoising steps. The key insight is that many layer blocks (particularly attention blocks) in diffusion models change very gradually during the denoising process, making their repeated computation redundant. The authors propose automatically determining which blocks to cache and when to refresh them based on measuring the relative changes in block outputs across timesteps. They also introduce a lightweight scale-shift adjustment mechanism that uses a student-teacher setup, where the student (cached model) learns additional scale and shift parameters to better align its cached block outputs with those of the teacher (uncached model), while keeping the original model weights frozen.	Link
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads	Guangxuan Xiao et al	2024	llm, kv cache, attention	Arxiv	The key insight of DuoAttention is the observation that attention heads in LLMs naturally fall into two distinct categories: retrieval heads that need to access the full context to make connections across long distances, and streaming heads that mainly focus on recent tokens and attention sinks. This dichotomy makes intuitive sense because not all parts of language processing require long-range dependencies - while some aspects like fact recall or logical reasoning need broad context, others like local grammar or immediate context processing can work with nearby tokens. The paper's approach of using optimization to identify these heads (rather than just looking at attention patterns) is clever because it directly measures the impact on model outputs, capturing the true functional role of each head rather than just its surface behavior. Finally, the insight to maintain two separate KV caches (full for retrieval heads, minimal for streaming heads) is an elegant way to preserve the model's capabilities while reducing memory usage, since it aligns the memory allocation with each head's actual needs.	Link
Efficient Streaming Language Models with Attention Sinks	Guangxuan Xiao et al	2024	llm, kv cache, attention	ICLR	This paper introduces StreamingLLM, a framework that enables large language models to process infinitely long text sequences efficiently without fine-tuning, based on a key insight about "attention sinks." The authors discover that LLMs allocate surprisingly high attention scores to initial tokens regardless of their semantic relevance, which they explain is due to the softmax operation requiring attention scores to sum to one - even when a token has no strong matches in context, the model must distribute attention somewhere, and initial tokens become natural "sinks" since they're visible to all subsequent tokens during autoregressive training. Building on this insight, StreamingLLM maintains just a few initial tokens (as attention sinks) along with a sliding window of recent tokens, achieving up to 22.2x speedup compared to baselines while maintaining performance on sequences up to 4 million tokens long. Additionally, they show that incorporating a dedicated learnable "sink token" during model pre-training can further improve streaming capabilities by providing an explicit token for collecting excess attention.	Link
MagicPIG: LSH Sampling for Efficient LLM Generation	Zhuoming Chen et al	2024	llm, kv cache	Arxiv	This paper challenges the common assumption that attention in LLMs is naturally sparse, showing that TopK attention (selecting only the highest attention scores) can significantly degrade performance on tasks that require aggregating information across the full context. The authors demonstrate that sampling-based approaches to attention can be more effective than TopK selection, leading them to develop MagicPIG, a system that uses Locality Sensitive Hashing (LSH) to efficiently sample attention keys and values. A key insight is that the geometry of attention in LLMs has specific patterns - notably that the initial attention sink token remains almost static regardless of input, and that query and key vectors typically lie in opposite directions - which helps explain why simple TopK selection is suboptimal. Their solution involves a heterogeneous system design that leverages both GPU and CPU resources, with hash computations on GPU and attention computation on CPU, allowing for efficient processing of longer contexts while maintaining accuracy.	Link
Guiding a Diffusion Model with a Bad Version of Itself	Tero Karras et al	2024	diffusion, guidance	Arxiv	The paper makes two key contributions: First, they show that Classifier-Free Guidance (CFG) improves image quality not just through better prompt alignment, but because the unconditional model D0 learns a more spread-out distribution than the conditional model D1, causing the guidance term ∇x log(p1/p0) to push samples toward high-probability regions of the data manifold. Second, based on this insight, they introduce "autoguidance" - using a smaller, less-trained version of the model itself as the guiding model D0 rather than an unconditional model, which allows for quality improvements without reducing variation and works even for unconditional models.	Link
LLM-Pruner: On the Structural Pruning of Large Language Models	Xinyin Ma et al	2023	llm, structural pruning	Arxiv	The authors introduce LLM-Pruner, a novel approach for compressing large language models that operates in a task-agnostic manner while requiring minimal access to the original training data. Their key insight is to first automatically identify groups of interdependent neural structures within the LLM by analyzing dependency patterns, ensuring that coupled structures are pruned together to maintain model coherence. The method then estimates the importance of these structural groups using both first-order gradients and approximated Hessian information from a small set of calibration samples, allowing them to selectively remove less critical groups while preserving the model's core functionality. Finally, they employ a rapid recovery phase using low-rank adaptation (LoRA) to fine-tune the pruned model with a limited dataset in just a few hours, enabling efficient compression while maintaining the LLM's general-purpose capabilities.	Link
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models	Guangxuan Xiao et al	2023	llm, quantization, activations	ICML	The key insight of SmoothQuant is that in large language models, while weights are relatively easy to quantize, activations are much harder due to outliers. They observed that these outliers persistently appear in specific channels across different tokens, suggesting that the difficulty could be redistributed. Their solution is to mathematically transform the model by scaling down problematic activation channels while scaling up the corresponding weight channels proportionally, which maintains mathematical equivalence while making both weights and activations easier to quantize. This "difficulty migration" approach allows them to balance the quantization challenges between weights and activations using a tunable parameter α, rather than having all the difficulty concentrated in the activation values.	Link
ESPACE: Dimensionality Reduction of Activations for Model Compression	Charbel Sakr et al	2024	llm, dimensionality reduction, activations, compression	NeurIPS	Instead of decomposing weight matrices as done in previous work, ESPACE reduces the dimensionality of activation tensors by projecting them onto a pre-calibrated set of principal components using a static projection matrix P, where for an activation x, its projection is x̃ = PPᵀx. The projection matrix P is carefully constructed (using eigendecomposition of activation statistics) to preserve the most important components while reducing dimensionality, taking advantage of natural redundancies that exist in activation patterns due to properties like the Central Limit Theorem when stacking sequence/batch dimensions. During training, the weights remain uncompressed and fully trainable (maintaining model expressivity), while at inference time, the weight matrices can be pre-multiplied with the projection matrix (PTWᵀ) to achieve compression through matrix multiplication associativity: Y = WᵀX ≈ Wᵀ(PPᵀX) = (PTWᵀ)(PᵀX). This activation-centric approach is fundamentally different from previous methods because it maintains full model expressivity during training while still achieving compression at inference time, and it takes advantage of natural statistical redundancies in activation patterns rather than trying to directly compress weights.	Link
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model	Chunting Zhou et al	2024	diffusion, transformer, multi-modal	Arxiv	The key insight of this paper is that a single transformer model can effectively handle both discrete data (like text) and continuous data (like images) by using different training objectives for each modality within the same model. They introduce "Transfusion," which uses traditional language modeling (next token prediction) for text sequences while simultaneously applying diffusion modeling for image sequences, combining these distinct objectives into a unified training approach. The architecture employs a novel attention pattern that allows for causal attention across the entire sequence while enabling bidirectional attention within individual images, letting image patches attend to each other freely while maintaining proper causality for text generation. This unified approach avoids the need for separate specialized models or complex architectures while still allowing each modality to be processed according to its most effective paradigm.	Link
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection	Jiawei Zhao et al	2024	lora, low-rank projection	ICML	This paper introduces GaLore, a memory-efficient approach for training large language models that exploits the inherent low-rank structure of gradients rather than imposing low-rank constraints on the model weights themselves. The key insight is that while weight matrices may need to be full-rank for optimal performance, their gradients naturally become low-rank during training due to the specific structure of backpropagated gradients in neural networks, particularly in cases where the batch size is smaller than the matrix dimensions or when the gradients follow certain parametric forms. Building on this observation, GaLore projects gradients into low-rank spaces for memory-efficient optimization while still allowing full-parameter learning, contrasting with previous approaches like LoRA that restrict the weight updates to low-rank spaces. By periodically switching between different low-rank subspaces during training, GaLore maintains the flexibility of full-rank training while significantly reducing memory usage, particularly in storing optimizer states.	Link
Neural Discrete Representation Learning	Aaron van der Oord et al	2017	generative models, vae	NeurIPS	The key innovation of this paper is the introduction of the Vector Quantised-Variational AutoEncoder (VQ-VAE), which combines vector quantization with VAEs to learn discrete latent representations instead of continuous ones. Unlike previous approaches to discrete latent variables which struggled with high variance or optimization challenges, VQ-VAE uses a simple but effective nearest-neighbor lookup system in the latent space, along with a straight-through gradient estimator, to learn meaningful discrete codes. This approach allows the model to avoid the common posterior collapse problem where latents are ignored when paired with powerful decoders, while still maintaining good reconstruction quality comparable to continuous VAEs. The discrete nature of the latent space enables the model to focus on capturing important high-level features that span many dimensions in the input space (like objects in images or phonemes in speech) rather than local details, and these discrete latents can then be effectively modeled using powerful autoregressive priors for generation.	Link
Improved Precision and Recall Metric for Assessing Generative Models	Tuomas Kynkaanniemi et al	2019	generative models, precision, recall	NeurIPS	This paper introduces an improved metric for evaluating generative models by separately measuring precision (quality of generated samples) and recall (coverage/diversity of generated distribution) using k-nearest neighbors to construct non-parametric manifold approximations of real and generated data distributions. The authors demonstrate their metric's effectiveness using StyleGAN and BigGAN, showing how it provides more nuanced insights than existing metrics like FID, particularly in revealing tradeoffs between image quality and variation that other metrics obscure. They use their metric to analyze and improve StyleGAN's architecture and training configurations, identifying new variants that achieve state-of-the-art results, and perform the first principled analysis of truncation methods. Finally, they extend their metric to evaluate individual sample quality, enabling quality assessment of interpolations and providing insights into the shape of the latent space that produces realistic images.	Link
Generative Pretraining from Pixels	Mark Chen et al	2020	pretraining, gpt	PMLR	The paper demonstrates that transformer models can learn high-quality image representations by simply predicting pixels in a generative way, without incorporating any knowledge of the 2D structure of images. They show that as the generative models get better at predicting pixels (measured by log probability), they also learn better representations that can be used for downstream image classification tasks. The authors discover that, unlike in supervised learning where the best representations are in the final layers, their generative models learn the best representations in the middle layers - suggesting the model first builds up representations before using them to predict pixels. Finally, while their approach requires significant compute and works best at lower resolutions, it achieves competitive results with other self-supervised methods and shows that generative pre-training can be a promising direction for learning visual representations without labels.	Link
Why Does Unsupervised Pre-Training Help Deep Learning?	Dumitru Erhan et al	2010	pretraining, unsupervised	JMLR	This paper argues that standard training schemes place parameters in regions of the parameter space that generalize poorly, while greedy layer-wise unsupervised pre-training allows each layer to learn a nonlinear transformation of its input that captures the main variations in the input, which acts as a regularizer: minimizing variance and introducing bias towards good initializations for the parameters. They argue that defining particular initialization points implicitly imposes constraints on the parameters in that it specifies which minima (out of many possible minima) of the cost function are allowed. They further argue that small perturbations in the trajectory of the parameters have a larger effect early on, and hint that early examples have larger influence and may trap model parameters in particular regions of parameter space corresponding to the arbitrary ordering of training examples (similar to the "critical period" in developmental psychology).	Link
Improving Language Understanding by Generative Pre-Training	Alec Radford et al	2020	pretraining	Arxiv	The key insight of this paper is that language models can learn deep linguistic and world knowledge through unsupervised pre-training on large corpora of contiguous text, which can then be effectively transferred to downstream tasks. The authors demonstrate this by using a Transformer architecture that can capture long-range dependencies, pre-training it on a books dataset that contains extended narratives rather than shuffled sentences, making it particularly effective at understanding context. Their innovation extends to how they handle transfer learning - rather than creating complex task-specific architectures, they show that simple input transformations can adapt their pre-trained model to various tasks while preserving its learned capabilities. This elegant approach proves remarkably effective, with their single task-agnostic model outperforming specially-designed architectures across nine different natural language understanding tasks, suggesting that their pre-training method captures fundamental aspects of language understanding.	Link
Learning Transferable Visual Models from Natural Language Supervision	Alec Radford et al	2021	CLIP	Arxiv	CLIP (Contrastive Language-Image Pre-training) works by simultaneously training two neural networks - one that encodes images and another that encodes text - to project their inputs into a shared multi-dimensional space where similar concepts end up close together. During training, CLIP takes a batch of image-text pairs and learns to identify which text descriptions actually match which images, doing this by maximizing the cosine similarity between embeddings of genuine pairs while minimizing similarity between mismatched pairs. The training data consists of hundreds of millions of (image, text) pairs collected from the internet, which helps CLIP learn broad visual concepts and their relationships to language without requiring hand-labeled data. What makes CLIP particularly powerful is its zero-shot capability - after training, it can make predictions about images it has never seen before by comparing them against any arbitrary text descriptions, rather than being limited to a fixed set of predetermined labels.	Link
Adam: A Method for Stochastic Optimization	Diederik Kingma et al	2015	optimizers	ICLR	Adam combines momentum (through exponential moving average of gradients mt) and adaptive learning rates (through exponential moving average of squared gradients vt) to create an efficient optimizer, where mt captures the direction of updates while vt adapts the step size for each parameter based on its gradient history. The optimizer corrects initialization bias in these moving averages by scaling them with factors 1/(1-β₁ᵗ) and 1/(1-β₂ᵗ) respectively, ensuring unbiased estimates even in early training. The parameter update θt ← θt-1 - α·mt/(√vt + ϵ) is invariant to gradient scaling because it uses the ratio mt/√vt, while the adaptive learning rate 1/√vt approximates the diagonal of the Fisher Information Matrix's square root, making it a more conservative version of natural gradient descent that works well with sparse gradients and non-stationary objectives. The hyperparameters β₁ = 0.9 and β₂ = 0.999 mean the momentum term considers roughly the last 10 steps while the variance term considers the last 1000 steps, allowing Adam to both move quickly in consistent directions while being careful in directions with high historical variance.	Link
Simplifying Neural Networks by Soft Weight-Sharing	Steven Nowlan et al	1992	soft weight sharing, mog	Neural Computation	This paper tackles the challenge of penalizing complexity and preventing overfitting in neural networks. Traditional methods, like L2 regularization, penalize the sum of squared weights but can favor multiple weak connections over a single strong one, leading to suboptimal weight configurations. To address this, the authors propose a mixture of Gaussians (MoG) prior: a narrow Gaussian encourages small weights to shrink to zero, while a broad Gaussian preserves large weights essential for modeling the data accurately. By clustering weights into near-zero and larger groups, this data-driven regularization avoids forcing all weights toward zero equally and demonstrates better generalization on 12 toy tasks compared to early stopping and traditional squared-weight penalties.	Link
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models	Muyang Li et al	2024	diffusion, distributed inference	Arxiv	DistriFusion introduces displaced patch parallelism, where the input image is split into patches, each processed independently by different GPUs. To maintain fidelity and reduce communication costs, the method reuses activations from the previous timestep as context for the current step, ensuring interaction between patches without excessive synchronization. Synchronous communication is only used at the initial step, while subsequent steps leverage asynchronous communication, hiding communication overhead within computation. This technique allows each device to process only a portion of the workload efficiently, avoiding artifacts and achieving scalable parallelism tailored to the sequential nature of diffusion models.	Link
Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching	Xinyin Ma et al	2024	diffusion, caching	Arxiv	This paper proposes interpolation between computationally inexpensive solutions that are suboptimal and optimal solutions that are expensive by training a router the learn how to cache layers of the diffusion transformer.	Link
Flash Attention	Tri Dao et al	2022	attention, transformer	Arxiv	This introduces FlashAttention, which is an IO-aware exact attention algo that uses tiling. Basically, they use tiling to prevent needing to put the large NxN attention matrix on GPU HBM; FlashAttention goes through blocks of the K and V matrices, loads them to on-chip SRAM, which increases speed! Neat!	Link
Token Merging for Fast Stable Diffusion	Daniel Bolya et al	2023	diffusion, token merging	Arxiv	This paper seeks to apply ToMe (https://arxiv.org/pdf/2210.09461) to diffusion models, introducing techniques for token partitioning (by changing the way src and dst is merged) and a token unmerging operation (which is basically just setting the two merged tokens equal to their average, and then resetting back the two tokens with that average). Remarkably, this works very well!	Link
DeepCache: Accelerating Diffusion Models for Free	Xinyin Ma et al	2023	diffusion, cache	Arxiv	Similarly to Faster Diffusion (Senma Li et al, 2024), this paper uses the temporal redundancy in the denoising stages. They then cache features across the UNet by skipping some of the skip branches / paths. Basically, for timesteps t and t+1 that are similar, we can cache some of the high level features between them and directly use them. Also smart!	Link
Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models	Senmao Li et al	2024	diffusion, encoder	NeurIPS	This paper notes that the UNet decoder in diffusion models has similar output between timesteps. Thus, they seek to basically cyclically reuse encoder features for the decoder. Smart!	Link
Improved Denoising Diffusion Probabilistic Models	Alex Nichol et al	2021	diffusion, precision, recall	Arxiv	This paper is the first to show that DDPMs can get competitive log-likelihoods. They use a reparameterization and a hybrid learning objective to more tightly optimize the variational lower bound, and find that their objective has less gradient noise during training. They use learned variances and find that they can get convincing samples using fewer steps. They also use the improved precision and recall metrics (Kynkaanniemi et al 2019) to show that diffusion models have higher recall for similar FID, which suggests they cover a large portion of the target distribution. They focused on optimizing log-likelihood as it is believed that optimizing ll forces the model to capture all models of data distribution (Razavi et al 2019). Heninghan et al 2020 has also shown that small improvements in ll can dramatically impact sample quality / learned feature representations. The authors argue that fixing \sigma_{t} (as Ho et al 2020 does) is reasonable in terms of sample quality, but does not explain much about the ll. Thus, to improve ll they think of finding a better choice for \Sigma_{\theta}(x_{t},t), so they choose to try to learn it. They note that it is better to parameterize the var as an interpolation between \beta_{t} and \tilde{\beta_{t}} in the log domain. Remember that \beta_{t} is the noise schedule, which is typically a small value that increases over time following some schedule. \tilde{\beta is a reparameterization of \beta_{t} used to simplify calculations. They are related via \alpha, which is 1-eta_{t}. Finally, they note that a linear schedule for noise leads to faster destroying of information than is necessary, and propose a different noise scheduler. Lots of insights!	Link
Improving Training Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architecture	Huijie Zhang et al	2024	diffusion, multi-stage	CVPR	This paper proposes a multi-stage framework for diffusion models that uses a shared encoder and separate decoders for different timestep intervals, along with an optimal denoiser-based timestep clustering method, to improve training and sampling efficiency while maintaining or enhancing image generation quality.	Link
Temporal Dynamic Quantization for Diffusion Models	Junhyuk So et al	2023	diffusion, quantization	NeurIPS	Temporal Dynamic Quantization (TDQ) addresses the challenge of quantizing diffusion models by dynamically adjusting quantization parameters based on the denoising time step. TDQ employs a trainable module consisting of frequency encoding, a multi-layer perceptron (MLP), and a SoftPlus activation to predict optimal quantization intervals for each time step. This module maps the temporal information to appropriate quantization parameters, allowing the method to adapt to the varying activation distributions across different stages of the diffusion process. By pre-computing these quantization intervals, TDQ avoids the runtime overhead associated with traditional dynamic quantization methods while still providing the necessary flexibility to handle the temporal dynamics of diffusion models.	Link
Learning Efficient Convolutional Networks through Network Slimming	Zhuang Liu et al	2017	pruning, importance	CVPR	This paper introduces network slimming, a method to reduce the size, memory footprint, and computation of CNNs by enforcing channel-level sparsity without sacrificing accuracy. It works by identifying and pruning insignificant channels during training, leveraging the γ scaling factors in Batch Normalization (BN) layers to effectively determine channel importance. The approach introduces minimal training overhead and is compatible with modern CNN architectures, eliminating the need for specialized hardware or software. Using the BN layer’s built-in scaling properties makes this pruning efficient, avoiding redundant scaling layers or issues that arise from linear transformations in convolution layers.	Link
Q-Diffusion: Quantizing Diffusion Models	Xiuyu Li et al	2023	diffusion, sampling	ICCV	This paper tackles the inefficiencies of diffusion models, such as slow inference and high computational cost, by proposing a post-training quantization (PTQ) method designed specifically for their multi-timestep process. The key innovation includes a time step-aware calibration data sampling approach, which uniformly samples inputs across multiple time steps to better reflect real inference data, addressing quantization errors and varying activation distributions without the need for additional data. Additionally, the paper introduces shortcut-splitting quantization to handle the bimodal activation distributions caused by the concatenation of deep and shallow feature channels in shortcuts, quantizing them separately before concatenation for improved accuracy with minimal extra resources.	Link
Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection	Alireza Ganjdanesh et al	2024	diffusion, sampling	Arxiv	This paper reduces the cost of sampling via pruning a pretrained diffusion model into a mixture of experts (MoE) for their respective time intervals, via a routing agent that predicts the architecture needed to generate the experts.	Link
A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training	Kai Wang et al	2024	diffusion, sampling	Arxiv	This paper introduces SpeeD, a novel approach for accelerating the training of diffusion models without compromising performance. The authors analyze the diffusion process and identify three distinct areas: acceleration, deceleration, and convergence, each with different characteristics and importance for model training. Based on these insights, SpeeD implements two key components: asymmetric sampling, which reduces the sampling of less informative time steps in the convergence area, and change-aware weighting, which gives more importance to the rapidly changing areas between acceleration and deceleration. The authors' key insight is that not all time steps in the diffusion process are equally valuable for training, with the convergence area providing limited benefits despite occupying a large proportion of time steps, while the rapidly changing area between acceleration and deceleration is crucial but often undersampled. To address this, SpeeD introduces an asymmetric sampling strategy using a two-step probability function: $P(t) = \begin{cases} \frac{k}{T + \tau(k-1)}, & 0 < t \leq \tau \ \frac{1}{T + \tau(k-1)}, & \tau < t \leq T \end{cases}$, where τ is a carefully selected threshold marking the beginning of the convergence area, k is a suppression intensity factor, T is the total number of time steps, and t is the current time step. This function increases sampling probability before τ and suppresses it after. Additionally, SpeeD employs a change-aware weighting scheme based on the gradient of the process increment's variance, assigning higher weights to time steps with faster changes. By combining these strategies, SpeeD aims to focus computational resources on the most informative parts of the diffusion process, potentially leading to significant speedups in training time without sacrificing model quality.	Link
HyperGAN: A Generative Model for Diverse, Performant Neural Networks	Neale Ratzlaff et al	2019	gan, ensemble	ICML	This paper introduces HyperGAN, a novel generative model designed to learn a distribution of neural network parameters, addressing the issue of overconfidence in standard neural networks when faced with out-of-distribution data. Unlike traditional approaches, HyperGAN doesn't require restrictive prior assumptions and can rapidly generate large, diverse ensembles of neural networks. The model employs a unique "mixer" component that projects prior samples into a correlated latent space, from which layer-specific generators create weights for a deep neural network. Experimental results show that HyperGAN can achieve competitive performance on datasets like MNIST and CIFAR-10 while providing improved uncertainty estimates for out-of-distribution and adversarial data compared to standard ensembles. NOTE: There has actually been a diffusion variant of this idea: https://arxiv.org/pdf/2402.13144	Link
Diffusion Models Already Have a Semantic Latent Space	Mingi Kwon et al	2023	diffusion, latent space	ICLR	This paper introduces Asymmetric Reverse Process (Asyrp), a method that discovers a semantic latent space (h-space) in pretrained diffusion models, enabling controlled image manipulation with desirable properties such as homogeneity, linearity, and consistency across timesteps, while also proposing a principled design for versatile editing and quality enhancement in the generative process. The authors propose Asymmetric Reverse Process (Asyrp). It modifies only the P_{t} term while preserving the D_{t} term in the reverse process. This makes sense because it a) breaks the destructive interference seen in previous methods, b) allows for controlled modification of the generation process towards target attributes, and c) maintains the overall structure and quality of the diffusion process.	Link
One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale	Fan Bao et al	2023	diffusion, multi-model	ICML	The authors present a method of sampling from joint and conditional distributions using a small modification on diffusion models. UniDiffuser’s proposed method involves handling multiple modalities (such as images and text) within a single diffusion model. Here is in general what they do: 1. Perturb data in all modalities: For a given data point (x0,y0), where x0 is an image and y0 is text, UniDiffuser adds noise to both simultaneously. The noisy versions are represented as xt_{x} and yt_{y}, where t_{x} and t_{y} are the respective timesteps. 2. Use of individual timesteps for different modalities: Instead of using a single timestep t for both modalities, UniDiffuser uses separate timesteps t_{x} and t_{y}. This allows for more flexibility in handling the different characteristics of each modality. 3. Predicting noise for all modalities simultaneously: UniDiffuser uses a joint noise prediction network \epsilon_{\theta}(xt_{x},yt_{y},t_{x},t_{y}) that takes in the noisy versions of both modalities and their respective timesteps. The network then outputs predicted noise for both modalities in one forward pass.	Link
Diffusion Models as a Representation Learner	Xingyi Yang et al	2023	diffusion, representation learner	ICCV	This paper (smartly!) notices that one of the major reasons for long training and poor results of diffusion models is the lack of fast learning of relationships. For instance, they remark on the learning of one eye of a dog before both eyes. They propose to mask the input image in the latent space and learn how to predict the masks, and then diffuse these masks. Brilliant!	Link
Masked Diffusion Transformer is a Strong Image Synthesizer	Shanghua Gao et al	2023	diffusion, masking, transformer	ICCV	This paper (smartly!) notices that one of the major reasons for long training and poor results of diffusion models is the lack of fast learning of relationships. For instance, they remark on the learning of one eye of a dog before both eyes. They propose to mask the input image in the latent space and learn how to predict the masks, and then diffuse these masks. Brilliant!	Link
Generative Modeling by Estimating Gradients of the Data Distribution	Yang Song et al	2019	diffusion, score matching	NeurIPS	This paper introduces Noise Conditional Score Networks (NCSNs), a novel approach to generative modeling that learns to estimate the score function of a data distribution at multiple noise levels. NCSNs are trained using score matching, avoiding the need to compute normalizing constants, and generate samples using annealed Langevin dynamics. The method addresses challenges in modeling complex, high-dimensional data distributions, particularly for data lying on or near low-dimensional manifolds.	Link
LAPTOP-Diff: Layer Pruning and Normalized Distillation for Compression Diffusion Models	Dingkun Zhang et al	2024	diffusion, pruning	Arxiv	This paper proposes layer pruning and normalized distillation for pruning diffusion models. They use a surrogate function and show that their surrogate implies a property called "additivity", where the output distortion caused by many perturbations approximately equals the sum of the output distortion caused by each single perturbation. They then show that their computation can be formed as a 0-1 Knapsack problem. They then analyze what is the important objective for retraining, and see that there is an imbalance in previous feature distillation approaches employed in the retraining phase. They note that the L2-Norms of feature maps at the end of different stages and the values of different feature loss terms vary significantly, for instance, the highest loss term is ~10k times greater than the lowest one throughout the distillation process, and produces about 1k times larger gradients. This dilutes the gradients of the numerically insignificant feature loss terms. So, they opt to normalize the feature loss.	Link
Classifier-Free Diffusion Guidance	Jonathan Ho et al	2022	diffusion, guidance	NeurIPS	This paper introduces classifier-free guidance, a novel technique for improving sample quality in conditional diffusion models without using a separate classifier. Unlike traditional classifier guidance, which relies on gradients from an additional classifier model, classifier-free guidance achieves similar results by combining score estimates from jointly trained conditional and unconditional diffusion models. The method involves training a single neural network that can produce both conditional and unconditional score estimates, and then using a weighted combination of these estimates during the sampling process. This approach simplifies the training pipeline, avoids potential issues associated with training classifiers on noisy data, and eliminates the need for adversarial attacks on classifiers during sampling. The authors demonstrate that classifier-free guidance can achieve a similar trade-off between Fréchet Inception Distance (FID) and Inception Score (IS) as classifier guidance, effectively boosting sample quality while reducing diversity. The key difference is that classifier-free guidance operates purely within the generative model framework, without relying on external classifier gradients. This method provides an intuitive explanation for how guidance works: it increases conditional likelihood while decreasing unconditional likelihood, pushing generated samples towards more characteristic features of the desired condition.	Link
LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights	Thibault Castells et al	2024	pruning, diffusion, ldm	CVPR	This paper presents LD-Pruner. The main interesting part is how the frame the pruning problem. Basically, they define an "operator" (any fundamental building block of a net, like convolutional layers, activation functions, transformer blocks), and try to either 1) remove it or 2) replace it with a less demanding operation. As they operate on the latent space, this work can be applied to any generation that uses diffusion (task agnostic). It is interesting to note their limitations: the approach does not extend to pruning the decoder, and their approach does not consider dependencies between operators (which is a big deal I think). Finally, their score function seems a bit arbitrary (maybe this could be learned?).	Link
RoFormer: Enhanced Transformer with Rotary Position Embedding	Jianlin Su et al	2021	attention, positional embedding	Arxiv	This paper introduces Rotary Position Embedding (RoPE), a method for integrating positional information into transformer models by using a rotation matrix to encode absolute positions and incorporating relative position dependencies.	Link
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models	Alex Nichol et al	2022	text-conditioned diffusion, inpainting	Arxiv	This paper explores text-conditional image synthesis using diffusion models, comparing CLIP guidance and classifier-free guidance, and finds that classifier-free guidance produces more photorealistic and caption-aligned images.	Link
LLM Inference Unveiled: Survey and Roofline Model Insights	Roger Waleffe et al	2024	llms, survey	Arxiv	This paper surveys some recent advancements in LLC inference, like speculative decoding or operator fusion. They also analyze the findings using the Roofline model, which is likely the first paper to do such a thing for LLM inference. Good for checking out other papers that have recently been published.	Link
An Empirical Study of Mamba-based Language Models	Roger Waleffe et al	2024	mamba, llms, transformer	Arxiv	This paper compares Mamba-based, Transformer-based, and hybrid-based language models in a controlled setting where sizes and datasets are larger than the past (8B-params / 3.5T tokens). They find that Mamba and Mamba-2 lag behind Transformer models on copying and in-context learning tasks. They then see that a hybrid architecture of 43% Mamba, 7% self attention, and 50% MLP layers performs better than all others.	Link
Diffusion Models Beat GANs on Image Synthesis	Prafulla Dhariwal et al	2021	diffusion, gan	Arxiv	This work demonstrates that diffusion models surpass the current state-of-the-art generative models in image quality, achieved through architecture improvements and classifier guidance, which balances diversity and fidelity. The model attains FID scores of 2.97 on ImageNet 128×128 and 4.59 on ImageNet 256×256, matching BigGAN-deep with as few as 25 forward passes while maintaining better distribution coverage. Additionally, combining classifier guidance with upsampling diffusion models further enhances FID scores to 3.94 on ImageNet 256×256 and 3.85 on ImageNet 512×512.	Link
Progressive Distillation for Fast Sampling of Diffusion Models	Tim Salimans et al	2022	diffusion, distillation, sampling	ICLR	Diffusion models excel in generative modeling, surpassing GANs in perceptual quality and autoregressive models in density estimation, but they suffer from slow sampling times. This paper introduces two key contributions: new parameterizations that improve stability with fewer sampling steps and a distillation method that progressively reduces the number of required steps by half each time. Applied to benchmarks like CIFAR-10 and ImageNet, the approach distills models from 8192 steps down to as few as 4 steps, maintaining high image quality while offering a more efficient solution for both training and inference.	Link
On Distillation of Guided Diffusion Models	Chenlin Meng et al	2023	diffusion, classifier-free guidance	Arxiv	Classifier-free guided diffusion models are effective for high-resolution image generation but are computationally expensive during inference due to the need to evaluate both conditional and unconditional models many times. This paper proposes a method to distill these models into faster ones by learning a single model that approximates the combined outputs, then progressively reducing the number of sampling steps. The approach significantly accelerates inference, generating images with comparable quality to the original model using as few as 1-4 denoising steps, achieving up to 256× speedup on datasets like ImageNet and LAION.	Link
Diffusion Probabilistic Models Made Slim	Xingyi Yang et al	2022	diffusion, dpms, spectral diffusion	Arxiv	Diffusion Probabilistic Models (DPMs) produce impressive visual results but suffer from high computational costs, limiting their use on resource-limited platforms. This paper introduces Spectral Diffusion (SD), a lightweight model designed to address DPMs' bias against high-frequency generation, which smaller networks struggle to capture. SD incorporates wavelet gating for frequency dynamics and spectrum-aware distillation to enhance high-frequency recovery, achieving 8-18× computational efficiency while maintaining competitive image fidelity.	Link
Structural Pruning for Diffusion Models	Gongfan Fang et al	2023	diffusion, pruning	NeurIPS	Generative modeling has advanced significantly with Diffusion Probabilistic Models (DPMs), but these models often require substantial computational resources. To address this, Diff-Pruning is introduced as a compression method that reduces the computational load by pruning unnecessary diffusion steps, using a Taylor expansion to identify key weights without extensive re-training. Empirical results show that Diff-Pruning can cut FLOPs by around 50%, while maintaining consistent generative performance at only 10-20% of the original training cost.	Link
Diffusion Models: A Comprehensive Survey of Methods and Applications	Ling Yang et al	2024	diffusion, survey	ACM	Diffusion models are a powerful class of deep generative models known for their success in tasks like image synthesis, video generation, and molecule design. This survey categorizes diffusion model research into efficient sampling, improved likelihood estimation, and handling specialized data structures, while also discussing the potential for combining them with other generative models. The review highlights their broad applications across fields such as computer vision, NLP, temporal data modeling, and interdisciplinary sciences, suggesting areas for further exploration.	Link
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium	Martin Heusel et al	2017	gan, equilibrium, fid, is	NeurIPS	This paper introduces a two time-scale update rule (TTUR) for GANs, and proves that this makes GANs converge to a local Nash equilibrium. More cited is the FID score introduced here. FID improves on IS by comparing the distributions of real and generated images directly. This is done by using the Inception model to extract features from images and then assuming these features follow a multidimensional Gaussian distribution. FID measures the difference between the Gaussians (representing the real and generated images) using the Frechet distance, which effectively captures differences in the mean and covariance (the first two moments) of the distributions. FID makes sense as it directly compares the distributions of real and generated images by using the extracted features from Inception. These features are assumed to follow some multidimensional Gaussian, which simplifies the comparison. The Guassian is chosen as it is the maximum entropy distribution for a given mean and covariance (proof: https://medium.com/mathematical-musings/how-gaussian-distribution-maximizes-entropy-the-proof-7f7dcb2caf4d) -- maximum entropy is important, because this means that the Gaussian makes the fewest additional assumptions about the data, making sure the model is as non-committal as possible given the available information. Then, we calculate the statistics between the real and generated image features, like their mean and covariances. Finally, we compute the FID score using Frechet AKA Wasserstein-2 distance.	Link
Scalable Diffusion Models with Transformers	William Peebles et al	2023	diffusion,ddpm, dit	CVPR	The authors explore using transformers in the latent space, rather than U-Nets. They find that their methods can lead to lower FID scores compared to prior SOTA. In this paper, their image generation pipeline is roughly: 1) Input high resolution image x 2) Encoder z = E(x), where E is a pre-trained frozen VAE encoder, and z is the latent representation 3) The DiT model operates on z 4) New latent representation z’ is sampled from the diffusion model 5) We then decode the z’ using the pre-trained frozen VAE decoder D, and x’ is now the generated high resolution image.	Link
Max-Affine Spline Insights Into Deep Network Pruning	Haoran You et al	2022	early-bird, lottery-hypothesis, pruning, low-precision	TMLR	The authors make connections from spline-theory (AKA, consdering DNNs as a continuous piecewise affline mapping) and pruning. Basically, they say that pruning removes redundant decision boundaries in layers that are pruned, and that we can compare the decision boundaries of unpruned networks to their pruned counterparts to show this (they have some nice visualizations). They also note that the final decision boundary often does not always depend on existing subdivision lines. Finally, they demonstrate another way of finding EB tickets using this spline formulation.	Link
Drawing Early-Bird Tickets: Towards More Efficient Training of Deep Networks	Haoran You et al	2020	early-bird, lottery-hypothesis, pruning, low-precision	ICLR	The authors show that there exist early-bird (EB) tickets: small, but critical subnetworks for dense randomly intialized networks, that can be found using low-cost training schemes (low precision, early stopping). They also design a practical low compute method for finding these. They use mask distance. Basically, for each pruning iteration, a binary mask is created. This mask represents which parts of the network are kept (the "ticket", or pruned subnet) and which parts are removed. They then consider the scaling factor "r" in BN layers as indicators of significance. This r is learned during training and is used to scale normalized activations. The magnitude of r is an indicator of how important the channel is to the network's performance. After deciding which channels to prune based on r, the binary mask is created. If the channel is kept (not pruned), marked as 1 in the mask. Else, 0. For any two subnets, they then compute the "mask distance" (AKA Hamming distance) between the two ticketmasks. They measure the mask distance between consequtive epochs and draw EB tickets when such distance is smaller than some threshold.	Link
Learning both Weights and Connections for Efficient Neural Networks	Song Han et al	2015	pruning, compression, regularization	NeurIPS	The authors show a method of pruning neural networks in three steps: 1) train the network to learn what connections are important, 2) prune unimportant connections, 3) retrain and fine-tune. In order to train for learning what connections are important, they do not focus on learning the final weight values, but rather just focus on the importance of connections. They don't explicitly mention how this is done, but one could look at the Hessian of the loss or the magnitude of the weights. I'd imagine you could do this within only a few training iterations. In their "Regularization" section, it is interesting to note that L1 regularization (penalizes non-zero params resulting in more params near zero) gave better accuracy after pruning, but before retraining. But, these remaining connections are not as good as with using L2. The authors also present a discussion of what dropout rate to use.	Link
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference	Jiaming Tang et al	2024	KV cache, sparsity, LLM	ICML	Long context LLM inference is slow and the speed decreases significantly as sequence lengths grow. This is mainly due to needing to load a big KV cache during self-attention. Prior works have use methods to evict tokens in the attention maps to promote sparsity, but the Han lab (smartly!) found that the criticality of tokens strongly correlates with the current query token. Thus, they employ a KV Cache eviction method that retains all KV cache (since past evicted tokens may be needed to handle future queries), while being able to select the top K relevant tokens to a particular query. This allows for speedups in self-attention at low cost to accuracy.	Link
BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage Models	Jiahui Yu et al	2020	NAS, one-shot	Arxiv	Most NAS frameworks train some one-shot model to rank the quality of different child architectures. However, these rankings often are different than reality, so frameworks typically finetune architecture after finding them. BigNAS proposes that this fine-tuning / post-processing is not necessary. They find some interesting points, such as that "big models converge faster while small child models converge slower". Thus, at some training step t when the performance of a big model peaks, the small child models are not yet fully-trained, and at a t' where the child models are fully trained, the big model is overfitting. Thus, they use an exponentially decaying with constant ending learning rate scheduler, which has constant learning rate at the end of training when it reaches 5% of initial learning rate. Another point they bring up is a "coarse-to-fine" strategy where one first finds a rough sketch of promising network candidates, and then samples multiple finer grained variations around the sketch of interest.	Link
Meta-Learning of Neural Architectures for Few-Shot Learning	Thomas Elsken et al	2021	NAS, meta-learning, few-shot, fsl	Arxiv	The authors propose MetaNAS, which is the first method that fully integrates NAS with gradient-based meta-learning. Basically, they learn a method of joint learning gradient-based NAS methods like DARTS and meta-learning the architecture itself. Their goal is thus: meta-learn an architecture \alpha_{meta} with corresponding meta-learned weights w_{meta}. When given a new task \mathcal{T}_{i}, both \alpha_{meta} and w_{meta} adapt quickly to \mathcal{T}_{i} based on a few samples. One interesting technique they do is add a temperature term that is annealed to 0 over the course of task training; this is to help with sparsity of the mixture weights of the operations when using the DARTS search.	Link
MetAdapt: Meta-Learned Task-Adaptive Architecture for Few-Shot Classification	Sivan Doveh et al	2020	NAS, meta-learning, few-shot, fsl	Arxiv	The authors propose a method using a DARTS-like search for FSL architectures. "Our goal is to learn a neural network where connections are controllable and adapt to the few-shot task with novel categories... However, unlike DARTS, our goal is not to learn a one time architecture to be used for all tasks... we need to make our architecture task adaptive so it would be able to quickly rewire for each new target task.". Basically, they design a thing called a MetAdapt Controller that changes the connection in the main network according to some given task.	Link
Distilling the Knowledge in a Neural Network	Geoffry Hinton et al	2015	distillation, ensemble, MoE	Arxiv	The first proposal of knowledge distillation. The main interesting point I found was that they change the temperature of the softmax to be higher to allow for softer targets. This allows for understanding what 2's look like 3's (in an MNIST example). Basically, adds a sort of regularization since more information can be carried in these softer targets compared to a single 0 or 1. They also propose the idea of having an ensemble of models, and then learning a distilled model that is smaller. The biological example of having a clumsy larvae that then becomes a more specialized bug was good.	Link
HyperTuning: Toward Adapting Large Language Models without Back-propagation	Jason Phang et al	2023	hypernetworks, adaptation, tuning, LoRA, LLMs	ICML	The authors show that we can a hypernetwork for model adaptation in order to generate task-specific parameters. They try two approaches: generating prefixs and generating LoRA parameters for a frozen T5 model using few-shot examples. They also note the importance of hyperpretraining, i.e., an additional stage to adapt the hypernet to generate parameters for the downstream model. They also propose a scheme for this. NOTE! "We also observe a consistent trend where HyperT5-Prefix outperforms HyperT5-LoRA. We speculate that it is easier for hypermodels to learn to generate soft prefixes as compared to LoRA weights..."	Link
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning	Armen Aghajanyan et al	2020	fine-tuning, intrinsic dimension, lora	Arxiv	Large models with billions of parameters can be fine-tuned using only a few hundred examples. Why is this? Furthermore, large models often allow for significant sparsification, which implies that there is much redundancy. This paper targets both of these ideas, by showing that many common models have an "intrinsic dimension" much less than the full parameterization.	Link
LoRA: Low-Rank Adaptation of Large Language Models	Edward Hu et al	2021	low rank adaptation, lora, llm, fine-tuning	Arxiv	Fine-tuning large models is expensive, because we update all the original parameters. LoRA, taking inspiration from Aghajanyan et al, 2020 (pre-trained language models have a low "intrinsic dimension"), the authors thought that the weight updates would also have low intrinsic rank. Thus, they decompose Delta W = BA, where B and A are lower rank. The A and B are trainable. They initialize A with Gaussian, and B as zero, so Delta W = BA is zero initialy. They then optimize and find this method to be more efficient in terms of both time and space.	Link
Learning to Compress Prompts with Gist Tokens	Jesse Mu et al	2023	llms, prompting, compression, tokens	NeurIPS	The authors describe a method of using a distilling function G (similar to a hypernet) that is able to compress LM prompts into a smaller set of "gist" tokens. These tokens can then be cached and reused. The neat trick is that they reuse the LM itself as G, so gisting itself incurs no additional training cost. Note that in their "Failure Cases" section, they mention "... While it is unclear why only the gist models exhibit this behavior (i.e. the fail example behavior), these issues can likely be mitigated with more careful sampling techniques.	Link
Once-For-All: Train One Network and Specialize it For Efficient Deployment	Han Cai et al	2020	nas, supernets	ICLR	The authors proposed training one large supernetwork and then sampling subnetworks as an approach for NAS. This method allows for the simultaneous generation of many different subnetworks that could satisfy different constraints (i.e. hardware, latency, accuracy, etc). The authors also propose a progressive shrinking method to train the net (start by training the big supernet, then progressively shrink down), which can be seen as a generalized pruning method. Furthermore, they introduce an idea of training a twin neural network to help estimate latency / accuracy given some architecture, which allows for fast feedback when conducting the search for subnetworks.	Link
Dataless Knowledge Fusion by Merging Weights	Xisen Jin et al	2023	knowledge fusion, weight merging	ICLR	The paper introduces RegMean, a method for merging pre-trained language models from different datasets by solving a linear optimization problem, which improves generalization across domains without requiring the original training data. Compared to existing methods like Simple Averaging and Fisher Averaging, RegMean offers higher computational efficiency and comparable memory overhead, while achieving better or equivalent performance across various natural language tasks, including out-of-domain generalization. The method is evaluated using GLUE datasets and demonstrates superior performance in most tasks, outperforming traditional model ensembling and multi-task learning approaches.	Link
Superposition of Many Models into One	Cheung et al	2019	superposition, online learning, tasks, continual learning	NeurIPS	A method of storing multiple models using only one set of parameters via parameter superposition is provided; it shares similarities to superposition in the fourier analysis for signal processing.	Link
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation	Yoshua Bengio et al	2013	gradients, stochasticy, backpropagation	Arxiv	The authors introduce a several methods of estimation / propagation for networks that have stochastic neurons. This is used often in networks that are quantization-aware, as they sometimes have decision-boundaries in the neurons that are not differentiable regularly. The paper also introduces the "Straight Through Estimator", which was actually first introduced in one of Hinton's lectures. One interesting idea they present (that I think may have also been introduced in Kingma's VAE paper?) is that we can model the output h_{i} of some stochastic neuron as the application of a deterministic function that also depends on some noise source z_{i}: h_{i} = f(a_{i},z_{i}). TLDR: Straight through units are typically the go-to due to ease of use and good performance.	Link
DoReFaNet: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients	Shuchang Zhou et al	2018	quantization, cnn, gradients	Arxiv	The authors introduce a method to train CNNs with low bitwidth weights and activations using low bitwidth param gradients. They use deterministic quantization for weights and activations, while stochastically quantizing gradients. Note that they do not quantize the weights of the first CNN layer for the most part, as they noted that it would often degrade performance (Han et al. 2015 also notes a similar thing). Another interesting thing they do is add noise to the gradient after quantization to increase performance. This paper also uses the straight through estimator (Bengio et al 2013) for propagating gradients when using their quantization scheme.	Link
Training Deep Neural Networks with 8-bit Floating Point Numbers	Naigang Wang et al	2018	quantization, floating-point, precision	NeurIPS	The authors show that it is possible to train DNNs with 8-bit fp values while maintaining decent accuracy. To do this, they make a new FP8 format, develop a technique "chunk-based computations" that allow matrix and convolution ops to be computed using 8-bit multiplications and 16 bit additions, and use fp stochastic rounding in weight updates. One interesting point they make is that swamping (the issue of truncation in large-to-small number addition) is a serious problem in DNN bit-precision reduction.	Link
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference	Benoit Jacob et al	2017	quantization, quantization schemes, efficient inference, floating-point	Arxiv	The authors propose a quantization scheme that allows us to only use integer arithmetic to approximate fp computations in a neural network. They also describe a training approach that simulates the effect of quantization in the forward pass. Backprop still occurs, but all weights and biases are stored in fp. The forward prop pass then simulates quantized inference by rounding off using the quantization scheme they describe that changes fp to int.	Link
PACT: Parameterized Clipping Activation for Quantized Neural Networks	Jungwook Choi et al	2018	quantization, clipping, activations	ICLR	The authors present a method of quantization by clipping activations using a learnable parameter, alpha. They show that this can lead to lower decreases in accuracy compared to other quantization methods. They also note that activations have been hard to quantize compared to weights in the past. They also prove that PACT is as expressive as ReLU, by showing it can reach the same solution as ReLU if SGD is used. They also describe the hardware benefits that can be incurred.	Link
SMASH: One-Shot Model Architecture Search through Hypernetworks	Andrew Brock et al	2017	hypernetworks, nas, one-shot, few-shot	Arxiv	The authors propose a technique to speed up NAS by using a hypernet. Basically, they train a hypernet to generate weights of a main model that has variable architecure. The input to the hypernet is a binarized representation of model architecture. The hypernet takes this representation in, and then outputs weights. They then train only for a few epochs, and compare the validation scores obtained across different representations. Then, they fully train the model that had the best validation score.	Link
Example-based Hypernetworks for Multi-source Adaptation to Unseen Domains	Tomer Volk et al	2023	hypernetworks, multi-source adaptation, unseen domains, NLP	EMNLP	The authors apply hypernets to unsupervised domain adaptation in NLP. They use example-based adaptation. The main idea is that they use an encoder-decoder to initially create the unique signatures from an input example, and then they embed it within the source domain's semantic space. The signature is then used by a hypernet to generate the task classifier's weights. The paper focuses on improving generalization to unseen domains by explicitly modeling the shared and domain specific characteristics of the input. To allow for parameter sharing, they propose modeling based on hypernets, which allow soft weight sharing.	Link
Meta-Learning via Hypernetworks	Dominic Zhao et al	2020	hypernetworks, meta-learning	NeurIPS	The authors propose a soft weight-sharing hypernet architecture that performs well on meta-learning tasks. A good paper to show efforts in meta-learning with regards to hypernets, and comparing them to SOTA methods like Model-Agnostic Meta-Learning (MAML).	Link
HyperDynamics: Meta-Learning Object and Agent Dynamics with Hypernetworks	Zhou Xian et al	2021	hypernetworks, meta-learning, dynamics	ICLR	The authors present a dynamics meta-learning framework which conditions on an agent's interations w/ env and (optionally) the visual input from it. From this, they can generate params of a neural dynamics model. The three modules they use are 1) an encoding module that encodes a few agent-env interations / agent's visual observations into a feature code, 2) a hypernet that conditions on the latent feature code to generate params of a dynamic model dedicated to this observed system, and 3) a target dynamics model that is made using the generated parameters, and takes input as a low-dim system state / agent action and outputs the prediction of next system state.	Link
Principled Weight Initialization for Hypernetworks	Oscar Chang et al	2020	hypernetworks, weight initialization	ICLR	Classical weight initialization techniques don't really work on hypernets, because they fail to produce weights for the mainnet in the correct scale. The authors derive formulas for hyperfan-out and hyperfan-in weight initialization, and show that it works well for the mainnet.	Link
Continual Learning with Hypernetworks	Johannes von Oswald et al	2020	hypernetworks, continual learning, meta learning	ICLR	The authors present a method of preventing catastrophic forgetting, by using task-conditioned hypernets (i.e., hypernets that generate weights of target model based on some task embedding). Thus, rather than memorizing many data characteristics, we can split the problem into just learning a single point per task, given the task embedding.	Link
Stochastic Hyperparameter Optimization through Hypernetworks	Jonathan Lorraine et al	2018	hypernetworks, hyperparameters	ICLR	Using hypernetworks to learn hyperparameters. They replace the training optimization loop in favor of a differentiable hypernetwork to allow for tuning of hyperparameters using grad descent.	Link
Playing Atari with Deep Reinforcement Learning	Volodymyr Mnih et al	2013	q-learning, reinforcement learning	Arxiv	The authors present the first deep learning model that can learn complex control policies, and they teach it to play Atari 2600 games using Q-learning. Their goal was to create one net that can play as many games as possible.	Link
Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Encoding	Song Han et al	2016	quantization, encoding, pruning	ICML	A three-pronged approach to compressing nets. They prune networks, then quantize and share weights, and then apply Huffman encoding.	Link
Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or -1	Matthieu Courbariaux et al	2016	quantization, efficiency, binary	Arxiv	Introduction of training Binary Neural Networks, or nets with binary weights and activations. They also present experiments on deterministic vs stochastic binarization. They use the deterministic one for the most part, except for activations.	Link
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks	Mingxing Tan et al	2020	efficiency, scaling	ICML	A study of model scaling is presented. They propose a novel scaling method to uniformly scale all dimensions of depth/width/resolution using a compound coefficient. This paper presents a method for scaling width/depth/resolution; for instance, if you want to use 2^{N} more compute resources, then you can scale by their coefficients to do so. They also quantify the relationship between width, depth, and resolution.	Link
The wake-sleep algorithm for unsupervised neural networks	Geoffry Hinton et al	1995	representation, generative	Arxiv	One of the first generative neural networks that kind of resembles diffusion.	Link
ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design	Haoran You et al	2022	vit, accelerator, attention	Arxiv	Co-deisng for ViTs. Prunes and polarizes attention maps to have denser/sparser patterns. Development of hardware accelerator as well.	Link
Evolving Neural Networks through Augmenting Topologies	Kenneth O. Stanley et al	2002	nas, evolution	Arxiv	Evolution for NAS.	Link
A Brief Review of Hypernetworks in Deep Learning	Vinod Kumar Chauhan et al	2024	hypernetwork	Arxiv	Review of hypernets.	Link
HyperNetworks	David Ha et al	2016	hypernetwork	Arxiv	Looking at HyperNetworks: networks that generate weights for other networks.	Link
Deep Learners Benefit More from Out-of-Distribution Examples	Yoshio Bengio et al	2024	ood	ICML	Evidence that ood samples can help learning. They also argue that intermediate levels of representation can benefit the models in multi-task settings.	Link
Balanced Data, Imbalanced Spectra: Unveiling Class Disparities with Spectral Imbalance	Chiraag Kaushik et al	2024	spectra, class imbalance	ICML	Introduction of the idea of "spectral imbalance", which can affect classification accuracy even when classes are balanced. Basically, they look at how the distributions of eigenvalues in different classes affect classification accuracy.	Link
DeepArchitect: Automatically Designing and Training Deep Architectures	Renato Negrinho et al	2017	nas	Arxiv	Proposal of a language to describe neural networks architectures. Can then describe them as trees to search through. Show different search methods for going through the trees (Monte Carlo tree search, random, use of surrogate function, etc.).	Link
Graph neural networks: A review of methods and applications	Jie Zhou et al	2020	gnn	AI Open	What graph neural networks are, what they are made of, how to train them. And examples. They describe a general design pipeline (Find graph structure, specify graph type and scale, design loss function) and explain the main modules in GNNs (propagation to propagate information between notes, sampling module to conduct the propagation, pooling module to extract information from notes).	Link
1D convolution neural networks and applications: A survey	Serkan Kiranyaz et al	2020	cnn, survey	Mechanical Systems and Signal Processing	A brief overview of applications of 1D CNNs is performed. It is largely focused on medicine (for instance, ECG) and fault detection (for instance, vibration based structural damage).	Link
2 in 1 Accelerator: Enabling Random Precision Switch for Winning Both Adversarial Robustness and Efficiency	Yonggan Fu et al	2021	quantization, accelerator	ACM	The most interesting point of this paper (among many things!) is the smart idea to use quantization as a way to boost DNN robustness. Cool!	Link
Token Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation	Junyoung Park et al	2024	efficiency, hardware, accelerator, attention	DAC	In autoregressive models with attention, off chip memory accesses need to be minimized. The authors note that there have been efforts to prune unimportant tokens, but these do not do much for removing tokens with attention scores near zero. The authors (smartly!) notice this issue, and provide a fast method of estimating the decision to prune or not based on estimation of the probability if a token is or is not important. An architecture for this is also provided.	Link
Maxout Networks	Ian Goodfellow et al	2013	dropout, maxout	ICML	The authors note that dropout is "most effective when taking relatively large steps in parameter space. In this regime, each update can be seen as making a significat update to a different model on a different subset of the training set". I really liked that quote. They then develop the maxout unit, which iessentially takes the maxmimum across some number of affine transformations, allowing for learning of piecewise linear approximations of nonlinear functions.	Link
Geometric deep learning: Going beyond Euclidean data	Michael Bronstein et al	2017	geometric deep learning	IEEE SIG	Provides an overview of geometric deep learning, which are methods of generalizing DNNs to non Euclidean domains (graphs, manifolds, etc).	Link
Sampling in Constrained Domains with Orthogonal Space Variational Gradient Descent	Ruqi Zhang et al	2022	variational gradient descent, gradient flow	NeurIPS	The authors propose a new variational framework called O Gradient for sampling in implicitly defined constrained domains, using two orthogonal directions to drive the sampler towards the domain and explore it by decreasing a KL divergence. They prove the convergence of O Gradient and apply it to both Langevin dynamics and Stein Variational Gradient Descent (SVGD), demonstrating its effectiveness on various machine learning tasks.	Link
Entropy MCMC: Sampling from Flat Basins with Ease	Bolian Li et al	2024	sampling, bayesian, flat basins	ICML	The authors propose a practical MCMC algorithm for sampling from flat basins of DNN posterior distributions, using a guiding variable based on local entropy to steer the sampler. They prove the fast convergence rate of their method compared to existing flatness aware methods and demonstrate its superior performance on various tasks through comprehensive experiments. The method is mathematically simple and computationally efficient, making it suitable as a drop in replacement for standard sampling methods like SGLD.	Link
AdderNet: Do We Really Need Multiplications in Deep Learning?	Hanting Chen et al	2021	multiplication-less, efficiency	CVPR	The authors show that with a cost of accuracy you can use additions instead of multiplications. They mainly tested CNNs.	Link
Explaining and Harnessing Adversarial Examples	Ian Goodfellow et al	2015	adversarial examples	ICLR	Adversarial examples (adding "small but intentially worst case perturbations to examples from the dataset") proves to be an interesting method to train models. The authors also (smartly!) describe a method to generate adversarial examples by a linear method.	Link
Identifying and attacking the saddle point problem in high dimensional non convex optimization	Yann Dauphin et al	2014	saddle points, optimization	NeurIPS	The authors argue that saddle points, rather than local minima, are the primary challenge in minimizing non convex error functions in high dimensional spaces, based on insights from various scientific fields and empirical evidence. They explain that saddle points surrounded by high error plateaus can significantly slow down learning and create the illusion of local minima, particularly in high dimensional problems of practical interest. To address this challenge, the authors propose a new approach called the saddle free Newton method, designed to quickly escape high dimensional saddle points, unlike traditional gradient descent and quasi Newton methods.	Link
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift	Sergey Ioffe et al	2015	batch, normalization	PMLR	The authors identify internal covariate shift as a significant challenge in training deep neural networks, where the distribution of each layer's inputs changes during training due to parameter updates in previous layers. To address this issue, they propose Batch Normalization, a method that normalizes layer inputs as part of the model architecture, performing normalization for each training mini batch. Batch Normalization enables the use of much higher learning rates, reduces sensitivity to initialization, and acts as a regularizer, sometimes eliminating the need for Dropout.\|	Link
Bayesian Deep Learning and a Probabilistic Perspective of Generalization	Andrew Wilson et al	2020	bayesian, marginalization	NeurIPS	The authors emphasize that marginalization, rather than using a single set of weights, is the key distinguishing feature of a Bayesian approach, which can significantly improve the accuracy and calibration of modern deep neural networks. They demonstrate that deep ensembles provide an effective mechanism for approximate Bayesian marginalization and propose a related approach that further improves the predictive distribution by marginalizing within basins of attraction. The paper investigates the prior over functions implied by a vague distribution over neural network weights, explaining neural network generalization from a probabilistic perspective and showing that seemingly mysterious results (like fitting random labels) can be reproduced with Gaussian processes. The authors demonstrate that Bayesian model averaging mitigates the double descent phenomenon, leading to monotonic performance improvements as model flexibility increases.	Link
A Practical Bayesian Framework for Backpropagation Networks	David MacKay et al	1992	bayesian	Neural Computation	The authors present a quantitative and practical Bayesian framework for learning mappings in feedforward networks, enabling objective comparisons between different network architectures and providing stopping rules for network pruning or growing procedures. This framework allows for objective selection of weight decay terms or regularizers, measures the effective number of well determined parameters in a model, and provides quantified estimates of error bars on network parameters and outputs. The approach helps detect poor underlying assumptions in learning models and demonstrates a good correlation between generalization ability and Bayesian evidence for well matched learning models.	Link
Bayesian Neural Network Priors Revisited	Vincent Fortuin et al	2022	bayesian, priors	ICLR	Isotropic Gaussian priors are the standard for modern Bayesian neural network inference, but their accuracy and optimal performance are uncertain. By studying summary statistics of neural network weights trained with stochastic gradient descent (SGD), the authors find that CNN and ResNet weights exhibit strong spatial correlations, while FCNNs display heavy tailed weight distributions. Incorporating these observations into priors improves performance on various image classification datasets, mitigating the cold posterior effect in FCNNs but slightly increasing it in ResNets.	Link
Hands on Bayesian Neural Networks A Tutorial for Deep Learning Users	Laurent Jospin et al	2022	bayesian	IEEE	A good summary / tutorial for using Bayesian Nets. Also provides some good paper references within.	Link
Position Paper: Bayesian Deep Learning in the Age of Large Scale AI	Theodore Papamarkou et al	2024	bayesian, mcmc	ICML	A good summary of strengths of BDL (Bayesian Deep Learning) with regards to modern deep learning, while also addressing some weaknesses. A good paper if need to do an overview of modern challenges (as of 2024).\|	Link
A Neural Probabilistic Language Model	Bengio et al	2003	statistical language modeling	JMLR	One of the first papers about modern methods in using neural systems to estimate probability functions of word sequences. They show that MLPs can model better than the SOTA (at that time). A classic.\|	Link
Bit Fusion: Bit Level Dynamically Composable Architecture for Accelerating Deep Neural Networks	Hardik Sharma et al	2018	accelerator, quantization, bit fusion	ISCA	Hardware acceleration of Deep Neural Networks (DNNs) aims to address their high compute intensity, with the paper focusing on the potential of reducing bitwidth in operations without compromising classification accuracy. To prevent accuracy loss, the bitwidth varies significantly across DNNs, and a fixed bitwidth accelerator may lead to limited benefits or degraded accuracy. The authors introduce Bit Fusion, a bit flexible accelerator that dynamically adjusts bitwidth for individual DNN layers, resulting in significant speedup and energy savings compared to state of the art accelerators, Eyeriss and Stripes, and achieving performance close to a 250 Watt Titan Xp GPU while consuming much less power.	Link
A Framework for the Cooperation of Learning Algorithms	Leon Bottou et al	1990	learning algorithms, modules	NeurIPS	Cooperative training of modular systems offers a unified approach to many learning algorithms and hybrid systems, allowing the design and implementation of complex learning systems that incorporate structural a priori knowledge about tasks. The authors introduce a framework using a statistical formulation of learning systems to define and combine modules into cooperative systems, enabling the creation of hybrid systems that combine the advantages of connectionist and other learning algorithms. By decomposing complex tasks into simpler subtasks, modular architectures can be built, where each module corresponds to a subtask, facilitating easier achievement of the learning goal by introducing a modular decomposition of the global task.	Link
CNP: An FPGA Based Processor for Convolutional Networks	Clement Farabet et al	2009	fpga, cnn	IEEE	One of the first attempts (that I have found) at putting a CNN into an FPGA and showing it can be done to perform some task (face detection).	Link
A Complete Recipe for Stochastic Gradient MCMC	Yi An Ma et al	2015	hamiltonian, mcmc	NeurIPS	Recent Markov chain Monte Carlo (MCMC) samplers use continuous dynamics and scalable variants with stochastic gradients to efficiently explore target distributions, but proving convergence with stochastic gradient noise remains challenging. The authors provide a general framework for constructing MCMC samplers, including stochastic gradient versions, based on continuous Markov processes defined by two matrices, demonstrating that any such process can be represented within this framework. Using this framework, they propose a new state adaptive sampler, stochastic gradient Riemann Hamiltonian Monte Carlo (SGRHMC), which combines the benefits of Riemann HMC with the scalability of stochastic gradient methods, as shown in experiments with simulated data and a streaming Wikipedia analysis.	Link
CPT: Efficient Deep Neural Network Training via Cyclic Precision	Yonggan Fu et al	2021	precision, efficiency, wide minima	ICLR	Low precision deep neural network (DNN) training is an effective method for improving training time and energy efficiency, with this paper proposing a new perspective: that DNN precision may act similarly to the learning rate during training. The authors introduce Cyclic Precision Training (CPT), which cyclically varies precision between two boundary values identified through a simple precision range test in the initial training epochs, aiming to boost time and energy efficiency further.	Link
Approximation by Superpositions of a Sigmoidal Function	G. Cybenko	universal approximator, completeness	TODO	Mathematics of Control, Signals, and Systems	This paper demonstrates that finite linear combinations of compositions of a fixed univariate function and a set of affine functionals can uniformly approximate any continuous function of nn real variables within the unit hypercube, under mild conditions on the univariate function. These findings resolve an open question about the representability of functions by single hidden layer neural networks, specifically showing that arbitrary decision regions can be well approximated by continuous feedforward neural networks with a single hidden layer and any continuous sigmoidal nonlinearity.	Link
Cyclical Stochastic Gradient MCMC for Bayesian Deep Learning	Ruqi Zhang et al	2020	mcmc, bayesian	ICLR	The posteriors over neural network weights are high dimensional and multimodal, with each mode representing a different meaningful interpretation of the data. The authors introduce Cyclical Stochastic Gradient MCMC (SG MCMC) with a cyclical stepsize schedule, where larger steps discover new modes and smaller steps characterize each mode, and they prove the non asymptotic convergence of this algorithm.	Link
DaDianNao: A Machine Learning Supercomputer	Yunji Chen et al	2014	accelerator, gpu	IEEE/ACM	This paper introduces a custom multi chip architecture optimized for Convolutional and Deep Neural Networks (CNNs and DNNs), addressing their computational and memory intensive nature by leveraging on chip storage to enhance internal bandwidth and reduce external communication bottlenecks. The authors demonstrate significant performance gains with their 64 chip system achieving up to a 450.65x speedup over GPUs and reducing energy consumption by up to 150.31x on large neural network layers, implemented with custom storage, computational units, and robust interconnects at 28nm scale.	Link
DARTS: Differentiable Architecture Search	Hanxiao Liu et al	2019	nas	ICLR	This paper introduces a differentiable approach to architecture search, tackling scalability challenges by reformulating the task to allow gradient based optimization over a continuous relaxation of architecture representations. Unlike traditional methods relying on evolutionary or reinforcement learning in discrete, non differentiable spaces, the proposed method efficiently discovers high performance convolutional architectures for image classification and recurrent architectures for language modeling.	Link
Decoupled Contrastive Learning	Chun Hsiao Yeh et al	2022	contrastive learning, self-supervised learning	ACM	This paper introduces decoupled contrastive learning (DCL), which removes the negative positive coupling (NPC) effect from the InfoNCE loss, significantly improving the efficiency of self supervised learning (SSL) tasks with smaller batch sizes. DCL achieves efficient and reliable performance enhancements across various benchmarks, outperforming the SimCLR baseline without requiring momentum encoding, large batch sizes, or extensive epochs.	Link
Deep Image Prior	Dmitry Ulyanov et al	2020	inpatining, super-resolution, denoising	IEEE	This paper challenges the conventional wisdom by demonstrating that the structure of a generator network, even when randomly initialized, can effectively capture low level image statistics without any specific training on example images. The authors show that this randomly initialized neural network can serve as a powerful handcrafted prior, yielding excellent results in standard image processing tasks such as denoising, super resolution, and inpainting. Furthermore, the same network structure can invert deep neural representations for diagnostic purposes and restore images based on input pairs like flash and no flash conditions, showcasing its versatility and effectiveness across various image restoration applications.	Link
Deep Double Descent: Where Bigger Models and More Data Hurt	Preetum Nakkiran et al	2019	capacity, double descent	Arxiv	This paper explores the "double descent" phenomenon in modern deep learning tasks, showing that as model size or training epochs increase, performance initially worsens before improving. The authors unify these observations by introducing a new complexity measure termed effective model complexity, conjecturing a generalized double descent across this measure.	Link
DeepShift: Towards Multiplication Less Neural Networks	Mostafa Elhoushi et al	2021	multiplication-less, efficiency	Arxiv	This paper addresses the computational challenges of deploying convolutional neural networks (CNNs) on edge computing platforms by introducing convolutional shifts and fully connected shifts, replacing multiplications with efficient bitwise operations during both training and inference. The proposed DeepShift models achieve competitive or higher accuracies compared to baseline models like ResNet18, ResNet50, VGG16, and GoogleNet, while significantly reducing the memory footprint by using only 5 bits or less for weight representation during inference.	Link
DepthShrinker: A New Compression Paradigm Towards Boosting Real Hardware Efficiency of Compact Neural Networks	Yonggan Fu et al	2022	compression, efficiency, pruning	ICML	This paper introduces DepthShrinker, a framework designed to enhance hardware efficiency of deep neural networks (DNNs) by transforming irregular computation patterns of compact operators into dense ones, thereby improving hardware utilization without sacrificing model accuracy. By leveraging insights that certain activation functions can be removed post training without loss of accuracy, DepthShrinker pioneers a compression paradigm that optimizes DNNs for real hardware efficiency, presenting a significant advancement in efficient model deployment.\|	Link
Dimensionality Reduction by Learning an Invariant Mapping	Raia Hadsell et al	2006	dimensionality reduction, mapping	CVPR	DrLIM, or Dimensionality Reduction by Learning an Invariant Mapping, addresses key limitations of existing dimensionality reduction techniques by learning a non linear function that maps high dimensional data to a low dimensional manifold based solely on neighborhood relationships, without requiring a predefined distance metric in input space. The method is distinguished by its ability to handle transformations and maintain invariant mappings, demonstrated through experiments that show its effectiveness in preserving neighborhood relationships and accurately mapping new, unseen samples to meaningful locations on the manifold. Unlike methods like LLE, which may struggle with variability and registration issues in input data, DrLIM's contrastive loss function ensures robustness by balancing similarity and dissimilarity in output space, offering a promising approach for applications requiring invariant mappings, such as learning positional information from image sequences in robotics.	Link
Disentangling Trainability and Generalization in Deep Neural Networks	Lechao Xiao et al	2020	neural tangent kernel, ntk	ICML	This study focuses on characterizing the trainability and generalization of deep neural networks, particularly under conditions of very wide and very deep architectures, leveraging insights from the Neural Tangent Kernel (NTK). By analyzing the NTK's spectrum, the study formulates necessary conditions for both memorization and generalization across architectures like Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs). The research identifies key spectral quantities such as λmax, λbulk, κ, and P(Θ(l)) that critically influence the performance of deep networks, providing a precise theoretical framework validated by extensive experiments on CIFAR10. It highlights distinctions in generalization behavior between CNNs with and without global average pooling.	Link
Finding Structure in Time	Jeffrey Elman	1990	rnn	Cognitive Science	I think this was the original backpropagation through time paper. Good insights on time dependent system learning.	Link
E2 Train: Training State of the art CNNs with Over 80% Energy Savings	Yue Wang et al	2019	cnn, batch, energy	NeurIPS	This paper introduces E2 Train, a framework for energy efficient CNN training on resource constrained platforms. E2 Train optimizes training energy costs through stochastic mini batch dropping, selective layer updates, and low cost, low precision back propagation strategies. Experimental results on CIFAR 10 and CIFAR 100 demonstrate significant energy savings of over 90% and 84%, respectively, with minimal loss in accuracy. E2 Train addresses the challenge of on device CNN training by integrating three levels of energy saving techniques: data level stochastic mini batch dropping, model level selective layer updates, and algorithm level low precision back propagation. Real energy measurements on an FPGA validate its effectiveness, achieving notable energy reductions in training ResNet models on CIFAR datasets.	Link
cuDNN: Efficient Primitives for Deep Learning	Sharan Chetlur et al	2014	cuda, gpu	Arxiv	This paper introduces cuDNN, a library designed to optimize deep learning primitives akin to BLAS for HPC. cuDNN offers efficient implementations of key deep learning kernels tailored for GPUs, improving performance and reducing memory usage in frameworks like Caffe by up to 36%.	Link
EIE: Efficient Inference Engine on Compressed Deep Neural Network	Song Han et al	2016	compression, accelerator, co-design	Arxiv	This paper introduces EIE, an energy efficient inference engine designed for compressed deep neural networks, achieving significant energy savings by exploiting weight sharing, sparsity, and quantization. EIE performs sparse matrix vector multiplications directly on compressed models, enabling 189× and 13× faster inference speeds compared to CPU and GPU implementations of uncompressed DNNs.	Link
An Empirical Analysis of Deep Network Loss Surfaces	Daniel Jiwoong Im et al	2016	optimization, loss surface, saddle points	Arxiv	This paper empirically investigates the geometry of loss functions in state of the art neural networks, employing various stochastic optimization methods. Through visualizations in low dimensional subspaces, it explores how different optimization procedures lead to distinct local minima, even when algorithms are changed late in the optimization process. The study reveals that modifications to optimization procedures consistently yield different local minima, each affecting the network's performance on test examples differently. Interestingly, while different optimization algorithms find varied local minima from different initializations, the shape of the loss function around these minima remains characteristic to the algorithm used, with ADAM showing larger basins compared to vanilla SGD.	Link
EyeCoD: Eye Tracking System Accelerator via FlatCam based Algorithm & Accelerator Co Design	Haoran You et al	2023	accelerator, co-design, eye-tracking	ACM	This paper introduces EyeCoD, a lensless FlatCam based eye tracking system designed to overcome limitations of traditional systems, such as large form factor and high communication costs. By integrating a predict then focus algorithm pipeline and dedicated hardware accelerator, EyeCoD achieves significant reductions in computation and communication overhead while maintaining high tracking accuracy.	Link
Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices	Yu Hsin Chen et al	2019	efficiency, sparsity	Arxiv	This paper introduces Eyeriss v2, a specialized DNN accelerator architecture designed to efficiently handle compact and sparse neural networks. Unlike traditional DNN accelerators, Eyeriss v2 incorporates a hierarchical mesh network on chip to adapt to varying layer shapes and sizes, optimizing data reuse and bandwidth utilization. Eyeriss v2 excels in processing sparse data directly in the compressed domain, both for weights and activations, thereby enhancing processing speed and energy efficiency particularly suited for sparse models.	Link
Eyeriss: A Spatial Architecture for Energy Efficient Dataflow for Convolutional Neural Networks	Yu Hsin Chen et al	2016	cnn, row-stationary, efficiency	ACM/IEEE	The paper addresses the high energy consumption in deep convolutional neural networks (CNNs) due to extensive data movement, despite advancements in parallel computing paradigms like SIMD/SIMT. Introduces a novel row stationary (RS) dataflow designed for spatial architectures. RS maximizes local data reuse and minimizes data movement during convolutions, leveraging PE local storage, inter PE communication, and spatial parallelism.	Link
Flat Minima	Sepp Hochreiter et al	1997	flat minima, low complexity, gibbs	Neural Computation	The algorithm focuses on identifying "flat" minima of the error function in weight space. A flat minimum is characterized by a large connected region where the error remains approximately constant. This property suggests simplicity in the network structure and low expected overfitting, supported by an MDL based Bayesian argument. Unlike traditional approaches that rely on Gaussian assumptions or specific weight priors, this algorithm uses a Bayesian framework with a prior over input output functions. This approach considers both network architecture and the training set, facilitating the identification of simpler and more generalizable models.	Link
Fused Layer CNN Accelerators	Manoj Alwani et al	2016	cnn, accelerator, fusion	IEEE	This work introduces a novel approach to CNN accelerator design by fusing the computation of multiple convolutional layers. By rearranging the dataflow across layers, intermediate data can be cached on chip between adjacent layers, reducing the need for off chip memory storage and minimizing data transfer. Specifically, the study demonstrates the effectiveness of this approach by implementing a fused layer CNN accelerator for the initial five convolutional layers of VGGNet E. Using 362KB of on chip storage, the accelerator significantly reduces off chip feature map data transfer by 95%, from 77MB to 3.6MB per image processed. This innovation targets early convolutional layers where data transfer typically dominates. By optimizing data reuse and minimizing off chip memory usage, the proposed design strategy enhances the efficiency of CNN accelerators, paving the way for improved performance in various machine learning tasks.	Link
EnlightenGAN: Deep Light Enhancement Without Paired Supervision	Yifan Jiang et al	2021	gan, enhancement, unsupervised	IEEE	Exploration of low light to well lit image generation using GANs. Also provides an interesting global local discriminator and self regularized perceptual loss fusion, with a simplified attention (the attention is just an inverse of the brightness of the image essentially).	Link
Understanding the difficulty of training deep feedforward neural networks	Xavier Glorot et al	2010	activation, saturation, initialization	AISTATS	The logistic sigmoid activation function is problematic for deep networks due to its mean value, which can lead to saturation of the top hidden layer. This saturation slows down learning and can cause training plateaus. The difficulty in training deep networks correlates with the singular values of the Jacobian matrix for each layer. When these values deviate significantly from 1, it indicates poor activation and gradient flow across layers, complicating training. New initialization schemes have been proposed to address issues with activation saturation and gradient flow. These schemes aim to achieve faster convergence by ensuring that activations and gradients flow well across layers, thereby improving overall training efficiency.	Link
Group Normalization	Yuxin Wu et al	2018	normalization	Arxiv	Batch Normalization performs normalization along the batch dimension, which causes errors to increase rapidly as batch sizes decrease. This limitation makes BN less effective for training larger models and tasks that require smaller batches due to memory constraints. GN divides channels into groups and computes normalization statistics (mean and variance) independently within each group. Unlike BN, GN's computation is not dependent on batch sizes, leading to stable performance across a wide range of batch sizes.	Link
Singularity of the Hessian in Deep Learning	Levent Sagun et al	2017	eigenvalues, hessian	ICLR	The bulk of eigenvalues concentrated around zero indicates how overparametrized the model is. In deep learning, overparametrization often leads to better generalization despite the potential for higher computational costs. The edges of the eigenvalue distribution, scattered away from zero, reflect the complexity of the input data. This complexity influences how the loss landscape is structured and affects optimization difficulty. Second order optimization methods, which leverage information from the Hessian, can potentially accelerate training and find better solutions by providing insights into the loss landscape's curvature. The top discrete eigenvalues of the Hessian are influenced by the data characteristics, indicating that different datasets may require different optimization strategies or model architectures for optimal performance.	Link
Long Short Term Memory	Sepp Hochreiter et al	1997	lstm, rnn	Neural Computation	The original paper on the LSTM. A classic, and demonstrated the power of gating.	Link
On the importance of initialization and momentum in deep learning	Ilya Sutskever et al	2013	initialization, momentum	ICML	Traditionally, training DNNs and RNNs with stochastic gradient descent (SGD) with momentum was considered challenging due to issues with gradient propagation and vanishing/exploding gradients, especially in networks with many layers or long term dependencies. The paper demonstrates that using a well designed random initialization significantly improves the training success of deep and recurrent networks with SGD and momentum.	Link
Algorithms for manifold learning	Lawrence Cayton	2005	manifold learning, dimensionality reduction	Arxiv	Many datasets exhibit complex relationships that cannot be effectively captured by linear methods like Principal Component Analysis (PCA). Manifold hypothesis: Despite high dimensional appearances, data points often lie on or near a much lower dimensional manifold embedded within the higher dimensional space. Manifold learning aims to uncover this underlying low dimensional structure to provide a more meaningful and compact representation of the data.	Link
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis	Ben Mildenhall et al	2020	nerf, view synthesis, 3d, scene representation, volume rendering	Arxiv	The method utilizes a fully connected deep network to represent scenes as continuous volumetric functions. This network takes a 5D input (spatial location and viewing direction) and outputs volume density and view dependent radiance. By querying these 5D coordinates along camera rays and employing differentiable volume rendering techniques, the method synthesizes novel views of scenes.	Link
On Large Batch Training for Deep Learning Generalization Gap and Sharp Minima	Nitish Shirish Keskar et al	2017	sharp minima, large batch	ICLR	The study identifies a phenomenon where large batch SGD methods tend to converge towards sharp minimizers of training and testing functions. Sharp minima are associated with poorer generalization, meaning the model performs worse on unseen data. In contrast, small batch methods more consistently converge towards flat minimizers. This behavior is attributed to the inherent noise in gradient estimation during training with small batches.	Link
Optimizing FPGA based Accelerator Design for Deep Convolutional Neural Networks	Chen Zhang et al	2015	cnn, fpga, accelerator	ACM	The study employs quantitative analysis techniques, including loop tiling and transformation, to optimize the CNN accelerator design. These techniques aim to maximize computation throughput while minimizing the resource utilization on the FPGA, particularly balancing logic resource usage and memory bandwidth.\|	Link
Learning Phrase Representation using RNN Encoder Decoder for Statistical Machine Translation	Kyunghyun Cho et al	2014	encoder decoder, machine translation	Arxiv	Introduces a novel neural network architecture called RNN Encoder Decoder, comprising two recurrent neural networks. One RNN serves as an encoder, converting a sequence of symbols into a fixed length vector representation. The other RNN acts as a decoder, generating another sequence of symbols based on the encoded representation.	Link
Qualitatively Characterizing Neural Network Optimization Problems	Ian Goodfellow et al	2015	optimization, visualization	ICLR	Demonstrates that contemporary neural networks can achieve minimal training error through direct training with stochastic gradient descent alone, without needing complex schemes like unsupervised pretraining. This finding challenges earlier beliefs about the difficulty of navigating non convex optimization landscapes in neural network training. They also introduce a nice graphical tool to show the energy landscape.	Link
Language Models are Unsupervised Multitask Learners	Alec Radford et al	2018	unsupervised, GPT	Arxiv	Demonstrates that language models, specifically GPT 2, trained on the WebText dataset, start to learn various natural language processing tasks (question answering, machine translation, reading comprehension, summarization) without explicit task specific supervision. For instance, when conditioned on a document and questions, the model achieves an F1 score of 55 on the CoQA dataset, matching or exceeding several baseline systems that were trained with over 127,000 examples.	Link
On the difficulty of training Recurrent Neural Networks	Razvan Pascanu et al	2013	exploding gradient, vanishing gradient, gradient clipping, normalization	Arxiv	Explanation of issues in RNNs (vanishing / exploding gradient) and proposal of gradient clipping.	Link
Learning representations by back propagating errors	David Rumelhart et al	1986	backpropagation, learning procedure, convergence	Nature	The main paper for backprop.	Link
The Shattered Gradients Problem: If resnets are the answer, then what is the question?	David Balduzzi et al	2018	shattering, initialization	ICML	The paper identifies the "shattered gradients" problem in standard feedforward neural networks. It shows that gradients in these networks exhibit an exponential decay in correlation with depth, leading to gradients that resemble white noise. In contrast, architectures like highway and ResNets with skip connections demonstrate gradients that decay sublinearly, indicating greater resilience against shattering. The paper introduces a new initialization technique termed "Looks Linear" (LL) that addresses the shattered gradients issue. Preliminary experiments demonstrate that LL initialization enables the training of very deep networks without the need for skip connections like those in ResNets or highway networks. This initialization method offers a promising alternative to achieving stable gradient propagation in deep networks, potentially simplifying network architecture and improving training efficiency.	Link
A Simple Baseline for Bayesian Uncertainty in Deep Learning	Wesley Maddox et al	2019	bayesian, uncertainty, guassian	NeurIPS	SWAG combines Stochastic Weight Averaging (SWA) with Gaussian fitting to provide an approximate posterior distribution over neural network weights. SWA computes the first moment of SGD iterates using a modified learning rate schedule. SWAG extends this by fitting a Gaussian distribution using SWA's solution as the first moment and incorporating a low rank plus diagonal covariance derived from SGD iterates.	Link
SmartExchange: Trading Higher cost Memory Storage/Access for Lower cost Computation	Yang Zhao et al	2020	compression, accelerator, pruning, decomposition, quantization	ACM/IEEE	SmartExchange integrates sparsification or pruning, decomposition, and quantization techniques into a unified algorithm. It aims to enforce a structured DNN weight format where each layer's weight matrix is represented as a product of a small basis matrix and a large sparse coefficient matrix with power of 2 elements.	Link
On the Spectral Bias of Neural Networks	Nasim Rahaman et al	2019	spectra, fourier analysis, manifold learning	ICML	Neural networks, particularly deep ReLU networks, exhibit a learning bias towards low frequency functions. This bias means they tend to prioritize learning global variations over local fluctuations in data. This property aligns with their ability to generalize well across different samples and datasets. Contrary to intuition, as the complexity of the data manifold increases, deep networks find it easier to learn higher frequency functions. This suggests that while they naturally favor low frequency patterns, they can also adapt to more complex data structures to capture higher frequency variations.	Link
Sequence to Sequence Learning with Neural Networks	Ilya Sutskever et al	2014	seq2seq	Arxiv	The paper introduces an end to end approach for sequence learning using multilayered Long Short Term Memory (LSTM) networks. This method requires minimal assumptions about the structure of the sequences and effectively maps input sequences to a fixed dimensional vector using one LSTM layer, and decodes target sequences using another deep LSTM layer.	Link
Tiled convolutional neural networks	Quoc Le et al	2010	tiling, cnn	NeurIPS	Tiled CNNs introduce a novel approach to learning invariances by using a regular "tiled" pattern of tied weights. Unlike traditional CNNs where adjacent hidden units share identical weights, Tiled CNNs require only that hidden units at a certain distance from each other share tied weights. This allows the network to learn complex invariances such as scale and rotational invariance, in addition to translational invariance.	Link
Unsupervised Learning of Image Manifolds by Semidefinite Programming	Kilian Weinberger et al	2004	manifold learning, dimensionality reduction	IEEE	The paper proposes a new approach to detect low dimensional structure in high dimensional datasets using semidefinite programming (SDP). SDP is leveraged to analyze data that resides on or near a low dimensional manifold, which is a common challenge in computer vision and pattern recognition. The algorithm introduced overcomes limitations observed in previous manifold learning techniques like Isomap and locally linear embedding (LLE). These traditional methods often struggle with certain types of data distributions or computational complexities, which the proposed SDP based approach aims to address more effectively.	Link