A Survey on Model Compression for Large Language Models

Large Language Models (LLMs), such as GPT-175B, are revolutionizing various tasks in natural language processing. However, their substantial size and computational demands pose significant challenges, especially in resource-constrained environments. Addressing these challenges, model compression has emerged as a critical area of research, focusing on transforming resource-intensive models into compact, efficient versions.

Method

Pruning

Pruning reduces a model's size or complexity by removing unnecessary or redundant components. It can be categorized into:

Unstructured Pruning: Simplifies LLMs by removing specific parameters, leading to irregular sparse model composition. Techniques like SparseGPT offer one-shot pruning strategies without retraining, achieving significant sparsity with minimal performance loss.
Structured Pruning: Focuses on removing entire structural components, like neurons or layers, maintaining the overall structure. Methods like GUM and LLM-Pruner optimize pruning while preserving the model's multi-task solving and language generation capabilities.

In the "Pruning" section of the survey on model compression for Large Language Models (LLMs), two primary strategies of pruning are discussed: Unstructured Pruning and Structured Pruning. Both strategies aim to reduce the size and complexity of LLMs, but they do so in different ways.

Unstructured Pruning

Unstructured pruning simplifies an LLM by removing individual parameters without considering the internal structure of the model. This approach targets specific weights or neurons, typically by applying a threshold to zero out parameters below it. However, this method results in an irregular sparse model composition, which can demand specialized compression techniques for efficient storage and computation. Unstructured pruning often requires substantial retraining of the LLM to regain accuracy, a process that can be especially resource-intensive for LLMs. Key advancements in this area include:

SparseGPT: Introduces a one-shot pruning strategy that doesn't require retraining. It frames pruning as a sparse regression problem and uses an approximate solver to achieve significant unstructured sparsity with minimal impact on model performance.
LoRAPrune: Combines parameter-efficient tuning methods with pruning. It introduces a unique parameter importance criterion based on values and gradients from Low-Rank Adaption (LoRA) to enhance performance on downstream tasks.
Wanda: Proposes a new pruning metric evaluating each weight's importance based on its magnitude and the norm of corresponding input activations, allowing for the removal of lower-priority weights from LLMs.

Structured Pruning

Structured pruning, in contrast, simplifies an LLM by removing entire structural components such as neurons, channels, or layers. This strategy targets groups of weights at once, reducing model complexity and memory usage while preserving the overall structure of the LLM. Structured pruning is particularly useful for maintaining the integrity of network architecture. Notable techniques include:

GUM: Analyzes several structured pruning methods for decoder-only LLMs on natural language generation tasks. It introduces a method that maximizes both sensitivity and uniqueness by pruning network components based on global movement and local uniqueness scores.
LLM-Pruner: A versatile approach to compressing LLMs, safeguarding their multi-task solving and language generation capabilities. It incorporates a dependency detection algorithm to identify interdependent structures and an efficient importance estimation method for optimal pruning.

In summary, pruning in LLMs is about reducing redundancy and complexity, either by targeting individual weights (unstructured pruning) or entire network components (structured pruning). These methods help in making LLMs more storage and computation efficient, albeit with different approaches and implications for the model's performance and structure.

Knowledge Distillation

Knowledge Distillation (KD) transfers knowledge from a complex teacher model to a simpler student model. It's divided into:

White-box KD: The student model has access to the teacher's parameters, enhancing learning and performance. Examples include MINILLM and GKD, which address challenges like distribution mismatch and model under-specification.
Black-box KD: Only the teacher’s predictions are accessible. This method has been effective in fine-tuning small models on prompt-response pairs generated by LLM APIs. Approaches like In-Context Learning distillation and Chain-of-Thought distillation leverage LLMs' emergent abilities for improved performance in tasks like reasoning and in-context learning.

The "Knowledge Distillation" section of the survey on model compression for Large Language Models (LLMs) delves into a technique aimed at enhancing model performance and generalization by transferring knowledge from a complex model (the teacher) to a simpler one (the student). Knowledge Distillation (KD) is divided into two main categories: White-box KD and Black-box KD.

White-box KD

In White-box Knowledge Distillation, the student model has access to not only the predictions but also the internal parameters of the teacher model. This access allows the student model to gain a deeper understanding of the teacher’s knowledge representations, often leading to improved performance. Key aspects and examples include:

Challenges in White-box KD: One of the challenges in white-box KD is minimizing the forward Kullback-Leibler divergence (KLD), which can cause high probabilities in unlikely areas of the teacher’s distribution, leading to improbable samples during generation.
MINILLM: Addresses the challenge in white-box KD by minimizing reverse KLD, preventing the student from overestimating low-probability regions within the teacher’s distribution, thereby refining the quality of generated samples.
GKD: Explores distillation from auto-regressive models and addresses issues like distribution mismatch between training and deployment outputs and model under-specification by sampling output sequences from the student during training and optimizing alternative divergences like reverse KL.
TF-LLMD: Uses a truncated model with a subset of layers from the larger model for initialization and trains the model on pretraining data using a language modeling objective for task-agnostic zero-shot evaluated distillation.

Black-box KD

In Black-box Knowledge Distillation, only the predictions made by the teacher LLM are accessible to the student model. This approach has shown promise, especially in fine-tuning small models. It leverages LLMs' emergent abilities, such as In-Context Learning, Chain-of-Thought, and Instruction Following. Key aspects and examples include:

In-Context Learning Distillation: Transfers in-context few-shot learning and language modeling capabilities from LLMs to smaller models (SLMs). It combines in-context learning objectives with traditional language modeling objectives.
Chain-of-Thought Distillation: Incorporates intermediate reasoning steps into prompts, enhancing training of smaller models. Techniques like MT-COT and CoT Prompting leverage this approach for improved reasoning capabilities.
Instruction Following Distillation: Focuses on generating responses based on specific instructions, where the student model learns to follow complex instructions as demonstrated by the teacher.

In both white-box and black-box KD, the student model learns from the teacher but in different ways. White-box KD allows for a deeper and more nuanced transfer of knowledge due to access to the teacher's internal workings, while black-box KD relies solely on the teacher's outputs. Both methods aim to create smaller, more efficient models that retain much of the performance and capabilities of their larger counterparts.

Quantization

Quantization converts floating-point numbers to integers or other discrete forms, reducing storage and computational complexity. It includes:

Quantization-Aware Training (QAT): Integrates quantization into the training process, allowing LLMs to adapt to low-precision representations.
Post-Training Quantization (PTQ): Applied after training completion, focusing on reducing complexity without architecture modification or retraining. Techniques include LUT-GEMM and LLM.int8(), which emphasize efficient inference while maintaining performance.

In QAT, the process of quantization is integrated into the model's training phase. This integration allows the model to adapt to lower-precision representations during training, which helps in mitigating the precision loss caused by quantization and preserving model performance after compression. Key aspects include:

LLM-QAT: Addresses the challenge of acquiring training data for LLMs by leveraging generations produced by a pretrained model for data-free distillation. This method quantizes not only weights and activations but also key-value caches to enhance throughput and support longer sequence dependencies.
PEQA and QLORA: Both methods fall under quantization-aware Parameter-Efficient Fine-Tuning (PEFT) techniques, which focus on model compression and accelerating inference. PEQA involves a dual-stage process of quantizing each layer's parameter matrix and fine-tuning the scalar vector for specific tasks. QLORA introduces concepts like double quantization and paged optimizers for memory conservation.

Post-Training Quantization (PTQ)

PTQ is applied after the model has completed its training phase, focusing on reducing the storage and computational complexity without modifying the model architecture or retraining. This method is simpler and more efficient in achieving model compression, but it may introduce precision loss due to the quantization process. Key approaches include:

Weight-only Quantization: Techniques like LUT-GEMM and LLM.int8() optimize matrix multiplications within LLMs using weight-only quantization, improving computational efficiency and reducing latency without significant performance compromise.
Layer-wise Quantization Techniques: Approaches like GPTQ, AWQ, and OWQ offer more refined quantization strategies. GPTQ proposes a novel layer-wise technique for higher compression rates, while AWQ focuses on protecting a small percentage of salient weights to reduce quantization error. OWQ analyzes how activation outliers can amplify error in weight quantization and introduces a mixed-precision scheme to address this.

Weight and Activation Quantization

Some PTQ methods aim to quantize both weights and activations of LLMs. This approach can be more complex due to the presence of outliers, especially in activations. Techniques like ZeroQuant integrate hardware-friendly schemes and knowledge distillation for effective reduction in weight and activation precision with minimal accuracy impact. SmoothQuant, ZeroQuant-V2, and RPTQ address the challenges of quantizing activations by employing strategies like per-channel scaling, Low Rank Compensation (LoRC), and strategic channel arrangement.

Quantization, both QAT and PTQ, is a key method in reducing the computational and storage demands of LLMs, enabling their deployment in more resource-constrained environments. While it introduces some level of precision loss, careful application of quantization techniques can lead to substantial model compression with minimal impact on accuracy.

Low-Rank Factorization

This technique approximates weight matrices by decomposing them into smaller matrices with lower dimensions. TensorGPT, for example, reduces space complexity by storing embeddings in a low-rank tensor format, significantly compressing the embedding layer.
The "Low-Rank Factorization" section in the survey on model compression for Large Language Models (LLMs) discusses a specific technique aimed at reducing the size and computational demands of these models. Low-Rank Factorization is a method that approximates a large weight matrix by decomposing it into two or more smaller matrices with significantly lower dimensions. This approach essentially simplifies the model by reducing the number of parameters and, consequently, the computational overhead.

Concept of Low-Rank Factorization

The core idea behind low-rank factorization is to find a way to represent a large weight matrix $W$ as the product of two smaller matrices $U$ and $V$ . Here, W≈UV, where $U$ is an $m \times k$ matrix and $V$ is a $k \times n$ matrix. The key is that $k$ , the rank, is much smaller than $m$ and $n$ , the dimensions of the original matrix. This results in a substantial reduction in the number of parameters and the overall computational complexity.

Application in LLMs

In the context of LLMs, low-rank factorization has been widely adopted for efficient fine-tuning. The technique is particularly useful for compressing models without significantly compromising their performance. An exemplary application in this domain is:

TensorGPT: A notable implementation of low-rank factorization in LLMs. TensorGPT specifically targets the embedding layer of LLMs, which can be quite large and resource-intensive. By applying the Tensor-Train Decomposition (TTD), TensorGPT efficiently compresses the embedding layer, reducing the space complexity significantly. Each token embedding is treated as a Matrix Product State (MPS), allowing the embedding layer to be compressed by a factor of up to 38.40 times. Remarkably, this compression does not just maintain but can even improve the model’s performance compared to the original LLM.

Significance

The approach of low-rank factorization is crucial in the realm of LLMs for several reasons:

Reduction in Model Size: It enables the compression of LLMs, making them more feasible for deployment in environments with limited storage and processing capabilities.
Preservation of Performance: Despite the reduction in size, low-rank factorization can maintain or even enhance the performance of the model, which is vital for practical applications of LLMs.
Efficiency in Fine-Tuning: The technique is particularly beneficial for fine-tuning LLMs, as it allows for modifications to the model without the need for extensive retraining or additional resources.

In conclusion, low-rank factorization presents a promising avenue for model compression in LLMs, offering a balance between efficiency and performance. It exemplifies how advanced mathematical techniques can be leveraged to address the practical challenges of deploying large-scale AI models.

Experiment Results

The effectiveness of model compression techniques is evaluated using metrics like the number of parameters, model size, compression ratio, inference time, and FLOPs. Benchmarks and datasets are employed to compare the performance of compressed LLMs with their uncompressed counterparts. While significant advancements have been made, there remains a performance gap between compressed and uncompressed LLMs.

Conclusion

This survey presents a detailed exploration of model compression techniques for LLMs, covering methods, metrics, and benchmarks. It emphasizes the need for advanced research in this area to unlock the full potential of LLMs across various applications, providing valuable insights for ongoing exploration.