Parallelizing Training and Fine-Tuning for Large Language Models: Leveraging Distributed GPU Clusters for Llama 2 and PEFT QLoRA with DeepSpeed

10 min readMar 25, 2024

Accelerating NLP Tasks with Advanced Tools: Fine-Tuning Llama2 on Dataset Using and QLoRA

Introduction

In this blog post, we will explore the remarkable process of fine-tuning massive models like Falcon 180B using a combination of cutting-edge technologies, including Hugging Face’s PEFT, DeepSpeed ZeRO-3, Flash Attention, and Gradient Checkpointing. By harnessing the power of just 16 A100 80GB GPUs, a mere fraction of the 1024 GPUs typically employed for such tasks, we will demonstrate how to achieve exceptional results with significantly reduced resources.

The most compelling aspect of this approach is that the resulting model not only consumes fewer resources but also outperforms the official Llama-7B and Llama-7B models on the OpenLLM Leaderboard by an impressive 3%. This means that not only are we saving on computational power, but we’re also delivering superior performance in the process.

As if that weren’t enough, the cost of training this model is estimated to be just $864 (36 hrs * $24/hr), a mere fraction of what it would typically take to fine-tune the chat version of Llama-7B. By following the steps outlined in this blog post, you’ll learn how to achieve this remarkable feat, unlocking new possibilities for large-scale model fine-tuning while keeping resource consumption and costs in check. Let’s dive in and discover how to make the most of these powerful tools and techniques.

In the sophisticated realm of natural language processing (NLP), fine-tuning question-answering models has emerged as a crucial pursuit for both enthusiasts and professionals. The journey commences with the use of datasets that serve as the foundation for training these language models.

In this tutorial, we will embark on an exploration to fine-tune Llama2, a state-of-the-art Foundational Large Language model developed by Meta. Llama2 distinguishes itself as an open-source solution, enabling users to leverage its capabilities locally. Moreover, Llama2 showcases remarkable question-answering abilities, making it a versatile tool in the NLP landscape.

In the dynamic field of natural language processing (NLP), fine-tuning large language models for specific tasks has become an essential endeavor for both researchers and practitioners. By leveraging high-quality datasets, the capabilities of these models can be honed to achieve remarkable performance in various applications. In this tutorial, we will focus on the successful fine-tuning of Llama2-7B, a powerful language model, using two distinct datasets: Alpaca and Alpaca Spanish.

The Alpaca dataset serves as a valuable resource for training language models to generate informative and coherent responses. By fine-tuning Llama2-7B on this dataset, we can enhance its ability to understand and process complex queries, making it more proficient in handling a wide range of NLP tasks.

Moreover, we will also explore the fine-tuning process using the Alpaca Spanish dataset. This dataset provides an opportunity to evaluate and improve Llama2-7B’s performance in a multilingual context, ensuring that the model can effectively process and generate responses in languages other than English.

Throughout this tutorial, we will delve into the intricacies of fine-tuning Llama-7B on both the Alpaca and Alpaca Spanish datasets. We will discuss the significance of these datasets in the context of NLP, the steps involved in the fine-tuning process, and the benefits of using Llama-7B for various applications. By the end of this article, readers should not only gain a comprehensive understanding of fine-tuning Llama2-7B but also appreciate the broader landscape of NLP and the importance of multilingual capabilities in language models.

By the end of this article, readers should not only gain a comprehensive understanding of fine-tuning Llama2 but also appreciate the broader landscape of NLP tools like Ludwig and QLoRA and DeepSpeed. This knowledge empowers them to apply these insights to their specific use cases, effectively leveraging the power of Llama2 for a myriad of applications in the world of natural language processing.

Background

Llama2: A Groundbreaking Advancement in Large Language Models

Llama2 represents a significant breakthrough in the realm of large language models (LLMs), establishing a new benchmark for dialogue optimization. Developed through a collaborative effort led by Hugo Touvron, Louis Martin, and an extensive team of experts, Llama2 stands out as a collection of pretrained and fine-tuned LLMs, ranging from 7 billion to an impressive 70 billion parameters. The focus of this innovative model, aptly named Llama 2-Chat, lies in its unparalleled performance in dialogue use cases, surpassing open-source chat models across various benchmarks.

The paper’s abstract emphasizes the model’s excellence, showcasing its potential as a viable substitute for closed-source alternatives, particularly in terms of helpfulness and safety. Llama2’s superiority extends to its fine-tuning and safety improvements, detailed comprehensively in the release, inviting the community to build upon this work and contribute responsibly to the development of large language models. Arthur Zucker, with contributions from Lysandre Debut, has made this model accessible through Hugging Face, utilizing the GPT-NeoX framework. Llama2 marks a significant leap forward, empowering NLP enthusiasts to harness its capabilities for transformative applications in dialogue-based scenarios. Explore the Llama2 model checkpoints and delve into the future of chat models in the world of natural language processing.

Introduction to DeepSpeed Zero Redundancy Optimizer (ZeRO)

DeepSpeed Zero Redundancy Optimizer (ZeRO) is an innovative approach that enables efficient distribution of deep learning model training across multiple devices. By sharding optimizer states, gradients, and parameters, DeepSpeed ZeRO significantly reduces memory redundancy and allows for training large-scale models with unprecedented efficiency.

In traditional data-parallel training, each device maintains a full copy of the model parameters, gradients, and optimizer states, leading to substantial memory consumption. DeepSpeed ZeRO addresses this challenge by strategically partitioning these components across devices, resulting in substantial memory savings. This optimization enables scaling of training processes to accommodate large language models (LLMs) and facilitates the use of higher batch sizes, ultimately leading to faster convergence and improved model performance.

By leveraging DeepSpeed ZeRO, researchers and practitioners can push the boundaries of model size and complexity, unlocking new possibilities in various applications of natural language processing, computer vision, and other AI domains. In the following sections, we will delve deeper into the workings of DeepSpeed ZeRO and its benefits for training large-scale models

Stage 1 : Shards optimizer states across data parallel workers/GPUs

Stage 2 : Shards optimizer states + gradients across data parallel workers/GPUs

Stage 3: Shards optimizer states + gradients + model parameters across data parallel workers/GPUs

Optimizer Offload: Offloads the gradients + optimizer states to CPU/Disk building on top of ZERO Stage 2

Param Offload: Offloads the model parameters to CPU/Disk building on top of ZERO Stage 3

Parameter Efficient Fine-Tuning (PEFT) and Low Rank Adaptation (LoRA)

As models get larger and larger, full fine-tuning becomes infeasible to train on consumer hardware. In addition, storing and deploying fine-tuned models independently for each downstream task becomes very expensive, because fine-tuned models are the same size as the original pretrained model.

Parameter-Efficient Fine-Tuning (PEFT) is a technique that allows us to fine-tune a large pretrained model on a specific downstream task while requiring significantly fewer parameters than full fine-tuning. The goal is to achieve comparable or even better performance than full fine-tuning, while requiring less computation and memory resources.

PEFT approaches only fine-tune a small number of (extra) model parameters while freezing most parameters of the pretrained LLMs, thereby greatly decreasing the computational and storage costs. This also overcomes the issues of catastrophic forgetting, a behaviour observed during the full finetuning of LLMs. PEFT approaches have also shown to be better than fine-tuning in the low-data regimes and generalize better to out-of-domain scenarios. It can be applied to various modalities, e.g., image classification and stable diffusion dreambooth.

It also helps in portability wherein users can tune models using PEFT methods to get tiny checkpoints worth a few MBs compared to the large checkpoints of full fine-tuning, e.g., bigscience/mt0-xxl takes up 40GB of storage and full fine-tuning will lead to 40GB checkpoints for each downstream dataset whereas using PEFT methods it would be just a few MBs for each downstream dataset all the while achieving comparable performance to full fine-tuning. The small trained weights from PEFT approaches are added on top of the pretrained LLM. So, the same LLM can be used for multiple tasks by adding small weights without having to replace the entire model.

🤗 PEFT library provides the latest Parameter-Efficient Fine-tuning techniques seamlessly integrated with 🤗 Transformers and 🤗 Accelerate. This enables the use of the most popular and performant models from Transformers coupled with the simplicity and scalability of Accelerate.

Parameter-Efficient Fine-Tuning using 🤗 PEFT

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

LoRA is a PEFT method which employs memory efficient reparametrization trick by adapting small additional trainable parameters on target modules (usually query, value layers on transformer attention blocks). Thereby, drastically reduces the number of trainable parameters. One of the nice features of LoRA is that there is no latency addition during inference as the additional trainable parameters are added back to the original weights. This method achieves comparable performance to full fine-tuning which makes its usage widespread across the community.

Flash Attention

Flash Attention and enabling gradient checkpointing are required for faster training and reducing VRAM usage to enable fine-tuning and save compute costs. The codebase currently uses monkey patching and the implementation is at DHS-LLM-Workshop/chat_assistant/training/falcon_flash_attn_monkey_patch.py

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness introduces a way to compute exact attention while being faster and memory-efficient by leveraging the knowledge of the memory hierarchy of the underlying hardware/GPUs — The higher the bandwidth/speed of the memory, the smaller its capacity as it becomes more expensive.

If we follow the blog Making Deep Learning Go Brrrr From First Principles, we can figure out that Attention module on current hardware is memory-bound/bandwidth-bound. The reason being that Attention mostly consists of elementwise operations as shown below on the left hand side. We can observe that masking, softmax and dropout operations take up the bulk of the time instead of matrix multiplications which consists of the bulk of FLOPs.

This is precisely the problem that Flash Attention addresses. The idea is to remove redundant HBM reads/writes. It does so by keeping everything in SRAM, perform all the intermediate steps and only then write the final result back to HBM, also known as Kernel Fusion. Below is an illustration of how this overcomes the memory-bound bottleneck.

Tiling is used during forward and backward passes to chunk the NxN softmax/scores computation into blocks to overcome the limitation of SRAM memory size. To enable tiling, online softmax algorithm is used. Recomputation is used during backward pass in order to avoid storing the entire NxN softmax/score matrix during forward pass. This greatly reduces the memory consumption.

For a simplified and in depth understanding of Flash Attention, please refer the blog posts ELI5: FlashAttention and Making Deep Learning Go Brrrr From First Principles along with the original paper FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.

Hardware and Environment

Number of nodes: 2.
Number of GPUs per node: 8
GPU type: A100
GPU memory: 80GB
CPU cores per node: 96

Fine-Tuning

Below is the command showcasing how to use ray launcher to run the training. un the script with the desired command-line arguments:

python funetune.py --num-workers 4 --use-cpu --no-deepspeed --model tiiuae/falcon-7b

Replace finetune.py with the name of your Python script, <number_of_workers> with the desired number of workers for training, and use the --use-cpu flag to enable CPU training if needed. You can also specify other optional arguments:

--num-workers: Sets the number of workers for training (default is 2).
--use-cpu: Enables CPU training.
--no-deepspeed: Disables DeepSpeed strategy.
--model: Specifies the model from Hugging Face to use (default is "meta-llama/Llama-2-7b").

Monitor the training progress and metrics logged to MLflow during the training process.

The DeepSpeed config is available at DHS-LLM-Workshop/chat_assistant/training/configs/deepspeed_config.yaml and also given below:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "warmup_type": "linear"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": false
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": false
        },
        "overlap_comm": false,
        "contiguous_gradients": true,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "sub_group_size": 1e9,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": "auto"
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 10,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

here Full code Repo implementation of using DeepSpeed Github Code

Github Repo :https://github.com/younesselbrag/GPUs-DeepSpeed-Distrub-LLM-

Conclusion

We successfully fine-tuned Llama-7B model using LoRA and DeepSpeed in a multi-node multi-gpu setting. We went over a brief overview of DeepSpeed, PEFT methods and Flash Attention. This was followed by the description of the dataset to be used for fine-tuning, finetuning codebase and the script launching command with the related hyperparameters. Therefore, fine-tuning LLMs using PEFT and DeepSpeed is a good alternative to Full Fine-tuning in a computationally resource-constrained scenario.