DeepSpeed: a tuning tool for large language models

Large Language Models (LLMs) have the potential to automate and reduce the workloads of many types, including those of cybersecurity analysts and incident responders. But generic LLMs lack the domain-specific knowledge to handle these tasks well. While they may have been built with training data that included some cybersecurity-related resources, that is often insufficient for taking on more specialized tasks that require more up to date and, in some cases, proprietary knowledge to perform well—knowledge not available to the LLMs when they were trained.

There are several existing solutions for tuning “stock” (unmodified) LLMs for specific types of tasks. But unfortunately, these solutions were insufficient for the types of applications of LLMs that Sophos X-Ops is attempting to implement. For that reason, SophosAI has assembled a framework that utilizes DeepSpeed, a library developed by Microsoft that can be used to train and tune the inference of a model with (in theory) trillions of parameters by scaling up the compute power and number of graphics processing units (GPUs) used during training. The framework is open source licensed and can be found in our GitHub repository.

While many of the parts of the framework are not novel and leverage existing open-source libraries, SophosAI has synthesized several of the key components for ease of use. And we continue to work on improving the performance of the framework.

The (inadequate) alternatives

There are several existing approaches to adapting stock LLMs to domain-specific knowledge. Each of them has its own advantages and limitations.

Approach	Techniques applied	Limitations
Retrieval Augmented Generation	Knowledge base required for task is “chunked,” embedded, and stored in a vector database. The information chunk most relevant to task is passed to stock model along with the information to be analyzed.	Maintaining the infrastructure for model serving and the vector database is not trivial. Chunking is not perfect, text with the same logical idea may be chunked into separate pieces. The model will return an answer very similar to the information retrieved—it will not have a wider, domain specific context that might allow it to reason and connect between ideas and topics. It can only be applied in information-based tasks and not in knowledge-based tasks.
Continued Training	A stock LLM is trained to predict the next token on domain specific data. Data can be unformatted (continued pre-training) or formatted as a set of instructions, such as questions and answers (instruction fine-tuning).	Requires extensive GPU hardware
Parameter Efficient Fine-tuning	A subset of continued training that performs fine-tuning on only a subset of the model’s parameters. Tuning can be performed on several or even a single consumer-grade GPU.	“Superficial alignment hypothesis”: A model’s capabilities and knowledge are imbued almost entirely during pre-training and subsequent fine-tuning will at most align the model output format and style to the user’s preferences. This means that the farther away a domain is from the LLM’s pretraining data, the less of an effect fine-tuning, and especially parameter efficient fine-tuning, will have.

To be fully effective, a domain expert LLM requires pre-training of all its parameters to learn the proprietary knowledge of a company. That undertaking can be resource intensive and time consuming—which is why we turned to DeepSpeed for our training framework, which we implemented in Python. The version of the framework that we are releasing as open source can be run in the Amazon Web Services SageMaker machine learning service, but it could be adapted to other environments.

Training frameworks (including DeepSpeed) allow you to scale up large model training tasks through parallelism. There are three main types of parallelism: data, tensor, and pipeline.

Figure 1: an illustration of the three main types of model training parallelism.

In data parallelism, each process working on the training task (essentially each graphics processor unit, or GPU) receives a copy of the full model’s weights but only a subset of the data, called a minibatch. After the forward pass through the data (to calculate loss , or the amount of inaccuracy in the parameters of the model being used for training) and the backward pass (to calculate the gradient of the loss) are completed, the resulting gradients are synchronized.

In Tensor parallelism, each layer of the model being used for training is split across the available processes. Each process computes a portion of the layer ‘s operation using the full training data set. The partial outputs from each of these layers are synchronized across processes to create a single output matrix.

Pipeline parallelism splits up the model differently. Instead of parallelizing by splitting layers of the model, each layer of the model receives its own process. The minibatches of data are divided into micro-batches and that are sent down the “pipeline” sequentially. Once a process finishes a micro-batch, it receives a new one. This method may experience “bubbles” where a process is idling, waiting for the output of processes hosting earlier model layers.

These three parallelism techniques can also be combined in several ways—and are, in the DeepSpeed training library.

Doing it with DeepSpeed

DeepSpeed performs sharded data parallelism. Every model layer is split such that each process gets a slice, and each process is given a separate mini batch as input. During the forward pass, each process shares its slice of the layer with the other processes. At the end of this communication, each process now has a copy of the full model layer.

Each process computes the layer output for its mini batch. After the process finishes computation for the given layer and its mini batch, the process discards the parts of the layer it was not originally holding.

The backwards pass through the training data is done in a similar fashion. As with data parallelism, the gradients are accumulated at the end of the backwards pass and synchronized across processes.

Training processes are more constrained in their performance by memory than processing power—and bringing on more GPUs with additional memory to handle a batch that is too large for the GPU’s own memory can cause significant performance cost because of the communication speed between GPUs, as well as the cost of using more processors than would otherwise be required to run the process. One of the key elements of the DeepSpeed library is its Zero Redundancy Optimizer (ZeRO), a set of memory utilization techniques that can efficiently parallelize very large language model training. ZeRO can reduce the memory consumption of each GPU by partitioning the model states (optimizers, gradients, and parameters) across parallelized data processes instead of duplicating them across each process.

The trick is finding the right combination of training approaches and optimizations for your computational budget. There are three selectable levels of partitioning in ZeRO:

ZeRO Stage 1 shards the optimizer state across.

Stage 2 shards the optimizer + the gradients.

Stage 3 shards the optimizer + the gradients + the model weights.

Each stage has its own relative benefits. ZeRO Stage 1 will be faster, for example, but will require more memory than Stage 2 or 3. There are two separate inference approaches within the DeepSpeed toolkit:

DeepSpeed Inference: inference engine with optimizations such as kernel injection; this has lower latency but requires more memory.

ZeRO Inference: allows for offloading parameters into CPU or NVMe memory during inference; this has higher latency but consumes less GPU memory.

Our Contributions

The Sophos AI team has put together a toolkit based on DeepSpeed that helps take some of the pain out of utilizing it. While the parts of the toolkit itself are not novel, what is new is the convenience of having several key components synthesized for ease of use.

At the time of its creation, this tool repository was the first to combine training and both DeepSpeed inference types (DeepSpeed Inference and ZeRO Inference) into one configurable script. It was also the first repository to create a custom container for running the latest DeepSpeed version on Amazon Web Service’s SageMaker. And it was the first repository to perform distributed script based DeepSpeed inference that was not run as an endpoint on SageMaker. The training methods currently supported include continued pre-training, supervised fine-tuning, and finally preference optimization.

The repository and its documentation can be found here on Sophos’ GitHub.

DeepSpeed: a tuning tool for large language models

The (inadequate) alternatives

Doing it with DeepSpeed

Our Contributions

Sean Bergeron

Younghoo Lee

Sean Gallagher

Read Similar Articles

What to expect when you’ve been hit with Avaddon ransomware

What’s New in Sophos EDR 4.0

Sophos XDR: Driven by data

The (inadequate) alternatives

Doing it with DeepSpeed

Our Contributions

Share this:

Sean Bergeron

Younghoo Lee

Sean Gallagher

Read Similar Articles

What to expect when you’ve been hit with Avaddon ransomware

What’s New in Sophos EDR 4.0

Sophos XDR: Driven by data