Apple's Efficient Inference Of Large Language Models On Devices With Limited Memory Capacity
3 main points
✔️ Propose a method to perform inference on large language models that exceed the memory (DRAM) available
✔️ Propose windowing and row-column bundling to transfer fast to DRAM only the minimum model parameters stored in flash memory that are necessary for the current inference. row-column bundling proposed
✔️ When only half of the model parameters of a large language model are in DRAM, the proposed method is 4-5 times faster on CPU and 20-25 times faster on GPU than the naive method
LLM in a flash: Efficient Large Language Model Inference with Limited Memory
written by Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar
(Submitted on 12 Dec 2023 (v1), last revised 4 Jan 2024 (this version, v2))
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The images used in this article are from the paper, the introductory slides, or were created based on them.
In recent years, large-scale language models (LLMs), represented by ChatGPT, have attracted a great deal of public attention.
Since LLMs have been shown to improve performance as the number of model parameters increases, larger LLMs are being developed to further improve the performance of large language models.
As LLM becomes larger, it is necessary to store a large number of model parameter values and perform large-scale calculations based on those model parameter values. Therefore, LLM cannot operate without a personal computer with a large memory as a "workshop" for storing and computing such model parameter values.
For many people, it is not possible to run a state-of-the-art, large-scale LLM on their own computer, but only as a web service in the cloud, such as ChatGPT. It is more difficult, moreover, to run LLM directly on a small mobile device such as a smartphone.
Unlike cloud services, AI that runs on mobile devices such as smartphones, or so-called edge devices, is called edge AI, and is expected to improve processing speed and reduce the risk of leaks of personal and confidential information by enabling processing without the need for a network.
However, as already mentioned, the trend toward larger LLMs has not stopped, and it has become difficult to put high-performance LLMs on edge devices and make them work.
In the paper described in this issue, researchers at Apple, famous for the iPhone, propose a technique that enables LLM to run efficiently on devices that do not have large memory and have limited memory. *This is not a technique that can learn on the device, but rather a technique that can efficiently perform inference based on learned model parameter values.
Google has announced that it will implement Gemini, an LLM, in the Pixel 8 Pro on December 7, 2023. In the case of Google, it seems that they are implementing LLM in smartphones with a reduced number of model parameters to begin with so that it can fit in the device's memory, but the technique in Apple's paper is a method that aims to efficiently execute LLM that cannot fit in the device's memory. The technique in the Apple paper is a method that aims to efficiently execute LLMs that cannot fit in the device's memory. An iPhone with LLM implemented based on this proposed technique may be sold in the future.
Let's take a look at the effects and how they work.
Comparison of inference speed between naive and proposed methods
First, we show the effectiveness of the proposed method.
Figure 1 compares the inference speed of the naive and proposed methods. This is a comparison of the proposed method with a naive LLM model loaded from flash memory.
Here, the terms memory and flash memory may be confusing, but they refer to different concepts.
Memory is what is commonly referred to as PC memory. Metaphorically, the concept corresponds to a "workspace," which in this paper refers to a storage device called a DRAM.
On a PC, if you want to have some processing done, once the data has to be developed in memory, it corresponds to the work area, but when the PC is turned off, the data is lost as well.
Therefore, data that is to be retained must be stored on storage devices such as SSDs or SD cards. Metaphorically, this concept corresponds to a "warehouse.
Images and other data taken with the phone are stored in this storage, and the storage core of SSDs and SD cards, which are commonly used for storage, is called flash memory. Basically, when the device starts up, the learned model parameter values are also stored in the storage.
Thus, memory in this paper corresponds to DRAM and flash memory to flash memory.
Returning to the comparison of inference speed, this paper compares the inference processing speed of LLMs under the constraint that only half the memory of the LLM's model size is available. In other words, we consider the case where it is impossible to extract the entire LLM model stored in flash memory onto memory.
The baseline (Naive in the figure) seems to consider the case where half of the LLM model size is already in DRAM and the other half is read from flash memory for calculation. It is difficult to understand because the assumptions seem to be out of sync with the proposed method, but although the baseline itself is set up to allow the entire LLM model to be placed in DRAM, it does not allow one token (the unit of processing when text is processed in LLM; the unit of processing when text is processed in LLM). It seems that half of the LLM model needs to be loaded from flash memory in order to infer one token (the unit of processing when text is processed in LLM, think of it as breaking text into words and processing them word by word).
As shown in the figure, we have targeted Falcon 7B and Opt 6.7B as models for LLM, which have about 7 billion model parameters and are considered to have model parameter sparsity as described below.
In the figure, Compute is the computation time, Load From Flash is the time required to load model parameters from flash memory to DRAM, and Memory Management is the time required for memory management. For both models, the time to load model parameters from flash memory is the inference bottleneck, as can be seen from the percentage of baseline inference time.
The proposed method in this paper achieves speedup in both models by efficiently loading model parameters from this flash memory. In particular, when inferring on a GPU in Opt 6.7B, the naive method takes more than 2 seconds to infer (generate) one token, while the proposed method can speed up the process to less than 0.1 second. In other words, it is more than 20 times faster.
Relationship between flash memory and DRAM storage capacity, transfer rate, and LLM model size
Earlier, we explained that the memory (DRAM) is the one that actually does the work, while the flash memory is the warehouse that holds the data, and this round-trip time is lost. So how exactly do the characteristics of DRAM and flash memory differ? The key difference is in data storage capacity and transfer speed. Figure 2 shows a comparison of the transfer rate (bandwidth) between DRAM and flash memory (Flash Memory).
Flash memory has a storage capacity of about 100 GB, while DRAM has a storage capacity of about 10 GB. Thus, in general, flash memory has a larger storage capacity than the internal memory (DRAM) of CPUs and GPUs. Flash memory can store 10 times more model parameters.
If so, why not just work with flash memory? However, flash memory is slower than DRAM. The bandwidth (data transfer rate) of flash memory is about 1 GB/s for flash memory and 100 GB/s for DRAM. In other words, flash memory has 10 times the storage capacity of DRAM, but 1/100th the transfer rate. Compared to the round-trip between the CPU and DRAM where calculations are performed, the round-trip between the CPU and flash memory significantly affects the inference time.
If you have a DRAM with storage capacity greater than the LLM model size, you can simply load the entire learned model parameters stored in flash memory into the DRAM only once for inference.
Although it takes time to load the entire model from flash memory, subsequent inference can be computed with the model already loaded in DRAM. Furthermore, since LLM is computationally intensive, it is often processed on GPUs, which are capable of massively parallel computation rather than CPUs, and the necessary model parameters are transferred to the GPU's memory for computation.
In contrast, what should be done if there is no DRAM with storage capacity greater than the LLM model size? This is the question this paper addresses.
Structure of the proposed method
The proposed method of this paper, LLM in a flash, has two major points of contrivance.
Point 1 is to reduce the amount of data transfer. Specifically, do not load the entire model parameters from flash memory into DRAM all at once. In order to transfer only the model parameters that are truly necessary for the inference of the current token, we propose the use of sparsity to predict a small number of non-zero model parameters and transfer only those parameters, and windowing to split the input sequence to the feed-forward network and transfer only the difference of necessary model parameters before and after the window slide. The proposed windowing method divides the input sequence to the feed-forward network and transfers only the differences between the model parameters required before and after the window slide.
Point 2 is to optimize the chunk size (data size transferred at once) of data to be read from flash memory. Flash memory tends to increase read (read) throughput (processing volume per unit time) the larger the chunk size. To increase throughput and eliminate bottlenecks in reading model parameters from flash memory, we propose row-column bundling to improve chunk size.
Point 1: Reduce data transfer (leveraging sparsity and windowing)
The LLM is based modeled on a transformer, but each layer of the transformer has an attention layer and a feed-forward network. In this paper, the model parameters for the attention layer are always kept in memory. This accounts for 1/3 of the total model parameters. We focus on the model parameters of the feed-forward network, which represent the remaining 2/3 of the model parameters, to reduce data transfer.
The model parameters of the feed-forward networks of OPT and Falcon, the LLM models under study, are very sparse (most of the model parameter values are zero). Of the model parameter values in the feed-forward network, 97% for OPT and 95% for Falcon are 0. 0 model parameter values do not contribute to the calculation, so the only model parameters that are really needed are the non-zero model parameters. Therefore, we propose windowing, which dynamically reads only these non-zero model parameters as needed. If we were to actually check whether the model is sparse or not, we would end up reading it into the DRAM, so it seems that we are also creating a predictive model to predict whether the model is sparse or not. This seems to be an existing idea, but it is shown as a feature that if the output of the attention layer of the current layer is known, it is possible to predict whether the output of the feed-forward network beyond that layer will be zero. The setting that the model parameters of the attention layer are always kept in memory seems to be the setting that is also employed to make this prediction.
There may be some concern about whether the prediction can really detect in advance whether the model parameter values are zero or not, but this paper evaluates the accuracy of the OPT 6.7B model with and without the prediction, using three different Zero shot tasks (evaluation tasks without LLM fine tuning), and the results show that the The results show that the accuracy of the OPT 6.7B model is almost the same with and without forecasting.
A conceptual diagram of windowing is shown in Figure 3.
LLM processes a sequence of words input by the user, as shown in the figure. When processing such a sequence of words (input sequence), this paper sets up a Sliding Window and reads in the model parameters related to the words in the window for inference. In the case of Opt 6.7B, the window size is 5, and only model parameter values that are predicted to be non-zero are read.
We now turn our attention further to active neurons. In this paper, an active neuron is defined as one whose output is positive at each layer of the feed-forward network for each input token. Since the outputs of inactive neurons add up to zero and are effectively irrelevant to the calculation, we can say that only active neurons need to be read.
The point of WINDOWING is that most of the active neurons for the word input in the next WINDOW will be common to the active neurons in the previous WINDOW. In Figure 3, blue are neurons that need to be loaded (model parameters) and red are neurons that do not need to be loaded (model parameters). The slightly darker blue are New Neurons, which, unlike the previous window, need to be newly loaded. So, in windowing, the active neurons related to the last 5 tokens are stored. This reduces the number of neurons that need to be newly loaded in the next window, thus reducing data transfer.
The amount of neuron (model parameter) transfers required by window size in Falcon 7B is shown in Figure 4.
The x-axis shows the window size and the y-axis shows the percentage of model parameters loaded into DRAM. In the case of Incremental Transfer, the larger the window size, the more model parameters are shared with the previous window and the fewer differences need to be loaded.
Point 2: chunk size optimization row-column bundling
The relationship between chunk size and flash memory read throughput is shown in Figure 5.
The horizontal axis is chunk size, the vertical axis is read throughput, and the difference between the lines is the number of threads. It is shown that the more the chunk size and the number of threads are increased when reading data from flash memory, the faster data can be read from flash memory. To take advantage of this characteristic of flash memory, row-column bundling is used to increase the chunk size.
OPT, in the Falcon model, it seems that the i-th row of the upward projection and the i-th column of the downward projection are needed to compute the i-th intermediate neuron. It seems that we can think of them as the values of the mapping matrices (model parameters) present in the feed-forward network. By leading these corresponding matrix data (model parameters) together (row-column bundling), we can improve the chunk size.
Figure 6 shows a conceptual diagram of row-column bundling.
Although there is not much direct explanation of Figure 6 in this paper, it seems that Predictor's Output is related to the windings described earlier. In the figure, the purple, black, blue, and red neurons in Predictor's Output are considered to be necessary neurons. In other words, the figure indicates that only the four necessary neurons are being read, whereas originally the eight rows of neurons in Figure 6 should be read. The right side of the figure shows the upward projection rows (Up Proj Columns) and downward projection columns (Down Proj Rows) for the necessary neurons, and the right side of the figure shows how they are compiled together from the flash memory. If the sizes of the Upward projection rows and Downward projection columns are each d_model, it is shown that the two can be increased to 2d_model, or twice the chunk size, by putting them together. Thus, it can be seen that the proposed method reduces the amount of data required to read from flash memory while increasing the transfer speed by increasing the chunk size as much as possible, thereby achieving efficient inference.
At the end
The paper describes key aspects of the technology that makes LLMs capable of performing inference on devices with limited memory. This paper is positioned as a first attempt to pursue the democratization of LLMs and to make LLM reasoning available to a wider range of individuals and devices. We anticipate that many more developmental R&D efforts will be undertaken to democratize LLMs and to make LLMs available in a memory-saving manner.
Categories related to this article