Architectural Exploration Method For Neural Nets Running On IoT Devices

NAS 31/08/2022

3 main points
✔️ Explore neural network architectures running on IoT devices
✔️ Add a term about computational resources to the evaluation function of the architecture
✔️ We were able to explore architectures that achieve high accuracy with fewer computational resources

μNAS: Constrained Neural Architecture Search for Microcontrollers
written by Edgar Liberis, Łukasz Dudziak, Nicholas D. Lane
(Submitted on 27 Oct 2020 (v1), last revised 8 Dec 2020 (this version, v3))
Comments: EuroMLSys '21
Subjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

outline

IoT devices have very resource-poor microcontroller units (MCUs); run a neural network on an IoT device, we To run a neural network on an IoT device, we need to design a lightweight neural network that can run on an MCU, which is difficult to find manually. In this research, we have built a system to search for such neural nets called μNAS. μNAS can be used to solve the three main aspects of resource shortage in MCUs: RAM, storage, and processor speed. RAM, storage, and processor speed are explicitly targeted.

prerequisite knowledge

In this paper, we consider a mid-tier MCU with 64KB SRAM and 64KB storage available.

The field of model compression is being worked on to design neural networks with less computation. Methods such as pruning and quantification compress large networks in a way that minimizes the loss of classification schemes. However, most compression methods are not suitable for creating MCU-sized networks. This is because many methods target platforms with significantly more computational resources than MCUs (e.g., mobile devices) or focus on reducing the number of parameters or floating-point operations in the layers of the network while maintaining the overall architecture. The resource-efficient models designed at the initiative of MobileNet and others exceed the expected resource budget by a factor of 10. This necessitates research into specialized methods for deep learning on MCUs.

The limitations of running a neural network on an MCU are as follows.

Temporary data generated by the network must fit in the MCU's SRAM
Static data such as neural network parameters and program code must fit in the ROM and flash memory of the MCU.
To run the inference process on a low-power processor, it is necessary to increase the execution speed of the network.

To design neural networks that achieve high accuracy while satisfying these constraints, the authors turned to Neural Architecture Search (NAS), which, given the right conditions, can generate architectures to achieve a particular set of constraints or multiple objectives simultaneously. NAS can generate architectures to achieve specific constraints or multiple objectives simultaneously by setting appropriate conditions. Most current NAS systems are not designed for the kind of computer targeted in this paper, but for larger GPUs, etc., so it is difficult to use them as they are. In this paper, we propose µNAS, a NAS system targeting MCUs. µNAS can run fast with low memory and design highly accurate models by accurately identifying resource requirements and combining them with model compression.

proposed method

The main features of the proposed method (μNAS) are as follows

Make the search space tailored to the MCU
Set limits on computing resources
1. Peak memory usage limits
2. Model size (storage capacity limit)
3. An execution time limit of inference

The μNAS incorporates these elements into the NAS system. From here, we will explain each of these elements in detail.

Make the search space tailored to the MCU

The search space of neural network architectures targeted at general GPUs and the search space of neural network architectures for MCUs differ in their granularity. For example, in the case of neural network architectures targeting general GPUs, choosing between 172-channel and 192-channel convolutional layers has almost no effect on the final performance. Therefore, we do not design the search space with this granularity, but rather more roughly, so that more extensive models can also be included in the search space. For MCUs, however, these differences are important. However, for MCUs, these differences are important because MCUs have very strict memory constraints, so choosing a slightly larger layer can have a huge impact on the overall result. In general, architectural search on GPUs is coarse-grained by dividing the search space into small architectural units called cells, but for MCUs, rather than using this cell-based approach, it is better to use a more fine-grained search space by allowing the user to freely choose the number of channels and layer connections. However, in the case of MCUs, it is necessary to take a search space where the number of channels and layer connections can be freely selected to achieve finer granularity than in such cell-based methods. In this paper, the search space is shown in the table below.

In the search, a randomly generated architecture is used as the parent network, and child networks are generated and explored by performing operations such as those represented by Morphisms.

computational resource limitation

In an architectural exploration of neural networks for GPUs in general, it is not necessary to know exactly what the compute resource limitations are. This is because they are not so resource-constrained compared to MCUs. However, MCUs have very strict resource constraints, so you need to know if the generated architecture is large enough to run on an MCU. In this section, we focus on the main compute resource constraints of MCUs: memory usage limit, storage usage limit, and execution time limit.

memory usage limit

At a minimum, the information that needs to be put into memory is the information before and after the operator is executed. Also, if there is a residual connection, the result of the operation needs to be kept in memory, even if the execution of the operator in that layer is finished. Considering these constraints, the authors developed an algorithm in ( https://www.arxiv-vanity.com/papers/1910.05110/ ) to calculate the peak memory usage during the execution of a computation. The authors use this algorithm to calculate the peak memory usage of the explored architecture.

Storage Capacity Limit

The neural network code and the weights of the network are stored in storage. Traditionally, each parameter is represented as a 32-bit floating point, but now it can be reduced by quantization. The authors quantize each parameter as an 8-bit integer. Therefore, μNAS calculates the required storage capacity as the number of parameters × 1 byte.

An execution time limit of inference

As a metric for estimating how long inference runs take, the authors use the Multiply-Accumulate Operation (MAC) count. To check whether this value is an appropriate metric for estimating inference time, we plot model runtimes and MAC values for 1000 randomly selected architectures from the search space. plotted against 1000 architectures randomly selected from the search space. The result is shown in the figure below.

This figure shows that there is a large correlation between the MAC value and the inference time (latency).

search method

As mentioned earlier, the following factors need to be considered when designing a neural network to run on an MCU.

Model performance (validation accuracy)
Peak memory usage
Model size (storage limit)
MAC value (inference run time)

To include all of these in the evaluation function of the architecture, we set up the evaluation function using the coefficient λ as follows

The coefficient λ represents the relative importance of each term.

To optimize this evaluation function, we use the Aging Evolution algorithm (AE), which samples architectures from the population at each search round and selects the one with the smallest evaluation function. It then applies morphisms to this selected architecture to generate descendants, which are added to the population to be explored. When a new architecture is added, the oldest architecture is excluded. In this way, the people to be explored are updated using Morphisms.

We also use model compression to reduce the size of the model: after determining the base architecture of the model by AE, we use pruning to discard channels and units that are deemed unimportant during training.

experiment

Trimming decisions

To determine whether pruning can find models with low resource usage, we run μNAS with and without pruning. The results are shown below.

This figure shows the model size and the error rate of the discovered models for both datasets (Chars74K, MNIST). The results show that the pruning run achieves a lower error rate with smaller model size.

Performance of the explored architectures

The table above summarizes the performance of the architectures explored by µNAS for each dataset and compares them with other models. The table shows that the architectures explored by µNAS outperform the other models not only in terms of accuracy but also in terms of model size and RAM usage.

summary

In this paper, we proposed a neural network architecture exploration method for neural networks running under tight computational resources. This method explores architectures by running Aging Evolution with a term related to computational resources added to the evaluation function of the architecture. The architecture explored in this way was found to have good performance not only in terms of accuracy but also in terms of computational resources.