Fast Machine Learning Applied To Science

Survey, Review 25/01/2022

3 main points
✔️ Accelerating the use of machine learning for scientific research, while requiring high throughput and low latency algorithms
✔️ Review emerging ML algorithms as well as the latest hardware/software
✔️ Further ML techniques continue to be enhanced as we try to apply them to scientific problems

Applications and Techniques for Fast Machine Learning in Science
written by Allison McCarn Deiana (coordinator), Nhan Tran (coordinator), Joshua Agar, Michaela Blott, Giuseppe Di Guglielmo, Javier Duarte, Philip Harris, Scott Hauck, Mia Liu, Mark S. Neubauer, Jennifer Ngadiuba, Seda Ogrenci-Memik, Maurizio Pierini, Thea Aarrestad, Steffen Bahr, Jurgen Becker, Anne-Sophie Berthold, Richard J. Bonventre, Tomas E. Muller Bravo, Markus Diefenthaler, Zhen Dong, Nick Fritzsche, Amir Gholami, Ekaterina Govorkova, Kyle J Hazelwood, Christian Herwig, Babar Khan, Sehoon Kim, Thomas Klijnsma, Yaling Liu, Kin Ho Lo, Tri Nguyen, Gianantonio Pezzullo, Seyedramin Rasoulinezhad, Ryan A. Rivera, Kate Scholberg, Justin Selig, Sougata Sen, Dmitri Strukov, William Tang, Savannah Thais, Kai Lukas Unger, Ricardo Vilalta, Belinavon Krosigk, Thomas K. Warburton, Maria Acosta Flechas, Anthony Aportela, Thomas Calvet, Leonardo Cristella, Daniel Diaz, Caterina Doglioni, Maria Domenica Galati, Elham E Khoda, Farah Fahim, Davide Giri, Benjamin Hawks, Duc Hoang, Burt Holzman, Shih-Chieh Hsu, Sergo Jindariani, Iris Johnson, Raghav Kansal, Ryan Kastner, Erik Katsavounidis, Jeffrey Krupa, Pan Li, Sandeep Madireddy, Ethan Marx, Patrick McCormack, Andres Meza, Jovan Mitrevski, Mohammed Attia Mohammed, Farouk Mokhtar, Eric Moreno, Srishti Nagu, Rohin Narayan, Noah Palladino, Zhiqiang Que, Sang Eon Park, Subramanian Ramamoorthy, Dylan Rankin, Simon Rothman, Ashish Sharma, Sioni Summers, Pietro Vischia, Jean-Roch Vlimant, Olivia Weng
(Submitted on 25 Oct 2021)
Comments: Published on arxiv.
Subjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Data Analysis, Statistics and Probability (physics.data-an); Instrumentation and Detectors (physics.ins-det)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

In the pursuit of scientific progress in many fields, experiments have become highly sophisticated to investigate physical systems at smaller spatial resolutions and shorter time scales. These orders of magnitude advances have led to an explosion in the quantity and quality of data, and scientists in every field need to develop new methods to meet their growing data processing needs. At the same time, the use of machine learning (ML), i.e., algorithms that can learn directly from data, has led to rapid advances in many scientific fields. Recent advances have shown that deep learning (DL) architectures based on structured deep neural networks are versatile and can solve a wide range of complex problems. The proliferation of large data sets, computers, and DL software has led to the exploration of different DL approaches, each with its advantages.

This review paper focuses on the integration of ML and experimental design to solve important scientific problems by accelerating and improving data processing and real-time decision-making. We discuss a myriad of scientific problems that require fast ML and outline unifying themes between these domains that lead to general solutions. In addition, we review the current technology needed to run ML algorithms fast, and present key technical issues that, if solved, will lead to significant scientific advances. A key requirement for such scientific progress is the need for openness. Experts from domains with which they have little interaction must come together to develop transferable solutions and collaborate to develop open-source solutions. Much of the progress in ML in the last few years has been due to the use of heterogeneous computing hardware. In particular, the use of GPUs has made it possible to develop large-scale DL algorithms. In addition, the ability to train large AI algorithms on large datasets has enabled algorithms that can perform sophisticated tasks. In parallel with these developments, new types of DL algorithms have emerged that aim to reduce the number of operations to achieve fast and efficient AI algorithms.

This paper is a review of the 2nd Fast Machine Learning conference and is based on material presented at the conference. Figure 1 shows the spirit of the workshop series that inspired this paper, and the topics covered in the subsequent Figure 1 shows the spirit of the workshop series that inspired this paper and the topics covered in the following sections.

As ML tools have become more sophisticated, the focus has shifted to building very large algorithms that solve complex problems such as language translation and speech recognition. Furthermore, the applications are becoming more diverse, as it is important to understand how each scientific approach can be transformed to benefit from the AI revolution. This includes the ability of AI to classify events in real-time, such as particle collisions or changes in gravitational waves, as well as the control of systems, such as the control of response through feedback mechanisms in plasma and particle accelerators, but constraints such as latency, bandwidth, and throughput, and the reasons for them, vary from system to system. The constraints and reasons for them are different for each system. Designing a low-latency algorithm differs from other AI implementations in that it requires the use of specific processing hardware to handle the task and improve the overall algorithm performance. For example, ultra-low latency inferencing may be required to perform scientific measurements. For example, scientific measurements may require ultra-low-latency interrogation times, and in such cases, the algorithm must be well-designed to maximize the use of available hardware constraints while keeping the algorithm relevant to the required experimental requirements. The algorithm must be well designed to keep up with the required experimental requirements while taking full advantage of the available hardware constraints.

Domain Application Exemplars

As scientific ecosystems rapidly become faster and larger, new paradigms for data processing and reduction need to be incorporated into the system-level design. While the implementation of fast machine learning varies widely across domains and architectures, the needs for basic data representation and machine learning integration are similar. Here we enumerate a broad sampling of scientific domains for seemingly unrelated tasks, including existing technologies and future needs.

Large Hadron Collider

The Large Hadron Collider (LHC) at CERN is the world's largest and most energetic particle accelerator, with a bunch of protons colliding every 25 nanoseconds. To study the products of these collisions, several detectors have been installed at interaction points along the ring. The purpose of these detectors is to measure the properties of the Higgs boson with high precision and to search for new physical phenomena beyond the Standard Model of particle physics. Due to the extremely high collision frequency of 40 MHz, the high multiplicity of secondary particles, and a large number of sensors, the detectors must process and store data at an enormous rate. The two multipurpose experiments, CMS and ATLAS, consist of tens of millions of readout channels, and their rates are on the order of 100 Tb/s. The processing and storage of these data are one of the most important challenges for the LHC physics program. The detector data processing consists of an online processing stage, where events are selected from a buffer and analyzed in real-time, and an offline processing stage, where the data are written to disk and analyzed in more detail using sophisticated algorithms. In the online processing system, called triggering, the data rate is reduced to a manageable level of 10 Gb/s and recorded for offline processing. Triggers are typically multi-tiered. Due to the limited size of the on-detector buffer, the first level (level 1 or L1) uses FPGAs or ASICs that can perform filtering operations with latencies on the order of 1 µs at most. In the second stage, High-Level Trigger (HLT), a CPU-based computing farm located at the experimental site processes the data with a latency of up to 100 ms. Finally, full off-line event processing is performed on a globally distributed CPU-based computing grid. Maintaining the capacity of this system will become even more challenging shortly: in 2027, the LHC will be upgraded to the so-called High-Luminosity LHC (HL-LHC), which will produce five to seven times as many particles per collision, and the total amount of data that will eventually be stored will be one order of magnitude larger than that achieved by current accelerators. The total amount of data accumulated will be one order of magnitude greater than that achieved by current accelerators. At the same time, the particle detectors will be larger and finer and will be able to process the data faster and faster. Thus, the amount of physics that can be extracted from an experiment is limited by the accuracy of the algorithms and the computational resources.

Machine learning techniques offer promising solutions and capabilities in both areas, due to their ability to extract the most relevant information from high-dimensional data and their high degree of parallelization to appropriate hardware. Once the new generation of algorithms is implemented in all stages of the data processing system of the LHC experiment, it is expected to play a key role in maintaining and hopefully improving the performance of the physics.

Examples of applications to physics tasks at the LHC include event reconstruction, event simulation, heterogeneous computing, real-time analysis at 40 MHz, and the application of ML to front-end detectors. (Details are omitted. )

(high) intensity accelerator experiment

Machine Learning Base Trigger System in Belle II Experiment

The Neural Network z-Vertex Trigger (NNT) used in Belle II is a Level 1 (L1) trigger with no dead time that identifies particles by estimating their origin along the beanpipe. The entire L1 trigger process, from data readout to decision, is given a real-time time budget of 5 µs to avoid dead time. Because of the time required to pre-process and transmit the data, NNT needs to decide in 300 ns processing time; the task of NNT is to estimate the origin of the particle track and to be able to determine if it is from an interaction point. For this purpose, a multilayer perceptron (MLP) implemented on a Xilinx Virtex 6 XC6VHX380T FPGA is used. the MLP consists of three layers: 27 input neurons, 81 hidden layer neurons, and 2 output neurons. For this task, data from the Central Drift Chamber (CDC) of Belle II is used, since it is specialized for particle trajectory detection. The raw detector data are combined into 2D tracks called track segments, which are groups of adjacent active sense wires, before being processed by the network; the output of the NNT shows the origin of the track in the z-direction, along the beam pipe, as well as the polar angle θ. The z-Vertex is used by the downstream Global Decision Logic (GDL) to determine if the track is from an interaction point or not. In addition, the polar angle θ can be used to detect the momentum of particles. the networks used in NNT are trained offline. The first networks were trained using plain simulation data because experimental data were not available. The more recent networks use tracks reconstructed from experimental data. For training, we use the iRPROP algorithm, which is an extension of the RPROP backpropagation algorithm. The current results show a good correlation between the NNT tracks and the reconstructed tracks. Currently, the event rate and background noise are within acceptable limits, so we set the Z-cut (the estimated origin allowed to maintain the track origin) to ±40 cm. However, this z-cut can be tightened as luminance increases and the associated background increases. Data preprocessing enhancements are planned for this year, now that the Virtex Ultrascale-based Universal Trigger Board (UT4) is available for use with NNT. This will use the 3D Hough transform to further improve efficiency. Already simulations have shown that more accurate resolution and wider coverage of solid angles can be obtained.

material testing

In recent years, the materials science community has begun to embrace machine learning to facilitate scientific discovery. However, this has been problematic. The ability to create highly over-parameterized models to solve problems with limited data does not make the necessary generalizations for science and leads to false validity. Machine learning model architectures designed for natural time series and images are ill-suited to physical processes governed by equations. In this regard, there is a growing body of work that incorporates physics into machine learning models to serve as the ultimate regularizer. For example, rotational and Euclidean equilibria are being incorporated into model architectures, and methods are being developed to learn sparse representations of the underlying governing equations. Another challenge is that real systems have system-specific discrepancies that need to be compensated for. For example, the slightly different viscosities of different batches of precursors need to be taken into account. There is an urgent need to develop these fundamental methods for materials synthesis. Complementing these fundamental studies, there is a growing body of literature that emphasizes "in situ" spectroscopic analysis based on "post-mortem" machine learning. As these concepts become more mature, there will be an increasing emphasis on synthetic systems, machine learning methods, and hardware codesign for on-the-fly analysis and control. Such automated lab efforts are already underway in wet chemical synthesis, where dynamics are minimal and latency is not an issue. In the future, the focus will undoubtedly be on controlling dynamic synthesis processes where millisecond to nanosecond latencies are required.

Scanning probe microscope

In the field of materials science, machine learning is rapidly being introduced into scanning probe microscopy. Linear and nonlinear spectral unmixing techniques can rapidly visualize and extract information from these datasets to discover and elucidate physical mechanisms. The ease with which these techniques can be applied has raised legitimate concerns about the over-interpretation of results and over-extension of linear models to highly nonlinear systems. More recently, a long- and short-term memory autoencoder was controlled to have a non-negative, sparse latent space for spectral unmixing. By scanning the learned latent space, it is now possible to draw complex structure-property relationships. There is a huge opportunity to speed up the computational pipeline to enable microscopists to extract information on practical time scales. 100,000 spectral sample rates yield up to GB/s of fast data, so extracting even the tiniest bit of information requires data-driven models To extract even the smallest amount of information, data-driven models, physics-based machine learning, and AI hardware are needed. For example, a band-excited piezoelectric force microscope measures the frequency-dependent response of a cantilever at rates of up to 2,000 spectra per second. To extract parameters from these measurements, the response must be fit to an empirical model. While fitting using least-squares methods yields a throughput of only about 50 fits per core minute, neural networks can be used to speed up the analysis and process noisy data. This pipeline can be approximated and accelerated by introducing neural networks into GPU and FPGA hardware accelerators.

Fermilab Accelerator Control

Traditional accelerator control focuses on grouping like elements together so that specific aspects of the beam can be tuned independently. However, many elements are not always completely separable. For example, magnets have higher-order magnetic fields that can affect the beam in ways that are not intended. Machine learning is finally making it possible to combine elements of readout and beam control that were previously thought to be unrelated to create a new control and coordination schemes. One such new control project is underway at the gradient magnet power supply (GMPS) of the booster, which controls the main orbit of the beam in the booster. This project aims to increase the regulation accuracy of the GMPS by a factor of 10. When completed, GMPS will be the first FPGA-based ML model-based online control system at the Fermilab accelerator facility. The potential of ML for accelerator control was so obvious to the Department of Energy that a call for ML-based accelerator control was issued at the national laboratory. One of the two proposals submitted by Fermilab and approved by DOE is the Real-time Edge AI for Distributed Systems (READS) project. READS consists of two projects. The second READS project addresses the long-standing problem of deblending beam loss in the main injector (MI) enclosure. There are two accelerators in the enclosure: the MI and the recycler. In normal operation, there is a high-intensity beam in both machines. This project aims to control the slow spill in the delivery ring to Mu2e using ML, and the other project is to deblend the losses generated by the accelerators of the recycler and the main injector, which share the same enclosure, using an online model in real-time. The goal of both READS projects is to deblend the losses generated by the accelerators of the recycler and the main injector, which share an enclosure, using real-time online models. Both READS projects use online ML models in FPGA for inference, and collect data with low latency from distributed systems around the accelerator complex.

Neutrino and direct dark matter experiments

Accelerator Neutrino Experiment

DUNE uses machine learning in its triggering framework to process huge data rates and identify interactions between traditional neutrino oscillation measurements and candidate solar and supernova events. Accelerator neutrino experiments have been successfully implementing machine learning techniques for years, the first such example being in 2017, where the network increased the effective exposure of the analysis by 30%. Networks intended to classify events are common in many experiments, and DUNE has recently published a network that can exceed design sensitivity on simulated data, which includes an output that counts the number of particles in the final state due to interactions The network includes an output that counts the number of particles in the final state due to interaction. In our experiments, we have become increasingly aware of the risk of the network learning more features of the training data than intended. Therefore, careful construction of the training dataset is essential to reduce this risk. However, it is not possible to correct or quantify biases that are not yet known. For this reason, in the MINERvA experiment, we considered the use of a domain adversarial neural network to reduce unknown biases arising from differences between simulated and real data. This network has a gradient inversion layer in the domain network (trained on the data), which prevents the classification network (trained on the simulation) from learning from features that behave differently between the two domains.

In this area, there are further applications in neutron astrophysics and direct detection of dark matter.

Electron and Ion Collider

Accessing the physics of the Electron-Ion Collider (EIC) requires unprecedented integration of interaction region (IR) and detector design. Seamless data processing from DAQ to analysis in the EIC will streamline workflows, for example by integrating software for DAQ, online and off-line analysis, and will enable new software technologies at all levels of data processing, especially fast ML algorithms. ML algorithms can be utilized at all levels of data processing. This will be an opportunity to further optimize the physics reach of the EIC. The status and prospects of "AI for Nuclear Physics" are discussed in the 2020 workshop. Relevant topics for fast ML are intelligent decisions about data storage and (near) real-time analysis. Intelligent decisions about data storage are needed to reliably capture the relevant physics. Fast ML algorithms can improve the acquired data through data compacting, sophisticated triggers, and fast online analysis; EIC includes automated data quality monitoring as well as automatic detector alignment and calibration. Near real-time analysis and feedback enable rapid diagnosis and optimization of experimental setups and greatly improve access to geophysical results.

gravitational wave

In recent years, machine learning algorithms have been explored in various areas of gravitational-wave physics; CNNs have been applied to the detection and classification of compact binary coalescing gravitational waves, burst gravitational waves from core-collapse supernovae, and continuous gravitational waves. Recurrent neural network (RNN)-based autoencoders are also being investigated to detect gravitational waves using unsupervised strategies; FPGA-based RNNs are also being explored to demonstrate the potential for low-latency detection of gravitational waves. Applications of ML in the search for other types of gravitational waves, such as generic bursts and stochastic backgrounds, are currently being investigated. In addition, probabilistic and generative ML models can be used for posterior sampling in gravitational wave parameter estimation, achieving performance comparable to Bayesian samplers using simulated data, while significantly reducing the time to completion ML algorithms can be used to improve the quality of gravitational wave data and It has also been used for noise subtraction. Transient noise artifacts can be identified and classified by examining their time-frequency and constant-Q transforms, as well as LIGO's hundreds of thousands of auxiliary channels. These auxiliary channels can also be used to subtract quasi-periodic noise sources, and although ML algorithms hold great promise for gravitational-wave data analysis, many of these algorithms are still in the proof-of-concept stage and have not been successfully applied to real-time analysis. Current challenges include building a computational infrastructure for low-latency analysis, improving the quality of the training data (e.g., expanding the parameter space, using more realistic noise models), and quantifying the performance of the algorithms on longer data runs.

biomedical engineering

Many changes in ML algorithms are accompanied by performance improvements in both accuracy and inference speed. Some of the most advanced machine learning models have high inference speed. For example, YOLOv3-tiny, a popular object detection model for medical imaging, can process images at more than 200 FPS on a standard dataset and achieve reasonable accuracy. Currently, both GPU- and FPGA-based, distributed networks of wireless sensors connected to cloud ML (edge computing), and 5G high-speed WiFi-based ML models are being deployed in medical AI applications. ML models for fast diagnosis of stroke, thrombosis, colorectal polyps, cancer, and epilepsy have significantly reduced the time for lesion detection and clinical decision-making. Real-time AI-assisted surgery can improve perioperative workflow, video segmentation, detection of surgical instruments, and visualization of tissue deformation. Fast ML plays an important role in digital health, i.e., remote diagnosis, surgery, and monitoring.

health monitoring

Existing research has taken various steps in different directions, but there is a growing need to develop ML approaches that can correctly sense health biomarkers and identify these biomarkers in a fast and accurate manner. Researchers have focused on developing novel sensing systems that can sense a variety of health behaviors and biomarkers. Historically, most of these new sensing techniques have been tested in controlled environments, but recently, researchers have ensured that these systems can work seamlessly in free-living environments. To achieve this, multiple ML models need to be developed, each of which can be adapted to specific situations and environments. New trends in this field are beginning to rely on the implementation of models that can be implemented on devices and that can detect these behaviors quickly and accurately. In addition to enabling real-time interventions, on-device monitoring of these behaviors can help alleviate privacy concerns. However, because wearable devices themselves may not be able to process the data, several researchers have recently explored coordinated machine learning approaches.

cosmology

CNNs are being applied to spherical surfaces to generate more accurate models for weak lensing maps and to remove noise from cosmic microwave background maps. A discovery and classification engine is also being developed to extract useful cosmological data from next-generation facilities. ML is also being used for space simulations, testing new analyses and methods, and laying the groundwork for the first operations of such a new facility.

plasma physics

The overarching goal here is to develop predictive plasma models of realistic disruptions and integrate them with state-of-the-art plasma control systems to provide the ability to design experiments before they are run. Discharges in ITER and future burning plasmas most efficiently and safely possible. The verification, validation and uncertainty quantification of the relevant components are as follows (1) development of predictive neural net models of plasmas and actuators that can be extrapolated to the scale of burning plasmas using advanced Bayesian reinforcement learning methods that incorporate prior information into efficient inference algorithms; (2) validation of the models for the world's major tokamak experiments (DIII-D in the USA, KSTAR and EAST in Europe, JET in Europe, and Japan's large superconducting device JT60SA, which predates ITER) to systematically and well diagnosed and validate the components of an integrated plasma prediction model. This will ideally result in a mature, AI-enabled, comprehensive control system for ITER and future reactors, which can be integrated with the full pilot plant system model.

The key challenges now are to provide significantly improved prediction methods with >95% prediction accuracy and to provide warnings to effectively apply disruption avoidance/mitigation strategies before the fatal damage is done to ITER. Significant progress has been made in the adoption of deep learning (recurrent) and CNNs, as exemplified by the "FRNN" deep learning code at Princeton University, to enable rapid analysis of large and complex datasets on supercomputing systems. In this connection, we have successfully predicted the destruction of a tokamak with unprecedented accuracy and speed. The paper (and extensive references cited therein) includes a description of the FES data representation for physical features (density, temperature, current, radiation, fluctuations, etc.), the frame (event-based) level of accuracy to account for the required "zero-D" (scalar) and higher dimensional signals, and a description of the It includes the nature of key plasma experiments that feature detectors/diagnostics with real-time resolution recorded at manageable data rates. Rough future projections suggest that ITER will require the processing and interpretation of vast amounts of complex spatial and temporal data. Since simulation is another important aspect of ITER data analysis, advanced compression methods will need to be implemented to cope with the large computational costs involved. More generally, real-time predictions based on real first-principles simulations are important to gain insight into the properties of instabilities and the dynamics of the particle phase space. This necessitates the development of AI-based "surrogate models". For example, the well-established HPC "gyrokinetic" particle-in-cell simulation code GTC [278] can accurately simulate plasma instabilities in real-time. Data preparation and surrogate model training, e.g. "SGTC", is a clear example of the modern task of integration/connection between modern high-performance computing (HPC) predictive simulations and AI-enabled deep learning/machine learning campaigns. These considerations also help to further illustrate and motivate the need to integrate HPC and Big Data ML approaches to facilitate the delivery of scientific discoveries. Finally, the cited paper is the first adaptive predictive DL software trained on a leading supercomputing system to accurately predict the disruption of different tokamak instruments (DIII-D in the US and JET in the UK). The software has the unique statistical capability to accurately predict the occurrence of disruptions on an unknown device (JET) through efficient "transfer learning" by training on a large database of data from a single experiment (DIII-D). More recently, the FRNN inference engine was installed in the real-time plasma control system at the DIII-D tokamak facility in San Diego, California. This paves the way for an exciting transition from passive disruption prediction to active real-time control and subsequent optimization of reactor scenarios.

Machine Learning for Wireless Networking and Edge Computing

Researchers have proposed various learning algorithms to use artificial neural networks to perform specific wireless resource management tasks. Some of the first proposals to train NNs to perform transmit power control employed supervised learning. More recent proposals employ deep reinforcement learning approaches that better deal with channel and network uncertainty and require little prior training data. Much of the research has focused on the integration of edge computing and deep learning. Specifically, there is work on federated learning, where participants collaboratively train a model instead of sending all data to a central controller for learning purposes. These studies essentially end at the simulation stage, as there is no practical ML/AI solution that is fast and computationally efficient. Specifically, the research challenge is to develop a computing platform that can run complex ML models on very fast timescales (<10ms) and that can be implemented in small cell access points. One project that could have a very large impact is the mapping of intelligent radio resource management algorithms into FPGA devices suitable for deployment in large networks of connected and interfering access points. Another interesting project is to build a federated learning system that performs time-sensitive ML for IoT devices that experience delays when transferring data to a central computing facility. This opens up a whole new world of possibilities for low-cost closed-loop IoT devices in areas such as healthcare, smart buildings, agriculture, and transportation.

Main areas of overlap

Real-time, accelerated AI reasoning is expected to improve the potential for discovery in current and planned scientific instruments in a variety of fields. Designing high-performance specialized systems for real-time/accelerated AI applications requires paying particular attention to the benefits of ML algorithms in the domain of interest. This may be governed by latency per inference, computational cost (e.g., power consumption), reliability, security, and ability to operate in harsh environments (e.g., radiation). For example, in the Large Hadron Collider, the system needs to be activated to capture rare events with a latency of about 100 ns. And when analyzing multichannel outpatient health monitors in the kilohertz frequency band, wireless data transfer is not possible due to power limitations (about 50 iPhone batteries per day are required for data transfer) and security requirements. In addition, material spectroscopy data streams on the order of terabits per second need to be supported. In addition, real-time analysis of advanced scientific instruments requires uninterrupted allocation of computing resources, and sensitive patient information processed by wireless health devices must be protected. These characteristics and properties provide quantitative guidelines for understanding the distinctions and similarities between domains and applications. This allows us to coordinate our efforts to create basic design principles and tools that address the needs between seemingly different domains. Proper data representation is an important first step in the design process, as it determines the choice of NN architecture to implement in real-time systems that need to meet the performance goals described above.

data representation

The data representation used in a particular domain affects both the computational system and the data storage. A broad classification of data representations across domains can be raw data and reconstructed data. The data representation often differs depending on the stage of reconstruction and the upstream steps of the data processing pipeline. Existing applications include fully concatenated NNs with preprocessed expert feature variables as input and CNNs when the data is of image nature. Domain knowledge-inspired NN algorithms currently under development can further exploit expert features in terms of accuracy and efficiency, as detailed below. To fully exploit the capabilities of advanced NNs and approach data preparation with minimal loss of information, it is necessary to employ a more appropriate representation of the raw data, e.g., point clouds. Typical representations of raw data obtained from various experimental and measurement systems include the following

- Spatial Data. It is used to describe physical objects in geometric space. There are two main types of data: vector data and raster data. Vector data includes points, lines, and polygons. Raster data is a grid of pixels, as an image, but the pixels can also represent other measurements such as intensity, charge, or field strength.

- Point Clouds. It is a kind of spatial data. This data is created by collating a set of spatial data, i.e., points in a 3D space, and usually forms a set of objects in the space.

- Temporal Data. It is used to describe the state of a system or experiment at a particular time. Data that are collected in a particular order over time are classified as such. Time series data is a subset of this representation, where the data is sampled at regular time intervals. As an example of time series data, Fig. 4 shows an example of supernova classification.

- Spatio-Temporal Data. Measurements or observations of a system can be collected in both spatial and temporal dimensions. In that case, the data can be considered Spatio-temporal.

- Multispectral Data. It is used to represent the output of multiple sensors that take measurements from multiple bands of the electromagnetic spectrum. Multispectral representations are often used in the field of imaging using sensors that are sensitive to light of different wavelengths. This usually involves several to ten spectra.

- Hyperspectral data. Hyperspectral data: used to represent measurements from as many as 100 spectra. These images, collected from different narrowband spectra, are combined into a so-called hyperspectral cube with three main dimensions. The first two dimensions refer to the spatial arrangement of the two dimensions (e.g. the earth's surface) and the third dimension represents the complete spectral content at each "pixel" location.

Table 1 provides a brief description of these data representations, corresponding to their scientific application fields. Data representations that are particularly important in a particular field are highlighted. The cost of data communication (latency) and the cost of data storage (the cost of acquiring and managing physical storage resources) are important issues. Highly optimized data analysis solutions are required, especially in application areas that require real-time analysis and real-time feedback. Applications that rely on hyperspectral data continue to increase the proportion of data input across the electromagnetic field. Fast data reduction is required in these areas. In addition, applications that generate large point clouds require efficient compression of spatial data. In the application domain of multispectral data with limited spatial resolution, ultra-fast reconstruction is required to enable real-time control feedback. Applications that require accurate analysis of time-series streaming data are forced to run under very limited storage and communication resources due to privacy and security concerns or limitations of the associated edge devices. Some of the current efforts to develop ML solutions for the data processing front-end focus on the development of autoencoder-based compression engines. ML-based dimensionality reduction for hyperspectral data is another direction that is receiving attention. Deep learning-based approaches are being studied for image reconstruction and the field of materials science is one of the most active areas in this regard.

Expert Feature DNNs

One direct approach to building powerful domain-specific ML algorithms is to start with expert domain features and combine them in a neural network or other multivariate analysis technique. Incorporating expertise in this way has the inherent advantage that the input features are interpretable, and the correlations between features can provide insight into a specific task while optimizing performance. Furthermore, depending on the computational complexity of the features in the domain, the computational efficiency of such a machine learning approach is higher than when using raw features directly. However, the disadvantage of using expert features is that it relies entirely on the informativeness of such new features. Therefore, much attention has been paid to automating the process of constructing informative new features from raw features. For example, in image classification tasks, significant progress has been made in extracting high-level representations of data using deep neural networks DNNs, which build layers of neurons on top of the original input signal so that the new layers capture a more abstract representation of the data. Each layer builds a new feature by forming a nonlinear combination of features from the layers below. This hierarchical approach to feature construction is effective in isolating sources of variability in the data and helps to build informative and meaningful representations. For example, in astronomical images, DNNs can start with low-level pixel information and gradually capture upper-level edges, motifs, and eventually entire objects (e.g., galaxies) to get a complete picture of the universe. This is also true in other scientific fields. For example, detecting particles in a large accelerator requires transforming the low-level signal into a dynamic pattern that can be attributed to a particular particle. In medical imaging, a gradual understanding of global tissue patterns is needed to quickly identify abnormal tissue from low-level pixel information. The importance of transforming the initial input data into a meaningful abstract representation cannot be overemphasized. This is one of the most powerful properties of modern neural network architectures.

Building highly abstract representations using DNNs presents several challenges: one challenge is to incorporate domain knowledge (e.g., physical constraints) into the neural network model. This is important to address the need for excessive amounts of data when training DNNs and to narrow the representation bias gap between the model and the target concept. In situations where data is scarce but domain expertise is abundant, adding domain knowledge can speed up the training process and also improve the generalization performance of the model. Another challenge is to develop tools for model interpretability by describing the semantics of the representations embedded in each layer. This is challenging due to the distributed representation of information in network architectures. Despite the lack of formal mechanisms to achieve seamless integration between statistical models and domain knowledge, current approaches point in interesting directions, e.g., using knowledge to add training data or to modify the loss function. possibilities have been an active area of research in the last few years. In general, there is a lot of research looking at individual units and their activation patterns to figure out what is being learned across layers of neurons.

Frame-based image

Frame-based images provide an adequate representation of experimental data in multiple domains, such as neutrino detection using time-projection chambers in particle physics. An example of this data representation is that of electron deposition in the ProtoDUNE neutrino detector, shown in Fig. 5. The spatial frame is shown by plotting the time coordinate "Tick" versus the position of the wire in space. Recently developed neural network architectures exploit the sparsity of images to reduce computational effort and enable real-time/fast ML applications; HEP and other types of experimental data in many other areas can also be processed to be represented as frame-based images HEP and other kinds of experimental data in many other fields can be processed so that they can be represented as frame-based images, but in many cases information loss is unavoidable.

point cloud

In HEP, point cloud data representations are often used to combine multiple frames of event-based measurements collected by a large number of detectors into a single data set. In many HEP applications, point clouds are often used to represent particle jets with data sizes greater than Pb/s. More broadly, point clouds can be used to capture events in 3D space and the interaction of moving parts in space; a point cloud visualization of the CMS detector at the LHC is shown in Fig. 6. The remnants of a proton-proton collision produce a sensor signal with a customized and optimized detector geometry, and the points are illustrated in space. Various types of scan-based image data can be represented as point clouds. CT and PET scans in medical engineering, as well as virtual reality, use this representation for imaging. Point clouds are also used in 3D scanners for product design, solid object modeling, architecture, and infrastructure design. Many of these imaging tasks generate point clouds that range in size from a few GB to the order of TB. Spatial properties are also often used in domains that share a point cloud representation (e.g., HEP and biomedical imaging).

Multi-/Hyperspectral Data

Multispectral data is commonly used between wireless health monitoring systems and wireless communication systems. In health monitoring and intervention systems, physiological sensors of different modalities are combined to create multispectral data sets. In wireless communications, multispectral data is used to understand signal interference and network traffic conditions. Both domains capture this data over a time domain, so they also show temporal characteristics. Furthermore, the size of the data produced in both domains is relatively small compared to the other domains discussed in this article (in the range of 100s of Mb/s to 10s of Gb/s). Hyperspectral data has been used in astronomy, medical imaging, and electron microscopy, and has been used in the design and discovery of many materials sciences. An example of hyperspectral data in electron microscopy is shown in Fig. 7. The electron probe is rastered on the sample of interest, and the diffraction pattern is captured on a pixelated detector. The pixelated detector captures a large number of images as the electron probe is scanned over the sample. Applications in multi-messenger astronomy further emphasize the utility of combining observations from different detectors and telescopes to represent hyperspectral data.

Time series data

Time series data are often found in experiments that observe dynamically evolving systems in processes such as synthesis for materials discovery or the temporal evolution of plasma states in fusion experiments. It is a fast time-resolved imaging of features of materials science and physics (density, temperature, current, radiation, fluctuations, etc.) and spatial features of evolving plasma states as a function of time.

In-situ diagnostics of time series data can provide warnings for early termination of experiments that show undesirable results in materials science, without the need for time-consuming and computationally expensive off-line analysis of the entire experiment.

Fig. 8 shows an example of accelerator control for the Fermilab booster accelerator. In this application, the voltage of the magnet that guides the proton beam around the synchrotron is recorded with a time sample of 15 Hz. In this study, a digital twin was built to simulate the booster accelerator data. In addition, real-time analysis of time series data is important to reliably predict and avoid large-scale failures in fusion experiments.

system constraint

One of the concepts being explored in the HEP community is the GPU as a Service (GPUaaS) model, which is a combination of CPU-based local clusters, cloud services, and cloud computing resources using GPUs and TPU-based hardware accelerators. (GPUaaS) model. One of the concepts being explored in the HEP community is the GPU as a Service (GPUaaS) model, which involves the implementation of a machine learning module that solves a physics problem, which is then transferred to a GPU or TPU accelerator and accessed by a local CPU "client" of the experimental system.

Software programmable coprocessor

Historically, the first attempts to address the computational needs of the problems discussed in this article were software-programmable systems. One of the concepts being explored in the HEP community is the GPU as a Service (GPUaaS) model, which is based on the use of GPUs and TPU-based hardware accelerators as cloud computing resources for a variety of applications. One concept being explored in the HEP community is the GPU as a Service (GPUaaS) model. This can be further extended to the concept of Machine Learning as a Service. In these paradigms, a machine learning module is implemented to solve a physics problem, which is then transferred to a GPU or TPU accelerator, which is then accessed by the local CPU "client" of the native experimental system.

One of the main system constraints is computational power. This can be defined in terms of the number of floating-point operations as far as the neural network implementation is concerned. Real-time machine learning methods require ever-increasing computational power, as it directly affects the latency per task. Tasks could be triggering the LHC, reconstructing events in accelerator experiments or astrophysics, materials synthesis, reconstructing images captured by an electron microscope, etc. Extreme parallelism is desired to minimize latency and to provide the maximum possible capacity to Maximize Throughput. In processor-based systems, this can be addressed by increasing the size of the computing cluster.

Naturally, the cost of the facilities imposes a limit on the size of these clusters. Another limitation is the amount of available storage and the cost of moving data across the memory hierarchy. In most use cases, the latency associated with moving data from the front end (detectors, microscopes, sensors, etc.) dominates the total latency. One significant performance constraint is related to the utilization and subsequent latency of the network linking the front-end and the back-end. Current limitations on the speed of data movement make CPU/GPU cluster-based systems incapable of meeting real-time requirements.

Custom Embedded Computing

In addition to latency and throughput constraints, practical energy constraints are becoming more stringent. Specialized computing systems are being developed to meet challenging real-time needs. An increasingly attractive paradigm is to design components that are finely tuned and optimized for specific steps in the data capture workflow. These components can be mapped onto FPGA devices or designed and manufactured as application-specific integrated circuits (ASICs).

In the LHC and accelerator space, there have been many FPGA-based demonstrations of front-end data processing systems that achieve latencies in the microsecond range. These systems are responsible for tasks such as triggering, event reconstruction, and anomaly detection. Direct implementations of neural networks to perform inference for these tasks cannot meet the latency. The maximum achievable FPGA clock frequency and inference latency correlate with the resource utilization and occupancy of the device. The co-design techniques developed for these applications are specifically focused on extreme quantization and pruning. These optimizations can lower the upper bound on resource usage to as low as 10% of the FPGA device to meet system constraints. With these optimizations, we have achieved an implementation with high inference accuracy while meeting system constraints.

In other applications (e.g., accelerator control, biomedical, and health applications), the need for resource minimization is relaxed and the expectation of millisecond latency is less stringent. Thus, the focus of system design can shift from resource minimization to refinement of the algorithms being mapped. The inference models can now include deep learning models combined with advanced video and signal processing engines, as well as local privacy-preserving processing tasks.

If your system is constrained, you may need to design a custom ASIC solution in addition to or instead of an FPGA device. ASICs can support extreme form factors, integrate computation and sensors (such as smart photon detectors) into compact front-end devices, and tightly integrate with other devices. They can be tightly integrated with other devices. They can address sensing (e.g., smart photon detectors) and computation integration, tight integration with mixed-signal and analog functions, radiation tolerance requirements, and ultra-low energy budgets.

Technology Latest News

This section provides an overview of technologies and methods for constructing fast ML algorithms.

This requires co-design. It provides an efficient platform for building algorithms and programming hardware with the hardware in mind.

A systematic method for efficient deployment of ML models

As mentioned before, most ML problems in science require low latency and often constrained resources. However, most of the current state-of-the-art NN models have very high latencies large memory footprints, and energy consumption. To avoid this latency problem, suboptimal models with non-ideal accuracy (e.g., shallow NNs) have had to be used.

To solve this problem, there is a large body of literature that focuses on making NN models more efficient (in terms of latency, memory footprint, and energy consumption). These efforts can be broadly categorized as follows.

(i) design of new efficient NN architectures, (ii) co-design of NNs and hardware, (iii) quantization (low-precision inference), (iv) pruning and sparse inference, and (v) knowledge distillation.

Design of a new efficient NN architecture

One line of research is focused on finding new NN models that are efficient in design. A notable early work is SqueezeNet. This is a new NN model without an expensive fully connected layer.

Using the new lightweight Fire module, the model was 50 times smaller than AlexNet, but the accuracy was the same. Subsequently, several innovations were made inefficient NN architecture design. One focus has been to find efficient layers/operators. Notable work has been done on group convolution, depth convolution, spatially separable convolution, shuffled layers, and shift convolution, to name a few.

Another focus has been to find alternatives similar to the Fire module that are more efficient and yield better accuracy/generalization. Notable work includes residue networks (originally designed to solve the gradient vanishing problem, but these structures are generally more efficient than non-residue architectures), tightly connected networks, squeeze-and-excite modules, and inverted residue blocks.

These classical methods mostly found new architectural modules through manual design searches. Since this is not scalable, recent approaches have proposed automated methods that use neural architecture search (NAS). The NAS method automatically finds the appropriate NN architecture for a given constraint of model size, depth/width, and/or latency. The high-level approach here is to train a probabilistic SuperNet that contains all possible combinations of NN architectures with learnable probability, within the specified constraints. After this SuperNet is trained, the architectures can be sampled from the learned probability distribution. Notable work includes RL-based methods, efficient NAS, MNasNet, DARTS, and differentiable NAS.

NN and hardware co-design

Another promising task is to tailor NN architectures for specific hardware platforms and/or to co-design them together. This is very promising for configurable hardware such as FPGAs. The importance of hardware-aware NN design is that the cost of performing different types of operations varies from hardware to hardware. For example, hardware with a dedicated cache hierarchy can perform bandwidth-constrained operations much more efficiently than hardware without a cache hierarchy. Notable work in this area includes SqueezeNext, where both the NN and the hardware accelerator were co-designed using a manual tuning approach. Recent work proposes to automate the design of hardware counterparts via NAS. Notable work includes ProxylessNAS, OnceForAll, FBNet, and MobileNetV3.

Quantization (low-precision inference)

A common solution is to use quantization to compress the NN model. In this case, a low bit precision is used for the weights/activation. The work of interest here is Deep Compression, which uses quantization to compress the model footprint of the above SqueezeNet model, reducing its size to 1/500th that of AlexNet. Quantization has the potential to use low-precision matrix multiplication or convolution because the model size is reduced without changing the original network architecture. Thus, both memory footprint and latency can be improved.

Quantization methods can be broadly classified into two categories: post-training quantization (PTQ) and quantization-recognition training (QAT). In PTQ, a single-precision, pre-trained model is quantized to low precision without fine-tuning or retraining. As a result, these quantization methods are usually very fast and in some cases do not even require training data. However, PTQ often leads to a loss of high accuracy, especially in the case of low-precision quantization.

To deal with this, some quantization methods employ QAT, which allows the model to be retrained after quantization to adjust the parameters. This approach often results in higher accuracy but requires more time to retrain the model.

The other is the application of simulated quantization (aka false quantization) and integer-only quantization. In the former, the weights/activations are stored at low precision, but cast to high precision during inference. In the latter case, no cast is included, and multiplication and accumulation are also done with low precision. Using integer-only quantization has the advantage of not only reducing the memory footprint of the model but also of speeding up inference by using low-precision logic for multiplication and addition.

The second is hardware-aware quantization. Similar to the design of the NN architecture, quantization can be tuned to a specific hardware platform. This is important for mixed-precision quantization. This is important for mixed-precision quantization because certain operations of the NN model may benefit from lower precision quantization than others, based on whether they are bandwidth-limited or computation-limited. Therefore, the optimal accuracy setting needs to be determined based on the trade-off between the potential footprint/delay gain and the sensitivity to accuracy degradation, as outlined in Fig. 9.

Pruning and sparse reasoning

Another approach to reducing the memory footprint and computational cost of NN is to apply to prune. This can be thought of as quantization to zero bits. In pruning, neurons with low saliency (sensitivity) are removed, resulting in a sparse computational graph. Neurons with low saliency are those whose removal should have a minimal effect on the output/loss function of the model. Pruning methods can be broadly classified into unstructured and structured pruning. Unstructured pruning removes neurons that have no structure. This approach allows the removal of most of the NN parameters with little impact on the generalization performance of the model. However, this approach is difficult to speed up and usually results in memory-limited sparse matrix operations. This can be addressed by structured pruning, where groups of parameters (such as output channels) are removed. The challenge here, however, is that a high degree of structured pruning often leads to a significant loss of accuracy.

In both approaches, the key problem is to find the parameters to prune. A simple and common approach is magnitude-based pruning. In this approach, the magnitude of the parameter is used as a pruning metric. The assumption here is that small parameter are not important and can be removed.

An important problem with magnitude-based pruning methods is that parameters with small magnitudes can be very sensitive in practice. This is easily confirmed by the second-order Taylor series expansion, where the perturbations depend not only on the magnitude of the weight but also on the Hesse matrix. Hence, several studies use second-order based pruning.

Finally, we need to mention that it is possible to compress the NN model by combining pruning and quantization. Indeed, pruning can be seen as quantization to zero bits. A quantization-aware pruning method has been proposed and applied to problems in high energy physics. We report results that are better than pruning or quantization alone.

Distillation of knowledge

Model distillation trains a large model, which is then used as a teacher to train a compact model. The key idea of model distillation is to leverage the soft probabilities generated by the teacher instead of using class labels while training the student model. This allows you to guide/assist the student's training.

Knowledge distillation methods focus on the exploration of different knowledge sources. Some methods use logit (soft probability) as a source of knowledge, others leverage knowledge from intermediate levels. The choice of teacher models has also been well studied, including the use of multiple teacher models to jointly supervise student models, and the application of self-distillation without additional teacher models. Other efforts have applied knowledge distillation using different settings in different applications. These include the study of data-free knowledge distillation and the combination of knowledge distillation and GANs.

The main challenge for knowledge distillation methods is to achieve high compression ratios. Compared to quantization and pruning, which can typically maintain accuracy at 4x compression, knowledge distillation methods tend to show a non-negligible loss of accuracy at these compression levels. However, these two approaches are orthogonal, and recent work has shown that their combination can lead to high accuracy/compression. Current distillation methods have been applied mainly to classical ML problems, and few studies have explored their application to scientific AI problems.

Systematic Neural Network Design and Training

Currently, there is no analytical approach to finding a suitable NN architecture for a particular task and training dataset. Originally, the design of NN architectures was a manual task with intuitions that were mostly ad hoc. However, in recent years, there have been many innovations in automating the NN architecture design process, called Neural Architecture Search (NAS).

NAS can be viewed as a hyperparameter tuning problem. Hyperparameters are design choices for NN architecture. They can include width, depth, and operation type. The main challenge is that the search space for operation types scales exponentially with the number of layers. Therefore, limiting the search space needs to include a high level of intuition about the NN architecture.

After limiting the search space, the general NAS process is as follows. The candidate architectures are sampled from the set of all possible architectures and trained on some epochs of the training dataset. The accuracy is then used as a metric to evaluate how good the candidate architecture is. The probability distribution of the sampled architectures is then updated based on this reward. This process needs to be repeated for many different candidate architectures, in some cases more than a few hundred thousand.

Essentially, this leads to another problem related to tuning the optimization hyperparameters for each candidate architecture. For example, if a good architecture is sampled from the NAS but trained with suboptimal hyperparameters, the error will be high, reducing the likelihood that the NAS algorithm will sample an architecture that is not the desired property.

As a result, scalability has become an essential concern for any procedure where "big data" is present. One major class of procedures for which scalability has become essential is numerical optimization algorithms, which are at the core of training methods. There is a large body of literature on the design of efficient numerical optimization/training methods and efficient NAS algorithms for searching for suitable NN architectures.

The goal of the optimization is to design a new method that requires fewer iterations to converge and is more robust to hyperparameter tuning. One of the notable advances here is the ability to apply quadratic methods without the need to create quadratic operators. The performance and robustness of these methods are better than first-order optimization methods for classical ML problems (such as computer vision and natural language processing). Interestingly, recent results from Physics Informed Neural Networks (PINN) show that first-order methods are significantly inferior to (quasi-)second-order methods. This potentially presents an opportunity to adapt or redesign some second-order algorithms for scientific problems.

A similar goal for NAS algorithms is to find ways in which the number of candidate architectures to evaluate needs to be reduced, with fewer manual restrictions and search space adjustments. Another goal is to design a transferable NAS algorithm that can be trained on small problems and then transferred to larger, more expensive problems.

In summary, the core of NN architecture design is a fast way to sample architectures (via NAS) and fast training of the sampled architectures (via fast and robust optimization).

Hardware Architecture: Conventional CMOS

Due to the rapid growth in popularity and demand for machine learning, it is increasingly important to efficiently design machine learning algorithms and deploy them simultaneously on complementary and powerful hardware platforms. The computing and memory demands of NN deployments are enormous and growing beyond the limits of what standard silicon-based semiconductors can scale to. The reasons behind the scalability challenges in the semiconductor industry are as follows. First, as we approach the end of Moore's Law, the cost of transistors is increasing exponentially due to the rising cost of chip design as technology nodes shrink (as published by Xilinx and Gartner in 2011). In addition, the Dennard rule faced considerable thermal challenges as power density was no longer kept constant across node generations. To mitigate the challenge of increased thermal density, the chip is designed to conditionally power a group of transistors, effectively throttling or "turning off" a portion of the chip. This technique has come to be known as dark silicon creation.

Several disruptive approaches have been proposed to overcome these challenges and provide sufficient computing capability. Cerebral Systems, for example, has brought to market the first computer system to employ wafer-scale integration. Chips are built from complete wafers rather than individual dies. Such a technology presented significant engineering challenges concerning power supply, packaging, and cooling. Exploring other aspects, the foundry is investigating true 3D chip stacking, presented by TSMC at HotChips'2019. Analog computing, quantum computing, and in-memory computing are also investigated.

A low-risk approach focuses on moving away from the traditional von Neumann architecture, using specialization of the computing architecture to provide the necessary performance scaling and energy efficiency. With specialization, devices become increasingly heterogeneous. A variety of devices have emerged that attempt to address this problem in different ways. A key challenge is how to loop-transform and deploy algorithms to maximize data reuse and computational efficiency, minimize memory bottlenecks, and limit power consumption while meeting real-time requirements.

The choice of hardware type and quantity is often summarized in a set of constraints imposed by the computing environment, workload type, and data type. For large data center deployments that handle different types of workloads, it is often necessary to combine multiple platforms to reduce the total cost of ownership (ToC) across all hardware platforms. Hence, it is increasingly necessary for owners of heterogeneous platforms to think of their systems as large multiprocessor computers. In the case of Deep Learning hardware accelerators, these new computers usually take the form of CPU co-processors.

Classification of computer architectures for deep learning

Currently, there is a wide range of hardware architectures for deploying machine learning algorithms. They can be broadly categorized according to the following criteria

Basic types of calculation operations
Specific support for certain numeric expressions
External memory capacity (mainly relevant for training workloads)
External Memory Access Bandwidth
Power consumption in thermal design power (TDP) format
Level of architectural parallelism and degree of specialization

As shown in Fig. 10, in practice, we classify computing architectures into scalar processors (CPUs), vector-based processors (GPUs), and so-called deep learning processing units (DPUs).

These categories are to some extent intermingled. DPUs are specialized for this application domain and distinguish spatial processing approaches from more general matrix-based or tensor-based processors. DPUs can be implemented in either ASICs or FPGAs.

CPU

CPUs are widely used in ML applications and are mainly considered as serial or scalar computing engines. They are optimized for single-threaded performance using an implicitly managed memory hierarchy (using multi-level caches) and support floating-point arithmetic (FP64 and FP32) and 8- and 16-bit integer formats using dedicated vector units of the latest variants. and 8-bit and 16-bit integer formats using the latest variants of dedicated vector units. Assuming boost clock speeds (cascade lakes, 56 cores, 3.8GHz), FP64 has the highest theoretical peak performance at 6.8 TOPs. External memory currently primarily utilizes large DDR4 memory banks; CPU memory bandwidth is low compared to GPUs and other HBM-enabled devices. In terms of power consumption, CPUs are at the upper end of the spectrum, with high-end devices ranging up to 400W. In the embedded space, ARM processors generally provide a popular solution, especially when performance requirements are very low or when special device variants require features that are not supported. In particular, the Ethos family of processing cores are specialized for CNN workloads and thus fall into the following DPU categories The advantages of CPUs are the generality of the hardware and the ease of programming that the design environment has matured over the decades. As expected, this comes at the cost of lower peak performance and efficiency compared to more specialized device families. For quantization, CPUs can take advantage of this optimization technique for INT8 and INT16 only if supported.

GPU

GPUs are SIMD-based (single-precision, multi-data) vector processors that support smaller floating-point formats (FP16) and, more recently, fixed-point 8- and 4-bit integer formats, implicitly and explicitly mixed. The ability to take advantage of the high degree of parallelism inherent in this application via increasingly specialized architectural features makes it one of the best-performing devices on the market for DNN acceleration. Currently, there are up to three different execution units, namely CUDA cores, tensor cores, and DLA, which do not operate simultaneously on the workload (at least not easily or by default). Therefore, instead of summing the peak performance of the different execution units, only the maximum performance is used.

For memory, the GPU utilizes specialized, highly pipelined GDDR memory. This reduces capacity but provides much higher bandwidth (up to 732GBps). For the same reason, some DPUs will also deploy HBM2. In terms of power consumption, GPUs are higher, up to 345W.

One of the general challenges of GPUs is the need to leverage parallel processing of inputs to achieve high utilization of large computing arrays. Therefore, the inputs need to be grouped into batches before execution. This has a negative impact on termination latency. In addition, GPUs have relatively high power consumption. For quantization, support is limited to the native data types. Finally, the software environment for GPUs, while not at the same level as the CPU, is much more mature and easier to use.

FPGA and ASIC

FPGAs and ASICs tailor the hardware architecture to the specifications of a particular application. They can be adapted in all aspects to meet the specific requirements of the use case. This can include IO functions, features, or tailored to specific performance or efficiency goals. Whereas ASICs are completely hard-fixed, FPGAs can be reprogrammed. This flexibility can amortize the cost of circuit design in many applications but at the expense of hardware resource cost and performance.

FPGAs are a popular choice for accelerating CNNs. Traditionally, FPGA computing fabrics consist of a sea of look-up tables (LUTs) interconnected via a programmable interconnect.

DPU

The term DPU (short for deep learning processing unit) refers to a new type of computing architecture dedicated to CNN acceleration. DPUs are customized for these types of applications in a variety of ways, including the types of operations supported, direct support for tensors or matrices, unique data types and supported numeric representations, macro architecture, special memory hierarchies explicitly managed, and level of parallelism. for these types of applications.

Matrix of Processing Elements (MPE)

The first type, shown on the left side of Fig. 11, consists of an MPE operating on a matrix or a high-dimensional tensor. The processing engine can be a simple MAC, a vector processor, or a more complex VLIW (Very Long Instruction Word) core that can support the concurrent execution of various instructions.

These implementations minimize hardware cost, maximize performance, and optimize efficiency by specializing to take advantage of certain high-precision operations using a dedicated instruction set and a customized memory system. However, the algorithm must be adapted to take advantage of these features to achieve performance benefits.

spatial DPU

The second type of DPU leverages spatial acceleration and takes advantage of the parallel processing of layers and branches. Popular examples are hls4ml and FINN. In this respect, the hardware architecture is more specific to the details of a particular deep learning topology. This is visualized on the right side of Fig. 11. The hardware architecture mimics the specific deep learning topology, and the input is streamed through the architecture. All layers are instantiated in a dedicated computational datapath. Each layer has a dedicated weight buffer, and the activation buffers between layers are FIFOs of minimal size. They buffer enough data to supply the next set of convolutions in the next layer. This is significantly more efficient and reduces the latency compared to the first type of DPUs or GPUs.

DPUs and GPUs typically perform layer-by-layer computations. In this case, a series of images needs to be buffered to extract the maximum number of computations (input, IFM, and OFM parallel processing) from the platform. For this reason, the device buffers a batch of images before computing the first layer of all images. Then all the intermediate results are buffered and the next layer is computed. Thus, the latency is highly dependent on the size of the input batch. As a result, spatial DPUs have an advantage concerning latency. This level of customization is only possible with programmable hardware architectures such as FPGAs. This is because FPGAs can adapt the hardware architecture to different use cases. This generally does not make sense in the context of ASIC accelerators. It produces an ASIC that can only accelerate one specific topology, which is very limited in scope. A limitation of spatial architectures is the scalability of the number of layers. Each layer has a resource cost overhead, and there is a maximum number of layers that can be created in a single device. As a result, some very deep CNNs may not fit on a single device. Microsoft's Brainwave project overcomes this limitation with a distributed approach that leverages spatial computing.

Once spatial DPUs are leveraged and the architecture is specialized for a very specific CNN, the architecture can be further customized concerning minimum precision, supporting only the necessary bits per CNN layer to achieve higher performance and efficiency. In MPE, the hardware supports the highest accuracy required for the entire network.

Concerning customized precision and spatial architecture, FINN pioneered the first binarized neural network accelerator, providing many proof points for customized low-precision implementations. This flexibility comes at a cost in the form of programming complexity, and performance characteristics are generally very difficult to characterize because they depend on the details of the hardware architecture in which they are implemented.

Further DPU variations

Besides the aforementioned spatial DPUs and MPEs, there are many more variants. For example, some exploit the sparse computing engine, such as EIE and its successors ESE, SCNN, Cnvlutin, Cambricon-S, Cambricon-X. These are the only architectures that can benefit from irregular sparsity. Finally, another aspect to customize the accuracy is to optimize the CNNs at runtime or execution time. In other words, besides using a statically fixed low precision where the hardware runs at a fixed precision for all variables, some approaches consider a configurable bit precision at runtime that can take advantage of bit parallelism in arithmetic operations. On the hardware implementation side, he is available in runtime programmable precision, which is effective in bit-serial implementations. For example, Umuroglu et al. demonstrate with BISMO that bit-serial can provide very attractive performance with minimal FPGA overhead, and Judd et al. show that the same is true for ASICs with a prototype ASIC called Stripes. We show that the same is true for ASICs with prototype ASICs called Stripes. This concept applies to both MPE and spatial architectures but makes the most sense for MPE.

Summary of conventional CMOS hardware architecture

We analyzed three categories of hardware architectures used for CNN inference: general CPUs, SIMD-based vector processors such as GPUs, and DPUs, which are architectures specifically designed to accelerate deep learning workloads. An overview of the architectures is given in Table 4.

Note that "ease of use" includes programmability and general usability of the computation kernel. The degree of specialization includes customization for operators, precision support, and topology. In summary, in the case of DPUs, we distinguish between tensor processors, which take advantage of the matrix of the processing engine, and spatial architectures, which can be further specialized for specific topologies using FPGAs.

CPUs are the most common solution, but they are high-powered. GPUs and DPUs offer the best performance, but GPUs are more expensive in energy costs. Spatial DPU architectures excel with low latency and provide the highest computational efficiency through maximum customization. While CPUs, GPUs, and DPUs (MPEs) use a sequential layer-by-layer computational model, spatial DPUs execute all layers of the network simultaneously. Enhanced topologies in the form of ASICs, CPUs, and GPUs provide a fixed set of native data types, while FPGAs provide maximum flexibility and can employ arbitrary precision and numeric representations that take full advantage of optimization through quantization, but enhanced approaches default to Higher supported accuracies and the ability to embed low-precision variables.

However, the programmability of FPGA fabrics also comes at a cost in speed and energy. All architectures can benefit from coarse-grained pruning optimization techniques. Only sparse execution engines can benefit from irregular pruning, such as synapse pruning. We also discussed the different deployment options. Many devices offer different power and operating modes as different compromises between throughput and power consumption to adapt to potentially very different optimization targets in different application settings.

Similarly, the batch size, the number of threads, and the stream size provide another compromise between throughput and latency. Again, this is to facilitate a variety of use cases. Finally, the table shows that speculative approaches, such as Cerebral, can provide basic performance scalability. Overall, each approach has its strengths and weaknesses, and the optimal solution depends heavily on the details of the particular use case.

Hardware/Software Code Design Examples: FPGA-Based Systems

FPGA Programming

FPGAs are configurable integrated circuits that offer good trade-offs in terms of performance, power consumption, and flexibility to other hardware paradigms. However, programming an FPGA is a difficult and time-consuming task. Traditionally, FPGA programming has been a task for hardware designers who are familiar with digital design and computer architecture. These requirements lead to a steep learning curve for software developers and other domain experts. To lower the barrier to entry, there is an increasing focus on designing FPGA hardware at a higher level of abstraction. As a result, a variety of approaches have brought FPGA development into the mainstream, allowing developers to design FPGAs at a higher level using familiar languages such as C, C ++, OpenCL, and in some cases C#. An important question arises here. What are the additional benefits of designing hardware at a higher level of abstraction? The high-level language (HLL) contains a variety of structures and design patterns that are more functionally expressive. In addition, the time spent verifying the design is an important factor. Hardware description languages such as Verilog and VHDL are more verbose because they focus more on the details of the final implementation. Larger code repositories are not as easy to verify the correctness of features. HLL, on the other hand, is more compact and faster to simulate. Therefore, designers can perform more verifications in the same period. Despite these advances, FPGA programming is still complex. This has forced academia and industry to develop new compilers, frameworks, and libraries to ease hardware design.

High-level synthesis and language

High-level synthesis (HLS), also called behavioral synthesis or algorithmic synthesis is an automated design process that takes as input a functional description of the design and outputs an RTL implementation. It converts an un-timed (or partially-timed) high-level specification into a fully-timed implementation. The HLS process begins by analyzing the data dependencies between the various operations of the functional description. This analysis results in a Data Flow Graph (DFG) representation. After generating the DFG, during the allocation phase, the HLS maps each operation to a hardware resource with delay and area characteristics. Next, HLS adds the concept of time to the design during the scheduling phase. Scheduling takes the DFG operations and resources and, taking into account the delay information, determines the clock cycles in which to execute them. In this step, sequential logic is inferred by adding registers between operations and creating a finite state machine.

In the last three decades, many HLS tools have been proposed. The tools have a variety of input languages, and even in the same input language, they perform a variety of internal optimizations and produce results of varying quality. The results show that each HLS tool can significantly improve performance once the designer learns the benchmark-specific optimizations and constraints. However, academic HLS tools have a higher learning curve because they are less focused on ease of use. Commercial HLS tools have advantages because of their better documentation, robustness, and design verification integration.

As for the input language of the HLS, most HLLs are variants of the C language. However, there are some limitations in generating hardware from a pure C specification. First, C does not have the concepts of timing and concurrency. The designer must rely on HLS tools to create clock-based timing. Similarly, the designer must either specify a concurrency model or rely on HLS to extract the parallelism between operations or processes. Second, C does not have a bit-precise data type. It provides only "native" data types such as char, int, and long whose size is a multiple of a byte. Third, it lacks the concept of hardware interfaces and communication channels. SystemC was adopted as an HLS language to address all these limitations.

However, SystemC has not yet fully penetrated the FPGA community. Another problem common to all C-based languages, including SystemC, is memory access and modeling. These languages have a flat memory model, and memory access is done via pointers. Either the HLS must determine how to implement memory in hardware, or the designer must use additional HLS directives or libraries to model the memory subsystem properly. Finally, in the family of C-based specification languages for HLS, the SYCL language has emerged. SYCL (pronounced "sickle") is an industry-driven standard for designing heterogeneous systems by adding parallel processing to C ++. SYCL programs perform best when combined with an SYCL-enabled C ++ compiler, such as the open-source data-parallel C ++ (DPC ++) compiler.

Apart from its C variants, Bluespec is an open-source language for hardware description and synthesis based on SystemVerilog. It provides a level of abstraction with clean semantics that emphasizes architectural aspects. It can be viewed as a high-level functional HDL where modules are implemented as rules using the SystemVerilog syntax. These rules are called protected atomic actions and represent the behavior as a simultaneously cooperating finite state machine (FSM). Another recent language among FPGA designers is Chisel. It is based on Scala and supports hardware definition using highly parameterized generators, object-oriented and functional programming. It compiles to RTLVerilog implementations as well as HLS flows.

All of these languages have helped create efficient hardware and greatly reduced development time, but they still require specific coding practices. The growth and diversification of application domains also illustrate the limitations of these programming languages. This has pushed the level of abstraction further into domain-specific languages (DSLs). In recent years, we have observed the growth of a substantial corpus of DSLs and frameworks for FPGA design. DSL-based approaches allow users and tools to use domain knowledge to apply static and dynamic optimizations. Table 5 shows some of the DSLs and frameworks that have been developed over the years for FPGA computing, organized by application domain. Table 6. While all of the approaches in the table are diverse in terms of application, an interesting question is: What is the general denominator? To the best of our knowledge, most of the approaches are broadly based on two approaches: either the DSL specification is compiled directly into an RTL implementation or a source-to-source compiler is used. In the latter case, the DSL compiler generates equivalent source code in a different programming language, such as C ++, for a more standardized HLS flow. As a conclusion to this paragraph, efforts to design better HLS compilers and languages are an important part of current FPGA research.

Integration of software and hardware

Running an application as software on a microprocessor is more accessible than designing and running dedicated hardware, but it may result in lower performance and higher power costs. On the other hand, it is difficult to separate the application into software and hardware components. This process, also known as hardware/software codesign, divides the application between the software running on the microprocessor and one or more custom hardware or coprocessor components to achieve the desired performance goals.

Unsurprisingly, there is a great deal of research in this area. This book provides background information on noteworthy aspects of older FPGA technologies, while at the same time explaining the basic architecture and design methodology of codesign. In addition, there is another comprehensive study that aims to evaluate and analyze in detail the microarchitectural characteristics of state-of-the-art CPU-FPGA platforms. There is also a white paper describing most shared-memory platforms and detailed benchmarks.

The two major FPGA vendors, Xilinx and Intel, have their solutions. The Xilinx Runtime Library (XRT) is implemented as a combination of user-space and kernel driver components. It supports both PCIe-based boards and MPSoC-based embedded platforms. Similarly, Xilinx SDDoc and SDAccel were released to the public in late 2015. The former only works with select boards in the Zynq family of FPGAs, and the latter only works with select PCIe-based boards for OpenCL computing. Beginning in 2020, Xilinx introduced Vitis as a unified platform. The Vitis Unified Software Platform is a comprehensive development environment for building and seamlessly deploying accelerated applications on Xilinx platforms, including Alveo cards on-premise, FPGA instances in the cloud, and embedded platforms. development environment. In addition, Xilinx's recent work under the flagship Versal is another step towards codesign applications. Intel has the Open Programmable Acceleration Engine (OPAE), a library of APIs for programmers to create host applications that take advantage of FPGA acceleration. Similarly, Intel one API is an open, integrated programming model built on standards to simplify the development and deployment of data-centric workloads across CPUs, GPUs, FPGAs, and other accelerators.

Apart from vendor solutions, academia and the open-source community have also attempted to simplify the integration of applications, operating systems, and hardware acceleration. This book provides a historical review and summary of ideas and key concepts for including aspects of reconfigurable computing in operating systems. It also provides an overview of the operating systems published and available over the past 30 years that target reconfigurable computing. Similarly, the design investigation and engineering of FPGA drivers that are portable across multiple physical interfaces (PCIe, Ethernet, optical links) continue to be an important part of HW / SW codesign research. The challenges arise from the variety of FPGA boards, a large number of interfaces, and diverse user requirements. Essentially, the FPGA driver must allow the designer to load or reconfigure the application bitstream and support data transfer between the FPGA and the host.

A key engineering challenge is to investigate how to partition the driver functionality between hardware and software components. One growing research focus is the implementation of multiple queues in FPGA drivers. Benchmarks of various mainstream academic and vendor solutions for system-level drivers in the FPGA domain are also provided.

Despite the variety of existing OS and driver solutions, the remaining open problem is standardization. Industry-wide standardization will enable faster development, improved portability, and (re)usability of FPGA applications. There is already work in progress in this area. Standards such as the CCIX Consortium and the HeterogeneousSystem Architecture (HSA) Foundation are already making good progress.

The case of ML framework for FPGA design

Machine learning is one of the fastest-growing application domains, and FPGAs can achieve latency, throughput, and efficiency requirements by leveraging low-precision arithmetic, streaming, and significant customization of hardware designs, thus increasing the demand for FPGA-based implementations. The demand for FPGA-based implementations is increasing. Data flow implementation (introduced as a spatial architecture) and fine-grained sparsity. To enable a wide range of users with these customizations and to reduce the significant engineering effort, compilers and tools are needed to meet the needs of ML researchers and domain experts using FPGAs. Two major ML frameworks, hls4ml, and FINN are striving to fill this void. Considering the aforementioned tools, compilers, programming languages, and codesign solutions, both hls4ml and FINN have the potential to reach the broader scientific community. To better understand how such a tool flow works, we examine the FINN compiler in detail in the following paragraphs.

The FINN compiler is an open-source framework for generating spatial DPUs or streaming dataflow accelerators on FPGAs. The FINN compiler has a highly modular structure, as shown in Fig. 12.

It allows users to interactively generate architectures specific to a particular DNN. The framework provides a front-end, a transformation and analysis path, and multiple back-ends for exploring the design space in terms of resource and throughput constraints. Brevitas, a PyTorch library for quantization-aware training, is the front-end used in this work. It allows training DNNs using weights and activations that are quantized to a few bits and exporting the trained network to an intermediate representation (IR) used by the FINN compiler. The transformation and analysis paths help to generate an efficient representation of the DNN. Finally, the backend includes a code generator that creates synthesizable accelerator descriptions. This code generator can be implemented as a standalone Vivado IPI component or integrated into a variety of shells such as Xilinx Alveo boards or PYNQ embedded platforms.

To process it further, we first need to convert the DNN model into an IR for the FINN compiler. The front-end stage handles this by converting the PyTorch description into an IR called FINN-ONNX.

The IR is based on ONNX, an open-source exchange format that uses protobuf descriptions to represent DNNs. Several standard operators are included, and users can easily create their operators to customize the model. Nodes represent layers, and edges carry the output from one layer and are the input to another. The ability to customize the ONNX representation is used by the framework to add application-specific nodes and attributes. Each node is tagged with inputs, parameters (weights and activations), and output quantization, allowing quantization-aware optimization and mapping to optimized backend primitives for quantized computations. During the compiler flow, nodes are transformed into backend-specific variants via a series of transformation paths.

The main principle of the FINN compiler is a graph transformation and analysis path that modifies or analyzes the IR of a model. The path takes an IR graph as input, and either (a) the DNN looks for specific patterns, modifies the graph in a particular way, and outputs the modified graph, or (b) the DNN is analyzed to generate metadata about its properties. Various transformations need to be applied to transform the model into a representation, to generate code from it, and finally to generate a hardware accelerator. The main relevant transformations are summarized below.

Although the PyTorch description of the network is mostly quantized, it still contains, for example, reprocessing, per-channel scaling, or batch norm layers. To generate a hardware accelerator from the model, we need to absorb these floating-point operations into a multilevel threshold to create a network of functionally identical integer operations. The transformation to achieve this is called rationalization, as described by Umuroglu and Jahre. During rationalization, floating-point operations are moved adjacent to each other, collapsed into a single operation, and absorbed into the subsequent multi-threshold node.

Then the high-level operations of the graph are reduced to simpler implementations that exist in the FINNHLS-based hardware library. For example, convolution is reduced to a sliding window node and a subsequent matrix-vector node, while pooling operations are implemented by a sliding window and a subsequent aggregation operator. The resulting graph is composed of layers that can be transformed into the equivalent of hardware building blocks. Each node corresponds to a VivadoHLS C ++ function call, from which the IP is used.

Blocks for each layer can be generated using Vivado. The resources used by each hardware building block can be controlled via specific attributes passed from FINN to Vivado. For example, multiplications can be performed using LUTs or DSP blocks, and parameters can be stored in distributed RAM, block RAM, or ultra-RAM.

Finally, the folding process allocates computing resources to each layer and fine-tunes the parallelism to achieve the desired throughput with a balanced pipeline. To enable per-layer specialization without reconfiguration and to minimize latency, FINN creates dedicated per-layer hardware interconnected with FIFO channels. Thus, the outermost loops between L layers are always fully pipelined. When folding is specified, a resource estimate can be created for each node. There are several ways to estimate the resources. Even before the IP blocks are generated from the HLS layers, you can use an analytical model based on the concept of FINN-R paper to estimate the resources per layer. It is also possible to extract estimates from VivadoHLS after IP generation, but these results are estimates that may differ from the resource usage of the final implementation due to synthesis optimizations.

The backend is responsible for creating the deployment package using the IR graph and backend-specific information. This is also implemented using the concept of transformations. A FIFO is inserted between the layers to get the inference accelerator. This can be automatically sized by the FINN compiler. A single IP block is then stitched together and synthesized. The stitched IP can be manually integrated into the system or inserted into the appropriate shell on the target platform. If the target platform is an Alveo card, the design is exported as a Vivado Design Checkpoint (DCP), followed by the generation and linking of a Xilinx Vitis object file.

Summary of hardware/software co-design and FPGA-based systems

In summary, CPUs are the most popular solution for CNN inference, but they are high-powered. GPUs and DPUs offer the best performance, and GPUs are more expensive in terms of energy cost. FPGAs offer several tradeoffs that may fit well in rapidly changing application domains. FPGAs can adopt arbitrary precision and numerical representation. This provides maximum flexibility and maximizes the use of optimization by quantization. On the other hand, the enhanced approach increases the precision supported by default and allows the embedding of low-precision variables. In addition, the spatial data flow approach allows for much lower latency. However, the complexity of FPGA programming limits the deployment.

Tools such as hls4ml and FINN are frameworks created specifically for the ML domain that automate the end-user hardware generation process, hiding the associated design complexities of FPGAs and making them available for the aforementioned end applications.

Beyond-CMOS neuromorphic hardware

Efficient hardware implementations are urgently needed for rapidly growing machine learning applications. Most of the efforts are focused on digital CMOS technologies, such as implementation ...

It is based on general-purpose TPUs / GPUs, FPGAs, and more specialized ML hardware accelerators. The steady improvement in performance and energy efficiency of such hardware platforms over the last decade is due to the use of very advanced sub-10 nm CMOS processes and the overall optimization of circuits, architectures, and algorithms. This includes, for example, aggressive voltage supply scaling, very deep pipelining, the use of extensive data reuse in the architecture, reduced weight accuracy, and algorithm activation. As a result, very compact state-of-the-art neural networks such as MobileNet based on 3.4M parameters and 300M multiplication and addition operations per inference can now be fully integrated into a single chip. However, on all these fronts, progress is saturating, and we cannot rely on Moore's Law decay.

On the other hand, ML algorithms are becoming increasingly complex, so further progress is essential. For example, transformer networks, the state-of-the-art approach for many ML tasks today, have hundreds of billions of parameters and can perform hundreds of trillions of operations per inference. Moreover, the functional performance of transformers typically improves with model size. Training such models requires massive data center scale resources (e.g., kilo TPU years), but performing inference on resource-constrained edge devices is very challenging.

The opportunity to build more efficient hardware may come from biological neural networks. Indeed, the human brain, with more than 1000 times more synapses than the weight of the largest trans-network, is considered to be very energy efficient and serves as a general motivation to develop neuromorphic hardware. CMOS neuromorphic circuits have a long history. However, to realize the full potential of neuromorphic computing, the new device and circuit technologies beyond CMOS may be required to enable more efficient implementation of the various functions of biological neural systems.

Analog Vector-by-Matrix Multiplication

The advent of high-density analog-grade nonvolatile memory over the past two decades has renewed interest in analog circuit implementations of vector-by-vector multiplication (VMM). The most common and frequently performed operations of neural networks in training or inference. In the simplest case, such a circuit consists of a matrix of memory cells acting as configurable resistors to encode the weights of the matrix (synapses) and peripheral sense amplifiers playing the role of neurons (Fig. 13). Since the input vectors are encoded as voltages applied to the rows of the memory matrix, the currents flowing in the virtually grounded columns correspond to the results of the VMM. Since addition and multiplication are performed at the physical level via Kirchhoff's law and Ohm's law, respectively, such an approach can be very fast and energy-efficient if the memory device is dense and the conductance is tunable (i.e., multi-state). Part of the energy efficiency comes from performing "in-memory" computations, which reduce the amount of data (corresponding to synaptic weights) moving across or in and out of the chip during the computation. Such communication overhead can dominate the energy consumption of state-of-the-art digital CMOS implementations.

A general challenge for the practical application of such circuits, especially when using the most promising new memory technologies, is the variation of the I-V characteristics, such as the switching voltage applied to change the memory state. In light of this challenge, the simplest application is an ex-situ trained inference accelerator for early firing rate neural networks, i.e., so-called second-generation artificial neural networks (ANNs) with stepwise response neurons. In such applications, the memory device is updated infrequently, only when a new inference function needs to be programmed.

Thus, the conductance of the crosspoint device can be tuned for slower and more tolerant write schemes of device variations. For example, after the weights are detected in software, the memory cells are programmed one at a time using feedback write verification algorithm that can adapt to the unique I-V characteristics of each device. For the same reason, switching endurance, i.e., the number of times a memory device can be programmed reliably, and write speed/energy are less important. In addition, the VMM operations in many neural network inferences can be performed with moderate sub-8-bit precision without loss of accuracy. This further relaxes the analog property requirements and increases the I-V non-ideality and noise.

The most sophisticated neuromorphic inference circuits have been demonstrated in more mature floating-gate transistor memory circuits. Until recently, such circuits were implemented primarily in "synaptic transistors", which can be fabricated using standard CMOS technology, and several sophisticated and efficient systems have been demonstrated. However, the relatively large area of these devices (> 103 F2, where F is the minimum feature size) and the large interconnect capacitance result in large time delays. Recent work has focused on the implementation of mixed-signal networks with much higher density (~40 F2) commercial NOR flash memory arrays that have been redesigned for analog computing applications.

For example, a prototype 100k + cell two-layer perceptron network fabricated on a 180nm process using a modified NOR flash memory technology is reported in Ref. It is highly reliable, has negligible long-term drift and temperature sensitivity, and reproducibly classifies MNIST benchmark set images with approximately 95% fidelity, less than 1 µs time delay, and less than 20 nJ energy consumption per pattern.

The energy-latency product was six orders of magnitude better than the best (at the time) 28nm digital implementation performing the same task with similar fidelity.

Recent theoretical work predicts that neuromorphic inference circuits can be implemented in much denser 3D-NAND flash memory, eventually scaling to densities of 10 terabits per square inch. In the long term, perhaps most promising are circuits based on a variety of metal-oxide-resistive random access (ReRAM for short, also called metal-oxide memristors), especially passively integrated (0T1R) technology. Indeed, thanks to an ion switching mechanism, ReRAM devices with dimensions of less than 10 nm maintain excellent analog characteristics and yearly scale retention. In addition, the low-temperature fabrication budget enables monolithic vertical integration of multiple ReRAM crossbar circuits, further improving the effective density. The complexity of ReRAM-based neuromorphic circuit demonstrations is scaling up rapidly. However, ReRAM technology still needs to be improved. In addition to the high device variability, another remaining problem is the high write current and operating conductance. This needs to be reduced by at least an order of magnitude to reduce the large overhead on the peripherals.

The device requirements for training a hardware accelerator are different and much more stringent. For example, weights are updated frequently and do not need to be retained for long periods. This allows the use of volatile memories in analog VMM circuits such as interfacial memristors or solid electrolyte memories based on electron trapping/de-trapping switching or controlling the current through crosspoint transistors in capacitor-based memories. The most difficult challenge, however, is the much higher computational and weight accuracy required for the training operation, and the need for an efficient scheme for updating the weights. This requires a significant increase in the number of device variants. An additional related requirement is that the change of the device conductance when applying a write pulse must not depend on its current state (the so-called linearity of the updated property). Otherwise, an accurate conductance adjustment would require sending a unique write pulse based on the current device state. This is almost incompatible with fast (parallel) weight updates.

Phase change memory has also been investigated as a candidate for variable resistors in analog VMM circuits. The main drawback, however, is the significant drift of the conductive state over time. 1T ferroelectric RAM devices have demonstrated high write endurance, high density (integrated structures such as vertical 3D-NAND), and long retention. There is a lot of promise for applications of such devices in training and inference accelerators, but their analog properties are probably inferior to ReRAM. A significant drawback of magnetic devices such as magnetic tunnel junction memory is their small on/off current ratio, which is insufficient for real VMM circuits, and the low analog characteristics of scaled-down devices.

The possibility of using light to implement fast, large-scale fan-out interconnections and linear computations such as multiplication and addition operations have motivated research in photonic neuromorphic computing. Various implementation flavors with fixed and programmable capabilities have recently been proposed in the context of modern neural networks. Specifically, a system of multiple 3D printed optical layers, a mesh of regions (neurons) with specially selected trans-reflective properties, capable of performing pattern classification inference similar to convolutional neural networks, have been reported. By transmitting coherent light with an amplitude-encoded input, useful computations can be performed at the speed of light. Specifically, the light is diffracted and interfered with as it passes through the optical system, and is ultimately directed to a specific region of the output layer corresponding to the pattern class. Optical neuromorphic systems with configurable weights have been reported. The input is encoded by the energy of the light, and the weights are encoded by the optical attenuation of the PCM device. The light is passed through the PCM device so that the product is computed. Encoding the input with the amplitude of the light is proposed, and the light from the input is coupled and passed through a frequency-selective weight bank based on a microring resonator (MRR) with a metal heater to perform the multiplication. In particular, the MRR couplings (i.e., weights) are controlled by the heating by adjusting the current supplied to each MRR. In these reconfigurable implementations, the product accumulation (i.e., summing operations in the VMM) is performed by integrating the photo-induced charges of the photodetectors. A very aggressive time-division multiplexing scheme has been proposed to compute VMMs where both the weights and the input are encoded incoherent light amplitudes. In such a scheme, the input light is fanned out into n channels, combined with n optically encoded weights using a beam splitter, and transmitted to n homodyne photodetectors where the n products are computed in parallel. All-optical feedforward inference based on a Mach-Zehnder interferometer mesh uses a unitary decomposition of the weight matrix. The unitary matrix transformation is implemented in an optical beamsplitter and phase shifter, and the diagonal matrix is implemented in an optical attenuator.

In principle, sub-aJ energies and sub-ps latencies of single multiplication and addition operations could be possible in optical computing. The main challenge, however, is the very large dimensions of the optical components and the very high I / O overhead of converting to and from the optical domain. Designs that rely on conversion to the electrical domain are particularly affected by the low integration density of the optical devices due to the high electrical communication overhead. This has been shown to overwhelm the system-level performance of ReRAM-based circuits (which are much denser). Optical systems will eventually benefit from very wide (>10,000) inner products and/or from the use of deep time-division multiplexing to amortize the I / O overhead. However, the possible problem of nonlinearities in the charge integrals and the usefulness of such a wide inner product calculation remains unclear.

Stochastic Vector-by-Matrix Multiplication

Computations performed in the brain are stochastic, e.g., substantially different neural responses are observed when the same stimulus is presented repeatedly. These noisy operations are mimicked by stochastic neural networks such as Boltzmann machines and deep belief neural networks. In the simplest case, such a network consists of two neurons that compute a probabilistic dot product. The probabilistic function can be realized either on the synaptic side or the neuron side. In the latter case, the neuron first computes deterministically the dot product of inputs and their corresponding weights. The result is then passed to some "stochastic" activation function. For example, it is used as an argument for a sigmoidal probability function, which determines the probability of high output. Because the ratio of synapses to neurons is large (more than 100), efficient implementation of efficient deterministic dot products is critical for high-performance probabilistic neural networks. However, previous work has shown that even the simplest deterministic neurons incur significant overhead. For example, one neural network model has been shown to occupy 30% of the area and consume 40% of the energy. Hence, there are benefits to neuromorphic hardware inefficiently realizing stochastic neurons.

There are two major ways in which emerging devices can be used to realize stochastic functions. One is to exploit the I-V characteristics of the memory device, either dynamically or statically. Specifically, the former method allows the memory state to be switched in an essentially stochastic manner. For example, in MTJ memory, thermal fluctuations cause a probabilistic change between a low-resistance parallel state and a high-resistance anti-parallel state, and the probability of the final memory state during switching can be controlled by the spin torque current. The reconstruction of the atomic structure by melt quenching is inherently stochastic. In phase-change memory (PCM), the reconfiguration of the atomic structure due to a melt quench is essentially stochastic. These phenomena have been proposed to realize stochastic neurons in MTJs and PCMs. the second approach is to exploit intrinsic and extrinsic current fluctuations in memory devices. The second approach exploits intrinsic and extrinsic current fluctuations in memory devices, such as random telegraph or thermal noise in ReRAM devices, or shot noise in nanoscale floating gate transistors. In such an approach, the noisy current flowing through the neuron is compared to a reference value using, for example, a simple latch, and a probabilistic activation function is implemented.

The main problems with the former approach are the limited endurance of many memories and the drift of the stochastic switching properties when used repeatedly. Stochastic switching characteristics change with repeated switching. To realize scalable stochastic dot-product circuits, it is necessary to integrate multiple memory device technologies. For example, integration of artificial synapses using ReRAM and neurons using MTJ. On the other hand, an analog circuit using only ReRAM devices (Fig. 13) can be used to realize the second approach of probabilistic VMMs, although the signal-to-noise ratio (SNR) is very low. Moreover, in such a circuit, the SNR can be controlled by adjusting the readout voltage. Thus, by controlling the effective temperature (the slope of the sigmoidal probability function), stochastic annealing can be implemented in Boltzmann machines in a runtime efficient manner. the disadvantage of the second approach is that it runs slower due to the low readout current. This could be solved by using external noise. Finally, the impact of noise quality on functional performance is also a common concern. Although this issue has not yet been studied systematically, Gaussian thermal noise and shot noise should be more advantageous for true random operation.

Spiking Neuron and Synaptic Plasticity

Despite recent algorithmic advances, spiking neural networks (SNNs), which appear to be biologically plausible, are still functionally inferior to simple ANNs. If simple ANNs are superior, then the development of efficient SNN hardware is justified by the need to interface with and model the brain efficiently. The development of efficient SNN hardware is justified by the need to interface with the brain and model the brain efficiently. Another interesting feature of SNNs is the local weight update rule. It requires only information from pre-and post-synaptic neurons and can be used to provide real-time training capabilities on large-scale neuromorphic hardware. It can be used to provide real-time training capabilities for large neuromorphic hardware.

In the simplest SNN model, information is encoded in spike-time correlations, and network function is defined by synapses. In addition to VMMs, SNNs also include a variety of neuronal functions such as Leaky-Integrate-and-Fire (LIF), short-term plasticity (STP), long-term potentiation （LIF neurons mimic the dynamic processes of neuronal cell membranes, and synaptic plasticity mimics the learning and memory mechanisms of biological networks. LIF neurons mimic the dynamic processes of the neuronal membrane, and synaptic plasticity mimics the learning and memory mechanisms of biological networks. For example, STP is a transient change in synaptic strength that executes short-term memory. If synaptic strength adjustments are not immediately strengthened, the memory is lost and the synaptic load returns to its original equilibrium state.

On the other hand, frequently repeated spiking stimuli cause long-term memory, such as permanent potentiation by LTP mechanisms. The goal of STDP is to strengthen synaptic efficiency when pre-and post-synaptic spikes occur in the expected causal temporal order and to weaken synaptic efficiency when they do not.

With conventional circuit technology, it is difficult to implement LIF neurons compactly with biological millisecond integration time due to the need for large capacitors. To solve this problem, circuits based on volatile memory have been proposed (leakage integrated circuits using switching mechanisms such as filaments, interfaces, and Mott insulators). In such an implementation, the integrated current is encoded in the conductive state of the volatile memory device. Neuron spiking functionality has been demonstrated in threshold-switching (volatile) memory devices with S-type negative differential resistance (NDR). threshold-switching (volatile) memory devices with negative differential resistance (NDR) I-V characteristics. The spiking function of a neuron has been demonstrated. The general idea of this approach is similar to oscillator circuits based on S-type (NDR) devices connected to a resistor-capacitor circuit.

The transition from STP to LTP is emulated in solid electrolyte devices. Specifically, short, infrequent, thin filaments are formed, which represent short memories because they are unstable and melt quickly. On the other hand, by repeating or lengthening the write pulse, thicker and more stable filaments can be formed. This allows us to mimic the transition to LTP. For example, implementations of STDP windows using PCM and metal oxide ReRAM have been proposed by carefully choosing the shape of the pre-and post-synaptic write voltage pulses.

Several small-scale spiking neuromorphic systems based on new device technologies have been demonstrated. These include chance detection using the STDP mechanism with metal oxide memories, and temporal data classification using diffuse memory cans. However, these advanced hardware advances have lagged far behind simpler ANN inference accelerators. The main reason for this is that such applications require more functionality from new devices, and the impact of device variability on SNN behavior and performance is more severe. For example, the SNN relies on spikes of constant magnitude to update the conductance of multiple devices in parallel. Therefore, a small change in the I-V switching voltage can cause a large change in the conductance, resulting in a large change in the STDP characteristics. On the other hand, the implementation of simpler ex-situ trained ANNs is less challenging, as discussed earlier. This is because the write amplitude voltage of such a network can be adjusted independently for each device based on feedback information during conductance tuning.

Superconducting circuits based on rapid single flux quanta (RSFQ) are suitable for spiking circuits because the information is encoded in the SFQ voltage pulse. For example, Josephson-Josephson junction spiking neurons have been shown to operate up to 50 GHz. However, historical challenges to such approaches include inferior fabrication techniques, application limitations due to low-temperature operation, and the lack of efficient analog memory circuits. Photonic spiking neural networks and hybrid superconducting/optoelectronic neuromorphic circuits share some of the same challenges as the photonic neuromorphic inference approaches already discussed.

Reservoir Computing

Recurrent neural networks, such as the Google Neural Machine Translation model, are particularly well suited for processing sequential and temporal data due to their inherent memory properties. Reservoir computing (RC) networks are a special type of recurrent network that learns efficiently and is motivated by information processing in the cerebral cortex. The main component of RC networks is the reservoir, which is a nonlinear recurrent network that maps inputs into a high-dimensional Spatio-temporal representation and has a fading memory property for previous inputs and network states. The other component is a readout layer that maps intermediate states to outputs. All the connections of the reservoirs are fixed and only the weights of the readout layer can be learned. Because of this and the sparse intermediate representations, faster online algorithms can be employed to train such networks, which is the primary strength of this approach.

Both the readout layer and the reservoir can be realized in the analog VMM circuit presented here. There are interesting cases where the reservoir can be realized by nonlinear physics in superconducting, magnetic, and photonic devices. For example, speech vowel recognition has been demonstrated in an RC where the reservoir is implemented in four MTJ-based spin-torque oscillators (STOs). In such demonstrations, the temporal input corresponding to a speech vowel is first transformed into the frequency domain. This is then mapped to the corresponding DC bias current applied to the MTJ device. The reservoir takes advantage of the nonlinear dependence of the STO frequency on the DC current and the history-dependent transients of the MTJ free layer spins.

Various photonic reservoirs have been proposed, for example, to exploit the transient characteristics of optical systems with time-delay feedback, or to superimpose passively circulating light through a waveguide, splitter, or combiner, and to realize a high-dimensional response by nonlinear conversion into the electronic domain. Recently, it has been proposed to realize an efficient and highly Recently, the dynamics of superconducting circuits have been studied for efficient and very fast reservoir implementations. Specifically, the proposed reservoir is based on a Josephson transmission line (JTL) formed by a chain of JJs: an input pulse from one end of the JTL causes a rapid cascade of phase slips of junctions that propagate SFQ pulses to the other end. As the JJs modulate each other's currents, complex dynamic states are realized.

There are several general concerns with the RC computing approach. At the algorithmic level, RC performs poorly compared to state-of-the-art approaches, and without further algorithmic improvements, it is unclear whether the benefits of online training can compensate for this handicap. The main concern with the various hardware implementations is still related to device variation. For example, whether the hardware can produce reproducible results when the same input is given. In the case of magnetic devices, the coupling between devices is limited, which may affect the effectiveness of the reservoir.

Hyper-Dimensional Computing / Associative Memory

Hyperdimensional computing circuits have recently been demonstrated in ReRAM and PCM devices. The low-level behavior of hyperdimensional computing is closely related to the behavior of associative and content-addressable memories. Specifically, at the heart of such an approach is an associative memory array circuit that outputs the memory row entry that is closest in Hamming distance to the binary input vector that serves as the search key. Assuming a symmetric binary representation with -1 and +1 encoding, the Hamming distance is linearly related to the dot product and is equal to the output vector length minus the dot product between the input vectors. Thus, the key function of hyperdimensional computation is still the VMM operation, and once the VMM operation is complete, the result is passed to the winner-take-all circuit (a harder version of the softmax function), which discards all other outputs and determines the element with the minimum Hamming distance. In further simplification, both the input and the weights of the VMM are binary.

In principle, a binary VMM can be implemented in hardware more efficiently than a fully analog version. As with binary neural networks, the apparent tradeoff is a reduction in the functional performance of hyperdimensional computation. Another feature of hyperdimensionality is that it lends itself to fast "one-shot" or incremental learning, in exchange for having much more redundant memory. Note that fast "one-shot" learning is not unique to hyperdimensional computing. For example, Hebbian learning and its many variants used to train associative neural networks are naturally incremental in that they have a recursive form and can only modify the weights based on the current weight values and new patterns stored in the network.

summary

Many new devices and circuit techniques are currently being explored for the realization of neuromorphic hardware. A neuromorphic inference accelerator using analog in-memory computing with floating gate memory is probably the closest to widespread use. A neuromorphic inference accelerator using analog in-memory computing with floating-gate memories is probably the closest to widespread adoption, considering the maturity of the technology, the practicality of the application, and the competitive performance compared to conventional (digital CMOS) circuit implementations. Many of the proposals are functional. Since most proposals target functionally inferior algorithms, it is not easy to compare the performance of other neuromorphic approaches; unless ML algorithms are significantly improved or new applications emerge that can benefit from high performance, low precision neuromorphic hardware. Unless ML algorithms are significantly improved, or new applications emerge that can benefit from high-performance, low-precision neuromorphic hardware, inferior functional performance may limit the utility of other approaches. The main challenge to realizing the concept of neuromorphic computing continues to be the large variability in the behavior of new devices.

view

In this review, we have presented some exciting applications of fast ML to enable scientific discovery in a variety of fields. The field is developing rapidly, with many exciting new studies and results being published frequently. However, this is a relatively young field with a lot of potentials, but also a lot of unresolved issues in many areas. We hope that the discussion of scientific use cases and their overlaps, beyond those presented in this review, will inspire the reader to enjoy and pursue further applications. We then gave an overview of techniques for developing powerful ML algorithms that need to operate in a high throughput, low latency environment. This includes both system design and training, as well as efficient deployment and implementation of those ML models. The hardware implementations are divided into two categories: current conventional CMOS technologies and more promising CMOS technologies. In the case of conventional CMOS, with the demise of Moore's Law, the emphasis has recently been on advanced hardware architectures designed for ML. In this paper, we have given an overview of common and new hardware architectures and their advantages and disadvantages. An important area for many hardware architectures is the codesign of a given ML algorithm for particular hardware, including the architecture and programmability of the algorithm. An example of a particularly relevant and important hardware platform is the FPGA. Finally, we have introduced the future of CMOS technology, which offers super-efficient and exciting techniques for implementing ML models. Although these techniques are speculative, they are expected to improve performance by orders of magnitude compared to conventional techniques; ML training and deployment techniques and computer architectures are both very fast-moving fields, and new works are appearing at a pace that even this paper cannot keep up with. New techniques are being introduced in both areas, but of particular importance is understanding the code design of new algorithms for different hardware and the ease of use of the tool flow for deploying those algorithms. The innovation here will allow for rapid and widespread adoption of powerful new ML hardware, and beyond CMOS technology, such practical considerations are also important. Shortly, we will revisit these topics to see how quickly applications, ML technology, and hardware platforms can advance, and most importantly, how their confluence can enable paradigm-shifting breakthroughs in science.

Categories related to this article

友安昌幸 (Masayuki Tomoyasu): JDLA G certificate 2020#2, E certificate2021#1 Japan Society of Data Scientists, DS Certificate Japan Society for Innovation Fusion, DX Certification Expert Amiko Consulting LLC, CEO