Fast Machine Learning Applied To Science
3 main points
✔️ Accelerating the use of machine learning for scientific research, while requiring high throughput and low latency algorithms
✔️ Review emerging ML algorithms as well as the latest hardware/software
✔️ Further ML techniques continue to be enhanced as we try to apply them to scientific problems
Applications and Techniques for Fast Machine Learning in Science
written by Allison McCarn Deiana (coordinator), Nhan Tran (coordinator), Joshua Agar, Michaela Blott, Giuseppe Di Guglielmo, Javier Duarte, Philip Harris, Scott Hauck, Mia Liu, Mark S. Neubauer, Jennifer Ngadiuba, Seda Ogrenci-Memik, Maurizio Pierini, Thea Aarrestad, Steffen Bahr, Jurgen Becker, Anne-Sophie Berthold, Richard J. Bonventre, Tomas E. Muller Bravo, Markus Diefenthaler, Zhen Dong, Nick Fritzsche, Amir Gholami, Ekaterina Govorkova, Kyle J Hazelwood, Christian Herwig, Babar Khan, Sehoon Kim, Thomas Klijnsma, Yaling Liu, Kin Ho Lo, Tri Nguyen, Gianantonio Pezzullo, Seyedramin Rasoulinezhad, Ryan A. Rivera, Kate Scholberg, Justin Selig, Sougata Sen, Dmitri Strukov, William Tang, Savannah Thais, Kai Lukas Unger, Ricardo Vilalta, Belinavon Krosigk, Thomas K. Warburton, Maria Acosta Flechas, Anthony Aportela, Thomas Calvet, Leonardo Cristella, Daniel Diaz, Caterina Doglioni, Maria Domenica Galati, Elham E Khoda, Farah Fahim, Davide Giri, Benjamin Hawks, Duc Hoang, Burt Holzman, Shih-Chieh Hsu, Sergo Jindariani, Iris Johnson, Raghav Kansal, Ryan Kastner, Erik Katsavounidis, Jeffrey Krupa, Pan Li, Sandeep Madireddy, Ethan Marx, Patrick McCormack, Andres Meza, Jovan Mitrevski, Mohammed Attia Mohammed, Farouk Mokhtar, Eric Moreno, Srishti Nagu, Rohin Narayan, Noah Palladino, Zhiqiang Que, Sang Eon Park, Subramanian Ramamoorthy, Dylan Rankin, Simon Rothman, Ashish Sharma, Sioni Summers, Pietro Vischia, Jean-Roch Vlimant, Olivia Weng
(Submitted on 25 Oct 2021)
Comments: Published on arxiv.
Subjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Data Analysis, Statistics and Probability (physics.data-an); Instrumentation and Detectors (physics.ins-det)
The images used in this article are from the paper, the introductory slides, or were created based on them.
first of all
In the pursuit of scientific progress in many fields, experiments have become highly sophisticated to investigate physical systems at smaller spatial resolutions and shorter time scales. These orders of magnitude advances have led to an explosion in the quantity and quality of data, and scientists in every field need to develop new methods to meet their growing data processing needs. At the same time, the use of machine learning (ML), i.e., algorithms that can learn directly from data, has led to rapid advances in many scientific fields. Recent advances have shown that deep learning (DL) architectures based on structured deep neural networks are versatile and can solve a wide range of complex problems. The proliferation of large data sets, computers, and DL software has led to the exploration of different DL approaches, each with its advantages.
This review paper focuses on the integration of ML and experimental design to solve important scientific problems by accelerating and improving data processing and real-time decision-making. We discuss a myriad of scientific problems that require fast ML and outline unifying themes between these domains that lead to general solutions. In addition, we review the current technology needed to run ML algorithms fast, and present key technical issues that, if solved, will lead to significant scientific advances. A key requirement for such scientific progress is the need for openness. Experts from domains with which they have little interaction must come together to develop transferable solutions and collaborate to develop open-source solutions. Much of the progress in ML in the last few years has been due to the use of heterogeneous computing hardware. In particular, the use of GPUs has made it possible to develop large-scale DL algorithms. In addition, the ability to train large AI algorithms on large datasets has enabled algorithms that can perform sophisticated tasks. In parallel with these developments, new types of DL algorithms have emerged that aim to reduce the number of operations to achieve fast and efficient AI algorithms.
This paper is a review of the 2nd Fast Machine Learning conference and is based on material presented at the conference. Figure 1 shows the spirit of the workshop series that inspired this paper, and the topics covered in the subsequent Figure 1 shows the spirit of the workshop series that inspired this paper and the topics covered in the following sections.
As ML tools have become more sophisticated, the focus has shifted to building very large algorithms that solve complex problems such as language translation and speech recognition. Furthermore, the applications are becoming more diverse, as it is important to understand how each scientific approach can be transformed to benefit from the AI revolution. This includes the ability of AI to classify events in real-time, such as particle collisions or changes in gravitational waves, as well as the control of systems, such as the control of response through feedback mechanisms in plasma and particle accelerators, but constraints such as latency, bandwidth, and throughput, and the reasons for them, vary from system to system. The constraints and reasons for them are different for each system. Designing a low-latency algorithm differs from other AI implementations in that it requires the use of specific processing hardware to handle the task and improve the overall algorithm performance. For example, ultra-low latency inferencing may be required to perform scientific measurements. For example, scientific measurements may require ultra-low-latency interrogation times, and in such cases, the algorithm must be well-designed to maximize the use of available hardware constraints while keeping the algorithm relevant to the required experimental requirements. The algorithm must be well designed to keep up with the required experimental requirements while taking full advantage of the available hardware constraints.
Domain Application Exemplars
As scientific ecosystems rapidly become faster and larger, new paradigms for data processing and reduction need to be incorporated into the system-level design. While the implementation of fast machine learning varies widely across domains and architectures, the needs for basic data representation and machine learning integration are similar. Here we enumerate a broad sampling of scientific domains for seemingly unrelated tasks, including existing technologies and future needs.
Large Hadron Collider
The Large Hadron Collider (LHC) at CERN is the world's largest and most energetic particle accelerator, with a bunch of protons colliding every 25 nanoseconds. To study the products of these collisions, several detectors have been installed at interaction points along the ring. The purpose of these detectors is to measure the properties of the Higgs boson with high precision and to search for new physical phenomena beyond the Standard Model of particle physics. Due to the extremely high collision frequency of 40 MHz, the high multiplicity of secondary particles, and a large number of sensors, the detectors must process and store data at an enormous rate. The two multipurpose experiments, CMS and ATLAS, consist of tens of millions of readout channels, and their rates are on the order of 100 Tb/s. The processing and storage of these data are one of the most important challenges for the LHC physics program. The detector data processing consists of an online processing stage, where events are selected from a buffer and analyzed in real-time, and an offline processing stage, where the data are written to disk and analyzed in more detail using sophisticated algorithms. In the online processing system, called triggering, the data rate is reduced to a manageable level of 10 Gb/s and recorded for offline processing. Triggers are typically multi-tiered. Due to the limited size of the on-detector buffer, the first level (level 1 or L1) uses FPGAs or ASICs that can perform filtering operations with latencies on the order of 1 µs at most. In the second stage, High-Level Trigger (HLT), a CPU-based computing farm located at the experimental site processes the data with a latency of up to 100 ms. Finally, full off-line event processing is performed on a globally distributed CPU-based computing grid. Maintaining the capacity of this system will become even more challenging shortly: in 2027, the LHC will be upgraded to the so-called High-Luminosity LHC (HL-LHC), which will produce five to seven times as many particles per collision, and the total amount of data that will eventually be stored will be one order of magnitude larger than that achieved by current accelerators. The total amount of data accumulated will be one order of magnitude greater than that achieved by current accelerators. At the same time, the particle detectors will be larger and finer and will be able to process the data faster and faster. Thus, the amount of physics that can be extracted from an experiment is limited by the accuracy of the algorithms and the computational resources.
Machine learning techniques offer promising solutions and capabilities in both areas, due to their ability to extract the most relevant information from high-dimensional data and their high degree of parallelization to appropriate hardware. Once the new generation of algorithms is implemented in all stages of the data processing system of the LHC experiment, it is expected to play a key role in maintaining and hopefully improving the performance of the physics.
Examples of applications to physics tasks at the LHC include event reconstruction, event simulation, heterogeneous computing, real-time analysis at 40 MHz, and the application of ML to front-end detectors. (Details are omitted. )
(high) intensity accelerator experiment
Machine Learning Base Trigger System in Belle II Experiment
The Neural Network z-Vertex Trigger (NNT) used in Belle II is a Level 1 (L1) trigger with no dead time that identifies particles by estimating their origin along the beanpipe. The entire L1 trigger process, from data readout to decision, is given a real-time time budget of 5 µs to avoid dead time. Because of the time required to pre-process and transmit the data, NNT needs to decide in 300 ns processing time; the task of NNT is to estimate the origin of the particle track and to be able to determine if it is from an interaction point. For this purpose, a multilayer perceptron (MLP) implemented on a Xilinx Virtex 6 XC6VHX380T FPGA is used. the MLP consists of three layers: 27 input neurons, 81 hidden layer neurons, and 2 output neurons. For this task, data from the Central Drift Chamber (CDC) of Belle II is used, since it is specialized for particle trajectory detection. The raw detector data are combined into 2D tracks called track segments, which are groups of adjacent active sense wires, before being processed by the network; the output of the NNT shows the origin of the track in the z-direction, along the beam pipe, as well as the polar angle θ. The z-Vertex is used by the downstream Global Decision Logic (GDL) to determine if the track is from an interaction point or not. In addition, the polar angle θ can be used to detect the momentum of particles. the networks used in NNT are trained offline. The first networks were trained using plain simulation data because experimental data were not available. The more recent networks use tracks reconstructed from experimental data. For training, we use the iRPROP algorithm, which is an extension of the RPROP backpropagation algorithm. The current results show a good correlation between the NNT tracks and the reconstructed tracks. Currently, the event rate and background noise are within acceptable limits, so we set the Z-cut (the estimated origin allowed to maintain the track origin) to ±40 cm. However, this z-cut can be tightened as luminance increases and the associated background increases. Data preprocessing enhancements are planned for this year, now that the Virtex Ultrascale-based Universal Trigger Board (UT4) is available for use with NNT. This will use the 3D Hough transform to further improve efficiency. Already simulations have shown that more accurate resolution and wider coverage of solid angles can be obtained.
In recent years, the materials science community has begun to embrace machine learning to facilitate scientific discovery. However, this has been problematic. The ability to create highly over-parameterized models to solve problems with limited data does not make the necessary generalizations for science and leads to false validity. Machine learning model architectures designed for natural time series and images are ill-suited to physical processes governed by equations. In this regard, there is a growing body of work that incorporates physics into machine learning models to serve as the ultimate regularizer. For example, rotational and Euclidean equilibria are being incorporated into model architectures, and methods are being developed to learn sparse representations of the underlying governing equations. Another challenge is that real systems have system-specific discrepancies that need to be compensated for. For example, the slightly different viscosities of different batches of precursors need to be taken into account. There is an urgent need to develop these fundamental methods for materials synthesis. Complementing these fundamental studies, there is a growing body of literature that emphasizes "in situ" spectroscopic analysis based on "post-mortem" machine learning. As these concepts become more mature, there will be an increasing emphasis on synthetic systems, machine learning methods, and hardware codesign for on-the-fly analysis and control. Such automated lab efforts are already underway in wet chemical synthesis, where dynamics are minimal and latency is not an issue. In the future, the focus will undoubtedly be on controlling dynamic synthesis processes where millisecond to nanosecond latencies are required.
Scanning probe microscope
In the field of materials science, machine learning is rapidly being introduced into scanning probe microscopy. Linear and nonlinear spectral unmixing techniques can rapidly visualize and extract information from these datasets to discover and elucidate physical mechanisms. The ease with which these techniques can be applied has raised legitimate concerns about the over-interpretation of results and over-extension of linear models to highly nonlinear systems. More recently, a long- and short-term memory autoencoder was controlled to have a non-negative, sparse latent space for spectral unmixing. By scanning the learned latent space, it is now possible to draw complex structure-property relationships. There is a huge opportunity to speed up the computational pipeline to enable microscopists to extract information on practical time scales. 100,000 spectral sample rates yield up to GB/s of fast data, so extracting even the tiniest bit of information requires data-driven models To extract even the smallest amount of information, data-driven models, physics-based machine learning, and AI hardware are needed. For example, a band-excited piezoelectric force microscope measures the frequency-dependent response of a cantilever at rates of up to 2,000 spectra per second. To extract parameters from these measurements, the response must be fit to an empirical model. While fitting using least-squares methods yields a throughput of only about 50 fits per core minute, neural networks can be used to speed up the analysis and process noisy data. This pipeline can be approximated and accelerated by introducing neural networks into GPU and FPGA hardware accelerators.
Fermilab Accelerator Control
Traditional accelerator control focuses on grouping like elements together so that specific aspects of the beam can be tuned independently. However, many elements are not always completely separable. For example, magnets have higher-order magnetic fields that can affect the beam in ways that are not intended. Machine learning is finally making it possible to combine elements of readout and beam control that were previously thought to be unrelated to create a new control and coordination schemes. One such new control project is underway at the gradient magnet power supply (GMPS) of the booster, which controls the main orbit of the beam in the booster. This project aims to increase the regulation accuracy of the GMPS by a factor of 10. When completed, GMPS will be the first FPGA-based ML model-based online control system at the Fermilab accelerator facility. The potential of ML for accelerator control was so obvious to the Department of Energy that a call for ML-based accelerator control was issued at the national laboratory. One of the two proposals submitted by Fermilab and approved by DOE is the Real-time Edge AI for Distributed Systems (READS) project. READS consists of two projects. The second READS project addresses the long-standing problem of deblending beam loss in the main injector (MI) enclosure. There are two accelerators in the enclosure: the MI and the recycler. In normal operation, there is a high-intensity beam in both machines. This project aims to control the slow spill in the delivery ring to Mu2e using ML, and the other project is to deblend the losses generated by the accelerators of the recycler and the main injector, which share the same enclosure, using an online model in real-time. The goal of both READS projects is to deblend the losses generated by the accelerators of the recycler and the main injector, which share an enclosure, using real-time online models. Both READS projects use online ML models in FPGA for inference, and collect data with low latency from distributed systems around the accelerator complex.
Neutrino and direct dark matter experiments
Accelerator Neutrino Experiment
DUNE uses machine learning in its triggering framework to process huge data rates and identify interactions between traditional neutrino oscillation measurements and candidate solar and supernova events. Accelerator neutrino experiments have been successfully implementing machine learning techniques for years, the first such example being in 2017, where the network increased the effective exposure of the analysis by 30%. Networks intended to classify events are common in many experiments, and DUNE has recently published a network that can exceed design sensitivity on simulated data, which includes an output that counts the number of particles in the final state due to interactions The network includes an output that counts the number of particles in the final state due to interaction. In our experiments, we have become increasingly aware of the risk of the network learning more features of the training data than intended. Therefore, careful construction of the training dataset is essential to reduce this risk. However, it is not possible to correct or quantify biases that are not yet known. For this reason, in the MINERvA experiment, we considered the use of a domain adversarial neural network to reduce unknown biases arising from differences between simulated and real data. This network has a gradient inversion layer in the domain network (trained on the data), which prevents the classification network (trained on the simulation) from learning from features that behave differently between the two domains.
In this area, there are further applications in neutron astrophysics and direct detection of dark matter.
Electron and Ion Collider
Accessing the physics of the Electron-Ion Collider (EIC) requires unprecedented integration of interaction region (IR) and detector design. Seamless data processing from DAQ to analysis in the EIC will streamline workflows, for example by integrating software for DAQ, online and off-line analysis, and will enable new software technologies at all levels of data processing, especially fast ML algorithms. ML algorithms can be utilized at all levels of data processing. This will be an opportunity to further optimize the physics reach of the EIC. The status and prospects of "AI for Nuclear Physics" are discussed in the 2020 workshop. Relevant topics for fast ML are intelligent decisions about data storage and (near) real-time analysis. Intelligent decisions about data storage are needed to reliably capture the relevant physics. Fast ML algorithms can improve the acquired data through data compacting, sophisticated triggers, and fast online analysis; EIC includes automated data quality monitoring as well as automatic detector alignment and calibration. Near real-time analysis and feedback enable rapid diagnosis and optimization of experimental setups and greatly improve access to geophysical results.
In recent years, machine learning algorithms have been explored in various areas of gravitational-wave physics; CNNs have been applied to the detection and classification of compact binary coalescing gravitational waves, burst gravitational waves from core-collapse supernovae, and continuous gravitational waves. Recurrent neural network (RNN)-based autoencoders are also being investigated to detect gravitational waves using unsupervised strategies; FPGA-based RNNs are also being explored to demonstrate the potential for low-latency detection of gravitational waves. Applications of ML in the search for other types of gravitational waves, such as generic bursts and stochastic backgrounds, are currently being investigated. In addition, probabilistic and generative ML models can be used for posterior sampling in gravitational wave parameter estimation, achieving performance comparable to Bayesian samplers using simulated data, while significantly reducing the time to completion ML algorithms can be used to improve the quality of gravitational wave data and It has also been used for noise subtraction. Transient noise artifacts can be identified and classified by examining their time-frequency and constant-Q transforms, as well as LIGO's hundreds of thousands of auxiliary channels. These auxiliary channels can also be used to subtract quasi-periodic noise sources, and although ML algorithms hold great promise for gravitational-wave data analysis, many of these algorithms are still in the proof-of-concept stage and have not been successfully applied to real-time analysis. Current challenges include building a computational infrastructure for low-latency analysis, improving the quality of the training data (e.g., expanding the parameter space, using more realistic noise models), and quantifying the performance of the algorithms on longer data runs.
Many changes in ML algorithms are accompanied by performance improvements in both accuracy and inference speed. Some of the most advanced machine learning models have high inference speed. For example, YOLOv3-tiny, a popular object detection model for medical imaging, can process images at more than 200 FPS on a standard dataset and achieve reasonable accuracy. Currently, both GPU- and FPGA-based, distributed networks of wireless sensors connected to cloud ML (edge computing), and 5G high-speed WiFi-based ML models are being deployed in medical AI applications. ML models for fast diagnosis of stroke, thrombosis, colorectal polyps, cancer, and epilepsy have significantly reduced the time for lesion detection and clinical decision-making. Real-time AI-assisted surgery can improve perioperative workflow, video segmentation, detection of surgical instruments, and visualization of tissue deformation. Fast ML plays an important role in digital health, i.e., remote diagnosis, surgery, and monitoring.
Existing research has taken various steps in different directions, but there is a growing need to develop ML approaches that can correctly sense health biomarkers and identify these biomarkers in a fast and accurate manner. Researchers have focused on developing novel sensing systems that can sense a variety of health behaviors and biomarkers. Historically, most of these new sensing techniques have been tested in controlled environments, but recently, researchers have ensured that these systems can work seamlessly in free-living environments. To achieve this, multiple ML models need to be developed, each of which can be adapted to specific situations and environments. New trends in this field are beginning to rely on the implementation of models that can be implemented on devices and that can detect these behaviors quickly and accurately. In addition to enabling real-time interventions, on-device monitoring of these behaviors can help alleviate privacy concerns. However, because wearable devices themselves may not be able to process the data, several researchers have recently explored coordinated machine learning approaches.
CNNs are being applied to spherical surfaces to generate more accurate models for weak lensing maps and to remove noise from cosmic microwave background maps. A discovery and classification engine is also being developed to extract useful cosmological data from next-generation facilities. ML is also being used for space simulations, testing new analyses and methods, and laying the groundwork for the first operations of such a new facility.
The overarching goal here is to develop predictive plasma models of realistic disruptions and integrate them with state-of-the-art plasma control systems to provide the ability to design experiments before they are run. Discharges in ITER and future burning plasmas most efficiently and safely possible. The verification, validation and uncertainty quantification of the relevant components are as follows (1) development of predictive neural net models of plasmas and actuators that can be extrapolated to the scale of burning plasmas using advanced Bayesian reinforcement learning methods that incorporate prior information into efficient inference algorithms; (2) validation of the models for the world's major tokamak experiments (DIII-D in the USA, KSTAR and EAST in Europe, JET in Europe, and Japan's large superconducting device JT60SA, which predates ITER) to systematically and well diagnosed and validate the components of an integrated plasma prediction model. This will ideally result in a mature, AI-enabled, comprehensive control system for ITER and future reactors, which can be integrated with the full pilot plant system model.
The key challenges now are to provide significantly improved prediction methods with >95% prediction accuracy and to provide warnings to effectively apply disruption avoidance/mitigation strategies before the fatal damage is done to ITER. Significant progress has been made in the adoption of deep learning (recurrent) and CNNs, as exemplified by the "FRNN" deep learning code at Princeton University, to enable rapid analysis of large and complex datasets on supercomputing systems. In this connection, we have successfully predicted the destruction of a tokamak with unprecedented accuracy and speed. The paper (and extensive references cited therein) includes a description of the FES data representation for physical features (density, temperature, current, radiation, fluctuations, etc.), the frame (event-based) level of accuracy to account for the required "zero-D" (scalar) and higher dimensional signals, and a description of the It includes the nature of key plasma experiments that feature detectors/diagnostics with real-time resolution recorded at manageable data rates. Rough future projections suggest that ITER will require the processing and interpretation of vast amounts of complex spatial and temporal data. Since simulation is another important aspect of ITER data analysis, advanced compression methods will need to be implemented to cope with the large computational costs involved. More generally, real-time predictions based on real first-principles simulations are important to gain insight into the properties of instabilities and the dynamics of the particle phase space. This necessitates the development of AI-based "surrogate models". For example, the well-established HPC "gyrokinetic" particle-in-cell simulation code GTC  can accurately simulate plasma instabilities in real-time. Data preparation and surrogate model training, e.g. "SGTC", is a clear example of the modern task of integration/connection between modern high-performance computing (HPC) predictive simulations and AI-enabled deep learning/machine learning campaigns. These considerations also help to further illustrate and motivate the need to integrate HPC and Big Data ML approaches to facilitate the delivery of scientific discoveries. Finally, the cited paper is the first adaptive predictive DL software trained on a leading supercomputing system to accurately predict the disruption of different tokamak instruments (DIII-D in the US and JET in the UK). The software has the unique statistical capability to accurately predict the occurrence of disruptions on an unknown device (JET) through efficient "transfer learning" by training on a large database of data from a single experiment (DIII-D). More recently, the FRNN inference engine was installed in the real-time plasma control system at the DIII-D tokamak facility in San Diego, California. This paves the way for an exciting transition from passive disruption prediction to active real-time control and subsequent optimization of reactor scenarios.
Machine Learning for Wireless Networking and Edge Computing
Researchers have proposed various learning algorithms to use artificial neural networks to perform specific wireless resource management tasks. Some of the first proposals to train NNs to perform transmit power control employed supervised learning. More recent proposals employ deep reinforcement learning approaches that better deal with channel and network uncertainty and require little prior training data. Much of the research has focused on the integration of edge computing and deep learning. Specifically, there is work on federated learning, where participants collaboratively train a model instead of sending all data to a central controller for learning purposes. These studies essentially end at the simulation stage, as there is no practical ML/AI solution that is fast and computationally efficient. Specifically, the research challenge is to develop a computing platform that can run complex ML models on very fast timescales (<10ms) and that can be implemented in small cell access points. One project that could have a very large impact is the mapping of intelligent radio resource management algorithms into FPGA devices suitable for deployment in large networks of connected and interfering access points. Another interesting project is to build a federated learning system that performs time-sensitive ML for IoT devices that experience delays when transferring data to a central computing facility. This opens up a whole new world of possibilities for low-cost closed-loop IoT devices in areas such as healthcare, smart buildings, agriculture, and transportation.
Main areas of overlap
Real-time, accelerated AI reasoning is expected to improve the potential for discovery in current and planned scientific instruments in a variety of fields. Designing high-performance specialized systems for real-time/accelerated AI applications requires paying particular attention to the benefits of ML algorithms in the domain of interest. This may be governed by latency per inference, computational cost (e.g., power consumption), reliability, security, and ability to operate in harsh environments (e.g., radiation). For example, in the Large Hadron Collider, the system needs to be activated to capture rare events with a latency of about 100 ns. And when analyzing multichannel outpatient health monitors in the kilohertz frequency band, wireless data transfer is not possible due to power limitations (about 50 iPhone batteries per day are required for data transfer) and security requirements. In addition, material spectroscopy data streams on the order of terabits per second need to be supported. In addition, real-time analysis of advanced scientific instruments requires uninterrupted allocation of computing resources, and sensitive patient information processed by wireless health devices must be protected. These characteristics and properties provide quantitative guidelines for understanding the distinctions and similarities between domains and applications. This allows us to coordinate our efforts to create basic design principles and tools that address the needs between seemingly different domains. Proper data representation is an important first step in the design process, as it determines the choice of NN architecture to implement in real-time systems that need to meet the performance goals described above.
The data representation used in a particular domain affects both the computational system and the data storage. A broad classification of data representations across domains can be raw data and reconstructed data. The data representation often differs depending on the stage of reconstruction and the upstream steps of the data processing pipeline. Existing applications include fully concatenated NNs with preprocessed expert feature variables as input and CNNs when the data is of image nature. Domain knowledge-inspired NN algorithms currently under development can further exploit expert features in terms of accuracy and efficiency, as detailed below. To fully exploit the capabilities of advanced NNs and approach data preparation with minimal loss of information, it is necessary to employ a more appropriate representation of the raw data, e.g., point clouds. Typical representations of raw data obtained from various experimental and measurement systems include the following
- Spatial Data. It is used to describe physical objects in geometric space. There are two main types of data: vector data and raster data. Vector data includes points, lines, and polygons. Raster data is a grid of pixels, as an image, but the pixels can also represent other measurements such as intensity, charge, or field strength.
- Point Clouds. It is a kind of spatial data. This data is created by collating a set of spatial data, i.e., points in a 3D space, and usually forms a set of objects in the space.
- Temporal Data. It is used to describe the state of a system or experiment at a particular time. Data that are collected in a particular order over time are classified as such. Time series data is a subset of this representation, where the data is sampled at regular time intervals. As an example of time series data, Fig. 4 shows an example of supernova classification.
- Spatio-Temporal Data. Measurements or observations of a system can be collected in both spatial and temporal dimensions. In that case, the data can be considered Spatio-temporal.
- Multispectral Data. It is used to represent the output of multiple sensors that take measurements from multiple bands of the electromagnetic spectrum. Multispectral representations are often used in the field of imaging using sensors that are sensitive to light of different wavelengths. This usually involves several to ten spectra.
- Hyperspectral data. Hyperspectral data: used to represent measurements from as many as 100 spectra. These images, collected from different narrowband spectra, are combined into a so-called hyperspectral cube with three main dimensions. The first two dimensions refer to the spatial arrangement of the two dimensions (e.g. the earth's surface) and the third dimension represents the complete spectral content at each "pixel" location.
Table 1 provides a brief description of these data representations, corresponding to their scientific application fields. Data representations that are particularly important in a particular field are highlighted. The cost of data communication (latency) and the cost of data storage (the cost of acquiring and managing physical storage resources) are important issues. Highly optimized data analysis solutions are required, especially in application areas that require real-time analysis and real-time feedback. Applications that rely on hyperspectral data continue to increase the proportion of data input across the electromagnetic field. Fast data reduction is required in these areas. In addition, applications that generate large point clouds require efficient compression of spatial data. In the application domain of multispectral data with limited spatial resolution, ultra-fast reconstruction is required to enable real-time control feedback. Applications that require accurate analysis of time-series streaming data are forced to run under very limited storage and communication resources due to privacy and security concerns or limitations of the associated edge devices. Some of the current efforts to develop ML solutions for the data processing front-end focus on the development of autoencoder-based compression engines. ML-based dimensionality reduction for hyperspectral data is another direction that is receiving attention. Deep learning-based approaches are being studied for image reconstruction and the field of materials science is one of the most active areas in this regard.