Catch up on the latest AI articles

Respond To Pandemic Emergencies! Propose Optimal Deployment Of Medical Supplies With Reinforcement Learning!

Respond To Pandemic Emergencies! Propose Optimal Deployment Of Medical Supplies With Reinforcement Learning!


3 main points
✔️ We need to allocate medical supplies - especially medical devices - appropriately to meet the demand for them.
✔️ Aim to derive the optimal policy for the placement of medical supplies - how medical devices should be placed
✔️ Suggests higher performance than other algorithms

On collaborative reinforcement learning to optimize the redistribution of critical medical supplies throughout the COVID-19 pandemic
written by Bryan P BednarskiAkash Deep SinghWilliam M Jones
(Submitted on 9 Dec 2020)
Comments: J Am Med Inform Assoc.

Subjects: Computer Vision and Pattern Recognition (cs.CV)

The images used in this article are from the paper, the introductory slides, or were created based on them.


Can medical supplies be deployed in a way that minimizes the damage of a pandemic?

This research aims to develop algorithms to derive the optimal relocation of medical devices using reinforcement learning and deep learning models in order to enhance the response to pandemics such as COVID-19.

Due to the pandemic of coronavirus 2019-COVID-19-, there have been reports of countries facing shortages of medical supplies, making it difficult to provide adequate healthcare. As a solution to these shortages, redistribution of medical equipment is gaining attention: in Northern Italy, for example, doctors distribute equipment and decide which patients to save; and in the United States, the lack of a unified system for distributing supplies has led to the use of rudimentary methods such as telephone calls and press releases. In the U.S., there is no unified system for distributing supplies, so rudimentary methods such as telephone calls and press releases are reported to have been used to provide relief. Thus, while the need for medical supplies to be deployed in response to a pandemic is mentioned, there are few reports of optimal deployment, and the method of policy making is unclear - an opaqueness that needs to be resolved in order to provide adequate medical care in an emergency situation such as this pandemic. In order to provide appropriate medical care in emergency situations such as this pandemic, these uncertainties need to be resolved.

In this study, we aim to develop a method to share resources more optimally in such public health emergencies: after preprocessing the datasets, we use them as input to a neural network inference model - LSTM After preprocessing the datasets, we use them as input to a neural network inference model - LSTM - to predict the future demand for ventilators in each region. In addition, based on the derived demand, we aim to derive the optimal allocation of medical supplies for each case by using reinforcement learning (VI, Q-learning) and to realize an appropriate medical system in emergency situations.

What is reinforcement learning?

This chapter outlines the reinforcement learning that we utilize for the redistribution of medical supplies.

Reinforcement learning (RL) is a type of machine learning that learns using two factors: the agent's behavior and the environment's feedback on that behavior. -A key feature of RL is that it is less dependent on datasets: based on the feedback from the environment, the agent is able to predict the The main feature of RL is that it is not dependent on a data set: it learns from the experience collected by the agent based on feedback from the environment - thus, unlike unsupervised and supervised learning, it does not require a static data set. This eliminates the need for data collection, preprocessing, and labeling prior to training.

A typical reinforcement learning workflow is as follows

  1. Creating the environment: The first step is to define the environment in which the agent operates - the interface between the agent and the environment, etc. In many cases, simulations are introduced in the environment for safety and experimentability reasons.
  2. Reward Definition: Defines the reward for achieving the goal and how the reward is calculated.
    The rewards guide the agents in choosing their actions.
  3. Agent creation: create an agent-an agent consists of measures and reinforcement learning learning algorithms. Specifically, we need to
    a) Choosing a way to represent the measures - neural networks, lookup tables, etc.

    b) Choosing the appropriate learning algorithm: neural networks are used in most cases because they are better suited for learning in large state and action spaces.
  4. Agent training and validation: we train the agent by setting the conditions for learning - e.g., stopping conditions. After training, we validate the learned measures derived by the agent: we revisit the design of reward signals, measures, etc., and run training. RL is sample inefficient - especially for model-free and on-policy algorithms - and can require several minutes to several days to train; therefore, we parallelize training on multiple CPUs, GPUs, and computer clusters.
  5. Deployment of measures: Investigations are performed on the learned measures. Depending on the results, we may go back to the initial stage of the workflow. Specifically, if the learning process and the derivation of the measures do not converge within the computation time, the following items need to be updated before retraining: training settings; configuration of the reinforcement learning algorithm; measure representation; reward signal definition; behavioral and observational signals; and environment dynamics.

purpose of one's research

In this work, we aim to propose redistribution algorithms to provide better quality health care in the face of public health crises, such as the COVID-19 pandemic, by optimizing the allocation of medical supplies: we preprocess the dataset and introduce a neural network inference model - LSTM - to predict the future demand for ventilators in each state. A neural network inference model - LSTM - is implemented to predict the future demand for ventilators in each state - based on these predictions, five redistribution algorithms - three Based on these predictions, we utilize five redistribution algorithms - three heuristic algorithms and two reinforcement learning algorithms - and compare their performance with the average performance of 5, 20, 35, and 50 participating states. We report that the redistribution algorithm based on q-learning achieves the best performance - the highest reduction of shortages in medical supplies. Furthermore, the prediction performance and reliability are expected to improve as the number of participating states increases, suggesting that the algorithm has greater utility.


In this chapter, we describe the proposed method for relocating medical supplies.

system overview

In this section, we describe the overall picture of the proposed method.

The proposed method (shown in the figure below) consists of a three-stage pipeline: pre-processing of input data; predicting future demand with a deep learning inference model; interpreting the demand prediction with a pre-selected reallocation algorithm; and deciding on an action. The second and third stages are optimized independently for each day.

The system is optimized to minimize the total ventilator shortages accumulated during the training period - ventilator shortages occur in states where the supply of ventilators is less than the demand. The inputs are the date of the simulation run and a random number of states to be selected.

Data Preprocessing and Imputation

In this section, we will discuss data preprocessing.

In preprocessing the dataset, we used the COVID-19 tracking program - taken from the Institute for Health Metrics and Evaluation at the University of Washington - as an indicator of disease. We also added biweekly Centers for Disease Control and Prevention above-average deaths as an indicator to overcome bias arising from regional differences in the number of COVID-19 tests performed. We also include fixed values for state-specific rates for various comorbidities-heart disease, asthma, chronic obstructive pulmonary disease, and diabetes-to account for the status of these diseases.

Assumptions in statistical processing

In this section, we describe the prerequisites for statistical processing.

To increase robustness to the system, two assumptions are made: number of ventilators; downtime - delay. The first assumes that the number of ventilators available per state is equal to the number of beds in the COVID-19 intensive care unit. There is currently no system in place to track and report hospital ventilators on a state-by-state basis, so we need to use a proxy variable in our simulations; on the other hand, previous studies have shown that approximately half of ICU patients required ventilation in the early stages of the pandemic, so we can assume that medical supplies, we assume ICU bed data as a proxy variable for ventilators. The second is the logistical downtime - delay - of the redistributed ventilators. The delays incurred are randomly sampled from a Gaussian distribution (mean 3 ± 0.5 days) and rounded to 2 days (about 16% of the total), 3 days (about 68%), or 4 days (about 16%). The lower bound in this distribution is based on reports from the Department of Health and Human Services that emergency stockpiled ventilators are available nationwide in 24 to 36 hours.

Estimation of Demand

In this chapter, we discuss the estimation of demand in the second stage of the pipeline.

In this stage, we forecast future demand for ventilators based on average redistribution delay intervals: since previous studies have reported time series raw iterations at the peak of regional COVID-19, we use LSTM, a type of RNN-recurrent neural network-as the demand model. LSTM, a type of RNN-recurrent neural network, is used for forecasting - considering non-seasonal, multivariate and time-series forecasting. We also pre-train the LSTM on a small amount of data to train it on past pandemics, and train it daily on observed data. The primary simulation runs from March 1 to August 1, 2020, and uses 26 days of processed observations to pre-train the LSTM, which uses this data to forecast demand for 14 consecutive days - with a redistribution algorithm to achieve an optimal action interval. The LSTM uses this data to forecast demand for 14 consecutive days - the forecast interval is set to the average logistics delay to achieve the optimal action interval in the redistribution algorithm.

redistribution algorithm

In this chapter, we describe the third step - the redistribution algorithm.

In this phase, decisions are made to optimize the redistribution of medical devices. We use three algorithms - Maximum Needs First, Minimum Needs First, and Random Order - and two reinforcement learning algorithms - Value Iteration: VI and q-learning -are used and compared to a baseline without ventilator replacement - starting with initial supply and ending with initial supply. Three approaches - maximum-needs-first, minimum-needs-first, and random order - that do not use reinforcement learning allocate excess ventilators to each state based on predicted demand. In addition, two RL algorithms - VI and q-learning - perform optimal allocation based on the interaction between the agents and the environment. The difference between these two approaches is as follows: q-learning evaluates actions based on a lookup table, a predefined and continuously updated table; VI recursively explores all actions until convergence, and The redistribution of ventilators by q-learning avoids shortages by buffering the state supply against unexpected demand spikes (see figure below).


In this section, we describe the results of the evaluation performed in this study. The evaluation compares the performance of five algorithms - three demand-driven algorithms and two algorithms that utilize reinforcement learning - in reducing the shortage of medical supplies when they are applied, and when the initial state is maintained. and evaluate the performance.

Evaluation Environment

In this chapter, we describe the environment in which we conducted our evaluation.

As an evaluation, we employ a time series analysis model - long-short term memory (LSTM) - to derive demand forecasts for medical supplies, and a redistribution algorithm that utilizes reinforcement learning: specifically, we use the LSTM inference Based on the best demand prediction by the model, we randomly select 5, 20, 35, and 50 states and compare the performance of each algorithm: for the reduction of shortages for medical supplies, we compare the performance of the algorithm with the application of the algorithm and without any action - each state maintains its initial supply during the simulation. maintain their initial supply - are compared. In our assessment of optimization, we compare the observed shortages of medical supplies with the shortages in the ideal ventilator state - where shortages occur elsewhere, no delays occur, and there are no excess locations. In order to exclude bias, outliers other than the three standard deviations - SD - are excluded and only the most representative indicators are evaluated. The simulation becomes infeasible in the following cases: no shortage occurs when no action is taken; no shortage is observed while the redistribution algorithm is applied.

evaluation results

In this chapter, we discuss the actual evaluation results.

The results in the evaluation environment described above (see figure below) show that q-learning performed best on both shortage reduction and optimization when 20, 35, and 50 states participated; while in the 5-state environment, q-learning performed better than the Allocate Maximum Fir method - which adjusts to the maximum need - has lower performance on shortage reduction.

q-learning suggests an increase in mean performance, and a decrease in standard deviation, with an increase in the number of random participating states: improving from 78.74 ± 30.84% for 5 states to 93.46 ± 0.31% for 50 states: q-learning consistently achieves 93.33% to 95.56 % average optimality, with the SD decreasing as the number of randomly participating states increases (see table below).


This research aims to develop an algorithm to derive the optimal allocation of medical supplies in emergency situations, such as COVID-19. By optimally allocating medical supplies, it is possible to resolve the shortage of medical supplies in an emergency situation and to provide appropriate medical care to as many people as possible. In this study, we propose a method to derive optimal medical supplies using reinforcement learning in order to solve the shortage of medical supplies. In the evaluation, we compare the performance of the redistribution algorithm based on simulations in which 5, 20, 35, and 50 states are randomly selected. The evaluation results show that the redistribution algorithm using reinforcement learning eliminates the shortage state regarding medical supplies in the range of 93% to 95%. The allocation performance also improves with the increase in the number of participating random states, with 78.74 ± 30.8% and 93.50 ± 0.003% shortage reduction in the 5 and 50 state simulations, respectively. System performance improved with the number of participating states, and was more pronounced at lower numbers of participating states - the baseline peaked at 5 states and degraded with increasing complexity, while q-learning improved these.

One possible explanation for the high standard deviation in shortfall reduction in the case of a small number of q-learning states is that the demand in a large number of states cannot be met by the supply in a small number of states; therefore, as the number of participating states increases, the number of such cases will decrease, and the SD will decrease accordingly. considered. This result suggests that q-learning can always select near-optimal behavior. In addition, further performance improvement can be expected by using the collected data from the current pandemic for training. Based on these findings, we expect that reinforcement learning will be able to achieve near-optimal public health resource allocation in the future.

One of the challenges of this work is the poor performance of reinforcement learning other than q-learning - Value Iteration (VI): the reduction in shortage from 73.42±31.59% to 23.40±7.72% for states 5 to 50. Since VI derives a policy based on the current policy at each iteration, it needs to converge learning from supply and demand at each state in real time - thus increasing the convergence threshold and causing a decrease in accuracy. On the other hand, the q-learning algorithm - a type of model-free learning algorithm - maintains high accuracy by adjusting and learning only the values in the q-table. In addition, since VI is generally computationally expensive, q-learning makes it possible to perform learning at a lower cost.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us