Catch up on the latest AI articles

Replicated In A Leukemia Treatment Strategy Model! A Proposed Decision Framework For Leukemia Treatment Policy Using Deep Reinforcement Learning

Replicated In A Leukemia Treatment Strategy Model! A Proposed Decision Framework For Leukemia Treatment Policy Using Deep Reinforcement Learning


3 main points
✔️ There is a dynamic treatment regime (DTR), which is a treatment strategy that is determined by dynamically changing the treatment based on disease progression, side effects, and laboratory values.
✔️ Development of a framework for estimating the optimal dynamic treatment regime from observed medical data using deep reinforcement learning.
✔️ Expectations for the realization of building a model that can support complex treatment decisions in DTR and determine the optimal policy for each case

Deep Reinforcement Learning for Dynamic Treatment Regimes on Medical Registry Data
written by Ning LiuYing LiuBrent LoganZhiyuan XuJian TangYanzhi Wang
(Submitted on 28 Jan 2018)
Comments: Published in final edited form as: Healthc Inform.
Subjects: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)



Can reinforcement learning replicate dynamic treatments by experts, tailored to individual conditions and symptoms?

This study proposes a deep reinforcement learning-based treatment decision framework for estimating the optimal dynamic treatment regime (DTR) from observed medical data. DTR refers to a treatment strategy that dynamically modifies the treatment method based on the progress of the disease, side effects, and laboratory values. In recent years, there has been a growing interest in personalized medicine and the introduction of DTR, but in some cases, such as leukemia, it is difficult for physicians to judge whether excessive treatment will worsen the final endpoint in the long term. For such cases, proposing the optimal treatment strategy may lead to an improvement in prognosis and symptoms.

In this study, we propose to construct a decision support model for bone marrow transplantation from acute and chronic complications data sets using deep reinforcement learning. Since Q-learning, which is used in conventional reinforcement learning, has difficulty in dealing with the increase in the number of states and behaviors, our method uses a deep learning model to process such a huge number of patterns.

What is a dynamic treatment regime (DTRs)?

First, a brief description of the dynamic treatment regimen (DTR), which is the subject of analysis in this study.

DTR refers to a treatment strategy in which treatment decisions are made dynamically based on disease progression, side effects, and laboratory values. For example, it is used to evaluate the appropriate criteria for laboratory values to initiate treatment from a prognostic perspective. This approach is attracting attention as being very important for the realization of personalized medicine, in that it takes into account the characteristics and genetic information of each patient in selecting treatment. In particular, it is sometimes difficult for doctors to make judgments in cases where excessive treatment, such as radiotherapy in cancer, causes a decline in the outcome and QOL. In addition, there are cases in which the continuous use of drugs with strong side effects, such as steroids, leads to the development of drug resistance and narrows down the options for alternative drugs. Suggesting the best treatment plan for such cases may lead to improvement of the prognosis and symptoms.

Previous Studies and Issues on DTR

Much of the work on deriving optimal patterns for these DTRs have been reported in the field of statistics. Most of these methods use data from randomized clinical trials to perform dynamic programming analyses of multiple stages of decision making. In other words, the optimal sequential decision rule (strategy) is estimated based on the state transitions derived by retrospectively analyzing the previously identified decision stages. At each stage, a parametric prediction model for the value function using Q-learning and a classification model to directly model the decision policy using Outcome Weighted Learning (OWL) is proposed.

On the other hand, these studies are proposed based on randomized controlled trials and are constructed in a low-dimensional space - essentially a two-dimensional space - which may be inappropriate for models dealing with high-dimensional spaces, such as dynamic medical regimens. In DTR, the high variability of individual cases predicts high heterogeneity in the decision-making process among patients, and the behavioral states represented in a low-dimensional space are difficult to apply to high-dimensional data on treatment options (i.e., electronic medical records and registry data). To cope with the high dimensionality of such data, statistical methods need to simplify the explanatory variables to some extent, and it is difficult to analyze cases in which multiple factors, such as the interaction between data elements, are related. In particular, in the case of complex processes such as decision making, it is expected that multiple factors are intertwined with each other, and therefore, it is highly likely that such simplification will not result in an optimal DTR. Against this background, reinforcement learning may be introduced to the decision-making task, but simple models such as Markov determinant process (MDP) may not be able to handle many DTR problems. However, simple models such as Markov determinant process (MDP) may not be able to handle many DTR problems. In this research, we focus on deep reinforcement learning, which is a combination of deep learning and reinforcement learning, such as Deep-Q-Neural network (DQN). We aim to build a support system that is close to the decision-making of experts.

purpose of one's research

In this study, we aim to develop an individualized sequential decision-making framework for DTR by introducing deep reinforcement learning to resolve the issue of previous studies, which is the low-dimensionality of the state and action space in decision-making models. As mentioned earlier, previous studies have proposed models mainly for randomized targeted trials, and it is inferred that they are not suitable for decision-making models that deal with high-dimensional and complex spaces such as individualized treatment decisions. Therefore, in this paper, we propose a framework based on deep reinforcement learning to provide data-driven sequential decision support based on medical registration data. More specifically, to model the behavioral and state-space with high dimensionality, we build a discrete-time model based on the design of the registry data collection to deal with the high dimensionality of the dataset.


data set

We are analyzing a dataset of outcome data on patients who have undergone hematopoietic cell transplantation (HCT), collected since 1972. This data covers the prevention and treatment of GVHD (Graft Versus Host Disease) - immunological damage caused by donor immune cells - a common complication after HCT. GVHD can occur within six months of transplantation and is often acute, with a relatively quick resolution, or it can occur immediately after transplantation up to several years later, causing long-term complications and disease, so treatment decisions need to be based on these characteristics. The dataset includes 6021 patients diagnosed with acute myeloid leukemia (AML) who underwent HCT between 1995 and 2007, using standard follow-up data at 100 days, 6 months, 12 months, 2 years, and 4 years post-transplant.

In addition, at the time each form is recorded, we define the state and behavior of the state transition in reinforcement learning. Specifically, we define relapse and death as convergent states and the occurrence of acute GVHD and chronic GVHD as transient states. In addition, it consists of three types of actions related to treatment policy: initial treatment applied at the time of transplantation (chemotherapy treatment), GVHD prophylaxis (immunosuppression of donor cells to prevent GVHD), and therapeutic agents for acute and chronic GVHD.

Construction of State Transition in DTR

The state transition model in this study is defined as follows: t = 0 at the time of transplantation, t = 1 at 100 days, t = 2 at 6 months, t = 3 at 1 year, t = 4 at 2 years, and t = 5 at 4 years. we also apply deep reinforcement learning to three tasks in the DTR: the initial state (chemotherapy to prevent relapse), the We also apply deep reinforcement learning to three tasks in the DTR: initial status (chemotherapy to prevent recurrence), initial treatment after transplantation including GVHD prophylaxis, and treatment of acute and chronic GVHD. Initial prophylactic treatment took place at t = 0 at the time of transplantation, treatment of acute GVHD at t = 1 (100 days) and t = 2 (6 months), and treatment of chronic GVHD at t = 2 (6 months) to t = 5 (4 years).

First, we build a supervised learning network to predict the distribution of experts on local policies. The proposed method predicts the distribution of treatment policy and GVHD prevention in the initial state at the time of transplantation based on baseline information and takes into account time variation to predict the distribution of treatment for acute GVHD at 100 days and 6 months, and for chronic GVHD up to 2 years after transplantation.

In the case of immediate post-transplantation, the input (state) is the patient's basic information (i.e., age, gender, and presence of comorbidities) and the patient-donor genetic matching information, and the output (behavior) is the drug combination in the initial treatment to prevent disease recurrence and GVHD prophylaxis. acute at t = 1 and t = 2 For the treatment of acute GVHD at t = 1 and t = 2, the input (state) is basic information, pairing conditions, and the presence of acute GVHD, and the output (behavior) is the combination of medications used in the treatment of acute GVHD. The same state and behavior apply to the treatment of chronic GVHD from t = 2 to t = 5. To eliminate the high dimensionality of the behavioral space, we encode the behaviors based on the drug combinations used, reducing the number of selectable behaviors to approximately 270. We also use autoencoders to accelerate convergence and mitigate overlearning by extracting features that reduce the number of dimensions in the state space. Next, we estimate the value function with the highest transition probability among the treatment options in the expert's behavior. The target value function evaluates only those behaviors with the highest probability - behaviors with low state transition probabilities have fewer samples and are less general, and narrowing the target reduces computational complexity. For the reward function, we target the Q-function of the expected reward in the future when the optimal treatment is received, and estimate it by Q-learning.

In this paper, we simulate a preliminary implementation of the proposed method with a simplified heuristic reward. The setup, including the simplified rewards, is as follows. For each patient, I, the delayed rewards in the terminal state (death, relapse, relapse-free survival after 4 years) or at ti when data are lost are grouped into the following categories: relapse-free survival and no GVHD survival; survival with acute GVHD or chronic GVHD; relapse of leukemic disease; death; data loss. Different delayed rewards are assigned to these five cases. 4-year survival without relapse and GVHD: reward of 1.0; acute and chronic GVHD: reward of 0.8; relapse: reward of 0.2; death: zero rewards. We also set up and trained three separate Deep Neural Networks (DNNs) for the initial conditions (chemotherapy and GVHD prophylaxis) and the DTRs for acute and GVHD treatment. For each DNN at timestamp t, we define the state as the input and the expert's decision as to the action. An autoencoder reduces the high dimensionality of the input state space, and the output prediction is the expected return on the action.


Prediction accuracy for expert behavior

The results of predicting expert behaviors for chronic GVHD (Fig. 2) confirmed that the top 5 prediction accuracies at time t = 2-5 and the individual prediction accuracies were sufficiently high and increased over time. It was also confirmed that the number of dimensions in the state and action space was reduced by clustering on autoencoders and actions. The dimensionality of the state space has been reduced from dozens to six, and the action space has been reduced from a 17-dimensional binary vector to 270 drug combinations.

Efficacy of a DTR framework with deep reinforcement learning for the treatment of chronic GVHD

In this evaluation, we compare the performance of the proposed method with that of a random action selection approach in terms of a value function, to clarify the performance of the proposed method.

The evaluation results (Figure 3) confirm that the proposed framework with deep reinforcement learning has improved value over random action selection at multiple time steps: up to 21.4% value improvement.


This study proposes a systematic framework using deep reinforcement learning based on medical observational data from long-term follow-up of subjects with acute and chronic GVHD. Although decision making on treatment selection in these diseases required experts to build decision models, they had to deal with a complex and high-dimensional space, such as a huge number of states and behaviors, which is difficult to handle with conventional Q-functions. The proposed method aims to properly handle such high dimensionality by introducing deep reinforcement learning. The results show that the proposed method predicts expert treatment decisions with high accuracy and also provides improved value over previous methods. It is expected that the proposed method has the potential to improve expert behavior by supporting decision-making through optimization for long-term patient outcomes.

Leukemia, the disease of interest in this study, presents other problems related to sequential decision making - the choice between transplant and no-transplant, and the optimal timing of transplantation - and its application to these problems will be explored. Because of the characteristics of leukemia, such as the practical difficulty of participating in randomized trials for patients with a high mortality rate, the high cost of treatment, and the difficulty of recruiting sufficient samples to increase the power of detection, the implementation of the proposed method is expected to reduce the cost of collecting such new data.

On the other hand, there are issues such as the computational cost required to build the model - computational amount and computational time - and issues in actual implementation - the actual decision is left to the patient or doctor. In particular, while conventional methods such as Q-learning can take into account high dimensionality, which is difficult to process, the amount of computation to be processed tends to increase. To deal with these issues, we can reduce the computational complexity by convolutional processing using CNNs, inverse reinforcement learning to estimate the reward function from the expert's behavior, and introducing Markov decision processes (MDPs).

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us