A Model To Support Optimal Treatment Decisions In Blood Glucose Management! Constructing A Reinforcement Learning Model To Estimate The Reward Function By Inverse Reinforcement Learning!

Medical 22/06/2021

3 main points
✔️ The importance of blood glucose management has received increasing attention in recent years for the rapidly growing number of diabetic patients.
✔️ In this study, to overcome these challenges, we propose a case-specific decision support model for treatment strategy using reinforcement learning (RL)
✔️ Among the three states related to blood glucose management - normal medium severe - the transition to the medium was shown to have the highest probability. Although this result is counterintuitive to the behavior of specialists who aim to transition to the normal state, the advantages of inverse reinforcement learning, which can derive reward functions that are difficult to estimate, are significant and promising.

An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment
written by H. Asoh, M. Shiro, S. Akaho, Toshihiro Kamishima
(Submitted on 23 Sept 2013)
Comments: Accepted at ECMLPKDD2013 Workshop
Subjects: Reinforcement Learning (cs.RL); Machine Learning (stat.ML)

background

Is it possible to build a decision support model that replicates a treatment strategy based on the actions of experts not present in the data?

This study analyzes the process of managing blood glucose level, which is a risk factor for serious cardiovascular diseases including ischemic heart disease, using Markov decision process (MDP). In recent years, the number of diabetic patients with increased blood glucose levels has increased rapidly due to changes in lifestyle and eating habits, and attention has been focused on treatment strategies to appropriately manage blood glucose, which is the cause of these diseases. While it is desirable to optimize such a management process for each case, it is difficult to realize a treatment policy that is tailored to patient characteristics in consideration of cost and other factors. Therefore, in order to derive the optimal treatment policy, attention has been focused on constructing an estimation model to derive the optimal treatment policy for each case by introducing reinforcement learning (RL), which is used in the decision-making field.

The purpose of this study is to develop a decision support system for the blood glucose management process using reinforcement learning. The modeling is based on state transitions using Markov Decision Processes (MDP), and inverse reinforcement learning is used to estimate a reward function that does not exist in the data.

What are blood glucose and blood sugar management?

First of all, I will briefly explain the subject of this study's analysis, blood glucose levels, and associated diabetes.

The blood glucose level refers to the concentration of glucose (sugar, glucose) in the blood and indicates how much sugar is in the blood. These substances are used as energy for daily activities and spike after a meal, then slowly return to normal levels. On the other hand, if blood glucose levels remain high - a state in which there is too much sugar in the blood - due to factors such as glucose intolerance, vascular damage - a condition in which the walls of blood vessels are destroyed and blood clots form or burst - can occur. In addition, the probability of causing serious damage increases rapidly, especially in organs with many capillaries - kidneys, brain, liver - and organs with large blood vessels - heart - due to the effects on internal organs, brain function, and blood pressure. These high blood glucose levels This condition of high blood glucose levels is called abnormal blood glucose (diabetes).

There are two factors in diabetes and they are called differently: a symptom of decreased secretion of insulin to take sugar into cells due to decreased function of the pancreas (hyposecretion of insulin, type I diabetes); a symptom of the failure of the door to take sugar into cells to open properly (insulin resistance, type II diabetes). Insulin is like a "key" to take sugar into cells, and in the former case, the production of the key is decreased and the sugar concentration in blood vessels increases. In the former case, the production of the key decreases, and the concentration of sugar in blood vessels increases. The cause of this is thought to be a decrease in insulin secretion in the pancreas, and heredity has been pointed out as a cause. On the other hand, in the latter case, the key that opens the door of the cell does not function properly due to excessive blood sugar. This is often caused by lifestyle factors such as overeating and obesity, and type II diabetes is generally referred to as diabetes mellitus.

Glycemic control is a treatment to prevent vascular damage caused by these elevated blood glucose levels. This treatment is based primarily on the measures fasting glucose and HbA1c: the former refers to the blood glucose level before meals and the latter to the percentage of hemoglobin - a blood component - that is bound to sugar. While proper glycemic control can help prevent the serious diseases mentioned above, it should be adapted to each individual case. The purpose of this study is to construct a model for implementing optimal management guidelines for blood glucose levels that are tailored to each individual case using a reinforcement learning model.

Examination of models in glycemic control

As mentioned above, elevated blood glucose levels are a risk factor not only for diabetes but also for vascular diseases - cardiovascular and renal diseases - and are related to the quality of life, so it is necessary to manage blood glucose levels appropriately (blood glucose management). In particular, since lifestyle habits, including diet, are completely different from person to person, it is desirable to optimize blood glucose management for each case. In addition, medical treatment, including blood glucose management, has an aspect of interaction between physician and patient - the physician selects an appropriate treatment method according to the patient's condition through tests, and changes the patient's condition. Therefore, it is assumed that it is difficult to take these factors into account in a model that evaluates the impact of a single treatment method or factor, which is the target of statistical analysis of conventional medical data. In addition, since glycemic control involves lifestyle improvement over a long period of time, analysis of long-term treatment records is required to evaluate costs and benefits on quality of life, but there are few studies that have examined the analysis of such long-term treatment records.

purpose of one's research

The purpose of this study is to construct a model of the blood glucose management process using Markov decision process (MDP), which is a kind of model-based reinforcement learning. As mentioned above, it is necessary to manage blood glucose levels according to individual cases in order to reduce the risk of cardiovascular diseases and other serious cardiovascular diseases caused by an increase in blood glucose levels. On the other hand, conventional studies of blood glucose management have focused on statistical methods, which may make it difficult to accurately reflect factors in individual cases. In order to appropriately reflect the characteristics of each case, this study introduces reinforcement learning and aims to construct a model that proposes optimal blood glucose management tailored to each case, which has been difficult to reflect using statistical methods. Specifically, we estimate the parameters of MDP and the patient's state progression from medical records and evaluate the value of state and action (treatment). Based on these evaluations, we estimate the optimal action selection rule (strategy) according to the condition of the patient. In addition, a simple reward function based on the physician's opinion is assumed as the evaluation of this kind of management, but there was an issue that the validity was unclear. In this study, we aim to solve this problem by using inverse reinforcement learning (IRL), which estimates a reward function from the behavioral data of experts.

technique

data set

In this study, we used medical records of hospitalized patients, including hospital visits, accumulated in a database to model the process of glycemic control. In particular, we used hospital visit data from approximately 3,000 patients who had been hospitalized for percutaneous coronary intervention (PCI), a treatment for ischemic heart disease, for which diabetes is one of the risk factors. The dataset was extracted from the hospital's test and prescription ordering system and anonymized, and reportedly does not include personal information, pain complaints, physician findings, or other information from electronic medical records. The data collected for each patient can be viewed as a single episode following a certain treatment strategy. Specifically, the episodes were divided into 75-day intervals, and cases with more than 24 episodes (approximately 2 years) of outpatient treatment were selected, indicating that 801 episodes were identified. The shortest episode length (number of visits) was 25, and the longest was 124. In addition, to generate blood glucose status, the hemoglobin A1c (HbA1c) is classified into three levels (normal, moderate, and severe) according to two thresholds (6.0 and 8.0). In addition, drug therapy is grouped according to drug efficacy, and the pattern of combination of drug groups prescribed at the same time is identified, and from the data, 38 combination patterns are identified.

Setting up a model for the blood glucose control process

In this study, we model the long-term process of treatment based on Markov decision processes (MDPs) to address the above-mentioned problem - deriving the optimal glycemic control process for each individual case. The MDP is a stochastic model of a dynamic system in which state transitions occur stochastically. MDP is a concept that models the state transitions between an agent's action and the next state/reward from the environment and is determined by the following six components: state, action, probability transition function, reward function, initial state probability, and strategy. The basis of reinforcement learning is the learning of measures to control an agent acting based on the state transition model of MDP. In this study, in order to construct a model using these state transitions, we first estimate the state transition probability of a Markov decision process (MDP) and the average behavioral strategy of a doctor from the extracted episodes. As the reward function, we assume that reward 1 is obtained when the test value is normal, derive the state value and the action value from the Bellman equation, and use 0.9 as the discount rate γ for the reward.

Estimation of Reward Functions by Inverse Reinforcement Learning

In this study, we introduce inverse reinforcement learning to supplement the information about the reward that is not explicitly stated in the data to be analyzed, and estimate the reward function from the physician's behavior.

Usually, in reinforcement learning, the process of estimating such a function from the given data exists because setting a reward function is the core of learning a strategy. On the other hand, when there is no information necessary for reward estimation in the data, as in this case, the reward function to be adopted depends on the purpose of the analysis, while the selection criteria are unclear. Therefore, in this study, we use inverse reinforcement learning to estimate the reward function from the physician's behavior to learn the policy. Specifically, we apply a Bayesian inverse reinforcement learning algorithm called PolicyWalk to estimate the reward function from the physician's behavior. policyWalk assumes that the reward value depends only on the blood glucose state, and the reward function R is a three-dimensional vector R = (Rnormal, Rmedium, Rsevere). Since these rewards are defined as states relative to each other, they are defined by normalizing them so that Rnormal + Rmedium + Rsevere = 1. In this algorithm, rewards are set as vectors, and policy iteration - an algorithm that learns by evaluating and improving policies based on assumed policies - is adopted.

result

Evaluation Conditions

Using the episodes extracted from the data, we first estimated the MDP state transition probabilities and the physician's policy π. The discretized HbA1c values comprise the state set S, and the medication combinations correspond to the action set A. In estimating the probabilities, we use Laplace smoothing to avoid untrained effects due to the small amount of data. As mentioned before, the observables (states) are discretized into three levels: the reward function R is defined by a three-dimensional vector r = (Rnormal, Rmedium, Rsevere), where

Results of the reward function estimated by inverse reinforcement learning.

The purpose of this evaluation is to show the results of the reward function estimated by inverse reinforcement learning from the specialist's behavior regarding blood glucose management in the setting described above.

The MCMC sampling results show r = (0.01, 0.98, 0.01) with probability almost equal to 1 (Figure 1: Typical sampling sequence for Rmedium). results suggest that the reward value of the medium state is the highest. We also compared the log-likelihood values of the observations using the reward vectors: (0.98 0.1, 0.1), (0.1, 0.98, 0.1), and (0.1, 0.1, 0.98), resulting in -159878, -143568, and -162928, with medium also having the highest likelihood As a result, the likelihood was highest for medium.

consideration

The purpose of this study is to develop a decision support system for the blood glucose management process using reinforcement learning. We used Markov Decision Processes (MDPs) based on state transitions and inverse reinforcement learning to estimate a reward function that does not exist in the data. As a result of the evaluation, it was confirmed that the state transition probability of medium was the highest by sampling with the estimated reward function.

The results of this evaluation suggest that the transition probability of the medium state is 0.98, which is higher than that of the other states. This result is counterintuitive, considering that the purpose of the treatment is to improve the transition probability to the normal state. Possible reasons for this result include: the interpretation of the model by the MDP model does not adequately reflect the complexity of the physician's decision-making process; the number of patients in the medium state is larger than the others in the data used. It is also noted in the paper that under the reward vector r=(0.01, 0.98, 0.01), the optimal behavior is not very similar to the behavior of the specialist, suggesting that the estimated assuming that the reward function depends purely on the current state of the patient is too simplistic. Thus, although inverse reinforcement learning has some implementation challenges in the medical field, it can estimate reward functions that cannot be computed from the data and is expected to be developed in the future.

Categories related to this article

Medical

今給黎薫弘