AI Causes Behavior Change To Improve Prognosis! Proposed Model For Deriving Optimal Intervention Policies Using Reinforcement Learning!
3 main points
✔️ Inappropriate health-related behaviors and habits are considered to be deeply involved in the onset and severity of chronic diseases, including diabetes and cancer
✔️ In this study, we propose a method for deriving intervention policies based on effective learning and taking into account individual characteristics for behavior change to improve prognosis.
✔️ The results confirm that the proposed model shows better performance than standard reinforcement learning algorithms
A reinforcement learning based algorithm for personalization of digital, just-in-time, adaptive interventions
written by
(Submitted on May 2021)
Comments: Artif Intell Med.
The images used in this article are from the paper, the introductory slides, or were created based on them.
background
Can reinforcement learning promote behavior change that improves prognosis?
In this study, we aim to construct a model to derive appropriate interventions to promote behaviors that promote the prevention and treatment of diabetes and other chronic diseases, based on reinforcement learning.
In recent years, the association between unhealthy behaviors and chronic diseases has been noted, and there is a need to take into account the individual's unique lifestyle and priorities, psychological and psychosocial background, and environmental factors - particularly in chronic diseases, where long-term pathological changes and post-onset improvements can be difficult to achieve, and the individual It is important to introduce treatment and prevention methods tailored to the characteristics. In this context, interventions based on digital devices have attracted attention for their effectiveness in supporting people's self-management activities: the recent development of mobile sensors and health sensors - worn, implanted, and gastrointestinal - has led to the development of a new class of sensors that can be adapted when needed and can be used in a wide range of health care settings, such as in the health care system, in the health care system and the community. The means to deliver instantaneous, targeted interventions are becoming more prevalent. Against this background, the concept of JITAI - just-in-time adaptive intervention - has been proposed and research is underway on instantaneous intervention methods via several components, including decision points and intervention options. The following is a brief overview of the results of the study.
This study aims to build an algorithm to learn personalized intervention strategies that account for long-term and moment-to-moment changes - building such a model will maximize adherence and reduce the burden of interventions in real-world care programs. The ultimate aim is to achieve better clinical outcomes. In this study, we also introduce a reinforcement learning RL algorithm to personalize JITAI, assuming a chronic disease. On the other hand, in environments where there is no initial data set at all, learning is often very costly - time and computationally expensive; therefore, in this study, we propose a method to rapidly learn personalized policies and to learn an acceptable We propose a model that dynamically and systematically adapts intervention strategies based on momentary and long-term changes in the patient.
technique
In this section, we provide an overview of the proposed model used in this study. In the proposed model, two RL models-opportune-moment-identification and intervention-selection-are used.
The entire algorithm
The approach consists of two major steps - see figure below.
The first step, the training phase, trains the state classifier used to reduce the number of random actions in unknown states - states that have not been visited before. The proposed model also includes two RL models that run concurrently - see figure below: specifically, the opportunity-moment-identification model monitors instantaneous changes and takes action accordingly -At this stage, only the selective qualification tracing method is used because no state classifier is available; whereas the intervention-selection model monitors habitual changes in individual patients. As a result of the learning phase, the state classifier is available for real experiments, and the opportunity-selection model can use this technique to further improve the learning process.
The overall algorithm flow is as follows: the
1. take inputs from the specific data elements used - the first four inputs are the main components of the RL environment - inputs about the environment and the agent: the environment element records the current state and history of transitions, the action plan records the includes the person's planned daily activities. Here, the Common Policy (CP) includes the accumulation of the number of times an action is selected along with its state, and the State Classifier (SCL) includes the number of times a person is selected for action in an unknown state. It is a learned model used to predict behavior.
2. the Opportunity-moment-identification model is executed only when a certain intervention type is selected - initially, based on the Greedy algorithm, the current state - omi_st -The learning algorithm performs state classification when selecting a random action for an unknown state; otherwise, it uses the Greedy algorithm to determine the action - omi_at- based on the state of the unknown state. and select the action with the highest q-value. After the action selection, the environment transitions from the current state to the next state according to the selected action.
3. then perform two simulations: if an intervention is given, simulate a response to the intervention given - the outcome is either to abandon the intervention or to engage in the intervention; in the second, perform the target behavior.
4. after obtaining the reward -omi_rt- for the action to be performed, the transition is recorded in the episode analysis object. For each selected intervention, the Opportunity-moment-identification model executes all time frames associated with the planned activity in the action plan.
5. advance the habit formation model - a mathematical model for simulating habits in patients, without going into details - one step and return to the intervention selection model. The next step of the intervention selection model is obtained using the updated parameters of the habit formation model, and a reward is generated. This loop is repeated for all time frames generated by the action plan. When an episode ends, it is updated with the data collected for that episode.
Opportunity-moment-identification model
In this section, we describe an overview of the opportunity-moment-identification model, a type of RL model adopted in the proposed model - see the figure below.
The above figure shows the interaction between the environment and the agent in this model: ai is the action taken when the environment is in state si; ri is the reward received for ai. The algorithm decides, for each decision point in the intervention timing, moment, and identification model, whether or not to intervene at each step until the action is performed or the person is involved in the intervention. If the response to the intervention delivered from the mobile device is delayed, then the action taken in the past needs to be rewarded if the timing of the intervention is appropriate. In this study, we used the agent's past state - rewarding only positive rewards for actions with involved interventions, not rewards for actions - in addition to the case where the pre-intervention action was Deliver_Nothing - the action was not practiced even after the intervention. -For example, in the above figure, a7 is a Deliver_Nothing action, but it is a good opportunity to intervene. This makes the policy of taking the Delivery_Intervention action when visiting s7 effective, and the intervention action is taken.
result
This section describes the evaluation environment and results.
Assessment Environment
In this section, we describe the verification method by simulation.
In the evaluation, the environment is set from two perspectives: the action plan and the persona - the archetypal example of behavior change: the action plan has three predefined Assume a simple action plan with defined decision points (morning, noon, evening); for personas, assume four personas and consider four characteristics: the
1. habituation
To realistically simulate the concepts associated with habits - the strength to perform a target behavior automatically without any external signals - we utilize the habit formation model. One of the features of this model is the commitment intensity: it is an indicator of the time required to make a behavior a habit. This parameter takes a value between 0 and 1, with a higher value indicating a higher appreciation of the target behavior and a stronger desire
2. everyday activities
Daily activities vary from individual to individual. Therefore, we introduce an activity timeline that represents all daily activities from waking up to going to bed - the aim is to simulate a state suitable for intervention and action planning. The timeline will be populated with predefined activities and assigned to each person semi-randomly per learning episode - per simulated day.
3. simulation of response to the intervention provided
Two assumptions are made as a response to the intervention: the subject prefers a specific type of intervention; everyday activities are suitable for practicing the intervention. Therefore, the individual's preference for the intervention is expressed as a probability of responding to the intervention - the preference for the intervention type is discontinuous and the probability sum is not necessarily 1.
4. simulation of the actual behavioral record
Behavioral records are determined by the predictions made by the habit formation model about the memory of the behavior and the suitability of the daily activity to perform the behavior: if the prediction is positive, the behavior is assumed to be performed within the corresponding activity time.
RL Model Comparison
Here, we compare RL algorithms for rewards aggregated per episode. The three target RLs are as follows - Q-Learning - QL -, selective eligibility - directedness of action selection to the intervention - extended with From the evaluation results, QL-SET-TL collected more rewards than Considering the ratio of the number of interventions performed to the total number of interventions sent, this suggests that QL-SET-TL is more effective. Regarding asymptotic performance, QL-SET-TL outperforms the other two. This result confirms that the proposed model, QL-SET-TL, has the best performance.
consideration
The present study aims to build a model based on the RL algorithm to derive an optimal policy specific to individual characteristics regarding the timing, frequency, and type of interventions - the algorithm optimizes these parameters using two RL models. The search for eligibility involved manipulating the trajectory of behavior, taking into account the selective rewards of past behavior and the suitability of the state to engage in the intervention. The evaluation simulated four individuals with differences in daily activities, preferences for specific intervention types, and attitudes toward the targeted health-related behavior, and showed better performance compared to the standard RL algorithm. The model is expected to accelerate the development of future self-management support systems to assist diabetic patients in their daily lives. In addition, the proposed approach is likely to improve care programs in the healthcare domain. It has been reported that healthy lifestyle habits such as physical activity and diet can reduce the risk of developing many chronic diseases and can even improve existing diseases - therefore, the proposed model can improve the effectiveness of behavior change programs and improve the quality of care through personalized intervention delivery strategies. people's health.
One of the strengths of this research is a learning mechanism that models the instantaneous and long-term course of a person's behavior using RL methods and personalizes the type, frequency, and timing of the behavior. At the present stage, no method that combines multiple RL models for behavior change has been reported. In addition, since the characteristic of medical data is long-term time-series data analysis, the construction of a model that considers not only a short-term perspective but also a long-term perspective is considered to be a strength of this method.
There are also issues to be addressed, such as improving the choice of intervention type: the model follows a model-free approach, but the intervention choice model can be designed as a model-based system - this allows the value function to be learned by intermediate simulations before action is taken. and higher accuracy can be achieved, potentially. To achieve such a structure, the intervention choice model could be split into two models, and the factors of type and frequency could be adjusted separately. The possibility that the parameters considered in the present model are not sufficient is also considered - it is assumed that additional parameters related to the environment, mobile phone, and patient could be used to build a more accurate model. In such improvements, further optimization of the algorithm - e.g., better generalization of the state - is needed and is considered a future task.
Categories related to this article