Can Side Effects Be Prevented By Considering Uncertain Factors? Propose A System That Combines Bayesian And Reinforcement Learning!

Reinforcement Learning 21/01/2022

3 main points
✔️ Focus on model-informed precision dosing-MIPD-using therapeutic drug and biomarker monitoring to improve the efficacy and safety of pharmacotherapy
✔️ A new MIPD approach combining Bayesian data assimilation-DA- and reinforcement learning-RL- is proposed
✔️ We also showed that the reward function in RL can be used to identify patient factors in dosage decisions.

Reinforcement learning and Bayesian data assimilation for model‐informed precision dosing in oncology
written by Corinna Maier, Niklas Hartung, Charlotte Kloft, Wilhelm Huisinga, Jana de Wiljes
(Submitted on 7 Mar 2021)
Comments: CPT Pharmacometrics Syst Pharmacol.
Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

The images used in this article are from the paper, the introductory slides, or were created based on them.

background

Is it possible to increase the effectiveness of treatment while preventing side effects according to individual characteristics with uncertainty?

This study aims to derive the optimal dosage and treatment strategy of a drug considering uncertain individual characteristics based on Bayesian models and reinforcement learning for model-informed precision dosing (MIPD). The project aims to

Personalized dosing, which provides the optimal dose of a drug or other medication based on an individual's condition, is expected to improve the safety and efficacy of medicines by reducing the effects of unwanted side effects - the occurrence of side effects and the effectiveness of treatment vary from person to person and are subject to uncertainty, including the range of drug effects. The development of side effects and therapeutic effects vary from person to person and are associated with uncertainty, including the extent of drug efficacy. One of these side effects is neutropenia in cancer treatment: a condition in which medication suppresses immune cells, leading to increased susceptibility to infection. This side effect increases the incidence of diseases such as pneumonia and influenza, and increases the possibility of a worse prognosis in the future; therefore, there is a need to derive an optimal dose that prevents the side effects of medication and does not interfere with the effectiveness of the drug.

To address these issues, we propose three different approaches using Bayesian data assimilation (DA) and reinforcement learning (RL) to control neutropenia We propose three approaches to control neutropenia: a Bayesian data assimilation approach - DA-guided dosing; a reinforcement learning approach - RL-guided dosing; and a combined Bayesian and reinforcement learning approach - DA-RL-guided dosing. and reinforcement learning - DA-RL-guided. These approaches aim to derive a therapeutically effective medication policy while taking into account individual characteristics with uncertainty and reducing side effects - neutropenia - in cancer treatment.

What is Model-informed precision dosing - MIPD?

First of all, we will briefly describe the model-informed precision dosing - MIPD - that is utilized in this study.

MIPD takes into account drug-disease-patient systems, prior knowledge of relevant variability - e.g. non-linear mixed-effects analysis - and patient-specific therapeutic drug/biomarker monitoring (TDM) data are taken into account to clarify the individualized approach to dosage. The results derived from MAP are evaluated concerning a utility function or target concentration to determine the next dose. The results derived by MAP are evaluated concerning a utility function or target concentration to determine the next dose - MAP-guided dosing; however, many therapies have been identified as having subtherapeutic or toxicity ranges, making it difficult to define target concentrations or utility functions that account for these uncertainties: in the case of therapeutic ranges In the case of treatment areas, previous studies have pointed out that MAP predictions are inappropriate because biased point estimates are used and out-of-range uncertainties are ignored. In this study, we aim to develop a method that takes into account these uncertain individual characteristics by combining Bayesian models and reinforcement learning.

What is neutropenia?

Here, we will discuss neutropenia, which is the subject of evaluation in this study.

Neutropenia is a side effect of anticancer chemotherapy that causes a decrease in the number of neutrophils, a type of immune cell: In severe neutropenia, the granulocytes of the neutrophils are decreased, causing the immune system to function improperly and increasing the susceptibility to life-threatening infections. Depending on the minimum concentration of neutrophils - nadir - the grade g of neutropenia is classified as follows - grade of no neutropenia (g=0) to life-threatening (g=4). Neutropenia can also be utilized as a surrogate indicator of medication dose effectiveness - median [overall survival]; therefore, neutrophil concentrations are used as a biomarker to derive the dose of chemotherapeutic agents that cause neutropenia and the treatment strategy to be, Therefore, neutrophil concentrations can be used as a biomarker to derive doses of chemotherapeutic agents that cause neutropenia and treatment strategies.

purpose of one's research

In this study, we aim to develop a system to prevent neutropenia, a kind of side effect of cancer treatment, and to derive appropriate doses of medication according to symptoms. Specifically, we have proposed three models based on Bayesian data assimilation (DA) and reinforcement learning (RL): DA-guided dosing; RL-guided The first method - DA-guided dosing - utilizes a Bayesian model to derive a more accurate dosing plan by considering uncertain parameters and to improve the existing online MIPD; the second method - RL-guided dosing - uses Monte Carlo tree search (MCTS) and upper confidence bounds -Upper Confidence Tree (UCT) and aims to improve learning strategies; the third method - DA-RL-guided - combines DA/RL and TDM - Therapeutic Drug/Biomarker Monitoring - data to account for uncertain individual characteristics, and to improve interpretability from reward functions and other sources. The evaluation compares these methods with existing methods to interpret dose performance and dose selection factors.

technique

In this section, we describe the proposed methods-DA-guided dosing; RL-guided dosing; DA-RL-guided.

Assumed environment

In this study, as a hypothetical environment, we consider a chemotherapy dosing schedule with paclitaxel - a type of anticancer drug: 1 cycle 𝑐=1,⋯,𝐶 - a total of 6 We consider a single-dose schedule of cycles (𝐶=6) - every 3 weeks. To select the dose, the physician uses various sources of information about the patient - covariates cov: gender, age, etc.; treatment history: drug, dosing regimen, etc.; TDM data on PK/PD: drug concentration, response, toxicity, etc. Despite these multiple sources of information, the information acquired is partial and incomplete, as only a few noisy measurements are available at each time point; therefore, MIPD links prior information about the drug-patient disease system with patient-specific TDM data. MIPD combines prior information about the drug-patient disease system with patient-specific TDM data.

The patient state (below) consists of the covariates sex and age, which are important predictors in exposure, as well as the parameters of the drug efficacy model, absolute neutrophil count ANC0 and neutropenia grade 𝑔 in previous cycles.

The MIPD Framework

In this section, we will discuss the MIPD to be analyzed.

In this study, MIPD is built on prior knowledge obtained from NLME analysis of clinical trials - nonlinear mixed-effects models. The structural and observational models are as follows.

The proposals in this study are summarized as the following three methods.

(i) The offline approach supports pre-computed Model-informed dosing tables -MIDTs- as well as dose individualization based on dosing decision trees. At the start of treatment, a dose is recommended based on the patient's covariates and baseline measurements, and during treatment, the observed TDM data are

Tables and trees are used for route determination. The treatment is individualized to the patient - uncertainty is taken into account - but the dose individualization procedure itself does not change; i.e., the tree and tables are static

(ii) The online approach determines the recommended dose based on the patient's model state and simulation results: individual TDM data are assimilated by Bayesian or MAP-Maximum a posteriori state, inferring the posterior distribution and MAP point estimates. While this approach tailors the parameters to the patient, it is difficult to process and requires additional information technology infrastructure and software to be implemented in clinical practice.

(iii) The offline-online approach combines the advantages of dosing decision trees and individualized models. It aims at a more accurate model that takes into account individual characteristics by adding uncertainty and information about prior states through data assimilation - DA - to the reinforcement learning approach. Personalized models are used for two main purposes: accurate state inference from sparsely observed TDM data - sampling - and individualization of the dosing decision tree.

reward function

Ideally, the reward function in reinforcement learning (the equation below) should correspond to the utility of beneficial and harmful effects on patients. In this study, we impose a greater penalty on the short-term goal - avoidance of life-threatening grade 4 - than on the long-term goal - increase in median [overall] survival - associated with grades 1-4 neutropenia. The DA also allows for quantified individual uncertainty - the probability of being inside or outside the target range - to be considered, resulting in a model that is closer to clinical practice.

RL-guided dosing

In this section, we describe a method for deriving a medication policy using reinforcement learning (RL).

In RL, the target task is formulated as a Markov decision process - MDP - that models sequential decision making under uncertainty: treated as a stochastic optimal control. The agent in RL - The objective of a virtual physician - is to optimize a specific long-term expected return - response - in an uncertain feedback environment - a virtual patient - by determining what actions &. minus; medication administration - is the best - strategy - to learn and derive.

The MDP consists of state 𝑆𝑐, behavior 𝐷𝑐, and reward 𝑅, where subscript 𝑐 is the time-treatment cycle -, and an episode corresponds to a path in the possibility tree. We also define the transition of a patient state as the transition probability ℙ[𝑆𝑐+1=𝑠𝑐+1|𝑆𝑐= 𝑠𝑐,𝐷𝑐+1= 𝑑𝑐+1], which allows us to account for uncertainty. The reward is determined by the reward function (i.e., 𝑅𝑐= 𝑅(𝑆𝑐)), which models how the dosing policy 𝜋 chooses the next dose (equation below).

Thus, the policy defines the behavior and strategy of the virtual doctor agent. The dosing policy is evaluated based on the return 𝐺𝑐 at time step 𝑐 and is defined as a weighted sum of rewards over the remaining treatment period (see equation below). The discount factor 𝛾 ∈ [0, 1] adjusts the short-run treatment goal -𝛾 → 0- and the long-run treatment goal -𝛾 → 1- to set to maximize the expected long-term return qπ.

In addition, in model-based RL, which relies on sampling, to estimate the expected value by sample approximation, multiple variables-age, ANC0-are discretized into covariate classes ℭ 𝔒𝔙𝑙, and discretize them into 𝑙=1, ⋯, 𝐿 to facilitate the computation. In addition, the policy 𝜋𝑘 is defined as follows (below) where 𝑁𝑘(𝑠,𝑑) is the number of times the dose 𝑑 is chosen in the patient state 𝑠 out of the first 𝑘 episodes, and 𝐺𝐺(𝑘)𝑐=𝑟(𝑘)𝑐+1+𝛾𝑟(𝑘) 𝑐+2+⋯.

On the trade-off between exploitation - selecting doses with known high returns - and exploitation - selecting new doses with potentially higher returns We use Monte Carlo tree search (MCTS) and upper confidence bound applied to trees (UCT) in conjunction with To converge the policy To converge the policies, the final policy is 𝜋∗=argmax𝑞̂ 𝜋UCT-𝜀𝑐=0: no search -.

DA-guided dosing

In this section, we describe the Bayesian data assimilation - DA - method for deriving dose plans. This method aims at unbiased prediction of treatment outcomes and comprehensive quantification of uncertainty in the parameter under analysis - the grade of neutropenia - taking into account a larger amount of information than MAP-based approaches. We can Predict the probability of an outcome occurring by inferring patient uncertainty and adding it to the predicted treatment time: to do this, the uncertainty of individual model parameters is updated sequentially using a Bayesian model.

For the posterior distribution, a sampling approximation is utilized - the approximate value from sampling represents the patient state, and the weight factor ω defines the frequency of occurrence. This model has the advantage of being able to show the posterior distribution of subtherapeutic effects and toxicity ranges - very low or high drug/biomarker concentrations - and to account for these uncertainties: in this study, the weighted risk of missing the target range for the optimal dose was minimized. In this study, the optimal dose is derived as the dose that minimizes the weighted risk of missing the target range - the posterior probability of 𝑔𝑐 = 0 and 𝑔𝑐 = 4 (equation below). Since grade 4 is the range in which adverse effects occur, we add a larger penalty.

DA-RL-guided dosing

In this section, we describe a method that combines DA-guided and RL-guided dosing - DA-RL-guided dosing.

This approach integrates the uncertainty individuated by the DA within the RL and achieves the following advantages: utilization of smoothed expectations; consideration of individual-based uncertainty. For the former, the ability to use smoothed posterior expectations for the target quantity - the predicted nadir concentration - rather than the observed grade - the measured neutrophil concentration on a particular day; thus reducing the effects of measurement noise and dependence on sampling date. For the latter, it is possible to sample from the posterior probability 𝑝(𝜃|𝑦1:𝑐) in model simulations in the RL scheme - i.e. it is possible to sample from individual-based uncertainty rather than population-based.

In addition, since DA-guided runs in real-time (online), the analysis is narrowed down to reduce computational complexity: not all state combinations are included, but only those that are relevant for the rest of the treatment. We also estimate the behavioral value function not from scratch, but as a prior distribution determined by the RL method before the TDM data by 𝑞𝜋0:=𝑞ˆ𝜋UCT - a The adjustment parameter 𝜀𝑐 for the exploitation-exploration trade-off is set to prioritize doses with higher a priori expected long-term returns.

result

In this section, we compare three proposed methods - DA-guided dosing; RL-guided dosing; and DA-RL-guided - with previous studies to derive medication regimens that take into account individual characteristics and evaluate their impact on prognosis. We will evaluate the effects of the proposed methods on prognosis by comparing them with previous studies.

Grade 4 and Grade 0 neutropenia

Here, we evaluate previous studies and the proposed dosing strategy for the side effect of cancer treatment - neutropenia: specifically, we compare the proposed and existing methods for MIPD based on TDM data from paclitaxel-based chemotherapy. In particular, we will compare the proposed and existing methods for MIPD based on TDM data of paclitaxel-based chemotherapy.

The design of this evaluation corresponded to a previous study - the CEPAC-TDM study - and the neutrophil counts on days 0 and 15 of each cycle were calculated using the pharmacokinetic/pharmacodynamic-PK/PD model of cumulative neutrophil reduction with paclitaxel - a treatment in anticancer therapy. reduction by paclitaxel - a therapeutic agent in anticancer therapy - using a pharmacokinetic/pharmacodynamic - PK/PD- model. Predicted neutrophil concentrations over 6 cycles of 3 weeks each (Figure below) - median and 90% confidence interval (CI) - indicate that when neutrophils rise, concentrations are within the target range - Grade 1-3, between the black horizontal lines -are within the target range. This result indicates that PK-guided dosing prevents a decrease in nadir concentration - the lowest neutrophil concentration - to the same extent as standard dosing.

In RL-guided, neutrophil concentrations were well controlled between cycles, and the distribution of nadir concentrations in the whole population was concentrated within the target range (see figure below); in DA-guided, nadir concentrations were steadily guided into the target range, resulting in variance - variability of results In DA-guided dosing, nadir concentrations moved steadily into the target range, resulting in a decrease in variance-result variability; and in DA-RL-guided dosing, nadir concentrations moved into the target range, resulting in a decrease in variability

On the other hand, different interpretations of grade 0 and grade 4 were obtained for each method: in PK-guided, the incidence of grade 0 increased (see below); in DA-guided, the incidence of grades 0 and 4 decreased in the later cycles. -Quantification of individual uncertainty contributed to the reduction in variability of results; RL-guided had a reduction in the incidence of grade 0 and 4 neutropenia compared to standard and DA-guided; MAP had a reduction in the incidence of grade 4 neutropenia increased throughout the cycle; DA-RL-guided dosing produced results similar to DA-guided.

These results showed that DA-guided and DA-RL-guided - uncertainty-aware methods - kept the nadir concentrations within the target range and reduced the variability; whereas the others did not confirm these trends.

An investigation of long-term expected returns in RL

In this evaluation, we investigate long-term expected returns in RL and aim to identify relevant covariates: we investigate the behavioral value function - the objective function - in RL and see if we can identify covariates that are relevant to dose individualization.

The evaluation results (below) show the estimated action-value function for RL-guided dosing, stratified by covariates, gender, age, and baseline neutrophil count - ANC0 - for first cycle dose selection. The results show that ANC0 is an important characteristic at the start of treatment, due to the steepness of the curve - the robustness of the dose selection - and, for comparison, the first cycle dose in the PK-guided algorithm. selection utilizes only gender and age.

In addition, the grade of neutropenia in the first cycle -g1- had the greatest impact on the choice of the second dose; while a larger ANC0 resulted in a higher optimal dose.

consideration

In this study, we propose three methods using DA and/or RL to derive a medication policy that takes into account individual characteristics with uncertainty in MIPD.

Specifically, action-derived rewards - penalties for high doses - are introduced and, in the case of neutrophil-guided dosing, toxicity, and efficacy - about median survival - are considered simultaneously. The dose can also include side effects other than the primary drug effect - such as peripheral neuropathy - as well as tumor response, long-term outcomes - such as overall survival or progression-free survival - and other concomitant medications - such as anticancer drug combinations - are also included. In contrast, RL allows for the inclusion of multiple side effects/beneficial effects and medications in the study and is appropriate for accounting for time delays and patient characteristics with uncertainty.

In addition, we newly utilize Monte Carlo decision trees - MCTS - for reinforcement learning to derive policies. Most of the previous research in the medical field has utilized algorithms that only perform a simple search, employing epsilon-greedy search strategies, which use a first-step approximation of the lookup table; on the other hand, in this study, we use MCTS with upper MCTS with upper confidence bound applied to trees-UCT- to evaluate the returns; this avoids the decomposition -Bellman equation-avoiding approximations required by algorithms such as Q-learning, thus reducing computational complexity. In addition, the UCT search can include additional information - individual patient uncertainty and prior information - by systematic sampling from the medication dose range. These features allow for potential model biases to be taken into account when performing analyses based on real patient data: for example, if a patient does not follow a dosing recommendation - off-policy learning - the system can learn without exchanging patient data. This makes it possible to implement the system in clinics.

Therefore, the proposed method allows

(1) To improve the response rate in clinical trials

(2) facilitating recruitment by relaxing the exclusion criteria; and

(3) to enable continuous learning after approval and to improve treatment outcomes in the long term; and

We believe that this can be achieved.

On the other hand, there are three main challenges: complexity in RL processing; differences from clinical practice; and processing power in DA-guided. The first challenge is that complex models, such as RL decision trees, can be difficult for agents to learn to navigate and remember. Therefore, it is necessary to develop software and dashboards - such as infliximab - that can be easily used when considering clinical practice. The second challenge is that we are focusing only on the dose of paclitaxel - an anticancer drug - without accounting for dropout, dose reductions due to non-hematologic toxicities, adherence, and comorbidities; therefore, the incidence of grade 4 neutropenia reduction rates may differ between simulations and clinical trials. A possible solution to these issues could be to run the simulation in a clinical setting or to run additional simulations that take these factors into account. Third, with DA-guided, convergence can be expected to require significant computational time and effort; therefore, if time or computational power is limited, the use of approximations is necessary - e.g., solving only for the next cycle's dose, not for all remaining cycles. This may include.