# A Proposal To Use Reinforcement Learning To Prevent HIV Infection!

*3 main points* ✔️ Human Immunodeficiency Virus (HIV), which compromises the body's immune function, has been reported to be on the rise not only in developing countries but also in developed countries

✔️ Using reinforcement learning, we aim to develop a method for deriving optimal patterns for testing and treatment retention rates, taking into account the temporal fluidity of decision making.

✔️ The derived pattern is robust to uncertainty in the cost of care. On the other hand, if only inspection and care retention rates were considered, EHE in 2030 could not be achieved, suggesting the need for other additional interventions

A reinforcement learning model to inform optimal decision paths for HIV elimination

written by

(Submitted on 6 Sep 2021)

Comments: Math Biosci Eng

The images used in this article are from the paper, the introductory slides, or were created based on them.

## background

Can we understand the maintenance rate of optimal testing and care to minimize HIV transmission?

In this study, we use reinforcement learning to derive optimal patterns of testing and care retention based on the reduction metrics of the Ending the HIV Epidemic (EHE) plan for the human immunodeficiency virus (HIV). We aim to derive optimal patterns of testing and care retention based on the reduction metrics developed for the human immunodeficiency virus (HIV).

HIV reduces the number of immune cells - T lymphocytes and macrophages - that protect the body from disease, weakens immune function and increases the severity and morbidity of other diseases. HIV remains a major public health problem in the developed as well as the developing world. HIV remains a major public health problem in the developed world as well as the developing world: in the United States, as of 2015, approximately 1.2 million people were living with HIV - People with HIV (PWH) - and approximately 38,000 new infections. -Ending the HIV Epidemic: EHE - The U.S. National Strategic Plan calls for a four-pronged strategy of diagnosis, treatment, prevention, and response to reduce new infections by approximately 75% - 9,300 cases by 2025 -and about 90% - 3,000 cases - by 2030. Currently, we recommend at least annual testing and immediate initiation of treatment for high-risk populations. However, national surveillance shows that the actual testing frequency is lower than the recommended frequency, and in 2015, the HIV population was tested three to five years, according to the report. Furthermore, only 48% of those diagnosed with HIV received treatment, suggesting a high dropout rate from care.

In this study, we aim to reduce HIV incidence by developing a model that uses reinforcement learning to derive the optimal combination of testing rate and care maintenance rate. We believe that deriving the optimal testing rate - the inverse of the testing frequency - can inform infected individuals of testing guidelines and facilitate treatment provision. Optimal retention in care - the proportion of people who received care at the beginning and end of the year - suggests that social services and support programs may be needed to mitigate dropout rates. This study aims to evaluate the feasibility of EHE strategies by using such reinforcement learning to derive optimal patterns of testing and care retention rates.

## What is the human immunodeficiency virus (HIV)?

First of all, I will explain about Human Immunodeficiency Virus (HIV), which is the target of this study.

HIV reduces the number of immune cells - T lymphocytes and macrophages - that protect the body from disease, weakening the immune system and increasing the severity and morbidity of other diseases. When the immune system is weakened, the body can become infected with bacteria and viruses that would not cause problems in a healthy state - opportunistic infections - and develop a variety of diseases that would not normally occur. AIDS (Acquired Immunodeficiency Syndrome: AIDS) is a condition in which HIV-infected people develop complications due to decreased immunity. Influenza-like symptoms may be seen in the early stages of infection, but they disappear within a few weeks due to the immune response in the infected person's body. The initial symptoms last 2 to 4 weeks, after which the patient enters an asymptomatic period. During this period, HIV multiplies about 10 billion cells every day, infects and kills T lymphocytes, and lowers the immune function: the number of T lymphocytes, which is 700 to 1,500 in a healthy state, decreases to less than 200 over 5 to 10 years, resulting in immunodeficiency. This image has been released. The treatment for HIV is antiretroviral therapy (ART), which suppresses the replication of HIV in the body, enhances immune function, and regenerates immunity. In 2015, the WHO published guidelines for the initiation of antiretroviral therapy and disclosure of prophylaxis before exposure to HIV, which recommends that HIV-infected people should start antiretroviral therapy promptly. It is recommended that HIV-infected people start antiretroviral therapy promptly.

## purpose of one's research

In this study, we aim to use reinforcement learning to derive optimal patterns for optimal inspection and retention of care rates based on EHE plan reduction metrics.

Diagnosis and treatment are considered to be the most effective interventions for HIV reduction; therefore, deriving optimal testing rates and continuity of care rates could help formulate effective support programs against HIV infection. This study is based on a stochastic and dynamic model using reinforcement learning (RL) and a Markov decision process (MDP) to evaluate dynamic decision sequences, including novel infections, by evaluating temporal dynamic changes in decision making, reflecting temporal dynamic changes in epidemics, with MDP. Existing previous studies have focused on patient-level decision-making, including optimal treatment protocols, and few studies have been reported for population-based infections. In addition, the number of RL learning iterations increases exponentially with the number of actionable options, which presents a challenge with a huge computational cost: we, therefore, reduce the computational complexity by reformulating the decision variables based on the proportion undetected and the proportion on HIV treatment to reduce the number of options. These models allow us to evaluate EHE goals as sequential decision problems in a stochastic dynamic environment and provide valid information for future sequential goals.

## technique

In this section, we provide an overview of the proposed methodology.

In the proposed method, we derive optimal patterns of test rate and treatment continuation rate by using RL based on MDP. In this section, we describe MDP and RL respectively.

### Environment setting based on MDP

In this section, we describe the MDP-based model used in the proposed method.

MDP is a stochastic formulation of the decision problem and here we outline the formulated environment. We define the epidemic state at time 𝑡 as the multivariate parameter 𝑋𝑡=[𝑝,𝜇𝑢,𝜇𝑎, 𝜇𝐴𝑅𝑇; ∀]-𝑝𝑖: HIV infected persons in risk group 𝑖 - PWH in HIV prevalence divided by the total number of people in the population; 𝜇𝑢: PWH in risk group 𝑖 with unknown infection; 𝜇𝑎: aware of infection but not on ART; and 𝜇𝐴𝑅𝑇: the proportion of people who know they are infected and are on ART; ∀: all treatment stages of care. We also set 𝜇𝑢 + 𝜇𝑎 + 𝜇𝐴𝑅𝑇 = 1. In addition to this, the intervention decision 𝐷𝑡𝑡=[𝛿,(1-𝜌);∀𝑖]-𝛿 at time 𝑡: the diagnosis rate; 1-. (𝜌) is the medical stay rate in risk group 𝑖 -, and the MDP is defined by the following four elements.

(1) Ω

It refers to the state space, which is the set of all possible states of the epidemic. We use category values based on heterosexuals -heterosexuals: HETs - and homosexuals -men who have sex with men: MSM-,.

(2) 𝐴

action space and is the set of all possible decisions - actions. Instead of a combination of the diagnosis rate (δ𝑖) and the treatment continuation rate (1-𝜌𝑖), we formulate the change in the ART unawareness rate and the ART implementation rate as proxy variables - these proxy variables limit the number of action choices and improve the convergence rate of learning.

(3) 𝑃𝑎

It is a one-step transition probability matrix under action 𝑎. The element 𝑃𝑎(𝑥,𝑥) is the probability that the epidemic will transition from the state 𝑋𝑡=𝑥 to 𝑋𝑡 +1= the probability of transitioning to 𝑥′, refers to the Here, we simulate the action and stochastic transition, track the transition destination state, and estimate the immediate reward to reduce the computational complexity.

(4) 𝑅𝑎

The immediate reward matrix under action 𝑎. 𝑅𝑎(𝑥,𝑥) is the immediate reward when the epidemic is in state 𝑥 and results in a transition to state 𝑎 refers to the immediate reward (total benefit - total cost) when The reward is a measure of the QALYs-Quality-Adjusted-Life Years-of the total population multiplied by the GDP per capita-$54,000-transformed into monetary value, and the cost the total population cost of HIV testing, care, and treatment.

An epidemic at any time 𝑡 can be represented by only one state, and the probability of moving to an epidemic state 𝑥 at time 𝑡+1 depends only on the epidemic state 𝑥 at time 𝑡 Assumption. At this time, the

𝑃𝑟{𝑋𝑡+1∣𝑋𝑡,𝑋𝑡-1,𝑋𝑡-2,...}=𝑃𝑟{𝑋𝑡+1∣𝑋t} which satisfies the Markovian property in MDP.

The objective function (below) is the derivation of the optimal decision - the optimal policy - that maximizes the expected reward.𝒅 is the optimal action - the agent's choice - for the five-year interval from 2016 to 2070 -is shown. Here, the decision is based on the costs and impacts of the decision, not only for the current epoch but also for all future decision epochs. The equation also does not discount future costs and benefits - we set 𝛾 = 1 to prevent a reduction to the weight of future avoided infections and prevented costs, and to accurately identify strategies that lead to HIV elimination.

### A proposed algorithm using reinforcement learning

In this section, we describe the optimal pattern derivation algorithm based on RL-Q-learning, which is used in the proposed method. RL consists of (1) a simulation model to evaluate policies (decision sequences), 2. (1) a simulation model to evaluate policies (decision sequences), and (2) an optimization algorithm to control the selection of policies to evaluate. RL is an opportunity learning method to derive optimal decisions using (1) a simulation model to evaluate policies (decision sequences) and (2) an optimization algorithm to control the selection of policies to evaluate.

To solve the MDP model for HIV, the use of Dynamic Programmings (DP), such as value iteration and strategy iteration, and algorithms such as SARSA and Q-learning will be considered. Due to the size of the dataset, we will not use DP, which requires the estimation of the transition probability matrices of all states and actions, but will use Q-learning, which is less computationally intensive: Q-leaning can derive near-optimal solutions without requiring prior knowledge of the transition probability matrices. Q-learning receives from the environment an immediate reward for its behavior in the simulation and a transition to the prevalent state 5 years later. For the optimization (equation below), we sum the immediate rewards of each action over the five years, observe the total rewards of the previous actions, and choose what action to take next. This iterative process is repeated many times to finally arrive at the optimal decision. We also set 𝜖 to decrease as 𝑘 increases: initially, there is more behavioral exploration, and over time, we use experience-exploration-exploitation trade-offs off-.

In addition, the transmission simulation uses a tool called PATH 2.0, which is based on an agent-based stochastic simulation that tracks HIV-infected individuals individually and can simulate HIV disease progression and sexual transmission: it models the HIV epidemic in the United States and accurately simulates the epidemic from 2010 to 2015. Based on this environment, we estimate diagnosis and retention rates: we derive fixed and variable costs from intervention program data and define them as nonlinear functional models of the number of people outreached. In the Q-learning iterations, we update the Q-values every five years from 2015 to 2070: deciding on a course of action. We simulate a feedback and control loop. This process is repeated to finally converge to the optimal policy.

As an evaluation of Q-learning, we derive the range of uncertainty of the optimal policy by running it for different numbers of iterations (2000, 3000, 4000, 5000) and comparing the corresponding total rewards - if the number of iterations is not large enough, we investigate the possibility that the algorithm terminates before converging. We investigate the possibility that.

### Uncertainty Analysis

In this section, we describe three cost functions that are set to account for uncertainty.

Here we deal with two uncertainties: uncertainty in HIV transmission; and uncertainty in intervention costs. In the former, we replicate the uncertain events in HIV transmission. Specifically, the steps are as follows.

a) Calculate input parameters from probability distributions and simulate events using probability functions

b) Use of Q-learning based on MDP

c) Study MDP iterations from 2000 to 5000, simulate the optimal policy 100 times, and derive the mean of the output metric

This is the procedure that is used to implement the program. The latter also includes the following four types of costs - fixed cost per clinic for health care programs; variable cost per person for health care outreach programs; marginal increase in variable cost for health care outreach programs; and variable cost for laboratory outreach programs. marginal increase - are utilized in the following three cost functions.

(a) Median (median inspection and retention costs): utilizing the median of all four parameters

(b) LTHR (Low Testing High Retention in Care Costs), using the minimum testing costs and the maximum medical maintenance costs

(c) HTLR (High Testing Low Retention in Care Costs): use the value with the lowest testing costs and the highest retention in care costs

From these, for each cost function assumption, we train at multiple iterations - 2000, 3000, 4000, and 5000 - and for each pair of cost functions and stopping conditions, we run 100 simulations to obtain the average of the optimal policy and the corresponding impact generated values (over 100 iterations) are extracted.

## result

In this section, we discuss the results of the evaluation. The evaluation uses reinforcement learning to derive optimal testing and retention rates to investigate the impact on the reduction targets proposed by the EHE indicators for HIV.

### Evaluation Environment

In this section, we describe the environment in which we performed the evaluation.

For the evaluation, we check the evolution from 2015 to 2070 at five-year intervals. For the evaluation environment, we set the following: at the end of 2015, the annual testing rate for high-risk heterosexuals was 0.26 and for MSM - homosexuals - 0.4; that is, the average time from infection to diagnosis was 3.8 years for heterosexuals and 2.5 years for MSM This is shown to be the case. From this setting, we can see that the annual retention rate for heterosexuals is 86% and that for MSM is 91%.

The optimal policies - specifically, the optimal combination of heterosexual (and MSM) testing rates (below) and retention rates (above) - from 2016 to 2070 are shown in the time series (below).

For the proportion of heterosexuals (and MSM) who are aware (top) and on ART (bottom), the range of uncertainty for each of the three cost function assumptions - median: blue band, LTHR: red band, HTLR: green band - is shown - shaded band -is shown (below).

The proposed model sets the testing rate for high-risk heterosexuals and MSM at 0.2 and 0.3, corresponding to one test every 5 and 3.5 years, respectively, in the three cost functions from 2016 to 2020 (see above). The following policy is also derived by the proposed algorithm: gradually increase the annual retention rate from 86 to 94% for HETs and from 91 to 96% for MSM. During this period, we find a narrow range of uncertainty in inspection rates and treatment retention rates for all three cost functions - this result indicates that the Q-values converge with the proposed algorithm. By achieving these testing and retention in care rates, approximately 85% of heterosexual HIV patients-people with HIV (PWH)-and 82% of MSM PWH will be aware of their infection by the end of 2020, and approximately 70% of heterosexual PWH and about 70% of MSM PWH are expected to be on ART. In addition, between 2016 and 2020, the combination of testing and care retention rates will reduce the number of newly infected heterosexuals by 50% - from 9,000 in 2016 to 4,500 by the end of 2020 - and the number of newly infected MSM by 42 -from 26,000 in 2016 to 15,000 by the end of 2020 - a significant decrease compared to the trend over the past five years (see figure below).

Heterosexual PWHs show a gradual decrease while MSM PWHs show an increase, indicating that the number of PWHs continues to increase for a short period and then decreases (see figure below).

The annual cost of HIV has increased by 22% over this period, suggesting that a high initial investment is required to achieve the reduction in new infections described above (see figure below).

From 2021 to 2025, we suggest a modest increase in testing frequency and maintenance of high retention rates for both heterosexuals and MSM. They also advocate scaling up the Cared Retention Program to increase the annual retention rate for heterosexuals from 94% to 96% and MSM from 96% to 98%. The decline in new infections was modest for both heterosexuals and MSM, while the number of PWH declined: from 2026 to 2030, for heterosexuals - in the Median and HTLR cost functions - and for MSM - in all cost functions - the testing rate was 0 .1 - less than one test per 10 years - and remained at that value for the remainder of the period. In heterosexuals, the number of new infections had fallen to about 3200-4000 by 2030 and 750-1200 by 2070; in MSM, the number of new infections had fallen to about 11000-14000 by 2030 and 3500-6000 by 2070.

## consideration

This research investigates and proposes a methodology for decision-making in public health epidemic control aimed at HIV eradication: Specifically, we optimize the number of HIV-infected people for the reduction target presented in the HIV Eradication - EHE Index We use reinforcement learning to optimize the testing and retention rates for the number of people - formulated as MDPs and modeled as a sequential decision problem using Q-learning. Evaluation results show that, compared to the approach using pre-selected scenarios, the proposed method makes the optimal choice among the presented alternatives - 3611 alternatives - based on probabilistic predictions of decision-making and epidemics. Cost and QALYs were evaluated to derive the optimal combination of test and retention rates. In addition, such a decision-making model requires a large amount of computation due to the size of the state and action space, making it difficult to converge; to address this issue, we have reduced the size of the action space by introducing indirect indicators and reformulating the action space. This study suggests a case for testing and treatment to reduce the spread of the disease, which may apply to other infectious diseases.

Using reinforcement learning, they suggest that the optimal policy would be to test more frequently in the first 10 years and then to test less frequently as the number of new infections declines. Specifically, they suggest gradually increasing the annual testing rate to 95% in the first 10 years and then implementing a maintenance program to maintain the rate. The model derives a policy that aims for a higher retention rate than the testing rate, indicating that spending on retention should be a priority. It was also robust to cost uncertainty, within the assumed range, for the optimal policy - the model suggests using lower testing costs than the median and HTLR cost functions and maintaining high testing rates for longer periods

The following issues can be considered for this study. First, the proposed method limits its evaluation to currently existing testing and treatment technologies: it excludes the possibility of a cure and significant improvements in testing and treatment costs; therefore, the availability of a cure may change optimal decision making. -changes in the time and probability of achieving HIV eradication due to reduced HIV transmission, and improved trade-offs with GDP due to reduced costs. On the other hand, the results of our model show that even in cases where treatment costs are high, the model favors the allocation of resources to treatment over-testing, suggesting that the results are applicable even when treatment costs are reduced relative to testing costs.

Categories related to this article