Catch up on the latest AI articles

AI Has Surpassed Medical Specialists! A Reinforcement Learning Agent Proposes An Exploratory Approach To Cancer Treatment!

AI Has Surpassed Medical Specialists! A Reinforcement Learning Agent Proposes An Exploratory Approach To Cancer Treatment!

Reinforcement Learning

3 main points
✔️ Among the rapidly increasing number of patients with cancer, epithelial ovarian cancer still has a low survival rate, and determining an appropriate treatment strategy is a challenge
✔️ Utilizes model-free learning (DQN) based on Markov decision process-MDP
✔️ Confirmed that the agent derives a treatment plan that improves the average survival rate over the regime put forward by the specialist

Patient level simulation and reinforcement learning to discover novel strategies for treating ovarian cancer
written by Brian MurphyMustafa Nasir-MoinGrace von OisteViola ChenHoward A RiinaDouglas KondziolkaEric K Oermann
(Submitted on 22 Oct 2021)
Comments: Published on arxiv.

Subjects: Machine Learning (cs.LG)


The images used in this article are from the paper, the introductory slides, or were created based on them.


Can real-world data-driven reinforcement learning be used to improve survival rates?

In this study, we propose a dynamic regime for epithelial ovarian cancer by utilizing reinforcement learning - MDP and DQN - to take into account individual characteristics. Epithelial ovarian cancer remains a challenge with low survival improvement compared to other cancer diseases. In addition, because chemotherapy using multiple drugs is the mainstay of treatment for this cancer, it is difficult to achieve effective results through uniform treatment; therefore, the introduction of a flexible treatment strategy that takes into account individual characteristics - a dynamic regime - is required. Dynamic regimes are required to be introduced.

In this study, we aim to address these issues by using real-world data-based reinforcement learning - MDP and DQN - to derive a treatment strategy that takes into account patient characteristics and improves survival - the dynamic regime -to improve the survival rate. To utilize reinforcement learning, we designed an environment to model the treatment history of epithelial ovarian cancer, and through interaction with agents, we aim to derive a dynamic regime that takes into account individual characteristics. This study has three main features: we have created a simulation environment using real-world data of individual responses to treatment in metastatic ovarian cancer; we have created a simulation in which agents choose a treatment strategy with the aim of maximizing reward - overall survival We introduced model-free reinforcement learning (DQN), which learns to find the optimal solution, and confirmed its effectiveness.

What is epithelial ovarian cancer?

First of all, we will briefly explain epithelial ovarian cancer, which is the subject of analysis in this study.

Epithelial ovarian cancer is a disease in which malignant (cancerous) cells develop in the tissue covering the surface of the ovary, and the incidence of the disease is particularly high in middle-aged women (40-60 years old). There are no symptoms in the early stages of the disease, and it is often discovered in an advanced stage. The most common symptoms reported are: abdominal pain and swelling; pelvic pain; and gastrointestinal symptoms such as gas, bloating, and constipation. It has also been reported that approximately 25% of ovarian cancers are positive on endometrial cytology, which increases the likelihood of prevention through stereotactic health screening.

There are four stages of the disease - stage I, stage II, stage III and stage IV - and each stage is treated accordingly. At present, there are three main types of treatment frequently used: surgery: surgical removal of the tumor; radiation therapy: use of high-energy X-rays or other radiation to remove cancer cells or inhibit their growth; chemotherapy: use of drugs to kill cancer cells or stop them from dividing. Epithelial ovarian cancer is the most sensitive of all gynecologic cancers to chemotherapy-anticancer therapy-and is most commonly treated with a combination of surgery and chemotherapy: In stage I, surgery - ovarian portion -and chemotherapy; in stage II III IV, surgery and chemotherapy-anticancer therapy are adopted.

In addition, multiple types of drugs are utilized in drug therapy, and treatment strategies vary widely, such as combining chemotherapy with immunotherapy or radiotherapy, adjusting the administration schedule, and using different routes of drug administration; therefore, some treatment strategies may worsen the prognosis. Therefore, it is necessary to derive an optimal treatment strategy that takes into account the characteristics of the individual and is appropriate to the patient's condition.

What is a dynamic regimen - dynamic regimen, DTR -?

Dynamic regimen - Dynamic regimen, DTR, refers to a regimen - a treatment plan, a treatment protocol, and a medication regimen - that determines treatment based on a patient's disease progression, side effects, and laboratory values. DTRs are regimens that determine treatment based on the patient's disease progression, side effects, laboratory values, etc. Usually, the regimen is followed to administer chemotherapy; this helps prevent medical accidents such as drug overdose. Dynamic regimens are more flexible, allowing the regimen to be altered according to the patient's condition to achieve a more optimal course of treatment; this has the advantage of allowing the best course of treatment to be chosen for each individual patient. DTR is particularly useful in cases of side effects that can worsen the final endpoint - such as drug overprescribing - and in cases where alternative drug options are limited - such as the development of drug resistance due to continued use; however, there are also challenges, including increased costs, including more frequent treatment decisions; and the need for greater expertise. This research aims to address these issues by using reinforcement learning - DQN - to derive more accurate dynamic regimes.

purpose of one's research

In this study, we propose a method for deriving a dynamic regime for epithelial ovarian cancer using model-free reinforcement learning: Specifically, we utilize model-free reinforcement learning based on Markov decision processes - DQN - to construct a environment to improve survival rates. Since the treatment of epithelial ovarian cancer usually involves multiple drugs and multiple treatment methods, there is a need to change the treatment strategy and make decisions according to the progress of the treatment. In this study, we aim to develop a method for deriving a dynamic regime using reinforcement learning. Reinforcement learning contains effective algorithms for decision making, such as sequential decision making, which can be applied to such dynamic regimes to derive the optimal treatment strategy - in this method, cancer clinical trial data - such as TCGA This method aims to derive optimal treatment decisions by formulating cancer clinical trial data - such as TCGA - as a simulation environment.


Data Sources and Preprocessing

In this section, we describe the data used in this study and the preprocessing of the data. The target data is a cancer database - The Cancer Genome Atlas (TCGA) - from which we obtained comprehensive treatment plans and outcomes for 609 patients with epithelial ovarian cancer based on previous studies. Outcomes were obtained from this database. We also utilize several libraries for pre-processing of the dataset - NCI Drug Dictionary, Broad GDAC Firehose: we use a drug standardized index to convert all drug names to their common equivalents. In addition to that, we removed from the treatment plan data those with no treatment line drug name, equal start and end dates, unclear treatment line timing, and excluded patients who did not achieve the overall survival endpoint - ultimately 460 of 225 patients were included (see table below).

We then reorganize the data into 30-day treatment periods. The reorganized dataset consists of 9,296 one-month treatment period samples, each containing patient ID, number of months since treatment initiation, and current treatment drug combination. These data include 127 drug combinations and a "no active treatment" option, and we build a reinforcement learning environment using a subset of patients whose final survival measure is a death event: 5,931 one-month treatment duration samples, comprising It contains 107 unique drug combinations and " no active treatment ".

Setting up the environment in reinforcement learning

In this section, we describe the environment in which reinforcement learning is performed.

Based on the above data, a Markov decision process - MDP - based environment is constructed to simulate the dynamic regime of patients with epithelial ovarian cancer: each state consists of the patient's condition, response to current treatment, time since the start of treatment, total duration of treatment, age, race, and tumor-specific information - tumor grade and stage. total treatment duration, age, race, and tumor-specific information - tumor grade and stage; drug-related actions consist of all unique treatment combinations - excluding drug combinations not present in the TCGA ovarian cancer dataset combinations are excluded.

survival modelling

In this section, we describe the model for viability that we used to set up the environment.

The model of survival that we describe here refers to a set of transition probabilities from one state to the next, for each patient who receives a particular treatment at a particular time. Each state transition contains two sets of probabilities that probabilistically determine the next state: the first probability determines whether the patient dies with probability P(D) or survives with probability P(S) = 1 - P(D). In the case of death, the patient is then dead - the final state - and proceeds to the next process; in the case of survival, the second probability applies: it determines whether the patient is in remission or will need further treatment in the next state P(T) = 1 - P(R) (see below).

To calculate these probabilities, we also utilize two multivariate Cox proportional hazards regressions: for the probability of state status, we calculate baseline hazards using terminal death events and months since treatment initiation; for the probability of survival status, we calculate baseline hazards using relapse/remission and months on current treatment regimen (see below). We use the number of months to calculate the baseline hazard (see figure below)

We then sample against the survival function for each regression based on the (patient's current state - action) pair and obtain P(D), P(S), P(R) and P(T). The reward is set to the sum of the number of months of survival of patients whose behavior 𝑎 does not lead to death.

Reinforcement Learning Model

In this section, we describe the reinforcement learning used in our evaluation.

This study utilizes a type of model-free, deep Q-network (DQN): agents select actions (drug combinations) based on observed state transitions, and state-action pairs are fed into an MDP that probabilistically determines the state. The agents were trained on 200,000 rounds (1 round = 1 simulated patient), and past patient trajectories became the training dataset for the DQN (see figure below).

The final performance of the DQN agent is evaluated based on two metrics - the baseline average survival rate calculated from the first 1,000 patients; and the average simulated survival rate of the last 1,000 patients in the training data, and the last 1, 000 patients, compared with respect to the average survival of patients treated by clinicians. The dataset used to build the MDP is also used to evaluate it, restricting the action to drug combinations that occur more than five times - to prevent learning under special circumstances that do not reflect common treatments.


In order to validate the effectiveness of the dynamic regime in the proposed method, we evaluate the survival time by a specialist physician as a comparison.

Comparison in survival time

The aim of this evaluation is to compare the proposed method with specialist treatment strategies using survival time as a measure.

The proposed approach resulted in a mean survival of 32.3 months for the first 1,000 patients and 42.9 months for the last 1,000 patients; therefore, it was shown to achieve a higher survival rate compared to the oncologist-led treatment strategy - which had a mean survival of 26.4 months (see figure below).

Specialists most often prescribed carboplatin and paclitaxel as first-line therapy and then switched to topotecan, doxorubicin, carboplatin, and paclitaxel monotherapy over time; while the proposed approach led to a strategy of almost continuous aldesleukin administration (see below).

We also evaluated a simulation with a restricted set of behaviors to see if the proposed method would create an alternative strategy when limited to more common treatments. after one million simulations, the proposed method derived a combination of gemcitabine and tamoxifen, and over time shifted to other regimens - such as the combination of cisplatin and tamoxifen (see figure below).

After these studies, for the mean survival, the proposed method showed a significant improvement over the specialist: compared to 43.4 months at baseline (first 1,000 patients), the mean survival for the last 1,000 patients was 45.5 months (see figure below).


In this study, we propose an algorithm for epithelial ovarian cancer that utilizes reinforcement learning to create a new dynamic regime based on real-world data. Due to the diversity of treatment strategies, the need to introduce a dynamic regime that can take into account the patient's condition has been described, but challenges, including cost, have been pointed out. In the proposed method, we develop a patient-level simulation of the treatment and outcome of epithelial ovarian cancer, build a DQN-based environment, and aim to derive an optimal dynamic regime that takes into account individual characteristics. Evaluation results show that agents trained by the proposed method improve overall survival compared to specialists.

An additional evaluation that may be required in the future is the evaluation of dynamic regimes on pre-trained agents using test data - for clinical applications, the learning of reinforcement learning agents needs to be validated and evaluated on such a versatile data sets; therefore, one of our future plans is to make treatment recommendations at each stage of treatment based on the patient's medical records and test data, and compare and contrast them with the oncologist's decision making for the actual patient.

Challenges related to this study include: validating the quality of the agents; and examining survival models other than the one being evaluated. First, the quality of the learning agent is limited by the fidelity of the simulator-the overall sample size is limited because we only utilized patients who achieved the overall survival endpoint in the TCGA dataset. The solution to these sample size challenges is to collect a sufficient amount of data from multiple comprehensive cancer centers and clinical trial databases: based on prior studies, at least one order of magnitude per patient (225 patients) and ideally two orders of magnitude. The second is additional validation of the mathematical model of survival. Since the simulation results showed that untrained DQN agents outperformed clinicians in terms of mean survival, the survival model may need further refinement: survival models other than the multivariate Cox proportional hazards regression utilized in this study should be considered and their impact on survival time tested. The survival model needs to be further improved.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us