AI-in-Oncology

Problem statement:

Current AI systems in oncology lack the ability to generate explainable, clinically grounded summaries from longitudinal data. In prostate cancer care, clinicians must interpret diverse sources such as prostate-specific antigen (PSA) scores, prostate imaging-reporting and data system (PI- RADS) scores, weight changes, and reports of bone pain, collected over time. Existing tools struggle with temporal complexity, hallucinations, and limited explainability, leading to fragmented care and increased clinician burden.

Purpose:

We aim to develop an MCP-driven Multi-Agent RAG-Enhanced LLM pipeline for generating temporally structured, evidence-grounded summaries of prostate cancer patient records. The system also produces treatment recommendations and lifespan predictions to reduce clinician burden, enhance model explainability, and support real-world oncology decision-making.

Materials and Methods:

The architecture integrates structured and unstructured data via the Model Context Protocol (MCP), orchestrated by CrewAI agents. A synthetic dataset of 500 longitudinal prostate cancer cases, generated via ChatGPT and refined with clinician feedback, was used in place of real EHRs. Model evaluation focuses on accuracy, hallucination resistance, and clinical relevance.

Results:

Preliminary tests on synthetic cases confirmed functional integration of pipeline components, accurate temporal sequencing, and reliable exclusion of irrelevant data. The validator agent effectively flags hallucinations, and outputs are structured for human review. Final validation using real prostate cancer data is scheduled for Summer 2025. Key metrics will include clinical alignment, information completeness, and blinded user feedback. Clinical Relevance Statement: Timeline-based, evidence-supported patient reports generated by AI can streamline oncology workflows, reduce documentation burden, and enable explainable treatment planning with minimal hallucination risk.

This is Work in Progress. We have a POC but working on observing real data and making changes to our pipeline to have an effective and sustainble solution to the existing problem. The problem has also been communicated by Radiologists at hospital.

Data and Sampling

Temporal Summarization:

For the temporal summarization component of our multi-agent clinical reasoning system, we developed a synthetic dataset of longitudinal prostate cancer records. Using ChatGPT and curated prompts, we generated structured timelines for 500 patients, simulating realistic disease trajectories based on PSA levels, PI-RADS scores, weight changes, bone pain, and treatment history. Each record was formatted to reflect the way oncologists chart disease progression over time, with dated entries following a consistent schema for easy parsing and modeling. To ensure compatibility with our MCP architecture, we designed the dataset to output structured temporal context via an internal API, enabling downstream agents to consume, reason over, and summarize the patient's clinical history. This dataset was critical in testing our system's ability to produce coherent, evidence-grounded summaries and accurate timelines while minimizing hallucinations. The data was repeatedly reviewed by a Radiologist and Assistant Professor of Radiology.

Treatment Prediction:

For the treatment recommendation component, we derived a supervised classification dataset from the same synthetic timelines. To train the treatment recommendation model, we used the full longitudinal record of each patient, and it consisted of multiple dated clinical observations parsed from structured narrative notes. These records included features such as PSA, PI-RADS, weight, bone pain severity, and treatment history. Instead of collapsing multiple records into one, we treated each dated entry as a separate training instance. To capture progression, we engineered temporal delta features, PSA_Delta and weight_Delta, by calculating the change in values relative to the previous visit. We also encoded bone pain severity and tracked VisitOrder to indicate the sequence of visits over time. The label for each record was the treatment administered at that point in the timeline. The target variable for each instance was the treatment administered at that point in the timeline, selected from a predefined set of classes: ['None', 'ADT', 'Surgery', 'Surgery + Radiation + ADT', 'Surgery + ADT', 'Radiation', 'Radiation + ADT', 'Surgery + Radiation'].

Lifespan Prediction:

To simulate lifespan prediction in prostate cancer patients, we constructed a supervised regression task using structured synthetic clinical records. Beginning with temporally ordered narrative notes, we parsed each patient's record into dated visits containing PSA levels, PI-RADS scores, body weight, bone pain severity, and administered treatment. These were transformed into a structured, visit-level dataset.

Recognizing that bone pain progression is a clinically significant indicator of patient deterioration, as confirmed by oncologist consultation, we engineered a numerical severity score for the BonePain variable, mapping "None" to 0 and "Severe" to 3. This allowed us to quantify its effect in the modeling process. To reflect temporal progression, we computed initial and final values of PSA and weight and introduced delta features (PSA_Change and Weight_Change) representing the net clinical drift over time. Patients were then aggregated into one record per individual, capturing summary statistics such as average PSA, maximum PI-RADS, and maximum bone pain severity. We further encoded treatment history by creating binary flags (Has_Surgery, Has_Radiation, and Has_ADT) to indicate whether a patient had received each treatment modality at any point in their timeline. Lifespan values were synthetically generated to simulate realistic variation in survival, incorporating the influence of baseline PSA, PIRADS, bone pain score, and treatments. The generation function penalized higher initial PSA, higher PIRADS scores, and greater bone pain severity, while rewarding the presence of aggressive or combined treatments (e.g., surgery or radiation). Gaussian noise was added to ensure plausible heterogeneity and prevent reverse- engineering of the ground truth.

The final dataset included 11 engineered features capturing static and temporal dynamics of disease, with lifespan (in years) as the continuous target variable. We evaluated a diverse set of regression models, including linear (Ridge), tree-based (Random Forest, Gradient Boosting, Decision Tree), kernel-based (SVR), and ensemble (XGBoost) approaches.

MCP Servers