Machine Learning for Healthcare 2025 Abstract Guide
Poster Session A
Poster ID: 49
Title: Classifying Copy Number Variations Using State Space Modeling of Targeted Sequencing Data: A Case Study in Thalassemia
Authors: Austin Talbot, Alex V. Kotlar, Lavanya Rishishwar, Yue Ke
Poster Session A
Abstract:
Thalassemia, a blood disorder and one of the most prevalent hereditary genetic disorders worldwide, is often caused by copy number variations (CNVs) in the hemoglobin genes. This disorder has incredible diversity, with a large number of distinct profiles corresponding to alterations of different regions in the genes. Correctly classifying an individual's profile is critical as it impacts treatment, prognosis, and genetic counseling. However, genetic classification is challenging due to the large number of profiles worldwide, and often requires a large number of sequential tests. Targeted next generation sequencing (NGS), which characterizes segments of an individual's genome, has the potential to dramatically reduce the cost of testing and increase accuracy. In this work, we introduce a probabilistic state space model for profiling thalassemia from targeted NGS data, which naturally characterize the spatial ordering of the genes along the chromosome. We then use decision theory to choose the best profile among the different options. Due to our use of Bayesian methodology, we are also able to detect low-quality samples to be excluded from consideration, an important component of clinical screening. We evaluate our model on a dataset of 57 individuals, including both controls and cases with a variety of thalassemia profiles. Our model has a sensitivity of $0.99$ and specificity of $0.93$ for thalassemia detection, and accuracy of $91.5\%$ for characterizing subtypes. Furthermore, the specificity and accuracy rise to $0.96$ and $93.9\%$ when low-quality samples are excluded using our automated quality control method. This approach outperforms alternative methods, particularly in specificity, and is broadly applicable to other disorders.
Poster ID: 88
Title: Rare Disease Differential Diagnosis with Large Language Models at Scale: From Abdominal Actinomycosis to Wilson's Disease
Authors: Elliot Schumacher, Dhruv Naik, Anitha Kannan
Poster Session A
Abstract:
Large language models (LLMs) have demonstrated impressive capabilities in disease diagnosis. However, their effectiveness in identifying rarer diseases, which are inherently more challenging to diagnose, remains an open question. Rare disease performance is critical with the increasing use of LLMs in healthcare settings. This is especially true if a primary care physician needs to make a rarer prognosis from only a patient conversation so that they can take the appropriate next step. To that end, several clinical decision support systems are designed to support providers in rare disease identification. Yet their utility is limited due to their lack of knowledge of common disorders and difficulty of use.
In this paper, we propose RareScale to combine the knowledge LLMs with expert systems. We use jointly use an expert system and LLM to simulate rare disease chats. This data is used to train a rare disease candidate predictor model. Candidates from this smaller model are then used as additional inputs to black-box LLM to make the final differential diagnosis. Thus, RareScale allows for a balance between rare and common diagnoses. We present results on over 575 rare diseases, beginning with Abdominal Actinomycosis and ending with Wilson's Disease. Our approach significantly improves the baseline performance of black-box LLMs by over 17% in Top-5 accuracy. We also find that our candidate generation performance is high (e.g. 88.8% on gpt-4o generated chats).
Poster ID: 128
Title: Impact of Food on A Transformer Based Glucose Prediction Model to Predict Glucose Trajectories at Different Time Horizons
Authors: Mansur E. Shomali, Junjie Luo, Abhimanyu Kumbara, Anand K. Iyer, Guodong Gordon Gao
Poster Session A
Abstract:
Introduction
Automated coaching based on accurate glucose predictions is important to and can help improve the self-management of diabetes [1]. Just as LEDs, digital temperature sensors, and accelerometers in activity trackers enable dense data collection of biometric signals such as heart rate, skin temperature, and movement/activity, continuous glucose monitoring (CGM) sensors enable collection of dense glucose data. There are many covariates—such as food, activity, and medications—that can impact our glucose levels, and dense data from CGM allows us to quantitatively measure such impacts, which can then unlock glucose prediction capabilities. In Shomali et al. (2024), the authors previously showed that transformer-based large sensor model (LSM) can accurately predict glucose levels at 30 minutes, 60 minutes, and 2-hour intervals [2]. In this study, we built a transformer based large health model (LHM) to study the impact of food on glucose prediction at 30 minutes, 60 minutes, and 2-hour intervals. The LHM takes glucose and other covariates such as food as inputs to predict future glucose levels.
Methods
We constructed two different GPT models: The first model, LSM-GPT, only used glucose data from both T1D and T2D populations in the training set to predict future glucose trajectories. The second model, LHM-GPT, used both glucose and food entry data from the same population to predict glucose trajectories at 30-minute, 60-minute, and 2-hour time horizons.
We evaluated real-world CGM from a digital health platform for 784 individuals with type 1 (T1D) and 1187 individuals with type 2 (T2D) diabetes for the LSM-GPT model. This dataset accounted for over 38 million CGM entries, covering approximately 134,770 patient-days (equivalent to 369 patient-years).
For the LHM-GPT model, we evaluated CGM and food data for 805 individuals with T1D and 1771 individuals with T2D. In total, there were 8,809 data points with each data point encompassing 24-hour CGM data and one food record. We used 7:3 training to test split ratio. Model accuracy was evaluated by calculating the root mean square (RMSE) (mg/dL) at each of these time intervals.
Results
For the LSM-GPT model, the held-out sample RMSE (mg/dL) for predicting T1D-only glucose trajectories at 30 minutes, 60 minutes, and 2-hours was 7.0, 16.0, and 29.7, respectively. Similarly, using the same model, the held-out sample RMSE for predicting T2D-only glucose trajectories at the same time intervals was 13.3, 22.4, and 33.8, respectively.
For the LHM-GPT model— which included food and glucose data—the held-out sample RMSE for predicting T1D-only glucose trajectories at 30 minutes, 60 minutes, and 2-hours was 7.5, 15.0, and 25.9, respectively. The held-out sample RMSE for predicting T2D-only glucose trajectories at the same time intervals were 13.4, 22.1, and 31.8, respectively.
When comparing the LSM-GPT and LHM-GPT results, we observed no improvement in RMSE scores at the 30minute interval. However, including food entries led to improved prediction accuracies at 60 minutes and 2-hour intervals for both populations. For T1D population, prediction accuracy improved by approximately 6% and 13% at 60 minutes and 2-hours, respectively. For the T2D population, accuracy improved by approximately 1% and 6% at the same time intervals.
Conclusion
In this study, we show that combining glucose data with food covariates variables can improve the glucose prediction accuracy at 60 minutes and 2-hour time intervals in both T1D and T2D populations, with T1D population showing greater accuracy improvements when compared to the T2D population. This suggests that food may play a larger role in influencing glucose levels in individuals with T1D than in those with T2D.
Our results also indicate that the improvement in prediction accuracy is greater for a 2-hour time horizon than 60-minute horizon. Since postprandial glucose typically peaks around 90minutes [3], it is logical that including food as a covariate improves prediction accuracy at 2-hour interval, when the impact of food on glucose levels is at its highest. In the future we may also want to study the impact of exercise and food intake behavior on glucose predictions.
References:
1. Munoz-Organero M. Deep physiological model for blood glucose prediction in T1D patients. Sensors (Basel). 2020 Jul 13;20(14):3896. doi: 10.3390/s20143896. PMID: 32668724; PMCID: PMC7412558.
2. Shomali ME, Luo J, Kumbara A, Iyer AK, Gao G. CGM-GPT: A Transformer Based Glucose Prediction Model to Predict Glucose Trajectories at Different Time Horizons. Machine Learning Healthcare Conference 2024, Toronto, ON, Canada.
3. S. Daenen, A. Sola-Gazagnes, J. M’Bemba, C. Dorange-Breillard, F. Defer, F. Elgrably, É. Larger, G. Slama,
Peak-time determination of post-meal glucose excursions in insulin-treated diabetic patients, Diabetes & Metabolism, Volume 36, Issue 2, 2010, Pages 165-169, ISSN 1262-3636, https://doi.org/10.1016/j.diabet.2009.12.002.
Poster ID: 189
Title: Bidirectional Hierarchical Protein Multi-Modal Representation Learning
Authors: Xuefeng Liu, Songhao Jiang, Chih-chan Tien, Jinbo Xu, Rick L. Stevens
Poster Session A
Abstract:
Protein representation learning is critical for numerous biological tasks. Recently, large transformer-based protein language models (pLMs) pretrained on large scale protein sequences have demonstrated significant success in sequence-based tasks. However, pLMs
lack structural context, and adapting them to structure-dependent tasks like binding affinity prediction remains a challenge. Conversely, graph neural networks (GNNs) designed to leverage 3D structural information have shown promising generalization in protein-related
prediction tasks, but their effectiveness is often constrained by the scarcity of labeled structural data. Recognizing that sequence and structural representations are complementary perspectives of the same protein entity, we propose a multimodal bidirectional hierarchical fusion framework to effectively merge these modalities. Our framework employs attention and gating mechanisms to enable effective interaction between pLMs-generated sequential representations and GNN-extracted structural features, improving information exchange and enhancement across layers of the neural network. This bidirectional and hierarchical (Bi-Hierarchical) fusion approach leverages the strengths of both modalities to capture richer and more comprehensive protein representations. Based on the framework, we further introduce local Bi-Hierarchical Fusion with gating and global Bi-Hierarchical Fusion with multihead self-attention approaches. Through extensive experiments on a diverse set of protein-related tasks, our method demonstrates consistent improvements over strong baselines and existing fusion techniques in a variety of protein representation learning benchmarks, including react (enzyme/EC classification), model quality assessment (MQA), protein-ligand binding affinity prediction (LBA), protein-protein binding site prediction (PPBS), and B cell epitopes prediction (BCEs). Our method establishes a new state-of-the-art for multimodal protein representation learning, emphasizing the efficacy of Bi-Hierarchical Fusion in bridging sequence and structural modalities.
Poster ID: 75
Title: Benchmarking Waitlist Mortality Prediction Through Time-to-Event Modeling using New UNOS Dataset
Authors: Yingtao Luo, Reza Skandari, Carlos Martinez, Rema Padman, Arman Kilic
Poster Session A
Abstract:
Background.
Heart transplantation remains the definitive therapy for end-stage heart failure, a global health crisis impacting approximately 65 million adults worldwide. In the United States, heart allocation decisions are guided primarily by patient urgency determined by clinical status categories established by the United Network for Organ Sharing (UNOS). However, these static categorization systems often fail to adequately capture dynamic changes in patient health status, resulting in suboptimal risk stratification and potentially preventable deaths on the waitlist. Recent UNOS policy revisions in 2018 introduced detailed longitudinal patient data collection, enabling continuous assessment of patient condition over time. Despite the availability of this new comprehensive UNOS dataset, existing predictive models have predominantly relied on static baseline variables, limiting their effectiveness in dynamically evolving clinical scenarios. The objective of this study is to establish a comprehensive benchmark for predictive models using longitudinal data from UNOS, aiming to enhance precision in predicting heart transplant waitlist mortality. The motivation lies in providing clinicians with advanced tools capable of updating survival predictions continuously, thus informing timely clinical decisions, enhancing transplant allocation efficiency, and ultimately improving patient outcomes.
Methods.
We analyzed a large, national dataset comprising 23,807 heart transplant candidate records from the UNOS thoracic registry, collected since October 18, 2018. The dataset contained rich longitudinal clinical data covering 71 dynamic variables, allowing real-time assessment of patient health trajectories. The primary endpoint of our analysis was waitlist mortality, with patient outcomes tracked through December 31, 2023. The dataset was randomly partitioned into training, validation, and test datasets with an 80%-10%-10% split. Variables with less than 30% missing data underwent imputation via Markov Chain Monte Carlo techniques. Three distinct modeling approaches were benchmarked against four existing baselines: a time-dependent Cox proportional hazards model, Random Survival Forest (RSF), and a deep learning-based DeepHit model. Model performance was assessed using the concordance index (C-Index) and one-year discrimination measured by area under the ROC curve (AUC), accuracy, precision, recall and F1 score.
Results.
As shown in Table 1, the DeepHit model emerged as the best-performing algorithm, achieving a C-Index of 0.94 and a one-year AUC of 0.89, significantly outperforming traditional static models and previously reported benchmarks. The dynamic models consistently demonstrated superior predictive capabilities by integrating longitudinal patient data. The recall rate was limited by the class imbalance (with only 8% waitlist mortality). Key predictors included established risk factors like renal dysfunction (BUN, dialysis), serum albumin, ECMO, age, and hospitalization frequency. Notably, underexplored variables such as AST, SvO₂, and oral anticoagulant use also emerged as significant, offering promising directions for future clinical validation. These results underscore the value of dynamic modeling for real-time, data-driven transplant decision support.
Conclusion.
Our study establishes the first comprehensive benchmark for dynamic predictive modeling of heart transplant waitlist mortality utilizing longitudinal UNOS data. The significant improvements in predictive accuracy demonstrated by dynamic models highlight their clinical relevance for real-time urgency assessment and transplant allocation decision-making. This advancement holds potential for reducing preventable deaths and optimizing resource allocation. Future research should prioritize external validation, integration of additional clinical data streams, and the translation of these advanced predictive tools into actionable clinical decision support systems.
Poster ID: 131
Title: Implementing A Notification System for Patients With High-Risk Conditions Presenting With Fever In The Pediatric Emergency Department
Authors: Noah Prizant, William Ratliff, Shems Saleh, Marshall Nichols, Mike Revoir, Matt Gardner, Michael Gao, Mark Sendak, Suresh Balu, Emily Greenwald, Emily C Sterrett
Poster Session A
Abstract:
Background. Patients with high-risk conditions (HRC) presenting to the pediatric emergency department (ED) with fever are at greatly increased risk for developing systemic infection or sepsis. Consensus guidelines for these patients emphasize timely evaluation and administration of antibiotics, ideally within 1 hour of presentation. Longer time-to-antibiotics (TTA) has been associated with poorer outcomes. From 2016-2023, 42% of patients admitted to Duke Children’s Hospital who met HRC+Fever received antibiotics within 1 hour of presentation to triage. In this study, our primary objective was to implement an informatics-driven system to immediately identify patients with high-risk conditions who presented to the Duke Pediatric ED with a fever and reduce TTA.
Methods. The prospective study cohort included all patients with a high-risk condition who presented with a fever to the Duke Pediatric ED from 1/1/24-3/1/25. High-risk conditions included active chemotherapy (1+ dose of chemotherapy in past 6 months), followed by transplant team (on solid organ transplant “list”) and sickle cell disease (prior encounter containing sickle cell diagnosis). Fever was defined as a temperature of >38oC measured in the ED or by chief complaint of fever. For each patient who met this HRC+Fever phenotype, an automated notification page was sent to the ED charge nurse and ED pharmacy to coordinate prompt evaluation and treatment. Pages were not sent if a patient already received antibiotics. Silent validation was performed from 1/1/24-2/25-24 during which notifications were generated but not sent to clinicians. HRC+Fever notifications went live to clinicians in the Duke Pediatric ED on 2/26/24. We tracked the actionability of notifications (non-actionable notifications occurred when the clinical team determined antibiotics were not indicated) as well as TTA via summary statistics and run charts.
Results. From 1/1/24-3/1/25, 299 patients met the HRC+Fever phenotype, 40 of which occurred during the Silent Trial. After implementation, median ED arrival to antibiotic administration time decreased from 75.6 min (SD 50.1) to 60.0 min (SD 88.3). TTA continues to decrease in recent months with a median of 43.8 in Feb 2025. Run charts (Figure 1) showed special cause variation only with astronomical points exceeding the upper control limit. After investigation, these were primarily instances in which the transplant team was consulted for antibiotic decision-making, causing delay in TTA. When excluding the transplant patient group, median ED arrival to antibiotic administration time decreased from 78.0 min (SD 51.2) to 55.2 min (SD 51.4) and special cause variation was noted (6 points below the median) in the run chart. Among all patients, arrival-to-antibiotic compliance (<1 hour ED arrival to antibiotic time) increased from 36.1% to 49.3% between silent trial and the implementation period.
Conclusion. We successfully implemented an informatics-based notification system to identify high-risk patients presenting with fever in the Pediatric ED, which has been live for over one year. After implementation, median TTA decreased and antibiotic compliance improved, with most significant improvements occurring in patients on chemotherapy or with sickle cell disease. Transplant patients were identified as a group at higher risk for delays in TTA. We will continue to track impact metrics and investigate additional workflow improvements as HRC+Fever notifications continue.
Poster ID: 162
Title: Augmenting Diagnosis of Primary Hyperparathyroidism (PHPT) Through AI-Based Prediction of Surgical Candidates Without PTH Values
Authors: Hadiza Kazaure, Sai Samyukta Palle, Will Knechtle, Matt Gardner, Marshall Nichols, Mike Revoir, Suresh Balu
Poster Session A
Abstract:
**Background:** Primary Hyperparathyroidism (PHPT), a curable condition by intervention, often suffers from delayed diagnosis and referral for surgery, with national delays averaging up to six years. A key contributor to this problem is the lack of assessing parathyroid hormone (PTH) levels routinely for patients in most primary care settings, a primary indicator in the diagnosis of PHPT. Delayed treatment or if left untreated, PHPT can result in serious conditions like kidney stones, fractures, neurocognitive issues, cardiovascular complications, and renal failure. Our objective was to develop a predictive model that can identify surgical candidates for PHPT using electronic health record (EHR) data, independent of preoperative PTH values.
**Method:** We conducted a retrospective analysis using EHR data from 2017 through 2022. The study cohort included outpatients seen in Endocrinology Clinics for Endocrine neck diseases. The case group (n=1,472) comprised patients diagnosed with PHPT, identified using ICD-10 codes who underwent parathyroidectomy. The control group (n=4,934) included patients diagnosed with thyroid disease excluding PHPT diagnoses. We excluded patients with concurrent PHPT and thyroid disease, and conditions like dementia, lithium usage, renal hyperparathyroidism, parathyroid carcinoma, or multiple endocrine neoplasia. Predictive features were derived from both structured and unstructured data. Structured data included comorbidities (e.g., hypertension, chronic kidney disease, osteoporosis, kidney stones, fractures, stroke, depression, and anxiety) and laboratory results (serum calcium, albumin, creatinine, estimated glomerular filtration rate-eGFR, 25-hydroxyvitamin D, and phosphorus). Unstructured data consisted of neurocognitive symptoms (e.g., brain fog, memory issues) extracted from clinical notes using a two-step large language model (LLM)-based pipeline (Llama-Nemotron): first identifying relevant sentences, then assigning a complexity score. To our knowledge, this is the first PHPT prediction model to incorporate neurocognitive symptom complexity derived from clinical documentation. An XGBoost classifier was trained using all the above-mentioned features to enable prediction of surgical candidates in the absence of diagnostic testing.
**Results:** In a cohort of 6,406 cases, our model achieved an AUROC of 0.88, a PRC of 0.75, a sensitivity of 0.90, and a precision of 0.39. Despite excluding PTH values, the model demonstrated strong sensitivity in identifying true surgical candidates. While precision remained moderate, the model is intended as a clinical decision-support tool to flag high-risk patients for follow-up testing—specifically, prompting PTH evaluation. A subsequently elevated PTH can then support the diagnosis of PHPT and expedite surgical referral. Notably, incorporating neurocognitive symptom features led to an increased in PRC from 0.66 to 0.75, indicating greater precision in identifying appropriate surgical candidates among those flagged as high risk.
**Conclusion:** Although our analysis was conducted within endocrinology clinics, where patients are more likely to have undergone PTH testing, our objective was to develop a model that can predict PHPT surgical candidates using structured and unstructured EHR data without relying on PTH values. This approach aims to compensate for the conditions in primary care settings, where PTH is not routinely assessed. Our results demonstrate that accurate prediction is feasible even in the absence of PTH data, supporting future efforts to validate and apply this model in broader primary care populations to enable earlier detection and intervention.
Poster ID: 83
Title: Fair and Accurate Prediction of Acute Kidney Injury in General Ward Patients
Authors: Muhan Yeo, Bokeum Cho, Sojung Kim, Hyesung Kim, Jeongyoon Chang, Jin Byeong Park, Jinyeong Yi, Myeongju Kim, Seunggeun Lee, Sejoong Kim
Poster Session A
Abstract:
Background.
Acute kidney injury (AKI) is a severe yet often preventable complication in hospitalized patients. Early detection enables timely intervention, reducing adverse outcomes. Existing machine learning (ML)-based AKI prediction models primarily target high-risk populations in early hospitalization periods. This study aims to develop a generalizable, fairness-aware ML model for AKI prediction in general ward inpatients.
Methods.
We conducted a retrospective cohort study at a tertiary medical center in Korea (2013–2023), including all general ward inpatients aged ≥19 years who were hospitalized for ≥3 days without a history of dialysis. Patients were excluded if they had pre-existing severe renal impairment, no serum creatinine measurement within six months prior to admission, or AKI at the time of admission. Patients were followed for up to 30 days. Static features (e.g., sex, age, department) and dynamic features (e.g., vital signs, lab results, medications) were extracted from patient data, with missing values imputed using forward filling and median imputation. Binary indicators of missingness for each dynamic feature were also included. A Long Short-Term Memory (LSTM) model was developed (Figure 1) and compared with transformer-based, gradient-boosting, and logistic regression models. The model incorporated an auxiliary loss for 48-hour creatinine prediction and a fairness loss to penalize equal opportunity difference, defined as the difference in true positive rates among subgroups. Sensitive features included admission department, sex, age, obesity, occupation, religion, and marital status.
Results.
Study population included 141,631 patients (mean age 61±16, 53% male, median hospitalization 7 days), of whom 10.1% (14,296) developed AKI. The LSTM-based model achieved an AUROC of 0.889 and an AUPRC of 0.280 for AKI prediction within 48 hours (Table 1). Performance varied across hospital days, with the highest in the first week (AUPRC: 0.313, recall: 0.615) and the lowest after the third week (AUPRC: 0.106, recall: 0.448), resulting in a maximum equal opportunity difference of 0.17 (Table 2). Prediction performance was lower in female patients and the obstetrics & gynecology department compared to male patients and other departments. After fairness regularization, disparities improved across hospital days (0.17→0.06), sex (0.14→0.09), and departments (0.41→0.27), while remaining below 0.1 for other sensitive variables, with overall AUROC and AUPRC maintained.
Conclusion.
The LSTM model outperformed non-sequential models, demonstrating high predictive performance. Fairness regularization effectively reduced equal opportunity differences across key sensitive features—hospital days, sex, and department—while maintaining overall model performance, consistent with the model’s primary role in risk screening within hospital settings. This model supports real-time AKI risk assessment, enhancing clinical decision-making while mitigating sociodemographic and temporal biases.
Poster ID: 207
Title: ADHAM: Additive Deep Hazard Analysis Mixtures for Interpretable Survival Regression
Authors: Mert Ketenci, Vincent Jeanselme, Harry Reyes Nieva, Shalmali Joshi, Noémie Elhadad
Poster Session A
Abstract:
Survival analysis is a fundamental tool for modeling time-to-event outcomes in healthcare. Recent advances have introduced flexible neural network approaches for improved predictive performances. However, these models do not provide interpretable insights into the association between exposures and the modeled outcomes, a critical requirement for decision-making in clinical practice. To address this limitation, we propose Additive Deep Hazard Analysis Mixtures (ADHAM), an interpretable additive survival model. ADHAM assumes a conditional latent subpopulation structure that characterizes an individual, combined with covariate-specific hazard functions. To select the number of subpopulations, we introduce a post-training group refinement-based model-selection procedure; \ie an efficient approach to merge similar clusters to reduce the number of repetitive latent subpopulations identified by the model. We perform comprehensive studies to demonstrate ADHAM's interpretability on population, subpopulation, and individual levels. Extensive experiments on real-world datasets show that ADHAM provides novel insights into the association between exposures and outcomes. Further, ADHAM remains on par with existing state-of-the-art survival baselines, offering a scalable and interpretable approach to time-to-event prediction in healthcare.
Poster ID: 11
Title: ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling
Authors: William Han, Chaojing Duan, Michael Rosenberg, Emerson Liu, Ding Zhao
Poster Session A
Abstract:
Large Language Models (LLMs) have demonstrated exceptional versatility across domains, including applications to electrocardiograms (ECGs). A growing body of work focuses on generating text from multi-channeled ECG signals and corresponding textual prompts. Existing approaches often involve a two-stage process: pretraining an ECG-specific encoder with a self-supervised learning (SSL) objective, followed by finetuning an LLM for natural language generation (NLG) using encoder-derived features. However, these methods face two key limitations: inefficiency due to multi-stage training and challenges in interpreting encoder-generated features.
To overcome these issues, we propose ECG-Byte, an adapted byte pair encoding (BPE) tokenizer pipeline for autoregressive language modeling of ECGs. ECG-Byte compresses and encodes ECG signals into tokens, enabling direct end-to-end LLM training by combining ECG and text tokens. This approach enhances interpretability, as ECG tokens can be directly mapped back to the original signals. Leveraging ECG-Byte, we achieve competitive NLG performance while training 3 times faster and using just 48% of the data required by traditional two-stage methods.
Poster ID: 150
Title: Borrowing From the Future: Enhancing Early Risk Assessment through Contrastive Learning
Authors: Minghui Sun, Matthew M. Engelhard, Benjamin Goldstein
Poster Session A
Abstract:
Risk assessment in pediatric populations often requires analysis across multiple developmental stages. For example, clinicians may evaluate risks prenatally, at birth, and during WellChild visits. While predictions at later stages typically achieve higher accuracy, it is clinically desirable to make reliable risk assessments as early as possible. Therefore, this study focuses on enhancing prediction performance in early-stage risk assessments. Our solution, **Borrowing From the Future (BFF)**, is a contrastive multi-modal framework that treats each time window as a distinct modality. In BFF, a model is trained on all available data throughout the time while conduct risk assessment using the up-to-time information. This contrastive framework allows the model to "borrow" informative signals from later stages (e.g., WellChild visits) to implicitly supervise the learning at earlier stages (e.g., prenatal/birth stages). We validate BFF on two real-world pediatric outcome prediction tasks, demonstrating consistent improvements in early risk assessment.
Poster ID: 31
Title: Towards Scalable Newborn Screening: Automated General Movement Assessment in Uncontrolled Settings
Authors: Daphné Chopard, Sonia Laguna, Kieran Chin-Cheong, Annika Dietz, Anna Badura, Sven Wellmann, Julia E Vogt
Poster Session A
Abstract:
General movements (GMs) are spontaneous, coordinated body movements in infants that offer valuable insights into the developing nervous system. Assessed through the Prechtl GM Assessment (GMA), GMs are reliable predictors for neurodevelopmental disorders. However, GMA requires specifically trained clinicians, who are limited in number. To scale up newborn screening, there is a need for an algorithm that can automatically classify GMs from infant video recordings. This data poses challenges, including variability in recording length, device type, and setting, with each video coarsely annotated for overall movement quality. In this work, we introduce a tool for extracting features from these recordings and explore various machine learning techniques for automated GM classification.
Poster ID: 34
Title: ConTextual: Improving Clinical Text Summarization in LLMs with Context-preserving Token Filtering and Knowledge Graphs
Authors: Fahmida Liza Piya, Rahmatollah Beheshti
Poster Session A
Abstract:
Unstructured clinical data can serve as a unique and rich source of information that can meaningfully inform clinical practice. Extracting the most pertinent context from such data is critical for exploiting its true potential toward optimal and timely decision-making in patient care. While prior research has explored various methods for clinical text summarization, most prior studies either process all input tokens uniformly or rely on heuristic-based filters, which can overlook nuanced clinical cues and fail to prioritize information critical for decision-making. In this study, we propose Contextual, a novel framework that integrates a Context-Preserving Token Filtering method with a Domain-Specific Knowledge Graph (KG) for contextual augmentation. By preserving context-specific important tokens and enriching them with structured knowledge, ConTextual improves both linguistic coherence and clinical fidelity. Our extensive empirical evaluations on two public benchmark datasets demonstrate that ConTextual consistently outperforms other baselines. Our proposed approach highlights the complementary role of token-level filtering and structured retrieval in enhancing both linguistic and clinical integrity, as well as offering a scalable solution for improving precision in clinical text generation1.
Poster ID: 58
Title: Hypothesis Generation Context Refiner
Authors: Ilya Tyagin, Saeideh Valipour, Aliaksandra Sikirzhytskaya, Michael Shtutman, Ilya Safro
Poster Session A
Abstract:
We introduce an explainability method for biomedical hypothesis generation systems, built on the the novel Hypothesis Generation Context Refiner framework. Our approach combines semantic graph-based retrieval, and relevant data-restrictive training to simulate real-world discovery constraints. Integrated with large language models (LLMs) via retrieval-augmented generation, the system explains hypotheses in contextual evidence using published scientific literature. We propose a novel feedback loop approach, which iteratively identifies and corrects flawed parts of LLM-generated explanations, refining both the evidence paths and supporting papers. We demonstrate the performance of our method with multiple large language models and evaluate explanation and context retrieval quality through both expert-curated assessment and large-scale automated analysis.\\
Reproducibility: our code and data are available at [link will be added upon acceptance]
Poster ID: 204
Title: Switching State Space Modeling via Constrained Inference for Clinical Outcome Prediction
Authors: Arnold Su, Anna Wong, Fareed Sheriff, Ardavan Saeedi, Li-wei H. Lehman
Poster Session A
Abstract:
In clinical settings, timely and accurate prediction of adverse patient outcomes can help guide treatment decisions. While deep learning models such as LSTMs have demonstrated strong predictive performance on multivariate clinical time series, they often lack interpretability. To address this gap, we propose a framework that combines the predictive strength of neural networks with the interpretability of latent variable models. Specifically, we develop a constrained inference approach to train a switching state space model—an autoregressive hidden Markov model (AR-HMM)—for outcome prediction. Our method leverages knowledge distillation: a high-capacity LSTM "teacher" model is first trained to predict a target clinical outcome of interest, and its predictive behavior is then transferred to an interpretable AR-HMM "student" model through a similarity constraint during inference. We implement a constrained variational inference approach to estimate the parameters of the student model while aligning its latent representations with that of the teacher model’s. We evaluated our approach using two real-world clinical datasets. Our approach demonstrates predictive performance comparable to state-of-the-art deep learning models, while producing interpretable latent trajectories that reflect clinically meaningful patient states.
Poster ID: 194
Title: Balancing Interpretability and Flexibility in Modeling Diagnostic Trajectories with an Embedded Neural Hawkes Process Model
Authors: Yuankang Zhao, Matthew M. Engelhard
Poster Session A
Abstract:
The Hawkes process (HP) is commonly used to model event sequences with self-reinforcing dynamics, including electronic health records (EHRs). Traditional HPs capture self-reinforcement via parametric impact functions that can be inspected to understand how each event modulates the intensity of others. Neural network-based HPs offer greater flexibility, resulting in improved fit and prediction performance, but at the cost of interpretability, which is often critical in healthcare. In this work, we aim to understand and improve upon this tradeoff. We propose a novel HP formulation in which impact functions are modeled by defining a flexible impact kernel, instantiated as a neural network, in event embedding space, which allows us to model large-scale event sequences with many event types. This approach is more flexible than traditional HPs yet more interpretable than other neural network approaches, and allows us to explicitly trade flexibility for interpretability by adding transformer encoder layers to further contextualize the event embeddings. Results show that our method accurately recovers impact functions in simulations, achieves competitive performance on MIMIC-IV procedure dataset, and gains clinically meaningful interpretation on XX-EHR with children diagnosis dataset even without transformer layers. This suggests that our flexible impact kernel is often sufficient to capture self-reinforcing dynamics in EHRs and other data effectively, implying that interpretability can be maintained without loss of performance.
Poster ID: 61
Title: State-of-the-Art Text-Prompted Medical Segmentation Models Struggle to Ground Chest CT Findings
Authors: Mohammed Baharoon, Luyang Luo, Michael Moritz, Abhinav Kumar, Sung Eun Kim, Xiaoman Zhang, Miao Zhu, Kent Kleinschmidt, Sri Sai Dinesh Jaliparthi, Sathvik Suryadevara, Rithvik Akula, Mark Marino, Wenhui Lei, Ibrahim Ethem Hamamci, Pranav Rajpurkar
Poster Session A
Abstract:
This study presents a comprehensive evaluation of state-of-the-art text-prompted segmentation models, including SAM2, MedSAM2, SegVol, SAT, and BiomedParse, on ReXGrounding, a novel dataset that pairs chest CT findings with corresponding segmentation masks. Our results demonstrate that despite recent advances, current models struggle to accurately segment diverse findings from chest CTs, particularly when dealing with non-focal abnormalities described in natural language reports. While existing models are primarily optimized for fixed categorical labels rather than nuanced clinical descriptions, even fine-tuning these models with free-text descriptions yields limited improvement in segmentation accuracy. These insights highlight that report grounding on 3D medical volumes through segmentation remains an open challenge, necessitating future models that better comprehend complex clinical language and irregular object patterns across volumetric data.
Poster ID: 36
Title: Why Don’t Patients Take Their Medications: A Natural Language Processing Study of Counseling Intervention Notes from MDR-TB Patients in South Africa
Authors: Xuan Lu, Meng Zhao, Allison Wolf, Jennifer R Zelnick, Max O'Donnell
Poster Session A
Abstract:
{Background} Tuberculosis (TB) is a severe infectious pulmonary disease that affects large populations worldwide. Challenges with the rise of Multi-Drug-Resistant Tuberculosis (MDR-TB) complicates treatment. Despite treatment advancement, non-adherence to TB regimen, influenced by various socioeconomical and behavioral factors, remains a key challenge to treatment outcomes. Qualitative studies can formally discover these factors but is resource-intensive. In this study, we leveraged natural language processing (NLP) and topic modeling techniques to discover potential factors of adherence from free-text counseling session notes collected as part of a larger randomized control trial.
{Methods.} We analyzed 327 unstructured counseling intervention session notes (“notes”) from the *Anonymized* study. Each session note was treated as an individual unit of analysis. The notes were preprocessed before topic modeling, including lowercasing, removal of non-alphabetic characters and customized stopwords, lemmatization, and tokenization (up to 3-grams). We then applied BERTopic to identify latent topics and explore reasons for treatment non-adherence. The BERTopic model is composed of four modules: SBERT document embedding, UMAP dimension reduction, HDBSCAN document clustering, and c-TF-IDF topic representation. We fine-tuned the dimensionality parameter of UMAP to optimize the balance between topic granularity and coherence. Due to a small corpus size, we performed manual Semantic Search to support interpretability.
{Results.} We selected the top six topics from BERTopic for their clarity and representativeness, removed redundant or low-utility words, and finalized topics with human-in-the-loop supervision. Topics were ranked by “count”—the number of notes assigned [Table 1]. We created a corpus of semantic themes based on preliminary knowledge and BERTopic’s top 20 topics. We evaluated SBERT embeddings and this corpus using K-Means clustering (k=6), which yielded well-separated clusters, suggesting effective semantic distinctions among the counseling notes [Figure 1].
{Conclusion.} Our fine-tuned BERTopic model was able to capture meaningful latent factors that potentially impact treatment adherence. Nonetheless, we still had keywords in many topic clusters that are non-informative, highlighting the necessity of human-in-the-loop supervision from domain experts.
Poster ID: 213
Title: A Machine Learning Framework to Identify Ecological Risk Pathways in Cardiovascular Stress: Insights for Health Equity Using Decision Trees and SHAP
Authors: Marcia E. I. Uddoh
Poster Session A
Abstract:
**Please refer to PDF for complete version of the abstract**
Cardiovascular disease (CVD) resulting from chronic stress has been consistently linked to increased morbidity and mortality. Recognized by the World Health Organization as a critical intermediate social determinant of health (Solar, 2013), chronic stress demands targeted attention. Biomarkers play an essential role in the early detection of CVD, which disproportionately affects Black populations. However, stress is frequently assessed through subjective measures, offering limited actionable insights for clinicians and public health officials aiming for precision in understanding disease progression. As noted by (Gaffey et al., 2022), this is especially problematic in cardiovascular health, where many remain hesitant to integrate psychological stressors into formal clinical guidelines. In contrast, biological and physiological markers can provide objective assessments of stress, enabling earlier identification of risk and the development of timely
interventions to improve health outcomes. Particularly within Black communities, stress emerges from the dynamic interplay of complex environmental factors operating across multiple levels of the social ecological model. As highlighted by (Golden & Earp, 2012), the health promotion field has often focused narrowly on lifestyle changes while overlooking the broader contextual forces shaping health outcomes. This model offers an effective framework by mapping stressors at the individual, patient-physician, institutional, community, and policy levels, each potentially linked to specific biomarkers indicative of cardiovascular disease risk patterns. Once identified, these biomarkers can be applied in both clinical and public health settings to enable early detection of disease, whether at the level of an individual patient profile or across broader populations. This is especially critical, as existing ecological studies on cardiovascular disease often
emphasize behavior modification (Savage et al., 2015) while overlooking the systemic factors contributing to the disproportionate burden of CVD among Black communities. Rather than focusing solely on individual behaviors, our approach examines the successive layers beyond the individual — exploring systemic, multi-level environmental determinants and their associated biological markers that collectively drive cardiovascular disease risk. However, manually identifying these complex patterns of stress-related biomarkers at both individual and population levels is labor-intensive and impractical. Compounding this challenge is the current lack of machine learning (ML) models specifically designed to integrate chronic stress, social-ecological indicators, and physiological outcomes. Given the multitude of
biomarkers and the multiple layers of the social ecological model where cardiovascular health is impacted, manually analyzing these factors would be an arduous task. To address this complexity, we introduce the Black Cardiovascular Ecological (BaCE) Pathway, a structured process for mapping the ecological determinants of cardiovascular disease. In this study, we focus specifically on the biomarkers that emerge across the various ecological levels. Our approach demonstrates how a machine learning model can classify individual patient biomarkers or population-level profiles according to their corresponding position within the BaCE pathway. This facilitates early pattern recognition, detection, and the implementation of timely interventions. The model leverages a combination of Decision Tree algorithms and Shapley Additive Explanations (SHAP) to illuminate the pathways through which specific biomarker patterns inform
cardiovascular risk across ecological levels. Similar approaches have successfully applied clinical biomarkers from electronic health records to predict atherosclerotic heart disease(Miranda et al., 2023) and demonstrated the accuracy and efficiency of tree-based algorithms in supporting decision-making for coronary artery disease diagnosis (Ghiasi et al., 2020). This interpretability enables health professionals to better understand the relationships between biomarkers and ecological risk factors, supporting informed clinical and public health decision-making.
Poster ID: 98
Title: FIVA: Federated Inverse Variance Averaging for Universal CT Segmentation with Uncertainty Estimation
Authors: Asim Ukaye, Numan Saeed, Karthik Nandakumar
Poster Session A
Abstract:
Different CT segmentation datasets are typically obtained from different scanners under different capture settings and often provide segmentation labels for a limited and often disjoint set of organs.
Using these heterogeneous data effectively while preserving patient privacy can be challenging. This work presents a novel federated learning approach to achieve universal segmentation across diverse abdominal CT datasets by utilizing model uncertainty for aggregation and predictive uncertainty for inference.
Our approach leverages the inherent noise in stochastic mini-batch gradient descent to estimate a distribution over the model weights to provide an on-the-go uncertainty over the model parameters at the client level. The parameters are then aggregated at the server using the additional uncertainty information using a Bayesian-inspired inverse-variance aggregation scheme.
Furthermore, the proposed method quantifies prediction uncertainty by propagating the uncertainty from the model weights, providing confidence measures essential for clinical decision-making. Consistent with recent work shown, predictive uncertainty is utilized in the inference stage to improve predictive performance.
Experimental evaluations demonstrate the effectiveness of this approach in improving both the quality of federated aggregation and the uncertainty-weighted inference as compared to the previously established baselines.
Poster ID: 110
Title: From radiology reports to early prognostic markers: benchmarking LLMs in chronic liver disease
Authors: Hania Paverd, Zeyu Gao, Golnar K. Mahani, Sarah W Burge, Matthew Hoare, Mireia Crispin-Ortuzar
Poster Session A
Abstract:
Liver cancer almost exclusively affects patients with chronic liver disease (CLD), yet most are diagnosed at advanced stages with poor prognosis. Routine surveillance of these at-risk individuals generates a wealth of information within unstructured radiology reports, a currently underutilized resource that could hold crucial clues for predicting liver cancer development. This project aims to leverage large language models (LLMs) to automatically extract structured data from these reports, aiming to identify early prognostic markers. Using the open-source Meta-Llama 3.1 model, we extracted data from 282 reports and benchmarked its performance against regular expression search. While the LLM showed comparable performance to regex in identifying simpler features like ascites, it significantly outperformed regex in more complex tasks requiring inference, such as identifying liver imaging, diagnosing cirrhosis, extracting spleen size and treatment dates, and detecting liver lesions, particularly when employing specific questions and hierarchical prompting strategies. These findings suggest that LLMs, particularly with optimized prompting, hold significant potential for identifying early prognostic markers for liver cancer development in CLD patients, paving the way for earlier interventions and improved patient outcomes.
Poster ID: 118
Title: Prospective Validation of a Machine Learning Model to Predict Risk of Adult Inpatient Deterioration
Authors: Jay Swayambunathan, William Ratliff, Mark Sendak, Michael Gao, Kartik Pejavara, Suresh Balu, Srijan Bhasin, Dustin Tart, Cara O'Brien
Poster Session A
Abstract:
Background
Early identification of patient deterioration remains a significant clinical challenge, and delays in recognizing deterioration can have significant detrimental consequences to patient outcomes. To address this, a Light Gradient Boosting Machine (lgbm) model was developed at an academic medical center (AMC) to proactively predict deterioration among adult inpatients. This model demonstrated superior predictive performance (AUROC 0.816, AUPRC 0.055) compared to traditional scoring methods, such as the National Early Warning Score (AUROC 0.724, AUPRC 0.047). Silent validation was conducted to assess the real-time predictive accuracy of this model and estimate its potential impact on patient care.
Methods
In calendar year (CY) 2024, the model was deployed silently to generate predictions for 5,998 adult patients admitted to intermediate and stepdown inpatient units at AMC. Predictions were stratified into Medium Risk, High Risk, and Critical Risk categories based on previously defined thresholds for model output. The system was designed to silence alarms for four hours after an initial alert to simulate measures to reduce alarm fatigue among providers. A total of 4,963 unique patient deterioration events were identified including rapid response team (RRT) engagement (2,592 events), intensive care unit (ICU) transfers (1,088 events), and in-hospital mortality (1,283 events) in CY 2024. A total of 3,101 model alerts were recorded during CY 2024 with 1963 identified as Medium Risk, 795 identified as High Risk, and 343 identified as Critical Risk. Positive predictive value (PPV) and sensitivity of alerts in predicting patient deterioration (defined as a RRT engagement, ICU transfer, or mortality) were calculated within 12 hours of an alert as well as within 12 hours for RRT engagement, 24 hours for ICU transfers, and within 48 hours for mortality to assess solution performance for actionability of model results.
Results
When looking at patient deterioration within 12 hours, the model achieved an overall PPV of 0.1912 (Medium Risk: 0.1406, High Risk: 0.2528, Critical Risk: 0.3382). In terms of RRT engagement within 12 hours, ICU transfers within 24 hours, and mortality within 48 hours of an alert, the model achieved an overall PPV of 0.3628 (Medium Risk: 0.2929, High Risk: 0.4364, Critical Risk: 0.5918). In terms of sensitivity in predicting RRT engagement, ICU transfers, and patient mortality within 12 hours, the model achieved an overall sensitivity of 0.1005 (RRT engagement: 0.0515, ICU transfers: 0.1363, mortality: 0.1593). Similar results were seen for sensitivity for RRT engagement within 12 hours, ICU transfers within 24 hours, and mortality within 48 hours of an alert with an overall sensitivity of 0.1050 (RRT engagement: 0.0515, ICU transfers: 0.1381, and mortality: 0.1761).
Conclusions
Silent prospective validation demonstrated that the lgbm model provided meaningful predictive performance in identifying adult inpatient deterioration events. Improved PPV was observed at higher risk thresholds. Despite lower sensitivity, the robust PPV results especially at higher risk levels support the potential clinical utility of this model in proactively guiding supportive interventions prior to significant deterioration events. Full retrospective analysis of data in CY 2024 and analysis of “false positive” events in which an alert was fired without subsequent RRT engagement/ICU transfer/mortality are underway to further refine understanding of model performance.
Poster ID: 86
Title: Designing a RAG-enabled physician messaging platform to enable asynchronous healthcare delivery
Authors: David C Whitehead, Tony Yue Sun, Muthuraman Alagappan, Rishi Khakhkhar, Sebastian Wakefield, Justin Yu, Jaisal Friedman, Jessica Fan, Angel Samsuddin Maredia, Uzair Khan
Poster Session A
Abstract:
While American patients have more digital care options than ever before, patients still face fundamental access-to-care disparities. The fastest-growing care options (e.g., automated symptom checkers, telemedicine consults, urgent care clinics) provide immediate answers, but lead to suboptimal care, and contribute to America's growing dependence on low-acuity emergency department utilization. In this abstract, we highlight how we built an asynchronous AI-enabled physician messaging platform to efficiently resolve medical issues with lower cost asynchronous care, and how surfacing contextualized information improves our physician efficiency.
We designed a clinical AI messaging platform that enables our physicians to efficiently provide asynchronous medical advice. Using the Fast Healthcare Interoperability Resource (FHIR) standard, our platform automatically aggregates patient health records from health information exchange (HIE) partners after patients onboard to our platform. When patients start a new messaging thread on their mobile application, we leverage a series of large language model (LLM) calls to engage the patient and collect relevant diagnostic histories in a back-and-forth chat sequence (improving physician efficiency). We collate and summarize these collected histories along with patient-uploaded multimodal inputs (e.g., voice recordings, uploaded images) for our physicians, while simultaneously generating tailored responses using our custom-built retrieval-augmented generation (RAG) pipeline (which integrates records from the HIE).
We found that supplementing our pre-drafted recommendations with HIE records, as well as automating history-taking led to significant improvements in clinical efficiency. Compared to a simulated electronic health record inbox environment, our AI-powered physician messaging platform (i.e., RAG-based clinical response pre-generation, contextualized patient onboarding and history-taking process) has improved clinician efficiency by 79% on a threads resolved per hour basis. Moreover, average time to clinical resolution for a thread was reduced by 9.3 minutes (41% decrease) compared to a baseline, unaided asynchronous messaging platform. In our pilot studies where we onboarded real users from a Medicare Advantage plan and beta users, we have been able to resolve roughly 72% of patient queries on our platform. For acute issues, 100% of patients received a post-query follow-up from a physician on our platform to ensure resolution, with an additional 22% of patients having been referred for non-urgent follow-up care. Preliminary data shows an annual estimated acute care cost saving of $417 per engaged member. By leveraging LLMs and RAG to contextualize and reduce physician information gathering needs, our clinician messaging platform empowers physicians to quickly answer patient questions and prevent unnecessary acute care utilization.
Poster ID: 77
Title: Development, external validation, and deployment of RFAN-ML: a machine learning model to estimate renal function after nephrectomy
Authors: Jesse Persily, Yassamin Neshatvar, Rajesh Ranganath, Katie S Murray, Madhur Nayan
Poster Session A
Abstract:
Background.
Surgery, either partial or radical nephrectomy, is the mainstay of treatment for renal tumors. While oncologic outcomes are generally favorable after surgical treatment for renal tumors, concerns persist regarding the potential long-term impact on renal function, which is a critical determinant of cardiovascular health and overall survival. This concern underscores the preference in several clinical guidelines for the use of partial nephrectomy, in which the renal tumor is removed while the remaining normal kidney is preserved, as the surgical approach over radical (or total) nephrectomy, when feasible. However, partial nephrectomy can be associated with increased perioperative risks, including higher rates of urinary leak and secondary interventions. To balance these risks against the potential benefits, it is essential to estimate the potential impact of each surgical approach on long-term renal function. Furthermore, a pre-operative assessment of estimated renal function after surgery can facilitate earlier identification of patients at high risk of developing a significant decline, enabling personalized perioperative management to optimize outcomes. In this study, we develop, externally validate, and deploy a machine learning (ML) model predicting renal function after nephrectomy (RFAN-ML).
Methods.
We used electronic health record data to identify patients undergoing a partial or radical nephrectomy at a multi-hospital, tertiary academic center (Institution A). As candidate features for the prediction model, we extracted age at nephrectomy, sex, race, pre-operative body mass index, history of diabetes mellitus, history of hypertension, pre-operative glomerular filtration rate (GFR), nephrectomy laterality, and nephrectomy type (partial vs. radical). We split the data into training and test samples, based on the hospital site. From the candidate features, we used the Boruta algorithm to select the final set of input features for the model. We used the selected features to train and compare various supervised ML regression models to estimate the new baseline estimated glomerular filtration rate (NB-GFR), measured as the average of all GFR values between 3 and 12 months post-operatively. The primary performance metric was root mean squared error (RMSE). Secondary performance metrics included R squared and mean absolute error (MAE). We externally validated the model at a separate multi-hospital, tertiary academic center (Institution B) and compared our model to previous benchmarks. We deployed our final model online as a web-based application.
Results.
The training sample from Institution A comprised of 1,932 patients and the final input features selected were age at nephrectomy, nephrectomy type, pre-operative GFR, and body mass index. The best ML model predicting NB-GFR was Random Forest. In the test sample (n=1,149) from Institution A, this model (RFAN-ML) demonstrated an RMSE of 12.5 (95% confidence interval (CI) 11.7 - 13.2), R squared of 0.61 (95% CI 0.57 - 0.66), and MAE of 9.2 (95% CI 8.6 – 9.7). In the external validation sample (n=891) from Institution B, RFAN-ML (RMSE 16.6 (95% CI 15.6 - 17.5)) outperformed benchmark 1 (requires 3 input features, RMSE 19.1 (95% CI 18.0 - 20.3)) and had similar performance to benchmark 2 (requires 5 input features, RMSE 16.2 (95% CI 15.3 – 17.0)) (Figure 1). Through our deployed application (link hidden during review), the user inputs 3 features (age at nephrectomy, pre-operative eGFR, and body mass index) and estimates for NB-GFR after partial and radical nephrectomy are generated.
Conclusion.
We developed, externally validated, and deployed RFAN-ML, a ML model that provides individualized estimates of renal function after nephrectomy based on routinely available information from electronic health record data. RFAN-ML has the potential to improve the care and outcomes in patients with renal tumors by informing personalized patient counseling, guiding surgical planning, and facilitating earlier identification of patients at risk for significant renal impairment after surgery.
Poster Session B
Poster ID: 147
Title: Iterative Learning of Computable Phenotypes for Treatment Resistant Hypertension using Large Language Models
Authors: Guilherme Seidyo Imai Aldeia, Daniel S Herman, William La Cava
Poster Session B
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities for medical question answering and programming, but their potential for generating interpretable computable phenotypes (CPs) is under-explored.
In this work, we investigate whether LLMs can generate accurate and concise CPs for six clinical phenotypes of varying complexity, which could be leveraged to enable scalable clinical decision support to improve care for patients with hypertension.
In addition to evaluating zero-short performance, we propose and test a synthesize, execute, debug, instruct strategy that uses LLMs to generate and iteratively refine CPs using data-driven feedback.
Our results show that LLMs, coupled with iterative learning, can generate interpretable and reasonably accurate programs that approach the performance of state-of-the-art ML methods while requiring significantly fewer training examples.
Poster ID: 106
Title: Generating Accurate Synthetic Survival Data by Conditioning on Outcomes
Authors: Mohd Ashhad, Ricardo Henao
Poster Session B
Abstract:
Synthetically generated data can improve privacy, fairness, and data accessibility; however, it can be challenging in specialized scenarios such as survival analysis. One key challenge in this setting is censoring, i.e., the timing of an event is unknown in some cases.
Existing methods struggle to accurately reproduce the distributions of both observed and censored event times when generating synthetic data. We propose a conceptually simple approach that generates covariates conditioned on event times and censoring indicators by leveraging existing tabular data generation models without making assumptions about the mechanism underlying censoring.
Experiments on real-world datasets demonstrate that our method consistently outperforms baselines and improves downstream survival model performance.
Poster ID: 195
Title: ReXscore: A Comprehensive Framework For Evaluating AI-generated Radiology Reports
Authors: Xiaoli Yang, Subathra Adithan, Julian Nicolas Acosta, Suvrankar Datta, Oishi Banerjee, Xiaoman Zhang, Osvaldo Landi Junior, Siddhant Dogra, Jung Oh Lee, Rohit Reddy, Joe Vimal Raj, Selvaganesan Muthu Purushothaman, Gouthame Sourya V S, Vinith M V, Divya B
Poster Session B
Abstract:
Automated radiology report generation (RRG) has the potential to improve efficiency and accuracy in radiology workflows. While classification or localization models can be straightforwardly evaluated using exact metrics, automatic report evaluation requires metrics to compare AI generations against radiologist-written reference reports and produce nuanced judgments of complex medical text. An ideal metric would focus on capturing meaningful differences in content, yet report evaluation metrics have been overly sensitive to low-impact differences in reporting style, i.e. subjective variations in what radiologists choose to describe and how they phrase it. We propose ReXscore, a large language model-based metric designed to address these issues. Developed through a collaboration between radiologists and AI researchers, ReXscore’s novel framework identifies and categorizes clinically significant discrepancies in content between AI and reference reports while remaining robust against variations in reporting style. We find that ReXscore attains a high level of agreement (Kendall’s tau = 0.42) with radiologists, surpassing existing metrics and showing that our metric reliably captures differences between reference reports and AI generations.
Poster ID: 43
Title: Error Profiling of Machine Learning Models: An Exploratory Visualization
Authors: Al Rahrooh, Jeffrey Feng, Alex Bui
Poster Session B
Abstract:
While data-driven predictive models are increasingly used in healthcare, their clinical translation remains limited—partly due to challenges in evaluating model performance across design choices. Existing explainability methods often focus on intra-model interpretability but fall short in supporting inter-model comparisons. We present a visualization-based error profiling method that facilitates comparative evaluation by highlighting overlaps and differences in model predictions. Our matrix-based visualization maps which models incorrectly classify which patient subgroups, with color intensity indicating the number of misclassified patients. This approach enables deeper insight into which (sub)populations are consistently (in)correctly classified across models, helping uncover patterns of model (dis)agreement and assess the impact of modeling decisions. We demonstrate our visualization method in four healthcare use cases: 1) missing data imputation in a longitudinal nutritional dataset; 2) feature set analysis using randomized controlled trial data; 3) end-model technical performance in cardiac morbidity prediction; and 4) data modality comparison using a dual-source lung cancer dataset with longitudinal and radiomic features. To evaluate the visualization, we obtained expert feedback and qualitative assessments of decision-making insights. Survey results—across clinicians, computer scientists, and medical informaticians—indicated that our method provides an interpretable and intuitive way to compare model error distributions by highlighting patterns within correctly and incorrectly classified subpopulations across different models. Our comprehensible error profiling approach represents an initial step toward a systematic framework for improving model assessment in clinical tasks. Through this framework, both model developers and end users can better understand when and where a given model is appropriate for real-world clinical deployment.
Poster ID: 202
Title: Leveraging Large Language Model for Predicting Rheumatoid Arthritis Treatment Response
Authors: Ruilin Wang, Yue Li, Marie Hudson
Poster Session B
Abstract:
Anti-tumor necrosis factor (anti-TNF) therapies, commonly prescribed as first-line treatments for rheumatoid arthritis (RA), exhibited variable patient responses and highlighted the need for accurate predictive modeling. This study investigated the performance of fine-tuned large language models (LLMs), specifically the DeepSeek-R1 distilled Qwen model, compared to traditional machine learning methods (e.g., XGBoost, logistic regression) in predicting patient response to anti-TNF therapy. Using data from 105 RA patients receiving anti-TNF treatments, we evaluated model performance both with and without presence of strong clinical biomarkers. Fine-tuned LLMs demonstrated superior accuracy (up to 68%) and ROC-AUC performance (up to 0.70) compared to conventional methods in scenarios lacking robust biomarkers. However, traditional methods like XGBoost achieved outperforming accuracy (97%) when strong biomarkers were present. Results underline the significant potential of LLMs for advancing personalized RA treatment, particularly in clinical settings where strong predictive biomarkers are unavailable.
Poster ID: 159
Title: Bayesian Neural Network (BNN) for Personalised Empirical Antibiotic Treatment and Escalation Guidance with Comparison to Clinician Prescribing
Authors: Augustine Yui Hei Luk, Kevin Yuan, Jia Wei
Poster Session B
Abstract:
Background.
Antimicrobial resistance (AMR) presents challenges to timely and effective treatment of sepsis. Empirical antibiotic selection often relies on population-level susceptibility patterns without accounting for patient-specific factors and provides limited guidance for escalating therapy if initial treatment fails. To address this gap, we propose using a Bayesian neural network (BNN) trained on rich electronic health record (EHR) data to personalise antibiotic selection and guide escalation decisions, aiming to reduce delays in effective treatment.
Unlike standard neural networks, BNNs treat weights and outputs as probability distributions that allow explicit uncertainty quantification. This is increasingly recognised as essential for building safer, more reliable clinical decision-support tools. We evaluated the BNN’s performance against real-world clinician prescribing practices, with considerations for model uncertainty.
Methods.
Data were collected from a university hospital serving a population of 750,000, representing ~1% of the UK population. We included patients aged ≥16 with a positive blood culture for Enterobacterales between 01 January 2017 and 30 September 2024. This resulted in 5,083 eligible cultures with antimicrobial susceptibility testing (AST) results. Our study investigated nine common antibiotics: amoxicillin, ceftriaxone, ciprofloxacin, co-amoxiclav, co-trimoxazole, ertapenem, gentamicin, meropenem, and piperacillin-tazobactam. Model input included demographics, comorbidities, prior hospital visits, antibiotic history, vital signs, labs, and local population-level AMR rates.
Two predictive models were developed: (1) an initial empirical selection model using information available on hospital admission and (2) escalation models predicting susceptibility to alternative antibiotics given initial treatment failure. We trained escalation models with and without bacterial species information, as that may become available before full AST results. The BNN has multiple tapering hidden layers, each applying a Bayesian linear transformation, optimised by binary cross-entropy with logits.
We compared our BNN recommendations against real-world clinician prescription patterns at two time intervals: initial empirical selection (within 6 hours post-culture) and escalation decisions (12–30 hours post-culture). Antibiotic intensity was ranked from narrow (amoxicillin=1) to broad-spectrum (carbapenems=5) for comparison. We further examined the effect of uncertainty thresholding to reject high-variance predictions, aligning with clinical caution.
Results.
Escherichia coli (65.8%), Klebsiella pneumoniae (11.5%) and Proteus mirabilis (4.7%) were the most common organisms isolated. Resistance rates were highest for amoxicillin (66.1%), co-amoxiclav (40.2%) and lowest for piperacillin-tazobactam (7.4%) and carbapenems (<0.6%)
The initial BNN model achieved a mean AUC 0.72 [0.66–0.81] across nine antibiotics (Table S1). Escalation BNN models, incorporating initial treatment resistance information, performed better (mean AUC 0.77 [0.72–0.82]), and improved further with bacterial species information (mean AUC 0.80 [0.76–0.83]) (Table S2). Ertapenem and meropenem were excluded from subsequent analysis due to their low resistance rates.
Compared with clinician empirical prescriptions (Table S3), the initial BNN showed similar coverage (69% vs clinician baseline: 68%) but lower average antibiotic intensity (1.9 vs 2.4). Applying an uncertainty threshold using a 95% confidence interval improved coverage (80%) with only a modest increase in antibiotic intensity (2.1). Escalation BNNs without uncertainty thresholding achieved 76% and 71% coverage with and without species information (clinician baseline: 74%) but at lower antibiotic intensity (1.8 and 1.9 vs 2.6). Applying stringent uncertainty variance thresholds (<0.0005) significantly increased coverage (83 and 85%), while matching clinician antibiotic intensity (2.6 and 2.7). Our results suggest that leveraging uncertainty thresholds can ensure effective coverage from BNN antibiotic selection while mitigating overprescription.
Conclusion.
To our knowledge, no research to date has attempted to develop machine learning-based personalised escalating antibiograms for guiding empirical antibiotic escalation therapy. We demonstrated that BNNs can effectively integrate patient-specific EHR data to individualise both initial and escalation empirical antibiotic therapy, achieving comparable or superior coverage to clinician practices while reducing reliance on broad-spectrum antibiotics.
Incorporating uncertainty facilitates better-informed decisions and safer AI integration into clinical workflows. Future work focusing on hyperparameter optimisation, addressing class imbalance and external validation is expected to further enhance clinical applicability and performance.
Poster ID: 191
Title: Improving ARDS Diagnosis Through Context-Aware Concept Bottleneck Models
Authors: Anish Narain, Ritam Majumdar, Nikita Narayanan, Dominic C Marshall, Sonali Parbhoo
Poster Session B
Abstract:
The digitization of medical data has opened the door for AI to improve healthcare delivery, but the opaque nature of AI technologies presents challenges for interpretability, which is crucial in clinical settings. Previous work has attempted to explain predictions using Concept Bottleneck Models (CBMs), which learn interpretable concepts or feature groupings that map to higher-level clinical ideas, such as disease severity, facilitating human evaluation. However, these models often experience performance limitations when the concepts fail to adequately explain or characterize the task. In our study, we demonstrate the importance of incorporating contextual information from clinical notes to improve CBM performance, particularly in characterizing Acute Respiratory Distress Syndrome (ARDS), using data from MIMIC-IV. Our approach leverages a Large Language Model (LLM) to process clinical notes and generate additional concepts, boosting accuracy by up to 10\% compared to existing methods. This method also enables learning more comprehensive concepts, reducing the risk of information leakage and reliance on spurious shortcuts, thus improving the characterization of ARDS.
Poster ID: 69
Title: Monte Carlo ExtremalMask: Uncertainty Aware Time Series Model Interpretability For Critical Care Applications
Authors: Shashank Yadav, Vignesh Subbian
Poster Session B
Abstract:
Model interpretability for biomedical time-series contexts (e.g., critical care medicine) remains a significant challenge where interactions between pathophysiological signals obscure clinical interpretations. Traditional feature-time attribution methods for time series generate static, deterministic saliency masks, which fail to account for the temporal uncertainty and probabilistic nature of model-inferred feature importance in dynamic physiological systems such as acute organ failure. We address this limitation by proposing a probabilistic
framework leveraging Monte Carlo Dropout to quantify model-centric epistemic uncertainty in attribution masks. We capture the stochastic variability through iterative sampling, though the inherent randomness introduces inconsistency in mask outputs across sampling
iterations. We implement a dual optimization strategy incorporating entropy minimization and spatiotemporal variance regularization during training to ensure the convergence of attribution masks toward higher informativeness and lower entropy while preserving
uncertainty quantification. This approach provides a systematic way to prioritize feature-time pairs by balancing high attribution scores with low uncertainty estimates, enabling end users to discover clinical biomarkers for time-dependent pathophysiological deterio-
ration of patient state. Our work advances the field of healthcare machine learning by formalizing uncertainty-aware interpretability for temporal models while bridging the gap between probabilistic attributions and clinically actionable interpretations for problems in
critical care.
Poster ID: 25
Title: Enabling Accessible Biomarker Prediction Through 3D Body Mesh Estimation from Surface Scans
Authors: Ayis Pyrros, Muhammad Ahmed Chaudhry, Suhana Bedi, Pola Lydia Lagari, Brian T Layden, William Galanter, Sanmi Koyejo
Poster Session B
Abstract:
Body composition metrics—such as visceral fat, muscle density, and organ attenuation—are powerful imaging biomarkers known to correlate with cardiometabolic conditions including diabetes and heart disease. Traditionally, these biomarkers require advanced imaging modalities such as computed tomography (CT) or magnetic resonance imaging (MRI), restricting their availability to clinical settings and limiting their utility for population-level or opportunistic screening.
In this study, we propose a novel approach to democratize access to body composition analysis by adapting these imaging biomarkers to 3D body surface meshes, which can be obtained using consumer-grade technologies such as LiDAR-enabled smartphones or depth cameras. By leveraging large annotated CT datasets and learning mappings from internal body composition to external surface geometry, we develop machine learning models capable of predicting key biomarkers from non-invasive, radiation-free surface scans.
This approach has the potential to transform biomarker accessibility by providing individuals, clinicians, and researchers with actionable physiological insights without the need for clinical imaging infrastructure. Our results suggest that surface-derived 3D meshes can approximate traditional imaging-based biomarkers with promising accuracy, opening the door to broad, low-cost, and scalable health screening tools for early disease risk detection.
Poster ID: 169
Title: Artificial Intelligence-Based Diagnostic System for Early Gastric Cancer and Precancerous Lesions: A Multicenter Retrospective Study
Authors: Yingyun Yang, Wanying Liao, Qilei Chen, Benyuan Liu
Poster Session B
Abstract:
Background/Purpose:
Accurate and objective prediction of the severity of gastrointestinal mucosal morphology, pathological subtypes, and long-term progression risk remains a critical challenge. This study aims to develop an AI-assisted endoscopic system for identifying gastric mucosal lesions, classifying pathological types, and predicting progression risk in low-grade intraepithelial neoplasia (LGIN) patients.
Methods:
A multicenter retrospective study was conducted in China. Two deep learning models, ResNet-101 and Swin Transformer, were trained to classify endoscopic images into four categories (inflammation, LGIN, high-grade intraepithelial neoplasia [HGIN], and cancer). Performance was evaluated using accuracy, precision, recall, F1-score, and confusion matrices. Grad-CAM visualizations highlighted model decision-making regions. Additionally, a 15-year cohort of LGIN patients from Peking Union Medical College Hospital was analyzed to develop a progression prediction model.
Results:
Lesion Classification
The four-class model achieved state-of-the-art diagnostic performance, surpassing previous binary (cancer vs. non-cancer) systems. ResNet-101 and Swin Transformer demonstrated comparable accuracy (ResNet: 92.3% vs. Swin: 93.7%) in validation sets (Table 3). Confusion matrices revealed minimal misclassification between adjacent pathological categories (Figures 1–2).
Progression Prediction
The preliminary model predicted LGIN progression with 70–75% accuracy (Figure 2), outperforming junior endoscopists.
Conclusion:
This AI system enables real-time, accurate diagnosis and risk stratification for early gastric cancer and precancerous lesions. Prospective multicenter validation is ongoing to refine clinical applicability.
Poster ID: 125
Title: Interaction between distinct but related AI clinical prediction models after deployment: A real-world analysis
Authors: Michael Colacci, Chloé Pou-Prom, Derek Beaton, Lauren Erdman, Michael Fralick, Muhammad Mamdani
Poster Session B
Abstract:
Introduction
Simulation studies suggest that sequential or concurrent deployment of related artificial intelligence (AI) clinical prediction models may influence model predictive performance and the frequency of obsolete predictions.1 However, real-world data on the impact of deploying multiple overlapping AI models remains scarce.2,3
Methods
We evaluated the effect of consecutively deploying two distinct AI-based clinical prediction models at St. Michael’s Hospital, which overlap in both input features and outcome events. The Deterioration model predicted inpatient deterioration (ICU transfer or death), was developed on data from 2011-2019 and was deployed in 2020.4 The Bleeding model predicted inpatient bleeding, was developed on data from 2019 to 2023, and was deployed in September 2023. We assessed changes in Deterioration model alert frequency, predictive performance, clinician utilization, and patient outcomes following introduction of the Bleeding model. Clinician utilization was defined as the proportion of alerted patients with vital signs measured every 4 hours, as this was a recommended action following a Deterioration model alert. We compared the co-deployment period when both models were active (Sep 2023- Sep 2024), to a control period in which the Bleeding model was in silent deployment (where predictions were not communicated to clinicians, Apr 2023 - Aug 2023). Additionally, we evaluated clinician utilization among patients receiving alerts from both models versus those receiving alerts from the Deterioration model alone.
Results:
A total of 4,337 hospitalizations were included in the study (Table 1). Of alerted hospitalizations (n=1,268), 16.6% (n=210) received alerts from both models. For patients with both alerts, there was an even distribution between which model alerted first (Deterioration 54%, Bleeding 46%). The median interval between alerts was 2 days (Interquartile Range [IQR] 1-6). The median number of daily alerts was three for the Deterioration model (IQR 2-4) and 2 for the Bleeding model (IQR 1-3).
The introduction of the Bleeding model was not associated with a change in the frequency of Deterioration model alerts (Table 1). Deployment of the Bleeding model did not significantly affect predictive performance of the Deterioration model. Clinician utilization of the Deterioration model’s recommendations was unaffected, including among patients receiving alerts from both models. Additionally, there was no difference in downstream clinical outcomes after a Deterioration model alert (outcome rate 3.9% during control vs. 3.4% during intervention period, p=0.37)
Conclusion
In this real-world evaluation, we observed moderate patient overlap between AI models predicting inpatient deterioration and bleeding. The introduction of the Bleeding model did not adversely impact Deterioration model alert frequency, discriminatory performance, clinician utilization or downstream clinical outcomes. While concerns exist that co-deployed models may interfere with one another, our findings suggest that such effects are likely context specific. The decision to implement multiple related AI models must carefully weigh their individual benefits against their combined impact on clinical workflows and clinician engagement. Further studies across diverse settings are needed.
Poster ID: 60
Title: Transparent AI for Retinal Diagnostics: Interpretable Machine Learning Applied to Spectral Domain Optical Coherence Tomography (SD-OCT) Imaging
Authors: Kyoung A Viola Lee, Arpan Sahoo, David Samvelian, Utsav Kapoor, Bharath Subramanian, Corey Dylan Tesdahl, Radouil Tzekov
Poster Session B
Abstract:
Background: Optical coherence tomography (OCT) is an imaging modality widely used in ophthalmology to obtain high-resolution cross-sectional images of the retina. It plays a crucial role in diagnosing and monitoring a broad spectrum of retinal pathologies, including glaucoma, diabetic retinopathy, and age-related macular degeneration. Although machine learning (ML) has enhanced OCT interpretation, many models lack transparency, limiting clinical trust. Explainable AI (XAI) techniques such as Gradient-weighted Class Activation Mapping (Grad-CAM) aim to address this by offering visual insights into model decisions. This study evaluates interpretable ML applied to OCT images, aiming to balance accuracy with interpretability.
Methods: Spectral-domain OCT (SD-OCT) images were collected and categorized into retinal vein occlusion (RVO, n=101), retinal artery occlusion (RAO, n=60), AMD (n=1231), and normal (n=332). Data was split into training and validation sets. All images were resized to 224×224 pixels and converted to three-channel RGB format, normalized using standard ImageNet parameters. Image transformations were implemented using PyTorch. A pretrained ResNet-18 convolutional neural network was utilized as the base model. The final fully connected layer was replaced with a new linear layer to match the number of output classes (binary classification for AMD vs. normal, and for RVO vs. RAO, respectively). Model was trained with the cross-entropy loss function, Adam optimizer, learning rate of 1e-4, over 5 epochs. Training and validation were performed using PyTorch with a batch size of 16. Computations were accelerated using a GPU. Model performance was evaluated on the validation set using standard classification metrics: accuracy, sensitivity, specificity, and the area under the receiver operating characteristic curve (AUROC). Grad-CAM heatmaps were generated by backpropagating gradients from the predicted class score through the final convolutional layer of the ResNet-18 model. Heatmaps were overlaid on the original OCT images to highlight the retinal regions contributing most to the model’s classification decisions using OpenCV and Matplotlib.
Results: The AMD vs. Normal classifier achieved an overall accuracy of 90%, sensitivity of 95%, and specificity of 85% (Fig 1A). AUROC was 0.98 (Fig 1B). These results suggest that the convolutional neural network, when trained on SD-OCT images, can effectively differentiate AMD from normal retinas with high accuracy and clinically relevant sensitivity. The Grad-CAM visualization for an AMD image (Fig 1C) demonstrated strong activation centered around the macular region within the outer retinal layers, aligning with the known pathophysiology of AMD which often involves the retinal pigment epithelium and photoreceptor complex. The model appears to have correctly focused on clinically relevant areas when making its predictions. The Grad-CAM map for a normal retina (Fig 1D) also highlights the foveal depression and surrounding architecture, but with less intense and more uniformly distributed activation, emphasizing structural preservation.
The RVO vs. RAO classifier exhibited moderate performance on the validation dataset, with an overall accuracy of 62%, sensitivity of 100% for identifying RVO and a specificity of 44% for RAO. The AUROC was 0.92. While the model was able to correctly classify all RVO cases, it frequently misclassified RAO cases as RVO, indicating a potential bias toward predicting venous occlusions. The Grad-CAM visualization for an RVO image showed strong activation over regions of retinal swelling and intraretinal fluid, which are hallmark features of venous occlusive disease. The highlighted area corresponds with areas of neurosensory thickening and signal shadowing, suggesting that the model appropriately focused on pathologically relevant features. In contrast, the Grad-CAM result for an RAO image revealed a more diffuse and ambiguous heatmap, with less concentrated activation in the central retina. This likely reflects the subtler and more variable imaging presentation of arterial occlusions, which can be challenging to distinguish from normal or other pathologic findings. Together, these results indicate that while the model can sensitively detect RVO pathology, further refinement is needed to improve its specificity for RAO detection. Nevertheless, the Grad-CAM outputs demonstrate that the model is attending to anatomical regions consistent with clinical reasoning, supporting its potential for guided interpretation in future iterations.
Conclusion. Interpretable deep learning models can accurately classify common retinal pathologies on SD-OCT while providing clinically meaningful visual explanations, offering promise for transparent AI integration in ophthalmic diagnostics.
Poster ID: 146
Title: Detection of Pancreatic Cancer Using Multimodal Learning on Electronic Health Records
Authors: Mosbah Aouad, Anirudh Choudhary, Awais Farooq, Steven W Nevers, Lusine Demirkhanyan, Bhrandon Harris, Suguna Pappu, CHRISTOPHER S GONDI, Ravi Iyer
Poster Session B
Abstract:
Pancreatic ductal adenocarcinoma (PDAC) is one of the deadliest cancers, and early detection remains a major clinical challenge due to the absence of specific symptoms and reliable biomarkers. In this work, we propose a new multimodal approach that integrates longitudinal diagnosis code histories and routinely collected laboratory measurements from electronic health records to detect PDAC up to one year prior to clinical diagnosis. Our method combines Neural Controlled Differential Equations to model irregular lab time series, pretrained language models and recurrent networks to learn diagnosis code trajectory representations, and cross-attention mechanisms to capture interactions between the two modalities. We develop and evaluate our approach on a real-world dataset of nearly 4,700 patients and achieve significant improvements in AUC ranging from 6.5% to 15.5% over state-of-the-art methods. Furthermore, our model identifies diagnosis codes and laboratory panels associated with elevated PDAC risk, including both established and new biomarkers.
Poster ID: 101
Title: Clinicians' Voice: Fundamental Considerations for XAI in Healthcare
Authors: Tabea Elina Röber, Rob Goedhart, Ilker Birbil
Poster Session B
Abstract:
Explainable AI (XAI) holds the promise of advancing the implementation and adoption of AI-based tools in practice, especially in high-stakes environments like healthcare. However, most of the current research lacks input from end users, and therefore their practical value is limited. To address this, we conducted semi-structured interviews with clinicians to discuss their thoughts, hopes, and concerns. Clinicians from our sample generally think positively about developing AI-based tools for clinical practice, but they have concerns about how these will fit into their workflow and how it will impact clinician-patient relations. We further identify training of clinicians on AI as a crucial factor for the success of AI in healthcare and highlight aspects clinicians are looking for in (X)AI-based tools. In contrast to other studies, we take on a holistic and exploratory perspective to identify general requirements for (X)AI products for healthcare before moving on to testing specific tools.
Poster ID: 39
Title: Uncertainty-Aware Prediction of Parkinson's Disease Medication Needs: A Two-Stage Conformal Prediction Approach
Authors: Ricardo Diaz-Rincon, Muxuan Liang, Adolfo Ramirez-Zamora, Benjamin Shickel
Poster Session B
Abstract:
Parkinson's Disease (PD) medication management presents unique challenges due to heterogeneous disease progression, symptoms, and treatment response. Neurologists must balance symptom control with optimal dopaminergic medication dosing based on functional disability while minimizing risks of side effects. This balance is crucial as inadequate or abrupt changes can cause levodopa-induced dyskinesia (LID), wearing off, and neuropsychiatric side effects, significantly reducing quality of life. Current approaches rely on trial-and-error decision-making without systematic predictive methods. Despite machine learning advances in medication forecasting, clinical adoption remains limited due to reliance on point predictions that do not account for prediction uncertainty, undermining clinical trust and utility. To facilitate trust, clinicians require not only predictions of future medication needs but also reliable confidence measures. Without quantified uncertainty, medication adjustments risk premature escalation to maximum doses or prolonged periods of inadequate symptom control. To address this challenge, we developed a conformal prediction framework anticipating medication needs up to two years in advance with reliable prediction intervals and statistical guarantees. Our approach addresses zero-inflation in PD inpatient data, where patients maintain stable medication regimens between visits. Using electronic health records data from 631 inpatient admissions at XYZ (2011-2021), our novel two-stage approach identifies patients likely to need medication changes, then predicts required levodopa equivalent daily dose adjustments. Our framework achieved marginal coverage while significantly reducing prediction interval lengths compared to traditional approaches, providing precise predictions for short-term planning and appropriately wider ranges for long-term forecasting, matching the increasing uncertainty in extended projections. By quantifying uncertainty in medication needs, our approach enables evidence-based decisions about levodopa dosing and medication adjustments, potentially optimizing symptom control while minimizing side effects and improving patients' quality of life.
Poster ID: 16
Title: Stage-Aware Event-Based Modeling (SA-EBM) for Disease Progression
Authors: Hongtao Hao, Vivek Prabhakaran, Veena A Nair, Nagesh Adluru, Joseph Austerweil
Poster Session B
Abstract:
As diseases progress, the number of cognitive and biological biomarkers they impact increases. By formulating probabilistic models with this basic assumption, Event-Based Models (EBMs) enable researchers to discover the progression of a disease that makes earlier diagnosis and effective clinical interventions possible. We build on prior EBMs with two major improvements: (1) dynamic estimation of healthy and pathological biomarker distributions, and (2) explicit modeling of the distribution of disease stages. We tested existing approaches and our novel approach on a benchmark of 9,000 synthetic datasets, inspired from real-world data. We found that our stage-aware EBM (SA-EBM) significantly outperforms prior methods, such as Gaussian Mixture Model (GMM) EBM, Kernel Density Estimation EBM and Discriminative EBM, on ordering and staging tasks.
Poster ID: 6
Title: TrajSurv: Learning Continuous Latent Trajectories from Electronic Health Records for Trustworthy Survival Prediction
Authors: Sihang Zeng, Lucas Jing Liu, Jun Wen, Meliha Yetisgen, Ruth Etzioni, Gang Luo
Poster Session B
Abstract:
Trustworthy survival prediction is essential for clinical decision making. Longitudinal electronic health records (EHRs) provide a uniquely powerful opportunity for the prediction. However, it is challenging to accurately model the continuous clinical progression of patients underlying the irregularly sampled clinical features and to transparently link the progression to survival outcomes. To address these challenges, we develop TrajSurv, a model that learns continuous latent trajectories from longitudinal EHR data for trustworthy survival prediction. TrajSurv employs a neural controlled differential equation (NCDE) to extract continuous-time latent states from the irregularly sampled data, forming continuous latent trajectories. To ensure the latent trajectories reflect the clinical progression, TrajSurv aligns the latent state space with patient state space through a time-aware contrastive learning approach. To transparently link clinical progression to the survival outcome, TrajSurv uses latent trajectories in a two-step divide-and-conquer interpretation process. First, it explains how the changes in clinical features translate into the latent trajectory's evolution using a learned vector field. Second, it clusters these latent trajectories to identify key clinical progression patterns associated with different survival outcomes. Evaluations on two real-world medical datasets, MIMIC-III and eICU, show TrajSurv's competitive accuracy and superior transparency over existing deep learning methods.
Poster ID: 26
Title: Enhancing Adaptive Behavioral Interventions with LLM Inference from Participant Described States
Authors: Karine Karine, Benjamin M. Marlin
Poster Session B
Abstract:
The use of reinforcement learning (RL) methods to support health behavior change via personalized and just-in-time adaptive interventions is of significant interest to health and behavioral science researchers focused on problems such as smoking cessation support and physical activity promotion. However, RL methods are often applied to these domains using a small collection of context variables to mitigate the significant data scarcity issues that arise from practical limitations on the design of adaptive intervention trials.
In this paper, we explore an approach to significantly expanding the state space of an adaptive intervention without impacting data efficiency. The proposed approach enables intervention participants to provide natural language descriptions of aspects of their current state. It then leverages inference with pre-trained large language models (LLMs) to better align the policy of a base RL method with these state descriptions. To evaluate our method, we develop a novel physical activity intervention simulation environment that generates text-based state descriptions conditioned on latent state variables using an auxiliary LLM.
We show that this approach has the potential to significantly improve the performance of online policy learning methods.
Poster ID: 76
Title: Development & Validation of a Machine Learning Model That Uses Voice to Detect Aspiration Risk
Authors: Cyril Varghese, Visar Berisha
Poster Session B
Abstract:
Background: Aspiration causes or aggravates a variety of respiratory diseases including but not limited to pneumonias, acute respiratory distress syndrome, bronchiectasis, pulmonary fibrosis etc. Subjective bedside evaluations of aspiration by nursing, the most widely used screening method to rule out aspiration, are limited by poor sensitivity and inter- and intra-rater reliabilities. Furthermore, current gold-standard diagnostic tests (namely Videofluroscopic Swallow Studies (VFSS) and Fiber Endoscopic Evaluation of Swallow (FEES)) to detect aspiration expose the patient to radiation, are invasive and uncomfortable. They are also healthcare resource intensive needing specialized equipment and expertise from speech and language pathologists (SLPs), otolaryngologists, radiologists etc. for proper interpretation.
Research Question: To develop and validate a novel machine learning algorithm that can analyze voice features to predict aspiration risk. The hypothesis is that aspiration over time causes changes to the vocal folds that can alter voice in subtle ways that can be detected by quantitative voice analytics.
Methods: Recorded ['i'] phonations during routine nasal endoscopy from 163 unique patients were extracted, tagged and retrospectively analyzed for acoustic features including pitch, jitter, shimmer, harmonic to noise ratio (HNR), and others. Supervised machine learning (ML) through a neural additive model was performed on the sustained ['i'] phonations of those with high risk of aspiration versus those with low risk of aspiration. The ML model that was developed was then independently tested by analyzing voice clips collected at an external medical center. Video fluoroscopic swallow study (VFSS) test was the gold standard for true aspiration risk classification.
Results: Mean ML risk score for subjects with high aspiration risk was 0.528+/- 0.248 and was 0.252+/- 0.241 for those with low aspiration risk. This was a significant difference (95% CI: 0.2122-0.3408) p<0.001. In the development cohort the model showed an area under the curve (AUC) for the ROC of 0.76 (0.67-0.84) with specificity of 0.76 and F1 score of 0.63. The performance of the model in an external testing cohort was comparable, with AUC of 0.70 (0.52-0.89) with a specificity of 0.81, and F1 score of 0.67.
Conclusion: Subjects with high aspiration risk have quantifiable differences in voice characteristics than those with low aspiration risk. This is detected by a ML model trained to analyze sustained phonation and tested on an independent cohort. The continued development of such a technology will impact the early identification of aspiration risk in a variety of clinical settings including ICUs, hospital wards, ambulatory clinics, and remote monitoring. Early identification of aspiration risk could potentially impact the mitigation and treatment of acute and chronic pulmonary diseases.
Poster ID: 177
Title: Does Domain-Specific Retrieval Augmented Generation Help LLMs Answer Consumer Health Questions?
Authors: Chase M Fensore, Rodrigo M Carrillo-Larco, Megha Shah, Joyce C. Ho
Poster Session B
Abstract:
While large language models (LLMs) have shown impressive performance on medical benchmarks, there remains uncertainty about whether retrieval-augmented generation (RAG) meaningfully improves their ability to answer consumer health questions. In this study, we systematically evaluate vanilla LLMs against RAG-enhanced approaches using the NIDDK portion of the MedQuAD dataset. We compare four open-source LLMs in both vanilla and RAG configurations, assessing performance through automated metrics, LLM-based evaluation, and clinical validation. Surprisingly, we find that vanilla LLM approaches consistently outperform RAG variants across both quantitative metrics (BLEU, ROUGE, BERTScore) and qualitative assessments. The relatively low retrieval performance (Precision@5 = 0.15) highlights fundamental challenges in implementing effective RAG systems for medical question-answering, even with carefully curated questions. While RAG showed competitive performance in specific areas like scientific consensus and harm reduction, our findings suggest that successful implementation of RAG for consumer health question-answering requires more sophisticated approaches than simple retrieval and prompt engineering. These results contribute to the ongoing discussion about the role of retrieval augmentation in medical AI systems and highlight the need for medical-specific RAG infrastructure to enhance medical question-answering systems.
Poster ID: 29
Title: INSIGHT: Explainable Weakly-Supervised Medical Image Analysis
Authors: Wenbo Zhang, Junyu Chen, Christopher Kanan
Poster Session B
Abstract:
Due to their large sizes, volumetric scans and whole-slide pathology images (WSIs) are often processed by extracting embeddings from local regions and then an aggregator makes predictions from this set. However, current methods require post-hoc visualization techniques (e.g., Grad-CAM) and often fail to localize small yet clinically crucial details. To address these limitations, we introduce INSIGHT, a novel weakly-supervised aggregator that integrates heatmap generation as an inductive bias. Starting from pre-trained feature maps, INSIGHT employs a detection module with small convolutional kernels to capture fine details and a context module with a broader receptive field to suppress local false positives. The resulting internal heatmap highlights diagnostically relevant regions. On CT and WSI benchmarks, INSIGHT achieves state-of-the-art classification results and high weakly-labeled semantic segmentation performance.
Poster ID: 197
Title: MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks
Authors: Mouath Abu Daoud, Chaimae Abouzahir, Leen Kharouf, Walid Al-Eisawi, Farah E. Shamout, Nizar Habash
Poster Session B
Abstract:
Large Language Models (LLMs) have demonstrated significant promise for various applications in healthcare. However, their effectiveness in the Arabic medical domain remains unexplored due to the lack of high-quality domain-specific datasets and benchmarks. This study introduces MedArabiQ, a new benchmark dataset consisting of seven Arabic medical tasks, covering multiple specialties and including multiple choice questions, fill-in-the-blank, and patient-doctor question answering. We first constructed the dataset using past medical exams and publicly available datasets. We then introduced different modifications to evaluate various LLM capabilities, including bias mitigation.
We conducted an extensive evaluation with seven state-of-the-art open-source and proprietary LLMs, including GPT-4o, Claude 3.5-Sonnet, and Gemini 1.5. Our findings highlight the need for the creation of new high-quality benchmarks that span different languages to ensure fair deployment and scalability of LLMs in healthcare. By establishing this benchmark and releasing the dataset, we provide a foundation for future research aimed at evaluating and enhancing the multilingual capabilities of LLMs for the equitable use of generative AI in healthcare.
Poster ID: 53
Title: Identifying Clinical Subtypes of Kidney Transplant Recipients with Associated Outcomes Using Non-Negative Tensor Decomposition Approach
Authors: Meng Zhao, Yilu Fang, Jordan Gabriela Nestor
Poster Session B
Abstract:
Background. Kidney transplantation (KT) remains the preferred treatment for patients with kidney failure, offering significant survival benefits and improved quality of life compared to long-term dialysis. However, KT recipients are clinically heterogeneous, exhibiting varied short- and long-term outcomes. While previous studies have incorporated donor characteristics, recipient profiles, and/or procedural variables to predict post-transplant outcomes, such detailed donor and surgical documentation may not always be available in real-world clinical datasets. This study focuses solely on recipient-level predictors derived from structured electronic health record (EHR) data. Leveraging a non-negative tensor decomposition approach, we model the complex interactions within high-dimensional structured EHR data to identify clinically meaningful subtypes of KT recipients that are associated with distinct outcomes.
Methods. We used the A Medical Center Observational Medical Outcomes Partnership (OMOP) EHR database. From this database, we identified 3,214 adult KT recipients who underwent their first kidney transplant between 2004 and 2019 and had documented condition and medication records either prior to or during their transplant hospitalization. To improve data quality and reduce noise, we excluded extremely common medications and drug ingredients (e.g., sodium chloride and glucose water) due to their ubiquity across patients. Additionally, conditions and medications with occurrence rates below 0.1% and 0.05%, respectively, were excluded. We also removed genitourinary-related conditions (e.g., renal failure syndrome and end-stage renal disease), as all recipients are assumed to have underlying genitourinary disease. After these exclusions, 294 condition codes and 178 medication codes were remained and utilized to construct a three-dimensional tensor of size 3,214×294×178, representing the co-occurrences of patients, conditions, and medications. We then applied a non-nagetive tensor decomposition model to factorize the tensor data into three mode-specific matrices: the patient mode matrix, the condition mode matrix, and the medication mode matrix. The decomposition was performed using a predefined tensor rank, corresponding to the number of subtypes. Each subtype was characterized by a specific set of conditions and medications (derived from the respective mode matrices), with each patient associated with one or more subtypes. The resulting patient mode matrix was subsequently used to train a classification model for post-transplant outcome prediction. Model performance was evaluated using accuracy, F1-score, precision and recall. The outcomes of interest were graft survival, graft failure (defined as the need for dialysis and/or kidney re-transplantation), and all-cause mortality. These outcomes were assessed over two short-term (30-day and 90-day) and five long-term (1-year, 3-year, 5-year, 7-year and 10-year) periods. Only patients with at least one follow-up visit were included. This study was approved by the A Institutional Review Board (IRB protocol number: IRB-AAAU5631).
Results. Among the 3,214 adult KT recipients included in the analysis, the mean age was 52.1 years (±14.0), and 2,012 individuals (62.6%) were male. The racial distribution was as follows: 1,610 (50.1%) White, 423 (13.2%) Black, 123 (3.8%) Asian, 22 (0.7%) Other, and 1,036 (32.2%) with unknown race. Four distinct subtypes were identified within the cohort based on condition-medication patterns. Subtype 1 (Gastrointestinal Diseases), included 877 patients (27.3%) and had the highest graft survival and lowest graft failure rates across all follow-up time points. Subtype 2 (Cardiovascular Diseases), comprised 340 patients (10.6%) and had the highest proportion of male recipients (69.6%). Subtype 3 (Hypertension and Anemia) was the largest group, consisting of 1,617 patients (50.3%). This group exhibited the highest rates of short-term adverse outcomes as well as the highest graft failure rate (e.g., 31% at 10-year point). Subtype 4 (Respiratory Diseases) comprised 380 patients (11.8%) and was associated with the highest long-term all-cause mortality. The classification results demonstrated that overall accuracy, F1-score, precision and recall exceeded 70% across most time points, indicating a strong association between condition-medication sets and transplant outcomes.
Conclusion. This study seeks to advance personalized transplant care by identifying clinically distinct subpopulations of KT recipients that may benefit from tailored interventions and precision management strategies. We employ an unsupervised machine learning method, effectively addressing the challenge of high dimensionality in structured EHR data. Our method shows how complex clinical datasets can be transformed into practical insights to better identify at-risk patients.
Poster ID: 79
Title: Equitable Electronic Health Record Prediction with FAME: Fairness-Aware Multimodal Embedding
Authors: Nikkie Hooman, Zhongjie Wu, Eric C. Larson, Mehak Gupta
Poster Session B
Abstract:
Electronic Health Record (EHR) data encompasses diverse modalities—text, images, and medical codes—that are vital for clinical decision-making. To process these complex data, multimodal AI (MAI) has emerged as a powerful approach for fusing such information. However, most existing MAI models optimize for better prediction performance, potentially reinforcing biases across patient subgroups. Although bias reduction techniques for multimodal models have been proposed, the individual strengths of each modality and their interplay in both reducing bias and optimizing performance remain underexplored. In this work, we introduce FAME (Fairness-Aware Multimodal Embeddings), a framework that explicitly weights each modality according to its fairness contribution. FAME optimizes both performance and fairness by incorporating a combined loss function. We leverage the Error Distribution Disparity Index (EDDI) to measure fairness across subgroups and propose an RMS-based (root mean square) aggregation method to balance fairness across subgroups, ensuring equitable model outcomes. We evaluate FAME with BEHRT and BioClinicalBERT, combining structured and unstructured EHR data, and demonstrate its effectiveness in terms of performance and fairness compared to other baselines across multiple EHR prediction tasks.
Poster Session C
Poster ID: 138
Title: Large Language Models for Patient Portal Message Classification
Authors: Henry P Foote, Kartik Pejavara, Michael Gao, Srijan Bhasin, Jay Swayambunathan, Suresh Balu
Poster Session C
Abstract:
Background
Increased EHR message volume has put an additional burden on providers and has been linked to burnout. At Hospital A endocrine clinics, all patient portal messages are reviewed by a single nurse triage team that routes them to their appropriate locations. However, routing messages takes time away from providing direct patient care. Thus, automating message classification would reduce the burden on these nurses and allow for additional time with patients. Large language models (LLMs) exhibit an exceptional ability to understand natural language, combining attention mechanisms and neural networks. Models in the range of one million to 13 billion parameters are often small enough to run on commercially available computing resources and achieve suitable performance on classification tasks. In this study, we fine-tune four locally hosted LLMs for automatic classification of endocrine patient portal messages. Local fine-tuning and inference ensure message privacy.
Methods
We collected 31,044 patient portal messages from August 2020 to August 2023 that were routed by the Hospital A endocrine triage team. Each message was labeled into 1 of 10 classes, each corresponding to the group/person that message was eventually routed to. Due to the nature of the messages the clinics receive, two classes constitute the majority of the messages: Provider Message (73.0%) and Triage (14.2%). The eight remaining classes constitute between 0.4 and 3.6% of the remaining data, each. Four open-source LLMs were fine-tuned for this classification task: BERT, LLaMA-2-7B, LLaMA-2-13B, and Meditron-7B, a variant of LLaMA-2-7B fine-tuned on a select corpus of medical texts. We used quantized low-rank adaptation (QLoRA) to reduce the amount of GPU memory needed to train the LLaMA-2 and Meditron models.
Results
Table 1: Fine-tuning results for each model by class.
Model | Overall accuracy | Macro-averaged | Precision | Recall | F1
BERT | 0.761 | 0.521 | 0.355 | 0.339
LLaMA-2-7B | 0.748 | 0.544 | 0.380 | 0.389
Meditron-7B | 0.748 | 0.451 | 0.384 | 0.393
LLaMA-2-13B | 0.775 | 0.525 | 0.400 | 0.435
Despite having a significantly smaller number of parameters, BERT achieved a similar overall accuracy to the other models. However, since a significant proportion of message dataset is of the Provider Message class (73.0%), a model that simply predicts every message as a Provider Message could achieve accuracy close to that of these models just by assigning this label to each message. The macro-averaged statistics show that when class-wise performance is weighted equally, LLaMA-2-7B achieves the best precision and LLaMA-2-13B achieves the best recall and F1, indicating that the 7B and 13B models might be better at predicting the classes that make up a smaller proportion of the data. Class-wise accuracy results support this; while BERT achieves better precision in Provider Messages (91.5%) compared to the LLaMA-2-7B (87.1%) and Meditron (88.0%), LLaMA-2-7B, Meditron-7B, and LLaMA-2-13B achieve higher precision than BERT in all of the classes that make up <2% of the data, including Clinical Staff (13.0%, 8.7%, and 11.7% vs. 5.9% respectively), Patient Assistance (0.0%, 15.4%, and 52.9% vs. 0.0% respectively), and Financial Counseling (31.3%, 37.5%, and 28.6% vs 0.0%, respectively).
Conclusion
Model selection did not significantly impact the overall accuracy on the test set. However, overall accuracy is significantly dominated by performance in the Provider Message class. When each class is considered separately, 7B+-parameter models achieve better precision in more classes than BERT, especially in the smallest classes. Because of their large number of parameters, the 7B+-parameter models might be able to retain more information about the small classes, despite reduced parameter precision using QLoRA. Further, the fine-tuning of 7B and 13B parameter LLMs was able to be performed on the same compute resources as the full parameter fine-tuning of BERT, indicating the effectiveness of QLoRA. The imbalanced data presents a limitation, as the models may not have had enough data from the smaller classes to train on. Problems in healthcare are constantly evolving, and systems adjust their administrative processes to address them. Ultimately, right choice of model depends on which classes the user prefers better performance in. We demonstrate that LLMs are able to route messages accurately, potentially saving time. Future work will involve integrating these models into practice and measuring its effects on reducing burnout and load on providers.
Poster ID: 90
Title: Stepwise Fine and Gray: Subject-Specific Variable Selection Shows When Hemodynamic Data Improves Prognostication of Comatose Post-Cardiac Arrest Patients
Authors: Xiaobin Shen, Jonathan Elmer, George H. Chen
Poster Session C
Abstract:
Prognostication for comatose post‐cardiac arrest patients is a critical challenge that directly impacts clinical decision-making in the ICU. Clinical information that informs prognostication is collected serially over time. Shortly after cardiac arrest, various time-invariant baseline features are collected (e.g., demographics, cardiac arrest characteristics). After ICU admission, additional features are gathered, including time-varying hemodynamic data (e.g., blood pressure, doses of vasopressor medications). We view these as two phases in which we collect new features. In this study, we propose a novel stepwise dynamic competing risks model that improves the prediction of neurological outcomes by automatically determining when to take advantage of time-invariant features (first phase) and time-varying features (second phase). A key finding is that it is not always beneficial to use all features (first and second phase) for prediction. Notably, our model finds patients for whom this second phase (time-varying hemodynamic) information is beneficial for prognostication and also *when* this information is beneficial (as we collect more hemodynamic data for a patient over time, how important these data are for prognostication varies). Our approach extends the standard Fine and Gray model to explicitly model the two phases and to incorporate neural networks to flexibly capture complex nonlinear feature relationships. Evaluated on a retrospective cohort of 2,278 comatose post-arrest patients, our model demonstrates robust discriminative performance for the competing outcomes of awakening, withdrawal of life-sustaining therapy, and death despite maximal support. Subgroup analyses based on the motor component of the FOUR score reveal that patients with severe neurological dysfunction receive minimal additional prognostic benefit from hemodynamic data, whereas those with moderate-to-mild impairment derive significant incremental risk information. These findings underscore the potential of dynamic risk modeling for enhancing prognostication. Our approach generalizes to more than two phases in which new features are collected and could be used in other dynamic prediction tasks, where it may be helpful to know when and for whom newly collected features significantly improve prediction.
Poster ID: 157
Title: An Open-Source Approach for Democratizing Local Validation of AI Solutions
Authors: Mark Sendak, Anusha Prakash, Sena Kpodzro, Marshall Nichols, Shems Saleh, Suresh Balu
Poster Session C
Abstract:
{Background}:
Local validation of AI models in healthcare is essential for ensuring appropriate performance across local and diverse patient populations, as compared to the originally trained patient cohort. However, this requires substantial technical expertise and resources that many health delivery organizations (HDOs) lack. Prior to the release of the first open-source local validation tool from a leading EHR vendor, most validation tools—such as Evidently AI, PyCaret, and MLflow—were general-purpose frameworks. These lacked critical healthcare-specific features like demographic fairness, lead-time analysis, clinical relevance, and alignment with emerging health AI regulations. By democratizing access to standardized validation methods, the tool provides an avenue for widespread adoption of best practices in healthcare AI evaluation using the local patient population.
{Objectives}:
This study has two objectives 1) examine the open source tool’s core functionalities for local validation using an unplanned all cause hospital readmissions model as a use case, and 2) benchmark its capabilities against existing validation tools to assess its comparative advantages for local validation.
{Methods}:
We used the open source tool to validate a hospital readmissions model through an eight-step process (see Figure 1) reflecting the standard, resource intensive workflow for local model validation in healthcare. For each step, we documented how the tool supported or automated it and the resulting outputs. The validation cohort included 156,344 inpatient admissions at a large academic medical center (AMC) from 08/01/2020 through 01/01/2025 with a 13.4% readmission rate. The model, trained on more than 275,000 in-patient encounters, predicts unplanned 30-day readmissions using clinical predictors like diagnoses. demographics, comorbidities, labs, and medications. We also included demographics, comorbidities, social determinants (e.g., insurance coverage), and outcomes like 30-day mortality. Model performance was evaluated with standard metrics (e.g., AUROC, AUPRC, sensitivity, specificity, PPV, NPV) and fairness was assessed by comparing performance across different population subgroups.
{Results}:
At AMCs, validating an AI model, whether developed in-house or from a trusted vendor, typically takes several weeks to 12 months and involves a skilled interdisciplinary team of clinicians, data scientists, engineers, and health system leadership. In this study, we assessed an open-source tool—integrated into existing systems— which reduced the validation time for a hospital readmission model by 40%. This efficiency was enabled by the tool’s robust features and alignment with standardized protocols, allowing thorough evaluation across fairness, appropriateness, validity, effectiveness, and safety. Data cleaning and feature assessments, typically done manually, were accelerated through automated reports highlighting missingness and class imbalance. Model performance and fairness evaluations were simplified through built-in visualizations and metrics stratified by subgroups, enabling quicker identification of disparities. The model achieved an AUPR of 0.22 and an AUROC of 0.67 on the local patient population, comparable to the developed model’s 0.22 and 0.73, respectively. The analysis was repeated to check for fairness and safety associated with biases in the data related to social determinants such as age, race and gender but did not reveal any concerning discrepancies. Additional calibration tools helped select local thresholds aligned with clinical priorities. Across steps, the tool reduced time and technical burden —illustrating both the demands of local validation and the value of open-source tools in democratizing access to best practices.
{Conclusion}:
While some HDOs are making progress establishing AI product lifecycle management programs to ensure the safe, effective, and equitable use of AI, many lack resources and expertise to validate any given AI model for their local population. The outcomes of this study contribute to improved validation practices, ensuring that AI models are accurately evaluated and optimized for local healthcare settings, while helping HDOs comply with emerging healthcare AI regulations. As an institution with over a decade of experience developing, validating, and integrating AI into clinical practice, this tool highlights that validation depends not just on algorithms, but also on aligning people, processes, and technology. Open-source tools like these accelerate AI model validation in clinical settings by simplifying data sharing and fostering a global community committed to safe, effective AI use in healthcare. However, with an increasingly complex AI landscape-now with generative models like LLMs, it becomes essential to continually update the scope of such tools, leaving no model and health system behind.
Poster ID: 45
Title: The CLASS Project: A Proof-of-Concept Machine Learning-Driven Complexity Level Algorithm for Surgical Scheduling in Mohs Micrographic Surgery
Authors: Jorge A. Rios-Duarte, Heather D. Hardway, Nahid Y. Vidal
Poster Session C
Abstract:
Background.
Mohs micrographic surgery (MMS) is the standard of care for high-risk skin cancers. However, it is a time-consuming procedure for which optimizing workflows is a priority. Current systems to assist MMS scheduling range from random scheduling to more structure machine learning (ML) models. However, limited generalizability across institutions is a central concern. The aims of our research were 1) to train ML models using information available in pathology reports at time of scheduling for the prediction of number of stages and reconstruction complexity for MMS, and 2) to generate sets of fixed rules based on ML modeling and human-derived knowledge to grade MMS case complexity and to generate case scheduling recommendations.
Methods.
MMS case notes from 2018 to 2023 were retrieved. As it was not possible to perfectly match the case notes with their respective pathology report, a fuzzy algorithm was used to find the best match. Data were extracted using fixed programming rules. A random sample of 200 cases was used to evaluate our MMS notes-pathology reports matching and data extraction by comparing their results to manual human matching and extraction. The final dataset was divided into a training/validation (18,473 cases) and a test set (2,053 cases). Two variables were selected for prediction. The first variable was reconstruction complexity as binary (complex vs. non-complex) and the second was number of stages as categorical (one vs. two vs. three or more, multiclass). Four variables available in pathology reports were used as predictors: sex, age, anatomical location, and histopathological diagnosis. Models were trained in Python 3, using the package “H2O” and the “automl” function. A maximum of 200 models were trained, and early stopping was used for modeling and hyperparameter search. Neural networks and stacked ensembles were excluded. Training weights were used to manage dataset imbalance and the “logloss” was used for model training and selection. 10-CV was used for training, hyperparameter search, and model selection. The explainability module of the “H2O” package was used to estimate feature importance and individual conditional expectation (ICE) plots.
Parallel to model training, a human-level analysis was performed by our health systems engineering team to evaluate the overall MMS workflow at our institution and identify potential areas of improvement. We combined the input of our human-level analysis with the insights provided by our ML models to develop two sets of fixed rules to optimize our MMS workflow. A set of rules to grade MMS case complexity based on case characteristics, and another with scheduling recommendations based on case complexity.
Results.
Our sample included 20,523 MMS notes, belonging to 17,061 patients that underwent MMS at six practices (14 surgeons were included). From the total number of patients, 37.3% (n=6,366) were female and 98.8% (n=16,862) were white. Regarding MMS stages and reconstruction, 6.1% (n=1,247) of the cases had three or more stages, while 18.2% (n=3,735) needed a complex reconstruction (defined as graft, flap, and/or referred closure). Our data preprocessing pipeline achieved high accuracy rates when compared to a ground truth of human manual preprocessing (accuracy > 90%) on a random sample of 200 cases.
The model with the highest performance for prediction of number of MMS stages was a gradient boosting machine (GBM), with an AUCROC (OVO, macro-average) of 0.62 ± 0.01 on 10-CV. When evaluated on the test set, the model displayed an AUCROC of 0.63. Similarly, the best model for prediction of complex reconstruction was a GBM with an AUCROC of 0.84 ± on 10-CV. On the test set, the model achieved an AUCROC of 0.83. The feature with the highest importance for both models was anatomical location. ICE plots showed that certain anatomical locations (e.g., lip, nose, eyelid) influence models’ predictions towards higher number of stages and complex reconstruction.
After analyzing the insights provided by the health systems engineering team and our ML modeling process, MMS case complexity was divided into four grades, with grade I displaying the lowest complexity and grade IV the highest (e.g., MMS for extramammary Paget’s disease). In a similar way, a scheduling recommendation system was generated based on the MMS case complexity, ranging from one block of time at any time on the day for grade I to three blocks of time for grade IV.
Conclusion.
Our models trained on data from various practices/surgeons hold potential for generalization. However, as our human analysis was specific to our practice, grading and scheduling recommendations might not generalize well to other MMS centers. Despite that, our project demonstrated the power of a mixed approach using ML and human-derived knowledge to develop systems for optimization of MMS workflow and will serve as a base for other practices willing to implement a similar approach.
Poster ID: 152
Title: Optimizing Segmentation of Neonatal Brain MRIs with Partially Annotated Multi-Label Data
Authors: Dariia Kucheruk, Sam Osia, Pouria Mashouri, Elizaveta Rybnikova, Sergey Protserov, Jaryd Hunter, Maksym Muzychenko, Jessie Ting Guo, Michael Brudno
Poster Session C
Abstract:
Accurate assessment of the developing brain is important for research and clinical applications, and manual segmentation of brain MRIs is a painstaking and expensive process. We introduce the first method for neonatal brain MRI segmentation that simultaneously leverages fully and partially labeled data within a multi-label segmentation framework. Our method improves accuracy and efficiency by utilizing all available supervision—even when only coarse or incomplete annotations are present—enabling the model to learn both detailed and high-level brain structures from heterogeneous data. We validate our method on scans from the Developing Human Connectome Project (dHCP) acquired at both preterm and term gestational ages. Our approach demonstrates more accurate and robust segmentation compared to standard supervised and semi-supervised models trained with equivalent data. The results showed an improvement in predictions of predominantly unannotated labels in the training set when combined with labels of relevant "super-classes". Further experiments with semi-supervised loss functions demonstrated that limited but reliable supervision is more effective than using noisy labels. Our work presents evidence that it is possible to build robust medical image segmentation models with only a small amount of fully labeled training data.
Poster ID: 115
Title: The Geometry of Queries: Query-Based Innovations in Retrieval-Augmented Generation for Healthcare QA
Authors: Eric Yang, Jonathan Amar, Jong Ha Lee, Bhawesh Kumar, Yugang jia
Poster Session C
Abstract:
Deploying Large Language Models (LLMs) for healthcare question answering requires robust methods to ensure accuracy and reliability. This work introduces Query-Based Retrieval Augmented Generation (QB-RAG), a framework for enhancing Retrieval-Augmented Generation (RAG) systems in healthcare question-answering by pre-aligning user queries with a database of curated, answerable questions derived from healthcare content. A key component of QB-RAG is an LLM-based filtering mechanism that ensures that only relevant and answerable questions are included in the database, enabling reliable reference query generation at scale. We establish a theoretical foundation for QB-RAG, provide a comparative analysis of existing retrieval enhancement techniques, and introduce a generalizable, comprehensive evaluation framework that assesses both the retrieval effectiveness and the quality of the generated response based on faithfulness, relevance, and adherence to the guideline. Our empirical evaluation on a healthcare data set demonstrates the superior performance of QB-RAG compared to existing retrieval methods, highlighting its practical value in building trustworthy digital health applications for health question-answering.
Poster ID: 100
Title: ProtoECGNet: Case-Based Interpretable Deep Learning for Multi-Label ECG Classification with Contrastive Learning
Authors: Sahil Sethi, David Chen, Thomas Statchen, Michael C. Burkhart, Nipun Bhandari, Bashar Ramadan, Brett Beaulieu-Jones
Poster Session C
Abstract:
Deep learning-based electrocardiogram (ECG) classification has shown impressive performance but clinical adoption has been slowed by the lack of transparent and faithful explanations. Post hoc methods such as saliency maps may fail to reflect a model’s true decision process. Prototype-based reasoning offers a more transparent alternative by grounding decisions in similarity to learned representations of real ECG segments—enabling faithful, case-based explanations. We introduce ProtoECGNet, a prototype-based deep learning model for interpretable, multi-label ECG classification. ProtoECGNet employs a structured, multi-branch architecture that reflects clinical interpretation workflows: it integrates a 1D CNN with global prototypes for rhythm classification, a 2D CNN with time-localized prototypes for morphology-based reasoning, and a 2D CNN with global prototypes for diffuse abnormalities. Each branch is trained with a prototype loss designed for multi-label learning, combining clustering, separation, diversity, and a novel contrastive loss that encourages appropriate separation between prototypes of unrelated classes while allowing clustering for frequently co-occurring diagnoses. We evaluate ProtoECGNet on all 71 diagnostic labels from the PTB-XL dataset, demonstrating competitive performance relative to state-of-the-art black-box models while providing structured, case-based explanations. To assess prototype quality, we conduct a structured clinician review of the final model’s projected prototypes, finding that they are rated as representative and clear. ProtoECGNet shows that prototype learning can be effectively scaled to complex, multi-label time-series classification, offering a practical path toward transparent and trustworthy deep learning models for clinical decision support.
Poster ID: 182
Title: Improving Out-of-distribution Human Activity Recognition via IMU-Video Cross-modal Representation Learning
Authors: Seyyed Saeid Cheshmi, Buyao Lyu, Thomas Lisko, Rajesh Rajamani, Robert A. McGovern, Yogatheesan Varatharajah
Poster Session C
Abstract:
Human Activity Recognition (HAR) based on wearable inertial sensors plays a critical role in remote health monitoring. In patients with movement disorders, the ability to detect abnormal patient movements in their home environments can enable continuous optimization of treatments and help alert caretakers as needed. Machine learning approaches have been proposed for HAR tasks using Inertial Measurement Unit (IMU) data; however, most rely on application-specific labels and lack generalizability to data collected in different environments or populations. To address this limitation, we propose a new cross-modal self-supervised pretraining approach to learn representations from large-sale unlabeled IMU-video data and demonstrate improved generalizability in HAR tasks on out of distribution (OOD) IMU datasets, including a dataset collected from patients with Parkinson's disease. Specifically, our results indicate that the proposed cross-modal pretraining approach outperforms the current state-of-the-art IMU-video pretraining approach and IMU-only pretraining under zero-shot and few-shot evaluations. Broadly, our study provides evidence that in highly dynamic data modalities, such as IMU signals, cross-modal pretraining may be a useful tool to learn generalizable data representations.
Poster ID: 188
Title: ScaffoldGPT: A Scaffold-based GPT Model for Drug Optimization
Authors: Xuefeng Liu, Songhao Jiang, Ian Foster, Jinbo Xu, Rick L. Stevens
Poster Session C
Abstract:
Drug optimization has become increasingly crucial in light of fast-mutating virus strains and drug-resistant cancer cells. Nevertheless, it remains challenging as it necessitates retaining the beneficial properties of the original drug while simultaneously enhancing desired
attributes beyond its scope. In this work, we aim to tackle this challenge by introducing ScaffoldGPT, a novel Generative Pretrained Transformer (GPT) designed for drug optimization based on molecular scaffolds. Our work comprises three key components: (1)
A three-stage drug optimization approach that integrates pretraining, finetuning, and decoding optimization. (2) A uniquely designed two-phase incremental training approach for pre-training the drug optimization GPT on molecule scaffold with enhanced performance. (3) A token-level decoding optimization strategy, Top-N, that enabling controlled, reward guided generation using pretrained/finetuned GPT. We demonstrate via a comprehensive evaluation on COVID and cancer benchmarks that ScaffoldGPT outperforms the competing baselines in drug optimization benchmarks, while excelling in preserving original functional scaffold and enhancing desired properties.
Poster ID: 201
Title: Learning to Call: A Field Trial of a Collaborative Bandit Algorithm for Improved Message Delivery in Mobile Maternal Health
Authors: Mizhaan Prajit Maniyar, Karthikeyan Shanmugam, Arun Suggala, Arpan Dasgupta, Aparna Taneja, Milind Tambe
Poster Session C
Abstract:
Mobile health (mHealth) programs utilize automated voice messages to deliver health information, particularly targeting underserved communities, demonstrating the effectiveness of using mobile technology to disseminate crucial health information to these populations, improving health outcomes through increased awareness and behavioral change. India's Kilkari program delivers vital maternal health information via weekly voice calls to millions of mothers. However, the current random call scheduling often results in missed calls and reduced message delivery. This study presents a field trial of a collaborative bandit algorithm designed to optimize call timing by learning individual mothers' preferred call times. We deployed the algorithm with ~6500 Kilkari participants as a pilot study, comparing its performance to the baseline random calling approach. Our results demonstrate a statistically significant improvement in call pickup rates with the bandit algorithm, indicating its potential to enhance message delivery and impact millions of mothers across India. This research highlights the efficacy of personalized scheduling in mobile health interventions and underscores the potential of machine learning to improve maternal health outreach at scale.
Poster ID: 107
Title: MedPatch: Confidence-Guided Multi-Stage Fusion for Multimodal Clinical Data
Authors: Baraa Al Jorf, Farah E. Shamout
Poster Session C
Abstract:
Clinical decision-making relies on the integration of information across various data modalities, such as clinical time-series, medical images and textual reports. Compared to other domains, real-world medical data is heterogeneous in nature, limited in size, and sparse due to missing modalities. This significantly limits model performance in clinical prediction tasks. Inspired by clinical workflows, we introduce MedPatch, a multi-stage multimodal fusion architecture, which seamlessly integrates multiple modalities via confidence-guided patching. MedPatch comprises three main components: (i) a multi-stage fusion strategy that leverages joint and late fusion simultaneously, (ii) a missingness-aware module that handles sparse samples with missing modalities, (iii) a joint fusion module that clusters latent token patches based on calibrated unimodal token-level confidence. We evaluated MedPatch using real-world data consisting of clinical time-series data, chest X-ray images, radiology reports, and discharge notes extracted from the MIMIC-IV, MIMIC-CXR, and MIMIC-Notes datasets on two benchmark tasks, namely in-hospital mortality prediction and clinical condition classification. Compared to existing baselines, MedPatch achieves state-of-the-art performance. Our work highlights the effectiveness of confidence-guided multi-stage fusion in addressing the heterogeneity of multimodal data, and establishes new state-of-the-art benchmark results for clinical prediction tasks.
Poster ID: 173
Title: Practice Patient: Improving Clinical Skills with LLMs
Authors: Srijan Bhasin, Jay Swayambunathan, Noah Prizant, Kartik Pejavara, Gabriel Yapuncich, Nancy Weigle, James Fox
Poster Session C
Abstract:
Background.
Health professions students often lack consistent, low-stakes opportunities to practice patient interviews in a structured way. Training this core clinical skill in the pre-clinical curriculum frequently involves simulated patient (SP) actors which can be constrained by time, access to and cost of SPs, and variability in faculty feedback. This is especially true when it comes to practicing sensitive or emotionally charged conversations: discussions about terminal diagnoses, mental and sexual health, and cultural or personal beliefs.
Methods.
We developed a generative AI-based web application termed Practice Patient designed to help students practice (1) history-taking and patient-centered communication skills and (2) real-time formation of differential diagnoses. By implementing audio-to-audio interactions using LiveKit, a platform integrating with OpenAI’s GPT-4o Realtime model, we developed a tool that allows students to engage in conversations with an AI-powered SP. Clinical scenarios drawn from a repository of SP scripts created by faculty leaders of a pre-clinical longitudinal physician skills course at the University A School of Medicine were integrated into a larger prompt designed to guide the LLM through a realistic conversation. Students were provided with real-time written feedback by routing student-patient conversation transcripts through a separate LLM designed to assess the interaction. The feedback rubric was adapted from that used by University A’s SP actors. Individual student feedback reports can be stored to track student progress over several practice sessions. Multiple users can interact with the application simultaneously.
We piloted the web application with first-year medical students as part of an exercise to practice writing three different progress note types. In order to obtain the necessary historical information, students interviewed small group faculty instructors for two of the scenarios and the AI SP for the third. After students were provided with details of the physical examination, they wrote progress notes to document the clinical encounters, including their assessments and plans. Students were asked to voluntarily complete a Qualtrics survey for feedback on the tool, its use compared to interviewing faculty instructors, and their willingness to continue to use similar tools within their medical school curriculum.
Results.
Sixty-one students completed the anonymous survey (51% response rate). Initial responses indicate the following:
(1) 94% of students strongly agree or agree that virtual patient experiences may help improve their history-taking abilities
(2) 94% of students strongly agree or agree that virtual patient experiences may help improve their diagnosis abilities
(3) 79% of students strongly agree or agree that they would recommend the tool to other health professions students
(4) 75% of students strongly agree or agree that the tool answered questions in a realistic manner
(5) 58% of students strongly agree or agree that the tool helped prepare them for real-world patient encounters
Qualitative survey analyses and analyses of the automated feedback compared with feedback given to students during their end-of-year Objective Structured Clinical Examinations (OSCEs) are ongoing.
Conclusion.
Preliminary findings suggest Practice Patient and similar AI tools are promising avenues to address the need for greater numbers of simulated patient interactions for health professional students. Our tool’s scalable architecture and adaptable case design and feedback structure potentially enables implementation across institutions and educational tracts.
Poster ID: 73
Title: Empowering Healthcare Safety Net Organizations with AI: The Hub-and-Spoke Approach
Authors: Mark Sendak, Ciera Thomas, Jee Young Kim, Alifia Hasan, Sena Kpodzro, Freya Gulamali, William Ratliff, Varoon Mathur, Suresh Balu
Poster Session C
Abstract:
Background
AI adoption and lifecycle management have been the focus of growing research in academic medical centers (AMCs) and centers of excellence (COEs) in the U.S.1-3 However, few studies have investigated AI uptake in safety net organizations (SNOs) (i.e., Federally Qualified Health Centers (FQHCs) and community hospitals), even though SNOs serve over 31 million patients annually.4-6 Understanding the opportunities and challenges of AI adoption in these settings is essential since SNOs may have the most to gain from this technology.7
To address these gaps, we launched a 12-month technical assistance program– the Practice Network–to support SNOs with AI adoption and lifecycle management. Through this program, we aimed to (1) guide SNOs through AI adoption and lifecycle management using best practices, (2) understand high-impact AI use cases, and (3) identify and address challenges in AI implementation.
Methods
The Practice Network program uses a hub-and-spoke model, engaging various stakeholders in the AI healthcare ecosystem.8-9 SNOs function as ‘spokes,’ receiving technical assistance, while AMCs and COEs with experience in AI adoption serve as ‘hubs,’ providing guidance. The Coordinating Center selected four FQHCS and one community hospital from across the U.S. as spokes. Several hub organizations were recruited from the Coordinating Center’s consortium to share AI expertise and insights with spoke sites (see Table 1).
The program commenced with structured focus groups led by the Coordinating Center to understand each spoke organization’s problem, proposed AI solution, and context. Detailed AI adoption and implementation plans for each spoke were created from data gathered in these focus groups. Spoke and hub organizations then began engaging in key program activities, facilitated through connections made by the Coordinating Center and following these plans (see Table 1).
Results
Initial findings show that SNO spokes share multiple challenges with AMCs and COEs, including limited vendor transparency, complex decision-making on communication of AI use, and mitigating predictive biases to ensure equitable care delivery.
Spokes also encounter challenges distinct from those experienced by AMCs and COEs, including securing sustainable funding for electronic health record (EHR) upgrades and AI tools, limited leverage in vendor contract negotiations, and reliance on health center-controlled networks for tool integration in the EHR. Lack of specialized internal expertise is a recurring barrier that limits the capacity of spokes to create AI governance systems, tailor AI implementation to their specific contexts, build systems for local monitoring, and track the evolving legal landscape. These constraints put SNOs at a significant disadvantage and limit their ability to safely recognize the benefits AI technologies can offer.
The Practice Network program addresses many of these challenges. Project plans developed from focus group data provide a tailored plan for each site’s specific challenge and contextual considerations, including non-English speaking patient populations, funding constraints, lack of formalized AI governance or change management processes, and limited expertise for evaluating vendor-developed solutions. Office hours and mentor touch points address expertise gaps in spoke sites, providing strategic guidance from real-world efforts to utilize similar AI solutions. Monthly cohort discussions allow for facilitated conversations on AI lifecycle challenges and enable spokes to learn about innovative approaches others have taken to overcome such barriers.
Discussion
The hub-and-spoke model utilized by the Practice Network program offers a scalable approach to providing technical assistance and expertise needed to safely, effectively, and equitably implement AI solutions in SNOs given their resource constraints. By centralizing specialized knowledge and distributing it through structured mentorship and technical assistance, this model allows SNOs seeking to implement AI tools to access resources and expertise they would otherwise struggle to acquire.
While the program provides initial insights into AI adoption and implementation processes in SNOs, further research on AI adoption in these organizational contexts are essential. The resource constraints, unique patient populations, and organizational structures of SNOs necessitate a tailored approach to AI adoption and lifecycle management that differs from that utilized by AMCs and COEs. The inaugural cohort of the Practice Network provides technical assistance to five spoke sites; given that there are thousands of SNOs across the country, scaling this model of hub-and-spoke technical assistance is critical to provide adequate support.10 Intentional investments in these areas are essential to prevent a widening digital divide in health technology and ensure equitable outcomes for the diverse patient populations served by these organizations.
Poster ID: 145
Title: The Impact of Image Resolution on Biomedical Multimodal Large Language Models
Authors: Liangyu Chen, James Burgess, Jeffrey J Nirschl, Orr Zohar, Serena Yeung-Levy
Poster Session C
Abstract:
Imaging technologies are fundamental to biomedical research and modern medicine, requiring analysis of high-resolution images across various modalities. While multimodal large language models (MLLMs) show promise for biomedical image analysis, most are designed for low-resolution images from general-purpose datasets, risking critical information loss. We investigate how image resolution affects MLLM performance in biomedical applications and demonstrate that: (1) native-resolution training and inference significantly improve performance across multiple tasks, (2) misalignment between training and inference resolutions severely degrades performance, and (3) mixed-resolution training effectively mitigates misalignment and balances computational constraints with performance requirements. Based on these findings, we recommend prioritizing native-resolution inference and mixed-resolution datasets to optimize biomedical MLLMs for transformative impact in scientific research and clinical applications.
Poster ID: 181
Title: LEAVES: Learning Views for Time-Series Biobehavioral Data in Contrastive Learning
Authors: Han Yu, Huiyuan Yang, Akane Sano
Poster Session C
Abstract:
Contrastive learning has been utilized as a promising self-supervised learning approach to extract meaningful representations from unlabeled data. The majority of these methods take advantage of data-augmentation techniques to create diverse views from the original input. However, optimizing augmentations and their parameters for generating more effective views in contrastive learning frameworks is often resource-intensive and time-consuming. While several strategies have been proposed for automatically generating new views in computer vision, research in other domains, such as time-series biobehavioral data, remains limited. In this paper, we introduce a simple yet powerful module for automatic view generation in contrastive learning frameworks applied to time-series biobehavioral data, which is essential for modern health care, termed **lea**rning **v**i**e**ws for time-**s**eries data (LEAVES). This proposed module employs adversarial training to learn augmentation hyperparameters within contrastive learning frameworks. We assess the efficacy of our method on multiple time-series datasets using two well-known contrastive learning frameworks, namely SimCLR and BYOL. Across four diverse biobehavioral datasets, LEAVES requires only ~20 learnable parameters—dramatically fewer than the ~580,000 parameters demanded by frameworks like ViewMaker, previously proposed adversarially trained convolutional module in contrastive learning, while achieving competitive and often superior performance to existing baseline methods. Crucially, these efficiency gains are obtained without extensive manual hyperparameter tuning, which makes LEAVES particularly suitable for large-scale or real-time healthcare applications that demand both accuracy and practicality.
Poster ID: 13
Title: Predicting the Predictable in the Psychiatric High Risk
Authors: Eric Strobl
Poster Session C
Abstract:
Most investigators in precision psychiatry force models to predict clinically meaningful but ultimately predefined outcomes in a high-risk population. We instead advocate for an alternative approach: let the data reveal which symptoms are predictable with high accuracy and then assess whether those predictable symptoms warrant early intervention. We correspondingly introduce the Sparse Canonical Outcome REgression (SCORE) algorithm, which combines items from clinical rating scales into severity scores that maximize predictability across time. Our findings show that this simple shift in perspective significantly boosts prognostic accuracy, uncovering predictable symptom profiles such as social difficulties and stress-paranoia from those at clinical high risk for psychosis, and social passivity from infants at genetic high risk for autism. The predictable scores differ markedly from conventional clinical metrics and offer clinicians memorable, actionable insights even when full diagnostic criteria are unmet. An R implementation is available at https://anonymous.4open.science/r/SCORE-B06C.
Poster ID: 114
Title: Can interpretability and accuracy coexist in cancer survival analysis?
Authors: Piyush Borole, Tongjie Wang, Antonio Vergari, Ajitha Rajan
Poster Session C
Abstract:
Survival analysis refers to statistical procedures used to analyze data that focuses on the time until an event occurs, such as death in cancer patients. Traditionally, the linear Cox Proportional Hazards (CPH) model is widely used due to its inherent interpretability. CPH model help identify key disease-associated factors (through feature weights), providing insights into patient risk of death. However, their reliance on linear assumptions limits their ability to capture the complex, non-linear relationships present in real-world data. To overcome this, more advanced models, such as neural networks, have been introduced, offering significantly improved predictive accuracy. However, these gains come at the expense of interpretability, which is essential for clinical trust and practical application. To address the trade-off between predictive accuracy and interpretability in survival analysis, we propose ConSurv, a concept bottleneck model that maintains state-of-the-art performance while providing transparent and interpretable insights. Using gene expression and clinical data from breast cancer patients, ConSurv captures complex feature interactions and predicts patient risk. By offering clear, biologically meaningful explanations for each prediction, ConSurv attempts to build trust among clinicians and researchers in using the model for informed decision-making.
Poster ID: 32
Title: PhenoRAG: Retrieval-Augmented Generation for Efficient Zero-Shot Phenotype Identification in Clinical Reports
Authors: Marc Berndt, Andrea Agostini, Beatrice Stocker, Maria Padrutt, Silvio Daniel Brugger, D Sean Froese, Daphné Chopard, Julia E Vogt
Poster Session C
Abstract:
Accurate extraction of phenotypic information from clinical narratives is essential in diagnostic medicine, yet mapping free-text reports to structured Human Phenotype Ontology (HPO) terms remains challenging. While encoder-only transformer models and small decoder-only generative models are attractive for clinical deployment due to their efficiency and low resource requirements, the former often fail to capture the rich context of clinical texts, and the latter struggle to process lengthy reports effectively. In contrast, larger language models excel at contextual understanding but are impractical for clinical use due to their size, propensity to hallucinate, and privacy concerns associated with non-local inference.
To overcome these challenges, we introduce PhenoRAG, a novel retrieval-augmented generation framework that leverages a synthetic database of contextually enriched sentences to augment a lightweight decoder-only model for accurate zero-shot phenotype identification. We demonstrate the capacity of PhenoRAG to capture nuanced contextual clues by 1) evaluating its ability to perform two clinically relevant tasks—guide rare disease diagnosis and facilitate urinary tract infection detection—and 2) validating its performance on a synthetic dataset designed to mimic the challenges of real clinical narratives. Experimental results demonstrate that our lightweight PhenoRAG framework achieves a higher F1-score than both encoder-only transformers and standalone small language models, driven primarily by its high recall. These findings underscore the potential of PhenoRAG as a ready-to-use clinical tool for phenotype identification.
Poster ID: 70
Title: Early Prediction of Postpartum Mood Disorders from Longitudinal Wearable Biometrics using deep learning and times series generative adversarial networks
Authors: Bonaventure F. P. Dossou, Mercy Nyamewaa Asiedu, Maja Mataric, Katherine A Heller, Belen Lafon, Nichole Young-Lin
Poster Session C
Abstract:
**Background.**
In the United States, 1 in 8 women experience postpartum depression, and most women (up to 80%) experience “baby blues”, identified by feelings of sadness, mood swings and anxiety. Post-partum depression and mood disorders have been linked to increased risk of suicide, accounting for up to 20% of maternal deaths in high-income countries, with mothers who experience perinatal depression being 3x more likely to exhibit suicidal behaviour than those who do not experience perinatal depression. They can be treated through interventions such as support groups, psychotherapy and medication after diagnosis, however they remain under-diagnosed with a low treatment rate of 15%. While screening through surveys enables an increase in early detection, they tend to have high false positives and negatives, can be subjective, may not generalize to cultural differences in interpretation, focus on major depressive disorders, which may overlook minor mood disorders, can be time consuming, and can be problematic for individuals with limited reading skills. Additionally there may not be sufficient postnatal-visits to capture the day-to-day nuances or acute incidences of depression and mood disorders. Our work presents the first study that we know of that forecasts the occurrence of postpartum mood disorders, based on passively collected pregnancy biometric wearable data to enable objective, and passive early detection and triage of postpartum mood disorders.
**Methods.**
Data was obtained in a retrospective study from consented wearable activity tracker users, 21 years or older from June 2020-May 2022. Participants were located in the United States and Canada, and were new parents with a child under one year. Additionally participants filled out surveys where they indicated outcomes for postpartum mood disorder. We filtered the dataset to 1822 participants who had 80% wear time and daily use during pregnancy and 30 days before pregnancy. Using this data we generated features that take into consideration the pre-pregnancy baseline (difference and z-score between 30 day pre-pregnancy average and each daily data measured during pregnancy), resulting in 24 features - step count, heart rate variability, deep sleep minutes, light sleep minutes, REM sleep minutes, minutes awake while in bed, active zone minutes, resting heart rate and their respective difference and z-score outputs. There were 1822 participants with 224 who had postpartum depression, 359 with postpartum sadness, 717 with postpartum anxiety and 1002 with no mood disorders. We developed a time series generative adversarial network to create synthetic data for the under-represented positive samples enabling a balanced dataset and an end-to-end LSTM model with batch normalization, and hybrid loss (combining the focal and binary cross entropy losses) for predicting the mood disorders with a 60/20/20 split in train, hold out validation and hold out test set. Synthetic data generation was performed after the train-test split on only the training set to ensure the model was not exposed to data in the hold out test set. Finally we applied integrated gradient analysis on the LSTM to determine which features most contributed to model prediction.
**Results.**
With our best performing model we achieve a 0.85 AUC and 0.70 F1 score for predicting anxiety, a 0.92 AUC and a 0.75 F1 score for predicting sadness and a 0.96 AUC and 0.84 F1 score for predicting postpartum depression. Feature importance from integrated gradients applied to the LSTM demonstrates that sleep and active zone minutes biometrics data are most important predictors, in line with literature that states that postpartum depression and other mood disorders are associated with decreased activity and sleep. While our dataset is mostly caucasian, subgroup analysis demonstrates that accurate performance extends to other racial demographics. Model performance demonstrates age-related disparities with higher accuracy for 30-44 years and lesser accuracy 18-29 years.
**Conclusion.**
Our study demonstrates the feasibility of using time series machine learning models to predict postpartum mood disorders from passively collected wearable data obtained during pregnancy. This enables early action to be taken in terms of treatment and support, while the patient is still consistently attending clinical visits, usually until 6 weeks postpartum. While our dataset is on a mostly racially white population, we demonstrate onpar performance across racial subgroups. However, we find variability in performance across age-subgroups, demonstrating that accuracy of predictive models based on wearables may be age dependent. Overall this presents a useful tool for objective screening for mood disorders, however translating this approach into a clinical tool would necessitate further validation and undergo a standard regulatory review process.
Poster ID: 154
Title: FactEHR: A Dataset for Evaluating Factuality in Clinical Notes Using LLMs
Authors: Monica Munnangi, Akshay Swaminathan, Jason Alan Fries, Jenelle A Jindal, Sanjana Narayanan, Ivan Lopez, Lucia Tu, Philip Chung, Jesutofunmi Omiye, Mehr Kashyap, Nigam Shah
Poster Session C
Abstract:
Verifying and attributing factual claims is essential for the safe and effective use of large language models (LLMs) in healthcare. A core component of factuality evaluation is fact decomposition—the process of breaking down complex clinical statements into fine-grained, atomic facts for verification. Recent work has proposed fact decomposition, which uses LLMs to rewrite source text into concise sentences conveying a single piece of information, as an approach for fine-grained fact verification, in the general domain. However, clinical documentation poses unique challenges for fact decomposition due to dense terminology and diverse note types and remains understudied. To address this gap and to explore these challenges, we present FactEHR, an NLI dataset consisting of full document fact decompositions for 2,168 clinical notes spanning four types from three hospital systems resulting in 987,266 entailment pairs. We asses the generated facts on different axes, from entailment evaluation of LLMs to a qualitative analysis. Our evaluation, including review by clinicians, highlights significant variability in the performance of LLMs for fact decom- position from Gemini generating highly relevant and factually correct facts to Llama-3 generating fewer and inconsistent facts. The results underscore the need for better LLM capabilities to support factual verification in clinical text. To facilitate further research, we release anonymized code and plan to make the dataset available upon acceptance.
Poster ID: 97
Title: Evaluation of Multi-Agent LLMs in Multidisciplinary Team Decision-Making for Challenging Cancer Cases
Authors: Jaesik Kim, Byounghan Lee, Kyung-Ah Sohn, Dokyoon Kim, Young Chan Lee
Poster Session C
Abstract:
This study explores the potential of large language model (LLM) agents in real-world clinical decision-making, focusing on their alignment with human experts in cancer multidisciplinary team (MDT) meetings. While LLMs perform well on benchmark medical question-answering tasks, these evaluations often oversimplify the open-ended, multifaceted nature of actual clinical decisions. In practice, MDTs require balancing diverse expert opinions and multiple valid treatment options. Using real MDT meeting data, we compare different LLM approaches including single-agent and multi-agent systems to assess their ability to replicate consensus-based decisions. Our findings indicate that multi-agent, conversation-based systems, which assign specialized roles and facilitate dynamic inter-agent conversation, better approximate human expert decisions in our data. Overall, this work highlights the potential practical utility of LLM agents in complex clinical settings and lays the groundwork for their future integration as decision support tools in multidisciplinary medical contexts.
Poster ID: 166
Title: Patient-Specific Deep Reinforcement Learning for Automatic Replanning in Head-and-Neck Cancer Proton Therapy
Authors: Malvern Madondo, Yuan Shao, Yingzi Liu, Jun Zhou, Xiaofeng Yang, Zhen Tian
Poster Session C
Abstract:
Anatomical changes in head-and-neck cancer (HNC) patients during intensity-modulated proton therapy (IMPT) can shift the Bragg Peak of proton beams, risking tumor underdosing and organ-at-risk (OAR) overdosing. As a result, treatment replanning is often required to maintain clinically acceptable treatment quality. However, current manual replanning processes are often resource-intensive and time-consuming. In this work, we propose a patient-specific deep reinforcement learning (DRL) framework for automated IMPT replanning, with a reward-shaping mechanism based on a $150$-point plan quality score designed to handle the competing clinical objectives in radiotherapy planning. We formulate the planning process as an RL problem where agents learn high-dimensional control policies to adjust plan optimization priorities to maximize plan quality. Unlike population-based approaches, our framework trains personalized agents for each patient using their planning CT and augmented anatomies simulating anatomical changes (tumor progression and regression). This patient-specific approach leverages anatomical similarities along the treatment course, enabling effective plan adaptation. We implemented and compared two DRL algorithms, Deep Q-Network (DQN) and Proximal Policy Optimization (PPO), using dose-volume histograms (DVHs) as state representations and a $22$-dimensional action space of priority adjustments. Evaluation on five HNC patients using actual replanning CT data showed that both DRL agents improved initial plan scores from $120.63 \pm 21.40$ to $139.78 \pm 6.84$ (DQN) and $142.74 \pm 5.16$ (PPO), surpassing the replans manually generated by a human planner ($137.20 \pm 5.58$). Clinical validation confirms these improvements translate to better tumor coverage and OAR sparing across diverse anatomical changes. This work highlights the potential of DRL in addressing the geometric and dosimetric complexities of adaptive proton therapy, offering a promising solution for efficient offline adaptation and paving the way for online adaptive proton therapy.
Poster ID: 102
Title: The Effect of Multi-site Data Pooling on Critical Care Early Warning Systems
Authors: Allan Pang, Owen Ashby Johnson, Marc de Kamps, Dr Alwyn Kotzé
Poster Session C
Abstract:
Introduction
Open-source healthcare datasets, such as MIMIC-III (2001-12) and MIMIC IV (2008-19), have long collection periods to amass large volumes of data for model development. Extended data collection periods can introduce temporal drift, as shifts in population characteristics, healthcare demands, and clinical practices over time may negatively impact model development. Constraining model training to contemporaneous data inherently ties usable data volume to a hospital’s size and clinical throughput. This data availability inequity may exacerbate disparities in model performance between high-volume, resource-rich centres and smaller or lower-throughput institutions. Data sharing between hospitals can augment training datasets; however, pooling data across institutions may introduce downstream effects on model predictions, including shifts in output distributions and altered decision boundaries. Inpatient physiological early warning systems, such as NEWS2, have traditionally relied on rules-based frameworks that apply static thresholds to physiological values. More recent approaches have shifted towards data-driven methods incorporating temporal dynamics through handcrafted temporal features or leveraging deep learning architectures such as recurrent neural networks. As these models increase in complexity and sensitivity to input patterns, they may become more susceptible to variability introduced by hospital-specific data distributions.
Methods
We develop and evaluate an early warning system using an adapted Long Short-Term Memory architecture to handle irregularly sampled time-series data. We train models to use temporal signals within 12 hours of death as a proxy for clinical deterioration. Training and validation are conducted using the eICU Collaborative Research Database, a multi-centre critical care dataset comprising 200,859 patient episodes from 208 hospitals collected over two years (2014–2015). We assess model performance based on its ability to predict death within the last 12 hours and to identify the presence of a major critical care intervention. These interventions include intubation, cardiopulmonary resuscitation, initiation or additional inotropic or vasopressor support, significant volume resuscitation, and renal replacement therapy. To characterise model behaviour, we analyse the distribution of model output scores and quantify temporal volatility using rolling standard deviation. To evaluate the impact of data pooling, we compare model performance between those trained on data from individual hospitals and a model trained on data pooled across all hospital sites. Additionally, we explore an intermediate approach by clustering hospitals based on similarities in patient population characteristics using hierarchical clustering, offering a less invasive alternative to full data pooling. All experiments are conducted using five-fold cross-validation.
Results
Pooling data from multiple sites leads to statistically significant improvements in model performance. Notable gains are observed in AUROC (0.8636 ± 0.0065 vs. 0.8294 ± 0.0099), Recall (0.5106 ± 0.0198 vs. 0.3105 ± 0.0214), Balanced Accuracy (0.7350 ± 0.0086 vs. 0.6422 ± 0.0094), and F1-Score (0.4442 ± 0.0183 vs. 0.3559 ± 0.0218). Additionally, pooled models demonstrated enhanced clinical utility, detecting a higher proportion of critical care interventions (23.39% ± 1.02% vs. 15.72% ± 1.32%). Similarly, models trained on clustered data achieved statistically significant improvements, with performance metrics closely matching those of fully pooled models: AUROC (0.8507 ± 0.0069), Recall (0.5027 ± 0.0213), Balanced Accuracy (0.7275 ± 0.0095), F1-Score (0.4145 ± 0.0199), and detection of critical care interventions (23.20% ± 0.66%). These findings suggest that clustering offers a viable, less intrusive alternative to full data pooling without substantially compromising performance. Temporal output variability was lowest in hospital-specific models (0.0043 ± 0.0002), followed by clustered models (0.0089 ± 0.0007), and highest in fully pooled models (0.0114 ± 0.0020). These findings suggest that increased data volume enhances predictive performance but also introduces greater instability in model outputs.
Conclusion
Pooling training data from multiple hospital sites can significantly enhance the performance of early warning systems. Importantly, similar performance can be achieved through patient population–based clustering, offering a pragmatic compromise that reduces the need for full data sharing. While pooled data improves predictive accuracy and the identification of clinical deterioration, it also introduces changes in model behaviour, most notably, increased output volatility, which may have clinically relevant implications. These findings underscore the importance of optimising model performance and understanding how available training data influence model behaviour.
Poster ID: 1
Title: Eye gaze analyses in everyday life settings for assessing cognitive development
Authors: Monami Nishio, Shoi Shi, Hiromu Yakura, Junichiro Shibata
Poster Session C
Abstract:
Assessing cognitive development in early childhood is crucial for early detection of developmental disorders. This study explores video-based eye tracking as a tool for evaluating children's cognitive development in everyday settings. Using the MPIIGaze model, we analyzed gaze stability in 26 children (40–65 months old) performing a hand pose imitation task. Concentration scores, derived from gaze variability, were compared with teacher-assessed developmental scores on the Enjoji Scale. A significant correlation was found with the “Life Skills” category (r = 0.421, P = 0.032), suggesting that gaze stability reflects higher-order cognitive functions. This accessible method offers potential for early screening and intervention.