Synthetic Electronic Health Record (EHR) time-series generation is crucial
for advancing clinical machine learning models, as it helps address data
scarcity by providing more training data. However, most existing approaches
focus primarily on replicating statistical distributions and temporal
dependencies of real-world data. We argue that fidelity to observed data alone
does not guarantee better model performance, as common patterns may dominate,
limiting the representation of rare but important conditions. This highlights
the need for generate synthetic samples to improve performance of specific
clinical models to fulfill their target outcomes. To address this, we propose
TarDiff, a novel target-oriented diffusion framework that integrates
task-specific influence guidance into the synthetic data generation process.
Unlike conventional approaches that mimic training data distributions, TarDiff
optimizes synthetic samples by quantifying their expected contribution to
improving downstream model performance through influence functions.
Specifically, we measure the reduction in task-specific loss induced by
synthetic samples and embed this influence gradient into the reverse diffusion
process, thereby steering the generation towards utility-optimized data.
Evaluated on six publicly available EHR datasets, TarDiff achieves
state-of-the-art performance, outperforming existing methods by up to 20.4% in
AUPRC and 18.4% in AUROC. Our results demonstrate that TarDiff not only
preserves temporal fidelity but also enhances downstream model performance,
offering a robust solution to data scarcity and class imbalance in healthcare
analytics.
Este artículo explora los viajes en el tiempo y sus implicaciones.
Descargar PDF:
2504.17613v1