在临床预测模型的开发和实施中处理缺失数据的插补和缺失指标：一项仿真研究

论文标题

在临床预测模型的开发和实施中处理缺失数据的插补和缺失指标：一项仿真研究

Imputation and Missing Indicators for handling missing data in the development and implementation of clinical prediction models: a simulation study

论文作者

Sisk, Rose, Sperrin, Matthew, Peek, Niels, van Smeden, Maarten, Martin, Glen P.

论文摘要

背景：现有的处理丢失数据的指南通常与预测建模的目标不一致，在模型管道的任何阶段都可能发生丢失的数据。多个插补（MI）通常被视为黄金标准方法，在诊所中应用可能具有挑战性。显然，结果不能在预测时使用数据。回归插补（RI）可以在预测环境中提供务实的替代方案，这更简单地适用于诊所。此外，使用丢失的指标可以处理内容丰富的缺失，但是目前尚不清楚它们在CPM中的表现如何。方法：我们进行了一项仿真研究，其中在各种缺失的数据机制下生成数据，以比较使用这两种插补方法开发的CPM的预测性能。我们考虑允许/禁止丢失数据的部署方案，并开发使用/忽略归纳结果的模型，并包括/省略丢失指标。结果：当部署时必须提供完整的数据时，我们的发现与广泛使用的建议一致；结果应用于在MI下估算开发数据，但在RI下省略了。当在部署时应用插补时，首选从开发中省略的结果。在某些特定情况下，缺失指标改善了模型性能，但是当丢失取决于结果时可能是有害的。结论：我们提供的证据表明，通常教授通过MI处理丢失数据的原则可能不适用于CPM，尤其是在部署时丢失数据时。在这种设置中，RI和缺少指示方法可以（略有）优于MI。如图所示，必须按研究基础评估缺失的数据处理方法的性能，并且应基于部署时是否允许丢失数据。

Background: Existing guidelines for handling missing data are generally not consistent with the goals of prediction modelling, where missing data can occur at any stage of the model pipeline. Multiple imputation (MI), often heralded as the gold standard approach, can be challenging to apply in the clinic. Clearly, the outcome cannot be used to impute data at prediction time. Regression imputation (RI) may offer a pragmatic alternative in the prediction context, that is simpler to apply in the clinic. Moreover, the use of missing indicators can handle informative missingness, but it is currently unknown how well they perform within CPMs. Methods: We performed a simulation study where data were generated under various missing data mechanisms to compare the predictive performance of CPMs developed using both imputation methods. We consider deployment scenarios where missing data is permitted/prohibited, and develop models that use/omit the outcome during imputation and include/omit missing indicators. Results: When complete data must be available at deployment, our findings were in line with widely used recommendations; that the outcome should be used to impute development data under MI, yet omitted under RI. When imputation is applied at deployment, omitting the outcome from the imputation at development was preferred. Missing indicators improved model performance in some specific cases, but can be harmful when missingness is dependent on the outcome. Conclusion: We provide evidence that commonly taught principles of handling missing data via MI may not apply to CPMs, particularly when data can be missing at deployment. In such settings, RI and missing indicator methods can (marginally) outperform MI. As shown, the performance of the missing data handling method must be evaluated on a study-by-study basis, and should be based on whether missing data are allowed at deployment.

下载PDF全文

下载文献需遵守相关版权规定

论文标题