整体强大的数据驱动决定

论文标题

整体强大的数据驱动决定

Holistic Robust Data-Driven Decisions

论文作者

Bennouna, Amine, Van Parys, Bart, Lucas, Ryan

论文摘要

以良好的样本性能设计用于机器学习和决策的数据驱动配方是一个关键挑战。观察到良好的样本表现并不能保证良好的样本外部表现通常被称为过度拟合。实际过度拟合通常不归因于一个原因，而是同时由几个因素引起的。我们在这里考虑三个过拟合来源：（i）使用有限样本数据的统计误差，（ii）数据噪声，这是在数据点仅以有限的精度测量的，最后，（iii）数据误指定，其中所有数据的一小部分可能会被完全损坏。尽管现有的数据驱动的配方可能会孤立地对这三个来源之一，但它们并不能同时对所有过度拟合来源提供全面的保护。我们设计了一种新型的数据驱动公式，可确保这种整体保护并在计算上可行。我们的分布在强大的优化配方中可以解释为kullback-leibler和Lévy-Prokhorov强大优化配方的新型组合。在分类和回归问题的背景下，我们表明，几种流行的正规化和健壮的配方自然会减少我们提出的新颖配方的特定情况。最后，我们将提出的人力资源公式应用于两个现实生活中的应用，并与几个基准一起研究：（1）在医疗保健数据上培训神经网络，我们在其中分析了在存在噪音，标记错误和稀缺数据的情况下的各种鲁棒性和概括性，（2）对实际库存数据的选择问题，并分配了自然折衷的转变，并分配了自然转换。

The design of data-driven formulations for machine learning and decision-making with good out-of-sample performance is a key challenge. The observation that good in-sample performance does not guarantee good out-of-sample performance is generally known as overfitting. Practical overfitting can typically not be attributed to a single cause but is caused by several factors simultaneously. We consider here three overfitting sources: (i) statistical error as a result of working with finite sample data, (ii) data noise, which occurs when the data points are measured only with finite precision, and finally, (iii) data misspecification in which a small fraction of all data may be wholly corrupted. Although existing data-driven formulations may be robust against one of these three sources in isolation, they do not provide holistic protection against all overfitting sources simultaneously. We design a novel data-driven formulation that guarantees such holistic protection and is computationally viable. Our distributionally robust optimization formulation can be interpreted as a novel combination of a Kullback-Leibler and Lévy-Prokhorov robust optimization formulation. In the context of classification and regression problems, we show that several popular regularized and robust formulations naturally reduce to a particular case of our proposed novel formulation. Finally, we apply the proposed HR formulation to two real-life applications and study it alongside several benchmarks: (1) training neural networks on healthcare data, where we analyze various robustness and generalization properties in the presence of noise, labeling errors, and scarce data, (2) a portfolio selection problem with real stock data, and analyze the risk/return tradeoff under the natural severe distribution shift of the application.

下载PDF全文

下载文献需遵守相关版权规定

论文标题