关于回归和分类，数据中可能缺少响应变量

论文标题

关于回归和分类，数据中可能缺少响应变量

On regression and classification with possibly missing response variables in the data

论文作者

Mojirsheibani, Majid, Pouliot, William, Shakhbandaryan, Andre

论文摘要

本文考虑了内核回归和分类问题，数据中可能无法观察到不可观察的响应变量，其中导致缺乏信息的机制未知，并且可能取决于预测因子和响应变量。我们提出的方法涉及两个步骤：在第一步中，我们构建了一个由缺失概率机制的未知参数索引的模型（可能是无限维度）的家族。在第二步中，进行搜索以找到基础家族的适当覆盖（或子类）的经验最佳成员，这是在最小化平方平方预测误差的意义上。本文的主要重点是研究这些估计器的理论特性。还解决了可识别性问题。我们的方法使用一种非常易于实现的数据分割方法。我们还根据一般LP规范的真实回归曲线的偏差来得出指数界面的估计量的指数界限，随着样本量n的增加，我们还允许盖子或子类的大小分歧。这些界限立即为提出的估计量产生各种强大的收敛结果。作为我们发现的应用，我们根据提出的回归估计器考虑统计分类的问题，并研究其在不同设置下的收敛速度。尽管这项工作主要用于内核型估计器，但它们也可以扩展到其他流行的局部暴力方法，例如最近的邻居估计器和直方图估计器。

This paper considers the problem of kernel regression and classification with possibly unobservable response variables in the data, where the mechanism that causes the absence of information is unknown and can depend on both predictors and the response variables. Our proposed approach involves two steps: In the first step, we construct a family of models (possibly infinite dimensional) indexed by the unknown parameter of the missing probability mechanism. In the second step, a search is carried out to find the empirically optimal member of an appropriate cover (or subclass) of the underlying family in the sense of minimizing the mean squared prediction error. The main focus of the paper is to look into the theoretical properties of these estimators. The issue of identifiability is also addressed. Our methods use a data-splitting approach which is quite easy to implement. We also derive exponential bounds on the performance of the resulting estimators in terms of their deviations from the true regression curve in general Lp norms, where we also allow the size of the cover or subclass to diverge as the sample size n increases. These bounds immediately yield various strong convergence results for the proposed estimators. As an application of our findings, we consider the problem of statistical classification based on the proposed regression estimators and also look into their rates of convergence under different settings. Although this work is mainly stated for kernel-type estimators, they can also be extended to other popular local-averaging methods such as nearest-neighbor estimators, and histogram estimators.

下载PDF全文

下载文献需遵守相关版权规定

论文标题