具有未披露特性的模型歧视的近乎理想的程序

论文标题

具有未披露特性的模型歧视的近乎理想的程序

Near-Optimal Procedures for Model Discrimination with Non-Disclosure Properties

论文作者

Ostrovskii, Dmitrii M., Ndaoud, Mohamed, Javanmard, Adel, Razaviyayn, Meisam

论文摘要

令$θ_0，θ_1\ in \ mathbb {r}^d $是与某些损失相关的人口风险最小化$ \ ell：\ mathbb {r}^d \ times \ times \ nathcal {z} \ to \ to \ to \ to \ to \ to \ mathbb {r} $ and tuspriest $ \ m rathbb {p p} $ \ yath pp} p} $ \ MATHCAL {Z} $。模型$θ_0，θ_1$是未知的，并且可以通过从它们的图i.i.d样本中访问$ \ mathbb {p} _0，\ mathbb {p} _1 $。我们的工作是由以下模型歧视问题激励的：“ $ \ m athbb {p} _0 $和$ \ m athbb {p} _1 $允许的样本尺寸有多少尺寸，可以区分两个假设$θ^*=θ_0$和$θ_1$和$θ^*=θ_1$ $ to $ th我们首先考虑了具有平方损耗的良好的线性模型的情况，我们首先考虑了以完全一般性回答的第一步。在这里，我们在样品复杂性上提供了匹配的上限和下限，如$ \ min \ {1/δ^2，\ sqrt {r}/δ\} $，至一个常数因子；这里$δ$是$ \ mathbb {p} _0 $和$ \ mathbb {p} _1 $和$ r $之间的分离度量的度量。然后，我们将此结果扩展到两个方向：（i）对于渐近制度中的一般参数模型；（ii）对于弱力矩假设下的小样本（$ n \ le r $）的通用线性模型。在这两种情况下，我们都会得出相似形式的样本复杂性界限，同时允许模型错误指定。实际上，我们的测试程序仅通过经验风险的某些功能访问$θ^*$。此外，使我们能够达到统计置信度的观测值的数量不允许“解析”两个模型$ - $，即恢复$θ_0，θ_1$最高$ O（δ）$预测精度。这两个属性允许在应用的任务中使用我们的框架，在这些任务中，人们希望$ \ textIt {stiend} $是一个可以专有的预测模型，同时保证该模型实际上不能由标识代理人$ \ textit {推断{推论} $。

Let $θ_0,θ_1 \in \mathbb{R}^d$ be the population risk minimizers associated to some loss $\ell:\mathbb{R}^d\times \mathcal{Z}\to\mathbb{R}$ and two distributions $\mathbb{P}_0,\mathbb{P}_1$ on $\mathcal{Z}$. The models $θ_0,θ_1$ are unknown, and $\mathbb{P}_0,\mathbb{P}_1$ can be accessed by drawing i.i.d samples from them. Our work is motivated by the following model discrimination question: "What sizes of the samples from $\mathbb{P}_0$ and $\mathbb{P}_1$ allow to distinguish between the two hypotheses $θ^*=θ_0$ and $θ^*=θ_1$ for given $θ^*\in\{θ_0,θ_1\}$?" Making the first steps towards answering it in full generality, we first consider the case of a well-specified linear model with squared loss. Here we provide matching upper and lower bounds on the sample complexity as given by $\min\{1/Δ^2,\sqrt{r}/Δ\}$ up to a constant factor; here $Δ$ is a measure of separation between $\mathbb{P}_0$ and $\mathbb{P}_1$ and $r$ is the rank of the design covariance matrix. We then extend this result in two directions: (i) for general parametric models in asymptotic regime; (ii) for generalized linear models in small samples ($n\le r$) under weak moment assumptions. In both cases we derive sample complexity bounds of a similar form while allowing for model misspecification. In fact, our testing procedures only access $θ^*$ via a certain functional of empirical risk. In addition, the number of observations that allows us to reach statistical confidence does not allow to "resolve" the two models $-$ that is, recover $θ_0,θ_1$ up to $O(Δ)$ prediction accuracy. These two properties allow to use our framework in applied tasks where one would like to $\textit{identify}$ a prediction model, which can be proprietary, while guaranteeing that the model cannot be actually $\textit{inferred}$ by the identifying agent.

下载PDF全文

下载文献需遵守相关版权规定

论文标题