论文标题

具有未披露特性的模型歧视的近乎理想的程序

Near-Optimal Procedures for Model Discrimination with Non-Disclosure Properties

论文作者

Ostrovskii, Dmitrii M., Ndaoud, Mohamed, Javanmard, Adel, Razaviyayn, Meisam

论文摘要

令$θ_0,θ_1\ in \ mathbb {r}^d $是与某些损失相关的人口风险最小化$ \ ell:\ mathbb {r}^d \ times \ times \ nathcal {z} \ to \ to \ to \ to \ to \ to \ mathbb {r} $ and tuspriest $ \ m rathbb {p p} $ \ yath pp} p} $ \ MATHCAL {Z} $。模型$θ_0,θ_1$是未知的,并且可以通过从它们的图i.i.d样本中访问$ \ mathbb {p} _0,\ mathbb {p} _1 $。我们的工作是由以下模型歧视问题激励的:“ $ \ m athbb {p} _0 $和$ \ m athbb {p} _1 $允许的样本尺寸有多少尺寸,可以区分两个假设$θ^*=θ_0$和$θ_1$和$θ^*=θ_1$ $ to $ th我们首先考虑了具有平方损耗的良好的线性模型的情况,我们首先考虑了以完全一般性回答的第一步。在这里,我们在样品复杂性上提供了匹配的上限和下限,如$ \ min \ {1/δ^2,\ sqrt {r}/δ\} $,至一个常数因子;这里$δ$是$ \ mathbb {p} _0 $和$ \ mathbb {p} _1 $和$ r $之间的分离度量的度量。然后,我们将此结果扩展到两个方向:(i)对于渐近制度中的一般参数模型; (ii)对于弱力矩假设下的小样本($ n \ le r $)的通用线性模型。在这两种情况下,我们都会得出相似形式的样本复杂性界限,同时允许模型错误指定。实际上,我们的测试程序仅通过经验风险的某些功能访问$θ^*$。此外,使我们能够达到统计置信度的观测值的数量不允许“解析”两个模型$ - $,即恢复$θ_0,θ_1$最高$ O(δ)$预测精度。这两个属性允许在应用的任务中使用我们的框架,在这些任务中,人们希望$ \ textIt {stiend} $是一个可以专有的预测模型,同时保证该模型实际上不能由标识代理人$ \ textit {推断{推论} $。

Let $θ_0,θ_1 \in \mathbb{R}^d$ be the population risk minimizers associated to some loss $\ell:\mathbb{R}^d\times \mathcal{Z}\to\mathbb{R}$ and two distributions $\mathbb{P}_0,\mathbb{P}_1$ on $\mathcal{Z}$. The models $θ_0,θ_1$ are unknown, and $\mathbb{P}_0,\mathbb{P}_1$ can be accessed by drawing i.i.d samples from them. Our work is motivated by the following model discrimination question: "What sizes of the samples from $\mathbb{P}_0$ and $\mathbb{P}_1$ allow to distinguish between the two hypotheses $θ^*=θ_0$ and $θ^*=θ_1$ for given $θ^*\in\{θ_0,θ_1\}$?" Making the first steps towards answering it in full generality, we first consider the case of a well-specified linear model with squared loss. Here we provide matching upper and lower bounds on the sample complexity as given by $\min\{1/Δ^2,\sqrt{r}/Δ\}$ up to a constant factor; here $Δ$ is a measure of separation between $\mathbb{P}_0$ and $\mathbb{P}_1$ and $r$ is the rank of the design covariance matrix. We then extend this result in two directions: (i) for general parametric models in asymptotic regime; (ii) for generalized linear models in small samples ($n\le r$) under weak moment assumptions. In both cases we derive sample complexity bounds of a similar form while allowing for model misspecification. In fact, our testing procedures only access $θ^*$ via a certain functional of empirical risk. In addition, the number of observations that allows us to reach statistical confidence does not allow to "resolve" the two models $-$ that is, recover $θ_0,θ_1$ up to $O(Δ)$ prediction accuracy. These two properties allow to use our framework in applied tasks where one would like to $\textit{identify}$ a prediction model, which can be proprietary, while guaranteeing that the model cannot be actually $\textit{inferred}$ by the identifying agent.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源