Neyman-Pearson and some other Uniformly Most Powerful Tests
Introduction
Suppose data consisting of i.i.d. observations \(X^n=(X_1,X_2,\cdots,X_n)\) are available from a distribution \(F(x,\theta),\,\theta\in\Theta\subset\mathbf{R}.\) The exact value \(\theta\) corresponding to the distribution that generated the observations is unknown. The problem is, using the available data \(X^n,\) construct tests for making decisions on the possible value of unknown parameter \(\theta\). Unlike the estimation problems where an estimator is constructed based on data which can be used as an approximate value of the unknown parameter \(\theta\), the hypothesis testing deals with decisions, for example, whether the unknown parameter is in a given subset (the null hypothesis) \[{\mathcal H}_0:\ \ \theta\in\Theta_0\subset\Theta,\]
or, alternatively, in its supplement \[{\mathcal H}_a:\ \ \theta\in\Theta\setminus\Theta_0.\] Therefore, hypothesis testing is interested in knowing whether the unknown value is in a given set \(\Theta_0\). We may define this set as containing only one value \(\Theta_0=\{\theta_0\}\) in which case the test will be whether the unknown value is equal to the given known value \(\theta_0\). The statistical tests that make the decisions are based on the data and the construction of statistical tests can be formalized as follows.
Suppose \(\psi:\mathbf{R^n}\rightarrow\{0,1\}\) is a measurable function defined for all observations \(X^n\) and takes only the values 0 and 1. Any such function will be called a statistical test. We will use the convention that the value 1 corresponds to the decision of rejecting the null hypothesis (hence the alternative hypothesis should be accepted), while the value 0 means a decision that the null hypothesis should be accepted. Hence using the available observations \(X^n\) we can make a decision based on the value of \(\psi(X^n).\)
As we have seen from the definition of a statistical test, any measurable function is a test, including the functions that are constant \(\psi\equiv1\) (data independent tests), which are not good tests at all since those will always give the same answer regardless of the data, and hence, will very likely be wrong in most cases. Therefore, we need to define tests that have good properties (give reliable answers), and before this we need to define what a good test should be in a formal way. We will be dealing only with small sample statistical tests, meaning the sample size \(n\) is fixed and the properties of statistical test will be considered under this condition only (unlike the asymptotic theory, where a large sample inference is done under the condition when \(n\rightarrow+\infty\)).
Type I and II errors
For each statistical test \(\psi\) we may either make a correct decision (correctly identify the set to which the unknown value \(\theta\) belongs) or commit one of two errors: reject the null hypothesis when it is true (type I error) or accept when it is false (type II error). If the sample size \(n\) is fixed, it is impossible to construct a test with both types of errors being low, hence the strategy is to fix some level for the type I error (level of significance) and among those tests find a test with the lowest type II error.
Indeed, consider the type I error of a given statistical test \(\psi.\) The type I error, denoting it by \(\alpha(\psi),\) will be \[\alpha(\psi,\theta) = P_\theta(\psi=1)=E_\theta\psi,\ \ \theta\in\Theta_0.\] That is, the probability of rejecting that \(\theta\) is in \(\Theta_0\) (the decision is \(\psi=1\)) while \(\theta\) is indeed in \(\Theta_0.\) For a given significance level \(\alpha\in(0,1)\), we consider only tests \(\psi\) such that \[\alpha(\psi,\theta)\leq\alpha,\ \ \theta\in\Theta_0.\] Among these tests we will try to find the one with the lowest type II error. Or, equivalently, if we denote by \(\pi(\psi,\theta)=P_\theta(\psi=1)=E_\theta\psi,\ \ \theta\in\Theta\setminus\Theta_0,\) the power function of the test \(\psi,\) then the problem above can be formulated as finding a test with the highest power in the region \(\Theta\setminus\Theta_0\) among the tests with the given significance level \(\alpha\) in the region \(\Theta_0.\)
The hypothesis testing will be called simple if both \(\Theta_0\) and its complement consist of only single values.
Neyman-Pearson test
Consider the case of simple hypothesis testing. We observe from a random variable \(X\) which has a distribution function \(F(x),\) \(X\sim F(x)\). The simple hypothesis to be tested is the following:
\[{\mathcal H}_0:\ \ F(x)=F_0(x),\]
and the alternative hypothesis is \[{\mathcal H}_a:\ \ F(x)=F_1(x).\] Here \(F_0(x)\) and \(F_1(x)\) are given distribution functions.
Suppose the distribution function \(F_0(x)\) has a density \(f_0(x)\) with respect to some measure \(\mu\), while \(F_1(x)\) has a density \(f_1(x),\) with respect to the same measure. Such a measure always exists since we can take the measure generated by the distribution function \(\tilde F(x)=\frac{F_0(x)+F_1(x)}{2}\). The Neyman-Pearson fundamental lemma (Lehmann and Romano 2005) says that:
For a given significance level \(\alpha\in(0,1)\) there exists a value \(c_0\in\mathbf{R}\) such that the following Neyman-Pearson (NP) test \[\tilde\psi_{c_0}(x)=\left\{ \begin{matrix} 1, & x\in\{x:\,f_1(x)>c_0f_0(x)\},\\ \frac{\alpha-\alpha(c_0)}{\alpha(c_0-0)-\alpha(c_0)}, & x\in\{x:\,f_1(x)=c_0f_0(x)\},\\ 0, & x\in\{x:\,f_1(x)<c_0f_0(x)\}, \end{matrix}\right. \] satisfies the equality \(E_{\theta_0}\tilde\psi_{c_0}(X)=\alpha.\) Here \[\alpha(c)=P_0(f_1(X)>cf_0(X)),\] and \(c_0\) is such that \(\alpha(c_0)\leq \alpha\leq\alpha(c_0-0).\)
The test \(\tilde\psi_c\) is most powerful at the significance level \(\alpha.\) Meaning that for any test \(\psi\) which is of \(\alpha\) level, that is, \(E_{0}(X)\psi\leq \alpha,\) the power of that test does not exceed the power of the test \(\tilde\psi_{c_0}\), \[E_{1}\tilde\psi_{c_0}(X)-E_{1}\psi(X)\geq \int\left[\tilde\psi_{c_0}(x)-\psi(x)\right]f_1(x)d \mu\geq 0.\] Indeed, if \(\tilde \psi_{c_0}(x)>\psi(x)\geq 0,\) then necessarily \(\tilde \psi_{c_0}(x)\neq 0\) hence \(f_1(x)\geq c_0f_0(x).\) While, in the same way, if \(\tilde \psi_{c_0}(x)<\psi(x)\geq 1,\) then necessarily \(\tilde \psi_{c_0}(x)\neq 1\) hence \(f_1(x)\leq c_0f_0(x).\) Therefore, \[\int\left(\tilde\psi_{c_0}(x)-\psi(x)\right)(f_1(x)-c_0f_0(x))d\mu\geq 0.\] Which entails that \[\int\left(\tilde\psi_{c_0}(x)-\psi(x)\right)f_1(x)d\mu\geq c_0\int\left(\tilde\psi_{c_0}(x)-\psi(x)\right)f_0(x)d\mu=c_0[a-E_0\psi(X)]\geq 0.\]
If a test \(\psi\) is most powerful at level \(\alpha\) for testing \(f_0(x)\) against \(f_1(x)\), then for some \(c\) it can be written as \(\psi=\tilde\psi_c,\) almost everywhere on the set \(\{f_1(x)\neq c_0 f_0(x)\}\). Furthermore, for the most powerful test \(\psi\) the equality \(E_{\theta_0}\psi(X)=\alpha\) will be satisfied unless there exists a test of size \(<\alpha\) and with power 1. Since the NP test always exists and is most powerful, this third point essentially means the uniqueness (almost everywhere) of most powerful tests, except possibly on the set \(\{f_1(x)= c_0 f_0(x)\}\). Indeed, suppose that \(\psi\) is most powerful and \(\tilde\psi_c\) is the NP test. Denote by \(S=\{\tilde\psi_c\neq\psi\}\cap\{f_1(x)\neq c_0f_0(x)\}\). As shown above, on this set \(\left(\tilde\psi_{c_0}(x)-\psi(x)\right)(f_1(x)-c_0f_0(x))> 0.\) Hence if \(\mu(S)>0\) then
\[\begin{align*} &\int\left(\tilde\psi_{c_0}(x)-\psi(x)\right)(f_1(x)-c_0f_0(x))d\mu=\\ =&\int_S\left(\tilde\psi_{c_0}(x)-\psi(x)\right)(f_1(x)-c_0f_0(x))d\mu> 0. \end{align*}\] This contradicts to the fact that \(\psi\) is most powerful. Hence \(\mu(S)=0.\)
Remark.
For a given \(\alpha\in(0,1)\) the value \(c_0\) always exits since \(1-\alpha(c)\) is a distribution function.
The constructed test is randomized, meaning that it does not take only the values \({0,1},\) but can take also a value between 0 and 1, which can be interpreted as the probability of rejecting the null hypothesis. Hence, as a result of this statistical test, the decision to reject or accept the null hypothesis sometimes may not be made, but a probability is assigned to rejecting the null hypothesis.
If the set \(\{x:\,f_1(x)=c_0f_0(x)\}\) has the \(\mu-\)measure zero, then the most powerful test is determined uniquely (up to sets of measure zero) by the Neyman-Pearson lemma. This will happen if, for example, both \(f_1(x)\) and \(f_0(x)\) are continuous and \(f_0(x)>0,\) almost everywhere.
In practice randomization is not considered acceptable and hence an \(\alpha\) value is selected so that a non-randomized test exists.
Example.
Consider a single observation \(X\) from a Poisson distribution, that is
\[f(k,\theta)=P(X=k)=\frac{\theta^k}{k!}e^{-\theta},\ \ k=0,1,2,\cdots.\] We are testing the simple hypothesis
\[{\mathcal H}_0:\ \ \theta=\theta_0,\]
against the alternative hypothesis \[{\mathcal H}_a:\ \ \theta=\theta_1>\theta_0.\]
In this case \[\frac{f(X,\theta_1)}{f(X,\theta_0)}=\left(\frac{\theta_1}{\theta_0}\right)^Xe^{-(\theta_1-\theta_0)}>\tilde c\] is equivalent to \(X>c,\) because of the fact that \(\theta_1>\theta_0.\) Hence the most powerful test will be
\[\tilde\psi_{c_0}(X)=\left\{ \begin{matrix} &1, & X>c_0,\\ &\frac{F(c_0,\theta_0)-(1-\alpha)}{F(c_0,\theta_0)-F(c_0-0,\theta_0)}, & X=c_0,\\ &0, & X<c_0. \end{matrix}\right. \] Here \(c_0\) is such that \(F(c_0-0,\theta_0)\leq 1-\alpha\leq F(c_0,\theta_0).\)
As noted above, to avoid randomized tests we can select the significance level in a way so that the set \(\{X=c_0\}\) has the measure zero. This can be achieved by replacing the given significance level \(\alpha\) by a more conservative (lower) significance level \(\alpha_0\) so that \(1-\alpha_0=F(c_0,\theta_0).\)
Take the case of \(\theta_0=1,\,\theta_1=2,\,\alpha=0.05.\) In this case,
\[F(3-0, 1)\leq 1-\alpha\leq F(3,1),\] therefore, \(c_0=3.\) This value can be found as follows
alpha <- 0.05
theta0 <- 1
theta1 <- 2
Y <- ppois(1:100, theta0)
Z <- which(Y > 1- alpha, Y)
c0 <- Z[1]
c0
## [1] 3
If we test at the given significance level \(\alpha=0.05,\) then the Neyman-Pearson test will be randomized and on the set \(\{X=3\}\) it will have the following value
(ppois(c0,theta0)-(1-alpha))/(ppois(c0,theta0)-ppois(c0-1,theta0))
## [1] 0.5057936
Hence, on this set the probability of rejecting the null hypothesis is around 0.5, hence no decision can be made. On the other hand, if we take a more conservative significance level \(\alpha_0\) as follows
alpha0 <- 1-ppois(c0,theta0)
alpha0
## [1] 0.01898816
In this case we can get a non-randomized test with the power equal to
ppois(c0,theta1)
## [1] 0.8571235
Monotone likelihood ratios
In this section we will consider some generalizations of the NP test for composite hypotheses, to obtain Uniformly Most Powerful (UMP) tests.