Bayesian Estimation

CEO Salary Estimation Problem

Consider the following problem. An investigative reporter wants to figure out how much salary makes the CEO of an investment bank X. For this he conducts interviews with some of the employees of that bank and writes down their salaries, which forms the following sample

\[X^n=(X_1,\cdots,X_n).\] The reporter knows only that the salaries in that bank can range from 0 (unpaid interns) to \(\theta\) which is the salary of the CEO that the reporter wants to estimate. Since he has no information about the structure of salaries in the bank X, he assumes that the salaries have uniform distribution in the interval \([0,\theta],\) \[X^n=(X_1,\cdots,X_n),\ \ X_i\sim {\mathbb U}(0,\theta),\,\theta>0.\]

The uniform distribution is the maximum entropy probability distribution under no constraint other than that it is contained in the distribution’s support (according to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class (usually defined in terms of specified properties or measures), then the distribution with the largest entropy should be chosen as the least-informative default.)

Frequentist and Bayesian Estimation

Since no other information is known about the possible values of the CEO’s salary, the reporter needs to estimate the unknown \(\theta\) using the sample \(X^n.\) For the uniform distribution the maximum likelihood estimator (MLE) for \(\theta\) will be \[\hat\theta_n=X_{(n)}=\max(X_1,\cdots,X_n).\] Therefore, the reporter needs to ask for salary information from as many employees of bank X as possible and take the maximum of these values, which will serve as an estimate for the CEO’s salary.

Now suppose that the investigative reporter wants to get a Pulitzer prize for his reporting and remembers that he has a minor in Statistics. He reads economics literature and finds out that economists established that nationally the salaries of CEOs of banks follow the Pareto distribution \(\theta\sim Pa(\theta_0, a)\), because of the Pareto principle, which states that a large portion of wealth of CEOs is held by a small fraction of them. Using this prior distribution, a Bayesian estimator for the salary of CEO of bank X will be

\[\hat\theta^B_n=\frac{a+n}{a+n-1}\max(\theta_0,X_1,\cdots,X_n).\] Here \(a\) and \(\theta_0\) are unknown as well and can be estimated using the available data of CEO salaries of other banks. Therefore, the investigative reporter decides to use the work of his colleagues - other investigative reporters - who conducted studies in other banks and reported estimates for salaries of CEOs. Denote this new sample of salaries of other CEOs as \[\vartheta^m=(\vartheta_1,\cdots,\vartheta_m).\] Since \(\theta\) follows the Pareto distribution then the parameters \(\theta_0\) and \(a\) can be estimated as follows: \[\hat\theta_0=\vartheta_{(1)}=\min(\vartheta_1,\cdots,\vartheta_m),\ \ \hat a=\frac{m}{\sum_{i=1}^m\ln\frac{\vartheta_i}{\vartheta_{(1)}}}.\] Therefore, drawing on reports from other investigations and his own survey, the investigative reporter can obtain the following estimator of the salary of the CEO

\[\hat\theta^B_n=\frac{\hat a+n}{\hat a+n-1}\max(\vartheta_{(1)},X_1,\cdots,X_n).\] Therefore, each time a new salary of some CEO is reported \((\vartheta_{m+1}),\) this new data can be used by the investigative reporter to update the estimate of the salary of the CEO of bank X.

Conjugate priors

In the example above the prior distribution of the unknown parameter \(\theta\) was chosen because of theoretical considerations, based on an economic law (the Pareto rule). In most cases, it is not possible to give preference to one prior over the others based on some principle, conjugate priors are selected for computational simplicity.

  • The prior distribution \(\theta\sim\pi(\theta),\,\theta\in\Theta\) (with the density \(p(\theta)\)) is called conjugate prior for the likelihood function \(f(x|\theta)\) if its posterior density function \(f(\theta|x)\) is from the same family as the likelihood function.

The Bayes’ theorem specifies the following relationship between the likelihood function \(f(x|\theta)\) and the posterior function \(f(\theta|x),\) for the given prior density \(p(\theta)\)

\[f(\theta|x)=\frac{f(x|\theta) p(\theta)}{\int_\Theta f(x|\vartheta) p(\vartheta) d\vartheta}.\] Therefore, the density \(p(\theta)\) is a conjugate prior for \(f(x|\theta),\) if \(f(x|\theta)\) and \(f(\theta|x)\) are from the same family.

Appendix

Pareto distributions

The random variable \(\xi\) has a Pareto distribution with parameters \(\theta>0\) and \(a>0\), \(\xi\sim Pa(\theta,a)\) if the distribution function is given by the formula \[F(x)=1-\left(\frac{\theta}{x}\right)^a,\,x>\theta,\] and \(F(x)=0,\,x\leq \theta.\) The density function will have the form \[f(x)=a\frac{\theta^a}{x^{a+1}},\,x>\theta.\] In the case of \(a>1\) the expectation of a Pareto random variable is \[E\xi=\frac{a}{a-1}\theta.\] If \(a>2\) then the variance exists as well and equals to \[Var(\xi)=\frac{a}{(a-1)^2(a-2)}\theta^2.\]

Problems

  1. Show that if \(\theta\) follows the Pareto distribution \(\theta\sim Pa(\theta_0, a),\) then the parameters \(\theta_0\) and \(a\) can be estimated as follows: \[\hat\theta_0=\vartheta_{(1)}=\min(\vartheta_1,\cdots,\vartheta_m),\ \ \hat a=\frac{m}{\sum_{i=1}^m\ln\frac{\vartheta_i}{\vartheta_{(1)}}},\] where \(\vartheta^m=(\vartheta_1,\cdots,\vartheta_m)\) is an i.i.d. sample from \(Pa(\theta_0,a).\)

  2. Suppose that an i.i.d. sample is observed \[X^n=(X_1,\cdots,X_n),\,X_i\sim F(x,\theta),\,\theta\in\Theta,\,x\in R.\] Consider an estimator for \(\theta\) based on the sample \(X^n\) \[\hat\theta_n(X^n)=\hat\theta_n(X_1,\cdots,X_n).\] Mean Squared Error (MSE, \(L(\theta,\hat\theta)\)) for this estimator is defined as \[E_\theta(\hat\theta_n-\theta)^2=\int_{R^n}(\hat\theta_n(x^n)-\theta)^2dF(x_1,\theta)\cdots dF(x_n,\theta).\] If the prior distribution of the unknown parameter \(\theta\) is given \(\theta\sim \pi(\theta),\,\theta\in\Theta,\) then the Bayesian risk of the estimator \(\hat\theta_n\) is defined as \[E_\pi L(\theta,\hat\theta)=\int_\Theta E_\theta(\hat\theta_n-\theta)^2d \pi(\theta).\] Show that the Bayes estimator, defined as the one which minimizes the Bayesian risk, has the form \[\hat\theta_n^B=\arg_{\hat\theta_n}\min_{\theta\in\Theta} E_\pi L(\theta,\hat\theta)=\int_{\Theta}\theta f(\theta|x)d\theta,\] where \(f(\theta|x)\) is the posterior distribution of \(\theta.\)

  3. Suppose that \(X_i\sim {\mathbb U}(0,\theta)\) and \(\theta\sim Pa(\theta_0, a)\). Show that the Bayes estimator has the form

\[\hat\theta^B_n=\frac{a+n}{a+n-1}\max(\theta_0,X_1,\cdots,X_n).\]

Samvel B. Gasparyan
Samvel B. Gasparyan
Biostatistician

Biostatistician in cardiovascular trials.

Related