Bias in the Estimate
The
main purpose of inferential statistics, is to infer population parameters such as
μ and σ from sample statistics such as
and s. You will sometimes see
and s and other statistics referred to as estimators, particularly in the context of inferring population values.
Estimators have several desirable characteristics, and one of them is unbiasedness.
The absence of bias in a statistic that’s being used as an estimator is
desirable. The mean is an unbiased estimator. No special adjustment is
needed for
to estimate μ accurately.
But when you use N, instead of the N − 1 degrees of
freedom, in the calculation of the variance, you are biasing the
statistic as an estimator. It is then biased negatively: it’s an
underestimate of the variance in the population.
As discussed in the prior section, that’s the reason
to use the degrees of freedom instead of the actual sample size when
you infer the population variance from the sample variance. So doing
removes the bias from the estimator.
It’s easy to conclude, then, that using N − 1 in the
denominator of the standard deviation also removes its bias as an
estimator of the population standard deviation. But it doesn’t. The
square root of an unbiased estimator is not itself necessarily unbiased.
Much of the bias in the standard deviation is in
fact removed by the use of the degrees of freedom instead of N in the
denominator. But a little is left, and it’s usually regarded as
negligible.
The larger the sample size, of course, the smaller
the correction involved in using the degrees of freedom. With a sample
of 100 values, the difference between dividing by 100 and dividing by
99 is quite small. With a sample of ten values, the difference between
dividing by 10 and dividing by 9 can be meaningful.
Similarly, the degree of bias that remains in the
standard deviation is very small when the degrees of freedom instead of
the sample size is used in the denominator. The standard deviation
remains a biased estimator, but the bias is only about 1% when the
sample size is as small as 20, and the remaining bias becomes smaller
yet as the sample size increases.
Note
You can estimate the bias in the standard deviation
as an estimator of the population standard deviation that remains after
the degrees of freedom has replaced the sample size in the denominator.
In a normal distribution, this expression is an unbiased estimator of
the population standard deviation:
(1 + 1 / [4 * {n - 1}]) * s
Degrees of Freedom
The concept of degrees of freedom is important to
calculating variances and standard deviations. But as you move from
descriptive statistics to inferential statistics, you encounter the concept
more and more often. Any inferential analysis, from a simple t-test to
a complicated multivariate linear regression, uses degrees of freedom
(df) as part of the math and to help evaluate how reliable a result
might be. The concept of degrees of freedom is also important for
understanding standard deviations, as the prior section discussed.
Unfortunately, degrees of freedom is not a
straightforward concept. It’s usual for people to take longer than they
expect to become comfortable with it.
Fundamentally, degrees of freedom refers to the
number of values that are free to vary. It is often true that one or
more values in a set are constrained. The remaining values—the number
of values in that set that are unconstrained—constitute the degrees of
freedom.
Consider the mean of three values. Once you have calculated the mean and stick to it, it acts as a constraint.
You can then set two of the three values to any two numbers you want,
but the third value is constrained by the calculated mean.
Take 6, 8, and 10. Their mean is 8. Two of them are
free to vary, and you could change 6 to 2 and 8 to 24. But because the
mean acts as a constraint, the original 10 is constrained to become −2
if the mean of 8 is to be maintained.
When you calculate the deviation of each observation
from the mean, you are imposing a constraint—the calculated mean—on the
values in the sample. All of the observations but one (that is, N − 1
of the values) are free to vary, and with them the sum of the squared
deviations. One of the observations is forced to take on a particular
value, in order to retain the value of the mean.
Excel’s Variability Functions
The 2010 version of Excel reorganizes and renames
several statistical functions. The aim is to name the functions
according to a more consistent pattern, and to make a function’s
purpose more apparent from its name.
Standard Deviation Functions
For example, Excel has since 1995 offered two functions that return the standard deviation:
STDEV() —This function assumes that its argument list is a sample from a population, and therefore uses N − 1 in the denominator.
STDEVP() —This function assumes that its argument list is the population, and therefore uses N in the denominator.
In its 2003 version, Excel added two more functions that return the standard deviation:
STDEVA()
—This function works like STDEV() except that it accepts alphabetic,
text values in its argument list and also Boolean (TRUE or FALSE)
values. Text values and FALSE values are treated as zeroes, and TRUE
values are treated as ones.
STDEVPA()
—This function accepts text and Boolean values, just as does STDEVA(),
but again it assumes that the argument list constitutes a population.
Microsoft decided that using P, for population, at
the end of the function name STDEVP() was inconsistent because there
was no STDEVS(). That would never do, and to remedy the situation,
Excel 2010 includes two new standard deviation functions that append a
letter to the function name in order to tell you whether it’s intended
for use with a sample or on a population:
STDEV.S() and STDEV.P() are termed consistency
functions because they introduce a new, more consistent naming
convention than the earlier versions. Microsoft also states that their
computation algorithms bring about more accurate results than is the
case with STDEV() and STDEVP().
Excel 2010 continues to support the old STDEV() and
STDEVP() functions, although it is not at present clear how long they
will continue to be supported. In recognition of their deprecated
status, STDEV() and STDEVP() occupy the bottom of the list of functions
that appears in a pop-up window when you begin to type =STD in a worksheet cell. Excel 2010 refers to them as compatibility functions.
Variance Functions
Similar considerations apply to the worksheet
functions that return the variance. The function’s name is used to
indicate whether it is intended for a population or to infer a
population value from a sample, and whether it can deal with nonnumeric
values in its arguments.
VAR() has been available in Excel since its
earliest versions. It returns an unbiased estimate of a population
variance based on values from a sample and uses degrees of freedom in
the denominator. It is the square of STDEV().
VARP()
has been available in Excel for as long as VAR(). It returns the
variance of a population and uses the number of records, not the
degrees of freedom, in the denominator. It is the square of STDEVP().
VARA()
made its first appearance in Excel 2003. See the discussion of
STDEVA(), for the difference between VAR() and
VARA().
The
documentation for Excel 2010 stresses the notion of consistency in the
naming of functions: If a function shows that it’s intended for use
with a population by means of an appended letter P, then the name of a function intended for use with a sample should behave the same way. It should have the letter S appended to it.
That’s fair enough, so Excel 2010 offers its users
STDEV.P for use with a population and STDEV.S for use with a sample.
However, what if we want to include text and/or Boolean values in the
argument to the function? In that case, we must resort to the 2003
functions STDEVA() and STDEVPA(). Notice, though, these points:
One, there is no STDEVSA(), as consistency with STDEVPA() would imply.
Two, there is no period separating STDEV from the rest of the function name in STDEVPA(), as there is with STDEV.P and STDEV.S.
Three, neither STDEVA() nor STDEVPA() is flagged as
deprecated in the function pop-up window, so there is apparently no
intent to supplant them with something such as STDEV.S.A() or
STDEV.P.A().
As to the enhancement of STDEV() with STDEVA(), and
STDEVP() with STDEVPA(), Microsoft documentation suggests that they
were supplied for consistency with 2003’s VARA() and COUNTA(), which
also allow for text and Boolean values. If so, it is what Emerson
referred to as “a foolish consistency.” When a user finds that he or
she needs to calculate the standard deviation of a set of values that
might include the word weasel or the
logical value FALSE, then that user has done a poor job of planning
either the layout of the worksheet or the course of the analysis.
I do not put these complaints here in order to
assert my right to rant. I put them here so that, if they have also
occurred to you, you’ll know that you’re not alone in your thoughts.
|
VARPA() also first appeared in Excel 2003 and takes the same approach to its nonnumeric arguments as does STDEVPA().
VAR.S()
is new in Excel 2010. Microsoft states that its computations are more
accurate than are those used by VAR(). Its use and intent is the same
as VAR().
VAR.P() is new in Excel 2010. Its similarities to VARP() are analogous to those between VAR() and VAR.S().