Dr. Arsham's Statistics Site

Fact No.	The Mode	The Median	The Mean
1	It is the most frequent value in the distribution; it is the point of greatest density.	It is the value of the middle point of the array (not midpoint of range), such that half the item are above and half below it.	It is the value in a given aggregate which would obtain if all the values were equal.
2	The value of the mode is established by the predominant frequency, not by the value in the distribution.	The value of the media is fixed by its position in the array and doesn't reflect the individual value.	The sum of deviations on either side of the mean are equal; hence, the algebraic sum of the deviation is equal zero.
3	It is the most probable value, hence the most typical.	The aggregate distance between the median point and all the value in the array is less than from any other point.	It reflect the magnitude of every value.
4	A distribution may have 2 or more modes. On the other hand, there is no mode in a rectangular distribution.	Each array has one and only one median.	An array has one and only one mean.
5	The mode does nott reflect the degree of modality.	It cannot be manipulated algebraically: medians of subgroups cannot be weighted and combined.	Means may be manipulated algebraically: means of subgroups may be combined when properly weighted.
6	It cannot be manipulated algebraically: modes of subgroups cannot be combined.	It is stable in that grouping procedures do not affect it appreciably.	It may be calculated even when individual values are unknown, provided the sum of the values and the sample size n are known.
7	It is unstable that it is influenced by grouping procedures.	Value must be ordered, and may be grouped, for computation.	Values need not be ordered or grouped for this calculation.
8	Values must be ordered and group for its computation.	It can be compute when ends are open	It cannot be calculated from a frequency table when ends are open.
9	It can be calculated when table ends are open.	It is not applicable to qualitative data.	It is stable in that grouping procedures do not seriously affected it.

Fact No.

The Mode

The Median

The Mean

It is the most frequent value in the distribution; it is the point of greatest density.

It is the value of the middle point of the array (not midpoint of range), such that half the item are above and half below it.

It is the value in a given aggregate which would obtain if all the values were equal.

The value of the mode is established by the predominant frequency, not by the value in the distribution.

The value of the media is fixed by its position in the array and doesn't reflect the individual value.

The sum of deviations on either side of the mean are equal; hence, the algebraic sum of the deviation is equal zero.

It is the most probable value, hence the most typical.

The aggregate distance between the median point and all the value in the array is less than from any other point.

It reflect the magnitude of every value.

A distribution may have 2 or more modes. On the other hand, there is no mode in a rectangular distribution.

Each array has one and only one median.

An array has one and only one mean.

The mode does nott reflect the degree of modality.

It cannot be manipulated algebraically: medians of subgroups cannot be weighted and combined.

Means may be manipulated algebraically: means of subgroups may be combined when properly weighted.

It cannot be manipulated algebraically: modes of subgroups cannot be combined.

It is stable in that grouping procedures do not affect it appreciably.

It may be calculated even when individual values are unknown, provided the sum of the values and the sample size n are known.

It is unstable that it is influenced by grouping procedures.

Value must be ordered, and may be grouped, for computation.

Values need not be ordered or grouped for this calculation.

Values must be ordered and group for its computation.

It can be compute when ends are open

It cannot be calculated from a frequency table when ends are open.

It can be calculated when table ends are open.

It is not applicable to qualitative data.

It is stable in that grouping procedures do not seriously affected it.

Fact No.	The Quartile Deviation	The Mean Absolute Deviation	The Standard Deviation
1	The quartile deviation is also easy to calculate and to understand. However, it is unreliable if there are gaps in the data around the quartiles.	The mean absolute deviation has the advantage of giving equal weight to the deviation of every value form the mean or median.	The standard deviation is usually more useful and better adapted to further analysis than the mean absolute deviation.
2	It depends on only 2 values, which include the middle half of the items.	Therefore, it is a more sensitive measure of dispersion than those described above and ordinarily has a smaller sampling error.	It is more reliable as an estimator of the population dispersion than other measures, provided the distribution is normal.
3	It is usually superior to the range as a rough measure of dispersion.	It is also easier to compute and to understand and is less affected by extreme values than the standard deviation.	It is the most widely used measure of dispersion and the easiest to handle algebraically.
4	It may be determined in an open-end distribution, or one in which the data may be ranked but not measured quantitatively.	Unfortunately, it is difficult to handle algebraically, since minus signs must be ignored in its computation.	Compared with the others, it is harder to compute and more difficult to understand.
5	It also useful in badly skewed distributions or those in which other measures of dispersion would be warped by extreme values.	Its main application is in modeling accuracy for comparative forecasting techniques.	It is generally affected by extreme values that may be due to skewness of data

Fact No.

The Quartile Deviation

The Mean Absolute Deviation

The Standard Deviation

The quartile deviation is also easy to calculate and to understand. However, it is unreliable if there are gaps in the data around the quartiles.

The mean absolute deviation has the advantage of giving equal weight to the deviation of every value form the mean or median.

The standard deviation is usually more useful and better adapted to further analysis than the mean absolute deviation.

It depends on only 2 values, which include the middle half of the items.

Therefore, it is a more sensitive measure of dispersion than those described above and ordinarily has a smaller sampling error.

It is more reliable as an estimator of the population dispersion than other measures, provided the distribution is normal.

It is usually superior to the range as a rough measure of dispersion.

It is also easier to compute and to understand and is less affected by extreme values than the standard deviation.

It is the most widely used measure of dispersion and the easiest to handle algebraically.

It may be determined in an open-end distribution, or one in which the data may be ranked but not measured quantitatively.

Unfortunately, it is difficult to handle algebraically, since minus signs must be ignored in its computation.

Compared with the others, it is harder to compute and more difficult to understand.

It also useful in badly skewed distributions or those in which other measures of dispersion would be warped by extreme values.

Its main application is in modeling accuracy for comparative forecasting techniques.

It is generally affected by extreme values that may be due to skewness of data

An important class of decision problems under uncertainty is characterized by the small chance of the occurrence of a particular event, such as an accident. Poisson probability function computes the probability of exactly x independent occurrences during a given period of time, if events take place independently and at a constant rate. Poisson probability function also represent number of occurrences over constant areas or volumes:

Poisson probabilities are often used; for example in quality control, software and hardware reliability, insurance claim, number of incoming telephone calls, and queuing theory.

An Application: One of the most useful applications of the Poisson distribution is in the field of queuing theory. In many situations where queues occur it has been shown that the number of people joining the queue in a given time period follows the Poisson model. For example, if the rate of arrivals to an emergency room is l per unit of time period (say 1 hr), then:

P ( n arrivals) = lⁿ e^-l/ n!

The mean and variance of random variable n are both l . However if the mean and variance of a random variable have equal numerical values, then it is not necessary that its distribution is a Poisson. Its mode is within interval [l -1, l].

Applications:

P ( 0 arrival) = e^-lP ( 1 arrival) = l e^-l/ 1!P ( 2 arrival) = l² e^-l/ 2!

and so on. In general:

P ( n+1 arrivals ) = l P ( n arrivals ) / n.

Normal approximation for Poisson: All Poisson tables are limited in their scope; therefore, it is necessary to use standard normal distribution in computing the Poisson probabilities. The following numerical example illustrates how good the approximation could be.

Numerical Example: Emergency patients arrive at a large hospital at the rate of 0.033 per minute. What is the probability of exactly two arrivals during the next 30 minutes?

The arrival rate during 30 minutes is l = (30)(0.033) = 1. Therefore,

P (2 arrivals) = [1² /(2!)] e^-1 = 18%

The mean and standard deviation of distribution are:

m = l = 1, and s = l ^1/2 = 1,

respectively; therefore, the standardized observation for n = 2, by using the continuity factor (which always enlarges) are:

z₁ = [(r-1/2) - m] / s = (1.5 -1)/1 = 0.5, and

z₂ = [(r+1/2) - m] / s = (2.5 -1)/1 = 1.5.

Therefore, the approximated P (2 arrivals) is P (z being within the interval 0.5, 1.5). Now, by using the standard normal table, we obtain:

P (2 arrivals) = 0.43319 - 0.19146 = 24%

As you see the approximation is slightly overestimated, therefore the error is on the safe side. For large values of l, say over 20, one may use the Normal approximation to calculate Poisson probabilities.

Notice that by taking the square root of a Poisson random variable, the transformed variable is more symmetric. This is a useful transformation in regression analysis of Poisson observations.

You might like to use Poisson Probability Function JavaScript to perform your computation, and Testing Poisson to perform the goodness-of-fit test.

Further Reading:
Barbour et al., Poisson Approximation, Oxford University Press, 1992.

Student T-Density Function

The t distributions were discovered in 1908 by William Gosset, who was a chemist and a statistician employed by the Guinness brewing company. He considered himself a student still learning statistics, so that is how he signed his papers as pseudonym "Student". Or, perhaps he used a pseudonym due to "trade secret" restrictions by Guinness.

Note that there are different t-distributions; it is a class of distributions. When we speak of a specific t distribution, we have to specify the degrees of freedom. The t density curves are symmetric and bell-shaped like the normal distribution and have their peak at 0. However, the spread is more than that of the standard normal distribution. The larger the degrees of freedom, the closer the t-density is to the normal density.

The shape of a t-distribution depends on a parameter called "degree-of-freedom". As the degree-of-freedom gets larger, the t-distribution gets closer and closer to the standard normal distribution. For practical purposes, the t-distribution is treated as the standard normal distribution when degree-of-freedom is greater than 30.

Suppose we have two independent random variables, one is Z, distributed as the standard normal distribution, while the other has a Chi-square distribution with (n-1) d.f.; then the random variable:

(n-1)Z / c²

has a t-distribution with (n-1) d.f. For large sample size (say, n over 30), the new random variable has an expected value equal to zero, and its variance is (n-1)/(n-3) which is close to one.

Notice that the t- statistic is related to F-statistic as follow: F = t², where F has (d.f.₁ = 1, and d.f.₂ = d.f. of the t-table)

You might like to use Student t-Density to obtain its P-values.

Triangular Density Function

The triangular distribution shows the number of successes when you know the minimum, maximum, and most likely values. For example, you could describe the number of intakes seen per week when past intake data show the minimum, maximum, and most likely number of cases seen. It has a continuous probability distribution.

The parameters for the triangular distribution are Minimum, Maximum, and Likeliest. There are three conditions underlying triangular distribution:

The minimum number of items is fixed.
The maximum number of items is fixed.
The most likely number of items falls between the minimum and maximum values.

These three parameters forming a triangular shaped distribution, which shows that values near the minimum and maximum are less apt to occur than those near the most likely value.

Further Reading:
Evans M., Hastings N., and B., Peacock, Triangular Distribution, Ch. 40 in Statistical Distributions, Wiley, pp. 187-188, 2000.

Uniform Density Function

The uniform density function gives the probability that observation will occur within a particular interval [a, b] when probability of occurrence within that interval is directly proportional to interval length. Its mean and variance are:

m = (a+b)/2, s² = (b-a)²/12.

Applications: Used to generate random numbers in sampling and Monte Carlo simulation.

Comments: Special case of beta distribution.

You might like to use Goodness-of-Fit Test for Uniform and performing some numerical experimentation for a deeper understanding of the concepts.

Notice that any Uniform distribution has uncountable number of modes having equal density value; therefore it is considered as a homogeneous population.

Further Reading:
Balakrishnan N., and V. Nevzorov, A Primer on Statistical Distributions, Wiley, 2003.

Necessary Conditions for Statistical Decision Making

Introduction to Inferential Data Analysis Necessary Conditions: Do not just learn formulas and number-crunching. Learn about the conditions under which statistical testing procedures apply. The following conditions are common to almost all statistical tests:

That is, all three population proportions are almost identical. The sample data from each of the three populations are given in the following table:

The Chi-square statistic is 8.95 with d.f. = (3-1)(3-1) = 4. The p-value is equal to 0.062, indicating that there is moderate evidence against the null hypothesis that the three populations are statistically identical.

Distribution-free Equality of Two Populations

Prior to applying the K-S test it is necessary to arrange each of the two sample observations in a frequency table. The frequency table must have a common classification. Therefore the test is based on the frequency table, which belongs to the family of distribution-free tests.

An Application: The daily sales of the two subsidiaries of The PC & Accessories Company are shown in the following table, with n1 = 44, and n2 = 54:

Daily Sales at Two Branches Over 6 Months
Sales ($1000)	Frequency I	Frequency II
0 - 2	11	1
3 - 5	7	3
6 - 8	8	6
9 - 11	3	12
12 - 14	5	12
15 - 17	5	14
18 - 20	5	6
Sums	44	54

The manager of the first branch is claiming that "since the daily sales are random phenomena, my overall performance is as good as the other manager's performance." In other words:

H₀: The daily sales at the two stores are almost the same.
H_a: The performance of the managers is significantly different.

Following the above process for this test, the K-S statistic is 0.421 with the p-value of 0.0009, indicating a strong evidence against the null hypothesis. There is enough evidence that the performance of the manager of the second branch is better.

Introduction to Applications of the Chi-square Statistic

The variance is not the only thing for which you use a Chi-square test for.

The most widely used applications of Chi-square distribution are:

The Chi-square Test for Association which is a non-parametric test; therefore, it can be used for nominal data too. It is a test of statistical significance widely used bivariate tabular association analysis. Typically, the hypothesis is whether or not two populations are different in some characteristic or aspect of their behavior based on two random samples. This test procedure is also known as the Pearson Chi-square test.

The Chi-square Goodness-of-Fit Test is used to test if an observed distribution conforms to any particular distribution. Calculation of this goodness-of-fit test is by comparison of observed data with data expected based on a particular distribution.

One of the disadvantages of some of the Chi-square tests is that they do not permit the calculation of confidence intervals; therefore, determination of the sample size is not readily available.

Treatment of Cases with Many Categories: Notice that, although in the following section most of the crosstables have only two categories, it is always possible to convert cases with many categories into similar crosstables. To do so, one must consider all possible pairs of categories and their numerical values while constructing the equivalent "two-categories" crosstable.

Test for Crosstable Relationship

Crosstables: Often crosstables are used to test relationships among two categorical types of data, or independence of two variables, such as cigarette smoking and drug use. If you were to survey 1000 people on whether or not they smoke and whether or not they use drugs, you would get one of four answers: (no, no) (no, yes) (yes, no) (yes, yes)

By compiling the number of people in each category, you can ultimately test whether drug usage is independent of cigarette smoking by using the Chi-square distribution (this is approximate, but works well). Again, the methodology for this is in your textbook. The degrees of freedom is equal to (number of rows-1)(number of columns -1). That is, these many numbers needed to fill in the entire body of the crosstable, the rest will be determined by using the given row sums and the column sums values.

Do not forget the conditions for the validity of Chi-square test and related expected values greater than 5 in 80% or more of the cells. Otherwise, one could use an "exact" test, using either a permutation or resampling approach.

Using Chi-square in a 2x2 table requires the Yates's correction. One first subtracts 0.5 from the absolute differences between observed and expected frequencies for each of the three genotypes before squaring, dividing by the expected frequency, and summing. The formula for the Chi-square value in a 2x2 table can be derived from the Normal Theory comparison of the two proportions in the table using the total incidence to produce the standard errors. The rationale of the correction is a better equivalence of the area under the normal curve and the probabilities obtained from the discrete frequencies. In other words, the simplest correction is to move the cut-off point for the continuous distribution from the observed value of the discrete distribution to midway between that and the next value in the direction of the null hypothesis expectation. Therefore, the correction essentially only applied to one d.f. tests where the "square root" of the Chi-square looks like a "normal/t-test" and where a direction can be attached to the 0.5 addition.

Chi-square distribution is used as an approximation of the binomial distribution. By applying a continuity correction, we get a better approximation of the binomial distribution for the purposes of calculating tail probabilities.

Given the following 2x2 table, one may compute some relative risk measures:

a

b

c

d

The most usual measures are:

Rate-difference: a/(a+c) - b/(b+d)
Rate-ratio: (a/(a+c))/(b/(b+d))
Odds-ratio: ad/bc

The rate difference and rate ratio are appropriate when you are contrasting two groups whose sizes (a+c and b+d) are given. The odds ratio is for when the issue is association rather than difference.

The risk-ratio (RR) is the ratio of the proportion (a/(a+b)) to the proportion (c/(c+d)):

RR = (a / (a + b)) / (c / (c + d))

RR is thus a measure of how much larger the proportion in the first row is compared to the second. RR value of < 1.00 indicating a 'negative' association [a/(a+b) < c/(c+d)], 1.00 indicating no association [a/(a+b) = c/(c+d)], and >1.00 indicating a 'positive' association [a/(a+b) > c/(c+d)]. The further from 1.00 the RR is, the stronger the association.

An Application: Suppose a counselor of a school in a small town is interested whether the curriculum chosen by students is related to the occupation of their parents. It is necessary to record the data as shown in the following contingency table with two rows (r1, r2) and three columns (c1, c2, c3):

Relationship between occupation of parents and
curriculum chosen by high school students

Curriculum Chosen by Students

Parental
Occupation College prep Vocational General Totals
Professional

12
2 6

6 6 8

20

Blue collar 20

Totals 18 8 14

Under the hypothesis that there is no relation, the expected (E) frequency would be:

E_{i, j} = (Sr_i)(Sc_j)/N

The Observed (O) and Expected (E) frequencies are recorded in the following table:

Expected frequencies for the data.

College prep Vocational General Totals

Professional

O = 12
E = 9
O = 2
E = 4 O = 6
E = 7

O = 6
E = 9 O = 6
E = 4 O = 8
E = 7

åO = 20
åE = 20

Blue collar åO= 20
åE = 20
Totals åO = 18
åE = 18 åO = 8
åE = 8 åO = 14
åE = 14

The quantity

c ² = S [(O - E )² / E]

is a measure of the degree of deviation between the Observed and Expected frequencies. If there is no relationship between the row variable and the column variable this measure will be very close to zero. Under the hypothesis that there is a relationship between the rows and the columns, this quantity has a Chi-square distribution with parameter equal to number of rows minus 1, multiplied by number of columns minus 1.

For this numerical example we have:

c ² = S [(O - E )² / E] = 30/7 = 4.3

with d.f. = (2-1)(3-1) = 2, that has the p-value of 0.14, suggesting little or no real evidences against the null hypothesis.

The main question is how large is this measure. The maximum value of this measure is:

c ²_max = N(A-1),

where A is the number of rows or columns, whichever is smaller. For our numerical example it is, 40(2-1) = 40.

The coefficient of determination which has a range of [0, 1], provides relative strength of relationship, computed as

c ²/c ²_max = 4.3/40 = 0.11

Therefore we conclude that the degree of association is only 11% which is fairly weak.

Alternatively, you could also look at the contingency coefficient f statistic, which is:

f = [ c²/(N + c²)]^½ = 0.31

This statistic ranges between 0 and 1 and can be interpreted like the correlation coefficient. This measure also indicates that the curriculum chosen by students is related to the occupation of their parents.

You might like to use Chi-square Test for Crosstable Relationship in performing this test, and he P-values for the Popular Distributions JavaScript to findout the p-values of Chi-square statistic.

Further Readings:
Agresti A., Categorical Data Analysis, Wiley, 2002.
Fleiss J., Statistical Methods for Rates and Proportions, Wiley, 1981.

Identical Populations Test for Crosstable Data

Test of homogeneity is much like the Test for Crosstable Relationship in that both deal with the cross-classification of nominal data; that is, r ´ c tables. The method of computing Chi-square statistic is the same for both tests, with the same d.f.

The two tests differ, however, in the following respect. The Test for Crosstable Relationship is made on data drawn from a single population (with fixed total) where one is concerned with whether one set of attributes is independent of another set. The test for homogeneity, on the other hand, is designed to test the null hypothesis that two or more random samples are drawn from the same population or from different populations, according to some criterion of classification applied to the samples.

The homogeneity test is concerned with the question: Are the samples drawn form populations that are homogeneous (i.e., the same) with respect to some criterion of classification?

In the crosstable for this test, either the row or the column categories may represent the populations from which the samples are drawn.

An Application: Suppose a board of directors of a labor union wishes to survey the opinion of its members regarding a change in its constitution. The following table shows the result of the survey sent to three union locals:

Reactions of A Sample of Three Locals Group Members

Union Local

Reaction A B C
In Favor

18 22 10
7 14 9
5 4 11
Against
No Response

The problem is not to determine whether or not the union members are in favor of the change. The question is to test if there is a significant difference in the proportions of opinion of the three populations' members concerning the proposed change.

The Chi-square statistic is 9.58 with d.f. = (3-1)(3-1) = 4. The p-value is equal to 0.048, indicating that there is moderate evidence against the null hypothesis that the three union locals are the same.

You might like to use Populations Homogeneity Test to perfor this test.

Further Readings:
Agresti A., Categorical Data Analysis, Wiley, 2002.
Clark Ch., and L. Schkade, Statistical Analysis for Administrative Decisions, South-Western Pub., 1979.

Test for Equality of Several Population Medians

Generally, the median provides a better measure of location than the mean when there are some extremely large or small observations; i.e., when the data are skewed to the right or to the left. For this reason, median income is used as the measure of location for the U.S. household income.

Suppose we are interested in testing the equality of the medians of k number of populations with respect to the same continuous random variable.

The first step in calculating the test statistic is to compute the common median of the k samples combined. Then, determine for each group the number of observations falling above and below the common median. The resulting frequencies are arranged in a 2 by k crosstable. If the k samples are, in fact, from populations with the same median, one expects about one half the score in each sample to be above the combined median and about one half to be below. In the case that some observations are equal to the combined median, one may drop those few observations, in constructing a 2 by k crosstable. Under this condition, now the Chi-square statistic may be computed and compared with the p-value of Chi-square distribution with d.f. = k-1.

An illustrative application: Do public and private primary school teachers differ with respect to their salary? The data from a random sample are given in the following table (in thousands of dollars per year).

Public	Private	Public	Private
35	29	25	50
26	50	27	37
27	43	45	34
21	22	46	31
27	42	33
38	47	26
23	42	46
25	32	41

The test of hypothesis is:

H₀: The public and private school teachers' salaries are almost the same.

The median of all data (i.e., combined) is 33.5. Now determine in each group the number of observations falling above and below the common median of 33.5. The resulting frequencies are shown in the following table:

Crosstable for the public and private school teachers'
	Public	Private	Total
Above median	6	8	14
Below median	10	4	14
Total	16	12	28

The Chi-square statistic based on this table is 2.33. The p-value for the computed test statistic with d.f. = (2-1)(2-1) = 1 is 0.127, therefore, we are unable to reject the null hypothesis.

You might like to use Testing Medians to perform this test.

Goodness-of-Fit Test for Probability Mass Functions

There are other tests that might use the Chi-square, such as goodness-of-fit test for discrete random variables. Therefore, Chi-square is a statistical test that measures "goodness-of-fit". In other words, it measures how much the observed or actual frequencies differ from the expected or predicted frequencies. Using a Chi-square table will enable you to discover how significant the difference is. A null hypothesis in the context of the Chi-square test is the model that you use to calculate your expected or predicted values. If the value you get from calculating the Chi-square statistic is sufficiently high (as compared to the values in the Chi-square table), it tells you that your null hypothesis is probably wrong.

Let Y₁, Y₂, . . ., Y_nbe a set of independent and identically distributed discrete random variables. Assume that the probability distribution of the Y_i's has the probability mass function f_o(y). We can divide the set of all possible values of Y_i, i = {1, 2, ..., n}, into m non-overlapping intervals D₁, D₂, ...., D_m. Define the probability values p₁, p₂, ..., p_m as;

p₁ = P(Y_i Î D₁)
p₂ = P(Y_i Î D₂)

p_m= P(Y_iÎ D_m)

Where the symbol Î means, "an element of".

Since the union of the mutually exclusive intervals D₁, D₂,...., D_m is the set of all possible values for the Y_i's, (p₁ + p₂ + .... + p_m) = 1. Define the set of discrete random variables X₁, X₂, ...., X_m, where

X₁= number of Y_i's whose valueÎD₁
X₂= number of Y_i's whose value Î D₂

:
:

X_m= number of Y_i's whose value Î D_m

and (X₁+ X₂+ .... + X_m) = n. Then the set of discrete random variables X₁, X₂, ...., X_mwill have a multinomial probability distribution with parameters n and the set of probabilities {p₁, p₂, ..., p_m}. If the intervals D₁, D₂, ...., D_m are chosen such that np_i ³ 5 for i = 1, 2, ..., m, then;

C = S (X_i - np_i) ²/ np_i.

The sum is over i = 1, 2,..., m. The results is distributed as c² _m-1.

For the goodness-of-fit sample test, we formulate the null and alternative hypothesis as

H₀ : f_Y(y) = f_o(y)
H_a : f_Y(y) ¹ f_o(y)

At the a level of significance, H₀ will be rejected in favor of H_a if

C = S (X_i - np_i) ²/ np_i

is greater than c² _m

However, it is possible that in a goodness-of-fit test, one or more of the parameters of f_o(y) are unknown. Then the probability values p₁, p₂, ..., p_m will have to be estimated by assuming that H₀ is true and calculating their estimated values from the sample data. That is, another set of probability values p'₁, p'₂, ..., p'_mwill need to be computed so that the values (np'₁, np'₂, ..., np'_m) are the estimated expected values of the multinomial random variable (X₁, X₂, ...., X_m). In this case, the random variable C will still have a Chi-square distribution, but its degrees of freedom will be reduced. In particular, if the probability function f_o(y) has r unknown parameters,

C = S (X_i - np_i) ²/ np_i

is distributed as c² _m-1-r.

For this goodness-of-fit test, we formulate the null and alternative hypothesis as

H₀: f_Y(y) = f_o(y)
H_a: f_Y(y) ¹ f_o(y)

At the a level of significance, H₀ will be rejected in favor of H_a if C is greater than c² _m-1-r.

An Application: A die is thrown 300 times and the following frequencies are observed. Test the hypothesis that the die is fair at level 0.05. Under the null hypothesis that the die is fair, the expected frequencies are all equal to 300/6 = 50. Both the Observed (O) and Expected (E) frequencies are recorded in the following table together with the random variable Y that represents the number on each sides of the die:

Goodness-of-fit Test For Discrete Variables
Y	1	2	3	4	5	6
O	57	43	59	55	63	23
E	50	50	50	50	50	50

The quantity

c ² = S [(O - E )² / E] = 22.04

is a measure of the goodness-of-fit. If there is a reasonably good fit to the hypothetical distribution, this measure will be very close to zero. Since c ² _{n-1, 0.95} = 11.07, we reject the null hypothesis that the die is a fair one.

You might like to use this JavaScript to perform this test.

For statistical equality of two random variables characterizing two populations, you might like to use the Kolmogorov-Smirnov Test if you have two independent sets of random observations, one from each population.

Compatibility of Multi-Counts Test

In some applications, such as quality control, it is necessary to check if the process is under control. This can be done by testing if there are significant differences between number of "counts", taken over k equal-periods of times. The counts are supposed to have been obtained under comparable conditions.

The null hypothesis is:

H₀: There is no significant difference between number of "counts" taken over k equal-periods of times.

Under the null hypothesis, the statistic:

S (N_i - N)²/N

has a Chi-square distribution with d.f. = k-1. Where i is the count's number, N_i is its counts, and N = SN_i/k.

One may extend this useful test to where the duration of obtaining the i^th count is t_i. Then the above test statistic becomes:

S [(N_i - t_iN)²/ t_iN]

and has a Chi-square distribution with d.f. = k-1, where i is the count's number, N_i is its counts, and N = SN_i/St_i.

You might like to use the Compatibility of Multi-Counts JavaScript to check your computations, and to perform some numerical experimentation for a deeper understanding of the concepts.

Necessary Conditions for the Above Chi-square Based Testing

Like any statistical test procedures, the Chi-square based testing must meet certain necessary conditions to apply; otherwise, any obtained conclusion might be wrong or misleading. This is true in particular for using the Chi-square-based test for cross-tabulated data.

Necessary conditions for the Chi-square based tests for crosstable data are:

Expected values greater than 5 in 80% or more of the cells.
Moreover, if number of cells is fewer than 5, then all expected values must be greater than 5.

An Example: Suppose the monthly number of accidents reported in a factory in three eight-hour shifts is 1, 7, and 7, respectively. Are the working conditions and the exposure to risk similar for all shifts? Clearly, the answer must be, No they are not. However, applying the goodness-of-fit, at 0.05, under the null hypothesis that there are no differences in the number of accidents in three shifts, one expects 5, 5, and 5 accidents in each shift. The Chi-square test statistic is:

c ² = S [(O - E )² / E] = 4.8

However, since c ² _{n-1, 0.95} = 5.99, there is no reason to reject that there is no difference, which is a very strange conclusion. What is wrong with this application?

You might like to use this JavaScript to verify your computation.

Testing the Variance: Is the Quality that Good?

Suppose a population has a normal distribution. The manager is to test a specific claim made about the quality of the population by way of testing its variance s². Among three possible scenarios, the interesting case is in testing the following null hypothesis based on a set of n random sample observations:

H₀: Variation is about the claimed value.
H_a: The variation is more than what is claimed, indicating the quality is much lower than expected.

Upon computing the estimated variance S² based on n observations, then the statistic:

c^½ = [(n-1). s²] / s²

has a Chi-square distribution with degree of freedom n = n - 1. This statistic is then used for testing the above null hypothesis.

You might like to use Testing the Variance JavaScript to check your computations.

Testing the Equality of Multi-Variances

The equality of variances across populations is called homogeneity of variances or homoscedasticity. Some statistical tests, such as testing equality of the means by the t-test and ANOVA, assume that the data come from populations that have the same variance, even if the test rejects the null hypothesis of equality of population means. If this condition of homogeneity of variance is not met, the statistical test results may not be valid. Heteroscedasticity refers to lack of homogeneity of variances.

Bartlett's Test is used to test if k samples have equal variances. It compares the Geometric Mean of the group variances to the arithmetic mean; therefore, it is a Chi-square statistic with (k-1) degrees of freedom, where k is the number of categories in the independent variable. The test is sensitive to departures from normality. The sample sizes do not have to be equal but each must be at least 6. Just like the two population t-test, ANOVA can go wrong when the equality of variances condition is not met.

The Bartlett test statistic is designed to test for equality of variances across groups against the alternative that variances are unequal for at least two groups. Formally,

H₀: All variances are almost equal.

The test statistic:

B = {S [(n_i -1)LnS²] S [(n_i -1)LnS_i²]}/ C

In the above, S_i² is the variance of the ith group, n_i is the sample size of the i^th group, k is the number of groups, and S² is the pooled variance. The pooled variance is a weighted average of the group variances and is defined as:

S² = {S [(n_i -1)S_i²]} / S [(n_i -1)], over all i = 1, 2,..,k

and

C = 1 + {S [1/(n_i -1)] - 1/ S [1/(n_i -1)] }/[3(k+1)].

You might like to use the Equality of Multi-Variances JavaScript to check your computations, and to perform some numerical experimentation for a deeper understanding of the concepts.

Rule of 2: For 3 or more populations, there is a practical rule known as the "Rule of 2". According to this rule, one divides the highest variance of a sample by the lowest variance of the other sample. Given that the sample sizes are almost the same, and the value of this division is less than 2, then, the variations of the populations are almost the same.

Example: Consider the following three random samples from three populations, P1, P2, P3:

	Sample P1	Sample P2	Sample P3
	25	17	8
	25	21	10
	20	17	14
	18	25	16
	13	19	12
	6	21	14
	5	15	6
	22	16	16
	25	24	13
	10	23	6
N	10	10	10
Mean	16.90	19.80	11.50
Std.Dev.	7.87	3.52	3.81
SE Mean	2.49	1.11	1.20

The ANOVA Table
Sources of Variation	Sum of Squares	Degrees of Freedom	Mean Squares	F-Statistic
Between Samples	79.40	2	39.70	4.38
Within Samples	244.90	27	9.07
Total	324.30	29

With an F = 4.38 and a p-value of 0.023, we reject the null at a = 0.05. This is not good news, since ANOVA, like the two-sample t-test, can go wrong when the equality of variances condition is not met.

Further Readings:
Hand D., and C. Taylor, Multivariate Analysis of Variance and Repeated Measures, Chapman and Hall, 1987.
Miller R. Jr, Beyond ANOVA: Basics of Applied Statistics, Wiley, 1986.

Correlation Coefficients Testing

The Fisher's Z-transformation is a useful tool in the circumstances in which two or more independent correlation coefficients are to be compared simultaneously. To perform such a test one may evaluate the Chi-square statistic:

c² = S[(n_i - 3).Z_i²] - [S(n_i - 3).Z_i]² / [S(n_i - 3)], the sums are over all i = 1, 2, .., k.

Where the Fisher Z-transformation is

Z_i = 0.5[Ln(1+r_i) - Ln(1-r_i)], provided | r_i | ¹ 1.

Under the null hypothesis:

H₀: All correlation coefficients are almost equal.

The test statistic c² has (k-1) degrees of freedom, where k is the number of populations.

An Application: Consider the following correlation coefficients obtained by random sampling form ten independent populations.

Population P_i	Correlation r_i	Sample Size n_i
1	0.72	67
2	0.41	93
3	0.57	73
4	0.53	98
5	0.62	82
6	0.21	39
7	0.68	91
8	0.53	27
9	0.49	75
10	0.50	49

Using the above formula c²-statistic = 19.916, that has a p-value of 0.02. Therefore, there is moderate evidence against the null hypothesis.

In such a case, one may omit a few outliers from the group, then use the Test for Equality of Several Correlation Coefficients JavaScript. Repeat this process until a possible homogeneous sub-group may emerge.

You might need to use Sample Size Determination JavaScript at the design stage of your statistical investigation in decision making with specific subjective requirements.

Simple Linear Regression: Computational Aspects

The regression analysis has three goals: predicting, modeling, and characterization. What would be the logical order in which to tackle these three goals such that one task leads to and /or and justifies the other tasks? Clearly, it depends on what the prime objective is. Sometimes you wish to model in order to get better prediction. Then the order is obvious. Sometimes, you just want to understand and explain what is going on. Then modeling is again the key, though out-of-sample predicting may be used to test any model. Often modeling and predicting proceed in an iterative way and there is no 'logical order' in the broadest sense. You may model to get predictions, which enable better control, but iteration is again likely to be present and there are sometimes special approaches to control problems.

The following contains the main essential steps during modeling and analysis of regression model building, presented in the context of an applied numerical example.

Formulas and Notations:

= Sx /n
This is just the mean of the x values.
= Sy /n
This is just the mean of the y values.
S_xx = SS_xx = S(x(i) - )² = Sx² - ( Sx)² / n
S_yy = SS_yy = S(y(i) - )² = Sy² - ( Sy) ² / n
S_xy = SS_xy = S(x(i) - )(y(i) - ) = Sx ×y – (Sx) × (Sy) / n
Slope m = SS_xy / SS_xx
Intercept, b = - m .
y-predicted = yhat(i) = m×x(i) + b.
Residual(i) = Error(i) = y – yhat(i).
SSE = S_res = SS_res = SS_errors = S[y(i) – yhat(i)]².
Standard deviation of residuals = s = S_res = S_errors = [SS_res / (n-2)]^1/2.
Standard error of the slope (m) = S_res / SS_xx^1/2.
Standard error of the intercept (b) = S_res[(SS_xx + n. ²) /(n × SS_xx] ^1/2.

A computational Example: A taxicab company manager believes that the monthly repair costs (Y) of cabs are related to age (X) of the cabs. Five cabs are selected randomly and from their records we obtained the following data: (x, y) = {(2, 2), (3, 5), (4, 7), (5, 10), (6, 11)}. Based on our practical knowledge and the scattered diagram of the data, we hypothesize a linear relationship between predictor X, and the cost Y.

Now the question is how we can best (i.e., least square) use the sample information to estimate the unknown slope (m) and the intercept (b)? The first step in finding the least square line is to construct a sum of squares table to find the sums of x values (Sx), y values (Sy), the squares of the x values (Sx²), the squares of the x values (Sy²), and the cross-product of the corresponding x and y values (Sxy), as shown in the following table:

x

y

x²

xy

y²

2

2

4

4

4

3

5

9

15

25

4

7

16

28

49

5

10

25

50

100

6

11

36

66

121

SUM

20

35

90

163

299

The second step is to substitute the values of Sx, Sy, Sx², Sxy, and Sy² into the following formulas:

SS_xy = Sxy – (x)(Sy)/n = 163 - (20)(35)/5 = 163 - 140 = 23

SS_xx = Sx² – (Sx)²/n = 90 - (20)²/5 = 90- 80 = 10

SS_yy = Sy² – (Sy)²/n = 299 - 245 = 54

Use the first two values to compute the estimated slope:

Slope = m = SS_xy / SS_xx = 23 / 10 = 2.3

To estimate the intercept of the least square line, use the fact that the graph of the least square line always pass through (, ) point, therefore,

The intercept = b = – (m)() = (Sy)/ 5 – (2.3) (Sx/5) = 35/5 – (2.3)(20/5) = -2.2

Therefore the least square line is:

y-predicted = yhat = mx + b = -2.2 + 2.3x.

After estimating the slope and the intercept the question is how we determine statistically if the model is good enough, say for prediction. The standard error of slope is:

Standard error of the slope (m)= S_m = S_res / S_xx^1/2,

and its relative precision is measured by statistic

t_slope = m / S_m.

For our numerical example, it is:

t_slope = 2.3 / [(0.6055)/ (10^1/2)] = 12.01

which is large enough, indication that the fitted model is a "good" one.

You may ask, in what sense is the least squares line the "best-fitting" straight line to 5 data points. The least squares criterion chooses the line that minimizes the sum of square vertical deviations, i.e., residual = error = y - yhat:

SSE = S (y – yhat)² = S(error)² = 1.1

The numerical value of SSE is obtained from the following computational table for our numerical example.

x Predictor	-2.2+2.3x y-predicted	y observed	error y	squared errors
2	2.4	2	-0.4	0.16
3	4.7	5	0.3	0.09
4	7	7	0	0
5	9.3	10	0.7	0.49
6	11.6	11	-0.6	0.36
			Sum=0	Sum=1.1

Alternately, one may compute SSE by:

SSE = SS_yy – m SS_xy = 54 – (2.3)(23) = 54 - 52.9 = 1.1,

as expected

Notice that this value of SSE agrees with the value directly computed from the above table. The numerical value of SSE gives the estimate of variation of the errors s²:

s² = SSE / (n -2) = 1.1 / (5 - 2) = 0.36667

The estimate the value of the error variance is a measure of variability of the y values about the estimated line. Clearly, we could also compute the estimated standard deviation s of the residuals by taking the square roots of the variance s².

As the last step in the model building, the following Analysis of Variance (ANOVA) table is then constructed to assess the overall goodness-of-fit using the F-statistics:

Analysis of Variance Components
Source	DF	Sum of Squares	Mean Square	F Value	Prob > F
Model	1	52.90000	52.90000	144.273	0.0012
Error	3	SSE = 1.1	0.36667
Total	4	SS_yy = 54

Greek Letters Commonly Used as Statistical Notations
alpha	beta	ki-sqre	delta	mu	nu	pi	rho	sigma	tau	theta
a	b	c ²	d	m	n	p	r	s	t	q

	Levels of Measurements
	_________________________________________
	Nominal	Ordinal	Interval/Ratio
Ranking?	no	yes	yes
Numerical difference	no	no	yes

- Two Investments -
Investment I		Investment II
Payoff %	Prob.	Payoff %	Prob.
1	0.25	3	0.33
7	0.50	5	0.33
12	0.25	8	0.34

Revising the Expected Value and the Variance
Estimate Source	Expected value	Variance
Sales manager	m₁ = 110	s₁² = 100
Market survey	m₂ = 70	s₂² = 49

P-value	Interpretation
P < 0.01	very strong evidence against H₀
0.01£ P < 0.05	moderate evidence against H₀
0.05 £ P < 0.10	suggestive evidence against H₀
0.10 £ P	little or no real evidences against H₀

i	x_i	( x_i- )	( x_i - ) ²	( x_i - ) ³	( x_i - )⁴
1	1	-2	4	-8	16
2	2	-1	1	-1	1
3	3	0	0	0	0
4	6	3	9	27	81
Sum	12	0	14	18	98

											Mean
0 oz	2	3	1	3	1	4	1	3	2	1	2.1
2 oz	3	2	1	4	2	3	1	5	1	2	2.4
4 oz	3	1	2	4	2	5	2	4	3	2	3.1

Test for homogeneity of Several Population Proportions
Populations	Yes	No	Total
Sample I	60	40	100
Sample II	57	53	110
Sample III	48	72	120
Total	165	165	330

	x	y	x²	xy	y²
	2	2	4	4	4
	3	5	9	15	25
	4	7	16	28	49
	5	10	25	50	100
	6	11	36	66	121
SUM	20	35	90	163	299

Value of \|r\|	Interpretation
0.00 - 0.40	Poor
0.41 - 0.75	Fair
0.76 - 0.85	Good
0.86 - 1.00	Excellent

Values of Covariate X and a Dependent Variable Y
Treatment-I		Treatment-II
X	Y	X	Y
5	11	2	1
3	9	6	7
1	5	4	3
4	8	7	8
6	12	3	2

- Sizes, Ages, and Prices of Twenty Houses -
X1 = Size	X2 = Age	Y = Price	X1 = Size	X2 = Age	Y = Price
1.8	30	32	2.3	30	44
1.0	33	24	1.4	17	27
1.7	25	27	3.3	16	50
1.2	12	25	2.2	22	37
2.8	12	47	1.5	29	28
1.7	1	30	1.1	29	20
2.5	12	43	2.0	25	38
3.6	28	52	2.6	2	45

- Relation between Age and Income($1000) -
Age	Income	Age	Income	Age	Income
20	15	42	19	61	13
22	13	47	17	62	14
23	17	53	13	65	9
28	19	55	18	67	7
35	15	41	21	72	7
24	21	53	39	65	22
26	26	57	28	65	24
29	27	58	22	69	27
39	31	58	29	71	22
31	16	46	27	69	9
37	19	44	35	62	21

- Relation between Age and Income($1000) -
Age ( 29 - 39 )	Age ( 40 - 59 )	Age ( 60 & Over )
15	19	13
13	17	14
17	13	9
21	21	7
15	39	21
26	28	24
27	22	27
31	26	22
16	27	9
19	35	22
19	18	7

	Plant Type - A		Plant Type - B
Months	Unit Output	Man Hours	Unit Output	Man Hours
1	0283	200000	11315	680000
2	0760	300000	12470	720000
3	1195	530000	13395	750000
Standard	4000	600000	16000	800000

		Year 2000		Year 2001
	Unit Needed	Unit Cost	Total	Unit Cost	Total
Labor	20	10	200	11	220
Almunium	02	100	200	110	220
Electricity	02	50	100	60	120
Total			500		560

Towards Statistical Thinking for Decision Making

Descriptive Sampling Data Analysis

Probability for Statistical Inference and Modeling

Necessary Conditions for Statistical Decision Making

Estimators and Their Qualities

Hypothesis Testing: Rejecting a Claim

Hypotheses Testing for Means and Proportions

Tests for Statistical Equality of Two or More Populations

Applications of the Chi-square Statistic

Regression Modeling and Analysis

Unified Views of Statistical Decision Technologies

Index Numbers with Applications

Introduction to Statistical Thinking for Decision Making

The Birth of Probability and Statistics

Statistical Modeling for Decision-Making under Uncertainties:From Data to the Instrumental Knowledge

Statistical Decision-Making Process

What is Business Statistics?

Common Statistical Terminology with Applications

Greek Letters Commonly Used as Statistical Notations

Type of Data and Levels of Measurement

Why Statistical Sampling?

Sampling Methods

Statistical Summaries

Representative of a Sample: Measures of Central Tendency Summaries

Selecting Among the Mode, Median, and Mean

Specialized Averages: The Geometric & Harmonic Means

Histogramming: Checking for Homogeneity of Population

How to Construct a BoxPlot

Measuring the Quality of a Sample

Selecting Among the Quartile Deviation, Mean Absolute Deviation, and Standard Deviation

Shape of a Distribution Function: The Skewness-Kurtosis Chart

Numerical Example and Discussions

The Two Statistical Representations of a Population

Empirical (i.e., observed) Cumulative Distribution Function

Introduction

Probability, Chance, Likelihood, and Odds

How to Assign Probabilities?

General Laws of Probability

Mutually Exclusive versus Independent Events

What Is so Important About the Normal Distributions?

What Is A Sampling Distribution?

What Is The Central Limit Theorem?

What Is "Degrees of Freedom"?

Applications of and Conditions for Using Statistical Tables

Binomial Probability Function

Exponential Density Function

F-Density Function

Chi-square Density Function

Multinomial Probability Function

Normal Density Function

Poisson Probability Function

Student T-Density Function

Triangular Density Function

Uniform Density Function

Necessary Conditions for Statistical Decision Making

Measure of Surprise for Outlier Detection

Homogeneous Population

Test for Randomness: The Runs' Test

Test for Normality

Introduction to Estimation

Qualities of a Good Estimator

Estimations with Confidence

What Is the Margin of Error?

Bias Reduction Techniques: Bootstrapping and Jackknifing

Prediction Intervals

What Is a Standard Error?

Sample Size Determination

Revising the Expected Value and the Variance

Subjective Assessment of Several Estimates Based on Relative Precision

Bayesian Statistical Inference: An Introduction

Managing the Producer's or the Consumer's Risk

Hypothesis Testing: Rejecting a Claim

Classical Approach to Testing Hypotheses

The Meaning and Interpretation of P-values (what the data say?)

Blending the Classical and the P-value Based Approaches in Test of Hypotheses

Bonferroni Method for Multiple P-Values Procedure

Power of a Test and the Size Effect

Parametric vs. Non-Parametric vs. Distribution-free Tests

Hypotheses Testing

Single Population t-Test

Statistical Modeling for Decision-Making under Uncertainties:
From Data to the Instrumental Knowledge

Shape of a Distribution Function:
The Skewness-Kurtosis Chart