Alternate hypothesis: The decision accepted when the value of a test statistic leads to rejection of the null hypothesis. See hypothesis testing.
[Top]Analysis of Variance: Techniques used to test hypotheses about differences among population means when there are more than the two populations that can be handled by the two sample normal or Student's-t tests. Tests of the effects of one, two or more experimental treatments are available.
[Top]ANOVA: See Analysis of Variance (ANOVA).
[Top]Central location, measures of: See location, measures of.
[Top]
Chi-square variance test: The value of a chi-square statistic
calculated from sample data may be used to
test hypotheses about
the variance of the sampled population. The possible left-tail,
two-tail and right-tail tests:
| H0:σ2>=σ02 | H0:σ2=σ02 | H0:σ2<=σ02 |
| H1:σ2<σ02 | H1:σ2<>σ02 | H1:σ2>σ02 |
Class and class interval: Counts of the number of sample values falling in ranges of possible values are often prepared to facilitate construction of frequency tables and histograms. Each category is called a class and the length of each class is the class interval.
[Top]Confidence: See statistical significance.
[Top]Confidence interval: Confidence intervals extend the notion of a point estimate (e.g. "the average mileage is 25 mpg") by using information about the sample variance to permit statements like "the 90% confidence interval on the mileage is 23.5 to 26.5 mpg." The precise interpretation of this statement is that, if the sampling procedure used to calculate the confidence interval were infinitely repeated, 90% of the resulting intervals would contain the true value of the parameter.
[Top]Contingency table: A technique for measuring the degree of association between two variables that may be measured on any measurement scale. A chi-square statistic is computed from a cross-tabulation of sample data in a two-way table. A large value of this statistic provides evidence that the counts in each cell of the two-way are dependent of the relationship between the row and column variables.
[Top]Correlation analysis: A measure of the strength of relationship between two interval scaled variables -- the degree to which a change in one variable is associated with a change in the other. Also see regression analysis and principal component analysis.
[Top]Criterion variable: See dependent variable.
[Top]Dependent variable: In a regression analysis the dependent variable represents the value to be predicted from values of the independent variable(s).
[Top]Descriptive statistics: These methods include tabular, mathematical and graphical techniques that are used to summarize and communicate the essential meaning of data sets. Examples include frequency tables, histograms, scatterplots and numerical measures such as the sample mean, median, mode, variance and range.
[Top]Dummy variables: It is possible to incorporate nominal scaled variables in multiple regression models that require interval or ratio scaled variables by recoding them as a series of dichotomous (two-valued) independent variables called dummy or indicator variables. There will be n-1 dummy variables to represent n levels of the nominal variable. Here is an example of recoding a nominal variable with three levels into into two dichotomous variables X1 and X2:
| Question | X1 | X2 |
| My job category is: | ||
| A. Butcher | 1 | 0 |
| B. Baker | 0 | 1 |
| C. Candlestick maker | 0 | 0 |
Estimating probability distribution parameters: Numerical estimates of the parameters of a probability distribution provide insight into the shape of the underlying distribution from which a data sample has been drawn. For example, the normal distribution is completely defined by its mean and variance, so calculating the sample mean and variance from data assumed drawn from a normal population gives the analyst an idea of the bell shaped curve's location and spread or dispersion.
[Top]Estimating model parameters: Models are used to define relationships among variables to understand how they influence each other or to support prediction. The parameters of these models may be estimated from data and evaluated for statistical significance. See regression analysis.
[Top]F-test of variance equality: Hypothesis tests used to draw inferences about the variances of two normal populations are based on the F distribution. A test statistic representing the ratio of the sample variances calculated from samples drawn from the two populations may be used to test hypotheses of the following form:
| H0:σ12>=σ22 | H0:σ12=σ22 | H0:σ12<=σ22 |
| H1:σ12<σ22 | H1:σ12<>σ22 | H1:σ12>σ22 |
Frequency table: A tabular summary indicating the number and/or percentage of sample points that take on specific values or fall into value ranges called classes:
| Satisfaction | N | % |
| High | 100 | 50% |
| Medium | 20 | 10% |
| Low | 80 | 40% |
Histogram: A graphical depiction of the data in a frequency table.

Hypothesis testing: Statistical methods may be used to decide whether a data sample subject to random variation provides sufficient evidence to conclude that an hypothesized effect has occurred. Suppose we are testing a fuel additive claimed to improve the mileage for an automobile that should average 25 miles-per-gallon in highway driving. We have a sample of mileage figures derived from highway tests and believe the sample values will be normally distributed around the population mean. Here is a summary of the hypothesis testing process:
| Step 1: The what's the question question must be formulated in terms of a parameter of the probability distribution from which the sample is assumed to be drawn. In our mileage example, the appropriate parameter would be the population mean. The research question would probably be: "is there evidence that this fuel additive increases the mean of the mileage distribution to a value above 25 miles-per-gallon?" [Top] | |||||||
| Step 2: A test statistic with values that may be calculated from the sample data is selected. The probability distribution describing the test statistic is called a sampling distribution. The test statistic's value will be influenced by the true value of the parameter used to formulate the hypothesis. Tests about means based on samples from normal distributions typically use test statistics that follow either the normal or Student's-t distribution. [Top] | |||||||
| Step 3: At this point, the formal null and alternate hypotheses for the statistical test are defined. The null hypothesis represents a statement that nothing has happened: for our example problem this would be an assertion that the data is from a population where the mean has not increased from 25 mpg. Notationally, this null hypothesis would be represented as follows: H0: µ<=25 where µ is the greek letter "mu" typically used to denote the population mean. [Top] | |||||||
| Step 4: The alternate hypothesis represents the decision about the population parameter implied by rejecting the null hypothesis. In statistical testing, statistical evidence of the effect of a treatment is obtained by rejecting the null hypothesis. This puts the burden of proof on the experiment by requiring strong evidence of the experimental effect. The type of effect the test hopes to support must be identified in advance of the analysis by selecting a left-tailed, right-tailed, or two-tailed test. These imply, respectively, that the effect decreases the parameter, increases the parameter, or changes it in an unspecified direction. For our mileage example, we would probably pick the right-tailed option: H1: µ>25 as our alternate hypothesis. If the statistical test leads us to reject the null hypothesis, our conclusion would be that the sample data is from a population with a mean greater than 25 mpg -- implying that the mileage has been increased by the fuel additive. [Top] | |||||||
| |||||||
| Step 5: Choosing a critical region or rejection region for the test defines the range of values of the test statistic that will cause the null hypothesis to be rejected. Its form is determined by the nature of the alternate hypothesis. The shaded areas in the above diagram indicate the nature of the critical region for three possible alternate hypotheses for tests on population means. When the alternate hypothesis is of the less-than form, small values of the test statistic will lead to rejection of the null hypothesis in favor of this alternate. For the two-tailed test, very small or very large values of the test statistic will lead to rejection. For the right-tailed situation, large test statistic values result in rejection. The significance level of the test, stated as a probability, represents the type-I error for the test and determines the actual numerical range of test statistic values that define the critical region. The diagram below summarizes the statistical nature of the hypothesis testing process and the possible decision outcomes. A small probability (often .01 or .05) is selected by the analyst as the size of type-I error to be accepted for the test -- this is the probabilty you are willing to accept of concluding the treatment has a meaningful effect when it does not. In the mileage example, this would be the probability of concluding that the mileage has increased when it is actually still 25 mpg or less. The size of the type-II error depends on the true value of the null hypothesis and is sometimes examined for a range of possible values. [Top] | |||||||
![]() | |||||||
| Step 6: The final step involves gathering the sample data, calculating the value of the test statistic and determining whether it falls in the critical region. If it does, the null hypothesis is rejected and the result is detemined to be statistically significant at the level of the type-I error selected for the test. [Top] | |||||||
Independent variable: Variables in a regression model used to predict the value of the dependent variable are called independent or predictor variables.
[Top]Indicator variables: See dummy variables.
[Top]Inference: See statistical inference.
[Top]Interval scale: This measurement scale adds a constant-distance property to the rank ordering capability of the ordinal scale. A one-unit difference has the same meaning wherever it appears on the scale. Temperature measures are interval scaled: one degree represents the same shift in temperature wherever it appears on the thermometer.
[Top]Kruskall-Wallis one-way ANOVA: A nonparametric test providing for single treatment ANOVA to be performed without requiring the assumption that the samples receiving each treatment were drawn from a specific probability distribution.
[Top]Location, measures of: Location measures (sometimes called "central" location measures) attempt to characterize a probability distribution with a single number that conveys information about the magnitude of a typical value that might be drawn from that distribution. Location measures include the mean, median and mode.
[Top]Mean (arithmetic mean): A measure of sample location calculated as the sum of the values divided by the number of values.
[Top]Measurement scale: The analyses that can be legitimately performed on data depend on the levels-of-measurement hierarchy that identifies variables as nominal scaled, ordinal scaled, interval scaled, or ratio scaled. Nominal and ordinal scales are non-metric (non-numeric) while interval and ratio scales are intrinsically numeric.
[Top]Median: A measure of sample location. To calculate the median, the values are arranged in order of magnitude. If the number of values is odd, the median is the middle value. If the number of values is even, the median is defined as a value halfway between the two middle values.
[Top]Mode: A measure of sample location represented by the value that occurs most frequently in the sample.
[Top]Multiple regression analysis: Regression analysis with more than one independent variable. The model has the following form:
Nominal (naming) scale: Each value of an item
measured with a nominal
measurement scale represents a
category, but no relationship among the categories is implied by their
order. For example:
My job category is:
A. Butcher
B. Baker
C. Candlestick maker
Because the order of the values is arbitrary, it is never legitimate to assign numeric values to the responses to this type of question (A=1, B=2, etc.) and then analyze these responses as if they were interval scaled. If you must analyze nominal scaled data with techniques designed for interval scaled data, you may be able to recode the variable first as a series of dummy variables.
[Top]Nonparametric statistics: Analysis procedures that have limited or no dependence on assumptions about the probability distribution from which the data were drawn are called nonparametric or distribution free.
[Top]Normal distribution: Symmetric, bell-shaped distributions of values often appear in nature. When a population exhibits normal behavior, most of the values fall close to the population mean value, with values far from the mean decreasing in frequency. The distribution is entirely defined by its population mean (measure of location: the random variable value corresponding to the highest point on the curve) and variance (measure of spread or dispersion).
![]() |
Null hypothesis: The premise that an experimental treatment has had no effect on a population parameter. See hypothesis testing.
[Top]Ordinal or ranking scale: Values measured on an ordinal measurement scale have a sequential relationship to each other that provides information beyond the simple categories of the nominal scale. Here is an example:
I enjoy going to work each day:
A. Strongly disagree
B. Disagree
C. Neutral
D. Agree
E. Strongly agree
The order of the values is not arbitrary. When the analyst is willing to assume that values on the scale are approximately evenly spaced, it is a common practice to assign sequential numeric values (A=1, B=2, etc.) and analyze the data as if it has been gathered on an interval scale.
[Top]Paired sample normal test: This hypothesis test is used to draw conclusions about the before and after (matched pairs) effect of a treatment on the population mean. The sample data are assumed to be drawn from a normal distribution with known variance.
[Top]Paired sample Student's-t test: This hypothesis test is used to draw conclusions about the before and after (matched pairs) effect of a treatment on the population mean. The sample data are assumed to be drawn from a normal distribution with unknown variance so the variance must be estimated from the sample data.
[Top]
Pie Chart: Circular representation of frequency data
with pie segment sizes proportional to frequencies.

Principal Component Analysis: A technique used to explore the underlying structure of a set of variables. The method's motivating assumption is that the measured or "manifestation" variables result from linear combinations of a smaller set of variables not directly measured. The technique performs orthogonal rotations of the sample correlation matrix to extract components as linear combinations of the measured variables so that each successive extracted component explains as much of the original variance as possible. If a small set of components explains most of the variance, and these components are interpretable, there may be justification for using scores on the components as measures of basic dimensions of the problem under analysis.
[Top]
Probability density function: The probability associated with a
single value of a
random variable with continuous values
is zero. The area under a probability density
function between any two values of the continuous random variable
represents the probability that the
variable's value will lie in that interval. In the following diagram,
the shaded area under a normal curve represents the probability that
the value of the normal random variable denoted by x will have a
value between x0and x1.

Probability distribution: These distributions associate a probability with each value that a discrete random variable takes on.
[Top]Random variable: A variable that can take on a range of either discrete or continuous values such that a probability can be assigned to each value in the discrete case or to each value interval in the continuous case. A probability distribution describes this association between values and probabilities for discrete random variables and a probability density function defines the probability that a continuous random variable will fall between two values.
[Top]Range: The difference between the largest and smallest value in a sample that provides a measure of the spread or dispersion of the data.
[Top]Ratio scale: The ratio scale extends the interval scale's definition to include an absolute zero that makes ratio comparisons meaningful for this measurement scale. A ratio scale value of zero means there is none of the property measured by the variable. Examples include units of weight and measure such as pounds or inches. A ten pound object is twice as heavy as a five pound object because an object with a weight of zero is weightless. A temperature of 80 degrees F (an interval scaled measurement) is not twice as warm as a temperature of 40 degrees F because zero degrees does not imply the complete absence of warmth.
[Top]Rank correlation: A nonparametric technique for measuring the degree to which two variables move together is provided by the rank correlation coefficient. To use the technique, data must be measured on at least an ordinal scale. No assumptions about the undelying probability distribution are required.
[Top]Regression analysis (linear): In a linear regression analysis, the coefficients of the independent variable(s) in the model are estimated from sample data. The model can then be used to predict values of the dependent variable. Testing the statistical significance of the coefficients allows conclusions to be drawn about the whether each independent variable influences the value of the dependent variable. See simple regression and multiple regression.
[Top]Sample variance: An estimate of the population variance calculated from sample data.
[Top]Sample description: See descriptive statistics.
[Top]Sampling distribution: The probability distribution for a test statistic. Used in hypothesis testing.
[Top]Scatterplot: Plotting the values of pairs of variables in a scatterplot like the one shown below may reveal the presence or absence of a relationship between the two variables. Insights into the nature of the relationship such as whether or not it is linear might also be obtained.

Sign test: This test uses the sample median to provide a nonparametric alternative to normal and Student's-t tests about population means. Assumptions about the underlying probability distribution are not required.
[Top]Significance tests: See hypothesis testing.
[Top]Simple regression analysis: Regression analysis with a single independent variable. The model has the following form:
Single sample normal test: This hypothesis test is used to draw conclusions about the effect of a treatment on the population mean. The sample data are assumed to be drawn from a normal distribution with known variance.
[Top]Single sample t-test: This hypothesis test is used to draw conclusions about the effect of a treatment on the population mean. The sample data are assumed to be drawn from a normal distribution with unknown variance so the variance must be estimated from the sample data.
[Top]Spread: The dispersion of a set of values. Example measures of spread or dispersion include the range and variance.
[Top]Statistical inference: Using evidence obtained from sample data to draw conclusions about the sampled population. See hypothesis testing.
[Top]Statistical significance: The value of a test statistic is considered statistically significant if its value falls in the critical region for the hypothesis test resulting in rejection of the null hypothesis.
[Top]Student's-t distribution: A probability distribution. Test statistics have been derived as a function of the sample mean and variance calculated from a sample assumed normally distributed that will follow the Student's-t distribution. These statistics may be used to perform hypothesis tests with samples assumed drawn from normal populations with unknown variances.
[Top]Test statistic: A random variable with values influenced by the true value of a parameter of a probability distribution that is the subject of a statistical hypothesis. See hypothesis testing.
[Top]Treatment: A distinguishing characteristic of a population that a statistical hypothesis test will evaluate for significance. For example, if mileage data is collected from a series of trial runs for automobiles operated with and without a fuel additive claimed to increase mileage, the fuel additive represents the treatment.
[Top]Two sample normal test: This hypothesis test is used to draw conclusions about two population means. The test is based on sample data assumed to be drawn from two normal distributions each with known variance.
[Top]Two sample t-test: This hypothesis test is used to draw conclusions about two population means. The test is based on samples assumed to be drawn from two normal distributions each with unknown variance so the variances must be estimated from the sample data.
[Top]Type I error: This error occurs when a true null hypothesis is rejected. See hypothesis testing.
[Top]Type II error: This error occurs when a false alternate hypothesis is accepted. See hypothesis testing.
[Top]Variance: A measure of sample spread or dispersion, defined as the sum of the squared differences between the values and the sample mean divided by (n-1) where n is the sample size.
[Top]