An Introduction to Data Analysis & Presentation

Prof. Timothy Shortell, Sociology, Brooklyn College

The Sampling Distribution

We have been discussing some concepts that imply a random sample, or a random selection from a distribution, and so forth. We need to be more precise, now, about what we mean by random in these cases.

We define a random sample as one in which every element in the population has an equal, nonzero, chance of being selected into the sample.

A random sample is the best way of generating a sample that is representative of its population. It does not guarantee such an outcome -- we will discuss sampling error later. But, it is usually the best way to draw a sample.

It may seem counter-intuitive, at first, that random sampling is a better strategy than purposive sampling -- where one goes out and looks for people representing the larger population. The key to its success is that every element has an equal chance of being selected.

In practice, we sometimes modify random sampling, combining it with other strategies designed to target specific groups in the population. When a population includes a small minority group, we often have to oversample for that group. Sometimes we identify important characteristics that define the population and stratify the sample along them. Both of these techniques, though, still rely on random sampling to select elements from subgroups.

Sampling allows the researcher to generalize characteristics of the sample to the population. This kind of inference is based on the well-studied sampling distribution.

Properties of the Sampling Distribution
The most important topic in the introductory statistics course is the logic of inference, and this begins with the sampling distribution.

Imagine some population with a particular mean, mu, and standard deviation, sigma, on some variable. We know that scores vary around the mean, some larger and some smaller. On average, scores differ from mu by sigma.

If we take a random sample, we can calculate its mean, x-bar.

Now, imagine that we take repeated random samples from this population, and calculate the mean for each.

We can take these sample means and generate a frequency distribution. We can define the mean of this distribution, X-bar sub-x-bar, and its standard deviation, sigma sub-x-bar.

Most of the sample means will be relatively close to the population mean. Some will be larger and some smaller. On average, sample means will differ from X-bar sub-x-bar by sigma sub-x-bar.

If we were to draw a large number of samples, the frequency distribution of the sample means would approximate the normal curve. This allows us to take advantage of the properties of the normal curve.

These are the things you want to remember about the sampling distribution:

  1. for relatively large samples, the sampling distribution approximates the normal curve for a sufficiently large number of samples;

  2. the mean of the sampling distribution equals the population mean;

  3. the standard deviation of the sampling distribution is less than the standard deviation of the population. We call the standard deviation of the sampling distribution the standard error

The Z-Test
Since the sampling distribution has the characteristics of the normal curve, we can generalize the notion of a standard score. We can calculate a z-score for a sample mean.

This z-score tells us the distance between the sample mean and the population mean, in standard error units.

The denominator of this formula is defined as the standard error and is calculated thus.

We can mark off the area under the curve in standard error units. We can think of the area under the curve having a range of about six standard errors -- just as the normal curve has a range of about six standard deviations.

We can use this knowledge to estimate the probability of drawing a sample from a population with a specific mean, or larger, for example.

Let's look at the relationship between the sample standard deviation and the standard error, the standard deviation of the sampling distribution.

We have a distribution of household incomes for Kings County, with a mean of $22,500 and a standard deviation of $1,725. What is the probability that a random sample of 100 households will yield a mean greater than $23,000?

First, we need to calculate the z-score for a sample mean of $23,000. Next, we look up this z-score in the table, using column (c). We find that this yields a value of 0.19%. Thus, the probability of selecting a sample with an mean household income of $23,000 or more is 0.0019. This is a rare event!

We have educational data from every county in New York State. For each county, we know the high school dropout rate. The mean rate is 6.25 per 1000 students. The standard deviation is 11.9 per 1000 students. What is the probability of drawing a random sample of 75 counties with a mean dropout rate below 5 per 1000 students?

First, calculate the z-score. Look the z-score up in the table. This yields a value of 18.14%. There is, then, an 18.14 percent chance of drawing a sample with a dropout rate of 5 or fewer per 1000 students from this population.

All materials on this site are copyright © 2001, by Professor Timothy Shortell, except those retained by their original owner. No infringement is intended or implied. All rights reserved. Please let me know if you link to this site or use these materials.