An Introduction to Data Analysis & Presentation

Prof. Timothy Shortell, Sociology, Brooklyn College

The Normal Curve

Earlier, we saw a frequency distribution based on empirical results. We could build a line graph to reflect the relative frequencies. This would show us the empirical probabilities in the distribution. We can extend this notion by constructing a frequency distribution based on theoretical probabilities instead of empirical results.




For example, if we have ten marbles, 6 are black, 3 are red and one green, we can build a corresponding probability distribution. There is a 60% chance that a marble selected at random would be black. There is a 30% chance that a marble selected at random would be red and a 10% chance that the marble would be green.




Now, imagine that we have four coins. We can build a probability distribution that reflects the likelihood of generating the possible number of tails.




Probability distributions can be viewed as empirical frequency distributions for an infinite number of cases. Practically speaking, this means that empirical frequency distributions for very large data files will tend to approximate the theoretical distribution more than frequency distributions for small data files.




The Normal Curve
One very important probability distribution is the normal curve, sometimes called the bell-shaped curve. It plays a central role in the statistical decision making process.




The normal curve has a number of important properties.

  1. It is symmetrical;
  2. it is unimodal;
  3. and, the area under the curve represents proportion, or probability.




Since the area under the curve represents proportion, we can calculate the percent of cases to be expected between some given point and the mean, for example.




We can, in effect, mark off the proportions on a scale of standard deviations. We can see the percent of cases between the mean and one standard deviation above the mean.













And likewise, for two standard deviations above the mean.















Three standard deviations above is just about 50%.















Since the curve is symmetrical, we know that there are about 6 standard deviations under the curve--three above and three below. (Remember our use of R/6 to judge the relative size of the standard deviation? This is the explanation for the denominator.)







The proportions can be expressed as probabilities. The area under the curve represents the probability of drawing a score at random from the distribution at some point or below, for example.















Standard Scores
We can formalize this knowledge by expressing any score in standard deviation units and then calculating the area under the curve. We call these standard scores, or z-scores. The formula for a z-score is given.




We don't have to calculate the area under the curve ourselves. (whew!) Statistics textbooks have a table in which probabilities are listed for each standard score. There are also several places to find this information on the web. The best site I've seen is Professor P. B. Stark's online statistics text, SticiGui.




Some Examples
We have a distribution of household incomes for Kings County, with a mean of $22,500 and a standard deviation of $1,725. What is the probability that a randomly selected household will have an income greater than $25,000?




First, we need to calculate the z-score for an x of $25,000. Next, we look up this z-score in the table. We find that this yield a value of 7.35%. Thus, the probability of selecting a household with an income of more than $25,000 at random from this population is 0.0735.







On the normal curve, this would appear thus.















What is the probability of selecting a household at random with an income $19,000 or less? We start with the z-score. Next, we look up this score in the table and find 2.12%. Thus, the probability of selecting a household with an income of less than $19,000 at random from this population is 0.0212.







We can depict this in a graph.















Another example: We have a distribution of contributions from corporate PACs in 1999, in thousands of dollars. The mean is 68.5 with a standard deviation of 12.9. What is the probability of selecting a PAC, at random from this population, with that contributed between 60 and 70 thousand dollars in 1996? Start by calculating the z-score for the x of 60.







Now, for x=70. Look each of these up in the table, using column (b). This yields 4,78 and 24.54. To find the probability, add these together. Thus, the probability of selecting a PAC that contributed between 60 and 70 thousand dollars in 1996 is 0.2932.




On the graph, this looks like this.















All materials on this site are © 1999, Professor Timothy Shortell, unless otherwise indicated. All rights reserved. Please let me know if you link to this site or use these materials.




All materials on this site are copyright © 2001, by Professor Timothy Shortell, except those retained by their original owner. No infringement is intended or implied. All rights reserved. Please let me know if you link to this site or use these materials.