# «©2013 Peter C. Bruce This text is based on earlier material developed for statistics.com by Dr. Robert Hayden Table of Contents Advisory Board ...»

Stats: Data and Analytics

(c) 2013 Peter C. Bruce

Advisory Board

Jeff Witmer

William Peterson

Chris Malone

©2013 Peter C. Bruce

This text is based on earlier material developed for statistics.com by Dr. Robert Hayden

Table of Contents

Advisory Board

Preface

Acknowledgments

Introduction

If You Can't Measure It, You Can't Manage It

Phantom Protection From Vitamin E

Statistician, Heal Thyself

Identifying Terrorists in Airports

Looking Ahead in the Course

1 Designing and Carrying Out a Statistical Study

1.1 A Small Example

1.2 Is Chance Responsible? The Foundation of Hypothesis Testing

Interpreting This Result

Increasing the Sample Size

1.3 A Major Example

1.4 Designing an Experiment

Randomizing

Planning

Blinding

Before-After Pairing

1.5 What to Measure—Central Location

Mean

Median

Mode

Expected Value

Percents

Proportions for Binary Data

1.6 What to Measure—Variability

Range

Percentiles

©2013 Peter C. Bruce ii Interquartile Range

Deviations and Residuals

Mean Absolute Deviation

Variance and Standard Deviation

Variance and Standard Deviation for a Sample

1.7 What to Measure—Distance (Nearness)

1.8 Test Statistic

** Test Statistic for This Study: **

1.9 The Data

Database Format

1.10 Variables and Their Flavors

Table Formats

1.11 Examining and Displaying the Data

Errors and Outliers Are Not the Same Thing!

Frequency Tables

Histograms

Stem and Leaf Plots

Box Plots

Tails and Skew

1.12 Are We Sure We Made a Difference?

Appendix: Historical Note

2 Statistical Inference

The Null Hypothesis

2.1 Repeating the Experiment

Shuffling and Picking Numbers From a Hat or Box

2.2 How Many Reshuffles?

Boxplots

Conclusion

The Normal Distribution

The Exact Test

2.3 How Odd is Odd?

2.4 Statistical and Practical Significance

2.5 When to use Hypothesis Tests

©2013 Peter C. Bruce iii 3 Categorical Data

3.1 Other Kinds of Studies

3.2 A Single Categorical Variable

3.3 Exploring Data Graphically

Choice of Baseline and Time Period

Indexing

Per Capita Adjustment

3.4 Mendel's Peas

3.5 Simple Probability

Venn Diagrams

3.6 Random Variables and Their Probability Distributions

Weighted Mean

Expected Value

3.7 The Normal Distribution

Standardization (Normalization)

Standard Normal Distribution

Z-Tables

The 95 Percent Rule

4 Relationship Between Two Categorical Variables

4.1 Two-Way Tables

Could Chance be Responsible?

A More Complex Example

4.2 More Probability

Conditional Probability

From Numbers to Percentages to Conditional Probabilities

4.3 From Conditional Probabilities to Bayesian Estimates

Let's Review the Different Probabilities

Bayesian Calculations

4.4 Independence

Multiplication Rules

Simpson's Paradox

4.5 Exploratory Data Analysis (EDA)

**5 Surveys and Sampling **

©2013 Peter C. Bruce iv

5.1 Simple Random Samples

5.2 Margin of Error: Sampling Distribution for a Proportion

The Uncertainty Interval

Summing Up

5.3 Sampling Distribution for a Mean

Simulating the Behavior of Samples from a Hypothetical Population........ 123

5.4 A Shortcut—the Bootstrap

Let's Recap

A Bit of History—1906 at Guinness Brewery

6 Confidence intervals

6.1 Point Estimates

6.2 Confidence Intervals as Resample Results

Confidence Interval vs. Margin of Error

** Resampling Procedure (Bootstrap): **

6.3 Formula-Based Counterparts to the Bootstrap

Normal Distribution

Central Limit Theorem

FORMULA: Confidence Intervals for a Mean—Z-Interval

Example

For a Mean: T-Interval

Example—Manual Calculations

Example—Software

6.4 Standard Error

Standard Error via Formula

6.5 Beyond Simple Random Sampling

Stratified Sampling

Cluster Sampling

Systematic Sampling

Multistage Sampling

Convenience Sampling

Self Selection

Nonresponse Bias

6.6 Absolute vs. Relative Sample Size

6.7 Appendix A—Alternative Populations

©2013 Peter C. Bruce v

6.8 Appendix B—The Parametric Bootstrap (OPTIONAL)

** Resampling Procedure—Parametric Bootstrap: **

Formulas and the Parametric Bootstrap

7 Concepts in Inference

Confidence Intervals and Hypothesis Tests

7.1 Confidence Intervals for a Single Proportion

Resampling Steps

Binomial Distribution

Multiplication Rule (An Aside)

Normal Approximation

These Are Alternate Approaches

7.2 Confidence Interval for a Single Mean

7.3 Confidence Interval for a Difference in Means

Resampling Procedure—Bootstrap Percentile Interval

FORMULA–Confidence Interval for a Difference in Means

7.4 Confidence Interval for a Difference in Proportions

Resampling Procedure

Appendix A: Formula Procedure

The Binomial Formula (For Those Interested)

Binomial Formula Example

Normal Approximation to the Binomial

7.5 Appendix B: Resampling Procedure - Parametric Bootstrap (OPTIONAL).... 160

7.6 Review

8 Hypothesis Tests—Introduction

P-Value

Significance or Alpha Level

Critical Value

8.1 Confidence Intervals vs. Hypothesis Tests

Confidence Interval

Relationship Between the Hypothesis Test and the Confidence Interval.... 167 Comment

8.2 Review

**9 Hypothesis Testing—Two Sample Comparison**

©2013 Peter C. Bruce vi Review—Basic Two-Sample Hypothesis Test Concept

Review—Basic Two-Sample Hypothesis Test Details

Formula-Based Approaches

Practice

9.1 Comparing Two Means

Resampling Procedure

9.2 Comparing Two Proportions

Resampling Procedure

9.3 Formula-Based Alternative—T-Test for Means

9.4 The Null and Alternative Hypotheses

Formulating the Null Hypothesis

Corresponding Alternative Hypotheses

One-Way or Two-Way Hypothesis Tests

The Rule

The Why

Practice

9.5 Paired Comparisons

Paired Comparisons: Resampling

Paired Comparisons: T-Test

9.6 Appendix

Formula-Based Variations of Two-Sample Tests

Z-Test With Known Population Variance

Pooled vs. Separate Variances

Formula-Based Alternative: Z-Test for Proportions

9.7 Review

10 Additional Inference Procedures

10.1 A Single Sample Against a Benchmark

Resampling Procedure

Formula Procedure

10.2 A Single Mean

Resampling Procedure for the Confidence Interval

Formula Approach for the Confidence Interval

10.3 More than Two Samples

Count Data

©2013 Peter C. Bruce vii The Key Question

Answer

Chi-Square Test

10.4 Continuous Data

Resampling Procedure

10.5 Appendix

Normal Approximation; Hypothesis Test of a Single Proportion............... 206 Confidence Interval for a Mean

11 Correlation

11.1 Example: Delta Wire

11.2 Example: Cotton Dust and Lung Disease

11.3 The Vector Product and Sum Test

Example: Baseball Payroll

11.4 Correlation Coefficient

Inference for the Correlation Coefficient—Resampling

Inference for the Correlation Coefficient: Formulas

11.5 Other Forms of Association

11.6 Correlation is not Causation

A Lurking External Cause

Coincidence

12 Regression

12.1 Finding the regression line by eye

Making predictions based on the regression line

12.2 Finding the regression line by minimizing residuals

12.3 Linear Relationships

Example: Workplace Exposure and PEFR

Residual Plots

12.4 Inference for Regression

Resampling Procedure for a Confidence Interval (the pulmonary data)..... 235 Using Resampling Stats with Excel (the pulmonary data, cont.)................ 235 Formula-based inference

Interpreting Software Output

**13 Analysis of Variance—ANOVA**

©2013 Peter C. Bruce viii

13.1 Comparing more than two groups: ANOVA

13.2 The Problem of Multiple Inference

13.3 A Single Test

13.4 Components of Variance

Decomposition: The Factor Diagram

Constructing the ANOVA Table

Resampling Procedure

Inference Using the ANOVA Table

The F-Distribution

Different Sized Groups

Caveats and Assumptions

13.5 Two-Way ANOVA

Resampling Approach

Formula Approach

13.6 Factorial Design

Stratification

Blocking

13.7 Interaction

Additivity

Checking for Interaction

14 Multiple Regression

14.1 Regression as Explanation

14.2 Simple Linear Regression -- Explore the Data First

Antimony is negatively correlated with strength

Is there a linear relationship?

14.3 More Independent Variables

Multiple Linear Regression

14.4 Model Assessment and Inference

R2

Inference for Regression—Holdout Sample

Confidence Intervals for Regression Coefficients

Bootstrapping a Regression

Inference for Regression—Hypothesis Tests

14.5 Assumptions

©2013 Peter C. Bruce ix Violation of Assumptions—Is the Model Useless?

14.6 Interaction, Again

Original Regression With No Interaction Term

The Regression With an Interaction Term

Does Crime Pay?

14.7 Regression for Prediction

Tayko

Binary And Categorical Variables in Regression

Multicollinearity

Tayko—Building the Model

Reviewing the output

Predicting New Data

Summary

**Index **

©2013 Peter C. Bruce x Preface This text was developed by Statistics.com to meet the needs of its introductory students, based on experience in teaching introductory statistics online since 2003. The field of statistics education has been in ferment for several decades. With this text, which

**continues to evolve, we attempt to capture two important strands of recent thinking:**

1. Guidelines for the introductory statistics course, developed in 2005 by a group of noted statistics educators, with funding from the American Statistical Association. These Guidelines for Assessment and Instruction in Statistics Education (GAISE) call for the use of real data with active learning, stress statistical literacy and understanding over memorization of formulas and the use of software to develop concepts and analyze data.

2. The use of resampling/simulation methods to develop the underpinnings of statistical inference (the most difficult topic in an introductory course) in a transparent and understandable fashion.

We start off with some examples of statistics in action (including two of statistics gone wrong), then dive right in to look at the proper design of studies and account for the possible role of chance. All the standard topics of introductory statistics are here (probability, descriptive statistics, inference, sampling, correlation, etc.), but sometimes they are introduced not as separate standalone topics but rather in the context of the situation in which they are needed.