# Describe a foolproof way to determine if a variable is numerical or categorical. Give an example of each. As part of your answer you should consider and explain whether categorical variables can be numbers or not.

Math 131 – Probability and Statistics – Project 1 Northwestern Michigan College

Due: Week 4, Tuesday, 11:55pm Eastern Time Name: Benjamin J Cooper

How do I submit this Project? (If submitting electronically, ONLY a .pdf file will be accepted.) You could… 1. (Preferred) Download this assignment to your Google Drive (every NMC student has this). You can type up your answers, and include any pictures / graphs / distributions that are required to solve the problem. Save the file as a .pdf and upload the file to Moodle under this assignment name. (You may need to save the .pdf to your computer, then upload the file from your computer.) 2. Print the assignment, fill it out by hand, scan it as a .pdf, and upload the file to Moodle under this assignment name. 3. Bring this sheet to my office (LB 33C, main campus). Please slide the sheet under the door if I’m not there.

Instructions: Please be explicit and show how you solved each of the problems. You are encouraged to use your calculator, textbook, and StatCrunch. If you are having difficulties with this project please come to my office hours and I will do my best to help you along. You should gather your thoughts on a separate sheet of paper before writing your final solutions here. I don’t want to see any scribbles or obvious erasure marks on this final copy. You should not expect to be able to hand in this project late.

Grading: Following the instructions above is worth 5 points. Each part of each problem below is worth 4 points. Be sure to show all necessary work, drawings, and diagrams, and answer in complete sentences.

1) Describe a foolproof way to determine if a variable is numerical or categorical. Give an example of each. As part of your answer you should consider and explain whether categorical variables can be numbers or not.

Numerical variables are quantitative. They are an amount of something, for example, gallons of water. Categorical variables are a “quality” or some sort of characteristics. In this sense, a “type” or “category” is the best way to describe this. A good example would be color of cars, or make of cars. The key is paying attention to what the variable is being described as, via the title, or legend. In this same sense, categorical variables can be numbers if indicated. For example, a study with men and women can have genders labeled with numbers, men being 1 and 2 being females. The numbers cannot be combined in a meaningful way if they are categorical.

2) The mean weight of adult women is 140 pounds with a standard deviation of 8 pounds. The mean height of adult women is 63 inches with a standard deviation of 2.5 inches. Assume both distributions are unimodal and symmetric.

a. What does it mean for the distribution of a numerical variable to be unimodal? Also, give an example (a picture would work well as an example).

Distributions of data usually tend to have few or many peaks in the data itself. Distributions of data with one clear peak in the data are called unimodal. For comparison purposes, distributions with two clear peaks in the data are called bimodial. Below is a pictured example with graphs.

b. Which would be more unusual: to find an adult woman that weighs less than 120 pounds, or to find an adult woman who is taller than 68.5 inches? Support your answer with z-scores.

First things first, we can go with finding an adult woman who is taller than 68.5 inches would be more unusual due to the fact that height isn’t controlled, and weight can (relatively) be controlled with things like diet, exercise. That’s simple statistics in itself. Now, the mathematical way is with Z-scores, which we show below.

Weight : Z = (120 – 140)/8, which we find is -2.5 std devs.

Height : Z = (68.5 – 63)/2.5, which we find is 2.2 std devs.

On the normal probability table, -2.5 (weight) is .0062. 2.2 (height) is .9861

Title

Author

Number of Pages

Publication Date

Genre

Number of Stars on Amazon.com

Under the Dome

Stephen King

1074

2009

Sci-Fi

4

The Circle Trilogy

Ted Dekker

1232

2008

Sci-Fi

4.5

The Da Vinci Code

Dan Brown

454

2003

Mystery

3.5

The Battle of the Labyrinth

Rick Riordan

361

2008

Fantasy

5

Marooned in Real-time

Vernor Vinge

270

1986

Sci-Fi

4.5

Empire

Orson Scott Card

352

2006

Fiction

2.5

A Game of Thrones

George R. R. Martin

835

1996

Fantasy

4.5

Hyperion

Dan Simmons

482

1989

Sci-Fi

4.5

Robopocalypse

Daniel H. Wilson

368

2011

Sci Fi

3.5

3) Refer to the table above to answer these questions. We wish to examine what relationship, if any, exists between the number of pages and the book’s rating on Amazon.

a. Make a scatterplot of the data. Clearly label the axes. Use Number of Pages as the explanatory variable.

C:\Users\Cooper\Desktop\download (1).png

b. Use StatCrunch to find the equation of the least-squares regression line. Write this equation below using descriptive variables. Include a sketch of this line on the scatterplot above. Also, give the value of the correlation coefficient.

This one was tough at first using stat crunch, but I was able to figure it out. After much effort and learning, this is what I came up with as my dependent variable, independent variable, and equation for the least-squares regression line. It is in according to Amazon rating and Pages.

Dependent Variable: Amazon Rating Independent Variable: Pages Amazon Rating = 3.7346008 + 0.00053216523 Pages

Here is the line plotted on the graph itself.

C:\Users\Cooper\Desktop\Linear Regression line.png

The correlation coefficient is as follows below:

R (correlation coefficient) = 0.24437975

c. Use the model found above to predict the Amazon rating of a book with 3450 pages. Does this answer make sense? Why or why not?

Mathematically, you can use the average of pages in correlation to rating, we can predict and –assume- that it would average at a 4.5 – 5 rating. This answer doesn’t make sense fully, only because as shows they can vary. Statistically the average with these books has gone up in relation to pages and rating, but anything can happen. A rating is –not- based upon pages, even if our scatter plot shows a relative relation between these specific books and pages.

d. Give a well-written description of the association. At a minimum, talk about trend, strength, and shape using good statistics vocabulary.

In the scatter plot above, we have created a plot involving the number of pages in a list of books in correlation to the rating they have on Amazon. These variables, our statistical dependent variable being rating whilst our opposing “independent” variable being pages. As we can see above, the least squares regression line (Amazon Rating = 3.7346008 + 0.00053216523 Pages) shows that among these books, we average higher average rating in this experiment with more pages. Although our highest rating was with a low paged book, the average seems to go up with number of pages.

Each part from question 4 below refers to the Publication Date variable from previous page. Unless otherwise noted, use StatCrunch.

4) a. Find the mean. 2001.8 Find the median. 2006 Find the standard deviation. 9.23

I found these using the calculator on my computer. Median is the middle year of all the years. The mean being all years added together, divided by the amount of years we had.

b. Show how to find the 5 number summary by hand, using the method discussed in class.

c. Explain clearly how can have a decimal in it even though all of the publication dates are whole numbers.

d. Find the 5 number summary using StatCrunch.

e. Why do you think your answers to “b” and “d” are different? Give a reasonable explanation of how StatCrunch calculates the 5 number summary vs how the author of your book calculates the 5 number summary.

f. Find the IQR and Range using your results from part “d”.

g. Create a boxplot, by hand, based on the 5 number summary in part “d”.

Stem

Leaf

0

1 2 2 4 5 9

1

0 0 1 1 2 2 3 5 6

2

2 6 9

3

8

4

5

4

5) Given the stem-and-leaf plot, answer the following questions:

On this page you are expected to use stats vocabulary really well. For example, if you were asked about “center” you should return to chapters 2 and 3 to talk about “center” using some of the vocabulary and ideas described there.

a. Decompose this stem-and-leaf plot into raw data. That is, instead of creating a stem-and-leaf plot from data I want you to create the data set from this stem-and-leaf plot.

b. What shape is this distribution?

c. Should you use the five number summary or should you use the mean and standard deviation to describe this distribution. How did you know which one to choose?

d. Describe this distribution using your choice from part “c”. Be sure to use good stats words to describe the shape, center, and variation.

e. In class (in your book) we discussed a very specific method for deciding if a value is an outlier or not. Use this method to comment on any outliers. Show your work and how you determined which value(s) were potential outliers. Were any values flagged as outliers that you think should not have been? Use your best judgment to address these issues.

f. Create a box plot of this data. Be sure to label potential outliers on your boxplot.