Chapter 2 Data Exploration
2.1 Plots
score | grade | texting | questions | participation |
---|---|---|---|---|
26.89 | F | never | never | 0.41 |
71.57 | B | always | rarely | 0.00 |
90.11 | A | always | never | 0.27 |
31.52 | D | sometimes | rarely | 0.68 |
95.94 | A | always | rarely | 0.09 |
45.72 | D | always | rarely | 0.19 |
90.82 | A | always | always | 0.25 |
75.52 | B | sometimes | never | 0.28 |
52.31 | C | never | never | 0.67 |
39.57 | D | always | always | 0.40 |
2.1.1 Scatter Plot
- Scatter Plot are used to plot points on the Cartesian plane (X-Y Plane)
- Hence it is used when both the labels are numerical values.
2.1.2 Bar Plot
- A bar plot is used to plot rectangular bars proportional to the values present in a numerical vector.
- This rectangle height is proportional to the value of the variable in the vector.
- Barplots are also used to graphically represent the distribution of a categorical variable, after converting the categorical vector into a table(i.e. frequency distribution table)
- In a bar plot, you can also give different colors to each bar.
2.1.3 Box Plot
- A boxplot shows the distribution of data in a dataset.
- A boxplot shows the following things:
- Minimum
- Maximum
- Median
- First quartile
- Third quartile
- Outliers
- You can create a single boxplot using just a vector or a multiple boxplot using a formula.
- When you write a formula, you should use the Tilde (~) operator. This column name on the left side of this operator goes on the y axis and the column name on the right side of this operator goes on the x axis.
2.1.4 Mosiac Plot
- Mosaic plot is a graphical method for visualizing data from two or more qualitative variables.
- The length of the rectangles in the mosaic plot represents the frequency of that particular value.
- The width and length of the mosaic plot can be used to interpret the frequencies of the elements. -For example, if you want to plot the number of individuals per letter grade using a smartphone, you want to look at mosiac plot.
2.2 Free Style data exploration with just seven R commands " R.7 "
Now as you know to make simple, but quite colorful, basic graphs it is time to go one step further and use the plots to explore your data. This is the subject of the process known as data exploration.
We will begin with what we call, the free style data exploration. We call it free style, since we are not going to use any sophisticated libraries, but rather just seven basic of commands of R. We call this set of commands - R.7. And with these seven commands of R.7 we will be able to do quite a bit of data exploration. We will be able to slice and dice data any way we want to.
No statistics yet, and no more sophisticated R programs.These will come later. For now, we are just feeling the data with four plots and three R instructions: subset(), table() and tapply()
Through this section we will use our usual initial example, of synthetically generate data describing mysterious methods of grading used by an eccentric Professor Moody. We have been using different versions of this data puzzle over the 6 years of teaching data 101. Data is different but narrative is always the same:
2.2.0.1 Professor Moody Puzzle
Professor Moody has been teaching statistics 101 class for many years. His teaching evaluations went considerably south with the chief complaint: he DOES NOT seem to assign grades fairly. Students compared their scores among themselves and found quite a bit of discrepancies! But their complaints went nowhere since Professor promptly disappeared after posting the final grades and scores.
A new brave TA, managed to get hold of the carefully maintained grading table (spanning multiple years) of professor Moody by ….messing a bit with Moody’s computer….well, let’s not explain the details because he would get in trouble. What he found out was a remarkably structured account of how professor Moody assigns his grades.
Looks like Professor Moody is in fact very alert in class. He is aware of what students do, detecting texting during class and remembering exactly who asked many questions in class. He also keeps the mysterious “participation index” which is a numerical score from 0 to 1. This is probably related to questions asked and answered by students as well as their general attentiveness in class. Remarkable but a little creepy, isn’t it?
What is the best advice the new TA, can give future students how to get a good grade in Professor Moody’s class? What factors influence the grade besides the score? Back your recommendation up with plots and evidence from the attached data.
score | grade | texting | questions | participation |
---|---|---|---|---|
26.89 | F | never | never | 0.41 |
71.57 | B | always | rarely | 0.00 |
90.11 | A | always | never | 0.27 |
31.52 | D | sometimes | rarely | 0.68 |
95.94 | A | always | rarely | 0.09 |
45.72 | D | always | rarely | 0.19 |
90.82 | A | always | always | 0.25 |
75.52 | B | sometimes | never | 0.28 |
52.31 | C | never | never | 0.67 |
39.57 | D | always | always | 0.40 |
Moody[score, grade, participation, questions, texting] Score and grade are self explanatory. Participation is supposedly measuring students participation in class. We do not know whether higher participation would necessarily be positive, since Professor Moody’s mood changes from year to year and he may be annoyed by students who are too active and bother him too much. Who knows? We have to find out.
Attribute “questions” has several values “always”, “frequently”, “sometimes” and “never”. So does the attribute “texting”. In our data set there are students who are are always texting and who never ask any questions. Oh, yes, and some students participation index is almost zero. Guess what grade are they getting? F, you probably guess. Well, but about their score.It should matter probably. At least to some degree!
Grading rules, which Professor Moody applies each year are different. The objective of our free style data exploration is to find some leads/hypotheses which would help us direct students what they should do to get a good grade in his class.
This is a good illustration of what data exploration is and can achieve. It is just an example, but one can of course easily see that, things we discuss here, applies to any data sets.
Data exploration can be viewed as an indefinite loop:
REPEAT{
Plot,one or many plots.
Transform Data. } UNTIL GRATIFICATION
Put yourself in the position of a student in Moody’s class. What does s/he want to know?
What should I do in order to pass his class aside from getting the best score possible?
Ask many questions?
Do not text?
Come to class as often as you can? Presumably improving participation index?
What is the approach?
First you need “kick the tires”, make some plots, feel the data and perhaps rule out the obvious. In case of Professor Moody data it may mean the following: - Test if straightforward mapping of scores into grades work in Professor Moody’s class. Admittedly it is a long shot. We expect more from professor Moody than just merely following the scoring intervals with A above, say 85, B between 70 and 85 etc!
But we need to establish that it is not the case quickly. Since it would be embarrassing to miss the obvious and simplest recommendation. Just score as high as you can. Otherwise no need to come to class, and in class you can text as much as you want to and ask no questions. Does not matter what else you do. You may never ask any questions, always text in class or simply…never even show up. All it matters is score!
There is just one plot which can quickly establish whether this simple rule works. And it is the boxplot.
As illustrated by the boxplot, there are significant overlaps between successive grades. This disproves that there is a deterministic function between score and grade. At least it is not always a function. If your score falls in certain “gray” areas you may get either one of two grades (A or B, B or C, C or D, D or F). And we do not know what is this additional “decider” in such case when score falls into this gray area.
Here is how we can check which factors may impact the grade. One way of doing this analysis is to make barplots for all possible slices of Moody data frame by a given categorical variable
For example,we want to know if asking questions “matters” for the grade? This can be validated by comparing barplots of grade distribution for different values of attribute “questions”.You can either do it by applying the mosaic plot which allows for two-dimensional representation of data and allows to create multicolored table for grade x questions to eyeball if values of attribute “questions” matter for values of attribute “grade”.
To dig deeper into the relationship of categorical variables “questions” and “texting” with “grade” we will use sequence of bar plots over subsets of the Moody data frame. Then we will follow with the mosaic plots.
The following slices represent subsets of the Moody data frame for each of the values of the attribute “questions”
The command \(\color{violet}{\text{table}}\) (one of the 7 commands) will provide us grade distribution for each of these slices.
And the barplot, will visualize this table.
Let’s look at the example of the above process for students who always ask question.
We can also run two mosaic plots of GRADE vs “questions” or “texting” respectively - and be able to asses the same - do these attributes matter for the grade?
In the following command we can combine attribute grade with anyone of the behavioral attributesThis can be concluded by comparing different columns and rows of the mosaic table. If grade distribution is similar for different values of behavioral attributes, this would indicate that these attributes do not matter in the establishing the grade. On the other hand we may “catch professor Moody” and find out that for some value of some attribute, grade distribution is significantly affected. This was the case several years ago when students sitting in the first row, got a grade bump up, even if they get similar scores to students sitting in the back row. In that case one of the extra attributes included the row where students were sitting during class.
We can see that asking many questions (frequently and always) really matters for the grade, there is more A’s and more B’s for these slices than in general.
But this may have nothing to do with Professor Moody rewarding students with the bonus for asking questions. It may be simply the case that such students are more involved and study harder (or are more interested in the topic) and simply get higher scores. We need to dig deeper and see which of the two is the case. Moody’s just gives his personal bonus to students who ask a lot of questions or no such bonus is given – such students simply score higher.
We can accomplish this using again one of the seven R commands – the tapply.
## always never rarely
## 51.08277 56.32474 53.69217
will return average score for each of the values of the attribute moody$questions.
If this values are more or less uniform than it will informally (not statistically yet, for this we have to wait for the next sections) show that questions matter in professor moody grading method and are not just correlated with student’s score. Take a look atWhat is the conclusion?
Does asking questions often imply higher score? Or it does not affect the score but affects the grade through the grading rules of Professor Moody.
Similarly, does asking questions often imply higher score? Or it does not affect the score but affects the grade through the grading rules of Professor Moody.
shows that mean scores are the same across different values of the “texting”attribute. Same is true for the mean scores of “questions” attribute.
Neither do these two attributes “texing” and “questions” seem to have impact on the grade.
Therefore it seems that the “behavioral” attributes: questions and texting do not seem to have an impact on the grade.
Lets examine participation attribute. This is the only culprit left - to explain different grades in the “grey” intervals of score.
We define intervals of score as clear, if there is only one grade associated with scores from such interval. The remaining intervals are defined as grey - scores where grade can be either A or B, B or C, C or D and D or F respectively.
Then we can examine how participation influences grades in these gray areas of score. Our hypothesis is that higher participation would probably offer better odds for higher grade. We can run the following command for different values of q.
Let q be a threshold of participation which we want to test. May be if participation is higher than q, higher grade (from the two possible grades in the gray area of score) is given, while if participation is lower than q, it works against a student, who then gets lower grade?
Lets check how the grade distribution changes for different values of q from the lower values of q to higher values of q. We can just change q directly in the code below, and see results immediately.
Run the following command for different values of q. We will only show it for the grey interval between A and B. The same process can be repeated for other gray areas between B and C, C and D and D and F. In fact one can modify the code below by just replacing grade A and B with B and C respectively as well as replacing the variables LowestA with Lowest B and Highest B with Highest C respectively.
Please verify that for higher values of 0<q<1, the table operation show higher percentages of better grades.
Is there a critical value of q which clearly separates, say A’s from B’s? It seems to be q=0.6 - but it is not a clear cut deterministic. Rather, higher value of participation threshold, q increases probability (frequency) of getting As.
We come to conclusion that participation matter in grey area of score, in having higher chance for better grade, if participation is higher. Thus, just in case (since no one can predict if they will end up in border line score) it is better to earn high participation index – by (probably) coming to class more often and participating in discussions, and answering professor Moody’s questions.
Simply put “come to as many classes as you can”. But while in class, do not worry about texting or asking questions. These two attributes do not seem to matter.
Now we can reveal how data was generated? What was the real rule embedded in the data.
Now it is time to reveal how we generated our data?
This what we embedded in our data generation method: participation increases the chances of higher grade in the gray areas of score (border areas).
We have indeed defined border areas in score value. In this border areas of score, participation plays a role. Student whose’s score falls into the grey area may get one of two grades, A or B, B or C, C or D and D or F, depending on the score.For example score of 72 may result in A or B. It is more likely to be A if student’s participation is high (higher the better the odds of getting A). If student’s participation is low, it is much more likely to result in lower grade, for the score of 72, it would be B.
Therefore we have discovered more or less the rule which guided generation of the Moody’s data set and provided students with actionable intelligence how to increase chances of getting higher grade.
Relationship between participation, score and grade
In the process of slicing, dicing and plotting the data we would also discover other interesting relationships still using just 7 commands.
Does higher participation mean higher score, in general? Meaning that coming to class is positively correlated with higher score?
We can run scatter plotTo our surprise it looks like the higher the participation, the lower the score! The distinct linear patterns in scatter graph seem to be sloping down with participation. We are tempted to infer that participation is bad for score, that somehow Moody’s lectures have negative impact on the score - hence do not carry pedagogical value. Such conclusion is typical confusion between correlation and causation. What is true in our simulated data set - that students had some prior background in the subject matter of Professor Moody just do not show up in class that often. They already know the material. Students who do show up are the ones who are not confident in their knowledge of the subject matter - in general “weak A’s” and below. Then of course lower grade students (D’s and F’s) just simply do not apply themselves that much - are not invested into the class and just show up in class even less. Thus, the explanation probably has more to do with the students attribute about the class than with the pedagogical value of Professor Moody’s lectures.
We can change values of parameters q and s and examine in more detail the relationships between scores and participation.
But lets ask for associations between behaviors
Do students who ask a lot of questions also spend little time texting?
Do students who participate more, generally texting less?
These questions have nothing to do with students performance.
But can all be answered using simple R.7 commands.
Same data may serve different purposes. We started with predicting what behaviors help getting higher grade in professor Moody’s class. But we can imagine a different study – which is addressing student behavior in professor Moody’s class. Yet another study could address the impact of behavioral attributes on students scores (not grades). All these analysis can be done using or free style exploration and R.7.
What we discover, and lets be very clear about it is not yet guaranteed to be statistically valid.
For this we need statistical evaluation, The p-values, the z-tests etc. Later we will also find statistical functions which can greatly help in data exploration.
Free style exploration role is to generate leads known otherwise as conjectures or hypotheses.
Our professor Moody’s data puzzle has been traditional the first data puzzle we ask students to solve in data 101 class. Here are some examples of conclusions reached on different instances of professor Moody’s data set in the past semesters. Notice that attributes of Moody’s data set in past may have been different (like “sleeping in class”)
Here is the set of recommendations from former student who cracked that year’s professor Moody’s puzzle (or did she?)
“Judging by plots and means calculated earlier, there are several factors, besides score, that affect students’ grades:
• Sleeping in class increases grade
• Texting in class decreases grades a little
• Being active(participating) in class all the time significantly increases the grade, BUT:
• Being active(participating) in class just occasionally decreases the grade even more, than not participating at all.
• Being active only occasionally significantly decreases the grades.
• Texting does not significantly affect grades .
So for students in order to succeed in professor Moody’s class, my advice will be(besides getting high score):
• VERY IMPORTANT: Participate all the time., or do not participate at all!!!
• Sleep in class(especially if you do not participate anyway)
• While texting might bring down your grade a little bit, the difference is very small”