Chapter 5 Data Frames & Transformation.
Now we have to introduce the core data structure of R – the data frame and show we can expand it with extra attributes.
Defining new attributes can very often be critical in data exploration and help to find patterns and relationships which otherwise would not be visible.
For example, may be participation matters but only to Pass/Fail grades? In other words students who Pass (A or B or C) always have participation above a certain threshold? Perhaps students who always text never pass the class? And students who always ask questions never fail? Such rules can only be discovered if we define a new Pass/Fail attribute, additional to grade attribute.
Similarly intervals of participation or score may discover important relationships which would not emerge with just numerical values of such attributes. May be High scores correlate with High participation? To establish it one would have first to define categorical attributes with named intervals of their numerical counterparts.
5.1 Create Column
- Lets put a column I have created using score.
- Suppose I am given a new column " pf " with same number of rows as that of the dataframe with the categories (“P” , “F”).
5.2 Factor Function: factor()
Factors are the data objects which are used to categorize the data and store it as levels.
They can store both strings and numbers.
They are useful in the columns which have a limited number of unique values. Like “Male,”Female" and True, False etc.
Factor data objects are useful in data analysis for statistical modeling.
The factor function is used to encode a vector as a factor.
Lets look at first example, checking if a data object is of factor type using the function is.factor(x)
Now lets convert the above vector to a factor data object. To do this we will use the function factor(x).
Notice that for the factor data objects, the attribute Levels is also created. This is an extremely important feature of the factor data object.
Lets look at how the factor data object looks when included in a dataframe.
Note: Sometimes depending on your version of R and packages, you might find that while inserting categorical vector into the data frame using the data.frame() function, without converting the categorical vector to factor, it automatically gets converted into a factor column. But to avoid confusion, it is a better technique to convert the categorical vector into factor using factor() function and then insert it in the data frame.
- Lets look at an example where the use of factor data object turns out to be useful.
- We have a categorical vector that we want to coerce as numeric for use in some model/application.
- Lets look at what happens when we just have a categorical vector, and we try to coerce it to numeric vector.
- We see that the outcome of the as.numeric() function on a normal categorical vector is coercion to “NA” of all elements.
- But when we convert the same categorical vector to factor, then after coercion to numeric type, we get a numeric vector of elements corresponding the the index of the labels of the factor data object.
Lets look at another example where factor is useful.
- We want to see the distribution of price of each quality for the wine dataset.
- Upon plotting, it gives us a scatter plot, which makes it hard for us to see the distribution.
- Thus we convert the quality vector which is numeric initially, to factor and the plot it again.
5.3 Coercing Values in data frames
Before coercing data into data frames, lets look at small examples.
- Lets look at a coerced vector.
We see that when a vector has elements of mixed data types, they gets coerced into a type with precedence over other types. For example in the above case there were character elements and numeric elements types in the vector. But character type has precedence over numeric type and hence the whole vector is coerced into character type.
We can check the types of vectors using a specific type of is function: is.character(), is.double(), is.integer(), is.logical(),etc. There are many other types under the is function, for checking if the data object given is a dataframe, factor, etc.
- Lets look at the examples.
We saw how to check the type of the data. But if you want to convert a column into your choice of data type, you can use the specific type of as function: as.character(), as.double(), as.integer(), as.logical(),etc. Again as we saw above about the is function types, there are also many other types of the as function.
- Lets look at the example of coercing a vector into character type.
- Lets look at an example of coercing a mixed type vector into numeric type.
We can see in the above example, while converting the character type vector to numeric if we encounter, numbers in character type, they get converted to numeric type. But the characters in character type, are not not converted, and instead we get a warning saying “NAs introduced by coercion”. Also, while converting a logical vector to numeric vector, we see that “TRUE” or “T” is coerced as 1 and “FALSE” or “F” is coerced as 0.
Now lets look at how to coerce data column and rewrite it into the dataframe.
Suppose in the Moody dataset, you want to change the categorical vector of letter grade to numeric grades between 1 to 5, where A=1, B=2, …, F=5.
- First, you will convert the grade column vector to factor using the factor() function.
- Then, convert the grade column with the command as.numeric() to numeric column.
We can see that the outcome of the above code, gives us a moody dataframe with grade column as a numeric column converted from the previous categorical column. We can also see that the we used the as.numeric() function inside the factor function while converting from categorical to numeric, to maintain the levels information of the grade column.
Now, suppose you also want to change the labels of the grade column.
- Lets change the grades from capital letters to small letters, i.e. A -> a, B -> b, and so on.
- To do this, we can provide our user defined labels vector to the labels attribute of the factor() function.
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGVlcGxva2hhbmRlL2RhdGExMDFkZW1vYm9vay9tYWluL2ZpbGVzL2RhdGFzZXQvbW9vZHkyMDIwYi5jc3ZcIikgI3dlYiBsb2FkIGRhdGFzZXRcblxuIyBDb252ZXJ0IHRoZSBjYXRlZ29yaWNhbCBjb2x1bW4gZ3JhZGUgdG8gZmFjdG9yIGRhdGEgY29sdW1uIHdpdGggdXNlciBkZWZpbmVkIGxhYmVscy5cbm1vb2R5JGdyYWRlIDwtIGZhY3Rvcihtb29keSRncmFkZSxsYWJlbHMgPSBjKFwiYVwiLFwiYlwiLFwiY1wiLFwiZFwiLFwiZlwiKSlcbmhlYWQobW9vZHkkZ3JhZGUpIn0=
We can see that the capital letter are now transformed to small letters.
5.4 Merging Two Relational Data Frames.
Often, we have data from multiple sources/multiple databases, files etc. To perform analysis, we need to merge these dataframes together with one or more common key variables.
In R the merge() function allows merging two data frames by common columns or row names. This function allows you to perform different SQL joins, like left join, inner join, right join or full join
, among others.
We will look at merging datasets in R with this function, along with examples.
Consider the following 2 datasets.
First is a smaller just 4 record data subset of the Moody dataset.STUDENTID | SCORE | GRADE | ON_SMARTPHONE | ASKS_QUESTIONS | FINALEXAM |
---|---|---|---|---|---|
65446 | 23.67 | D | never | always | 12.874804 |
79686 | 8.41 | F | never | never | 5.044093 |
56400 | 69.76 | C | never | always | 23.585730 |
16792 | 95.51 | A | never | always | 23.476748 |
Second is another dataset of students with respective GPA and Majors.
STUDENTID | GPA | Major |
---|---|---|
65446 | 1.559626 | computer science |
79686 | 3.813033 | economics |
56400 | 2.840912 | political science |
10001 | 2.664000 | economics |
NOTE: We can see from the above snippets of the above the top 3 records in both dataset have same STUDENTID
, but the 4th records in both datasets are of different students. The most important element while discussing the examples below, will focus on what happens to the 4th records of both datasets when using the various merge options and attributes.
5.4.1 Inner Join
This is the most usual type of join of datasets that you can perform. It consists of merging two dataframes in one that contains common elements of both.
In order to merge the two datasets, you just have to pass them to the merge() function without the need of changing other arguments. Inner join merge is the default merge of the merge() function.
We can see that there are only 3 record in the output. The reason being that, the studentid of the fourth record in both the dataset did not match. And thus the merge function did not know which datasets record to be kept and which not.
Also the reason the merge function tried to match and merge the two datasets, is by using the first columns from both the datasets, which in both case was the “STUDENTID” column.
We can also do the same process, and get he same outcome, by defining the index column by yourself. Lets look at this in the example below.
As we can see, the output remains the same. But we understand that we can define any other common column as the index column based on which the merging can occur.
IMPORTANT NOTE: There are also arguments like by.x and by.y
which correspond to indexing based on one of the column from the left(first) or right(second) datasets respectively.
This could come extremely handy, when the two datasets you want to merge, have different column name for the index column.
For example, suppose in the two dataset that we have considered above, the first dataset had students records indexed by the studentid
column where the indexing column name is studentid
, but in the second dataset the indexing column even though with same student id’s as entries but with the column name of stu-id
.
Now while merging, you can face error since the merge() function will have trouble finding the two index columns to match since they are named differently in the two datasets.
Here you can provide the argument by.x = "studentid" , by.y = "stu-id"
in the function while merging.
5.4.1.1 Another example
Suppose you have the happiness index dataset,
IDN | AGE | COUNTRY | GENDER | IMMIGRANT | INCOME | HAPPINESS |
---|---|---|---|---|---|---|
88364 | 29 | Kyrgyzstan | Male | 0 | 103305 | 8.35 |
37692 | 41 | Afghanistan | Male | 0 | 51682 | 4.44 |
57856 | 20 | Azerbaijan | Female | 0 | 72381 | 6.24 |
49453 | 62 | South Korea | Female | 0 | 65658 | 5.66 |
93485 | 63 | Jordan | Female | 1 | 109581 | 3.17 |
97976 | 36 | Congo-Kinshasa | Female | 0 | 112432 | 9.43 |
where you have the survey data of people of various countries with records of information about AGE, COUNTRY, GENDER, IMMIGRANT, INCOME, and HAPPINESS
.
You can do analysis on the above dataset per country, per age group,etc.
But if you want to do analysis based on per continent, then you will have to create lists of all the countries in each continent, and then subset using the appropriate subset method/s from section below 5.5.
Alternate method will be acquiring another dataset, with information of each country and its respective continent, and do merge, which we can then use to subset easily.
Lets look at an example of this process below.
We can see from the output of the above example, the new dataframe created in happy.c
after applying merge() function on the Happiness index dataset and the country-continents dataset, the CONTINENT
column is added from the country-continents dataset into the happyness index dataset. And each country in the happy.c
dataframe has now the value of its respective continent in the the CONTINENT
column.
5.4.2 Full Join
Full Join is also known as the outer join
or the full outer join.
It merges all the columns of both datasets into one.
For those records with non-intersecting index elements, Full join keeps both the records, and fills the missing values with NA , i.e. Not Available(NA) keyword.
In order to create this type of full join of the two dataframes in R, we need to set the argument all
to TRUE or T
.
Lets look at this in the example below.
We can see that the first record of the output with studentid = 10001
was present in the second dataset only, thus the values corresponding to the columns of the first dataset are set to NA
.
Similarly, the same occurs with the record with studentid = 16792
, which was only present in the first dataset, and thus has NA
in the place of columns of second dataset.
5.4.3 Left Join
The left join in R involves matching all the rows in the first data frame with the corresponding records on the second dataframe.
To create this left join, you just have to set the argument all.x
to TRUE or T
.
Recall while doing the full join, we set the argument all
to TRUE or T
. Similarly, since we consider x
as the first dataset or the left dataset, we will set the argument of all.x
where the .x
is the key to select the first dataset.
We have seen in the snippets above, the student with studentid = 16792
is only present in the first dataset but not the second.
So lets see the result of merging using the left join in the example below.
We can see that the record of student with studentid = 16792
has NA
as the entry in the columns merged from the right dataset. Also, the record of student with studentid = 10001
is completely excluded, since it belongs to the second dataset.
5.4.4 Right Join
The right join merge involves joining all the rows in the second data frame with the corresponding records on the first dataframe.
The right join is opposite to that of left join.
In consequence, here, you will need to set the argument all.y
to TRUE or T
, since we consider the right dataset or the second dataset as y
.
We have seen in the snippet above, that the student with studenid = 10001
is only present in the second dataset but not the first.
So lets see the result of merging using the right join in the example below.
We can see that the record of student with studentid = 10001
has NA
as the entry in the columns merged from the left dataset. Also, the record of student with studentid = 16792
is completely excluded, since it belongs to the first dataset.
5.5 Slicing and Dicing.
R was made especially for data analysis and graphics. SQL was made especially for databases. They are allies in this field of data science.
The data structure in R that most closely matches a SQL table is a data frame. The terms rows and columns are used in both.
There is an R package called sqldf
that allows you to use SQL commands to extract data from an R data frame. We will not use this package in the examples but look at a way the operations in SQL translate to basic R commands that we have studied in previous chapter 4.
In R we have seen how subsetting of rows and columns happen using the subset function in earlier chapters 4.4. Please review this section before proceeding ahead.
5.5.1 Subsetting on Columns ( DICING )
So lets start with dicing the dataframe. In other words, lets look at subsetting operations on columns.
Columns in SQL are also called “fields”. In R it is commonly called “variables”.
In SQL the subset of columns is determined by SELECT
statement.
We can do these type of SQL operation in R using the normal subsetting method, either using the subset() function or using the square brackets [ ]
.
NOTE: In most of the examples below, to avoid printing of the complete dataset after any operations, we have used the head() function to truncate the output to only top 6 rows. However you can always remove the function or change the limit or output records to your choice by passing additional attribute n = user_defined_limit
to the head() function.
Just to recap subsetting on columns,
5.5.1.1 Subset single column.
Remember: You can either use the column names or the column location index, to dice the dataframe.
Suppose we want to subset the moody dataset only the grade column. Lets look at this example.We can see that only one column is selected form the dataframe. The drop = F
attribute is provided to keep the dataframe structure. You can also see the effect of not using the drop = F
attribute in the above example.
Note: In some cases, where you want to use the subsetted column with other function, e.g. mean(subsetted_column)
you must not use the drop=F
attribute, otherwise it will result in error.
5.5.1.2 Subset multiple column.
Suppose you want to subset multiple columns by name, you can create a vector or the column names you want to subset and then include it wile subsetting.
Lets look at the example.We can see that only the two column of “grade” and “score” are kept in the subset. Similarly, we can include the multiple column names and get subset.
5.5.1.3 Subset on all columns
You can get all the columns in the subset, by keeping the space after the comma blank. This gives the complete set of columns.
Suppose you did some slicing on the dataframe and want to keep all the columns in the output, you can just keep the space after the comma blank while subsetting.5.5.2 Subsetting on Rows ( SLICING )
Now that we have seen dicing Or subsetting on columns, which is similar to the select
statement of SQL,
we will now look at slicing on the dataframe. Or in other words subsetting on rows.
There are many statements of SQL that does subsetting on rows, i.e. SELECT, WHERE, AND, OR, IN, LIKE, LIMIT
, and many more.
We will look at few of them, by implementing them using the basic R functions.
5.5.2.1 Subsetting based on single condition.
We will look at a subsetting condition based on value.
For subsetting based on value, you can use the relational operators e.g. > , < , >= , <= , == , etc between the attribute name and the value.
Lets look at this in the following example.
Suppose you want to keep all the observations of where score of students are greater than 80.
We can see from the above result, the subset has only records of students having score greater than 80.
This example is similar to using where
statement in SQL.
5.5.2.2 Subsetting based on multiple conditions.
Similar to the above example, suppose you want to subset based on multiple conditions.
To do this, we will use the logical operators e.g. AND (" & ") , OR (" | ") , NOT (" ! ")
between the various conditions.
Lets look at an example for this type of subsetting.
Suppose you want to slice the records of the moody dataset, based on two conditions: - Students with grade equal to " A " - AND - Students with score greater than 90.
We can see that the records of students with score greater than 90 and grade equal to A are kept, rest all records are removed.
This example is similar to using the AND, OR, NOT
clause in SQL.
5.5.2.3 Subset based on multiple values.
We will look at subsetting the dataframe based on one condition with multiple values.
Suppose you want to subset the moody dataset, based on the students grade, but you want to keep students records with grade equal to both “B” and “C”
Well you can use multiple conditions as seen above with an AND clause between the two conditions with different values on the same variable/columns, but there is a simple and useful way to do this with just one conditional statement.
We will make use of a vector of all the values that we want to use, and then assign this vector to the condition statement.
Lets look at this in the following example.
We can see that the output has only records of students with grades B or C.
And both the methods, result in same output.
This example is similar to the IN
operator of the SQL
5.5.2.4 Subset based on a partial/complete text/character.
We will look at subsetting the dataset based on a specific pattern of text/characters.
This type of subsetting proves useful in text columns where each record has one or more than one sentence, and you want to search for a particular keyword or pattern.
Most simple example would be of a survey dateset, where each record in the dataset consists of text paragraph, answering the questions asked in the survey, and you want to figure out the count of particular keywords in each response.
Lets look at an example based on the Happiness dataset. We would like to find the subset of countries with the letters " and " in their name. eg. Iceland, Uganda, Poland, etc.
We can see the output has subset of the happy dataset with records of only those countries with the pattern “and” in its name.
To do this we have used the grep() function, which is a really important function for finding patterns in text and data.
We don’t need to study this grep() function in detail, but one can find very good resources explaining it online.
This example is similar to the LIKE
operator in SQL.
5.6 Group By
Now that we have done Slicing and Dicing, we will like to apply some functions and gain measured information form the subsets.
Although there is no straightforward, direct/ one step function to perform the function as that of the GROUP BY
from SQL, but we can get the required functionality, by combining various functions step by step from the R.7 commands list 2.2 and the things we learned in this section 5 and the revision section 4.
This section will involve use of the table(), tapply() function to apply the functions like mean, count, sum, etc
on the subsets categorical or numerical columns. More importantly, we will look at a very useful example below, which will tie together all that we have learned until now.
Suppose you want to get the statistics/numbers of average scores per grade and frequency of students per grade, and then use this table afterwards.
So the SQL query will look something like SELECT grade, avg(score) as averagescore, count(*) as student_number FROM moody GROUP BY grade
.
To implement this above query functionality in R we would fisrt need to use the tapply function to find get the average score per grade and frequency per grade, and then combine it. Lets look at this process in the code below.
We can see the combined table of both the average scores and frequency per grade.
Now suppose we want to go one step ahead and want to order the out
table from the example above, based on decreasing value of frequency of students per grade.
To do this, we introduce the order() function.
- Order() Function
- The order() function returns a permutation of the order of the elements of a vector.
- You can decide by passing the argument to order the elements in ascending or descending order.
We saw how we can use implement ordering in R. This is similar to using the ORDER BY
statement of SQL.
Another thing we can do is subsetting on the output.
Suppose you want to keep only those grade records in the out
data with frequency of students greater than 150 for particular grade.
To do this we will use the technique studied in the slicing section 5.5.2. Lets look at the working of the above example.
We see that the B
Grade had only 108 students in the record, it is removed from the out
dataframe.
This is similar to using the HAVING
clause of SQL.
5.7 Handling Date and Time in dataframes.
one of the most common issue that a novice or even an experienced R user can face is of handling date and time information available into the dataset, and importing it to use as a variable that is appropriate ans usable during analysis.
Also getting R to agree that your data contains the dates and times can be tricky sometimes. We will see an example where the usual R data import fails to read date and time as actually date and time.
To simplify this issue, we use a package called lubridate
, which makes it easier to work with dates and times and converts them into POSIXct format. POSIXct is a class of data recognized by R as being a date or date and time. Lubridate’s functions handle wide variety of formats and separators, which simplifies the parsing process.
Lets look how easy it is to use it and convert date and time input to be used in analysis. First we will look at converting date in character format to POSIXct.
- First we will convert
"20200317"
which is in year-month-date format. To convert this we will use theymd()
function of thelubridate
package. - Second we will convert
"03-17-2020"
which is in month-date-year format. To convert this we will use themdy()
function of thelubridate
package. - Third we will convert
"17/03/2020"
which is in date-month-year format. To convert this we will use thedmy()
function of thelubridate
package.
We can see the output of all the 3 function is the same, this means that the functions used have successfully converted all the input character type dates into the standardized POSIXct data type.
Now lets look at converting time in character format to POSIXct.
- First we will convert
"18:20"
which is in hour-minutes format. To convert this we ill use thehm()
function of thelubridate
package. - Second we will convert
"18:20:30"
which is in hour-minute-second format. To convert this we ill use thehms()
function of thelubridate
package.
We can see that the output of the 2 functions above are in POSIXct format and has the information of hours minutes and seconds annotated properly.
There are various other functions in the lubridate package like for various use case, but we will not cover them since they are not useful here. To learn more about it you can visit the official lubridate
package vignette linked here: lubridate
Now coming back to the main example of avoiding issues/errors while importing date and time attributes present in dataset. For this we will look at the AirQualityUCI
dataset. And here is the snippet of the dataset below.
Date | Time | CO | Tin.Oxide | Non.Metanic.HydroCarbons | Benzene |
---|---|---|---|---|---|
3/10/2004 | 18:00:00 | 2.6 | 1360 | 150 | 11.9 |
3/10/2004 | 19:00:00 | 2.0 | 1292 | 112 | 9.4 |
3/10/2004 | 20:00:00 | 2.2 | 1402 | 88 | 9.0 |
3/10/2004 | 21:00:00 | 2.2 | 1376 | 80 | 9.2 |
3/10/2004 | 22:00:00 | 1.6 | 1272 | 51 | 6.5 |
3/10/2004 | 23:00:00 | 1.2 | 1197 | 38 | 4.7 |
You will see from the dataset that the date and time columns are imported correctly. But in fact, and as we will see in the code below, the date column is of type character
and the time is also of type character
.
Now to convert these columns into POSIXct supported date time columns we will use the lubridate functions.
And then we will count the number of records in the dataset per year using the year()
function.
We can see that the original type of the date column was "character"
but then after using the lubridate’s function, we converted it to a suitable POSIXct format of "Date"
. Then we were easily able to subset the dataset based on the year, and get the frequency count of the records per year, as seen from the table for the years 2004 and 2005
.
Similarly, we can also convert the time column and probably use it later in analysis process.
We can see that the original type of the time column was "character"
but then after using the lubridate’s function, we converted it to a suitable POSIXct format of "Period"
which is used to represent time information.
EOC