Chapter 6 Data Modeling and Prediction techniques for Classification.
6.1 Decision Tree.
Decision tree is one of the most powerful and popular tool for classification and prediction. It is a supervised learning predictive model that uses a set of binary rules to calculate a target value. It is a flow chart like tree structure, where each internal node has a test on a particular attribute, each branch denotes the outcome of the test, and each leaf node holds a class label/ numeric value.
The reason decision tree are very popular are: - It is able to generate rules easier to understand as compared to other models.. - It require much less computations for performing modeling and prediction. - Both continuous/numerical and categorical variables are handled easily while creating the decision trees.
There are a few drawbacks too while using decision trees in certain case of inputs and tasks. But for the scope of discussion of this course we won’t go into much details of it. Also, we won’t go into the depth of the internals of the formation/creation of the decision tree and its underlying algorithm, but we will look at decision tree as a way to create prediction model for both classification and regression.
6.2 Use of Rpart
Recursive Partitioning and Regression Tree RPART
library is a collection of routines which implement Classification and Regression Tree (CART) which is a type of Decision Tree.The resulting model can be represented as a binary tree.
The library associated with this RPART
is called rpart
. Install this library using install.packages("rpart")
.
Syntax for building the decision tree using rpart():
rpart( formula , method, data, control,...)
- formula: here we mention the prediction column and the other related columns(predictors) on which the prediction will be based on.
prediction ~ predictor1 + predictor2 + predictor3 + ...
- method: here we describe the type of decision tree we want. If nothing is provided, the function makes an intelligent guess. We can use “anova” for regression, “class” for classification, etc.
- data: here we provide the dataset on which we want to fit the decision tree on.
- control: here we provide the control parametes for the decision tree. Explained more in detail in section further in this chapter.
- formula: here we mention the prediction column and the other related columns(predictors) on which the prediction will be based on.
For more info on the rpart function visit rpart documentation
Lets look at an example on the Moody 2019 dataset.
- We will use the rpart() function with the following inputs:
- prediction -> GRADE
- predictors -> SCORE, ON_SMARTPHONE, ASKS_QUESTIONS, LEAVES_EARLY, LATE_IN_CLASS
- data -> moody dataset
- method -> “class” for classification.
We can see that the output of the rpart() function is the decision tree with details of,
- node -> node number
- split -> split conditions/tests
- n -> number of records in either branch i.e. subset
- yval -> output value i.e. the target predicted value.
- yprob -> probability of obtaining a particular category as the predicted output.
Using output tree, we can use the predict function to predict the grades of the test data. We will look at this process later in section 6.5
But coming back to the output of the rpart() function, the text type output is useful but difficult to read and understand, right! We will look at visualizing the decision tree in the next section.
6.3 Visualize the Decision tree
To visualize and understand the rpart() tree output in the easiest way possible, we use a library called rpart.plot
. The function rpart.plot()
of the rpart.plot library is the function used to visualize decision trees.
The rpart.plot
library is a front-end wrapper to the library prp
which is the most basic library for plotting decision trees. prp
allows various aesthetic modifications for visualizing the decision tree. We will look at a few examples of using prp
below.
But, first lets look at a example to visualize the output decision tree in the previous example on Moody dataset using rpart.plot()
NOTE: The online runnable code block does not support rpart.plot and prp
library and functions, thus the output of the following code examples are provided directly.
# First lets import the rpart library
library(rpart)
# Import dataset
<- read.csv("https://raw.githubusercontent.com/deeplokhande/data101demobook/main/files/dataset/MOODY-2019.csv")
moody
# Use of the rpart() function.
<- rpart(GRADE ~ SCORE+ON_SMARTPHONE+ASKS_QUESTIONS+LEAVES_EARLY+LATE_IN_CLASS, data = moody,method = "class")
tree
# Now lets import the rpart.plot library to use the rpart.plot() function.
library(rpart.plot)
# Use of the rpart.plot() function to visualize the decision tree.
rpart.plot(tree)
We can see that after plotting the tree using rpart.plot() function, the tree is more readable and provides better information about the splitting conditions, and the probability of outcomes. Each leaf node has information about
- the grade category.
- the outcome probability of each grade category.
- the records percentage out of total records.
To study more in detail the arguments that can be passed to the rpart.plot() function, please look at these guides rpart.plot and Plotting with rpart.plot (PDF)
Note that for any beginner using rpart.plot() function is the easiest way. But if you want to learn another way of plotting rpart trees then the following function can be used.
So,another form of plotting rpart trees in a very minimalistic way is using the plot rpart i.e. prp()
function, which is actually the working function behind rpart.plot().
Lets look at a same example like above but using prp().
# First lets import the rpart library
library(rpart)
# Import dataset
<- read.csv("https://raw.githubusercontent.com/deeplokhande/data101demobook/main/files/dataset/MOODY-2019.csv")
moody
# Use of the rpart() function.
<- rpart(GRADE ~ SCORE+ON_SMARTPHONE+ASKS_QUESTIONS+LEAVES_EARLY+LATE_IN_CLASS, data = moody[,-c(1)],method = "class")
tree
# Now lets import the rpart.plot library to use the rpart.plot() function.
library(rpart.plot)
# Use of the prp function to visualize the decision tree.
prp(tree)
We can see that the output of the prp() function is a very minimalist tree, without any colors with minimum required information. There are other arguments that can be passed to the prp() function to increase the aesthetic look and the information provided. To learn those extra arguments visit this guide prp()
NOTE: In this chapter, from this point forward, the rpart.plots() generated in any example below will be shown as images, and also the code to generate those rpart.plots will be commented in the interactive code blocks. If you want to generate these plots yourself, please use a local Rstudio or R environment. |
6.4 Rpart Control
We will now look at the control argument used in rpart()
function, which is one of the important argument. The control argument of rpart() function is used to manually decide the control parameters of the decision tree.
The advantages of using control method: - It restricts the height of the decision tree. - It avoids overfitting on the training dataset. - It can be used to eliminate attributes that affect less significantly on the splitting constraints. - Helps to terminate the creation process of tree earlier, thus reducing required computational time.
The disadvantages of using control method: - It creates risk of generating trees with lesser accuracy compared to uncontrolled tree. - It could hamper the splitting condition selection in negative way. - Could result in underfitting, if control parameters not chosen carefully.
As we can see, that controlling the decision tree provides us with lot of advantage in certain condition, we also risk in reducing the accuracy of the prediction from using the tree.
Thus these control methods must be applied only in certain case, where the uncontrolled method takes large amount of time to create the tree, and overfits the train data. When the datasets have more significantly higher count of columns but less records of data, using control methods could be suitable.
Now lets look at the rpart.control() function used to pass the control parameters to the control argument of the rpart() function.
rpart.control( *minsplit*, *minbucket*, *cp*,...)
- minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted. For example, minsplit=500 -> the minimum number of observations in a node must be 500 or up, in order to perform the split at the testing condition.
- minbucket: minimum number of observations in any terminal(leaf) node. For example, minbucket=500 -> the minimum number of observation in the terminal/leaf node of the trees must be 500 or above.
- cp: complexity parameter. Using this informs the program that any split which does not increase the accuracy of the fit by cp, will not be made in the tree.
For more information of the other arguments of the rpart.control()
function visit rpart.control
Note: The ratio of minsplt to minbucket is 3:1. Thus if only one of the minsplit/minbucket is provided the other value is set using the above ratio. Also if both values are provided, unless the values are not in the above ratio, the rpart.control() the resorts to the default value. Also note, the default value of cp is 0.01
.
Let look at few examples.
Suppose you want to set the control parameter minsplit=200.
We can see from the output of tree$splits
and the tree plot, that at each split the total amount of observations are above 200. Also, in comparison to the tree without control, the tree with control has lower height, and lesser count of splits.
Now, lets set the minbucket parameter to 100, and see how that affects the tree parameters.
We can see for the output and the tree plot, that the count of observations in each leaf node is greater than 100. Also, the tree height has shortened, suggesting that the control method was able to shorten the tree size.
Lets now use the cp
parameter and see its effect on the tree.
We can see for the output and the tree plot, that the tree size has increased, with increase in number of splits, and leaf nodes. Also we can see that the minimum CP value in the output is 0.005.
Now we saw the most important control parameters of the rpart.control function. Remember there are other parameters too, which you can study if you wish, but studying just these 3 discussed above are sufficient for the scope of this course.
Also, note we have not check the effects of the control parameters on the prediction accuracy of the decision tree. Using the control parameters you could either increase the accuracy, but also risk, decreasing the accuracy. So choosing the controls parameter very carefully is very important to push the accuracy in the right direction.
6.5 Prediction using rpart.
Now that we saw process to create a decision tree and also plotting it, we will like to use the output tree to predict the required attribute.
From the moody example, we are trying to predict the grade of students. Lets look at the predict()
function to predict the outcomes.
predict(*object*,*data*,*type*,...)
- object: the generated tree from the rpart function.
- data: the data on which the prediction is to be performed.
- type: the type of prediction required. One of “vector”, “prob”, “class” or “matrix”.
Now lets use the predict function to predict the grades of students using the tree generated on the Moody dataset.
We see that the output of the predict function is a vector of grades corresponding to each record of the Moody dataframe. Each index has a grade among “A”, “B”, “C”, “D”, “F”.
Although we see the output, how do we compare the accuracy and correctness of the outputs.
Lets look at one of the basic test we can do is perform a record by record comparison of the grade already in the dataset and the predicted grade, in the example below.
Using just a row by row comparison we can see that the outcomes of the predicted grades and the original grades from the Moody dataset, are matching 93.73%. Thus our prediction accuracy is 93.73% and the error rate is 6.27%, which is very good.
This prediction accuracy calculated on the training dataset is called training accuracy.
Notice that we just compared the predicted grades with the already present grades from the Moody dataset, the same dataset on which the tree was built.
In such scenario, one can say that you are predicting the grades on data already considered while training. So in some sense, you just output the known grades, and did not do any useful prediction.
But to really see the use of our tree, we must predict on a data which has never been used in the training of the tree, OR, where the Prof. Moody has not assigned the grades.
In the latter case it is difficult to prove the accuracy, without actually checking with Prof. Moody, if he/she would have assigned the same grade as the grade predicted by our model.
But we have a solution for the former case, where we can predict grades on a subset of data, which we have not used while training. For that we would need to split the provided data into 2 parts, Training and Testing and then repeat the training process on the training dataset and the prediction on testing dataset.
6.6 Split the data yourself.
We introduced the first and basic model for data analysis, Decision tree, in the section above.
But we found that we need to perform a basic routine before dwelling deep into the creation of decision tree.
The routine is to split the dataset in multiple parts, to check the accuracy of our tree’s prediction.
This routine is the first step you perform after acquiring a cleaned dataset.
The most useful splitting of dataset is done in 2 parts, Training and Testing.
- While splitting, the split ratio between training and testing should be decided properly. Mostly, training data is kept bigger,and testing is done on a relatively smaller subset. But the ratio should not be too biased, where there are only few observations in test data compared to training data.
- Usually, the math behind splitting is that, even after split, the smaller subset, i.e. the test subset should represent the distribution of the complete dataset. This means the test data should at-least have few record of each possible combination of attribute’s categories if categorical data, or, if numerical data then the numerical distribution is same as of the complete dataset.
- Thus, typically the train-test ratio is 80-20 or 70-30 or in some case even 60-40.
Also, while selecting the records to assign to either training/testing data, they should be randomly picked from the original data, so as to avoid unbalanced distribution.
We will look at a small example of splitting the complete dataset into training and testing dataset with a 70-30 ratio.
As we can see we split the original data with 1580 rows into two dataset, training data with almost 70% of rows of the original, and testing data with almost 30% of the original. Notice that we used a random sampling of the data, and not just sequential, to avoid any unbalanced distribution of attributes.
Now, we looked at a method to split the dataset into training and testing data. But there is another type of splitting of the dataset which involves splitting the data into 3 parts namely, training, cross-validation and testing. We will look at the use of cross-validation and the process, in the next section 6.7.
Typically, the ratio of train-validation-test is 60-20-20 or 50-25-25.
Before that lets look at a simple method to perform a 3 way split with ratio 60-20-20.
We can see that the dataset is split into 3 parts, with 60% in training data, 20% in validation data, and 20% in testing data.
6.7 Cross Validation
Cross validation is a model validation technique for assessing generalization of the results of statistical analysis to an independent dataset. In other words, it is a technique to estimate the accuracy of a predictive model’s performance in practice.
The goal of cross-validation is to test the model’s ability to predict new data that was not used in estimating/training it, in order to avoid problems like over-fitting and selection bias, and to give an insight on how the model will generalize to an independent dataset(i.e., an unknown dataset).
Cross-validation also helps in selecting and fine-tuning the hyper-parameters of the models. In our case of decision tree, the hyper parameters could be the control parameters that determind the size of the decision tree, which in-turn determines the accuracy of the tree.
One round of cross-validation involves partitioning data into complementary subsets, and then performing model training on one subset, and validating the results on the other subset. In most methods, multiple rounds of cross-validation are performed using different partitions in each round, and the validation results are combined (e.g. averaged) over the rounds to give an estimate of the model’s predictive performance.
Another use of cross-validation is when you don’t have the test data, and hence, you don’t have a way to determine the true accuracy of the model. Because we cannot determine accuracy on test dataset, we partition our training dataset into train and validation (testing). We train our model (rpart or lm) on train partition and test on the validation partition. The accuracy on the validation data is called cross-validation accuracy, while that on the train data is called training accuracy.
Lets not dive too deep into the theory of this cross validation technique, but lets learn about the cross_validate() function, that helps us achieve this.
cross_validate(*data*, *tree*, *n_iter*, *split_ratio*, *method*)
- data: The dataset on which cross validation is to be performed.
- tree: The decision tree generated using rpart.
- n_iter: Number of iterations.
- split_ratio: The splitting ratio of the data into train data and validation data.
- method: Method of the prediction. “class” for classification.
The way the function works is as follows:
- It randomly partitions your data into training and validation.
- It then constructs the following two decision trees on training partition:
- The tree that you pass to the function.
- The tree constructed on all attributes as predictors and with no control parameters. -It then determines the accuracy of the two trees on validation partition and returns you the accuracy values for both the trees.
The first column corresponds to the cross-validation accuracy on the tree that you pass; the second is the cross-validation accuracy on the tree without any control and all attributes.
The values in the first column(accuracy_subset) returned by cross-validation function are more important when it comes to detecting overfitting. If these values are much lower than the training accuracy you get, that means you are overfitting.
We would also want the values in accuracy_subset to be close to each other (in other words, have low variance). If the values are quite different from each other, that means your model (or tree) has a high variance which is not desired.
The second column(accuracy_all) tells you what happens if you construct a tree based on all attributes. If these values are larger than accuracy_subset, that means you are probably leaving out attributes from your tree that are relevant.
Each iteration of cross-validation creates a different random partition of train and validation, and so you have possibly different accuracy values for every iteration.
Lets look at the cross_validate() function in action in the example below.
We will pass the tree with formula as GRADE ~ SCORE+ON_SMARTPHONE+LEAVES_EARLY
, and control parameter, with minsplit=100
.
And for cross_validate() function, we will usen_iter=5, and split_raitio=0.7
NOTE: Cross-Validation repository is already preloaded for the following interactive code block. Thus you can directly use the cross_validate() function in the following interactive code block. But if you wish to use the code_validate() function locally, please use
install.packages("devtools")
devtools::install_github("devanshagr/CrossValidation")
CrossValidation::cross_validate()
NOTE: If you encounter error while running the cross-validation function that said “new levels encountered” in test, make sure the dataset is imported again with read.csv() attribute stringsAsFactors
as TRUE or T
. For more information about the inner-working of the cross_validate() function visit cross_validate()
We can see in the output the Training accuracy, the table of cross-validation accuracy at each iteration for both the passed tree and the tree on all attribute and also their averages and variances.
Few Observation from the selected example above are:
- For the tree passed with selected attributes and some control parameters, the cross-validation accuracy’s (i.e. accuracy values in the
accuracy_subset
column) are fairly high for all iterations and have very low variance. - They are close to the training accuracy which indicates we are not overfitting.
- Observe that the accuracy at each iteration of the accuracy_subset and accuracy_all column are relatively, close but not exact, suggesting that there are more attributes or other control parameters that can be included to the passed tree, to further increase the accuracy, thus closing the gap.
Thus using cross-validation we were able to figure out with certainty, that the passed tree, is not the best tree that can be created using the training data. Also, we saw whether the generated tree overfits the training data or not.
EOC