Chapter 7 Data Modelling and Prediction techniques for Regression.

In chapter 6 we studied modeling and prediction techniques for Classification, where we predicted the class of the output variable, using Decision tree based algorithms.

In this chapter we will study techniques for regression based prediction and create models for it. As opposite the the classification where we predict a particular class of the output variable, here we will predict a numerical/continuous value.

We will look at two methods of performing this regression based modeling and prediction, first simple linear regression and second regression using decision tree.

7.1 Linear Regression.

Linear regression is a linear approach to modeling the relationship between a scalar response ($Y$) and one or more explanatory variables($X_i$, where i is the number of explanatory variables).

Scalar response means the predicted output. These variables are also known as dependent variables i.e. they are derived by applying some law/rule or function onto some other variable/s
- Usually in linear regression, models are used to predict only one scalar variable. But there are two subtype if these models:
  - First when there is only one explanatory variable and one output variable. This type of linear regression model known as simple linear regression.
  - Second, when there are multiple predictors, i.e. explanatory/dependent variables for the output variable. This type of linear regression model known as multiple linear regression.
- But in the case of prediction of multiple correlated output variables, we call this type of prediction using linear regression model as multivariate linear regression.
Explanatory variables are the predictors on which the output predictions are based on. These variables are also known as independent variables and are independently sufficient to be used as predictors in regression models.

Linear models fitted to various different type of data spread. This illustrates the pitfalls of relying solely on a fitted model to understand the relationship between variables. Credits: Wikipedia.

Since the relationship between the explanatory variables and the output variable is modeled linearly, these models are called as linear models. To do this, we need to find a linear regression equation for the set of input predictors and the output variable.

But without going into the mathematics of finding this linear regression equation, we will use a tool/function provided in R to model and predict the output variable.

7.1.1 Linear regression using lm() function

Syntax for building the regression model using the lm() function is as follows:

lm(formula, data, ...)
- formula: here we mention the prediction column and the other related columns(predictors) on which the prediction will be based on.
  - prediction ~ predictor1 + predictor2 + predictor3 + ...
- data: here we provide the dataset on which the linear regression model is to be trained.

For more info on the lm() function visit lm()

Lets look at the example on the RealEstate dataset.

A snippet of the Realestate Dataset is given below.

Table 7.1: Snippet of Real Estate Dataset
Price	Bedrooms	Bathrooms	Size
795000	3	3	2371
399000	4	3	2818
545000	4	3	3032
909000	4	4	3540
109900	3	1	1249
324900	3	3	1800
192900	4	2	1603
215000	3	2	1450
999000	4	3	3360
319000	3	2	1323

Now we can build a simple linear regression model to predict the Price attribute based on the various other attributes present in the dataset, as shown above.

Since we will be predicting only one attribute values, this model will be called simple linear regression model.

For the first example we will predict the price value of house using only size attribute as the predictor.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KE1vZGVsTWV0cmljcylcblxuIyBMb2FkIHRoZSBkYXRhc2V0LlxucmVhbGVzdGF0ZTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGVlcGxva2hhbmRlL2RhdGExMDFkZW1vYm9vay9tYWluL2ZpbGVzL2RhdGFzZXQvUmVhbEVzdGF0ZS5jc3ZcIilcblxuIyBzcGxpdGluZyB0aGUgZGF0YXNldCBpbnRvIHRyYWluaW5nIGFuZCB0ZXN0aW5nLlxuaWR4IDwtIHNhbXBsZSggMToyLCBzaXplID0gbnJvdyhyZWFsZXN0YXRlKSwgcmVwbGFjZSA9IFRSVUUsIHByb2IgPSBjKC43LCAuMykpXG50cmFpbiA8LSByZWFsZXN0YXRlW2lkeCA9PSAxLF1cbnRlc3QgPC0gcmVhbGVzdGF0ZVtpZHggPT0gMixdXG5cbiMgVXNlIHRoZSBsbSgpIGZ1bmN0aW9uIHRvIHByZWRpY3QgdGhlIHByaWNlIGJhc2VkIG9uIHNpemUgb2YgdGhlIGhvdXNlLlxuIyBUaHVzIHRoaXMgaXMgYW4gZXhhbXBsZSBvZiBzaW1wbGUgbGluZWFyIHJlZ3Jlc3Npb24gc2luY2Ugb25seSBvbmUgcHJlZGljdG9yIGFuZCBvbmUgb3V0cHV0IHZhbHVlIGlzIHVzZWQuXG5zaW1wbGUuZml0IDwtIGxtKFByaWNlflNpemUsZGF0YT10cmFpbilcblxuIyBzdW1tYXJ5IG9mIHRoZSBtb2RlbFxuc3VtbWFyeShzaW1wbGUuZml0KVxuXG4jIExpbmVhciByZWxhdGlvbiBiZXR3ZWVuIHRoZSBQcmljZSBhbmQgU2l6ZSBhdHRyaWJ1dGUuXG5wbG90KHRyYWluJFNpemUsdHJhaW4kUHJpY2UpXG5hYmxpbmUoc2ltcGxlLmZpdCAsIGNvbD1cInJlZFwiKVxuXG4jIFByZWRpY3RpbmcgdmFsdWVzIG9uIHRoZSB0ZXN0IGRhdGFzZXQuXG5QcmVkaWN0ZWRQcmljZS5zaW1wbGUgPC0gcHJlZGljdChzaW1wbGUuZml0LHRlc3QpXG4jIFByZWRpY3RlZCBWYWx1ZXNcbmhlYWQoYXMuaW50ZWdlcih1bm5hbWUoUHJlZGljdGVkUHJpY2Uuc2ltcGxlKSkpXG4jIEFjdHVhbCBWYWx1ZXNcbmhlYWQodGVzdCRQcmljZSkifQ==

We can see that,

The summary of the lm model give us information about the parameters of the model, the residuals and coefficients, etc.
The plot of Size vs Price, and the red line represents the fitted line or the linear model line which will be used for prediction.
The predicted values are obtained form the predict function using the trained model and the test data. In comparison to the actual values, the predicted values are some times close,some time far and few are very far.

We saw above an example of simple linear regression model, where only one predictor was used for predicting a single output attribute.

Now we will see an example of multiple linear regression model, where there can be multiple predictors to predict a single output attribute. (Note: Please do not confuse this with the multivariate linear regression.)

Let look at an example of predicting the Price of the real estate, based on 3 attributes Size, Number of Bedrooms and Number of Bathrooms.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KE1vZGVsTWV0cmljcylcblxuIyBMb2FkIHRoZSBkYXRhc2V0LlxucmVhbGVzdGF0ZTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGVlcGxva2hhbmRlL2RhdGExMDFkZW1vYm9vay9tYWluL2ZpbGVzL2RhdGFzZXQvUmVhbEVzdGF0ZS5jc3ZcIilcblxuIyBzcGxpdGluZyB0aGUgZGF0YXNldCBpbnRvIHRyYWluaW5nIGFuZCB0ZXN0aW5nLlxuaWR4IDwtIHNhbXBsZSggMToyLCBzaXplID0gbnJvdyhyZWFsZXN0YXRlKSwgcmVwbGFjZSA9IFRSVUUsIHByb2IgPSBjKC43LCAuMykpXG50cmFpbiA8LSByZWFsZXN0YXRlW2lkeCA9PSAxLF1cbnRlc3QgPC0gcmVhbGVzdGF0ZVtpZHggPT0gMixdXG5cblxuIyBVc2UgdGhlIGxtKCkgZnVuY3Rpb24gdG8gcHJlZGljdCB0aGUgcHJpY2UgYmFzZWQgb24gc2l6ZSwgYmF0aHJvb21zIGFuZCBiZWRyb29tcyBvZiB0aGUgaG91c2UuXG4jIFRodXMgdGhpcyBpcyBhbiBleGFtcGxlIG9mIG11bHRpcGxlIGxpbmVhciByZWdyZXNzaW9uIHNpbmNlIG11bHRpcGxlIHByZWRpY3RvciBhbmQgb25lIG91dHB1dCB2YWx1ZSBpcyB1c2VkLlxubXVsdGlwbGUuZml0IDwtIGxtKFByaWNlflNpemUgKyBCYXRocm9vbXMgKyBCZWRyb29tcyxkYXRhPXRyYWluKVxuXG4jIHN1bW1hcnkgb2YgdGhlIG1vZGVsXG5zdW1tYXJ5KG11bHRpcGxlLmZpdClcblxuIyBQcmVkaWN0aW5nIHZhbHVlcyBvbiB0aGUgdGVzdCBkYXRhc2V0LlxuUHJlZGljdGVkUHJpY2UubXVsdGlwbGUgPC0gcHJlZGljdChtdWx0aXBsZS5maXQsdGVzdClcbiMgUHJlZGljdGVkIFZhbHVlc1xuaGVhZChhcy5pbnRlZ2VyKHVubmFtZShQcmVkaWN0ZWRQcmljZS5tdWx0aXBsZSkpKVxuIyBBY3R1YWwgVmFsdWVzXG5oZWFkKHRlc3QkUHJpY2UpIn0=

We can see that,

The summary of the lm model give us information about the parameters of the model, the residuals and coefficients, etc.
The predicted values are obtained form the predict function using the trained model and the test data. In comparison to the previous model based on just the Size as predictor, here, when we used 3 predictors, we have more accurate predictions, thus increasing the overall accuracy of the model.

7.1.2 Calculating the Error using mse()

As was the simple case in the categorical predictions of the classification models, where we could just compare the predicted categories and the actual categories, this type of direct comparison as an accuracy test won’t prove useful now in our numerical predictions scenario.

Also we don’t want to eyeball everytime we predict, to find the accuracy of our predictions each row by row, so lets see a method to calculate the accuracy of our predictions, using some statistical technique.

To do this we will use the Mean Squared Error(MSE).

The MSE is a measure of the quality of an predictor/estimator
It is always non-negative
Values closer to zero are better.

The equation to calculate the MSE is as follows:

\[\begin{equation} MSE=\frac{1}{n} \sum_{i=1}^{n}{(Y_i - \hat{Y_i})^2} \\ \text{where $n$ is the number of data points, $Y_i$ are the observed value}\\ \text{and $\hat{Y_i}$ are the predicted values} \end{equation}\]

To implement this, we will use the mse() function present in the Metrics Package, so remember to install the Metrics package and use library(Metrics) in the code for local use.

The syntax for mse() function is very simple:

mse(actual,predicted)
- actual: vector of the actual values of the attribute we want to predict.
- predicted: vector of the predicted values obtained using our model.

Now lets look at the MSE of the previous example.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KE1vZGVsTWV0cmljcylcblxuIyBMb2FkIHRoZSBkYXRhc2V0LlxucmVhbGVzdGF0ZTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGVlcGxva2hhbmRlL2RhdGExMDFkZW1vYm9vay9tYWluL2ZpbGVzL2RhdGFzZXQvUmVhbEVzdGF0ZS5jc3ZcIilcblxuIyBzcGxpdGluZyB0aGUgZGF0YXNldCBpbnRvIHRyYWluaW5nIGFuZCB0ZXN0aW5nLlxuaWR4IDwtIHNhbXBsZSggMToyLCBzaXplID0gbnJvdyhyZWFsZXN0YXRlKSwgcmVwbGFjZSA9IFRSVUUsIHByb2IgPSBjKC43LCAuMykpXG50cmFpbiA8LSByZWFsZXN0YXRlW2lkeCA9PSAxLF1cbnRlc3QgPC0gcmVhbGVzdGF0ZVtpZHggPT0gMixdXG5cbiMgVXNlIHRoZSBsbSgpIGZ1bmN0aW9uIHRvIHByZWRpY3QgdGhlIHByaWNlIGJhc2VkIG9uIHNpemUgb2YgdGhlIGhvdXNlLlxuc2ltcGxlLmZpdCA8LSBsbShQcmljZX5TaXplLGRhdGE9dHJhaW4pXG4jIFByZWRpY3RpbmcgdmFsdWVzIG9uIHRoZSB0ZXN0IGRhdGFzZXQuXG5QcmVkaWN0ZWRQcmljZS5zaW1wbGUgPC0gcHJlZGljdChzaW1wbGUuZml0LHRlc3QpXG4jIFByZWRpY3RlZCBWYWx1ZXNcbmhlYWQoYXMuaW50ZWdlcih1bm5hbWUoUHJlZGljdGVkUHJpY2Uuc2ltcGxlKSkpXG4jIEFjdHVhbCBWYWx1ZXNcbmhlYWQodGVzdCRQcmljZSlcblxuIyBMZXRzIHVzZSB0aGUgbXNlKCkgZnVuY3Rpb24gdG8gXG5tc2UodGVzdCRQcmljZSxQcmVkaWN0ZWRQcmljZS5zaW1wbGUpIn0=

We can see the MSE is too large above 200 billion, and this is huge value could be understandable as we are taking the squared differences of all the records that we predicted.

The main intention is to get this huge value to as low as possible possibly near zero, which could be difficult but can be achieved upto a relative error by using a better model and training data.

7.2 Regression using RPART

Since we have already used the rpart library for performing decision tree algorithms also referred as CART(classification and regression tree) algorithms, we will now look at this type algorithm for regression based prediction.

Remember we have discussed the usage of Rpart in the section 6.2 in great detail. Thus for using Rpart for regression based prediction we will need to provide the rpart() functions, method attribute, with the keyword “anova”.

For more details on the use of Rpart for prediction please refer to section 6.2.

Lets look at an example of regression based prediction using Rpart for the Price attribute of the Real estate Dataset with Size, Number of Bedrooms and Number of Bathrooms as predictors.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KHJwYXJ0KVxubGlicmFyeShNb2RlbE1ldHJpY3MpXG5cbiMgTG9hZCB0aGUgZGF0YXNldC5cbnJlYWxlc3RhdGU8LXJlYWQuY3N2KFwiaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2RlZXBsb2toYW5kZS9kYXRhMTAxZGVtb2Jvb2svbWFpbi9maWxlcy9kYXRhc2V0L1JlYWxFc3RhdGUuY3N2XCIpXG5cbiMgc3BsaXRpbmcgdGhlIGRhdGFzZXQgaW50byB0cmFpbmluZyBhbmQgdGVzdGluZy5cbmlkeCA8LSBzYW1wbGUoIDE6Miwgc2l6ZSA9IG5yb3cocmVhbGVzdGF0ZSksIHJlcGxhY2UgPSBUUlVFLCBwcm9iID0gYyguNywgLjMpKVxudHJhaW4gPC0gcmVhbGVzdGF0ZVtpZHggPT0gMSxdXG50ZXN0IDwtIHJlYWxlc3RhdGVbaWR4ID09IDIsXVxuXG4jIFVzZSBvZiB0aGUgcnBhcnQoKSBmdW5jdGlvbiB0byBwcmVkaWN0IHRoZSBwcmljZSBiYXNlZCBvbiB0aGUgc2l6ZSwgYmF0aHJvb21zIGFuZCBiZWRyb29tcyBvZiB0aGUgaG91c2UuXG5ycGFydC5maXQgPC0gcnBhcnQoUHJpY2V+U2l6ZStCYXRocm9vbXMrQmVkcm9vbXMsZGF0YT10cmFpbixtZXRob2QgPSBcImFub3ZhXCIpXG5cbiMgUHJlZGljdGluZyB2YWx1ZXMgb24gdGhlIHRlc3QgZGF0YXNldC5cblByZWRpY3RlZFByaWNlLnJwYXJ0IDwtIHByZWRpY3QocnBhcnQuZml0LHRlc3QpXG4jIFByZWRpY3RlZCBWYWx1ZXNcbmhlYWQoYXMuaW50ZWdlcih1bm5hbWUoUHJlZGljdGVkUHJpY2UucnBhcnQpKSlcbiMgQWN0dWFsIFZhbHVlc1xuaGVhZCh0ZXN0JFByaWNlKVxuXG4jIExldHMgdXNlIHRoZSBtc2UoKSBmdW5jdGlvbiB0byBcbm1zZSh0ZXN0JFByaWNlLFByZWRpY3RlZFByaWNlLnJwYXJ0KSJ9

Output tree plot of rpart() model using for regression using “anova” method

We can see,

The output decision tree of the rpart() function
The predicted values obtained using the model created by the rpart() function.
The MSE of the model on the testing dataset.

An important point to note while using decision trees for regression purpose, is that since the underlying process of modeling is still a decision tree, the output still represent a set of distinct classes, even though the values of the classes are numeric. Thus we can see that the predicted values are repeated even for varying inputs.

Hence Decision tree must be used carefully when used for regression based prediction models.

EOC