Evaluating Your Models

Now that you have serveral linear regression models, the time has come to evaluate which model of yours is the best. Defining "what is best" can be tricky: how should be the goodness of fit be evaluated. You have already encountered one approach to evaluate models: using R-squared and adjusted R-squared, which usually are conveniently reported, and is a relative measure of fit. If you have not reviewed the two mentioned measures in the previous exercise, go check them out now. As you probably have read: R-squared comes with the downfall that it will never decrease with the number or regressors used in a model. While adjusted R-squared penalizes the more regressors are used in a model, we should still keep this issue in the back of our heads.

Additionally, we should considers the Root Mean Squared Error (RMSE), which we will calculate ourselves, with the following formula:

\text{RMSE}(y, \hat{y}) = \sqrt{\frac{\sum_{i=0}^{N - 1} (y_i - \hat{y}_i)^2}{N}}

Furthermore, remember that we split in our data into two parts? We will not only consider the RMSE of the the model we fitted on to our training data - in our sample so to speak, but also how the model does in predicting values on our testing data - out of sample. Confused what to do now? Follow the below steps:

Predict values for log GDP per capita from your models using training data (in sample)
Predict values from your model using testing data (out of sample)
Use predicted values and true values for log GDP per capita to calculate the RSME (do not use functions you may find in packages, however, you may use them to compare your results) - do this for the test and training data
Report your findings and explain which model from those you build is "the best" - explain key concepts and your choice
Save your best model

If you previously used sklearn and have your model object, you can use its predict() method to make predictions. If you used scipy you can write a simple prediction function yourself and apply it to your test data using pandas DataFrame.apply() function.

PreviousLinear Regression NextRandom Forest Models

Last updated 11 months ago