Photo by Patrick Hendry on Unsplash
Finding the Best Model For You - A Guide to Grid Search CV
Exploring Scikit-Learn's tool GridSearchCV
I remember first starting out as a data scientist and manually changing hyper-parameters for hours. It was frustrating. It was time-consuming. Sometimes I would change the model all-together. That was until I started using GridSearchCV.
What is GridSearchCV
GridSearchCV is a tool from the scikit-learn's library made for hyper-parameter tuning. I was first introduced to it by Aurelien Geron's Machine Learning with Sckit-Learn, Keras and TensorFlow.
Implementing GridSearchCV
Suppose you have a Random Forest Classifier and want to tune its hyper-parameters. Implementing GridSearchCV would look something like this.
from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators' : [50, 100, 150],
'max_features' : [2, 4, 6, 8]}
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=3,
scoring='accuracy',
return_train_score=True)
grid_search.fit(X_train, y_train)
In the code above the param_grid
are the combinations of hyper-parameters you want to try out using cross-validation value which you can adjust using cv
.
Implementing GridSearchCV for Multiple Models and Scoring
GridSearchCV also supports multiple models and scoring at the same time.
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier()
rf_clf = RandomForestClassifier()
models = {
'SGD' : (sgd_clf,
{'loss' : ['hinge', 'log_loss'],
'penalty' : ['l2', None]}),
'RandomForest' : (rf_clf,
{'n_estimators' : [100, 150, 200, 250],
'max_features' : ['sqrt', 'log2']})
}
scoring = {
'accuracy' : 'accuracy',
'roc_auc' : 'roc_auc'
}
for model_name, (model, param_grid) in model.items():
grid_search = GridSearchCV(model, param_grid, cv=5,
scoring=scoring,
return_true_score=True,
refit='accuracy')
grid_search.fit(X_train, y_train)
Building a dictionary of multiple models along with their parameters and multiple scorings can enable us to perform grid search on multiple models and scoring.
To find the best parameters for each model we can include the following code snippet in the for loop
print(f"Best parameters for {model_name}: {grid_search.best_params_}")
print(f"Best cross-validation score for {model_name}: {grid_search.best_score_}")
print(f"Best cross-validation ROC AUC score for {model_name}: {grid_search.cv_results_['mean_test_roc_auc'][grid_search.best_index_]}")
Applying this on the titanic dataset we get the following output
To get the best model with the most optimal hyper-parameters you can use the following code
final_model = grid_search.best_estimator_
For further insights into how each feature affects the best model the following code can be used
feature_importances = grid_search.best_estimator_.feature_importances_
For a more comprehensive analysis the following code can give the results of each model, score and cross validation.
cv_result = grid_search.cv_results_
For a more detailed look into how GridSearchCV can be implemented visit my Kaggle Notebook where I implement it on the Titanic Dataset.
print("Happy Coding!)