Sklearn RandomizedSearchCV, evaluate each random model

Question

I want to try to optimize the parameters of a RandomForest regression model, in order to find the best trade-off between accuracy and prediction speed. My idea was to use a randomized grid search, and to evaluate the speed/accuracy of each of the tested random parameters configuration.

So, I prepared a parameter grid, and I can run k-fold cv on the training data

    ## parameter grid for random search
    n_estimators = [1, 40, 80, 100, 120]
    max_features = ['auto', 'sqrt']
    max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
    max_depth.append(None)
    min_samples_split = [2, 5, 10]
    min_samples_leaf = [1, 2, 4]
    bootstrap = [True, False]
    random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

    rf = RandomForestRegressor()
    rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, n_jobs = -1)
    rf_random.fit(X_train, y_train)

I found the way to get the parameters of the best model, by using:

rf_random.best_params_

However, I wanted to iterate through all the random models, check their parameter values, evaluate them on the test set and write the values of parameters, accuracy and speed to and output dataframe, so something like:

for model in rf_random:
   start_time_base = time.time()
   y_pred = model.predict(X_test) -> evaluate the current random model on the test data
   time = (time.time()-start_time_base)/X_test.shape[0]
   rmse = mean_squared_error(y_test, y_pred, squared=False)
   params = something to get the values of the parameters for this model
   
   write to dataframe...

Is there a way to do that? Just to be clear, I'm asking about the iteration over models and parameters, not the writing to the dataframe part :) Should I go for a different approach altogether instead?

Chris Schmitz · Accepted Answer · 2021-01-30 13:55:51Z

2

You get the df you're looking to create with model parameters and CV results by calling rf_random.cv_results_, which you can instantly put into a df: all_results = pd.DataFrame(rf_random.cv_results_).

Every time I've seen this used in practice, this has been seen as a good measure of all the metrics you're looking for; what you describe in the question is unnecessary. However if you want to go through with what you describe above (ie. evaluate against a held-out test set rather than cross-validate), you can then go through this df and define a model with each parameter combination in a loop:

for i in range(len(all_results)):

    model = RandomForestRegressor(n_estimators = all_results['n_estimators'][i],
                                  max_features = all_results['max_features'][i],
                                  ...)
    
    model.fit(X_train, y_train)

    start_time_base = time.time()
    y_pred = model.predict(X_test) -> evaluate the current random model on the test data
    time = (time.time()-start_time_base)/X_test.shape[0]

    # Evaluate predictions however you see fit

As the trained model is only kept for the best parameter combination in RandomizedSearchCV, you'll need to retrain the models in this loop.

edited Jan 30, 2021 at 13:55

answered Jan 30, 2021 at 13:36

Chris Schmitz

6681 gold badge6 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Carlo Over a year ago

That sounds great, but I specifically need to evaluate the inference speed of each model, that's why I was planning to do it on the test set afterwards. Do I get any information about it in the rf_random.cv_results_?

Chris Schmitz Over a year ago

You can do that in the loop I provided too, just start the timer after you trained the model again. I've edited the code accordingly.

Carlo Over a year ago

Yes, I was wondering if I could avoid the loop at this point, considering that - as you pointed out - I would need to re-train each model again. Thanks!

Chris Schmitz Over a year ago

Right. cv_results_ does include mean and std of fit and score times for each model. However if you want to use a holdout test set you'll need to retrain, as the model objects aren't all saved.

Carlo Over a year ago

Just as a follow-up, I did use the results from cv_results to select a bunch of more "promising" models, in terms of the speed/accuracy trade-off, and I re-trained and evaluated only those on the test set. Thanks again for the great help!

Collectives™ on Stack Overflow

Sklearn RandomizedSearchCV, evaluate each random model

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related