1

I want to try to optimize the parameters of a RandomForest regression model, in order to find the best trade-off between accuracy and prediction speed. My idea was to use a randomized grid search, and to evaluate the speed/accuracy of each of the tested random parameters configuration.

So, I prepared a parameter grid, and I can run k-fold cv on the training data

    ## parameter grid for random search
    n_estimators = [1, 40, 80, 100, 120]
    max_features = ['auto', 'sqrt']
    max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
    max_depth.append(None)
    min_samples_split = [2, 5, 10]
    min_samples_leaf = [1, 2, 4]
    bootstrap = [True, False]
    random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

    rf = RandomForestRegressor()
    rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, n_jobs = -1)
    rf_random.fit(X_train, y_train)


I found the way to get the parameters of the best model, by using:

rf_random.best_params_

However, I wanted to iterate through all the random models, check their parameter values, evaluate them on the test set and write the values of parameters, accuracy and speed to and output dataframe, so something like:

for model in rf_random:
   start_time_base = time.time()
   y_pred = model.predict(X_test) -> evaluate the current random model on the test data
   time = (time.time()-start_time_base)/X_test.shape[0]
   rmse = mean_squared_error(y_test, y_pred, squared=False)
   params = something to get the values of the parameters for this model
   
   write to dataframe...

Is there a way to do that? Just to be clear, I'm asking about the iteration over models and parameters, not the writing to the dataframe part :) Should I go for a different approach altogether instead?

1 Answer 1

2

You get the df you're looking to create with model parameters and CV results by calling rf_random.cv_results_, which you can instantly put into a df: all_results = pd.DataFrame(rf_random.cv_results_).

Every time I've seen this used in practice, this has been seen as a good measure of all the metrics you're looking for; what you describe in the question is unnecessary. However if you want to go through with what you describe above (ie. evaluate against a held-out test set rather than cross-validate), you can then go through this df and define a model with each parameter combination in a loop:

for i in range(len(all_results)):

    model = RandomForestRegressor(n_estimators = all_results['n_estimators'][i],
                                  max_features = all_results['max_features'][i],
                                  ...)
    
    model.fit(X_train, y_train)

    start_time_base = time.time()
    y_pred = model.predict(X_test) -> evaluate the current random model on the test data
    time = (time.time()-start_time_base)/X_test.shape[0]

    # Evaluate predictions however you see fit

As the trained model is only kept for the best parameter combination in RandomizedSearchCV, you'll need to retrain the models in this loop.

Sign up to request clarification or add additional context in comments.

5 Comments

That sounds great, but I specifically need to evaluate the inference speed of each model, that's why I was planning to do it on the test set afterwards. Do I get any information about it in the rf_random.cv_results_?
You can do that in the loop I provided too, just start the timer after you trained the model again. I've edited the code accordingly.
Yes, I was wondering if I could avoid the loop at this point, considering that - as you pointed out - I would need to re-train each model again. Thanks!
Right. cv_results_ does include mean and std of fit and score times for each model. However if you want to use a holdout test set you'll need to retrain, as the model objects aren't all saved.
Just as a follow-up, I did use the results from cv_results to select a bunch of more "promising" models, in terms of the speed/accuracy trade-off, and I re-trained and evaluated only those on the test set. Thanks again for the great help!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.