{"id":8788,"date":"2020-09-21T21:19:19","date_gmt":"2020-09-21T21:19:19","guid":{"rendered":"https:\/\/www.askpython.com\/?p=8788"},"modified":"2020-09-21T21:19:22","modified_gmt":"2020-09-21T21:19:22","slug":"random-forest-regression","status":"publish","type":"post","link":"https:\/\/www.askpython.com\/python\/examples\/random-forest-regression","title":{"rendered":"Random Forest Regression: A Complete Reference"},"content":{"rendered":"\n<p>Welcome to this article on Random Forest Regression. Let me quickly walk you through the meaning of regression first.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What is Regression in Machine Learning?<\/h2>\n\n\n\n<p>Regression is a machine learning technique that is used to predict values across a certain range. Let us see understand this concept with an example, consider the salaries of employees and their experience in years. <\/p>\n\n\n\n<p>A regression model on this data can help in predicting the salary of an employee even if that year is not having a corresponding salary in the dataset.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What is Random Forest Regression?<\/h2>\n\n\n\n<p>Random forest regression is an ensemble learning technique. But what is ensemble learning?<\/p>\n\n\n\n<p>In ensemble learning, you take multiple algorithms or same algorithm multiple times and put together a model that\u2019s more powerful than the original.<\/p>\n\n\n\n<p>Prediction based on the trees is more accurate because it takes into account many predictions. This is because of the average value used. These algorithms are more stable because any changes in dataset can impact one tree but not the forest of trees. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Steps to perform the random forest regression<\/h2>\n\n\n\n<p>This is a four step process and our steps are as follows:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Pick a random K data points from the training set.<\/li><li>Build the decision tree associated to these K data points.<\/li><li>Choose the number N tree of trees you want to build and repeat steps 1 and 2.<\/li><li>For a new data point, make each one of your Ntree trees predict the value of Y for the data point in the question, and assign the new data point the average across all of the predicted Y values.<\/li><\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">Implementing Random Forest Regression in Python<\/h2>\n\n\n\n<p>Our goal here is to build a team of decision trees, each making a prediction about the dependent variable and the ultimate prediction of random forest is average of predictions of all trees. <\/p>\n\n\n\n<p>For our example, we will be using the Salary &#8211; positions dataset which will predict the salary based on prediction. <\/p>\n\n\n\n<p>The dataset used can be found at <a href=\"https:\/\/github.com\/content-anu\/dataset-polynomial-regression\" class=\"rank-math-link\" target=\"_blank\" rel=\"noopener\">https:\/\/github.com\/content-anu\/dataset-polynomial-regression<\/a><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1. Importing the dataset<\/h3>\n\n\n\n<p>We&#8217;ll use the <a href=\"https:\/\/www.askpython.com\/python-modules\/numpy\/python-numpy-module\" class=\"rank-math-link\">numpy<\/a>, <a href=\"https:\/\/www.askpython.com\/python-modules\/pandas\/python-pandas-module-tutorial\" class=\"rank-math-link\">pandas<\/a>, and <a href=\"https:\/\/www.askpython.com\/python-modules\/matplotlib\/python-matplotlib\" class=\"rank-math-link\">matplotlib<\/a> libraries to implement our model.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\n\ndataset = pd.read_csv(&#039;Position_Salaries.csv&#039;)\ndataset.head()\n<\/pre><\/div>\n\n\n<p>The dataset snapshot is as follows: <\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"296\" height=\"192\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-69.png\" alt=\"Image 69\" class=\"wp-image-8790\"\/><figcaption>Output snapshot of dataset<\/figcaption><\/figure><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">2. Data preprocessing<\/h3>\n\n\n\n<p>We will not have much data preprocessing. We will just have to identify the matrix of features and the vectorized array.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nX = dataset.iloc&#x5B;:,1:2].values\ny = dataset.iloc&#x5B;:,2].values\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\" id=\"Fitting-the-Random-forest-regression-to-dataset\">3. Fitting the Random forest regression to dataset<\/h3>\n\n\n\n<p>We will import the RandomForestRegressor from the ensemble library of sklearn. We create a regressor object using the RFR class constructor. The parameters include:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>n_estimators : number of trees in the forest. (default = 10)<\/li><li>criterion : Default is mse ie mean squared error. This was also a part of decision tree.<\/li><li>random_state<\/li><\/ol>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nfrom sklearn.ensemble import RandomForestRegressor\nregressor = RandomForestRegressor(n_estimators = 10, random_state = 0)\nregressor.fit(X,y)\n<\/pre><\/div>\n\n\n<p><strong>The regressor line is as follows:<\/strong><\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"716\" height=\"139\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-70.png\" alt=\"Image 70\" class=\"wp-image-8791\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-70.png 716w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-70-300x58.png 300w\" sizes=\"auto, (max-width: 716px) 100vw, 716px\" \/><figcaption>Regressor line<\/figcaption><\/figure><\/div>\n\n\n\n<p>We will just make a test prediction as follows:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\ny_pred=regressor.predict(&#x5B;&#x5B;6.5]])\ny_pred\n<\/pre><\/div>\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"177\" height=\"41\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-71.png\" alt=\"Image 71\" class=\"wp-image-8792\"\/><figcaption>Output of the prediction<\/figcaption><\/figure><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">4. Visualizing the result<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n#higher resolution graph\nX_grid = np.arange(min(X),max(X),0.01)\nX_grid = X_grid.reshape(len(X_grid),1) \n\nplt.scatter(X,y, color=&#039;red&#039;) #plotting real points\nplt.plot(X_grid, regressor.predict(X_grid),color=&#039;blue&#039;) #plotting for predict points\n\nplt.title(&quot;Truth or Bluff(Random Forest - Smooth)&quot;)\nplt.xlabel(&#039;Position level&#039;)\nplt.ylabel(&#039;Salary&#039;)\nplt.show()\n<\/pre><\/div>\n\n\n<p>The graph produced is as shown below:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"557\" height=\"362\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-72.png\" alt=\"Image 72\" class=\"wp-image-8793\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-72.png 557w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-72-300x195.png 300w\" sizes=\"auto, (max-width: 557px) 100vw, 557px\" \/><figcaption>Graph for 10 trees<\/figcaption><\/figure><\/div>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"Interpretation-of-the-above-graph\">5. Interpretation of the above graph<\/h3>\n\n\n\n<p>We get many steps in this graph than with one decision tree. We have a lot more of intervals and splits. We get more steps in our stairs. <\/p>\n\n\n\n<p>Every prediction is based on 10 votes (we have taken 10 decision trees). Random forest calculates many averages for each of these intervals. <\/p>\n\n\n\n<p>The more number of trees we include, more is the accuracy because many trees converge to the same ultimate average.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">6. Rebuilding the model for 100 trees<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nfrom sklearn.ensemble import RandomForestRegressor\nregressor = RandomForestRegressor(n_estimators = 100, random_state = 0)\nregressor.fit(X,y)\n<\/pre><\/div>\n\n\n<p>The regressor equation formed for the above 100 trees is as follows:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-73.png\" alt=\"Image 73\" class=\"wp-image-8794\" width=\"580\" height=\"112\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-73.png 718w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-73-300x58.png 300w\" sizes=\"auto, (max-width: 580px) 100vw, 580px\" \/><figcaption>Regressor equation<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">7. Creating the graph for 100 trees<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n#higher resolution graph\nX_grid = np.arange(min(X),max(X),0.01)\nX_grid = X_grid.reshape(len(X_grid),1) \nplt.scatter(X,y, color=&#039;red&#039;) \n\nplt.plot(X_grid, regressor.predict(X_grid),color=&#039;blue&#039;) \nplt.title(&quot;Truth or Bluff(Random Forest - Smooth)&quot;)\nplt.xlabel(&#039;Position level&#039;)\nplt.ylabel(&#039;Salary&#039;)\nplt.show()\n<\/pre><\/div>\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"535\" height=\"357\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-77.png\" alt=\"Image 77\" class=\"wp-image-8798\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-77.png 535w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-77-300x200.png 300w\" sizes=\"auto, (max-width: 535px) 100vw, 535px\" \/><figcaption>Graph with 100 trees<\/figcaption><\/figure><\/div>\n\n\n\n<p>The steps of the graph don\u2019t increase 10 times as the number of trees in the forest. But the prediction will be better. Let\u2019s predict the result of the same variable.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\ny_pred=regressor.predict(&#x5B;&#x5B;6.5]])\ny_pred\n<\/pre><\/div>\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"171\" height=\"35\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-75.png\" alt=\"Image 75\" class=\"wp-image-8796\"\/><figcaption>Output prediction<\/figcaption><\/figure><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">8. Rebuilding the model for 300 trees<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nfrom sklearn.ensemble import RandomForestRegressor\nregressor = RandomForestRegressor(n_estimators = 300, random_state = 0)\nregressor.fit(X,y)\n<\/pre><\/div>\n\n\n<p>The output for the above code snippet produces the following regressor:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"716\" height=\"143\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-76.png\" alt=\"Image 76\" class=\"wp-image-8797\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-76.png 716w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-76-300x60.png 300w\" sizes=\"auto, (max-width: 716px) 100vw, 716px\" \/><figcaption>Regressor for 300 trees<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">9. Graph for 300 trees<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n#higher resolution graph\nX_grid = np.arange(min(X),max(X),0.01)\nX_grid = X_grid.reshape(len(X_grid),1) \n\nplt.scatter(X,y, color=&#039;red&#039;) #plotting real points\nplt.plot(X_grid, regressor.predict(X_grid),color=&#039;blue&#039;) #plotting for predict points\n\nplt.title(&quot;Truth or Bluff(Random Forest - Smooth)&quot;)\nplt.xlabel(&#039;Position level&#039;)\nplt.ylabel(&#039;Salary&#039;)\nplt.show()\n<\/pre><\/div>\n\n\n<p>The above code produces the following graph:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"535\" height=\"357\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-77.png\" alt=\"Image 77\" class=\"wp-image-8798\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-77.png 535w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-77-300x200.png 300w\" sizes=\"auto, (max-width: 535px) 100vw, 535px\" \/><figcaption>Graph for 300 trees<\/figcaption><\/figure><\/div>\n\n\n\n<p>Now, let us make a prediction. <\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\ny_pred=regressor.predict(&#x5B;&#x5B;6.5]])\ny_pred\n<\/pre><\/div>\n\n\n<p>The output for the above code is as follows:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"269\" height=\"34\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-78.png\" alt=\"Image 78\" class=\"wp-image-8799\"\/><figcaption>Prediction using 300 trees<\/figcaption><\/figure><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Complete Python Code for Implementing Random Forest Regression<\/h2>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\n \ndataset = pd.read_csv(&#039;Position_Salaries.csv&#039;)\ndataset.head()\n\nX = dataset.iloc&#x5B;:,1:2].values\ny = dataset.iloc&#x5B;:,2].values\n\n# for 10 trees\nfrom sklearn.ensemble import RandomForestRegressor\nregressor = RandomForestRegressor(n_estimators = 10, random_state = 0)\nregressor.fit(X,y)\n\ny_pred=regressor.predict(&#x5B;&#x5B;6.5]])\ny_pred\n\n#higher resolution graph\nX_grid = np.arange(min(X),max(X),0.01)\nX_grid = X_grid.reshape(len(X_grid),1) \n \nplt.scatter(X,y, color=&#039;red&#039;) #plotting real points\nplt.plot(X_grid, regressor.predict(X_grid),color=&#039;blue&#039;) #plotting for predict points\n \nplt.title(&quot;Truth or Bluff(Random Forest - Smooth)&quot;)\nplt.xlabel(&#039;Position level&#039;)\nplt.ylabel(&#039;Salary&#039;)\nplt.show()\n\n\n# for 100 trees\nfrom sklearn.ensemble import RandomForestRegressor\nregressor = RandomForestRegressor(n_estimators = 100, random_state = 0)\nregressor.fit(X,y)\n\n#higher resolution graph\nX_grid = np.arange(min(X),max(X),0.01)\nX_grid = X_grid.reshape(len(X_grid),1) \nplt.scatter(X,y, color=&#039;red&#039;) \n \nplt.plot(X_grid, regressor.predict(X_grid),color=&#039;blue&#039;) \nplt.title(&quot;Truth or Bluff(Random Forest - Smooth)&quot;)\nplt.xlabel(&#039;Position level&#039;)\nplt.ylabel(&#039;Salary&#039;)\nplt.show()\n\ny_pred=regressor.predict(&#x5B;&#x5B;6.5]])\ny_pred\n\n# for 300 trees\nfrom sklearn.ensemble import RandomForestRegressor\nregressor = RandomForestRegressor(n_estimators = 300, random_state = 0)\nregressor.fit(X,y)\n\n#higher resolution graph\nX_grid = np.arange(min(X),max(X),0.01)\nX_grid = X_grid.reshape(len(X_grid),1) \n \nplt.scatter(X,y, color=&#039;red&#039;) #plotting real points\nplt.plot(X_grid, regressor.predict(X_grid),color=&#039;blue&#039;) #plotting for predict points\n \nplt.title(&quot;Truth or Bluff(Random Forest - Smooth)&quot;)\nplt.xlabel(&#039;Position level&#039;)\nplt.ylabel(&#039;Salary&#039;)\nplt.show()\n\ny_pred=regressor.predict(&#x5B;&#x5B;6.5]])\ny_pred\n\n<\/pre><\/div>\n\n\n<p>The output of the above code will be graphs and prediction values. Below are the graphs: <\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"399\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-79-1024x399.png\" alt=\"Image 79\" class=\"wp-image-8801\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-79-1024x399.png 1024w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-79-300x117.png 300w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-79-768x299.png 768w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-79.png 1318w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Output graphs<\/figcaption><\/figure><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>As you have observed, the 10 trees model predicted the salary for 6.5 years of experience to be 167,000. The 100 trees model predicted 158,300 and the 300 trees model predicted 160,333.33. Hence more the number of trees, the more accurate is our result. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Welcome to this article on Random Forest Regression. Let me quickly walk you through the meaning of regression first. What is Regression in Machine Learning? Regression is a machine learning technique that is used to predict values across a certain range. Let us see understand this concept with an example, consider the salaries of employees [&hellip;]<\/p>\n","protected":false},"author":12,"featured_media":8803,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[],"class_list":["post-8788","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-examples"],"blocksy_meta":[],"_links":{"self":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/8788","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/users\/12"}],"replies":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/comments?post=8788"}],"version-history":[{"count":0,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/8788\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media\/8803"}],"wp:attachment":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media?parent=8788"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/categories?post=8788"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/tags?post=8788"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}