{"id":8761,"date":"2020-09-21T20:28:48","date_gmt":"2020-09-21T20:28:48","guid":{"rendered":"https:\/\/www.askpython.com\/?p=8761"},"modified":"2021-01-06T07:44:39","modified_gmt":"2021-01-06T07:44:39","slug":"linear-regression-in-python","status":"publish","type":"post","link":"https:\/\/www.askpython.com\/python\/examples\/linear-regression-in-python","title":{"rendered":"Simple Linear Regression: A Practical Implementation in Python"},"content":{"rendered":"\n<p>Welcome to this article on simple linear regression. Today we will look at how to build a simple linear regression model given a dataset. You can go through our article detailing the concept of simple linear regression prior to the coding example in this article. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">6 Steps to build a Linear Regression model<\/h2>\n\n\n\n<p>Step 1: Importing the dataset<br>Step 2: Data pre-processing<br>Step 3: Splitting the test and train sets<br>Step 4: Fitting the linear regression model to the training set<br>Step 5: Predicting test results <br>Step 6: Visualizing the test results<\/p>\n\n\n\n<p>Now that we have seen the steps, let us begin with coding the same<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Implementing a Linear Regression Model in Python<\/h2>\n\n\n\n<p>In this article, we will be using salary dataset. Our dataset will have 2 columns namely &#8211; Years of Experience and Salary. <\/p>\n\n\n\n<p>The link to the dataset is &#8211; <a href=\"https:\/\/github.com\/content-anu\/dataset-simple-linear\" class=\"rank-math-link\" target=\"_blank\" rel=\"noopener\">https:\/\/github.com\/content-anu\/dataset-simple-linear<\/a><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1. Importing the dataset<\/h3>\n\n\n\n<p>We will begin with importing the dataset using <a href=\"https:\/\/www.askpython.com\/python-modules\/pandas\/python-pandas-module-tutorial\" class=\"rank-math-link\">pandas<\/a> and also import other libraries such as <a href=\"https:\/\/www.askpython.com\/python-modules\/numpy\/python-numpy-module\" class=\"rank-math-link\">numpy<\/a> and <a href=\"https:\/\/www.askpython.com\/python-modules\/matplotlib\/python-matplotlib\" class=\"rank-math-link\">matplotlib<\/a>.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\n\ndataset = pd.read_csv(&#039;Salary_Data.csv&#039;)\ndataset.head()\n<\/pre><\/div>\n\n\n<p>The <code>dataset.head()<\/code> shows the first few columns of our dataset. The output of the above snippet is as follows:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"255\" height=\"190\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-59.png\" alt=\"Image 59\" class=\"wp-image-8762\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-59.png 255w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-59-160x120.png 160w\" sizes=\"auto, (max-width: 255px) 100vw, 255px\" \/><figcaption>Dataset<\/figcaption><\/figure><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">2. Data Preprocessing<\/h3>\n\n\n\n<p>Now that we have imported the dataset, we will perform data preprocessing. <\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nX = dataset.iloc&#x5B;:,:-1].values  #independent variable array\ny = dataset.iloc&#x5B;:,1].values  #dependent variable vector\n<\/pre><\/div>\n\n\n<p>The <code>X<\/code> is independent variable array and <code>y<\/code> is the dependent variable vector. Note the difference between the array and vector. The dependent variable must be in vector and independent variable must be an array itself. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3. Splitting the dataset<\/h3>\n\n\n\n<p>We need to split our dataset into the test and train set. Generally, we follow the 20-80 policy or the 30-70 policy respectively.<\/p>\n\n\n\n<p><strong>Why is it necessary to perform splitting? <\/strong>This is because we wish to train our model according to the years and salary. We then test our model on the test set. <\/p>\n\n\n\n<p>We check whether the predictions made by the model on the test set data matches what was given in the dataset. <\/p>\n\n\n\n<p>If it matches, it implies that our model is accurate and is making the right predictions. <\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nfrom sklearn.model_selection import train_test_split\nX_train, X_test, y_train, y_test = train_test_split(X,y,test_size=1\/3,random_state=0)\n<\/pre><\/div>\n\n\n<p>We don&#8217;t need to apply feature scaling for linear regression as libraries take care of it.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4. Fitting linear regression model into the training set<\/h3>\n\n\n\n<p>From sklearn\u2019s linear model library, import linear regression class. Create an object for a linear regression class called regressor.<\/p>\n\n\n\n<p>To fit the regressor into the training set, we will call the fit method \u2013 function to fit the regressor into the training set.<\/p>\n\n\n\n<p>We need to fit X_train (training data of matrix of features) into the target values y_train. Thus the model learns the correlation and learns how to predict the dependent variables based on the independent variable.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nfrom sklearn.linear_model import LinearRegression\nregressor = LinearRegression()\nregressor.fit(X_train,y_train) #actually produces the linear eqn for the data\n<\/pre><\/div>\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"775\" height=\"32\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-60.png\" alt=\"Image 60\" class=\"wp-image-8763\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-60.png 775w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-60-300x12.png 300w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-60-768x32.png 768w\" sizes=\"auto, (max-width: 775px) 100vw, 775px\" \/><figcaption>Output equation<\/figcaption><\/figure><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">5. Predicting the test set results<\/h3>\n\n\n\n<p>We create a vector containing all the predictions of the test set salaries. The predicted salaries are then put into the vector called <code>y_pred<\/code>.(contains prediction for all observations in the test set)<\/p>\n\n\n\n<p><code>predict<\/code> method makes the predictions for the test set. Hence, the input is the test set. The parameter for predict must be an array or sparse matrix, hence input is X_test.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\ny_pred = regressor.predict(X_test) \ny_pred\n<\/pre><\/div>\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"749\" height=\"84\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-61.png\" alt=\"Image 61\" class=\"wp-image-8764\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-61.png 749w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-61-300x34.png 300w\" sizes=\"auto, (max-width: 749px) 100vw, 749px\" \/><figcaption>y-pred output<\/figcaption><\/figure><\/div>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\ny_test\n<\/pre><\/div>\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"679\" height=\"64\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-62.png\" alt=\"Image 62\" class=\"wp-image-8765\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-62.png 679w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-62-300x28.png 300w\" sizes=\"auto, (max-width: 679px) 100vw, 679px\" \/><figcaption>y-test output<\/figcaption><\/figure><\/div>\n\n\n\n<p><code>y_test<\/code> is the real salary of the test set.<br><code>y_pred<\/code> are the predicted salaries.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Visualizing the results<\/h3>\n\n\n\n<p>Let&#8217;s see what the results of our code will look like when we visualize it.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"Plotting-the-points-ie-observations\">1. Plotting the points (observations)<\/h4>\n\n\n\n<p>To visualize the data, we plot graphs using matplotlib. To plot real observation points ie plotting the real given values. <\/p>\n\n\n\n<p>The X-axis will have years of experience and the Y-axis will have the predicted salaries.<\/p>\n\n\n\n<p><code>plt.scatter<\/code> plots a scatter plot of the data. Parameters include :<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>X &#8211; coordinate (X_train: number of years)<\/li><li>Y &#8211; coordinate (y_train: real salaries of the employees)<\/li><li>Color ( Regression line in red and observation line in blue)<\/li><\/ol>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"Plotting-the-regression-line\">2. Plotting the regression line<\/h4>\n\n\n\n<p>plt.plot have the following parameters :<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>X coordinates (X_train) &#8211; number of years<\/li><li>Y coordinates (predict on X_train) &#8211; prediction of X-train (based on a number of years).<\/li><\/ol>\n\n\n\n<p>Note : The y-coordinate is not y_pred because y_pred is predicted salaries of the test set observations.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n#plot for the TRAIN\n\nplt.scatter(X_train, y_train, color=&#039;red&#039;) # plotting the observation line\n\nplt.plot(X_train, regressor.predict(X_train), color=&#039;blue&#039;) # plotting the regression line\n\nplt.title(&quot;Salary vs Experience (Training set)&quot;) # stating the title of the graph\n\nplt.xlabel(&quot;Years of experience&quot;) # adding the name of x-axis\nplt.ylabel(&quot;Salaries&quot;) # adding the name of y-axis\nplt.show() # specifies end of graph\n<\/pre><\/div>\n\n\n<p><strong>The above code generates a plot for the train set shown below:<\/strong><\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"517\" height=\"342\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-63.png\" alt=\"Image 63\" class=\"wp-image-8766\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-63.png 517w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-63-300x198.png 300w\" sizes=\"auto, (max-width: 517px) 100vw, 517px\" \/><figcaption>Output graph for training set<\/figcaption><\/figure><\/div>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n#plot for the TEST\n\nplt.scatter(X_test, y_test, color=&#039;red&#039;) \nplt.plot(X_train, regressor.predict(X_train), color=&#039;blue&#039;) # plotting the regression line\n\nplt.title(&quot;Salary vs Experience (Testing set)&quot;)\n\nplt.xlabel(&quot;Years of experience&quot;) \nplt.ylabel(&quot;Salaries&quot;) \nplt.show() \n<\/pre><\/div>\n\n\n<p><strong>The above code snippet generates a plot as shown below:<\/strong><\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"536\" height=\"344\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-64.png\" alt=\"Image 64\" class=\"wp-image-8767\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-64.png 536w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-64-300x193.png 300w\" sizes=\"auto, (max-width: 536px) 100vw, 536px\" \/><figcaption>Output graph for test set<\/figcaption><\/figure><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Complete Python Code for Implementing Linear Regression<\/h2>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n# importing the dataset\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\n \ndataset = pd.read_csv(&#039;Salary_Data.csv&#039;)\ndataset.head()\n\n# data preprocessing\nX = dataset.iloc&#x5B;:, :-1].values  #independent variable array\ny = dataset.iloc&#x5B;:,1].values  #dependent variable vector\n\n# splitting the dataset\nfrom sklearn.model_selection import train_test_split\nX_train, X_test, y_train, y_test = train_test_split(X,y,test_size=1\/3,random_state=0)\n\n# fitting the regression model\nfrom sklearn.linear_model import LinearRegression\nregressor = LinearRegression()\nregressor.fit(X_train,y_train) #actually produces the linear eqn for the data\n\n# predicting the test set results\ny_pred = regressor.predict(X_test) \ny_pred\n\ny_test\n\n# visualizing the results\n#plot for the TRAIN\n \nplt.scatter(X_train, y_train, color=&#039;red&#039;) # plotting the observation line\nplt.plot(X_train, regressor.predict(X_train), color=&#039;blue&#039;) # plotting the regression line\nplt.title(&quot;Salary vs Experience (Training set)&quot;) # stating the title of the graph\n \nplt.xlabel(&quot;Years of experience&quot;) # adding the name of x-axis\nplt.ylabel(&quot;Salaries&quot;) # adding the name of y-axis\nplt.show() # specifies end of graph\n\n#plot for the TEST\n \nplt.scatter(X_test, y_test, color=&#039;red&#039;) \nplt.plot(X_train, regressor.predict(X_train), color=&#039;blue&#039;) # plotting the regression line\nplt.title(&quot;Salary vs Experience (Testing set)&quot;)\n \nplt.xlabel(&quot;Years of experience&quot;) \nplt.ylabel(&quot;Salaries&quot;) \nplt.show() \n<\/pre><\/div>\n\n\n<p>The output of the above code snippet is as shown below:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"373\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-65-1024x373.png\" alt=\"Image 65\" class=\"wp-image-8770\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-65-1024x373.png 1024w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-65-300x109.png 300w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-65-768x280.png 768w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/09\/image-65.png 1057w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Output graphs<\/figcaption><\/figure><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>We have come to the end of this article on Simple Linear Regression. Hope you liked our example and have tried coding the model as well. Do let us know your feedback in the comment section below.<\/p>\n\n\n\n<p>If you&#8217;re interested in more regression models, do read through <a href=\"https:\/\/www.askpython.com\/python\/examples\/multiple-linear-regression\" class=\"rank-math-link\">multiple linear regression model<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Welcome to this article on simple linear regression. Today we will look at how to build a simple linear regression model given a dataset. You can go through our article detailing the concept of simple linear regression prior to the coding example in this article. 6 Steps to build a Linear Regression model Step 1: [&hellip;]<\/p>\n","protected":false},"author":12,"featured_media":8772,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[],"class_list":["post-8761","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-examples"],"blocksy_meta":[],"_links":{"self":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/8761","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/users\/12"}],"replies":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/comments?post=8761"}],"version-history":[{"count":0,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/8761\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media\/8772"}],"wp:attachment":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media?parent=8761"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/categories?post=8761"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/tags?post=8761"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}