{"id":12164,"date":"2021-01-30T06:10:00","date_gmt":"2021-01-30T06:10:00","guid":{"rendered":"https:\/\/www.askpython.com\/?p=12164"},"modified":"2021-02-08T15:49:36","modified_gmt":"2021-02-08T15:49:36","slug":"anova-test-in-python","status":"publish","type":"post","link":"https:\/\/www.askpython.com\/python\/examples\/anova-test-in-python","title":{"rendered":"ANOVA test in Python"},"content":{"rendered":"\n<p>Hello readers! Today we will be focusing on an important statistical test in Data science &#8212; <strong>ANOVA test<\/strong> in Python programming, in detail.<\/p>\n\n\n\n<p>So, let us get started!!<\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Emergence of ANOVA test<\/h2>\n\n\n\n<p>In the domain of data science and machine learning, the data needs to be understood and processed prior to modelling. That is, we need to analyze every variable of the dataset and its credibility in terms of its contribution to the target value.<\/p>\n\n\n\n<p>Usually there are two kinds of variables&#8211;<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li><strong>Continuous variables<\/strong><\/li><li><strong>Categorical variables<\/strong><\/li><\/ol>\n\n\n\n<p>Below are the mostly used statistical tests to analyze the numeric variables:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>T-test<\/strong><\/li><li><a href=\"https:\/\/www.askpython.com\/python\/examples\/correlation-matrix-in-python\" target=\"_blank\" aria-label=\"Correlation regression analysis (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"rank-math-link\">Correlation regression analysis<\/a>, etc.<\/li><\/ul>\n\n\n\n<p>ANOVA test is a categorical statistical tests i.e. it works on the categorical variables to analyze them.<\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">What is ANOVA test all about?<\/h2>\n\n\n\n<p><strong>ANOVA test<\/strong> is a statistical test to analyze and work with the understanding of the categorical data variables. It estimates the extent to which a dependent variable is affected by one or more independent categorical data elements.<\/p>\n\n\n\n<p>With ANOVA test, we estimate and analyze the difference in the statistical mean of every group of the independent categorical variable.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Hypothesis for ANOVA testing<\/h4>\n\n\n\n<p>As well all know, the Hypothesis claims are represented using two categories: Null Hypothesis and Alternate Hypothesis, respectively.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>In the case of the ANOVA test, our <strong>Null hypothesis<\/strong> would claim the following: &#8220;The statistical mean of all the groups\/categories of the variables is the same.&#8221;<\/li><li>On the other hand, the <strong>Alternate Hypothesis<\/strong> would claim as follows: &#8220;The statistical mean of all the groups\/categories of the variables is not the same.&#8221;<\/li><\/ul>\n\n\n\n<p>Having said this, let us now focus on the Assumptions or considerations for ANOVA testing.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Assumptions of ANOVA testing<\/h4>\n\n\n\n<ul class=\"wp-block-list\"><li>The data elements of the columns follow a normal distribution.<\/li><li>The variables share a common variance.<\/li><\/ul>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">ANOVA test in Python &#8211; Simple Practical Approach!<\/h2>\n\n\n\n<p>In this example, we will be making use of the Bike Rental Count Prediction dataset wherein we are required to predict the number of customers who would opt for a rented bike based on different conditions provided.<\/p>\n\n\n\n<p>You can find the dataset <a href=\"https:\/\/github.com\/Safa1615\/BIKE-RENTAL-COUNT\/blob\/master\/day.csv\" target=\"_blank\" aria-label=\"here (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"rank-math-link\">here<\/a>!<\/p>\n\n\n\n<p>So, initially, we load the dataset into the Python environment using <a href=\"https:\/\/www.askpython.com\/python-modules\/python-csv-module\" class=\"rank-math-link\"><code>read_csv()<\/code> function<\/a>. Further, we change the data type of the variables upon (EDA) to a defined data type. We also use the <a href=\"https:\/\/www.askpython.com\/python-modules\/python-os-module-10-must-know-functions\" class=\"rank-math-link\">os module<\/a> and the <a href=\"https:\/\/www.askpython.com\/python-modules\/pandas\/python-pandas-module-tutorial\" class=\"rank-math-link\">Pandas library<\/a> to work with system variables and parse CSV data respectively<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport os\nimport pandas \n#Changing the current working directory\nos.chdir(&quot;D:\/Ediwsor_Project - Bike_Rental_Count&quot;)\nBIKE = pandas.read_csv(&quot;day.csv&quot;)\nBIKE&#x5B;&#039;holiday&#039;]=BIKE&#x5B;&#039;holiday&#039;].astype(str)\nBIKE&#x5B;&#039;weekday&#039;]=BIKE&#x5B;&#039;weekday&#039;].astype(str)\nBIKE&#x5B;&#039;workingday&#039;]=BIKE&#x5B;&#039;workingday&#039;].astype(str)\nBIKE&#x5B;&#039;weathersit&#039;]=BIKE&#x5B;&#039;weathersit&#039;].astype(str)\nBIKE&#x5B;&#039;dteday&#039;]=pandas.to_datetime(BIKE&#x5B;&#039;dteday&#039;])\nBIKE&#x5B;&#039;season&#039;]=BIKE&#x5B;&#039;season&#039;].astype(str)\nBIKE&#x5B;&#039;yr&#039;]=BIKE&#x5B;&#039;yr&#039;].astype(str)\nBIKE&#x5B;&#039;mnth&#039;]=BIKE&#x5B;&#039;mnth&#039;].astype(str)\nprint(BIKE.dtypes)\n<\/pre><\/div>\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\ninstant                int64\ndteday        datetime64&#x5B;ns]\nseason                object\nyr                    object\nmnth                  object\nholiday               object\nweekday               object\nworkingday            object\nweathersit            object\ntemp                 float64\natemp                float64\nhum                  float64\nwindspeed            float64\ncasual                 int64\nregistered             int64\ncnt                    int64\ndtype: object\n<\/pre><\/div>\n\n\n<p>Now, is the time to apply ANOVA test. Python provides us with <code>anova_lm()<\/code> function from the <code>statsmodels<\/code> library to implement the same.<\/p>\n\n\n\n<p>Initially, we perform <strong>Ordinary Least Square test<\/strong> on the data, further to which the ANOVA test is applied on the above resultant.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport statsmodels.api as sm\nfrom statsmodels.formula.api import ols\n\nfor x in categorical_col:\n    model = ols(&#039;cnt&#039; + &#039;~&#039; + x, data = BIKE).fit() #Oridnary least square method\n    result_anova = sm.stats.anova_lm(model) # ANOVA Test\n    print(result_anova)\n   \n<\/pre><\/div>\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n             df        sum_sq       mean_sq           F        PR(&gt;F)\nseason      3.0  9.218466e+08  3.072822e+08  124.840203  5.433284e-65\nResidual  713.0  1.754981e+09  2.461404e+06         NaN           NaN\n             df        sum_sq       mean_sq           F        PR(&gt;F)\nyr          1.0  8.813271e+08  8.813271e+08  350.959951  5.148657e-64\nResidual  715.0  1.795501e+09  2.511190e+06         NaN           NaN\n             df        sum_sq       mean_sq          F        PR(&gt;F)\nmnth       11.0  1.042307e+09  9.475520e+07  40.869727  2.557743e-68\nResidual  705.0  1.634521e+09  2.318469e+06        NaN           NaN\n             df        sum_sq       mean_sq        F    PR(&gt;F)\nholiday     1.0  1.377098e+07  1.377098e+07  3.69735  0.054896\nResidual  715.0  2.663057e+09  3.724555e+06      NaN       NaN\n             df        sum_sq       mean_sq         F    PR(&gt;F)\nweekday     6.0  1.757122e+07  2.928537e+06  0.781896  0.584261\nResidual  710.0  2.659257e+09  3.745432e+06       NaN       NaN\n               df        sum_sq       mean_sq         F    PR(&gt;F)\nworkingday    1.0  8.494340e+06  8.494340e+06  2.276122  0.131822\nResidual    715.0  2.668333e+09  3.731935e+06       NaN       NaN\n               df        sum_sq       mean_sq          F        PR(&gt;F)\nweathersit    2.0  2.679982e+08  1.339991e+08  39.718604  4.408358e-17\nResidual    714.0  2.408830e+09  3.373711e+06        NaN           NaN\n<\/pre><\/div>\n\n\n<p>Considering significance value as 0.05. we say that if the p value is less than 0.05, we assume and claim that there is considerable differences in the mean of the groups formed by each level of the categorical data. That is, we reject the NULL hypothesis.<\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>By this, we have reached the end of this topic. Feel free to comment below, in case you come across any question.<\/p>\n\n\n\n<p><strong><em>Recommended read: <a href=\"https:\/\/www.askpython.com\/python\/examples\/chi-square-test\" class=\"rank-math-link\">Chi-square test in Python<\/a><\/em><\/strong><\/p>\n\n\n\n<p>Happy Analyzing!! \ud83d\ude42<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hello readers! Today we will be focusing on an important statistical test in Data science &#8212; ANOVA test in Python programming, in detail. So, let us get started!! Emergence of ANOVA test In the domain of data science and machine learning, the data needs to be understood and processed prior to modelling. That is, we [&hellip;]<\/p>\n","protected":false},"author":4,"featured_media":12198,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[],"class_list":["post-12164","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-examples"],"blocksy_meta":[],"_links":{"self":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/12164","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/comments?post=12164"}],"version-history":[{"count":0,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/12164\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media\/12198"}],"wp:attachment":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media?parent=12164"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/categories?post=12164"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/tags?post=12164"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}