{"id":48581,"date":"2023-04-28T13:46:30","date_gmt":"2023-04-28T13:46:30","guid":{"rendered":"https:\/\/www.askpython.com\/?p=48581"},"modified":"2023-04-28T13:46:32","modified_gmt":"2023-04-28T13:46:32","slug":"bootstrap-sampling-introduction","status":"publish","type":"post","link":"https:\/\/www.askpython.com\/python\/examples\/bootstrap-sampling-introduction","title":{"rendered":"Introduction to Bootstrap Sampling in Python"},"content":{"rendered":"\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>In statistics, Bootstrap Sampling is a method that involves retrieving of subset data repeatedly with replacement from a vast data source to calculate a population parameter.<\/p>\n<\/blockquote>\n\n\n\n<p>Sampling is the process of selecting a subset or smaller dataset from a vast collection of data to calculate a certain characteristic of the entire data set. Sampling with replacement means a data point in a selected sample(subset) that can reappear in future selected samples and lastly, the process of estimating the parameters of the entire population on the basis of samples is parameter estimation.<\/p>\n\n\n\n<p>To understand the need for bootstrap sampling let&#8217;s consider an example, suppose we want to calculate the average age of 1000 employees working for a particular company the first approach could be to ask all the 1000 employees their age and then calculate the mean age.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"512\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/04\/bootstrap-sampling_1-1024x512.png\" alt=\"Bootstrap Sampling 1\" class=\"wp-image-48984\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/04\/bootstrap-sampling_1-1024x512.png 1024w, https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/04\/bootstrap-sampling_1-300x150.png 300w, https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/04\/bootstrap-sampling_1-768x384.png 768w, https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/04\/bootstrap-sampling_1.png 1200w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>This can be a tedious task and a time-consuming approach, another method is to consider a sample of 5 employees and collect their ages this process can be repeated 20 times, and then average the collected age data of 100 employees. This average age would be the estimate of all 1000 employees.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"512\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/04\/bootstrap-sampling_2-1024x512.png\" alt=\"Bootstrap Sampling 2\" class=\"wp-image-48985\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/04\/bootstrap-sampling_2-1024x512.png 1024w, https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/04\/bootstrap-sampling_2-300x150.png 300w, https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/04\/bootstrap-sampling_2-768x384.png 768w, https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/04\/bootstrap-sampling_2.png 1200w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Implementation of Bootstrap Sampling in Python<\/h2>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>Bootstrap sampling is a statistical method used to analyze data by repeatedly drawing subsets from a larger dataset and estimating population parameters. In Python, you can use the NumPy library to implement bootstrap sampling. Use np.random.choice() to generate bootstrap samples with replacement, then calculate the mean, standard deviation, or confidence intervals as required.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">Example 1: Basic Bootstrap Sampling<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; gutter: true; title: ; notranslate\" title=\"\">\nimport numpy as np\n\nages = &#x5B;25, 30, 35, 40, 45, 50, 55, 60, 65, 70]\n\nnum_samples = 1000\n\nbootstrap_means = np.zeros(num_samples)\n\n# Perform bootstrap sampling\nfor i in range(num_samples):\n\n    bootstrap_sample = np.random.choice(ages, size=len(ages), replace=True)\n    bootstrap_mean = np.mean(bootstrap_sample)\n    bootstrap_means&#x5B;i] = bootstrap_mean\n\nestimated_mean = np.mean(bootstrap_means)\nestimated_std = np.std(bootstrap_means, ddof=1)\n\n\nprint(&quot;Estimated population mean age:&quot;, estimated_mean)\nprint(&quot;Standard error of the estimate:&quot;, estimated_std)\n<\/pre><\/div>\n\n\n<p>We import <code>numpy<\/code> library as its alias <code>np<\/code> . We define a sample of ages in variable <code>ages<\/code> .We set <code>num_samples = 1000<\/code> to generate the bootstrap samples. Then we initialize an array to store the bootstrap means <code>bootstrap_means<\/code>. <code>np.random.choice(ages, size=len(ages), replace=True)<\/code> resamples with replacement from the original sample to generate a bootstrap sample. <\/p>\n\n\n\n<p>To calculate the mean age of the bootstrap sample we define np.mean(bootstrap_sample)this calculated mean as stored is in <code>bootstrap_means[i] = bootstrap_mean<\/code>.Later to calculate the estimated population mean and standard error we define two functions <code>np.mean(bootstrap_means)<\/code> and <code>np.std(bootstrap_means, ddof=1)<\/code>.<\/p>\n\n\n\n<p>The output will display the estimated population mean age and the standard error of the estimate.<\/p>\n\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"622\" height=\"50\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/04\/sampling1.png\" alt=\"Sampling1\" class=\"wp-image-48737\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/04\/sampling1.png 622w, https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/04\/sampling1-300x24.png 300w\" sizes=\"auto, (max-width: 622px) 100vw, 622px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Example 2: Bootstrap Sampling for Confidence Intervals<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; gutter: true; title: ; notranslate\" title=\"\">\n#Bootstrap sampling for confidence intervals:\nimport numpy as np\n\ndata = &#x5B;10, 20, 30, 40, 50, 60, 70, 80, 90, 100]\nnum_samples = 1000\n\nbootstrap_means = np.zeros(num_samples)\n\n# Perform bootstrap sampling\nfor i in range(num_samples):\n    bootstrap_sample = np.random.choice(data, size=len(data), replace=True)\n    bootstrap_mean = np.mean(bootstrap_sample)\n    bootstrap_means&#x5B;i] = bootstrap_mean\n\nconfidence_interval = np.percentile(bootstrap_means, &#x5B;2.5, 97.5])\n\nprint(&quot;95% Confidence interval:&quot;, confidence_interval)\n<\/pre><\/div>\n\n\n<p>The above code performs bootstrap sampling to estimate a 95% confidence interval for the population mean of the original sample. We define an original sample <code>data<\/code> and also set the number of bootstrap samples to generate <code>num_samples<\/code>.<code>bootstrap_means<\/code> is to initialize an array to store the mean of the sample. To resample with replacement from the original samples so that a bootstrap sample is generated we define <code>bootstrap_sample<\/code>.And to calculate the mean of Bootstrap mean we define bootstrap_mean. At last, we calculate a 95% confidence interval by taking the 2.5th and 97.5th percentiles of the mean ages of the bootstrap samples.<\/p>\n\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/04\/sampling2.png\" alt=\"Sampling2\" class=\"wp-image-48940\" width=\"498\" height=\"24\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/04\/sampling2.png 380w, https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/04\/sampling2-300x14.png 300w\" sizes=\"auto, (max-width: 498px) 100vw, 498px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Example 3: Two-Sample Bootstrap Hypothesis Test<\/h3>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; gutter: true; title: ; notranslate\" title=\"\">\nimport numpy as np\n\ngroup1 = &#x5B;10, 12, 15, 18, 20]\ngroup2 = &#x5B;8, 11, 13, 16, 19]\n\nnum_samples = 1000\n\nbootstrap_diffs = np.zeros(num_samples)\n\n# Perform bootstrap sampling\nfor i in range(num_samples):\n    bootstrap_group1 = np.random.choice(group1, size=len(group1), replace=True)\n    bootstrap_group2 = np.random.choice(group2, size=len(group2), replace=True)\n    \n    bootstrap_diff = np.mean(bootstrap_group1) - np.mean(bootstrap_group2)\n    bootstrap_diffs&#x5B;i] = bootstrap_diff\n\np_value = np.mean(bootstrap_diffs &gt;= np.mean(group1) - np.mean(group2))\n\nprint(&quot;Bootstrap p-value:&quot;, p_value)\n\n<\/pre><\/div>\n\n\n<p>In the third example, we perform a two-sample bootstrap hypothesis test to determine whether there is a significant difference between the means of two independent groups. We define two groups, group 1 and group 2. We set the number of bootstrap samples to be generated to 1000 at variable num_samples.To initialize an array to store the difference in means of each bootstrap sample we define np.zeros(num_samples) function.bootstrap_group1 and bootstrap_group2 define functions that will resample with replacement from the two groups. p_value calculates the p-value, which is the proportion of bootstrap samples with a difference in means greater than or equal to the difference in means of the original samples.<\/p>\n\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/04\/sampling3.png\" alt=\"Sampling3\" class=\"wp-image-48962\" width=\"548\" height=\"41\"\/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Summary<\/h2>\n\n\n\n<p>Bootstrap sampling is a powerful technique for statistical analysis in Python. It allows you to estimate population parameters with a smaller dataset, increasing efficiency and reducing complexity. What are some other applications of bootstrap sampling in your field?<\/p>\n\n\n\n<p><strong>You can browse more interesting articles:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Also read, <a href=\"https:\/\/www.askpython.com\/python\/examples\/supervised-machine-learning\">https:\/\/www.askpython.com\/python\/examples\/supervised-machine-learning<\/a><\/li>\n\n\n\n<li>Also read, <a href=\"https:\/\/www.askpython.com\/python\/examples\/get-week-numbers\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.askpython.com\/python\/examples\/get-week-numbers<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/towardsdatascience.com\/what-is-bootstrap-sampling-in-machine-learning-and-why-is-it-important-a5bb90cbd89a\" target=\"_blank\" rel=\"noopener\">What is Bootstrap Sampling in Machine Learning and Why is it Important?<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>In statistics, Bootstrap Sampling is a method that involves retrieving of subset data repeatedly with replacement from a vast data source to calculate a population parameter. Sampling is the process of selecting a subset or smaller dataset from a vast collection of data to calculate a certain characteristic of the entire data set. Sampling with [&hellip;]<\/p>\n","protected":false},"author":56,"featured_media":48987,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[],"class_list":["post-48581","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-examples"],"blocksy_meta":[],"_links":{"self":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/48581","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/users\/56"}],"replies":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/comments?post=48581"}],"version-history":[{"count":0,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/48581\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media\/48987"}],"wp:attachment":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media?parent=48581"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/categories?post=48581"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/tags?post=48581"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}