{"id":41310,"date":"2023-02-27T18:43:54","date_gmt":"2023-02-27T18:43:54","guid":{"rendered":"https:\/\/www.askpython.com\/?p=41310"},"modified":"2023-02-27T18:43:56","modified_gmt":"2023-02-27T18:43:56","slug":"adam-optimizer","status":"publish","type":"post","link":"https:\/\/www.askpython.com\/python\/examples\/adam-optimizer","title":{"rendered":"Adam optimizer: A Quick Introduction"},"content":{"rendered":"\n<p>Optimization is one of the critical processes in deep learning that helps in tuning the parameters of a model to minimize the loss function. Adam optimizer is one of the widely used optimization algorithms in deep learning that combines the benefits of <a href=\"https:\/\/optimization.cbe.cornell.edu\/index.php?title=AdaGrad\" data-type=\"URL\" data-id=\"https:\/\/optimization.cbe.cornell.edu\/index.php?title=AdaGrad\" target=\"_blank\" rel=\"noreferrer noopener\">Adagrad<\/a> and <a href=\"https:\/\/towardsdatascience.com\/understanding-rmsprop-faster-neural-network-learning-62e116fcf29a\" data-type=\"URL\" data-id=\"https:\/\/towardsdatascience.com\/understanding-rmsprop-faster-neural-network-learning-62e116fcf29a\" target=\"_blank\" rel=\"noreferrer noopener\">RMSprop<\/a> optimizers. <\/p>\n\n\n\n<p>In this article, we will discuss the Adam optimizer, its features, and an easy-to-understand example of its implementation in Python using the <a href=\"https:\/\/www.askpython.com\/python\/examples\/keras-deep-learning\" data-type=\"post\" data-id=\"36142\">Keras library.<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What is Adam Optimizer and how it works?<\/h2>\n\n\n\n<p>Adam stands for Adaptive Moment Estimation. It is an optimization algorithm that was introduced by <a href=\"https:\/\/arxiv.org\/pdf\/1412.6980.pdf\" target=\"_blank\" data-type=\"URL\" data-id=\"https:\/\/arxiv.org\/pdf\/1412.6980.pdf\" rel=\"noreferrer noopener\">Kingma and Ba<\/a> in their 2014 paper. The algorithm computes the adaptive learning rates for each parameter and stores the first and second moments of the gradients.<\/p>\n\n\n\n<p>Adam optimizer is an extension of the stochastic gradient descent (SGD) algorithm that updates the learning rate adaptively. The Adam optimizer updates the parameters of the model using the first and second moments of the gradients. The first moment is the mean of the gradients, and the second moment is the uncentered variance of the gradients. <\/p>\n\n\n\n<p>The algorithm computes the adaptive learning rates for each parameter and uses the first and second moments of the gradients to adapt the learning rate. This helps in providing a different learning rate for each parameter and hence more precise parameter updates.<\/p>\n\n\n\n<p>The working of Adam optimizer can be summarized in the following steps:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Initialize the learning rate and the model weights.<\/li>\n\n\n\n<li>Compute the gradients of the model with respect to the loss function using <a href=\"https:\/\/www.askpython.com\/python\/examples\/backpropagation-in-python\" target=\"_blank\" data-type=\"URL\" data-id=\"https:\/\/www.askpython.com\/python\/examples\/backpropagation-in-python\" rel=\"noreferrer noopener\">backpropagation<\/a>.<\/li>\n\n\n\n<li>Compute the moving average of the gradient and the squared gradient.<\/li>\n\n\n\n<li>Compute the bias-corrected moving averages.<\/li>\n\n\n\n<li>Update the model weights using the bias-corrected moving averages.<\/li>\n<\/ol>\n\n\n\n<p>The Adam optimizer updates the learning rate adaptively, depending on the gradient&#8217;s moving average and the squared gradient&#8217;s moving average. The moving average is computed for each parameter, and the learning rate is updated accordingly. This helps in providing a different learning rate for each parameter, which is useful in case some parameters are more sensitive than others.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Example of Adam Optimizer<\/h2>\n\n\n\n<p>Let&#8217;s understand this with an easy to understand example:<\/p>\n\n\n\n<p>Minimize the value of function x<strong>^3 &#8211; 2*x<\/strong>^2 + 2. The manual computation looks something like this:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"410\" height=\"446\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/02\/Calculating-minimum-value-manually.png\" alt=\"Calculating Minimum Value Manually\" class=\"wp-image-45038\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/02\/Calculating-minimum-value-manually.png 410w, https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/02\/Calculating-minimum-value-manually-276x300.png 276w\" sizes=\"auto, (max-width: 410px) 100vw, 410px\" \/><figcaption class=\"wp-element-caption\">Calculating The Minima Manually<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Adam Optimizer Implementation in Python<\/h2>\n\n\n\n<p>Now let&#8217;s see how to use Adam optimizer for computing the same:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport tensorflow as tf\nimport numpy as np\nimport matplotlib.pyplot as plt\n\n# Define the function\ndef f(x):\n    return x**3 - 2*x**2 + 2\n\n# Define the Adam optimizer\noptimizer = tf.keras.optimizers.Adam(learning_rate=0.05,  epsilon=1e-07,)\n# Define the starting point for optimization\nx = tf.Variable(0.001)\n\n# Define a list to store the history of x values\nx_values = &#x5B;]\n\n# Define the number of optimization steps\nnum_steps = 100\n\n# Perform the optimization\nfor i in range(num_steps):\n    with tf.GradientTape() as tape:\n        # Calculate the value of the function and record the gradient\n        y = f(x)\n        gradient = tape.gradient(y, x)\n    # Use the Adam optimizer to update the value of x\n    optimizer.apply_gradients(&#x5B;(gradient, x)])\n    # Record the current value of x\n    x_values.append(x.numpy())\n\n# Print the optimized value of x and the value of the function at that point\nprint(&quot;Optimized value of x:&quot;, x.numpy())\nprint(&quot;Value of the function at the optimized point:&quot;, f(x.numpy()))\n# Plot the function and the optimization path\nx_plot = np.linspace(-2, 2, 500)\ny_plot = f(x_plot)\nplt.plot(x_plot, y_plot, label=&#039;Function&#039;)\nplt.plot(x_values, &#x5B;f(x) for x in x_values], label=&#039;Optimization path&#039;, marker=&#039;o&#039;)\nplt.xlabel(&#039;x&#039;)\nplt.ylabel(&#039;y&#039;)\nplt.title(&#039;Function and Optimization Path&#039;)\nplt.legend()\nplt.show()\n<\/pre><\/div>\n\n\n<p><strong>The output looks like this:<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"631\" height=\"410\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/02\/adam-optimizer-example-1.png\" alt=\"Adam Optimizer Example 1\" class=\"wp-image-45034\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/02\/adam-optimizer-example-1.png 631w, https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/02\/adam-optimizer-example-1-300x195.png 300w\" sizes=\"auto, (max-width: 631px) 100vw, 631px\" \/><figcaption class=\"wp-element-caption\">Adam Optimizer Example 1<\/figcaption><\/figure>\n\n\n\n<p>Now that we&#8217;ve understood the working of Adam with an example, let&#8217;s also know about how it is different from other optimizers.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Advantages of Adam over other optimizers<\/h2>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"563\" height=\"536\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/02\/Comparsion-of-Adam-over-other-activations.png\" alt=\"Comparsion Of Adam Over Other Activations\" class=\"wp-image-45149\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/02\/Comparsion-of-Adam-over-other-activations.png 563w, https:\/\/www.askpython.com\/wp-content\/uploads\/2023\/02\/Comparsion-of-Adam-over-other-activations-300x286.png 300w\" sizes=\"auto, (max-width: 563px) 100vw, 563px\" \/><figcaption class=\"wp-element-caption\">Comparison Of Adam to Other Optimization Algorithms<br>Taken from <a href=\"https:\/\/arxiv.org\/abs\/1412.6980\" target=\"_blank\" data-type=\"URL\" data-id=\"https:\/\/arxiv.org\/abs\/1412.6980\" rel=\"noreferrer noopener\">Adam paper<\/a><\/figcaption><\/figure>\n\n\n\n<p>Let&#8217;s see how is this different from other optimizers:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Adam optimizer computes<\/strong> the adaptive learning rates for each parameter, which aids in quicker convergence and better generalization. This means that the learning rate adjusts during training, based on historical gradient information. It differs from stochastic gradient descent, which uses a fixed learning rate.<\/li>\n\n\n\n<li>The first and second moments of gradients are stored in Adam optimizer, reducing gradient noise and enhancing optimization algorithm stability. This is distinct from stochastic gradient descent, which stores no historical gradient information.<\/li>\n\n\n\n<li><strong>Adam optimizer is sturdy against noisy gradients<\/strong>, handling non-stationary objectives and sidestepping local minima and saddle points. In contrast, stochastic gradient descent may get stuck in local minima.<\/li>\n\n\n\n<li><strong>Adam optimizer is memory-efficient<\/strong>, requiring minimal storage to hold gradient first and second moments. Conversely, Adagrad and RMSprop need more memory to store gradient historical information.<\/li>\n\n\n\n<li><strong>Adam optimizer tends to converge faster<\/strong> than other optimizers in many cases, owing to adaptive learning rates and moment estimation, enabling it to move quickly towards minimum.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>In this article, we provided an overview of the Adam optimizer, which is a frequently used optimization algorithm in the training of deep learning models. The article also presented the advantages of the Adam optimizer, including its adaptive learning rates, memory efficiency, and resilience to noisy gradients. To illustrate the functionality of the Adam optimizer, the article showcased an example of a cubic function and plotted the optimization process.<\/p>\n\n\n\n<p>Also read:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.askpython.com\/python\/examples\/activation-functions-python\" target=\"_blank\" data-type=\"URL\" data-id=\"https:\/\/www.askpython.com\/python\/examples\/activation-functions-python\" rel=\"noreferrer noopener\">Activation functions<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.askpython.com\/python\/examples\/backpropagation-in-python\" target=\"_blank\" data-type=\"URL\" data-id=\"https:\/\/www.askpython.com\/python\/examples\/backpropagation-in-python\" rel=\"noreferrer noopener\">Back propagation<\/a><\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>Optimization is one of the critical processes in deep learning that helps in tuning the parameters of a model to minimize the loss function. Adam optimizer is one of the widely used optimization algorithms in deep learning that combines the benefits of Adagrad and RMSprop optimizers. In this article, we will discuss the Adam optimizer, [&hellip;]<\/p>\n","protected":false},"author":58,"featured_media":45154,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[],"class_list":["post-41310","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-examples"],"blocksy_meta":[],"_links":{"self":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/41310","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/users\/58"}],"replies":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/comments?post=41310"}],"version-history":[{"count":0,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/41310\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media\/45154"}],"wp:attachment":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media?parent=41310"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/categories?post=41310"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/tags?post=41310"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}