<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[SerpApi - Medium]]></title>
        <description><![CDATA[Fast, complete, and easy API to scrape and extract search results - Medium]]></description>
        <link>https://medium.com/serpapi?source=rss----76bc81ac44eb---4</link>
        <image>
            <url>https://cdn-images-1.medium.com/proxy/1*TGH72Nnw24QL3iV9IOm4VA.png</url>
            <title>SerpApi - Medium</title>
            <link>https://medium.com/serpapi?source=rss----76bc81ac44eb---4</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sat, 16 May 2026 16:23:18 GMT</lastBuildDate>
        <atom:link href="https://medium.com/feed/serpapi" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Playwright’s getByRole locator is 1.5x slower than CSS selectors]]></title>
            <link>https://medium.com/serpapi/playwrights-getbyrole-locator-is-1-5x-slower-than-css-selectors-9262e766c5f8?source=rss----76bc81ac44eb---4</link>
            <guid isPermaLink="false">https://medium.com/p/9262e766c5f8</guid>
            <category><![CDATA[software-development]]></category>
            <category><![CDATA[automated-testing]]></category>
            <category><![CDATA[web-scraping]]></category>
            <dc:creator><![CDATA[Illia Zub]]></dc:creator>
            <pubDate>Tue, 14 Nov 2023 19:02:07 GMT</pubDate>
            <atom:updated>2023-11-14T19:03:58.236Z</atom:updated>
            <content:encoded><![CDATA[<p>Playwright’s getByRole uses of querySelectorAll(‘*’) and matches elements by the accessible name.</p><h3>Introduction</h3><p>In the realm of web automation and testing with Playwright, understanding the performance of various locator strategies is key. This article delves into the getByRole locator&#39;s efficiency compared to CSS selectors, offering insights into the technical workings and practical implications of these choices.</p><p>I wanted to use Page#getByRole&#39;s underlying CSS selectors in <a href="https://serpapi.com/">SerpApi</a> code base. But the getByRole locator was 1.5 times slower compared to the <a href="https://serpapi.com/blog/web-scraping-with-css-selectors-using-python/#selectors_types">standard CSS selectors</a>, prompting an investigation into the root cause. This performance discrepancy, likely stemming from Playwright’s use of <a href="https://github.com/microsoft/playwright/blob/main/packages/playwright-core/src/server/injected/roleSelectorEngine.ts#L163-L173">querySelectorAll(&#39;*&#39;)</a> and <a href="https://github.com/microsoft/playwright/blob/main/packages/playwright-core/src/server/injected/roleUtils.ts#L402-L711">matching elements by the accessible name</a>, raises essential considerations for developers prioritizing speed in their automation scripts.</p><h3>Deep Dive: How getByRole Works</h3><p>The getByRole function in Playwright is more than just a method to locate web elements; it&#39;s a complex mechanism with multiple layers of interaction within the Playwright architecture. Let&#39;s demystify this process with an example. Consider this code:</p><pre>await page.getByRole(&#39;button&#39;, { name: &#39;Enter address manually&#39; }).click();</pre><p>This command sets off a cascade of actions within Playwright:</p><ul><li>Page#getByRole <a href="https://github.com/microsoft/playwright/blob/main/packages/playwright-core/src/client/locator.ts#L175-L177">creates a </a><a href="https://github.com/microsoft/playwright/blob/main/packages/playwright-core/src/client/locator.ts#L175-L177">Locator</a></li><li>Locator#click delegates call to <a href="https://github.com/microsoft/playwright/blob/main/packages/playwright-core/src/client/locator.ts#L95-L97">Frame#click passing the </a><a href="https://github.com/microsoft/playwright/blob/main/packages/playwright-core/src/client/locator.ts#L95-L97">Locator#_selector</a></li><li>Frame#click delegates to <a href="https://github.com/microsoft/playwright/blob/main/packages/playwright-core/src/client/frame.ts#L284-L286">Channel#click</a>. Frame inherits _channel from ChannelOwner. <a href="https://github.com/microsoft/playwright/blob/main/packages/playwright-core/src/client/channelOwner.ts#L138-L160">ChannelOwner#_channel</a> is a JS Proxy object based on the EventEmitter</li><li>Client Frame dispatches an event to the server <a href="https://github.com/microsoft/playwright/blob/main/packages/playwright-core/src/server/frames.ts#L1144-L1149">Frame#click</a></li><li>FrameSelector#resolveInjectedForSelector injects the <a href="https://github.com/microsoft/playwright/blob/2afd857642c26980e56f269d05df72d4d69f57e7/packages/playwright-core/src/server/dom.ts#L83-L109">FrameExecutionContext#injectedScript</a> script to the page, controlled by Playwright. The InjectedScript#constructor adds the <a href="https://github.com/microsoft/playwright/blob/2afd857642c26980e56f269d05df72d4d69f57e7/packages/playwright-core/src/server/injected/injectedScript.ts#L125">engine for </a><a href="https://github.com/microsoft/playwright/blob/2afd857642c26980e56f269d05df72d4d69f57e7/packages/playwright-core/src/server/injected/injectedScript.ts#L125">getByRole locator</a>.</li><li>createRoleEngine calls <a href="https://github.com/microsoft/playwright/blob/main/packages/playwright-core/src/server/injected/roleSelectorEngine.ts#L179-L195">parseAttributeSelector and </a><a href="https://github.com/microsoft/playwright/blob/main/packages/playwright-core/src/server/injected/roleSelectorEngine.ts#L179-L195">queryRole</a></li><li>queryRole calls <a href="https://github.com/microsoft/playwright/blob/main/packages/playwright-core/src/server/injected/roleSelectorEngine.ts#L129-L161">querySelectorAll(&#39;*&#39;) and matches the element</a></li></ul><p>Compared to the getByRole, the locator with the regular CSS selector just traverses DOM and is 1.5 times faster.</p><h3>Performance</h3><p>Comparative tests reveal that a regular locator using CSS selectors outperforms getByRole by 1.5 times. Interestingly, the $.then method trailed, being 2x slower in our tests.</p><pre>console.time(&quot;getByRole&quot;);<br>for (let i = 0; i &lt; 100; i++) {<br>  const textContent = await page.getByRole(&#39;button&#39;, { name: &#39;Enter address manually&#39; }).textContent()<br>}<br>console.timeEnd(&quot;getByRole&quot;);<br><br>console.time(&quot;locator&quot;);<br>for (let i = 0; i &lt; 100; i++) {<br>  const textContent = await page.locator(&#39;.ektjNL&#39;).textContent()<br>}<br>console.timeEnd(&quot;locator&quot;);<br><br>console.time(&quot;$.then&quot;);<br>for (let i = 0; i &lt; 100; i++) {<br>  const textContent = await page.$(&#39;.ektjNL&#39;).then(e =&gt; e.textContent())<br>}<br>console.timeEnd(&quot;$.then&quot;);<br><br>// Output:<br>// getByRole: 677.5ms<br>// locator: 497.306ms<br>// $.then: 1.135s</pre><h3>Conclusion</h3><p>In web automation, understanding Playwright&#39;s locators — getByRole and CSS selectors — is key. getByRole excels in clarity, while CSS selectors win in speed. This matters in large test suites where every second counts. Choose wisely: getByRole for readability, CSS selectors for efficiency.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=9262e766c5f8" width="1" height="1" alt=""><hr><p><a href="https://medium.com/serpapi/playwrights-getbyrole-locator-is-1-5x-slower-than-css-selectors-9262e766c5f8">Playwright’s getByRole locator is 1.5x slower than CSS selectors</a> was originally published in <a href="https://medium.com/serpapi">SerpApi</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[When counting lines in Ruby randomly failed SerpApi deployments]]></title>
            <link>https://medium.com/serpapi/when-counting-lines-in-ruby-randomly-failed-serpapi-deployments-67b18fc75f0e?source=rss----76bc81ac44eb---4</link>
            <guid isPermaLink="false">https://medium.com/p/67b18fc75f0e</guid>
            <category><![CDATA[serpapi]]></category>
            <category><![CDATA[ruby]]></category>
            <category><![CDATA[software-engineering]]></category>
            <category><![CDATA[performance]]></category>
            <dc:creator><![CDATA[Illia Zub]]></dc:creator>
            <pubDate>Thu, 07 Sep 2023 10:23:25 GMT</pubDate>
            <atom:updated>2023-09-21T15:04:12.409Z</atom:updated>
            <content:encoded><![CDATA[<p>Recently, we observed the occasional failing deployments only on two of our servers. The failed servers even were closing my regular SSH connection. In this story, you’ll learn how we reduced memory usage and made one piece of SerpApi code 1.5x faster.</p><h3>TL;DR</h3><p>str.count($/) was 1.5x faster compared to str.lines.count and didn&#39;t allocate additional memory.</p><h3>Investigation</h3><p>Only two servers faced the random failed deployments.</p><pre>#&lt;Thread:0x000055a170560e70 digital_ocean.rb:80 run&gt; terminated with exception (report_on_exception is true):<br>SerpApi/vendor/bundle/ruby/2.7.0/gems/net-ssh-5.2.0/lib/net/ssh/transport/server_version.rb:54:in `readpartial&#39;: Connection reset by peer (Errno::ECONNRESET)</pre><p>These servers also randomly closed my SSH connections.</p><pre>$ ssh server-2<br>Last login: Fri Feb 24 14:23:29 2023 from {remote_ip}<br>client_loop: send disconnect: Broken pipe</pre><p>DigitalOcean server’s graphs shown that memory usage was near to 95% percent on these two servers. Load average was occasionally peaking at 12 compared to 2 on other servers.</p><p>We checked the Puma server flamegraph. Most of the wall time on production were the SearchSplitter#do_one_request and Puma thread pool.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*HOfKFid7TIEAispd.png" /></figure><p>We used <a href="https://github.com/rbspy/rbspy?ref=serpapi.com">rbspy</a> to generate the flamegraph:</p><pre>$ rbspy record -p $PID_OF_PUMA_PROCESS</pre><p>The flamegraph didn’t reveal anything actionable and we moved to memory profiling.</p><h3>Memory profiling</h3><p>Here’s the script we used:</p><pre>require &#39;memory_profiler&#39;<br><br>user = User.find_by(email: &quot;me&quot;)<br><br>report = MemoryProfiler.report do<br>  threads = []<br>  (1..5).map do |i|<br>    (1..5).map do |j|<br>      threads &lt;&lt; Thread.new {<br>        search = Search.new(engine: &quot;google&quot;, q: &quot;#{i} * #{j}&quot;, user: user)<br>        SearchProcessor.process(search)<br>      }<br>    end<br>  end<br>  threads.each(&amp;:join)<br>end<br><br>report.pretty_print</pre><p>It turned out that the top allocator was line counting in the response validator.</p><pre>elsif response[:html].lines.count &lt; 50 &amp;&amp; response[:html].match?(/&lt;div id=\&quot;main\&quot;&gt;&lt;div id=\&quot;cnt\&quot;&gt;&lt;div class=\&quot;dodTBe\&quot; id=\&quot;sfcnt\&quot;&gt;&lt;\/div&gt;&lt;H1&gt;.+&lt;\/H1&gt;.+&lt;p&gt;.+&lt;\/p&gt;$/)</pre><p>Why the String#lines and Array#count were the top allocators of the entire app?</p><h3><a href="https://ruby-doc.org/core-2.6/String.html?ref=serpapi.com#method-i-lines">String#lines</a></h3><p>The HTML file size varies from 180 KB (regular organic results) to 1.3 MB (Google Shopping with num=100). The <a href="https://ruby-doc.org/core-2.6/String.html?ref=serpapi.com#method-i-lines">String#lines</a> allocated an array multiple time per each search because we send <a href="https://serpapi.com/ludicrous-speed?ref=serpapi.com">multiple requests concurrently per each search</a>.</p><p>Thanks to <a href="https://gist.github.com/guilhermesimoes/d69e547884e556c3dc95?permalink_comment_id=4502636&amp;ref=serpapi.com">@guilhermesimoes’s gist</a>, we found that str.each_line.count should be faster. But it was not optimal and we found a way to improve the solution.</p><h3>Solution</h3><p>The solution was super simple — str.count($/). Here&#39;s the diff:</p><pre>- elsif response[:html].lines.count &lt; 50 &amp;&amp; response[:html].match?(/&lt;div id=\&quot;main\&quot;&gt;&lt;div id=\&quot;cnt\&quot;&gt;&lt;div class=\&quot;dodTBe\&quot; id=\&quot;sfcnt\&quot;&gt;&lt;\/div&gt;&lt;H1&gt;.+&lt;\/H1&gt;.+&lt;p&gt;.+&lt;\/p&gt;$/)<br>+ elsif response[:html].count($/) &lt; 50 &amp;&amp; response[:html].match?(/&lt;div id=\&quot;main\&quot;&gt;&lt;div id=\&quot;cnt\&quot;&gt;&lt;div class=\&quot;dodTBe\&quot; id=\&quot;sfcnt\&quot;&gt;&lt;\/div&gt;&lt;H1&gt;.+&lt;\/H1&gt;.+&lt;p&gt;.+&lt;\/p&gt;$/)</pre><p>To make sure the problem was solved, we benchmarked multiple ways of counting string lines in Ruby. We reused and adopted the gist above to exclude File#read from the benchmark and added <a href="https://ruby-doc.org/core-2.6/String.html?ref=serpapi.com#method-i-count">String#count</a> to the benchmark.</p><h3>Benchmark</h3><p><a href="https://ruby-doc.org/core-2.6/String.html?ref=serpapi.com#method-i-count">String#count</a> was 1.5 times faster that other options:</p><pre>Warming up --------------------------------------<br>                size    31.000  i/100ms<br>              length    75.000  i/100ms<br>               count    77.000  i/100ms<br>   each_line + count    81.000  i/100ms<br>           count($/)   196.000  i/100ms<br>Calculating -------------------------------------<br>                size      1.529k (±33.9%) i/s -      4.774k in   5.015361s<br>              length      1.434k (±38.8%) i/s -      5.025k in   5.139834s<br>               count      1.335k (±40.7%) i/s -      4.697k in   5.079353s<br>   each_line + count      1.411k (±39.5%) i/s -      5.022k in   5.110146s<br>           count($/)      2.231k (± 2.6%) i/s -     11.172k in   5.012323s<br><br>Comparison:<br>           count($/):     2230.5 i/s<br>                size:     1529.0 i/s - 1.46x  (± 0.00) slower<br>              length:     1434.2 i/s - 1.56x  (± 0.00) slower<br>   each_line + count:     1411.0 i/s - 1.58x  (± 0.00) slower<br>               count:     1334.9 i/s - 1.67x  (± 0.00) slower</pre><p>Here’s the script:</p><pre>require &quot;benchmark/ips&quot;<br><br>html = File.read(Rails.root.join(&quot;spec/data/google/superhero-movies-mobile-63f582a0defa1345501c6b50-2023-02-23.html&quot;))<br><br>Benchmark.ips do |x|<br>  x.report(&quot;size&quot;) { html.lines.size &lt; 50 }<br>  x.report(&quot;length&quot;) { html.lines.length &lt; 50 }<br>  x.report(&quot;count&quot;) { html.lines.count &lt; 50 }<br>  x.report(&quot;each_line + count&quot;) { html.each_line.count &lt; 50 }<br>  x.report(&quot;count($/)&quot;) { html.count($/) &lt; 50 }<br>  x.compare!<br>end</pre><h3>Memory usage</h3><p>count($/) doesn&#39;t allocate a new array compared to lines/each_line/etc.</p><p>We used the awesome <a href="https://github.com/Shopify/heap-profiler?ref=serpapi.com">heap-profiler</a> and <a href="https://github.com/zombocom/heapy?ref=serpapi.com">heapy</a> Ruby gems to profile memory usage.</p><h4>lines/readlines/each_line/etc.</h4><p>html.lines.count allocated the new array and referenced the original string for each iteration in the benchmark.</p><pre>$ bundle exec heapy read ./tmp/lines_count/allocated.heap 49 --lines=1<br><br>Analyzing Heap (Generation: 49)<br>-------------------------------<br>allocated by memory (204879705) (in bytes)<br>==============================<br>  204872652  tmp/html_length_vs_count_vs_size_bench.rb:6<br>object count (5406)<br>==============================<br>  5301  tmp/html_length_vs_count_vs_size_bench.rb:6<br>High Ref Counts<br>==============================<br>  5300  tmp/html_length_vs_count_vs_size_bench.rb:6</pre><p>We also used the predefined <a href="https://ruby-doc.org/docs/ruby-doc-bundle/Manual/man-1.4/variable.html?ref=serpapi.com#slash">$/ line separator</a> to allocate even less memory.</p><h4>count($\)</h4><p>Most of these memory allocations and all of the object reference counts were gone when we used the String#count($/).</p><pre>$ bundle exec heapy read ./tmp/count_nl/allocated.heap 48 --lines=1<br><br>Analyzing Heap (Generation: 48)<br>-------------------------------<br>allocated by memory (2547465) (in bytes)<br>==============================<br>  2540804  tmp/html_length_vs_count_vs_size_bench.rb:4<br>object count (105)<br>==============================<br>  27  /usr/local/lib/ruby/gems/2.7.0/gems/activesupport/lib/active_support/deprecation/proxy_wrappers.rb:172<br>High Ref Counts<br>==============================<br>  73  /usr/local/lib/ruby/gems/2.7.0/gems/activesupport/lib/active_support/deprecation/proxy_wrappers.rb:172</pre><h4>Code</h4><pre>require &quot;heap-profiler&quot;<br><br>HeapProfiler.report(Rails.root.join(&#39;tmp/lines_count&#39;)) do<br>  html = File.read(Rails.root.join(&quot;spec/data/google/superhero-movies-mobile-63f582a0defa1345501c6b50-2023-02-23.html&quot;))<br><br>  100.times { html.lines.count &lt; 50 }<br>  # 100.times { html.count($/) &lt; 50 }<br>end</pre><h4>Comparison process</h4><p>The heap diff comparison was a bit manual because the <a href="https://github.com/zombocom/heapy?ref=serpapi.com#diff-2-heap-dumps">heapy diff</a> did&#39;t provide a <em>diff</em>. We commented / uncommented 100.times { html.lines.count &lt; 50 } and replaced paths in the command above.</p><pre># Profile heap<br>$ bundle exec rails r tmp/html_length_vs_count_vs_size_bench.rb<br><br># Read summary of heap allocations<br>$ bundle exec heapy read ./tmp/count_nl/allocated.heap<br><br># Read a specific generation (48) limiting number of lines to output (1)<br>$ bundle exec heapy read ./tmp/count_nl/allocated.heap 48 --lines=1</pre><h3>Results</h3><p>Immediately after the fix was deployed, memory usage on the affected servers decreased and stabilized. Then memory usage fluctuated again, but deployments and SSH connections stabilized.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*nDFIBPcWmMTE8di8.png" /></figure><h3>Observations and final thoughts</h3><p>The initial assumption was the Puma graceful restart. During the phased-restart, Puma spawned additional workers to switch to the new code version (which was expected). It wasn&#39;t clear why SSH connections were dropping on two DigitalOcean droplets only.</p><p>Doubling the amount of RAM would also solve the problem, but it wouldn’t be as efficient at this point. The fix was deployed half a year ago and the issue is definitely solved.</p><h3>Update Sep 20th, 2023</h3><p>Thanks to <a href="https://www.reddit.com/r/ruby/comments/16d4ha6/comment/k1f5mc6/?utm_source=share&amp;utm_medium=web2x&amp;context=3">@Freaky from Reddit</a> for a wonderful feedback and cooperation:</p><ul><li>@Freaky tracked down a <a href="https://bugs.ruby-lang.org/issues/19875?ref=serpapi.com">performance regression</a> in String#count in 3.1+</li><li>@Freaky brought his old <a href="https://github.com/Freaky/fast-bytecount/?ref=serpapi.com">SIMD bytecount C port</a> out of mothballs</li><li>MRI maintainer nobu is <a href="https://github.com/nobu/ruby/tree/mm_bytecount?ref=serpapi.com">evaluating it</a> for inclusion in MRI</li><li>Freaky noted the bytecount Rust crate <a href="https://github.com/llogiq/bytecount/issues/85?ref=serpapi.com">uses an SSE4.1 intrinsic in SSE2 code</a> and submitted a fix</li><li>A similar <a href="https://github.com/BurntSushi/aho-corasick/pull/130?ref=serpapi.com">fix for the aho-corasick Rust crate</a> was made in response</li></ul><blockquote>I love it when you get this ripple effect from something that initially seems pretty innocuous.</blockquote><p><em>If you enjoy working on such challenges, come work here with us: </em><a href="https://serpapi.com/careers?ref=serpapi.com#open-roles"><em>https://serpapi.com/careers#open-roles</em></a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=67b18fc75f0e" width="1" height="1" alt=""><hr><p><a href="https://medium.com/serpapi/when-counting-lines-in-ruby-randomly-failed-serpapi-deployments-67b18fc75f0e">When counting lines in Ruby randomly failed SerpApi deployments</a> was originally published in <a href="https://medium.com/serpapi">SerpApi</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Software Engineer’s Life Balance with Dmytro Vasyliev | #SerpApiPodcast, Episode 10]]></title>
            <link>https://medium.com/serpapi/software-engineers-life-balance-with-dmytro-vasyliev-serpapipodcast-episode-10-bb28177314b6?source=rss----76bc81ac44eb---4</link>
            <guid isPermaLink="false">https://medium.com/p/bb28177314b6</guid>
            <category><![CDATA[serpapipodcast]]></category>
            <category><![CDATA[software-engineering]]></category>
            <category><![CDATA[work-life-balance]]></category>
            <dc:creator><![CDATA[Illia Zub]]></dc:creator>
            <pubDate>Sat, 24 Jun 2023 19:57:36 GMT</pubDate>
            <atom:updated>2023-06-24T19:57:36.584Z</atom:updated>
            <content:encoded><![CDATA[<p>In the 10th episode of <a href="https://youtube.com/playlist?list=PLGt0Yb3JKV12RPWTuDGXIS6N5Ex1aYJBx">SerpApi Podcast</a>, we discuss work-life balance with Dmytro Vasyliev: “How can you find a balance between your professional activity and personal life?”, “How can you maintain your health and avoid burnout?”, etc.</p><p><a href="https://youtu.be/YITDzpvFiU4">Watch this episode</a> · <a href="https://podcasters.spotify.com/pod/show/ilyazub/episodes/Software-Engineers-Life-Balance-with-Dmytro-Vasyliev--SerpApiPodcast--Episode-10-e25an6r">Listen this episode</a></p><h3>Chapters</h3><p>[00:00:00] Introduction<br>[00:02:03] What is life balance for Dmytro?<br>[00:03:00] How Dmytro organizes his day?<br>[00:05:10] Was you always organized in life or you had to do something in order to feel better?<br>[00:06:35] About life spheres<br>[00:12:10] Burnout and how to handle it<br>[00:14:51] Work-life balance: what helps to be more focused?<br>[00:29:55] Work from home or from office?<br>[00:33:08] Work with a coach: when, why and what a result is<br>[00:38:35] The first time with a psychologist<br>[00:42:20] Training with a personal trainer — how does it works?<br>[00:46:26] A few exercises to keep an eye on life balance<br>[00:50:32] Organization of a to-do list<br>[00:58:08] Epilogue</p><p>SerpApi Podcast is about SERP (search engine results pages) and e-commerce data scraping: parsing, circumvention of blocking, web automation, proxies, legal part of scraping, performance, data extraction and validation.</p><p>Use cases for SERP data scraping: programmatic search engine optimization (SEO) and local SEO, machine learning (ML), artificial intelligence (AI), large language models (LLMs), news monitoring, open-source intelligence (OSINT), voice assistants, e-commerce competitor research.</p><p>The podcast is brought to you by SerpApi team: <a href="https://serpapi.com">https://serpapi.com</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=bb28177314b6" width="1" height="1" alt=""><hr><p><a href="https://medium.com/serpapi/software-engineers-life-balance-with-dmytro-vasyliev-serpapipodcast-episode-10-bb28177314b6">Software Engineer’s Life Balance with Dmytro Vasyliev | #SerpApiPodcast, Episode 10</a> was originally published in <a href="https://medium.com/serpapi">SerpApi</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[From Software Engineer to Manager with Oleksii Gyturo | #SerpApiPodcast, Episode 9]]></title>
            <link>https://medium.com/serpapi/from-software-engineer-to-manager-with-oleksii-gyturo-serpapipodcast-episode-9-eb1b28e4a4f0?source=rss----76bc81ac44eb---4</link>
            <guid isPermaLink="false">https://medium.com/p/eb1b28e4a4f0</guid>
            <category><![CDATA[software-engineering]]></category>
            <category><![CDATA[serpapipodcast]]></category>
            <category><![CDATA[podcast]]></category>
            <category><![CDATA[careers]]></category>
            <dc:creator><![CDATA[Illia Zub]]></dc:creator>
            <pubDate>Fri, 23 Jun 2023 16:45:53 GMT</pubDate>
            <atom:updated>2023-06-23T16:45:53.875Z</atom:updated>
            <content:encoded><![CDATA[<p>In episode we discuss the career growth in management. Oleksii, as an aspiring manager, faced a multitude of questions and challenges. Career growth depends not only on how well a manager copes with the tasks, but also on how open they are to new knowledge and experience.</p><p><a href="https://youtu.be/U7CNEdoTWRc">Watch this episode</a> · <a href="https://podcasters.spotify.com/pod/show/ilyazub/episodes/From-Software-Engineer-to-Manager--SerpApiPodcast--Episode-9-e23ogmv">Listen this episode</a></p><h3>Show notes</h3><p>[00:00:00] Introduction<br>[00:02:11] Did you grow from another position or have been hired as the head of mobile?<br>[00:03:29] You have started as an individual contributor but then you started working on more and more why?<br>[00:06:55] What was a result of involving you in a leadership strategy?<br>[00:10:32] Experience exchange (from respondent to interviewer)<br>[00:20:14] How do you deal with multiple sources of information about the business?<br>[00:36:04] What is Illia’s mission at SerpApi?<br>[00:56:02] Long term planning or being spontaneous? How did you manage that?<br>[01:06:38] Who is an engineering manager?<br>[01:10:09] Why have you decided to consult businesses?<br>[01:12:29] Do you treat consulting as a business?<br>[01:14:34] Epilogue (few words from Oleksii)</p><p>SerpApi Podcast is about SERP (search engine results pages) and e-commerce data scraping: parsing, circumvention of blocking, web automation, proxies, legal part of scraping, performance, data extraction and validation.</p><p>Use cases for SERP data scraping: programmatic search engine optimization (SEO) and local SEO, machine learning (ML), artificial intelligence (AI), large language models (LLMs), news monitoring, open-source intelligence (OSINT), voice assistants, e-commerce competitor research.</p><p>The podcast is brought to you by SerpApi team: <a href="https://serpapi.com">https://serpapi.com</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=eb1b28e4a4f0" width="1" height="1" alt=""><hr><p><a href="https://medium.com/serpapi/from-software-engineer-to-manager-with-oleksii-gyturo-serpapipodcast-episode-9-eb1b28e4a4f0">From Software Engineer to Manager with Oleksii Gyturo | #SerpApiPodcast, Episode 9</a> was originally published in <a href="https://medium.com/serpapi">SerpApi</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Insights from an Engineering Director (#SerpApiPodcast, Ep. 8)]]></title>
            <link>https://medium.com/serpapi/insights-from-an-engineering-director-serpapipodcast-ep-8-b79f86d85530?source=rss----76bc81ac44eb---4</link>
            <guid isPermaLink="false">https://medium.com/p/b79f86d85530</guid>
            <category><![CDATA[web-scraping]]></category>
            <category><![CDATA[software-engineering]]></category>
            <category><![CDATA[serpapipodcast]]></category>
            <category><![CDATA[podcast]]></category>
            <dc:creator><![CDATA[Illia Zub]]></dc:creator>
            <pubDate>Wed, 17 May 2023 14:20:58 GMT</pubDate>
            <atom:updated>2023-05-17T14:20:58.810Z</atom:updated>
            <content:encoded><![CDATA[<p>The success of the company primarily depends on the effective actions and efforts of talented people. Today we would like to introduce you to Miloš, one of the Engineering Directors at SerpApi.</p><p><a href="https://youtu.be/U7CNEdoTWRc">Watch this episode</a> · <a href="https://podcasters.spotify.com/pod/show/ilyazub/episodes/Insights-from-an-Engineering-Director-with-Milos--SerpApiPodcast--Episode-8-e23og6j">Listen this episode</a></p><h4>Show Notes</h4><p>[00:00] — Intro<br>[01:26] — What are Miloš’ responsibilities at SerpApi?<br>[02:19] — How did you become a manager?<br>[03:25] — How do you feel about switching from individual contributor to manager?<br>[05:16] — Where did you gain that knowledge, internally or externally?<br>[07:41] — Do you prioritize individual tasks?<br>[09:46] — How do you feel about contributing via others instead of contributing personally?<br>[11:18] — What is the upper limit for you (number of subordinates)?<br>[13:22] — What’s a goal of a weekly meeting?<br>[15:13] — How does your team interact with you?<br>[16:36] — How do you understand and know what is valuable for the business?<br>[18:19] — Hobbies<br>[20:29] — How do you organize your time?<br>[22:23] — Do you work without interruption?<br>[23:12] — What do you put in the calendar besides meeting?<br>[24:26] — Miloš’ actions when he can’t follow the plan ?<br>[25:00] — What do you do when you finish your work early?<br>[26:12] — Long term plans or small steps?<br>[30:40] — How have you changed since you started working in SerpApi?<br>[32:41] — What are your plans for the future?<br>[34:20] — What is the feedback about you from the people you manage?<br>[36:36] — How do you manage disagreements with the people you manage?<br>[38:15] — How do you decide when to say “no”?<br>[40:12] — How do you manage to deal with a large amount of work?<br>[45:48] — What would you like to be changed in your work?<br>[50:59] — What are your suggestions for yourself in the past?<br>[53:26] — How do you deal with anxiety?<br>[56:16] — Do you have a role model?<br>[57:13] — Epilogue (few words from Miloš)</p><h4>Links and resources</h4><p><a href="https://www.youtube.com/playlist?list=PLGt0Yb3JKV12RPWTuDGXIS6N5Ex1aYJBx">#SerpApiPodcast</a> is about SERP (search engine results pages) and e-commerce data scraping: parsing, circumvention of blocking, web automation, proxies, legal part of scraping, performance, data extraction and validation.</p><p>Use cases for SERP data scraping: SEO, local SEO, ML models, news monitoring, OSINT, voice assistant, ecommerce competitor research.</p><p>The podcast is brought to you by SerpApi team: <a href="https://serpapi.com">https://serpapi.com</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b79f86d85530" width="1" height="1" alt=""><hr><p><a href="https://medium.com/serpapi/insights-from-an-engineering-director-serpapipodcast-ep-8-b79f86d85530">Insights from an Engineering Director (#SerpApiPodcast, Ep. 8)</a> was originally published in <a href="https://medium.com/serpapi">SerpApi</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Safeguarding Web Scraping Activities with SerpApi: Protecting Privacy and Security]]></title>
            <link>https://medium.com/serpapi/safeguarding-web-scraping-activities-with-serpapi-protecting-privacy-and-security-c5b782cae620?source=rss----76bc81ac44eb---4</link>
            <guid isPermaLink="false">https://medium.com/p/c5b782cae620</guid>
            <category><![CDATA[web-scraping]]></category>
            <category><![CDATA[erp]]></category>
            <category><![CDATA[legal-us-shield]]></category>
            <category><![CDATA[serpapi]]></category>
            <category><![CDATA[security]]></category>
            <dc:creator><![CDATA[Alaa Abdulridha]]></dc:creator>
            <pubDate>Tue, 16 May 2023 20:16:56 GMT</pubDate>
            <atom:updated>2023-05-16T20:58:32.610Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*YrCZuXDt1nbd3-07s6FLig.png" /></figure><p>In this article, we will explore how <a href="https://serpapi.com/?ref=serpapi.com">SerpApi</a> safeguards web scraping activities, ensuring the privacy and security of users’ data.</p><h3>Introduction</h3><p>Web scraping has become an integral part of data acquisition in various domains, enabling businesses, researchers, and developers to gather valuable information from the web. However, conducting web scraping activities while maintaining privacy and security can be challenging. That’s where <a href="https://serpapi.com/?ref=serpapi.com">SerpApi</a> (Search Engine Results Page API Service) steps in as a powerful solution.</p><h3>Features</h3><ol><li><strong>Anonymity and Privacy Protection:</strong> When conducting web scraping, it’s crucial to protect your identity and sensitive information.</li><li><a href="https://serpapi.com/?ref=serpapi.com">SerpApi, LLC</a> service serves as an intermediary layer, ensuring anonymity by routing requests through their infrastructure. By shielding the user’s IP address and personal details, <a href="https://serpapi.com/?ref=serpapi.com">SerpApi</a> minimizes the risk of being blocked or flagged by websites, thereby preserving privacy during the scraping process.</li><li><strong>Anti-Bot Mechanism Bypass:</strong> Websites often implement anti-bot mechanisms, such as <a href="https://en.wikipedia.org/wiki/CAPTCHA?ref=serpapi.com">CAPTCHAs </a>and IP blocking, to prevent automated scraping activities. These mechanisms can hinder the scraping process and compromise security. <a href="https://serpapi.com/?ref=serpapi.com">SerpApi</a> employs advanced algorithms to bypass CAPTCHAs and other anti-bot mechanisms, ensuring uninterrupted and efficient data extraction.</li><li>By automatically handling these challenges, <a href="https://serpapi.com/?ref=serpapi.com">SerpApi</a> enhances security by reducing exposure to potential vulnerabilities.</li><li><strong>Encryption and Secure Communication:</strong> Data security is a paramount concern when performing web scraping operations. <a href="https://serpapi.com/?ref=serpapi.com">SerpApi</a> employs industry-standard encryption protocols, such as HTTPS, to establish secure communication channels between the user’s application and the API. This ensures that data transmitted during the scraping process remains encrypted and protected from unauthorized access or interception by malicious actors.</li><li><strong>Rate Limit Management:</strong> Websites may enforce rate limits to prevent excessive scraping and ensure fair usage. <a href="https://serpapi.com/?ref=serpapi.com">SerpApi</a> simplifies rate limit management by monitoring and managing requests on behalf of the user. By intelligently pacing the requests and adhering to rate limits, <a href="https://serpapi.com/?ref=serpapi.com">SerpApi</a> reduces the risk of triggering website defenses, such as temporary IP bans, while maintaining optimal performance.</li><li><strong>Compliance with Legal and Ethical Standards:</strong> Web scraping activities must comply with legal and ethical guidelines to maintain integrity and respect for the data source. <a href="https://serpapi.com/?ref=serpapi.com">SerpApi, LLC</a> promotes ethical scraping practices by enforcing compliance with the terms of service of search engines and websites. By handling scraping operations responsibly and abiding by the rules, <a href="https://serpapi.com/?ref=serpapi.com">SerpApi</a> helps users avoid legal repercussions and fosters a sustainable web scraping ecosystem.</li><li><strong>Continuous Maintenance and Updates:</strong> Websites often undergo structural changes, which can affect scraping scripts and introduce potential security risks. <a href="https://serpapi.com/?ref=serpapi.com">SerpApi</a> alleviates this concern by actively monitoring and adapting to evolving web structures. By handling the complexity of website changes, SerpApi ensures that the scraping process remains accurate and secure, reducing the need for constant maintenance and safeguarding against vulnerabilities.</li><li><strong>Legal US Shield: </strong>The crawling and parsing of public data is protected by the First Amendment of the United States Constitution. We value tremendously freedom of speech. We assume scraping and parsing liabilities for both domestic and foreign companies unless your usage is otherwise illegal. (Including but are not limited to: acts of cyber criminality, terrorism, pedopornography, denial of service attacks, and war crimes.)</li></ol><h3>Conclusion</h3><p><a href="https://serpapi.com/?ref=serpapi.com">SerpApi, LLC</a> plays a vital role in safeguarding web scraping activities by prioritizing privacy and security.</p><p>Through its anonymization, anti-bot mechanism bypassing, secure communication, rate limit management, and compliance with legal and ethical standards, <a href="https://serpapi.com/?ref=serpapi.com">SerpApi</a> empowers users to conduct web scraping operations with confidence.<br>By utilizing <a href="https://serpapi.com/?ref=serpapi.com">SerpApi</a>, businesses, researchers, and developers can extract valuable data while protecting their privacy, maintaining security, and adhering to ethical practices.</p><p>In addition to that, at SerpApi we take reasonable precautions and follow industry best practices to make sure it is not inappropriately lost, misused, accessed, disclosed, altered, or destroyed. customers’ information is encrypted using secure <a href="https://www.cloudflare.com/learning/ssl/what-is-ssl/?ref=serpapi.com">socket layer technology (SSL)</a> and stored with <a href="https://en.wikipedia.org/wiki/Advanced_Encryption_Standard?ref=serpapi.com">AES-256 encryption</a>. Although no method of transmission over the Internet or electronic storage is 100% secure, we follow all <a href="https://en.wikipedia.org/wiki/Payment_Card_Industry_Data_Security_Standard?ref=serpapi.com">PCI-DSS</a> requirements and implement additional generally accepted industry standards.</p><h3>Links</h3><ul><li><a href="https://serpapi.com/status?ref=serpapi.com">SerpApi Status</a> (check the performance of our APIs)</li><li><a href="https://serpapi.com/search-api?ref=serpapi.com">SerpApi Documentation</a> (browse all of our APIs)</li><li><a href="https://serpapi.com/playground?ref=serpapi.com">SerpApi Playground</a> (try out some searches)</li></ul><p>If you have any further questions regarding SerpApi please contact us: contact@serpapi.com</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=c5b782cae620" width="1" height="1" alt=""><hr><p><a href="https://medium.com/serpapi/safeguarding-web-scraping-activities-with-serpapi-protecting-privacy-and-security-c5b782cae620">Safeguarding Web Scraping Activities with SerpApi: Protecting Privacy and Security</a> was originally published in <a href="https://medium.com/serpapi">SerpApi</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to detect memory leak in Ruby C extension]]></title>
            <link>https://medium.com/serpapi/how-to-detect-memory-leak-in-ruby-c-extension-5fcc3ce10d63?source=rss----76bc81ac44eb---4</link>
            <guid isPermaLink="false">https://medium.com/p/5fcc3ce10d63</guid>
            <category><![CDATA[ruby]]></category>
            <category><![CDATA[debugging]]></category>
            <category><![CDATA[c-extension]]></category>
            <category><![CDATA[memory-leak]]></category>
            <dc:creator><![CDATA[Yicheng Zhou]]></dc:creator>
            <pubDate>Thu, 23 Feb 2023 07:57:11 GMT</pubDate>
            <atom:updated>2023-02-23T07:57:11.592Z</atom:updated>
            <content:encoded><![CDATA[<p>Memory leak has been one of the most difficult problems to track down in the C/C++ world. It’s also true when it comes to Ruby C extensions. We don’t expect memory bloat for a long-running Ruby service. Fortunately, the community has developed great tools to help detect memory leaks. Two of the most popular ones are <a href="https://github.com/google/sanitizers/wiki/AddressSanitizer">AddressSanitizer</a> (or ASAN) and <a href="http://valgrind.org/">Valgrind</a>. Here is <a href="https://github.com/google/sanitizers/wiki/AddressSanitizerComparisonOfMemoryTools">a comparison</a> between these tools. I personally favor ASAN more as it does not require additional tools to be installed and runs fast.</p><p>It’s better to run memory detection in a Linux environment. The tests of this post were run in Ubuntu 20.04 + GCC 9.4.0</p><h3>Configuration</h3><p>Let’s take <a href="https://github.com/serpapi/nokolexbor">Nokolexbor</a> as an example. Enabling ASAN is as easy as adding -fsanitize=address to the CFLAGS and LDFLAGS. In extconf.rb, add</p><pre>if ENV[&#39;NOKOLEXBOR_DEBUG&#39;] || ENV[&#39;NOKOLEXBOR_ASAN&#39;]<br>  CONFIG[&quot;optflags&quot;] = &quot;-O0&quot;<br>  CONFIG[&quot;debugflags&quot;] = &quot;-ggdb3&quot;<br>end<br><br>if ENV[&#39;NOKOLEXBOR_ASAN&#39;]<br>  $LDFLAGS &lt;&lt; &quot; -fsanitize=address&quot;<br>  $CFLAGS &lt;&lt; &quot; -fsanitize=address -DNOKOLEXBOR_ASAN&quot;<br>end</pre><p>It is recommended to compile with -O0 -ggdb3 when enabling ASAN to reveal as much information as possible when memory leak is detected.</p><p>Note that we shall use standard memory functions malloc, realloc, calloc and free instead of ruby_xmalloc, ruby_xrealloc, ruby_xcalloc and ruby_xfree because if we use the latter ones, the call stack cannot be shown correctly in the memory leak reports, making it useless for analysis. Here we are defining NOKOLEXBOR_ASAN so that we can control which memory functions to use</p><pre>#ifndef NOKOLEXBOR_ASAN<br>  lexbor_memory_setup(ruby_xmalloc, ruby_xrealloc, ruby_xcalloc, ruby_xfree);<br>#else<br>  lexbor_memory_setup(malloc, realloc, calloc, free);<br>#endif</pre><p>Now, just compile with</p><pre>NOKOLEXBOR_ASAN=1 rake compile</pre><h3>Detection</h3><p>ASAN starts to detect memory leaks when the target program is shutting down, checking if any memory blocks allocated are not freed. Therefore, your test program should cover the code paths as much as possible. Here, we utilize our existing tests that are run by rake test.</p><p>The ASAN runtime has to be manually loaded into the ruby process before running our code. This can be done by setting the environment variable LD_PRELOAD=/path/to/libasan.so, where the path of libasan.so can be retrieved by gcc -print-file-name=libasan.so. The launch command looks like this</p><pre>LD_PRELOAD=/path/to/libasan.so /path/to/ruby -Ilib -rnokolexbor /path/to/some_spec.rb</pre><p>Note that /path/to/ruby should refer to the binary program of ruby, not a script that is created by rvm or rbenv.</p><p>When running the above command, you will find that ASAN reports memory leaks even if you don’t pass an empty ruby file. This is because Ruby itself does not free all of the memory during shutdown, resulting in false positive reports. You will probably see the output like this:</p><pre>==3930==ERROR: LeakSanitizer: detected memory leaks<br><br>Direct leak of 956376 byte(s) in 7481 object(s) allocated from:<br>    #0 0x7fe00f23aa06 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:153<br>    #1 0x7fe00ee54929 in calloc1 /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/gc.c:1583<br>    #2 0x7fe00ee54929 in objspace_xcalloc /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/gc.c:10113<br>    #3 0x7fe00ee54929 in ruby_xcalloc_body /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/gc.c:10120<br>    #4 0x7fe00ee54929 in ruby_xcalloc /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/gc.c:12004<br><br>Direct leak of 247663 byte(s) in 2055 object(s) allocated from:<br>    #0 0x7fe00f23a808 in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:144<br>    #1 0x7fe00ee54764 in objspace_xmalloc0 /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/gc.c:9861<br>    #2 0x7fe00ee54764 in ruby_xmalloc2_body /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/gc.c:10104<br>    #3 0x7fe00ee54764 in ruby_xmalloc2 /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/gc.c:11994<br><br>Direct leak of 108437 byte(s) in 1150 object(s) allocated from:<br>    #0 0x7fe00f23ac3e in __interceptor_realloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:163<br>    #1 0x7fe00ee54ff1 in objspace_xrealloc /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/gc.c:9932<br>    #2 0x7fe00ee54ff1 in ruby_sized_xrealloc2 /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/gc.c:10149<br>    #3 0x7fe00ee54ff1 in ruby_xrealloc2_body /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/gc.c:10155<br>    #4 0x7fe00ee54ff1 in ruby_xrealloc2 /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/gc.c:12024<br><br>Direct leak of 73920 byte(s) in 165 object(s) allocated from:<br>    #0 0x7fe00f23a808 in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:144<br>    #1 0x7fe00ef36d43 in onig_new_with_source /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/re.c:841<br>    #2 0x7fe00ef36d43 in make_regexp /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/re.c:871<br>    #3 0x7fe00ef36d43 in rb_reg_initialize /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/re.c:2836</pre><p>To wipe out those false positive messages, we can create a rake task for testing that takes care of all the settings mentioned above, and filter the final output by excluding blocks that do not contain paths of your project (in our case lexbor):</p><pre>class ASanTestTask &lt; Rake::TestTask<br>  def filter_leak_message(output)<br>    # Discard ruby only leaks (false positives)<br>    results = output.scan(/(?:Direct|Indirect).*?\n\n/m).select { |r| r.include? &#39;lexbor&#39; }<br>    results.join<br>  end<br><br>  def ruby(*args, **options, &amp;block)<br>    asan_so = `gcc -print-file-name=libasan.so`.strip<br>    env = {&quot;LD_PRELOAD&quot; =&gt; asan_so}<br>    if args.length &gt; 1<br>      stdout, stderr, status = Open3.capture3(env, FileUtils::RUBY, *args, **options, &amp;block)<br>    else<br>      stdout, stderr, status = Open3.capture3(env, &quot;#{FileUtils::RUBY} #{args.first}&quot;, **options, &amp;block)<br>    end<br><br>    puts stdout<br>    unless (leaks = filter_leak_message(stderr)).empty?<br>      puts<br>      puts leaks<br>      yield false, status<br>    end<br>  end<br>end<br><br>namespace :test do<br>  ASanTestTask.new(&#39;asan&#39;) do |t|<br>    t.libs &lt;&lt; &#39;spec&#39;<br>    t.pattern = &#39;spec/**/*_spec.rb&#39;<br>  end<br>end</pre><p>Now, simply run rake test:asan. It will run all the tests just like rake test, and the output will only include the leak information related to your code.</p><pre># Running:<br><br>.........................................................................................................................................................<br><br>Finished in 0.124536s, 1228.5576 runs/s, 2505.2939 assertions/s.<br>153 runs, 312 assertions, 0 failures, 0 errors, 0 skips<br><br>Direct leak of 3168 byte(s) in 132 object(s) allocated from:<br>    #0 0x7f5d01c86a06 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:153<br>    #1 0x7f5cfa550955 in lexbor_calloc ../../../../vendor/lexbor/source/lexbor/ports/posix/lexbor/core/memory.c:29    #2 0x7f5cfa54a1b8 in lexbor_array_create ../../../../vendor/lexbor/source/lexbor/core/array.c:13<br>    #3 0x7f5cfa4d53b8 in nl_node_at_css ../../../../ext/nokolexbor/nl_node.c:336<br>    #4 0x7f5d01a29ea0 in vm_call_cfunc_with_frame /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/vm_insnhelper.c:2514<br>    #5 0x7f5d01a29ea0 in vm_call_cfunc /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/vm_insnhelper.c:2539</pre><p>In this case, ext/nokolexbor/nl_node.c:336 is causing the memory leak.</p><p>That’s all for this tutorial. Thanks for reading.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=5fcc3ce10d63" width="1" height="1" alt=""><hr><p><a href="https://medium.com/serpapi/how-to-detect-memory-leak-in-ruby-c-extension-5fcc3ce10d63">How to detect memory leak in Ruby C extension</a> was originally published in <a href="https://medium.com/serpapi">SerpApi</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to create Ruby C extension and debug in VS Code]]></title>
            <link>https://medium.com/serpapi/how-to-create-ruby-c-extension-and-debug-in-vs-code-7e1b3efd6f02?source=rss----76bc81ac44eb---4</link>
            <guid isPermaLink="false">https://medium.com/p/7e1b3efd6f02</guid>
            <category><![CDATA[vscode]]></category>
            <category><![CDATA[c-extension]]></category>
            <category><![CDATA[ruby]]></category>
            <category><![CDATA[debugging]]></category>
            <dc:creator><![CDATA[Yicheng Zhou]]></dc:creator>
            <pubDate>Thu, 23 Feb 2023 07:48:03 GMT</pubDate>
            <atom:updated>2023-02-23T07:48:03.347Z</atom:updated>
            <content:encoded><![CDATA[<p>When developing <a href="https://github.com/serpapi/nokolexbor">Nokolexbor</a>, I found debugging with gdb or lldb was troublesome because I have to do everything with commands. For me, it will be more delightful to debug with an interactive GUI, which makes it convenient to control breakpoints and steps, inspect values and navigate through the code. This post shows how to use VS Code debugging tools to debug Ruby C extension from scratch.</p><h3>Create the gem project</h3><p>We demonstrate this on an empty gem project with extensions. The project can be easily created by:</p><pre>bundle gem example_gem --ext</pre><p>If you already have a working gem project, just jump to the next section.</p><p>Let’s modify ext/example_gem/example_gem.c to add a native method example_plus under ExampleClass that can be called by ruby code.</p><pre>#include &quot;example_gem.h&quot;<br><br>VALUE rb_mExampleGem;<br>VALUE rb_cExampleClass;<br><br>static VALUE<br>example_plus(VALUE self, VALUE rb_a, VALUE rb_b)<br>{<br>  int a, b;<br>  a = NUM2INT(rb_a);<br>  b = NUM2INT(rb_b);<br>  return INT2NUM(a + b);<br>}<br><br>void Init_example_gem(void)<br>{<br>  rb_mExampleGem = rb_define_module(&quot;ExampleGem&quot;);<br>  rb_cExampleClass = rb_define_class_under(rb_mExampleGem, &quot;ExampleClass&quot;, rb_cObject);<br>  rb_define_method(rb_cExampleClass, &quot;example_plus&quot;, example_plus, 2);<br>}</pre><p>Now, compile the extension: rake compile. If everything goes well, a shared library named example_gem.so or example_gem.bundle will be created under lib/example_gem</p><p>Let’s test it in irb and make sure it works:</p><pre>❯ irb -Ilib -rexample_gem<br>irb(main):001:0&gt; obj = ExampleGem::ExampleClass.new<br># =&gt; #&lt;ExampleGem::ExampleClass:0x00007f902022ac88&gt;<br>irb(main):002:0&gt; obj.example_plus(2, 3)<br># =&gt; 5</pre><h3>Debug in VS Code</h3><p>By default, ruby compiles the C extension with -O3, which loses debug information. We should modify ext/example_gem/extconf.rb to add debug options.</p><pre>require &quot;mkmf&quot;<br><br>if ENV[&#39;EXAMPLE_DEBUG&#39;]<br>  CONFIG[&quot;optflags&quot;] = &quot;-O0&quot;<br>  CONFIG[&quot;debugflags&quot;] = &quot;-ggdb3&quot;<br>end<br><br>create_makefile(&quot;example_gem/example_gem&quot;)</pre><p>Here we are using an environment variable EXAMPLE_DEBUG to control whether we want to compile with debug information.</p><p>Just compile with EXAMPLE_DEBUG=1 rake compile.</p><p>Some posts would suggest that you compile ruby itself with the same debug flags as well, but as long as you don’t need to debug ruby-related functions, just use the ruby you’ve installed.</p><p>Now, let’s open VS Code. First, install the official C/C++ extension</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/952/0*AI_-tTIO3mPrtq62.png" /></figure><p>Switch to Run and Debug tab. Click create a launch.json file.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/892/0*WWY9s8zyvYr-VvA7.png" /></figure><p>On the open launch.json tab. Click Add Configuration... button on the bottom right. Select C/C++: (lldb) Launch (on Linux, this option may be C/C++: (gdb) Launch).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*sQ7Quq7bCtvJt2Wi.png" /></figure><p>Change program and args to tell VS Code how to run the program. Note that you should set program to the absolute path of the actual ruby program. A single ruby will probably not work if you are using a ruby version manager such as rvm or rbenv, because the ruby command will refer to a script instead of the ruby binary. You might also want to set cwd to ${workspaceFolder} to let ruby find the lib path correctly. Here is my final setting:</p><pre>{<br>  &quot;version&quot;: &quot;0.2.0&quot;,<br>  &quot;configurations&quot;: [<br>    {<br>      &quot;name&quot;: &quot;(lldb) Launch&quot;,<br>      &quot;type&quot;: &quot;cppdbg&quot;,<br>      &quot;request&quot;: &quot;launch&quot;,<br>      &quot;program&quot;: &quot;/Users/zyc/.rbenv/versions/2.7.2/bin/ruby&quot;,<br>      &quot;args&quot;: [&quot;-Ilib&quot;, &quot;-rexample_gem&quot;, &quot;-e&quot;, &quot;puts ExampleGem::ExampleClass.new.example_plus(2, 3)&quot;],<br>      &quot;stopAtEntry&quot;: false,<br>      &quot;cwd&quot;: &quot;${workspaceFolder}&quot;,<br>      &quot;environment&quot;: [],<br>      &quot;externalConsole&quot;: false,<br>      &quot;MIMode&quot;: &quot;lldb&quot;<br>    }<br>  ]<br>}</pre><p>Everything is set! Let’s add a breakpoint to the C code and start debugging.</p><p>Click on the space left to the line number to add a breakpoint (red dot). And click Start Debugging button.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*8qSNe1OAA4xIVF07.png" /></figure><p>We can see that the code stopped at the breakpoint. The debugging toolbox showed up and you can do all the debugging actions you are familiar with.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*u0iGI4J-pQUf2tsd.png" /></figure><h3>Attach to a running process</h3><p>Sometimes it’s not convenient to run the code directly from the raw ruby process. For example, your code must run under Ruby-on-Rails environment, which is not easy to set up manually. Instead of launching ruby, we can attach to a running ruby process and debug from there.</p><p>Let’s open launch.json again, click the Add Configuration... button on the bottom right, and select C/C++: (lldb) Attach (on Linux, this option may be C/C++: (gdb) Attach), edit the newly inserted JSON configuration:</p><pre>  &quot;configurations&quot;: [<br>    {<br>      &quot;name&quot;: &quot;(lldb) Attach&quot;,<br>      &quot;type&quot;: &quot;cppdbg&quot;,<br>      &quot;request&quot;: &quot;attach&quot;,<br>      &quot;program&quot;: &quot;/Users/zyc/.rbenv/versions/2.7.2/bin/ruby&quot;,<br>      &quot;MIMode&quot;: &quot;lldb&quot;<br>    },<br>    ...<br>  ]</pre><p>Now, switch to Attach, start your rails c somewhere, and click Start Debugging</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/884/0*yiM8oNDZtyfFBJ6_.png" /></figure><p>A list will pop up for you to select the target process to attach. Select ruby</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*niqPd2lyVL8daWbO.png" /></figure><p>Finally, run the code in the Rails console that will trigger the breakpoint, and the rest will be the same as debugging with Launch.</p><p>That’s all for this tutorial. I hope you find it useful. Happy debugging!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=7e1b3efd6f02" width="1" height="1" alt=""><hr><p><a href="https://medium.com/serpapi/how-to-create-ruby-c-extension-and-debug-in-vs-code-7e1b3efd6f02">How to create Ruby C extension and debug in VS Code</a> was originally published in <a href="https://medium.com/serpapi">SerpApi</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Benchmarking Puma 4 vs. Puma 5 vs. Puma 6]]></title>
            <link>https://medium.com/serpapi/benchmarking-puma-4-vs-puma-5-vs-puma-6-e1aedd09ac50?source=rss----76bc81ac44eb---4</link>
            <guid isPermaLink="false">https://medium.com/p/e1aedd09ac50</guid>
            <category><![CDATA[ruby-on-rails]]></category>
            <category><![CDATA[ruby]]></category>
            <category><![CDATA[puma]]></category>
            <category><![CDATA[benchmark]]></category>
            <dc:creator><![CDATA[Yicheng Zhou]]></dc:creator>
            <pubDate>Thu, 23 Feb 2023 07:39:56 GMT</pubDate>
            <atom:updated>2023-02-23T07:39:56.248Z</atom:updated>
            <content:encoded><![CDATA[<p>Puma 6.0.0 was released on Oct 19, 2022. We are two major versions behind. Let’s benchmark Puma 4 vs. Puma 5 vs. Puma 6 and see how they perform.</p><p>We put our focus on testing the throughput of these versions. The excellent HTTP load generator <a href="https://github.com/rakyll/hey">hey</a> made it easy to stress test Puma and get detailed reports.</p><p>We benchmarked on SerpApi’s home page, which returns a 75 KB HTML with 1000 requests on 50 concurrencies. The tests were run on DigitalOcean CPU-Optimized, 2 vCPUs, 4 GB, Ruby 2.7.2, Rails 6.0.3.7 production mode. Here are the results.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*LA3ZsAILhbS_WkRc_9AtSw.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*lfRd0oGL9kOV8Yj97IwlOQ.png" /></figure><p>The results showed that the developers are doing great in improving Puma’s performance. We can see an increasing number of “Requests/sec” on newer versions.</p><p>During the tests, we noticed that compared to Puma 6.0.0, on Puma 4.3.6 and Puma 5.6.5, the two workers were sometimes not evenly loaded, which means they didn’t make most of the CPU resources. Looking at the <a href="https://github.com/puma/puma/blob/master/History.md">release note</a>, wait_for_less_busy_worker, first introduced in Puma 5, was defaulted in Puma 6. This option resolved the unbalanced workload among workers.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*0PEfiBiTb1mIgEW8.png" /><figcaption>Unbalanced CPU load on Puma 4.3.6 and Puma 5.6.5</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*Iecy21EQDx8KOTqu.png" /><figcaption>Balanced CPU load on Puma 6.0.0</figcaption></figure><p>We tried to set the option on Puma 5. As a result, “Requests/sec” increased a bit. However, its performance was still lower than Puma 6.</p><p>Migrating to Puma 6 is a no-brainer. But as 6.0.0 has just been released, it would be better to look at its bug reports first. For those who are cautious, upgrading to the latest version of Puma 5 with wait_for_less_busy_worker = 0.005 set is recommended.</p><p>That’s all for the benchmarks on Puma. Thanks for reading. See you in the following benchmark report.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=e1aedd09ac50" width="1" height="1" alt=""><hr><p><a href="https://medium.com/serpapi/benchmarking-puma-4-vs-puma-5-vs-puma-6-e1aedd09ac50">Benchmarking Puma 4 vs. Puma 5 vs. Puma 6</a> was originally published in <a href="https://medium.com/serpapi">SerpApi</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Nokolexbor — a performance-focused HTML parser for Ruby]]></title>
            <link>https://medium.com/serpapi/nokolexbor-a-performance-focused-html-parser-for-ruby-21d5b31b05a3?source=rss----76bc81ac44eb---4</link>
            <guid isPermaLink="false">https://medium.com/p/21d5b31b05a3</guid>
            <category><![CDATA[ruby]]></category>
            <category><![CDATA[nokogiri]]></category>
            <category><![CDATA[nokolexbor]]></category>
            <category><![CDATA[html-parsing]]></category>
            <category><![CDATA[performance]]></category>
            <dc:creator><![CDATA[Yicheng Zhou]]></dc:creator>
            <pubDate>Thu, 23 Feb 2023 07:11:39 GMT</pubDate>
            <atom:updated>2023-02-23T07:11:39.545Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*hTS3uydzkACpPflepKwLkQ.png" /></figure><h3>Nokolexbor — a performance-focused HTML parser for Ruby</h3><p>There aren’t many choices of HTML parsers in the ruby ecosystem. The most obvious one would be <a href="https://nokogiri.org/">Nokogiri</a>, which we used heavily at SerpApi. As time passed, we became gradually unsatisfied with Nokogiri’s performance. Though it’s mainly relying on libxml2 which is an XML processor written in C, it’s not optimized for HTML-specific tasks. We’ve once <a href="https://serpapi.com/blog/data-extraction-speed-up-3s-to-800ms/">contributed to Nokogiri</a> to improve its performance a lot. But as <a href="https://serpapi.com/blog/author/ilyazub/">Illia</a> (the author) said</p><blockquote><em>800 ms to extract data from HTML is still too much.</em></blockquote><p>He also created an experimental library <a href="https://github.com/ilyazub/nokogiri-rust">NokogiriRust</a> trying to use <a href="https://crates.io/crates/scraper">scraper</a> and be API-compatible with Nokogiri. The benchmark showed 60x faster on at_css.text. This proves that it is possible to replace libxml2 with a high-performance and production-ready HTML parser. Sadly, the project didn&#39;t continue.</p><p>Now, we are back to the task. Our goal is to:</p><ul><li>Create a new Ruby HTML parser with a superfast underlying parser engine.</li><li>Make it API-compatible with Nokogiri.</li><li>Make it capable of searching nodes with both CSS selectors and XPath.</li></ul><h3>Development of Nokolexbor</h3><p>We investigated recent HTML parsers in the C and Rust world, and picked <a href="https://github.com/lexbor/lexbor">Lexbor</a> as the core of our new library. It has very similar APIs to Nokogiri, and the performance is much higher than libxml2. Also, C library is easier to compile than Rust when installed by users. Lexbor already had a bunch of bindings such as Python, Erlang and Crystal, which made us more confident that we could do it for Ruby as well. As a result, <a href="https://github.com/serpapi/nokolexbor">Nokolexbor</a> was born.</p><p>During the development, we soon encountered two problems that we must address at the early stage.</p><ol><li>CSS selectors don’t support selecting text nodes, but we select text nodes extensively with Nokogiri. We have to patch Lexbor to support it somehow.</li><li>Lexbor doesn’t support XPath which we used with Nokogiri. XPath can be converted to CSS selectors, but not in all cases. To be maximum compatible with Nokogiri, we’d better implement XPath algorithm using Lexbor’s data structures.</li></ol><p>Solving the first problem turned out to be easier than we thought. Thanks to Lexbor’s great code generators, we soon <a href="https://github.com/serpapi/nokolexbor/blob/master/patches/0001-lexbor-support-text-pseudo-element.patch">patched Lexbor</a> and added a ::text pseudo element to represent text nodes. Selecting all text nodes under a div can be as easy as node.css(&#39;div ::text&#39;)</p><p>The second problem was a monster. The only C implementation of XPath we found was libxml2. It has a very large and messy code base. The algorithm <a href="https://github.com/GNOME/libxml2/blob/master/xpath.c">xpath.c</a> itself has over 14k lines of code, plus a number of references to other modules. We had a hard time porting the code. Fortunately, we conquered it. The algorithm was nicely integrated with Lexbor. We were able to select nodes using XPath the same way as Nokogiri: node.xpath(&#39;.//div//text()&#39;).</p><p>The rest of the development would be to make Nokolexbor behave the same way as Nokogiri. Some notable updates were:</p><ul><li>Patch Lexbor on case matching on tag names, class names and ids.</li><li>Patch Lexbor to be able to select nodes in &lt;template&gt; tags.</li><li>Make the resulting node-set unique and be of the document traversal order.</li><li>Enrich APIs and make them compatible with Nokogiri ones.</li></ul><p>Apart from functionalities, we’ve also ensured that the library has no memory leaks. This is one of the biggest concerns of using a C library in production. We’ve written <a href="https://serpapi.com/blog/how-to-detect-memory-leak-in-ruby-c-extension/">a separate blog post</a> on this topic.</p><h3>Benchmarks</h3><p>We benchmarked parsing a google result page (368 KB) and searching nodes using CSS and XPath. Run on MacBook Pro (2019) 2.3 GHz 8-Core Intel Core i9.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/538/1*pfhAaiNh9levKKJLTuamZQ.png" /></figure><p>Parsing and searching with CSS selectors were significantly faster thanks to Lexbor. XPath performs the same as they both use libxml2.</p><p>You might be astonished by the 997x improvement on at_css. That is because Nokolexbor will return the matched node as soon as it finds one, and stops further searching. But Nokogiri&#39;s at_css implementation was</p><pre>def at_css(*args)<br>  css(*args).first<br>end</pre><p>which has no optimization at all. Searching after the first occurrence is totally a waste of time after all.</p><h3>What’s Next</h3><p>We are happy to open-source <a href="https://github.com/serpapi/nokolexbor">Nokolexbor</a>. Feel free to try it out. And of course, contributions are welcomed!</p><p>We’ll be continuously developing Nokolexbor to make it 1:1 compatible with Nokogiri as much as possible. We hope it can be an alternative choice for Nokogiri when performance is a concern.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=21d5b31b05a3" width="1" height="1" alt=""><hr><p><a href="https://medium.com/serpapi/nokolexbor-a-performance-focused-html-parser-for-ruby-21d5b31b05a3">Nokolexbor — a performance-focused HTML parser for Ruby</a> was originally published in <a href="https://medium.com/serpapi">SerpApi</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>