SerpApi - Medium

Playwright’s getByRole locator is 1.5x slower than CSS selectors

Illia Zub — Tue, 14 Nov 2023 19:02:07 GMT

Playwright’s getByRole uses of querySelectorAll(‘*’) and matches elements by the accessible name.

Introduction

In the realm of web automation and testing with Playwright, understanding the performance of various locator strategies is key. This article delves into the getByRole locator's efficiency compared to CSS selectors, offering insights into the technical workings and practical implications of these choices.

I wanted to use Page#getByRole's underlying CSS selectors in SerpApi code base. But the getByRole locator was 1.5 times slower compared to the standard CSS selectors, prompting an investigation into the root cause. This performance discrepancy, likely stemming from Playwright’s use of querySelectorAll('*') and matching elements by the accessible name, raises essential considerations for developers prioritizing speed in their automation scripts.

Deep Dive: How getByRole Works

The getByRole function in Playwright is more than just a method to locate web elements; it's a complex mechanism with multiple layers of interaction within the Playwright architecture. Let's demystify this process with an example. Consider this code:

await page.getByRole('button', { name: 'Enter address manually' }).click();

This command sets off a cascade of actions within Playwright:

Page#getByRole creates a Locator
Locator#click delegates call to Frame#click passing the Locator#_selector
Frame#click delegates to Channel#click. Frame inherits _channel from ChannelOwner. ChannelOwner#_channel is a JS Proxy object based on the EventEmitter
Client Frame dispatches an event to the server Frame#click
FrameSelector#resolveInjectedForSelector injects the FrameExecutionContext#injectedScript script to the page, controlled by Playwright. The InjectedScript#constructor adds the engine for getByRole locator.
createRoleEngine calls parseAttributeSelector and queryRole
queryRole calls querySelectorAll('*') and matches the element

Compared to the getByRole, the locator with the regular CSS selector just traverses DOM and is 1.5 times faster.

Performance

Comparative tests reveal that a regular locator using CSS selectors outperforms getByRole by 1.5 times. Interestingly, the $.then method trailed, being 2x slower in our tests.

console.time("getByRole");
for (let i = 0; i < 100; i++) {
  const textContent = await page.getByRole('button', { name: 'Enter address manually' }).textContent()
}
console.timeEnd("getByRole");

console.time("locator");
for (let i = 0; i < 100; i++) {
  const textContent = await page.locator('.ektjNL').textContent()
}
console.timeEnd("locator");

console.time("$.then");
for (let i = 0; i < 100; i++) {
  const textContent = await page.$('.ektjNL').then(e => e.textContent())
}
console.timeEnd("$.then");

// Output:
// getByRole: 677.5ms
// locator: 497.306ms
// $.then: 1.135s

Conclusion

In web automation, understanding Playwright's locators — getByRole and CSS selectors — is key. getByRole excels in clarity, while CSS selectors win in speed. This matters in large test suites where every second counts. Choose wisely: getByRole for readability, CSS selectors for efficiency.

Playwright’s getByRole locator is 1.5x slower than CSS selectors was originally published in SerpApi on Medium, where people are continuing the conversation by highlighting and responding to this story.

When counting lines in Ruby randomly failed SerpApi deployments

Illia Zub — Thu, 07 Sep 2023 10:23:25 GMT

Recently, we observed the occasional failing deployments only on two of our servers. The failed servers even were closing my regular SSH connection. In this story, you’ll learn how we reduced memory usage and made one piece of SerpApi code 1.5x faster.

TL;DR

str.count($/) was 1.5x faster compared to str.lines.count and didn't allocate additional memory.

Investigation

Only two servers faced the random failed deployments.

# terminated with exception (report_on_exception is true):
SerpApi/vendor/bundle/ruby/2.7.0/gems/net-ssh-5.2.0/lib/net/ssh/transport/server_version.rb:54:in `readpartial': Connection reset by peer (Errno::ECONNRESET)

These servers also randomly closed my SSH connections.

$ ssh server-2
Last login: Fri Feb 24 14:23:29 2023 from {remote_ip}
client_loop: send disconnect: Broken pipe

DigitalOcean server’s graphs shown that memory usage was near to 95% percent on these two servers. Load average was occasionally peaking at 12 compared to 2 on other servers.

We checked the Puma server flamegraph. Most of the wall time on production were the SearchSplitter#do_one_request and Puma thread pool.

We used rbspy to generate the flamegraph:

$ rbspy record -p $PID_OF_PUMA_PROCESS

The flamegraph didn’t reveal anything actionable and we moved to memory profiling.

Memory profiling

Here’s the script we used:

require 'memory_profiler'

user = User.find_by(email: "me")

report = MemoryProfiler.report do
  threads = []
  (1..5).map do |i|
    (1..5).map do |j|
      threads << Thread.new {
        search = Search.new(engine: "google", q: "#{i} * #{j}", user: user)
        SearchProcessor.process(search)
      }
    end
  end
  threads.each(&:join)
end

report.pretty_print

It turned out that the top allocator was line counting in the response validator.

elsif response[:html].lines.count < 50 && response[:html].match?(/<\/div>.+<\/H1>.+.+<\/p>$/)

Why the String#lines and Array#count were the top allocators of the entire app?

String#lines

The HTML file size varies from 180 KB (regular organic results) to 1.3 MB (Google Shopping with num=100). The String#lines allocated an array multiple time per each search because we send multiple requests concurrently per each search.

Thanks to @guilhermesimoes’s gist, we found that str.each_line.count should be faster. But it was not optimal and we found a way to improve the solution.

Solution

The solution was super simple — str.count($/). Here's the diff:

- elsif response[:html].lines.count < 50 && response[:html].match?(/<\/div>.+<\/H1>.+.+<\/p>$/)
+ elsif response[:html].count($/) < 50 && response[:html].match?(/
<\/div>.+<\/H1>.+.+<\/p>$/)

To make sure the problem was solved, we benchmarked multiple ways of counting string lines in Ruby. We reused and adopted the gist above to exclude File#read from the benchmark and added String#count to the benchmark.

Benchmark

String#count was 1.5 times faster that other options:

Warming up --------------------------------------
                size    31.000  i/100ms
              length    75.000  i/100ms
               count    77.000  i/100ms
   each_line + count    81.000  i/100ms
           count($/)   196.000  i/100ms
Calculating -------------------------------------
                size      1.529k (±33.9%) i/s -      4.774k in   5.015361s
              length      1.434k (±38.8%) i/s -      5.025k in   5.139834s
               count      1.335k (±40.7%) i/s -      4.697k in   5.079353s
   each_line + count      1.411k (±39.5%) i/s -      5.022k in   5.110146s
           count($/)      2.231k (± 2.6%) i/s -     11.172k in   5.012323s

Comparison:
           count($/):     2230.5 i/s
                size:     1529.0 i/s - 1.46x  (± 0.00) slower
              length:     1434.2 i/s - 1.56x  (± 0.00) slower
   each_line + count:     1411.0 i/s - 1.58x  (± 0.00) slower
               count:     1334.9 i/s - 1.67x  (± 0.00) slower

Here’s the script:

require "benchmark/ips"

html = File.read(Rails.root.join("spec/data/google/superhero-movies-mobile-63f582a0defa1345501c6b50-2023-02-23.html"))

Benchmark.ips do |x|
  x.report("size") { html.lines.size < 50 }
  x.report("length") { html.lines.length < 50 }
  x.report("count") { html.lines.count < 50 }
  x.report("each_line + count") { html.each_line.count < 50 }
  x.report("count($/)") { html.count($/) < 50 }
  x.compare!
end

Memory usage

count($/) doesn't allocate a new array compared to lines/each_line/etc.

We used the awesome heap-profiler and heapy Ruby gems to profile memory usage.

lines/readlines/each_line/etc.

html.lines.count allocated the new array and referenced the original string for each iteration in the benchmark.

$ bundle exec heapy read ./tmp/lines_count/allocated.heap 49 --lines=1

Analyzing Heap (Generation: 49)
-------------------------------
allocated by memory (204879705) (in bytes)
==============================
  204872652  tmp/html_length_vs_count_vs_size_bench.rb:6
object count (5406)
==============================
  5301  tmp/html_length_vs_count_vs_size_bench.rb:6
High Ref Counts
==============================
  5300  tmp/html_length_vs_count_vs_size_bench.rb:6

We also used the predefined $/ line separator to allocate even less memory.

count($\)

Most of these memory allocations and all of the object reference counts were gone when we used the String#count($/).

$ bundle exec heapy read ./tmp/count_nl/allocated.heap 48 --lines=1

Analyzing Heap (Generation: 48)
-------------------------------
allocated by memory (2547465) (in bytes)
==============================
  2540804  tmp/html_length_vs_count_vs_size_bench.rb:4
object count (105)
==============================
  27  /usr/local/lib/ruby/gems/2.7.0/gems/activesupport/lib/active_support/deprecation/proxy_wrappers.rb:172
High Ref Counts
==============================
  73  /usr/local/lib/ruby/gems/2.7.0/gems/activesupport/lib/active_support/deprecation/proxy_wrappers.rb:172

Code

require "heap-profiler"

HeapProfiler.report(Rails.root.join('tmp/lines_count')) do
  html = File.read(Rails.root.join("spec/data/google/superhero-movies-mobile-63f582a0defa1345501c6b50-2023-02-23.html"))

  100.times { html.lines.count < 50 }
  # 100.times { html.count($/) < 50 }
end

Comparison process

The heap diff comparison was a bit manual because the heapy diff did't provide a diff. We commented / uncommented 100.times { html.lines.count < 50 } and replaced paths in the command above.

# Profile heap
$ bundle exec rails r tmp/html_length_vs_count_vs_size_bench.rb

# Read summary of heap allocations
$ bundle exec heapy read ./tmp/count_nl/allocated.heap

# Read a specific generation (48) limiting number of lines to output (1)
$ bundle exec heapy read ./tmp/count_nl/allocated.heap 48 --lines=1

Results

Immediately after the fix was deployed, memory usage on the affected servers decreased and stabilized. Then memory usage fluctuated again, but deployments and SSH connections stabilized.

Observations and final thoughts

The initial assumption was the Puma graceful restart. During the phased-restart, Puma spawned additional workers to switch to the new code version (which was expected). It wasn't clear why SSH connections were dropping on two DigitalOcean droplets only.

Doubling the amount of RAM would also solve the problem, but it wouldn’t be as efficient at this point. The fix was deployed half a year ago and the issue is definitely solved.

Update Sep 20th, 2023

Thanks to @Freaky from Reddit for a wonderful feedback and cooperation:

@Freaky tracked down a performance regression in String#count in 3.1+
@Freaky brought his old SIMD bytecount C port out of mothballs
MRI maintainer nobu is evaluating it for inclusion in MRI
Freaky noted the bytecount Rust crate uses an SSE4.1 intrinsic in SSE2 code and submitted a fix
A similar fix for the aho-corasick Rust crate was made in response

I love it when you get this ripple effect from something that initially seems pretty innocuous.

If you enjoy working on such challenges, come work here with us: https://serpapi.com/careers#open-roles

When counting lines in Ruby randomly failed SerpApi deployments was originally published in SerpApi on Medium, where people are continuing the conversation by highlighting and responding to this story.

Software Engineer’s Life Balance with Dmytro Vasyliev | #SerpApiPodcast, Episode 10

Illia Zub — Sat, 24 Jun 2023 19:57:36 GMT

In the 10th episode of SerpApi Podcast, we discuss work-life balance with Dmytro Vasyliev: “How can you find a balance between your professional activity and personal life?”, “How can you maintain your health and avoid burnout?”, etc.

Watch this episode · Listen this episode

Chapters

[00:00:00] Introduction
[00:02:03] What is life balance for Dmytro?
[00:03:00] How Dmytro organizes his day?
[00:05:10] Was you always organized in life or you had to do something in order to feel better?
[00:06:35] About life spheres
[00:12:10] Burnout and how to handle it
[00:14:51] Work-life balance: what helps to be more focused?
[00:29:55] Work from home or from office?
[00:33:08] Work with a coach: when, why and what a result is
[00:38:35] The first time with a psychologist
[00:42:20] Training with a personal trainer — how does it works?
[00:46:26] A few exercises to keep an eye on life balance
[00:50:32] Organization of a to-do list
[00:58:08] Epilogue

SerpApi Podcast is about SERP (search engine results pages) and e-commerce data scraping: parsing, circumvention of blocking, web automation, proxies, legal part of scraping, performance, data extraction and validation.

Use cases for SERP data scraping: programmatic search engine optimization (SEO) and local SEO, machine learning (ML), artificial intelligence (AI), large language models (LLMs), news monitoring, open-source intelligence (OSINT), voice assistants, e-commerce competitor research.

The podcast is brought to you by SerpApi team: https://serpapi.com

Software Engineer’s Life Balance with Dmytro Vasyliev | #SerpApiPodcast, Episode 10 was originally published in SerpApi on Medium, where people are continuing the conversation by highlighting and responding to this story.

From Software Engineer to Manager with Oleksii Gyturo | #SerpApiPodcast, Episode 9

Illia Zub — Fri, 23 Jun 2023 16:45:53 GMT

In episode we discuss the career growth in management. Oleksii, as an aspiring manager, faced a multitude of questions and challenges. Career growth depends not only on how well a manager copes with the tasks, but also on how open they are to new knowledge and experience.

Watch this episode · Listen this episode

Show notes

[00:00:00] Introduction
[00:02:11] Did you grow from another position or have been hired as the head of mobile?
[00:03:29] You have started as an individual contributor but then you started working on more and more why?
[00:06:55] What was a result of involving you in a leadership strategy?
[00:10:32] Experience exchange (from respondent to interviewer)
[00:20:14] How do you deal with multiple sources of information about the business?
[00:36:04] What is Illia’s mission at SerpApi?
[00:56:02] Long term planning or being spontaneous? How did you manage that?
[01:06:38] Who is an engineering manager?
[01:10:09] Why have you decided to consult businesses?
[01:12:29] Do you treat consulting as a business?
[01:14:34] Epilogue (few words from Oleksii)

The podcast is brought to you by SerpApi team: https://serpapi.com

From Software Engineer to Manager with Oleksii Gyturo | #SerpApiPodcast, Episode 9 was originally published in SerpApi on Medium, where people are continuing the conversation by highlighting and responding to this story.

Insights from an Engineering Director (#SerpApiPodcast, Ep. 8)

Illia Zub — Wed, 17 May 2023 14:20:58 GMT

The success of the company primarily depends on the effective actions and efforts of talented people. Today we would like to introduce you to Miloš, one of the Engineering Directors at SerpApi.

Watch this episode · Listen this episode

Show Notes

[00:00] — Intro
[01:26] — What are Miloš’ responsibilities at SerpApi?
[02:19] — How did you become a manager?
[03:25] — How do you feel about switching from individual contributor to manager?
[05:16] — Where did you gain that knowledge, internally or externally?
[07:41] — Do you prioritize individual tasks?
[09:46] — How do you feel about contributing via others instead of contributing personally?
[11:18] — What is the upper limit for you (number of subordinates)?
[13:22] — What’s a goal of a weekly meeting?
[15:13] — How does your team interact with you?
[16:36] — How do you understand and know what is valuable for the business?
[18:19] — Hobbies
[20:29] — How do you organize your time?
[22:23] — Do you work without interruption?
[23:12] — What do you put in the calendar besides meeting?
[24:26] — Miloš’ actions when he can’t follow the plan ?
[25:00] — What do you do when you finish your work early?
[26:12] — Long term plans or small steps?
[30:40] — How have you changed since you started working in SerpApi?
[32:41] — What are your plans for the future?
[34:20] — What is the feedback about you from the people you manage?
[36:36] — How do you manage disagreements with the people you manage?
[38:15] — How do you decide when to say “no”?
[40:12] — How do you manage to deal with a large amount of work?
[45:48] — What would you like to be changed in your work?
[50:59] — What are your suggestions for yourself in the past?
[53:26] — How do you deal with anxiety?
[56:16] — Do you have a role model?
[57:13] — Epilogue (few words from Miloš)

Links and resources

#SerpApiPodcast is about SERP (search engine results pages) and e-commerce data scraping: parsing, circumvention of blocking, web automation, proxies, legal part of scraping, performance, data extraction and validation.

Use cases for SERP data scraping: SEO, local SEO, ML models, news monitoring, OSINT, voice assistant, ecommerce competitor research.

The podcast is brought to you by SerpApi team: https://serpapi.com

Insights from an Engineering Director (#SerpApiPodcast, Ep. 8) was originally published in SerpApi on Medium, where people are continuing the conversation by highlighting and responding to this story.

Safeguarding Web Scraping Activities with SerpApi: Protecting Privacy and Security

Alaa Abdulridha — Tue, 16 May 2023 20:16:56 GMT

In this article, we will explore how SerpApi safeguards web scraping activities, ensuring the privacy and security of users’ data.

Introduction

Web scraping has become an integral part of data acquisition in various domains, enabling businesses, researchers, and developers to gather valuable information from the web. However, conducting web scraping activities while maintaining privacy and security can be challenging. That’s where SerpApi (Search Engine Results Page API Service) steps in as a powerful solution.

Features

Anonymity and Privacy Protection: When conducting web scraping, it’s crucial to protect your identity and sensitive information.
SerpApi, LLC service serves as an intermediary layer, ensuring anonymity by routing requests through their infrastructure. By shielding the user’s IP address and personal details, SerpApi minimizes the risk of being blocked or flagged by websites, thereby preserving privacy during the scraping process.
Anti-Bot Mechanism Bypass: Websites often implement anti-bot mechanisms, such as CAPTCHAs and IP blocking, to prevent automated scraping activities. These mechanisms can hinder the scraping process and compromise security. SerpApi employs advanced algorithms to bypass CAPTCHAs and other anti-bot mechanisms, ensuring uninterrupted and efficient data extraction.
By automatically handling these challenges, SerpApi enhances security by reducing exposure to potential vulnerabilities.
Encryption and Secure Communication: Data security is a paramount concern when performing web scraping operations. SerpApi employs industry-standard encryption protocols, such as HTTPS, to establish secure communication channels between the user’s application and the API. This ensures that data transmitted during the scraping process remains encrypted and protected from unauthorized access or interception by malicious actors.
Rate Limit Management: Websites may enforce rate limits to prevent excessive scraping and ensure fair usage. SerpApi simplifies rate limit management by monitoring and managing requests on behalf of the user. By intelligently pacing the requests and adhering to rate limits, SerpApi reduces the risk of triggering website defenses, such as temporary IP bans, while maintaining optimal performance.
Compliance with Legal and Ethical Standards: Web scraping activities must comply with legal and ethical guidelines to maintain integrity and respect for the data source. SerpApi, LLC promotes ethical scraping practices by enforcing compliance with the terms of service of search engines and websites. By handling scraping operations responsibly and abiding by the rules, SerpApi helps users avoid legal repercussions and fosters a sustainable web scraping ecosystem.
Continuous Maintenance and Updates: Websites often undergo structural changes, which can affect scraping scripts and introduce potential security risks. SerpApi alleviates this concern by actively monitoring and adapting to evolving web structures. By handling the complexity of website changes, SerpApi ensures that the scraping process remains accurate and secure, reducing the need for constant maintenance and safeguarding against vulnerabilities.
Legal US Shield: The crawling and parsing of public data is protected by the First Amendment of the United States Constitution. We value tremendously freedom of speech. We assume scraping and parsing liabilities for both domestic and foreign companies unless your usage is otherwise illegal. (Including but are not limited to: acts of cyber criminality, terrorism, pedopornography, denial of service attacks, and war crimes.)

Conclusion

SerpApi, LLC plays a vital role in safeguarding web scraping activities by prioritizing privacy and security.

Through its anonymization, anti-bot mechanism bypassing, secure communication, rate limit management, and compliance with legal and ethical standards, SerpApi empowers users to conduct web scraping operations with confidence.
By utilizing SerpApi, businesses, researchers, and developers can extract valuable data while protecting their privacy, maintaining security, and adhering to ethical practices.

In addition to that, at SerpApi we take reasonable precautions and follow industry best practices to make sure it is not inappropriately lost, misused, accessed, disclosed, altered, or destroyed. customers’ information is encrypted using secure socket layer technology (SSL) and stored with AES-256 encryption. Although no method of transmission over the Internet or electronic storage is 100% secure, we follow all PCI-DSS requirements and implement additional generally accepted industry standards.

Links

SerpApi Status (check the performance of our APIs)
SerpApi Documentation (browse all of our APIs)
SerpApi Playground (try out some searches)

If you have any further questions regarding SerpApi please contact us: contact@serpapi.com

Safeguarding Web Scraping Activities with SerpApi: Protecting Privacy and Security was originally published in SerpApi on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to detect memory leak in Ruby C extension

Yicheng Zhou — Thu, 23 Feb 2023 07:57:11 GMT

Memory leak has been one of the most difficult problems to track down in the C/C++ world. It’s also true when it comes to Ruby C extensions. We don’t expect memory bloat for a long-running Ruby service. Fortunately, the community has developed great tools to help detect memory leaks. Two of the most popular ones are AddressSanitizer (or ASAN) and Valgrind. Here is a comparison between these tools. I personally favor ASAN more as it does not require additional tools to be installed and runs fast.

It’s better to run memory detection in a Linux environment. The tests of this post were run in Ubuntu 20.04 + GCC 9.4.0

Configuration

Let’s take Nokolexbor as an example. Enabling ASAN is as easy as adding -fsanitize=address to the CFLAGS and LDFLAGS. In extconf.rb, add

if ENV['NOKOLEXBOR_DEBUG'] || ENV['NOKOLEXBOR_ASAN']
  CONFIG["optflags"] = "-O0"
  CONFIG["debugflags"] = "-ggdb3"
end

if ENV['NOKOLEXBOR_ASAN']
  $LDFLAGS << " -fsanitize=address"
  $CFLAGS << " -fsanitize=address -DNOKOLEXBOR_ASAN"
end

It is recommended to compile with -O0 -ggdb3 when enabling ASAN to reveal as much information as possible when memory leak is detected.

Note that we shall use standard memory functions malloc, realloc, calloc and free instead of ruby_xmalloc, ruby_xrealloc, ruby_xcalloc and ruby_xfree because if we use the latter ones, the call stack cannot be shown correctly in the memory leak reports, making it useless for analysis. Here we are defining NOKOLEXBOR_ASAN so that we can control which memory functions to use

#ifndef NOKOLEXBOR_ASAN
  lexbor_memory_setup(ruby_xmalloc, ruby_xrealloc, ruby_xcalloc, ruby_xfree);
#else
  lexbor_memory_setup(malloc, realloc, calloc, free);
#endif

Now, just compile with

NOKOLEXBOR_ASAN=1 rake compile

Detection

ASAN starts to detect memory leaks when the target program is shutting down, checking if any memory blocks allocated are not freed. Therefore, your test program should cover the code paths as much as possible. Here, we utilize our existing tests that are run by rake test.

The ASAN runtime has to be manually loaded into the ruby process before running our code. This can be done by setting the environment variable LD_PRELOAD=/path/to/libasan.so, where the path of libasan.so can be retrieved by gcc -print-file-name=libasan.so. The launch command looks like this

LD_PRELOAD=/path/to/libasan.so /path/to/ruby -Ilib -rnokolexbor /path/to/some_spec.rb

Note that /path/to/ruby should refer to the binary program of ruby, not a script that is created by rvm or rbenv.

When running the above command, you will find that ASAN reports memory leaks even if you don’t pass an empty ruby file. This is because Ruby itself does not free all of the memory during shutdown, resulting in false positive reports. You will probably see the output like this:

==3930==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 956376 byte(s) in 7481 object(s) allocated from:
    #0 0x7fe00f23aa06 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:153
    #1 0x7fe00ee54929 in calloc1 /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/gc.c:1583
    #2 0x7fe00ee54929 in objspace_xcalloc /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/gc.c:10113
    #3 0x7fe00ee54929 in ruby_xcalloc_body /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/gc.c:10120
    #4 0x7fe00ee54929 in ruby_xcalloc /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/gc.c:12004

Direct leak of 247663 byte(s) in 2055 object(s) allocated from:
    #0 0x7fe00f23a808 in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:144
    #1 0x7fe00ee54764 in objspace_xmalloc0 /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/gc.c:9861
    #2 0x7fe00ee54764 in ruby_xmalloc2_body /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/gc.c:10104
    #3 0x7fe00ee54764 in ruby_xmalloc2 /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/gc.c:11994

Direct leak of 108437 byte(s) in 1150 object(s) allocated from:
    #0 0x7fe00f23ac3e in __interceptor_realloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:163
    #1 0x7fe00ee54ff1 in objspace_xrealloc /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/gc.c:9932
    #2 0x7fe00ee54ff1 in ruby_sized_xrealloc2 /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/gc.c:10149
    #3 0x7fe00ee54ff1 in ruby_xrealloc2_body /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/gc.c:10155
    #4 0x7fe00ee54ff1 in ruby_xrealloc2 /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/gc.c:12024

Direct leak of 73920 byte(s) in 165 object(s) allocated from:
    #0 0x7fe00f23a808 in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:144
    #1 0x7fe00ef36d43 in onig_new_with_source /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/re.c:841
    #2 0x7fe00ef36d43 in make_regexp /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/re.c:871
    #3 0x7fe00ef36d43 in rb_reg_initialize /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/re.c:2836

To wipe out those false positive messages, we can create a rake task for testing that takes care of all the settings mentioned above, and filter the final output by excluding blocks that do not contain paths of your project (in our case lexbor):

class ASanTestTask < Rake::TestTask
  def filter_leak_message(output)
    # Discard ruby only leaks (false positives)
    results = output.scan(/(?:Direct|Indirect).*?\n\n/m).select { |r| r.include? 'lexbor' }
    results.join
  end

  def ruby(*args, **options, &block)
    asan_so = `gcc -print-file-name=libasan.so`.strip
    env = {"LD_PRELOAD" => asan_so}
    if args.length > 1
      stdout, stderr, status = Open3.capture3(env, FileUtils::RUBY, *args, **options, &block)
    else
      stdout, stderr, status = Open3.capture3(env, "#{FileUtils::RUBY} #{args.first}", **options, &block)
    end

    puts stdout
    unless (leaks = filter_leak_message(stderr)).empty?
      puts
      puts leaks
      yield false, status
    end
  end
end

namespace :test do
  ASanTestTask.new('asan') do |t|
    t.libs << 'spec'
    t.pattern = 'spec/**/*_spec.rb'
  end
end

Now, simply run rake test:asan. It will run all the tests just like rake test, and the output will only include the leak information related to your code.

# Running:

.........................................................................................................................................................

Finished in 0.124536s, 1228.5576 runs/s, 2505.2939 assertions/s.
153 runs, 312 assertions, 0 failures, 0 errors, 0 skips

Direct leak of 3168 byte(s) in 132 object(s) allocated from:
    #0 0x7f5d01c86a06 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:153
    #1 0x7f5cfa550955 in lexbor_calloc ../../../../vendor/lexbor/source/lexbor/ports/posix/lexbor/core/memory.c:29    #2 0x7f5cfa54a1b8 in lexbor_array_create ../../../../vendor/lexbor/source/lexbor/core/array.c:13
    #3 0x7f5cfa4d53b8 in nl_node_at_css ../../../../ext/nokolexbor/nl_node.c:336
    #4 0x7f5d01a29ea0 in vm_call_cfunc_with_frame /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/vm_insnhelper.c:2514
    #5 0x7f5d01a29ea0 in vm_call_cfunc /tmp/ruby-build.20220826185459.18027.EBv81g/ruby-2.7.2/vm_insnhelper.c:2539

In this case, ext/nokolexbor/nl_node.c:336 is causing the memory leak.

That’s all for this tutorial. Thanks for reading.

How to detect memory leak in Ruby C extension was originally published in SerpApi on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to create Ruby C extension and debug in VS Code

Yicheng Zhou — Thu, 23 Feb 2023 07:48:03 GMT

When developing Nokolexbor, I found debugging with gdb or lldb was troublesome because I have to do everything with commands. For me, it will be more delightful to debug with an interactive GUI, which makes it convenient to control breakpoints and steps, inspect values and navigate through the code. This post shows how to use VS Code debugging tools to debug Ruby C extension from scratch.

Create the gem project

We demonstrate this on an empty gem project with extensions. The project can be easily created by:

bundle gem example_gem --ext

If you already have a working gem project, just jump to the next section.

Let’s modify ext/example_gem/example_gem.c to add a native method example_plus under ExampleClass that can be called by ruby code.

#include "example_gem.h"

VALUE rb_mExampleGem;
VALUE rb_cExampleClass;

static VALUE
example_plus(VALUE self, VALUE rb_a, VALUE rb_b)
{
  int a, b;
  a = NUM2INT(rb_a);
  b = NUM2INT(rb_b);
  return INT2NUM(a + b);
}

void Init_example_gem(void)
{
  rb_mExampleGem = rb_define_module("ExampleGem");
  rb_cExampleClass = rb_define_class_under(rb_mExampleGem, "ExampleClass", rb_cObject);
  rb_define_method(rb_cExampleClass, "example_plus", example_plus, 2);
}

Now, compile the extension: rake compile. If everything goes well, a shared library named example_gem.so or example_gem.bundle will be created under lib/example_gem

Let’s test it in irb and make sure it works:

❯ irb -Ilib -rexample_gem
irb(main):001:0> obj = ExampleGem::ExampleClass.new
# => #
irb(main):002:0> obj.example_plus(2, 3)
# => 5

Debug in VS Code

By default, ruby compiles the C extension with -O3, which loses debug information. We should modify ext/example_gem/extconf.rb to add debug options.

require "mkmf"

if ENV['EXAMPLE_DEBUG']
  CONFIG["optflags"] = "-O0"
  CONFIG["debugflags"] = "-ggdb3"
end

create_makefile("example_gem/example_gem")

Here we are using an environment variable EXAMPLE_DEBUG to control whether we want to compile with debug information.

Just compile with EXAMPLE_DEBUG=1 rake compile.

Some posts would suggest that you compile ruby itself with the same debug flags as well, but as long as you don’t need to debug ruby-related functions, just use the ruby you’ve installed.

Now, let’s open VS Code. First, install the official C/C++ extension

Switch to Run and Debug tab. Click create a launch.json file.

On the open launch.json tab. Click Add Configuration... button on the bottom right. Select C/C++: (lldb) Launch (on Linux, this option may be C/C++: (gdb) Launch).

Change program and args to tell VS Code how to run the program. Note that you should set program to the absolute path of the actual ruby program. A single ruby will probably not work if you are using a ruby version manager such as rvm or rbenv, because the ruby command will refer to a script instead of the ruby binary. You might also want to set cwd to ${workspaceFolder} to let ruby find the lib path correctly. Here is my final setting:

{
  "version": "0.2.0",
  "configurations": [
    {
      "name": "(lldb) Launch",
      "type": "cppdbg",
      "request": "launch",
      "program": "/Users/zyc/.rbenv/versions/2.7.2/bin/ruby",
      "args": ["-Ilib", "-rexample_gem", "-e", "puts ExampleGem::ExampleClass.new.example_plus(2, 3)"],
      "stopAtEntry": false,
      "cwd": "${workspaceFolder}",
      "environment": [],
      "externalConsole": false,
      "MIMode": "lldb"
    }
  ]
}

Everything is set! Let’s add a breakpoint to the C code and start debugging.

Click on the space left to the line number to add a breakpoint (red dot). And click Start Debugging button.

We can see that the code stopped at the breakpoint. The debugging toolbox showed up and you can do all the debugging actions you are familiar with.

Attach to a running process

Sometimes it’s not convenient to run the code directly from the raw ruby process. For example, your code must run under Ruby-on-Rails environment, which is not easy to set up manually. Instead of launching ruby, we can attach to a running ruby process and debug from there.

Let’s open launch.json again, click the Add Configuration... button on the bottom right, and select C/C++: (lldb) Attach (on Linux, this option may be C/C++: (gdb) Attach), edit the newly inserted JSON configuration:

  "configurations": [
    {
      "name": "(lldb) Attach",
      "type": "cppdbg",
      "request": "attach",
      "program": "/Users/zyc/.rbenv/versions/2.7.2/bin/ruby",
      "MIMode": "lldb"
    },
    ...
  ]

Now, switch to Attach, start your rails c somewhere, and click Start Debugging

A list will pop up for you to select the target process to attach. Select ruby

Finally, run the code in the Rails console that will trigger the breakpoint, and the rest will be the same as debugging with Launch.

That’s all for this tutorial. I hope you find it useful. Happy debugging!

How to create Ruby C extension and debug in VS Code was originally published in SerpApi on Medium, where people are continuing the conversation by highlighting and responding to this story.

Benchmarking Puma 4 vs. Puma 5 vs. Puma 6

Yicheng Zhou — Thu, 23 Feb 2023 07:39:56 GMT

Puma 6.0.0 was released on Oct 19, 2022. We are two major versions behind. Let’s benchmark Puma 4 vs. Puma 5 vs. Puma 6 and see how they perform.

We put our focus on testing the throughput of these versions. The excellent HTTP load generator hey made it easy to stress test Puma and get detailed reports.

We benchmarked on SerpApi’s home page, which returns a 75 KB HTML with 1000 requests on 50 concurrencies. The tests were run on DigitalOcean CPU-Optimized, 2 vCPUs, 4 GB, Ruby 2.7.2, Rails 6.0.3.7 production mode. Here are the results.

The results showed that the developers are doing great in improving Puma’s performance. We can see an increasing number of “Requests/sec” on newer versions.

During the tests, we noticed that compared to Puma 6.0.0, on Puma 4.3.6 and Puma 5.6.5, the two workers were sometimes not evenly loaded, which means they didn’t make most of the CPU resources. Looking at the release note, wait_for_less_busy_worker, first introduced in Puma 5, was defaulted in Puma 6. This option resolved the unbalanced workload among workers.

Unbalanced CPU load on Puma 4.3.6 and Puma 5.6.5

Balanced CPU load on Puma 6.0.0

We tried to set the option on Puma 5. As a result, “Requests/sec” increased a bit. However, its performance was still lower than Puma 6.

Migrating to Puma 6 is a no-brainer. But as 6.0.0 has just been released, it would be better to look at its bug reports first. For those who are cautious, upgrading to the latest version of Puma 5 with wait_for_less_busy_worker = 0.005 set is recommended.

That’s all for the benchmarks on Puma. Thanks for reading. See you in the following benchmark report.

Benchmarking Puma 4 vs. Puma 5 vs. Puma 6 was originally published in SerpApi on Medium, where people are continuing the conversation by highlighting and responding to this story.

Nokolexbor — a performance-focused HTML parser for Ruby

Yicheng Zhou — Thu, 23 Feb 2023 07:11:39 GMT

Nokolexbor — a performance-focused HTML parser for Ruby

There aren’t many choices of HTML parsers in the ruby ecosystem. The most obvious one would be Nokogiri, which we used heavily at SerpApi. As time passed, we became gradually unsatisfied with Nokogiri’s performance. Though it’s mainly relying on libxml2 which is an XML processor written in C, it’s not optimized for HTML-specific tasks. We’ve once contributed to Nokogiri to improve its performance a lot. But as Illia (the author) said

800 ms to extract data from HTML is still too much.

He also created an experimental library NokogiriRust trying to use scraper and be API-compatible with Nokogiri. The benchmark showed 60x faster on at_css.text. This proves that it is possible to replace libxml2 with a high-performance and production-ready HTML parser. Sadly, the project didn't continue.

Now, we are back to the task. Our goal is to:

Create a new Ruby HTML parser with a superfast underlying parser engine.
Make it API-compatible with Nokogiri.
Make it capable of searching nodes with both CSS selectors and XPath.

Development of Nokolexbor

We investigated recent HTML parsers in the C and Rust world, and picked Lexbor as the core of our new library. It has very similar APIs to Nokogiri, and the performance is much higher than libxml2. Also, C library is easier to compile than Rust when installed by users. Lexbor already had a bunch of bindings such as Python, Erlang and Crystal, which made us more confident that we could do it for Ruby as well. As a result, Nokolexbor was born.

During the development, we soon encountered two problems that we must address at the early stage.

CSS selectors don’t support selecting text nodes, but we select text nodes extensively with Nokogiri. We have to patch Lexbor to support it somehow.
Lexbor doesn’t support XPath which we used with Nokogiri. XPath can be converted to CSS selectors, but not in all cases. To be maximum compatible with Nokogiri, we’d better implement XPath algorithm using Lexbor’s data structures.

Solving the first problem turned out to be easier than we thought. Thanks to Lexbor’s great code generators, we soon patched Lexbor and added a ::text pseudo element to represent text nodes. Selecting all text nodes under a div can be as easy as node.css('div ::text')

The second problem was a monster. The only C implementation of XPath we found was libxml2. It has a very large and messy code base. The algorithm xpath.c itself has over 14k lines of code, plus a number of references to other modules. We had a hard time porting the code. Fortunately, we conquered it. The algorithm was nicely integrated with Lexbor. We were able to select nodes using XPath the same way as Nokogiri: node.xpath('.//div//text()').

The rest of the development would be to make Nokolexbor behave the same way as Nokogiri. Some notable updates were:

SerpApi - Medium

Playwright’s getByRole locator is 1.5x slower than CSS selectors

Introduction

Deep Dive: How getByRole Works

Performance

Conclusion

When counting lines in Ruby randomly failed SerpApi deployments

TL;DR

Investigation

Memory profiling

.+<\/H1>.+.+<\/p>$/)

String#lines

Solution

.+<\/H1>.+.+<\/p>$/)+ elsif response[:html].count($/) < 50 && response[:html].match?(/<\/div>.+<\/H1>.+.+<\/p>$/)

.+<\/H1>.+.+<\/p>$/)

Benchmark

Memory usage

lines/readlines/each_line/etc.

count($\)

Code

Comparison process

Results

Observations and final thoughts

Update Sep 20th, 2023

Software Engineer’s Life Balance with Dmytro Vasyliev | #SerpApiPodcast, Episode 10

Chapters

From Software Engineer to Manager with Oleksii Gyturo | #SerpApiPodcast, Episode 9

Show notes

Insights from an Engineering Director (#SerpApiPodcast, Ep. 8)

Show Notes

Links and resources

Safeguarding Web Scraping Activities with SerpApi: Protecting Privacy and Security

Introduction

Features

Conclusion

Links

How to detect memory leak in Ruby C extension

Configuration

Detection

How to create Ruby C extension and debug in VS Code

Create the gem project

Debug in VS Code

Attach to a running process

Benchmarking Puma 4 vs. Puma 5 vs. Puma 6

Nokolexbor — a performance-focused HTML parser for Ruby

Nokolexbor — a performance-focused HTML parser for Ruby

Development of Nokolexbor

Benchmarks

What’s Next

.+<\/H1>.+
.+<\/p>$/)

.+<\/H1>.+
.+<\/p>$/)
+ elsif response[:html].count($/) < 50 && response[:html].match?(/
<\/div>
.+<\/H1>.+
.+<\/p>$/)

.+<\/H1>.+
.+<\/p>$/)