Stories by BoilingData on Medium

Parquet is a Streaming Data Format

BoilingData — Thu, 08 May 2025 12:34:40 GMT

By Dan Forsberg, CEO & Founder @ BoilingData

In the world of high-performance data processing, two critical challenges stand out: how to efficiently ingest streaming data and how to optimize network throughput without wasting precious CPU resources. At BoilingData, we’ve been tackling these challenges head-on with some remarkable results that I’d like to share.

Parquet as a Streaming Data Format

When people think of Parquet, they often think of static files optimized for columnar data storage. However, we’ve been leveraging Parquet as a streaming data format with remarkable efficiency.

I’ve recently built a no-compromise streaming data ingestion agent using Rust that transforms how we think about data streaming:

It accepts data via Apache Arrow IPC and stream-writes topic-specific Parquet files directly to S3 using multipart uploads. Parts are uploaded concurrently, and each part includes Parquet row groups for the same Parquet file.
Achieves true zero-copy architecture with pre-allocated memory and no locking. Network interface card (NIC) copies incoming data directly to CPU memory (zero-copy).
Implements rate limiting and true horizontal scaling by adding more agents
Absolutely minimizes data copying and (de)serialization— only from Arrow IPC to compressed Parquet

What makes this approach particularly powerful is how Parquet files and S3 multipart uploads complement each other. The Parquet row groups can be written concurrently into S3, eliminating the need for background compaction since the Parquet files are already optimized to your desired size.

The S3 multipart uploads provide an excellent fit for disaster recovery scenarios while allowing you to tune the S3 persistence latency window to your requirements.

Reliability Features

Our agent doesn’t just focus on performance. We’ve built in robust reliability mechanisms:

Deduplication with SIMD-optimized bit vectors for the past 128 messages
Acknowledgment of sequence ranges for S3 part uploads (durability acks)
Kafka-like behavior where client retransmission doesn’t slow down the system
Exactly-once delivery with a much larger dedup window than Kafka
Guaranteed ordering via sequence numbers when reading

Most impressively, this is achieved with no disk usage — we write directly to S3 with tunable thresholds for time and size.

Ultra-Efficient Network Data Processing

The second breakthrough we’ve achieved is in network data transfer efficiency.

Using AWS EC2 c7gn.2xLarge instances with eXpress Data Path (XDP) sockets, we’ve demonstrated remarkable efficiency:

~1 million packets per second while saturating a 25 Gbps link
Only 4% (eBPF) and 6% (server) CPU core utilization — leaving abundant CPU resources for actual data processing, like DuckDB based stream processing.

This efficiency comes from having network cards directly memory-copy (DMA) the received packets into CPU memory while bypassing the Linux kernel network stack processing using XDP sockets with zero-copy mode (XDP_ZEROCOPY).

Specific Implementation

Our setup involves:

One instance (AWS EC2 c7gn.4xLarge) as server and another as client sending packets with multiple threads over UDP
Client using the normal Linux networking stack
Server using the XDP zero-copy path and custom eBPF program
4 NIC queue utilization and single CPU pinning

The results speak for themselves, with consistent performance around with about 70% single CPU core utilisation (BPF):

Packets per second:    1,629,555 PPS  |    5230.98 MB/s  |   43.881 Gbps  |  Avg size:  3366 bytes
Packets per second:    1,624,624 PPS  |    5215.15 MB/s  |   43.748 Gbps  |  Avg size:  3366 bytes
Packets per second:    1,623,462 PPS  |    5211.42 MB/s  |   43.717 Gbps  |  Avg size:  3366 bytes
Packets per second:    1,624,201 PPS  |    5213.80 MB/s  |   43.736 Gbps  |  Avg size:  3366 bytes
Packets per second:    1,622,978 PPS  |    5209.87 MB/s  |   43.704 Gbps  |  Avg size:  3366 bytes
Packets per second:    1,624,368 PPS  |    5214.33 MB/s  |   43.741 Gbps  |  Avg size:  3366 bytes

The same setup without zero-copy with single CPU core hits 100% and does not reach the same throughput performance.

Packets per second:    1,071,528 PPS  |    3439.68 MB/s  |   28.854 Gbps  |  Avg size:  3366 bytes
Packets per second:    1,069,480 PPS  |    3433.10 MB/s  |   28.799 Gbps  |  Avg size:  3366 bytes
Packets per second:    1,065,762 PPS  |    3421.17 MB/s  |   28.699 Gbps  |  Avg size:  3366 bytes
Packets per second:    1,069,274 PPS  |    3432.44 MB/s  |   28.793 Gbps  |  Avg size:  3366 bytes
Packets per second:    1,066,823 PPS  |    3424.57 MB/s  |   28.727 Gbps  |  Avg size:  3366 bytes
Packets per second:    1,067,504 PPS  |    3426.76 MB/s  |   28.746 Gbps  |  Avg size:  3366 bytes

With zero-copy the data ingestion processing is about 2x or 100% more efficient than without zero-copy.

The Future of High-Performance Data Processing

By combining these approaches — treating Parquet as a streaming format and leveraging zero-copy network processing — we’re redefining what’s possible in high-throughput, low-latency data processing systems.

Our implementations deliver:

Exceptional throughput
Minimal resource utilization
Direct S3 storage without intermediate persistence
Strong reliability guarantees

For organizations dealing with massive data volumes that need real-time processing, these techniques open new possibilities for building simpler, more efficient, and more cost-effective data pipelines.

If you’d like to learn more about our work at BoilingData or discuss how these approaches might apply to your data challenges, I’d love to hear from you.

Dan Forsberg is the CEO and Founder of BoilingData, a company focused on delivering high-performance data processing solutions with serverless distributed SQL over cold data on S3 with in-memory database speed.

Seriously, can AWS Lambda take streaming data?

BoilingData — Mon, 01 Apr 2024 10:55:24 GMT

This tiny little AWS Lambda can do it! 🚀

Streaming Data to S3 with Scale

TL;DR Use HTTP POST to send newline delimited JSON to URL (Data Tap), and land your data on S3 in optimised compressed Parquet files more cost efficiently and with higher scale than instance/cluster/serverless based counterparts.

What’s the simplest, most scalable, and most cost efficient solution to stream data into S3? With any scale from hundreds and thousands of data sources? Production ready as a first class cloud native building block? Highly Available (HA) and reacting to varying peaks and lows efficiently. Globally available, on the nearest region to the data source.

Surprisingly, it is AWS Lambda accompanied with a AWS Lambda Function URL. But what kind of Lambda function?

You can setup Kafka or a multitude other streaming ingestion stacks out there, or if you have known data sources, any ELT/ETL combo that extracts data from known sources and either transforms/loads or loads/transforms it into destination like S3. But most, if not all of these solutions, require you to do VPC peering or vendor specific (or your own in-house) deployments into your VPC while also considering the scalability aspects.

With high requests per second (RPS), the CPU is kept busy? The bigger the scale the more memory and CPU you need to handle connections, SSL/TSL decryption, parsing, and processing. Furthermore, with continuous load it sounds like the best compute alternative at least is not AWS Lambda, right? The engineer inside says that AWS Lambda as a compute is expensive in scale — but in this case the intuition is wrong!

The engineer inside says that AWS Lambda as a compute is expensive in scale — but in this case the intuition is wrong!

I started streaming data into S3 with AWS Lambda Function URL, NodeJS, and embedded DuckDB database engine for in-transit SQL transformations — like ETL where the E is an open URL to HTTP POST data to, e.g. from PostgreSQL CDC, or logs, events, metrics, whatever you wanted to collect in newline delimited JSON format. This was already great and I could even push realtime metrics directly from the AWS Lambda to an embedded Apache ECharts and Plotly.JS charts on a web page with WebSockets (iframes that communicate together through main page JS “router” for session management). But intrigued by the AWS LLRT runtime, my hunger grew.

AWS Lambda Function URLs

It struck me how perfect the Lambda Function URL is for streaming ingestion!

Lambda Function URLs are AWS managed public HTTP endpoints for AWS Lambda functions. AWS manages the infrastructure that takes care of TCP connections, SSL/TSL offloading, throttling, and buffering. And what is left for you with the Lambda function is a HTTP packet blob processing and it is handed to you nicely with max 6 MB packets. You’re given horizontal scaling, dedicated vCPUs and memory. Now, your job is to process the hot potato as fast as possible! How to efficiently process the data then? And how fast can we do it?

I ended up going to the bottom of it, implementing custom AWS Lambda C/C++ runtime and handler, accompanied with Lambda extension. This is the best that can be done, unless you start inlining assembly 😅. Of course the code can be optimised further and further 😁, and there are various things you can do to make the latency very low and steady for the clients.

DuckDB itself is C++ code, so it has a C/C++ API as well — and it embeds easily because you can get both dynamic and static library with low number of dependencies. And it is reasonably small sized, fits nicely in Lambda.

Custom AWS Lambda C++ Runtime, Extension, and Handler

Here we describe how we built Data Taps.

The results are staggering and unbelievable. The smallest arm64 AWS Lambda with 128MB memory and arguably about 2 vCPU can handle the load smoothly and with steady latency.

AWS Lambda functions are firecracker containers, created very fast and once warm keep the memory and disk contents intact between subsequent invocations — at least for some time — and you pay only for the milliseconds the code is processing the invocation. While the warm Lambda container holds your code and data in memory and on the local SSD disk, you don’t pay for it.

You have a AWS Lambda extension that hooks to the container lifecycle events, SHUTDOWN event in our case. So, you have all the building blocks of using the AWS Lambda as a stateful independent entity for data processing — without loosing data. This is how logging extension libraries work too.

At the C/C++ level, an atomic append call to local filesystem file is very fast (one of the building blocks for streaming systems like Kafka). This warm Lambda then gets called again and again — and in our case, to process the incoming data. Append the data to file and report back — fast and efficient.

At this point, even this tiniest ARM64 128MB AWS Lambda function processing time is well below 2ms with small packet sizes.

Enter DuckDB. DuckDB has become very resilient data processor — i.e. suitable for data engineering problems in addition to analytics. Its’ fast, multi-threaded C++ vector based and what-not engine is top class.

DuckDB has fairly small binary size and fits nicely into AWS Lambda!

So, you need the bootstrap binary (runtime), extension for hooking into the lifecycle events and the handler code itself of course. Together you achieve data lossless ingestion to S3, where the left-over data that has been stacking up since the last sync is flushed to S3 in the shutdown hook or in the next sync (assuming that our code or DuckDB does not crash and that S3 is available and allows uploading the small data in the 2s time).

This buffering into AWS Lambda memory and disk and the scale out capability of AWS Lambda itself makes it a monster at sucking data in with scale! And then you have tens and hundreds of concurrent vCPUs crunching your data efficiently.

One of the remaining things is to get the data out of AWS Lambda to S3.

DuckDB can stream from source to destination with a single SQL line while doing any transformation or aggregations, compression, and data format change — from newline delimited JSON to ZSTD compressed Parquet files. The SQL clause is taken from the Lambda environment variable.

Data format conversion to columnar format together with compression reduces the data size considerably and not only saves on S3 costs, but also on further data processing.

With newline delimited JSON comes also error resiliency as DuckDB can skip erroneous JSON without dropping rest of the data or throwing exceptions — no poison pills.

Simple Load Test

I started AWS Load Tester and pointed it against Data Tap Function URL (i.e. against the Lambda function we describe above). It’s a bit unclear to me on what frequency a single “user” sends these packets, but the test had 10 containers and 20 users each with a 2 min ramp up period and 2 min hold period.

The live scatter plot ramped steadily from 0 to 200 users and varied there peaking close to 300. One could do many more of these tests, especially with more steep ramp up period to see how fast AWS Lambda really responds, plus of course adding much more users.

The Lambda that holds the Function URL is multi-AZ and runs inside a VPC. The per request payload size was about 244kB.

AWS Load Tester sample run with 2 min ramp-up and 2 min hold time, 244kB request size, 10 containers, 20 users per container

Here is an example of Data Taps realtime data metrics snapshot.

Stable and steady SLA!

The solution is stable with excellent SLA and scale. You don’t have to over-provision your auto-scaling cluster to be able to serve steep traffic peaks because the container environment scale in/out is a bit slow. There is no cluster(s) to worry about, no alarms to create, no DevOps people to keep around 24/7. It’s a single tiniest AWS Lambda function that outperforms its big old brothers and you can see its metrics on AWS Console Dashboard if you like.

You can also control the scale out by using reserved concurrency configuration option, if you like. Thing to keep in mind is that if you don’t set the reserved concurrency, the stream intake Lambda will eat the shared concurrency pool from the AWS Account Region. By default this soft limit is 1000 concurrent exections.

Cost Calculations

The most costly thing is uploading this data as it takes more time the smaller the Lambda is. This is also the tricky part as you need to tune the max payload size (threshold) to send to S3 with the size of the AWS Lambda function as it determines the network bandwidth available. With 2s budget the extension has, you still have plenty room to process and flush the data.

The recipe to calculate costs is to e.g. use awslogs to download all the Data Tap AWS Lambda CW Logs and sum all billedDurationMs entries (e.g. with DuckDB) to see how much cost the Lambda actually incurred. Plus, not to forget, in this case especially, the number of AWS Lambda invocations made ($0.20 per million requests).

Finally, add also all S3 API calls, e.g. with S3 Express and the stored bytes of CW Logs as well (remember CW Logs retention).

==> You can deploy Data Taps yourself and run tests and then calculate the total costs. You may not believe your eyes 👀. All the CW Logs are in JSON format and easy to process (e.g. with DuckDB). The Lambda function binary is included in the GitHub repository, along with the extension.

Achieving more than 10x cost efficiency improvement compared to already cost efficient AWS Firehose is quite incredible! Not to speak about the simplicity of a single AWS Lambda vs. the complexity of setting up Firehose to land data as parquet on S3 — not to speak about doing that cross-account.

With Data Taps you can share the data ingestion rights to any other BoilingData user.

How about Security?

Data Taps is a BoilingData product. It implements and uses our C++ AWS Lambda runtime and handler. They use the AWS Lambda Function URLs and de-facto standard JWT token based authentication and ACL (Access Control List) based access control.

Every incoming packet must contain authorization header which identifies and authenticates the user. An environment variable holds list of Data Taps users that are allowed to send data to the Tap. By default the owner of the Tap is the only user allowed to ingest data.

Other Data Taps users in other AWS regions or AWS accounts can send data to you if you put their BoilingData account username (email address) into the ACL. It’s an easy way to stream data between borders and the senders do not need to have an AWS Account or know anything about AWS.

Conclusion

Data Taps, our tailored C++ AWS Lambda runtime, handler, and extension accompanied with DuckDB provides simple and very powerful construct for data ingestion to S3. The implementation pattern and C++ code is so efficient that you don’t have to worry about the cost aspects.

AWS Lambda Function URL characteristics like TCP connections pooling / throttling, SSL/TSL offloading, buffering make it possible to achieve stable and very low latency, highly cost efficient streaming data processing with AWS Lambda.

AWS Lambda scalability is unparalleled and together with Function URLs you can also control the max scale out limits.

A single function can burst to 1,000 concurrent executions instantly. It can then scale by a further 1,000 concurrent executions every 10 seconds.

Data Taps is much simpler and much more cost efficient than e.g. its counterpart AWS Firehose, while it also allows you to define data transformation and filtering with SQL.

You can Deploy Data Taps into your own AWS Account!

Data Taps is BYOC (Bring Your Own (AWS) Cloud) ready. You can deploy it into your AWS Account and the data plane stays with you. You don’t need to worry about VPC peerings or data plane routing across AZs, not to speak about load balancers.

We have an AWS SAM template which you can use to deploy Data Taps to your AWS account and start ingesting data to S3 right away.

For more details, please see Data Taps homepage: https://www.taps.boilingdata.com/

Follow-up bdcli, our command line tool that utilises write-through-cache on deployments into AWS to achieve deployment speeds that you’re not used to! It can deploy Data Taps in couple of seconds so you don’t have to wait for minutes or half an hour when your deployment stack gets stuck for some reason and starts snoring.

You can use bdcli to fetch Data Tap tokens when you want to send data to Taps, either your own or some other Taps shared to you (i.e. the Data Tap ACL includes your account email address).

You can use our JS SDK on browser or node to acquire authorization tokens for sending data to your and other’s Data Taps.

BoilingData Inc.!

BoilingData — Mon, 11 Sep 2023 07:33:19 GMT

I created BoilingData Inc. this summer and switched full-time on it less than a month ago. I’m ..

Happy as an engineer that has the possibility to follow his passion this way as well!

Excited with never ending hunger for learning new, building, and innovating for greatly improving lives for people working with data — for secure, distributed, and networked data processing.

Puzzled and humbled as an entrepreneur, realising technology is only one part of a company or a business (— while I’ve had my pet company isecure.fi for secure bank connections to multiple banks with single API for many years already, this one is different.)

Scared like a duck in a an almost frozen pond who forgot to fly south and can’t read a compass — I’m currently based in Helsinki and autumn is coming.

Let data flow to the bottoms of lakes, bubble up, stream with energy and delight when Boiled.. eh, needed!

BoilingData

Ancient history — foundation

I was working in London for an innovative big data company about 4 years ago when I found about DuckDB while searching for an embedded SQL database to run analytics on in AWS Lambda. That path has been a side story until now. You can read some of the steps in this blog as well. In that company we ran constant load of thousands lines of SQL query graphs with tens of nodes with in-memory catalog EMR Presto/Trino (micro to small), Athena (medium), and yet bigger EMR clusters (large to XL workloads) — this serverless system survived without problems an AWS outage due to proper queueing, pull model, SRE, etc. and it allowed adding/removing EMR clusters ad-hoc. We even used a Presto module for switching from query result polling model to callback events. But still there was much room for better efficiency and simplicity..

My background is a bit broader than trad. data or SW engineering. I spent my early career years at Nokia Research Center developing and leading research projects, mostly related to mobility and security protocols in the area of datacom (IETF) and telecom (3GPP). Like the 4G security key management protocols supporting mobility are my design which I’m quite happy about.

After Nokia, I worked in a smaller startup that was full in AWS, containers, etc. much before the cloud security properties were as good as they now are. We achieved and ran PCI DSS Level 1 for a managed payment gateway and payment terminal software service with tens of thousands of payment terminals — long before the term SRE was born. Team had excellent competencies, owners were coders, and PCI DSS thought us holistic security practices throughout the SDLC (we were lucky with competent auditors) forming a foundation that lasts long.

I remember reworking a main EMV kernel payment loop of a client payment terminal SW written in C at 4am through the night with one of the founders.. I was crazy, but you go with the flow.. and the app became more stable and grew older.

Back at the university we coded mobile IPv4 with hierarchical routing agents (with C) which is still close to my heart due to its distributed nature and no-packet-loss implementation, our tribute to Linux — we demoed music streaming without buffering with tens of hierarchical handoffs per second without glitches with multiple WLAN APs. Back then in a demo, we had almost 10 people from Nokia around the table staring us and at my low-end stereos and a pile of white plastic boxes.

My most recent job was in a security company that processes half of the US household’s worth of broadband data and more.. — true real-time enterprise big data streaming along with lots of various batch processing data pipelines. A scale where the architectures and tools become different to what you start with.

Each of these experiences (and much more) warms my heart, sparks my interest and enthusiasm. You work with something that tingles your fingers, makes you smile, and you create something new on top hoping to create valuable assets for your self and for others. In essence, you automate so you can move to other stuff.

For me DuckDB is something like this too, it has exceptional people behind it, good traction and direction, and they have already innovated a lot. They challenge many areas.

This is the foundation of Boiling

I’m bringing these aspects together — and it has progressed quite well already even as a side track. Now with full focus, I can work with You and we can shape Boiling to become.. ..in the future!

Yesterday

I’ve felt the pain of Data Engineering/Science/ML Teams a few times, like complexity, non-productisation, mixed up layers, scalability and reliability issues, wrong or bad tooling, insufficient practices, processes, missing metadata/semantics, low or negative ROI, no goals or directions etc. You all working on this domain can relate to this and more, and.. it is just normal. It’s a bit challenging domain and typically lacks experience that is present on “proper” SW Dev projects. Too fragmented and complex, too big working set for human minds, like many times in this wider industry.

Today

Today Boiling runs SQL queries on AWS Lambda functions, in distributed and stateful manner, supports httpfs+parquet interface for query result transport if you like, e.g. for querying directly from within DuckDB with JOINs if you like, or with plain curl.. if you like. It’s fully serverless in its true manner as we run on top of AWS Lambda functions. However, AWS Lambda is not an obsession for Boiling, but a fairly challenging compute layer with nice properties and fulfilled promises.

Tomorrow .. with You!

I’ve jokingly said that “Tomorrow’s routers will speak SQL”, but with engines like DuckDB and SQLite etc. it’s not that far fetched. It’s anyway about transforming and moving bits around 😉

I believe the fragmented world of big data and analytics can be made simpler, leaner, easier, and as a result, much more cost efficient — much closer to real time data processing than what it currently is, less data gravity defying, and more shared, reliable, and at the same time with improved security.

I feel the big data industry has stagnated for “ten” years, if not more, and hundreds if not thousands of data startups try/tried to find yet another niche while increasing the complexity of the overall puzzle. But maybe I’m just puzzled myself and as a result trying to create yet another puzzle piece 😆. However, as an architect I want to simplify, automate, improve, package, productise, deliver and liberate — and have fun while kicking the tires of big neighbouring trucks.

Perhaps, we’ll have a world where data mesh and data fabric are closer to reality (practical). Where data flows and query distributions have a new meaning. A world where operational and analytical data processing can co-exist without friction and within the same service team. A world with less complexity. A world where data and internet protocols are not that far away from each other — where security is easier to understand and control. A world, where Boiling gives back to businesses, companies, engineers, analysts, you.

When I joined uni back a “few years ago”, we had to create home pages. Maybe tomorrow, students will write Data Taps. Taps that connect with each other, with AI, and with people (hmm, …maybe I was brainwashed at Nokia).

Please join me and let’s make it reality together!

I’m looking for design partners, customers, and people who share the passion and joy for making data and compute serve us and not the other way around 😉. You need to be in control and Boiling is there to help!

Beat (and blame) the machines, love and respect people!

BoilingData Founder, Dan Forsberg, Ph.D,
Proud father of 3 sons (and one extraordinary Parson Russel Terrier), husband (not a father of a husband yet), and researcher-builder.

You can register here and use the SDK and bdcli client:
- https://app.boilingdata.com/
- https://github.com/boilingdata/node-boilingdata
- https://github.com/boilingdata/boilingdata-bdcli

BoilingData is a member of DuckDB Foundation

Redefining SaaS Analytics: Dynamic, Real-time Insights at a Fraction of the Cost

BoilingData — Mon, 27 Mar 2023 16:48:47 GMT

We are excited to announce a major milestone for BoilingData: the successful validation of our system on the very dataset that prompted us to explore the question, “How can we decrease hot data querying costs while delivering real-time, dynamic analytics?” We are pleased to report that our system has demonstrated its effectiveness in achieving this objective.

The Data Challenge

In collaboration with a MarTech SaaS company that provides marketing insights from comprehensive web crawls, we faced a data challenge. The company crawls many of the largest websites on the internet, and the data is evaluated against hundreds of rules to identify issues or suggest optimizations. Over time, this company has accumulated a data warehouse that contains several hundred terabytes of crawl data.

A primary challenge for this company is finding a balance between ensuring data availability for arbitrary client analysis while managing the associated costs. This usage pattern is common in B2B SaaS: clients expect complete, flexible data access at all times, yet predicting their login times, data interests, and required insights is difficult. Moreover, client users typically analyze only a small portion of their data at any given time. This unpredictability of specific data needs adds complexity to finding a storage solution that offers both flexibility and cost-efficiency.

To facilitate flexibility, the company stored all raw data from the past 90 days in a large data cluster and pre-calculated over 300 insights for each client-defined data segment to enable low latency upon initial login. However, this expensive and cumbersome approach did not adequately provide clients with the dynamic data filtering and segmented insights they required.

This architecture poses several challenges:

Storing vast amounts of historical data in an always-on analytics cluster, such as ElasticSearch, for immediate access is prohibitively expensive.
Despite retaining all data from the past 90 days in memory, we estimated that over 85% of it was never accessed or aggregated.
Responsive dashboarding necessitates calculating aggregates at the end of each crawl (using an ETL paradigm), but this method does not offer dynamic filtering and segmentation, resulting in end-user frustration.
To reduce cost overhead, data is archived after 3 months. Users can trigger an unarchive function to rehydrate it into the live database; however, this process can potentially take hours or days, depending on the dataset, leading to a subpar user experience.
If an end-user modifies their data segment definitions, constructing long-term trends of insights is impossible without rehydrating several terabytes of historical data into the live database.

These data-related challenges hindered product development and resulted in sub-optimal user experiences. Various solutions, including Athena, Presto, and Redshift, were explored to increase flexibility and decrease costs, but they all had limitations.

BoilingData Emerges

Our founding team was captivated by DuckDB’s 2019 launch, recognizing its potential to address the challenges they faced. DuckDB’s performance in delivering analytics from columnar data using an embeddable binary made it an ideal foundation for a highly scalable analytics platform, particularly when combined with our expertise in scaling systems using AWS Lambda.

To achieve this, we addressed several obstacles:

Lambda is stateless, making it unsuitable for analytics sessions where data loading and network speed are bottlenecks. We designed a method for keeping data warm and in-memory for extended periods, effectively addressing this issue.
Even if data remains active in a Lambda instance’s memory, subsequent queries are not guaranteed to be routed to that specific instance, resulting in significant cache misses. Our query planning and routing layer addresses this issue.
Lambdas have limited power and network capabilities. To create an effective querying system, a query must utilize the power of hundreds or thousands of individual Lambda instances. We developed a network optimized distributed query system to enable distributed querying.

Addressing the Original Problem

With all components in place, we can now address the initial problem.

Rather than focusing on optimizing data lifecycle management within a data warehouse to facilitate interactive analytics, we recognized the potential of creating a new serverless data warehouse paradigm where each client’s data would have its own transient and highly elastic data warehouse. The data boundaries and access rules can be defined and modified at any time, but we only initiate compute resources at the moment of query. This approach allows us to transfer hundreds of GBs of data to hundreds of Lambdas in under 7 seconds and promptly put the warehouse cluster to sleep the instant a query has completed execution.

We use a sample web crawl containing 192GB of ZSTD Parquet compressed data about millions of webpages and hundreds of millions of links. This is a large dataset by most standards, but because an end-user may only analyze a given crawl a few times in a 3-month period, it becomes incredibly wasteful to keep it hot in a live database for that entire time: If this single crawl were kept live in a “Hot Tier” cluster from Elastic.co, it would cost $5,000 to keep live for that period (ES expands the dataset to 850GB per copy, or 2560GB in a standard HA configuration).

Using BoilingData, we can generate this crawl’s full insight set from cold data in S3 40% faster than with AWS Athena. Once the data has been loaded into a cluster of AWS Lambda instances, subsequent generations are approximately 96% faster due to BoilingData’s warm, in-memory data storage.

For this crawl, we can produce all 300 insights for 20 user-defined segments in under two seconds. This allows end-users to dynamically modify their segment definitions on the fly and receive long-term trends from all insights immediately. For B2B SaaS businesses, this means they can offer more flexible insights and fast analysis of more data at lower costs than keeping everything in an always-on cluster.

What’s Next?

Although we are excited about solving this initial problem, it is just the beginning. We have an ambitious roadmap ahead, featuring improvements and new features such as:

Enabling BoilingData as a query acceleration layer for businesses that only need faster analytics without replacing their entire data stack.
Allowing BoilingData to be used as a data source in popular BI tools such as Tableau and PowerBI
Deployment of our innovative distributed SQL JOIN model to allow for analytics queries across relational datasets
The ability to lift data out of more data sources and streams, such as Kafka topics, to undertake real-time aggregations

As we continue to refine BoilingData, our goal remains the same: to provide an innovative, cost-effective, and efficient solution for real-time analytics that not only meets but exceeds the expectations of clients of SaaS businesses. Stay tuned for future updates on our progress as we work to revolutionize the way businesses access and analyze their data.

If you would like to hear more about BoilingData or to run a quick demo on your data contact us at info@boilingdata.com

Lightning fast aggregations by distributing DuckDB across AWS Lambda functions

BoilingData — Tue, 03 Jan 2023 19:55:00 GMT

DuckDB is rapidly changing the way data scientists and engineers work. It’s efficient and internally parallelised architecture means that a single querying node often out-competes entire clusters of more traditional query engines.

But out of the box, DuckDB needs to be run on a single node meaning the hardware naturally limits performance. The typical way of solving this in a production environment involves scaling out or scaling up your DuckDB infrastructure: increasing the size of your querying instance or distributing concurrent query jobs to different hardware. Traditionally, scaling out or up are slow and expensive tasks that involve waiting for an upstream provider to allocate resources and start new servers.

The time and cost of scaling DuckDB infrastructure is a good trade-off for many analytical workloads — particularly if you are generating and pre-caching analytics for later consumption. This only works for some businesses in 2023: these days, our users expect more than rigid and overly optimised analytics — they want to dynamically filter, generate trends over long periods, and analyse data in real-time.

Distributing queries across an AWS Lambda DuckDB cluster

Serverless computing presents an opportunity to solve both the cost and cold start problem. AWS Lambda instances are relatively small and underpowered (6 CPU cores, 10GB RAM, max 0.5Gbps network throughput), but have a core benefit of being charged per millisecond and having almost no time/cost overhead when starting, pausing, and resuming instances.

If we can distribute our dataset across 3,000 Lambda instances, data will be downloaded at 1.5Tbps, and we’ll utilise 18,000 CPUs/30TB RAM (10GB Lambdas) to execute queries.

Suddenly, it becomes possible to keep all our data in cold storage and rapidly load it into a serverless DuckDB cluster at query time for a real-time interactive analysis session.

Distributed DuckDB on Lambda vs. single instance DuckDB benchmarks — NYC Yellow Taxi Dataset

To create a benchmark, we used the popular NYC Yellow taxi dataset GZip parquet files from the years 2011–2022 (as they have a common schema). The dataset has about 1.34b rows and is a challenging dataset to run queries over: it is compressed with Gzip, which is not the best algorithm from CPU and network perspective, and the Parquet files have many row groups, which slows down OLAP queries. A more optimised version of the dataset would run faster queries, for instance, with larger row groups and ZSTD compression.

The compressed Parquet files are 18GB across 141 files. This is a total of 118GB uncompressed CSV data (without CSV escapes or hyphens over columns — a 6.56x compression ratio). Downloading the files to the laptop from S3 with an average speed of 30MB/s takes 10 minutes, and to an EC2 instance in the same region as the data takes 50 seconds.

We ran two queries on each system: first, a query that can be easily distributed while combining the results at the end (i.e. COUNT and SUM functions). The second query uses average (AVG), for which we calculate accurate results by running the query over all the contributing rows in one pass. This requires shuffling on the Boiling side, which we have implemented in a network and Lambda-optimised manner. The shuffling is implemented as a generic mechanism to address distributed queries that are not easily combinable.

We wanted these tests to reflect “real world” conditions, so the “Laptop” tests started with the entire dataset already downloaded locally onto the SSD. In contrast, the Boiling and EC2 tests started completely cold and began by downloading data from the object store (S3).

1. Simple combinable query

Laptop with SSD

We ran DuckDB v0.6.1 on OSX (MB Pro 2019, 2.3Ghz, i9 8-core, 16GB DDR4, 1TB SSD).

v0.6.1 919cad22e8
D SELECT COUNT(*) AS total5, 
         strftime(tpep_pickup_datetime, '%Y-%m-%d') AS pickup4 
    FROM parquet_scan('*.parquet') 
   WHERE CAST(strftime(tpep_pickup_datetime, '%Y') AS INTEGER) 
         BETWEEN 2011 AND 2022 
GROUP BY pickup4 ORDER BY pickup4 LIMIT 20;
┌────────┬────────────┐
│ total5 │  pickup4   │
│ int64  │  varchar   │
├────────┼────────────┤
│ 183031 │ 2011-01-01 │
│ 158706 │ 2011-01-02 │
│ 374927 │ 2011-01-03 │
│ 407012 │ 2011-01-04 │
│ 439070 │ 2011-01-05 │
│ 476390 │ 2011-01-06 │
│ 486988 │ 2011-01-07 │
│ 500272 │ 2011-01-08 │
│ 451117 │ 2011-01-09 │
│ 449946 │ 2011-01-10 │
│ 471811 │ 2011-01-11 │
│ 361127 │ 2011-01-12 │
│ 488565 │ 2011-01-13 │
│ 512868 │ 2011-01-14 │
│ 485503 │ 2011-01-15 │
│ 446699 │ 2011-01-16 │
│ 386391 │ 2011-01-17 │
│ 442217 │ 2011-01-18 │
│ 477235 │ 2011-01-19 │
│ 501964 │ 2011-01-20 │
├────────┴────────────┤
│ 20 rows   2 columns │
└─────────────────────┘
Run Time (s): real 35.152 user 507.912590 sys 3.608308

DuckDB took 35 seconds to run this query.

EC2 instance (c6a.metal)

The query was run on a c6a.metal instance in the same region as the data. This machine has 192 CPU cores and 384GB RAM, and is the largest instance in the c6 range, coming in at $7.34 per hour.

The initial data download from S3 ran at 368MiB/s and took 50 seconds.

v0.6.1 919cad22e8
D SELECT COUNT(*) AS total5, 
         strftime(tpep_pickup_datetime, '%Y-%m-%d') AS pickup4 
    FROM parquet_scan('*.parquet') 
   WHERE CAST(strftime(tpep_pickup_datetime, '%Y') AS INTEGER) 
         BETWEEN 2011 AND 2022 
GROUP BY pickup4 ORDER BY pickup4 LIMIT 20;
┌────────┬────────────┐
│ total5 │  pickup4   │
│ int64  │  varchar   │
├────────┼────────────┤
│ 183031 │ 2011-01-01 │
│ 158706 │ 2011-01-02 │
│ 374927 │ 2011-01-03 │
│ 407012 │ 2011-01-04 │
│ 439070 │ 2011-01-05 │
│ 476390 │ 2011-01-06 │
│ 486988 │ 2011-01-07 │
│ 500272 │ 2011-01-08 │
│ 451117 │ 2011-01-09 │
│ 449946 │ 2011-01-10 │
│ 471811 │ 2011-01-11 │
│ 361127 │ 2011-01-12 │
│ 488565 │ 2011-01-13 │
│ 512868 │ 2011-01-14 │
│ 485503 │ 2011-01-15 │
│ 446699 │ 2011-01-16 │
│ 386391 │ 2011-01-17 │
│ 442217 │ 2011-01-18 │
│ 477235 │ 2011-01-19 │
│ 501964 │ 2011-01-20 │
├────────┴────────────┤
│ 20 rows   2 columns │
└─────────────────────┘
Run Time (s): real 4.065 user 302.831 sys 1.822

The initial query from cold S3 data took 50s + 4s = 54s

BoilingData with 61 Lambdas

The query was run on Boiling, and the query planner chose to run it over 61 Lambdas concurrently. The initial query from all-cold data on S3 took 0.8s + 6.7s = 7.5s. In other words, we didn’t import the existing data in any way or pre-warm any Lambdas before we hit them with the query

We reran the query, using the Lambda cluster that was warmed in the last test. The results were much better: 1.3s — 1.7s. In this query, the column was changed to bypass in-memory result caches.

2. Query with an additional column (avg)

Laptop with SSD

v0.6.1 919cad22e8
D SELECT COUNT(*) AS total5, round(AVG(trip_distance),4) AS adist, 
         strftime(tpep_pickup_datetime, '%Y-%m-%d') AS pickup4 
    FROM parquet_scan('*.parquet') 
   WHERE CAST(strftime(tpep_pickup_datetime, '%Y') AS INTEGER) 
         BETWEEN 2011 AND 2022 
GROUP BY pickup4 ORDER BY pickup4 LIMIT 20;
┌────────┬────────┬────────────┐
│ total5 │ adist  │  pickup4   │
│ int64  │ double │  varchar   │
├────────┼────────┼────────────┤
│ 183031 │ 3.0448 │ 2011-01-01 │
│ 158706 │  3.169 │ 2011-01-02 │
│ 374927 │ 2.8845 │ 2011-01-03 │
│ 407012 │  2.716 │ 2011-01-04 │
│ 439070 │ 2.6061 │ 2011-01-05 │
│ 476390 │ 2.5925 │ 2011-01-06 │
│ 486988 │ 2.5055 │ 2011-01-07 │
│ 500272 │ 2.5656 │ 2011-01-08 │
│ 451117 │ 2.8534 │ 2011-01-09 │
│ 449946 │ 2.6484 │ 2011-01-10 │
│ 471811 │ 2.5189 │ 2011-01-11 │
│ 361127 │  2.398 │ 2011-01-12 │
│ 488565 │  2.647 │ 2011-01-13 │
│ 512868 │ 2.6572 │ 2011-01-14 │
│ 485503 │ 2.6565 │ 2011-01-15 │
│ 446699 │ 2.8195 │ 2011-01-16 │
│ 386391 │ 2.8472 │ 2011-01-17 │
│ 442217 │ 2.5776 │ 2011-01-18 │
│ 477235 │ 2.6019 │ 2011-01-19 │
│ 501964 │ 2.6306 │ 2011-01-20 │
├────────┴────────┴────────────┤
│ 20 rows            3 columns │
└──────────────────────────────┘
Run Time (s): real 39.741 user 561.296748 sys 5.281544

DuckDB took 39 seconds to run this query.

EC2 Instance

v0.6.1 919cad22e8
D SELECT COUNT(*) AS total5, round(AVG(trip_distance),4) AS adist, 
         strftime(tpep_pickup_datetime, '%Y-%m-%d') AS pickup4 
    FROM parquet_scan('*.parquet') 
   WHERE CAST(strftime(tpep_pickup_datetime, '%Y') AS INTEGER) 
         BETWEEN 2011 AND 2022 
GROUP BY pickup4 ORDER BY pickup4 LIMIT 20;
┌────────┬────────┬────────────┐
│ total5 │ adist  │  pickup4   │
│ int64  │ double │  varchar   │
├────────┼────────┼────────────┤
│ 183031 │ 3.0448 │ 2011-01-01 │
│ 158706 │  3.169 │ 2011-01-02 │
│ 374927 │ 2.8845 │ 2011-01-03 │
│ 407012 │  2.716 │ 2011-01-04 │
│ 439070 │ 2.6061 │ 2011-01-05 │
│ 476390 │ 2.5925 │ 2011-01-06 │
│ 486988 │ 2.5055 │ 2011-01-07 │
│ 500272 │ 2.5656 │ 2011-01-08 │
│ 451117 │ 2.8534 │ 2011-01-09 │
│ 449946 │ 2.6484 │ 2011-01-10 │
│ 471811 │ 2.5189 │ 2011-01-11 │
│ 361127 │  2.398 │ 2011-01-12 │
│ 488565 │  2.647 │ 2011-01-13 │
│ 512868 │ 2.6572 │ 2011-01-14 │
│ 485503 │ 2.6565 │ 2011-01-15 │
│ 446699 │ 2.8195 │ 2011-01-16 │
│ 386391 │ 2.8472 │ 2011-01-17 │
│ 442217 │ 2.5776 │ 2011-01-18 │
│ 477235 │ 2.6019 │ 2011-01-19 │
│ 501964 │ 2.6306 │ 2011-01-20 │
├────────┴────────┴────────────┤
│ 20 rows            3 columns │
└──────────────────────────────┘
Run Time (s): real 4.621 user 325.146 sys 2.284

Initial query from cold S3 data: 50s + 4.6s = 54.6s

BoilingData

Initial query from all cold S3 data: 3.6s + 14s = 17.6s.

BoilingData cold queries incur warming up time when data is loaded in parallel from S3. This query ran shuffling due to the AVG aggregation function.

Further queries on the warmed Lambda cluster took 6.5s — 8s. Boiling routes subsequent queries to Lambdas which already have the data in memory, so the query can be changed to run different aggregations.

BoilingData warm query times are much faster as the data is in an on-demand, dedicated, and distributed in-memory database layer

Cloud managed Presto query times

  SELECT COUNT(*) AS total5, round(AVG(trip_distance),4) AS adist, 
         date_format(tpep_pickup_datetime, '%Y-%m-%d') AS pickup4 
    FROM "boilingdata-benchmark"."nyc6trip_data" 
   WHERE CAST(date_format(tpep_pickup_datetime, '%Y') AS INTEGER) 
         BETWEEN 2011 AND 2022 
GROUP BY date_format(tpep_pickup_datetime, '%Y-%m-%d') 
ORDER BY date_format(tpep_pickup_datetime, '%Y-%m-%d') LIMIT 20;

For comparison, we ran the queries on a popular cloud-managed Presto service. The complex query (having queued for 135ms), had a total runtime of 77s — 124s (data scanned: 5.54GB). The more straightforward (combinable) query queued for 260ms, and had a total query runtime 32–40s (data scanned: 3.85GB).

Summary of test results

Using an instantly scalable serverless DuckDB cluster, it is easy to beat a Laptop as it is bound by SSD, memory, and available cores. With any sized dataset, Boiling will allocate many Lambdas to provide more pure raw CPU power for decompressing the GZip parquet files whilst overcoming the incurred latency from the shuffling and coordination.

The challenging parts are mastering the distributed Lambda compute layer, implementing (and running) a performant shuffle on top of it that actually works, and running subsequent queries over the warmed in-memory data set so that S3 is not touched every single query.

Downloading 18GB from S3 to a laptop takes a long time and S3 egress costs come in at $0.50 — $0.90. Even though downloading is faster on the EC2, 50 seconds is a slow cold start for an interactive dashboarding session, and costly at almost $0.10 using this instance.

Our queries didn’t touch all the columns, so downloading the entire dataset to the machine is wasteful (if all the data is not consumed). However, downloading that data from S3 in an optimal manner with distributed Lambdas achieves high aggregated throughput without AWS egress fees.

Distributed combinable queries are much faster as they don’t require shuffling

From cold, BoilingData is between 3x and 7x faster than querying using an EC2 instance. Once data has been warmed, BoilingData excels in combinable queries (3x faster), whereas EC2 wins the race for more complex queries that require us to shuffle data between Lambdas (1.4x slower).

We wanted to test whether BoilingData can compete with duckdb running on laptop and a top-of-the-line EC2, and even with this small dataset, it can.

Want to learn more about how we distribute queries across Lambda? Check out our latest post:

Redefining SaaS Analytics: Dynamic, Real-time Insights at a Fraction of the Cost

Postscript: Data we didn’t include

In this test, we used DuckDB’s parquet_scan function to query the data files — this function scans through the source data without materialising it into a table, creating indexes, etc. One argument against using parquet_scan is that (in theory) allowing DuckDB to create an optimised table will mean that subsequent queries will run faster.

We tested this on the c6a.metal instance by creating an in-memory DuckDB table and running these queries against it, but in the end, it added a full minute for table creation and didn’t improve the query speed at all:

D CREATE TABLE test AS SELECT * FROM '*.parquet';
100% ▕████████████████████████████████████████████████████████████▏
Run Time (s): real 63.265 user 538.439722 sys 1380.067524

D SELECT COUNT(*) AS total5,
>          strftime(tpep_pickup_datetime, '%Y-%m-%d') AS pickup4
>     FROM test
>    WHERE CAST(strftime(tpep_pickup_datetime, '%Y') AS INTEGER)
>          BETWEEN 2011 AND 2022
> GROUP BY pickup4 ORDER BY pickup4 LIMIT 20;
 50% ▕██████████████████████████████                              ▏ ┌────────┬────────────┐
│ total5 │  pickup4   │
│ int64  │  varchar   │
├────────┼────────────┤
│ 183031 │ 2011-01-01 │
│ 158706 │ 2011-01-02 │
│ 374927 │ 2011-01-03 │
│ 407012 │ 2011-01-04 │
│ 439070 │ 2011-01-05 │
│ 476390 │ 2011-01-06 │
│ 486988 │ 2011-01-07 │
│ 500272 │ 2011-01-08 │
│ 451117 │ 2011-01-09 │
│ 449946 │ 2011-01-10 │
│ 471811 │ 2011-01-11 │
│ 361127 │ 2011-01-12 │
│ 488565 │ 2011-01-13 │
│ 512868 │ 2011-01-14 │
│ 485503 │ 2011-01-15 │
│ 446699 │ 2011-01-16 │
│ 386391 │ 2011-01-17 │
│ 442217 │ 2011-01-18 │
│ 477235 │ 2011-01-19 │
│ 501964 │ 2011-01-20 │
├────────┴────────────┤
│ 20 rows   2 columns │
└─────────────────────┘
Run Time (s): real 3.936 user 417.322142 sys 8.984839

D SELECT COUNT(*) AS total5, round(AVG(trip_distance),4) AS adist,
>          strftime(tpep_pickup_datetime, '%Y-%m-%d') AS pickup4
>     FROM test
>    WHERE CAST(strftime(tpep_pickup_datetime, '%Y') AS INTEGER)
>          BETWEEN 2011 AND 2022
> GROUP BY pickup4 ORDER BY pickup4 LIMIT 20;
100% ▕████████████████████████████████████████████████████████████▏
┌────────┬────────┬────────────┐
│ total5 │ adist  │  pickup4   │
│ int64  │ double │  varchar   │
├────────┼────────┼────────────┤
│ 183031 │ 3.0448 │ 2011-01-01 │
│ 158706 │  3.169 │ 2011-01-02 │
│ 374927 │ 2.8845 │ 2011-01-03 │
│ 407012 │  2.716 │ 2011-01-04 │
│ 439070 │ 2.6061 │ 2011-01-05 │
│ 476390 │ 2.5925 │ 2011-01-06 │
│ 486988 │ 2.5055 │ 2011-01-07 │
│ 500272 │ 2.5656 │ 2011-01-08 │
│ 451117 │ 2.8534 │ 2011-01-09 │
│ 449946 │ 2.6484 │ 2011-01-10 │
│ 471811 │ 2.5189 │ 2011-01-11 │
│ 361127 │  2.398 │ 2011-01-12 │
│ 488565 │  2.647 │ 2011-01-13 │
│ 512868 │ 2.6572 │ 2011-01-14 │
│ 485503 │ 2.6565 │ 2011-01-15 │
│ 446699 │ 2.8195 │ 2011-01-16 │
│ 386391 │ 2.8472 │ 2011-01-17 │
│ 442217 │ 2.5776 │ 2011-01-18 │
│ 477235 │ 2.6019 │ 2011-01-19 │
│ 501964 │ 2.6306 │ 2011-01-20 │
├────────┴────────┴────────────┤
│ 20 rows            3 columns │
└──────────────────────────────┘
Run Time (s): real 6.451 user 442.361772 sys 544.996610

In this article we downloaded the entire dataset to the EC2, and queried the local copy — this added 50 seconds to the warm up time. Another potential way that DuckDB on EC2 could have been faster is by parquet_scanning S3 directly (because parquet_scan only accesses the relevant byte ranges of the source data, not the whole dataset).

We tested this, and while the initial query would have been ~30 seconds faster, it meant that the subsequent query would have been ~25 seconds slower. On balance, no massive performance improvement (and in fact, it is much slower if you wanted to run three or more queries on this dataset):

D SELECT COUNT(*) AS total5,
>          strftime(tpep_pickup_datetime, '%Y-%m-%d') AS pickup4
>     FROM parquet_scan('s3://isecurefi-dev-test/nyc-tlc/trip_data/*')
>    WHERE CAST(strftime(tpep_pickup_datetime, '%Y') AS INTEGER)
>          BETWEEN 2011 AND 2022
> GROUP BY pickup4 ORDER BY pickup4 LIMIT 20;
   WHERE CAST(strftime(tpep_pickup_datetime, '%Y') AS INTEGER)
         BETWEEN 2011 AND 2022
GROUP BY pickup4 ORDER BY pickup4 LIMIT 20;
100% ▕████████████████████████████████████████████████████████████▏
┌────────┬────────────┐
│ total5 │  pickup4   │
│ int64  │  varchar   │
├────────┼────────────┤
│ 183031 │ 2011-01-01 │
│ 158706 │ 2011-01-02 │
│ 374927 │ 2011-01-03 │
│ 407012 │ 2011-01-04 │
│ 439070 │ 2011-01-05 │
│ 476390 │ 2011-01-06 │
│ 486988 │ 2011-01-07 │
│ 500272 │ 2011-01-08 │
│ 451117 │ 2011-01-09 │
│ 449946 │ 2011-01-10 │
│ 471811 │ 2011-01-11 │
│ 361127 │ 2011-01-12 │
│ 488565 │ 2011-01-13 │
│ 512868 │ 2011-01-14 │
│ 485503 │ 2011-01-15 │
│ 446699 │ 2011-01-16 │
│ 386391 │ 2011-01-17 │
│ 442217 │ 2011-01-18 │
│ 477235 │ 2011-01-19 │
│ 501964 │ 2011-01-20 │
├────────┴────────────┤
│ 20 rows   2 columns │
└─────────────────────┘
Run Time (s): real 29.428 user 244.316008 sys 8.996392

D SELECT COUNT(*) AS total5, round(AVG(trip_distance),4) AS adist,
>          strftime(tpep_pickup_datetime, '%Y-%m-%d') AS pickup4
>     FROM parquet_scan('s3://isecurefi-dev-test/nyc-tlc/trip_data/*')
>    WHERE CAST(strftime(tpep_pickup_datetime, '%Y') AS INTEGER)
>          BETWEEN 2011 AND 2022
> GROUP BY pickup4 ORDER BY pickup4 LIMIT 20;
100% ▕████████████████████████████████████████████████████████████▏
┌────────┬────────┬────────────┐
│ total5 │ adist  │  pickup4   │
│ int64  │ double │  varchar   │
├────────┼────────┼────────────┤
│ 183031 │ 3.0448 │ 2011-01-01 │
│ 158706 │  3.169 │ 2011-01-02 │
│ 374927 │ 2.8845 │ 2011-01-03 │
│ 407012 │  2.716 │ 2011-01-04 │
│ 439070 │ 2.6061 │ 2011-01-05 │
│ 476390 │ 2.5925 │ 2011-01-06 │
│ 486988 │ 2.5055 │ 2011-01-07 │
│ 500272 │ 2.5656 │ 2011-01-08 │
│ 451117 │ 2.8534 │ 2011-01-09 │
│ 449946 │ 2.6484 │ 2011-01-10 │
│ 471811 │ 2.5189 │ 2011-01-11 │
│ 361127 │  2.398 │ 2011-01-12 │
│ 488565 │  2.647 │ 2011-01-13 │
│ 512868 │ 2.6572 │ 2011-01-14 │
│ 485503 │ 2.6565 │ 2011-01-15 │
│ 446699 │ 2.8195 │ 2011-01-16 │
│ 386391 │ 2.8472 │ 2011-01-17 │
│ 442217 │ 2.5776 │ 2011-01-18 │
│ 477235 │ 2.6019 │ 2011-01-19 │
│ 501964 │ 2.6306 │ 2011-01-20 │
├────────┴────────┴────────────┤
│ 20 rows            3 columns │
└──────────────────────────────┘
Run Time (s): real 28.016 user 265.048058 sys 10.144067

Boiling Data Source Apps

BoilingData — Tue, 18 Oct 2022 19:05:25 GMT

What can you do with NodeJS runtime on single tenant computing resources where customers bring their own access roles?

We have launched support for Data Source Apps (DSA) for Boiling! They are templated JS functions, callable as SQL Table Functions.

Listing S3 Bucket using Boiling DSA for AWS SDK S3

DSAs are integrations with Javascript Functions on-the-fly into BoilingData that can then be queried with SQL like any other SQL Compute Cached data source. No need to install plugins, JAR files, compile, transpile, package, upload, release, etc. but just write Javascript function template into JSON formatted string, INSERT into Boiling apps catalog and its ready for use immediately.

And you get the results like any other SQL Query

We have released a SDK for Boiling Data Stream Analytics (DSAs), which can be found in our GitHub repository at https://github.com/boilingdata/boilingdata-dsa-sdk. This repository includes example applications, such as those utilizing the Amazon Web Services (AWS) SDK and a basic random number generator. The AWS SDK example demonstrates full paging capabilities. Additionally, the repository provides an in-depth analysis and discussion of the SDK’s features as well as any related security implications.

Aliases and more aliases..

Interestingly, you can create aliases for SQL Table functions with all parameters predefined, only a few parameters defined, or even apply transformations to them. By doing so, you can assign meaningful names to your virtual table aliases that closely reflect their semantic purpose. For example:

SELECT * FROM apps.awssdk.allBuckets WHERE name LIKE '%boiling%';
SELECT * FROM apps.awssdk.demoBucketRootListing;
SELECT * FROM apps.awssdk.gluePartitions('nyctaxis');
SELECT * FROM apps.awssdk.glueTables('default');

All the examples above are using the same Boiling DSA, but instead of calling the “main” application, we used the defined aliases.

So, how does it work really?

Boiling processes the function template using json-template, taking into account any pre-defined SQL Table function parameters from an alias. It then creates a Function and calls it with a parameters object containing S3, Glue, and Lambda AWS SDK instances, which are based on the assumed IAM Role provided by the Boiling user.

The output should be an array of objects, so one of the SQL Table Function parameters needs to specify the path to the list of objects in the response, such as from a REST API. Boiling extracts all keys from the objects and generates a table schema based on their JSON data types (loosely). It then stores the results in an in-memory table. Objects with varying schemas are permitted, and null values are used if some keys are missing in certain rows.

The original SQL query referencing the Boiling DSA is replaced with a temporary table name, which represents the table containing the in-memory DSA results. The SQL is then executed within an embedded database, such as DuckDB. When a query requests the same Boiling DSA with identical parameters again, the data is readily available in the in-memory table for instant access. In this way, Boiling efficiently handles REST API calls.

You can start using BoilingData by signing up to our application here https://app.boilingdata.com/ and play with the demo datasets, set your own IAM role and access your S3 Buckets they way you like.

Boiling Data

BoilingData — Fri, 02 Sep 2022 04:44:37 GMT

Enjoyable Serverless Analytics

It’s real — high performance on-demand serverless analytics is more than just an option or add-on to existing toolbox. You don’t have to suffer from random cold starts and unpredictable query performance, but enjoy the interaction with your data on S3.

Serverless has been around for years, but recently it has become a viable option for data processing too due to increased resources (more CPU and more memory), access to network storages, and increased local SSD size — at least with AWS Lambda.

We started to work with DuckDB three years ago when it wasn’t on top of the hype curve. It’s an embedded and highly performant state of the art OLAP database that you can embed into your applications — a missing piece for serverless. It makes a perfect fit with AWS Lambda and S3.

We added support for “RAM” disk, so that DuckDB can work with in-memory data, query after query without having to worry about S3 latency or Lambda network bandwidth or even SSD bandwidth/latency. This gives you hot in-memory database overlay — SQL compute cache, hot pools of boiling data that cools down when not used anymore.

Simple GUI with QueryLog and API logging

Single tenant dedicated resources

The core with Boiling’s service is the ability to “boil” data, keep it hot, and always route your queries to the hot Lambda instances with your data already in memory. The routing happens globally and queries run where your S3 Buckets are located to avoid having to send data cross-region with egress costs.

With Lambdas, every query gets always dedicated resources in a single tenant environment. Lambda scales out to thousands of instances in seconds. Boiling can run the same query over hundreds of S3 Objects at the same time, again and again.

NodeJS SDK — https://github.com/boilingdata/node-boilingdata

Asynchronous API

We started right away with asynchronous API with WebSockets to get the possibility to stream query response data when it is ready. This helps with query response time and e.g. with analytics metrics dashboards where trends data and aggregations over partitions are calculated live concurrently and results arrive immediately when ready.

What’s Boiling bro?

People ask that what BoilingData is, how does is it compare to Presto, or can it do true distributed queries to replace a Data Warehouse.

Boiling is an in-memory SQL compute overlay. Data is brought from S3 into the Lambda memory, kept there, and SQL queries run against it with the embedded databases. Once in the memory and run with DuckDB the response time is in the order of ms, whilst with e.g. Presto that loads data from S3 with every query, the response times are in seconds or tens of seconds. They are different, Presto being a large cluster running JVMs, good for ETL jobs, DuckDB inside Lambda much faster with data sets fitting in, but requires the initial load latency when data is brought from S3 to Lambda CPU memory.

To take yet another perspective with Spark. They say that DuckDB needs one instance, whilst Spark needs 32 for the same query performance. They are different, Spark having a large usage base, lots of legacy code and scaling horizontally, DuckDB running the queries in-memory and scaling vertically (for a working set, horizontally with multiple independent working sets).

Boiling is a caching layer. Any query you run will be cached in-memory and optionally persisted. In data processing, there are numerous caching layers and Boiling brings the selected data closer to the CPU from S3.

Boiling combines best of the two worlds. Try to fetch hundreds of columns from columnar data store like Parquet. It’s painfully slow as its not meant to do horizontal data retrieval, but vertical. However, if you take SQLite and fetch all columns with a row number (or index) it is very fast, especially if you have the storage page hierarchy already cached in-memory. With BoilingData you can use DuckDB for the WHERE part and SQLite for the SELECT part.

But we are just beginning, early in the journey.. Data Engineering world has lots to learn from Software Engineering, what it comes to Semantic Layers, data contracts etc. and we know that data is becoming more realtime and streaming, globally. We love query graphs and want to make them more interactive, we love coding and APIs and want to bring them closer to data. And what more, running our service in AWS, with Lambda and other services gives us fast iterations and modularity. Our world is not restricted to a single low level programming language codebase. We want to take you in into our enjoyable serverless analyticsland!

You can start using BoilingData by signing up to our application here https://app.boilingdata.com/ and play with the demo datasets, set your own IAM role and access your S3 Buckets they way you like.