Posted on: Mon 12 January 2026 by Geoffrey Claude (Datadog)
If you embed DataFusion in your product, your users will eventually run SQL that DataFusion does not recognize. Not because the query is unreasonable, but because SQL in practice includes many dialects and system-specific statements.
Suppose you store data as Parquet files on S3 and want users to attach an …
Posted on: Mon 15 December 2025 by Gene Bordegaray
Databases are some of the most complex yet interesting pieces of software. They are amazing pieces of abstraction: query engines optimize and execute complex plans, storage engines provide sophisticated infrastructure as the backbone of the system, while intricate file formats lay the groundwork for particular workloads. All of this is …
The Apache DataFusion PMC is pleased to announce version 0.12.0 of the Comet subproject.
Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for
improved performance and efficiency without requiring any code changes.
This release covers approximately four weeks of development …
We are proud to announce the release of DataFusion 51.0.0. This post highlights
some of the major improvements since DataFusion 50.0.0. The complete list of
changes is available in the changelog. Thanks to the 128 contributors for
making this release possible.
The Apache DataFusion PMC is pleased to announce version 0.11.0 of the Comet subproject.
Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for
improved performance and efficiency without requiring any code changes.
This release covers approximately five weeks of development …
We are proud to announce the release of DataFusion 50.0.0. This blog post
highlights some of the major improvements since the release of DataFusion
49.0.0. The complete list of changes is available in the changelog.
Thanks to numerous contributors for making this release possible!
Posted on: Sun 21 September 2025 by Tim Saucer(rerun.io), Dewey Dunnington(Wherobots), Andrew Lamb(InfluxData)
Apache DataFusion significantly improves support for user
defined types and metadata. The user defined function APIs let users access
metadata on the input columns to functions and produce metadata in the output.
The Apache DataFusion PMC is pleased to announce version 0.10.0 of the Comet subproject.
Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for
improved performance and efficiency without requiring any code changes.
This release covers approximately ten weeks of development …
Posted on: Wed 10 September 2025 by Adrian Garcia Badaracco (Pydantic), Andrew Lamb (InfluxData)
This blog post introduces the query engine optimization techniques called TopK
and dynamic filters. We describe the motivating use case, how these
optimizations work, and how we implemented them with the Apache DataFusion
community to improve performance by an order of magnitude for some query
patterns.
Posted on: Fri 15 August 2025 by Andrew Lamb (InfluxData)
It is a common misconception that Apache Parquet requires (slow) reparsing of
metadata and is limited to indexing structures provided by the format. In fact,
caching parsed metadata and using custom external indexes along with
Parquet's hierarchical data organization can significantly speed up query
processing.
We are proud to announce the release of DataFusion 49.0.0. This blog post highlights some of
the major improvements since the release of DataFusion 48.0.0. The complete list of changes is available in the changelog.
We’re excited to announce the release of Apache DataFusion 48.0.0! As always, this version packs in a wide range of
improvements and fixes. You can find the complete details in the full
changelog. We’ll highlight the most
important changes below and guide you through upgrading.
Posted on: Mon 14 July 2025 by Qi Zhu (Cloudera), Jigao Luo (Systems Group at TU Darmstadt), and Andrew Lamb (InfluxData)
It’s a common misconception that Apache Parquet files are limited to basic Min/Max/Null Count statistics and Bloom filters, and that adding more advanced indexes requires changing the specification or creating a new file format. In fact, footer metadata and offset-based addressing already provide everything needed to embed …
We’re excited to announce the release of Apache DataFusion 47.0.0! This new version represents a significant
milestone for the project, packing in a wide range of improvements and fixes. You can find the complete details in the
full changelog. We’ll highlight the most
important changes below …
The Apache DataFusion PMC is pleased to announce version 0.9.0 of the Comet subproject.
Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for
improved performance and efficiency without requiring any code changes.
This release covers approximately ten weeks of development …
Sometimes Query Optimizers are seen as a sort of black magic, “the most
challenging problem in computer
science,” according to Father
Pavlo, or some behind-the-scenes player. We believe this perception is because:
In the first part of this post, we discussed what a Query Optimizer is, what
role it plays, and described how industrial optimizers are organized. In this
second post, we describe various optimizations that are found in Apache
DataFusion and …
The Apache DataFusion PMC is pleased to announce version 0.8.0 of the Comet subproject.
Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for
improved performance and efficiency without requiring any code changes.
This release covers approximately six weeks of development …
Posted on: Sat 19 April 2025 by Aditya Singh Rathore, Andrew Lamb
Window functions are a powerful feature in SQL, allowing for complex analytical computations over a subset of data. However, efficiently implementing them, especially sliding windows, can be quite challenging. With Apache DataFusion's user-defined window functions, developers can easily take advantage of all the effort put into DataFusion's implementation.
We are happy to announce that datafusion-python 46.0.0 has been released. This release
brings in all of the new features of the core DataFusion 46.0.0 library. Since the last
blog post for datafusion-python 43.1.0, a large number of improvements have been made
that can …
Posted on: Mon 24 March 2025 by Oznur Hanci and Berkay Sahin on behalf of the PMC
We’re excited to announce the release of Apache DataFusion 46.0.0! This new version represents a significant milestone for the project, packing in a wide range of improvements and fixes. You can find the complete details in the full changelog. We’ll highlight the most important changes below …
The Apache DataFusion PMC is pleased to announce version 0.7.0 of the Comet subproject.
Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for
improved performance and efficiency without requiring any code changes.
Editor's Note: This blog was first published on Xiangpeng Hao's blog. Thanks to InfluxData for sponsoring this work as part of his PhD funding.
Apache Parquet has become the industry standard for storing columnar data, and reading Parquet efficiently -- especially from remote storage -- is crucial for query performance.
In this blog post, we explain when an ordering requirement of an operator is satisfied by its input data. This analysis is essential for order-based optimizations and is often more complex than one might initially think.
Ordering Requirement for an operator describes how the input data to that operator …
We are very proud to announce DataFusion 45.0.0. This blog highlights some of the
many major improvements since we released DataFusion 40.0.0 and a preview of
what the community is thinking about in the next 6 months. It has been an exciting
period of development …
The Apache DataFusion PMC is pleased to announce version 0.6.0 of the Comet subproject.
Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for
improved performance and efficiency without requiring any code changes.
We are pleased to announce version 43.0.0 of the DataFusion Ballista. Ballista allows existing DataFusion applications to be scaled out on a cluster for use cases that are not practical to run on a single node.
The Apache DataFusion PMC is pleased to announce version 0.5.0 of the Comet subproject.
Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for
improved performance and efficiency without requiring any code changes.
We are happy to announce that datafusion-python 43.1.0 has been released. This release
brings in all of the new features of the core DataFusion 43.0.0 library. Since the last
blog post for datafusion-python 40.1.0, a large number of improvements have been made
that can …
The Apache DataFusion PMC is pleased to announce version 0.4.0 of the Comet subproject.
Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for
improved performance and efficiency without requiring any code changes.
For a few months now I’ve been working with Apache DataFusion, a
fast query engine written in Rust. From my experience the language that nearly all data scientists
are working in is Python. In general, data scientists often use Pandas
for in-memory tasks and PySpark for larger …
Posted on: Mon 18 November 2024 by Andrew Lamb, Staff Engineer at InfluxData
I am extremely excited to announce that Apache DataFusion is the
fastest engine for querying Apache Parquet files in ClickBench. It is faster
than DuckDB, chDB and Clickhouse using the same hardware. It also marks
the first time a Rust-based engine holds the top spot, which has previously
been …
The Apache DataFusion PMC is pleased to announce version 0.3.0 of the Comet subproject.
Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for
improved performance and efficiency without requiring any code changes.
Posted on: Fri 13 September 2024 by Xiangpeng Hao, Andrew Lamb
Editor's Note: This is the first of a two part blog series that was first published on the InfluxData blog. Thanks to InfluxData for sponsoring this work as Xiangpeng Hao's summer intern project
Posted on: Fri 13 September 2024 by Xiangpeng Hao, Andrew Lamb
Editor's Note: This blog series was first published on the InfluxData blog. Thanks to InfluxData for sponsoring this work as Xiangpeng Hao's summer intern project
In the first post, we discussed the nuances required to accelerate Parquet loading using StringViewArray by reusing buffers and reducing copies.
In this second …
The Apache DataFusion PMC is pleased to announce version 0.2.0 of the Comet subproject.
Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for
improved performance and efficiency without requiring any code changes.
We are happy to announce that DataFusion in Python 40.1.0 has been released. In addition to
bringing in all of the new features of the core DataFusion 40.0.0 package, this release
contains significant updates to the user interface and documentation. We listened to the python …
We are proud to announce DataFusion 40.0.0. This blog highlights some of the
many major improvements since we released DataFusion 34.0.0 and a preview of
what the community is thinking about in the next 6 months. We are hoping to make
more regular blog posts …
The Apache DataFusion PMC is pleased to announce the first official source release of the Comet subproject.
Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for
improved performance and efficiency without requiring any code changes.
The Arrow PMC and newly created DataFusion PMC are happy to announce that as of
April 16, 2024 the Apache Arrow DataFusion subproject is now a top level
Apache Software Foundation project.
Comet is an Apache Spark plugin that uses Apache Arrow DataFusion to
accelerate Spark workloads. It is designed as a drop-in
replacement for Spark's JVM …
We recently released DataFusion 34.0.0. This blog highlights some of the major
improvements since we released DataFusion 26.0.0 (spoiler alert there are many)
and a preview of where the community plans to focus in the next 6 months.
Grouped aggregations are a core part of any analytic tool, creating understandable summaries of huge data volumes. Apache Arrow DataFusion’s parallel aggregation capability …
It has been a whirlwind 6 months of DataFusion development since our
last update: the community has grown, many features have been added,
performance improved and we are discussing branching out to our own
top level Apache Project.
DataFusion is an extensible
query execution framework, written in Rust,
that uses Apache Arrow as its
in-memory format. It is targeted primarily at developers creating data
intensive analytics, and offers mature
SQL support,
a DataFrame API, and many extension points.
Systems based on DataFusion perform very well in benchmarks …
DataFusion is an extensible query execution framework, written in Rust, that
uses Apache Arrow as its in-memory format.
When you want to extend your Rust project with SQL support,
a DataFrame API, or the ability to read and process Parquet, JSON, Avro or CSV data, DataFusion is definitely worth …
Apache Arrow DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.
When you want to extend your Rust project with SQL support, a DataFrame API, or the ability to read and process Parquet, JSON, Avro or CSV data, DataFusion is …
DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.
When you want to extend your Rust project with SQL support, a DataFrame API, or the ability to read and process Parquet, JSON, Avro or CSV data, DataFusion is definitely worth …
DataFusion is an embedded
query engine which leverages the unique features of
Rust and Apache
Arrow to provide a system that is high
performance, easy to connect, easy to embed, and high quality.
The Apache Arrow team is pleased to announce the DataFusion 6.0.0 release. This covers …
Ballista extends DataFusion to provide support for distributed queries. This is the first release of Ballista since
the project was donated to the Apache Arrow project
and includes 80 commits from 11 contributors.
The Apache Arrow team is pleased to announce the DataFusion 5.0.0 release. This covers 4 months of development work
and includes 211 commits from the following 31 distinct contributors.
$ git shortlog -sn 4.0.0..5.0.0 datafusion datafusion-cli datafusion-examples
61 Jiayu Liu
47 Andrew Lamb
27 …
We are excited to announce that Ballista has been donated
to the Apache Arrow project.
Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow. It is built
on an architecture that allows other programming languages (such as Python, C++, and Java) to be supported …
We are excited to announce that DataFusion has been donated to the Apache Arrow project. DataFusion is an in-memory query engine for the Rust implementation of Apache Arrow.
Although DataFusion was started two years ago, it was recently re-implemented to be Arrow-native and currently has limited capabilities but does support …