Image

Posts

  • Column Storage for the AI Era

    In the past few years, we’ve seen a Cambrian explosion of new columnar formats, challenging the hegemony of Parquet: Lance, Fastlanes, Nimble, Vortex, AnyBlox, F3 (File Format for the Future). The thinking is that the context has changed so much that the design of yore (the previous decade) is not going to cut it moving forward. This seemed a bit intriguing to me, especially since the main contribution of Parquet has been to provide a standard for columnar storage. Parquet is not simply a file format. As an open source project hosted by the ASF, it acts as a consensus building machine for the industry. Creating six new formats is not going to help with interoperability. I spent some time to understand a bit better how things actually changed and how Parquet needs to adapt to meet the demands of this new era. In this post I’ll discuss my findings.

  • Trials and tribulations in AI coding

    This is just me sharing my experience trying out AI coding. Don’t get fooled by the curmudgeonly engineer having to change his ways. If you get past my stumbling through what might be obvious beginner mistakes, and the (sometimes humorous) grumbling that accompanies it, you’ll get to the happy ending of the tool I was able to create fairly quickly, using languages and technologies I am totally unfamiliar with. I learned some things along the way, not the least how to better collaborate with a coding agent.

  • Bottom up Architecture

    The topic of software architecture has become a bit cringe. Some people will roll their eyes at the mere mention of it. My impression is this is because it has often been a very top down practice. An architecture committee approval must be secured before starting anything new. Nothing gets done unless it’s been vetted by “the architect”. The people who are closest to the problem being solved must seek permission and convince people who are furthest removed from it that their solution is best. There are many flaws to this approach to solving problems and you’ll find many opinions online on how it is an anti-pattern to have architects. Not the least that there is no such thing as a best solution, it’s trade-offs all the way down.

  • The advent of the Open Data Lake

    As I looked back at my experience working in data engineering for this post, I realized I never really consciously decided to specialize in data. It just kind of happened. The company I was working for was acquired by Yahoo! where Hadoop was emerging as the next industry leap in data processing. As I dug deeper in the platforms I was using and became interested in open source software, I inadvertently just did and I found there a career and a community.
    Now I’m going to talk about the evolution of the industry towards what I like to call the open data lake, also referred to as the Lakehouse pattern. To get there, we’ll take a little trip back memory lane to understand better how we got there.

  • The Future of Lineage

    Lineage has long been a requirement for anyone processing data - whether for complying with regulations, ensuring data reliability or, to quote Marvin Gaye, plainly just knowing what’s going on from provenance to impact analysis.
    However, our industry has historically had difficulties collecting data lineage reliably.
    From the early days of lineage powered by spreadsheets, we’ve come a long way towards standardizing lineage. We have evolved from painful, manual approaches to automated operational lineage extraction across batch and stream processing.
    Now, we’re on the brink of a new era when lineage will be built into every data processing layer - whether ETL, data warehouse or ai - and not an afterthought.

  • The Deconstructed Database

    In 2018, I wanted to describe how the components of databases, distributed or not, were being commoditized as individual parts that anyone could recombine into use-case specific engines. Given your own constraints, you can leverage those components to build a query engine that solves your problem much faster than building everything from the ground up. I liked the idea of calling it “the Deconstructed Database” and gave a few talks about it.

  • Improv for Engineers

    Recently I’ve signed up for an improv class and it’s been a lot of fun. It had been way too long since the last time I had taken classes, back I was at Twitter, and I wish I had done this earlier. That class I took ten years ago was part of “Twitter University”, a program designed to help employees develop their skills. There, you could learn about many topics from programming Scala to Improv’. Employees were also encouraged to teach classes. For example, I taught the analytics onboarding class for a while and a few other one-off classes.

  • Chapter III: Onwards, OpenLineage

    Much the same as there was a common need for a columnar file format and a columnar in-memory representation, there's a common need for lineage across the data ecosystem. In this chapter, I'm telling the story of how OpenLineage came to be and filled that need.
  • Chapter II: From Parquet to Arrow

    In 2015, a discussion started in the Parquet community around the need for an in-memory columnar format. The goal was to enable vectorization of query engines and interoperability of data exchange. The requirements were different enough from Parquet to warrant the creation of a different format, one focused on in-memory processing.
  • Chapter I: The birth of Parquet

    15 years ago (2007-2011) I was at Yahoo! working with Map/Reduce and Apache Pig, which was the better Map/Reduce at the time. The Dremel paper just came out and, as everything I worked with seemed to be inspired from Google papers, I read it. I could see it applying to what we were doing at Yahoo! and this was to become a big inspiration for my future work.
  • Ten years of Building Open Source Standards: From Parquet to Arrow to OpenLineage

    Over the last decade, I have been lucky enough to contribute to a few successful open source projects in the data ecosystem.
  • Dremel made simple with Parquet

    Columnar storage is a popular technique to optimize analytical workloads in parallel RDBMs. The performance and compression benefits for storing and processing large amounts of data are well documented in academic literature as well as several commercial analytical databases
  • Java JIT compiler inlining

    As you know the Java Virtual Machine (JVM) optimizes the java bytecode at runtime using a just-in-time-compiler (JIT). However the exact behavior of the JIT is hard to predict and documentation is scarce. You probably know that the JIT will try to inline frequently called methods in order to avoid the overhead of method invocation. But you may not realize that the heuristic it uses depends on both how often a method is invoked and also on how big it is. Methods that are too big can not be inlined without bloating the call sites.

  • Java Classloader tips

    This post is not trying to exhaustively describe classloaders. I merely intend to give some hints about use of classloaders and traps to avoid.

  • Exception handling, Checked vs Unchecked exceptions, ...

    Some thoughts in random order about exceptions in Java, as they often get overlooked. The following points are often related. Mostly I’m talking to myself here, don’t take it too personally when I say “you should do this” 🙂 just imagine I’m the guy from Memento and that I tattoo myself with my blog posts. Feel free to disagree/add your own views in the comments (though I may not get your opinion tattooed on my body). Here I’m assuming we are building a library that others will use.

  • Transitive Closure in Pig

    This a follow-up on my previous post about implementing PageRank in Pig using embedding. I also talked about this in a presentation to the Pig user group. One of the best features of embedding is how it simplifies writing UDFs and using them right away in the same script without superfluous declarations. Computing a transitive closure is a good example of an algorithm requiring an iteration, a few simple UDFs and an end condition to decide when to stop iterating. Embedding is available in Pig 0.9. Knowledge of both Pig and Python is required to follow. Examples are available on github.

  • PageRank implementation in Pig

    In this post I’m going to give a very simple example of how to use Pig embedded in Python to implement the PageRank algorithm. It goes in a little more details on the same example given in the presentation I gave at the Pig user meetup. If you are interested, Daniel just published a nice K-Means implementation on the HortonWorks blog.

  • Detecting low memory in Java Part 2

    This is a follow up on my previous post the rationale is explained there.

  • Detecting low memory in Java

    This seems to be a common need and difficult thing to do in Java. Here is my solution, let me know what you think.

subscribe via RSS