Consider using JSON arrays instead of objects to transfer data

When implementing the awesome MULTISET operator in jOOQ, its implementation mostly relied on SQL/JSON support of various RDBMS. In short, while standard SQL supports nested collections via ARRAY or MULTISET operators like this:

SELECT
  f.title,
  MULTISET(
    SELECT a.first_name, a.last_name
    FROM actor AS a
    JOIN film_actor AS fa ON a.actor_id = fa.actor_id
    WHERE fa.film_id = f.film_id
  )
FROM film AS f;

This is poorly supported in most RDBMS, so jOOQ emulates it using SQL/JSON as follows (or similar):

SELECT
  f.title,
  (
    SELECT json_arrayagg(json_array(a.first_name, a.last_name))
    FROM actor AS a
    JOIN film_actor AS fa ON a.actor_id = fa.actor_id
    WHERE fa.film_id = f.film_id
  )
FROM film AS f;

Wait a second. A JSON array of arrays??

You’d expect a nested collection in JSON to be of this form, normally, for debuggability and good measure, etc. etc.:

[
  {
    "first_name": "John",
    "last_name": "Doe"
  }, {
    "first_name": "Jane",
    "last_name": "Smith"
  }, ...
]

But jOOQ is serialising this, instead??

[
  [ "John", "Doe" ],
  [ "Jane", "Smith" ],
  ...
]

Why, yes of course! jOOQ is in full control of your SQL statement and knows exactly what column (and data type) is at which position, because you helped jOOQ construct not only the query object model, but also the result structure. So, a much faster index access is possible, compared to the much slower column name access.

The same is true for ordinary result sets, by the way, where jOOQ always calls JDBC’s ResultSet.getString(int), for example, over ResultSet.getString(String). Not only is it faster, but also more reliable. Think about duplicate column names, e.g. when joining two tables that both contain an ID column. While JSON is not opinionated about duplicate object keys, not all JSON parsers support this, let alone Java Map types.

Fun fact, while the Map API specification forbids duplicate keys, an implementation could in principle ignore this specification and define non-standard behaviour that allows for accessing duplicate keys via the iteration methods, such as Map::entrySet.

See some experiments in this issue here: https://github.com/jOOQ/jOOQ/issues/11889

In any case, once access-by-name is replaced by positional access, duplicate column names are no longer an issue.

Not related to SQL

While jOOQ does this in a SQL context, note that you could also do this between any client and server application with any programming languages on each side. This is in no way related to SQL itself!

When to apply this advice

Performance advice is always a difficult thing. There’s no such thing as “always” (use JSON arrays) or “never” (use JSON objects). It’s a tradeoff. In the case of jOOQ, JSON arrays of arrays are being used to serialise data that users don’t see at all. Users don’t even think in terms of JSON, when using the MULTISET operator, and because it’s mature enough now, they hardly have to debug the serialisation. If we could, we’d use a binary serialisation format between the server and the client in jOOQ for even faster results. Regrettably, there’s no such thing in SQL as binary data aggregation.

When you design an API (especially one that is consumed by others), then clarity may be much more important than increasing speed by some percentages. JSON objects are much more clear when modelling a data type, compared to JSON arrays that only support positional field access.

I think that this advice applies to many technical JSON serialisation use-cases, where JSON has to be used, but if it were possible, binary formats would be a preferred option. But if in doubt, always do measure first!

Compression

People may be lured into thinking that this is about data transfer, in case of which compression could mitigate some of the repetitive object key overheads. This is certainly true, especially in an HTTP context.

But the cost of compression and decompression is still present in terms of CPU overhead at both ends of the transfer, so while it reduces the transferred data size, it might be much simpler to just do this manually with a different data layout.

Benchmark

What would a claim about things being faster be without a reproducible benchmark? The following benchmark uses:

H2 in-memory, but you could certainly reproduce it anywhere else, too, e.g. on Oracle or PostgreSQL
JMH, despite this not being a micro benchmark. JMH offers great reproducibility and statistics tools (average, standard deviation, etc.)

By using these two tools, the benchmark can measure all of:

The server side implications of serialising data
The client side implications of parsing data
The network transfer overhead (not measured in the case of H2, so compression wouldn’t help here anyway)

It uses JDBC directly, not jOOQ, so mapping overheads are omitted. It does use jOOQ’s built-in JSON parser, though, but you’ll get similar results with any other JSON parser, e.g. Jackson.

The results on my machine are (higher is better)

For H2 (in memory):

Benchmark                                   Mode  Cnt        Score       Error  Units
JSONObjectVsArrayBenchmark.testJsonArray   thrpt    7  1005894.475 ┬▒ 57611.223  ops/s
JSONObjectVsArrayBenchmark.testJsonObject  thrpt    7   809580.238 ┬▒ 19848.664  ops/s

For PostgreSQL:

Benchmark                                   Mode  Cnt     Score      Error  Units
JSONObjectVsArrayBenchmark.testJsonArray   thrpt    7  1588.895 ┬▒  433.826  ops/s
JSONObjectVsArrayBenchmark.testJsonObject  thrpt    7  1387.053 ┬▒ 1132.281  ops/s

In each case, there’s roughly a 15% – 25% improvement for a small data set (10 rows with 2 columns). I would expect it to be more significant for larger results, e.g. here’s 10000 rows again:

For H2:

Benchmark                                   Mode  Cnt     Score    Error  Units
JSONObjectVsArrayBenchmark.testJsonArray   thrpt    7  2932.684 ┬▒ 41.095  ops/s
JSONObjectVsArrayBenchmark.testJsonObject  thrpt    7  1643.838 ┬▒ 31.943  ops/s

For PostgreSQL:

Benchmark                                   Mode  Cnt    Score   Error  Units
JSONObjectVsArrayBenchmark.testJsonArray   thrpt    7  122.875 ┬▒ 7.133  ops/s
JSONObjectVsArrayBenchmark.testJsonObject  thrpt    7   71.916 ┬▒ 3.232  ops/s

These results are significant enough to justify the reduced readability, given that every jOOQ query using MULTISET will profit from the speed-up.

The benchmark code is below. It uses a very simple JSON parser handler, which doesn’t really keep track of object keys or array indexes, just assumes that there’s exactly 1 number column and 1 string column. This shouldn’t really matter for more complex results, though.

Benchmark code below:

package org.jooq.impl;

import java.io.InputStream;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.ArrayList;
import java.util.List;
import java.util.Properties;

import org.jooq.DSLContext;
import org.jooq.test.benchmarks.local.PGQueryBenchmark.BenchmarkState;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Level;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.TearDown;
import org.openjdk.jmh.annotations.Warmup;

@Fork(value = 1, jvmArgsAppend = "-Dorg.jooq.no-logo=true")
@Warmup(iterations = 3, time = 3)
@Measurement(iterations = 7, time = 3)
public class JSONObjectVsArrayBenchmark {

    @State(Scope.Benchmark)
    public static class BenchmarkState {

        DSLContext ctx;
        Connection connection;
        Statement  statement;

        @Setup(Level.Trial)
        public void setup() throws Exception {
            // H2 setup
            connection = DriverManager.getConnection("jdbc:h2:mem:json-benchmark", "sa", "");
            ctx = DSL.using(connection);

            statement = connection.createStatement();
            statement.executeUpdate("drop table if exists t");
            statement.executeUpdate("create table t (number_column int, string_column varchar(100))");
            statement.executeUpdate("insert into t select i, i from system_range(1, 10) as t(i)");

            // PostgreSQL setup
//            try (InputStream is = BenchmarkState.class.getResourceAsStream("/config.postgres.properties")) {
//                Properties p = new Properties();
//                p.load(is);
//                Class.forName(p.getProperty("db.postgres.driver"));
//                connection = DriverManager.getConnection(
//                    p.getProperty("db.postgres.url"),
//                    p.getProperty("db.postgres.username"),
//                    p.getProperty("db.postgres.password")
//                );
//                ctx = DSL.using(connection);
//
//                statement = connection.createStatement();
//                statement.executeUpdate("drop table if exists t");
//                statement.executeUpdate("create table t (number_column int, string_column varchar(100))");
//                statement.executeUpdate("insert into t select i, i from generate_series(1, 10) as t(i)");
//            }
        }

        @TearDown(Level.Trial)
        public void teardown() throws Exception {
            statement.executeUpdate("drop table if exists t");
            statement.close();
            connection.close();
        }
    }

    // Just show that both methods really do the same thing
    public static void main(String[] args) throws Exception {
        JSONObjectVsArrayBenchmark f = new JSONObjectVsArrayBenchmark();
        BenchmarkState state = new BenchmarkState();
        state.setup();

        System.out.println(f.testJsonObject(state));
        System.out.println(f.testJsonArray(state));
    }

    record R(int n, String s) {}

    @Benchmark
    public List<R> testJsonObject(BenchmarkState state) throws SQLException {
        List<R> result = new ArrayList<>();

        try (ResultSet rs = state.statement.executeQuery(

            // H2 syntax (if only there was a way to abstract over syntax ;)
            """
            select
              json_arrayagg(json_object(
                key number_column value number_column,
                key string_column value string_column
              ))
            from t
            """

            // PostgreSQL syntax
//            """
//            select
//              json_agg(json_build_object(
//                number_column, number_column,
//                string_column, string_column
//              ))
//            from t
//            """
        )) {
            while (rs.next()) {
                new JSONParser(state.ctx, rs.getString(1), ch(result)).parse();
            }
        }

        return result;
    }

    @Benchmark
    public List<R> testJsonArray(BenchmarkState state) throws SQLException {
        List<R> result = new ArrayList<>();

        try (ResultSet rs = state.statement.executeQuery(

            // H2 syntax
            """
            select
              json_arrayagg(json_array(number_column, string_column))
            from t
            """

            // PostgreSQL syntax
//            """
//            select
//              json_agg(json_build_array(number_column, string_column))
//            from t
//            """
        )) {
            while (rs.next()) {
                new JSONParser(state.ctx, rs.getString(1), ch(result)).parse();
            }
        }

        return result;
    }

    private JSONContentHandler ch(List<R> result) {
        return new JSONContentHandler() {

            int level;
            int n;
            String s;

            @Override
            public void valueTrue() {
            }

            @Override
            public void valueString(String string) {
                s = string;
            }

            @Override
            public void valueNumber(String string) {
                n = Integer.parseInt(string);
            }

            @Override
            public void valueNull() {
            }

            @Override
            public void valueFalse() {
            }

            @Override
            public void startObject() {
                if (level >= 1) {
                    n = 0;
                    s = null;
                }
            }

            @Override
            public void startArray() {
                if (level++ >= 1) {
                    n = 0;
                    s = null;
                }
            }

            @Override
            public void property(String key) {
            }

            @Override
            public void endObject() {
                result.add(new R(n, s));
            }

            @Override
            public void endArray() {
                if (level-- >= 1) {
                    result.add(new R(n, s));
                }
            }
        };
    }
}

Consider using JSON arrays instead of JSON objects for serialisation

Not related to SQL

When to apply this advice

Compression

Benchmark

Like this:

Published by lukaseder

Leave a ReplyCancel reply

Not related to SQL

When to apply this advice

Compression

Benchmark

Like this:

Published by lukaseder

Leave a ReplyCancel reply

Discover more from Java, SQL and jOOQ.