GH-46522: [C++][FlightRPC] Add Arrow Flight SQL ODBC driver#40939

jbonofre · 2024-04-02T05:59:17Z

Rationale for this change

An ODBC driver uses the Open Database Connectivity (ODBC) interface that allow applications to access data in DBMS like environment using SQL.
As Arrow Flight provides JDBC and ADBC drivers, similarly Arrow Flight can provide ODBC driver.

What changes are included in this PR?

This PR adds an ODBC driver implementation.

Are these changes tested?

This ODBC driver is coming from a production system (SGA) that is tested/ran for a while.

Are there any user-facing changes?

No change, but new user option to use Arrow Flight.

GitHub Issue: [C++][FlightRPC] Accept donation of ODBC driver #46522

github-actions · 2024-04-02T05:59:42Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

jbonofre · 2024-04-02T06:02:35Z

@ianmcook @lidavidm @assignUser as discussed during the community meetings, here's the ODBC driver PR draft.

This PR includes:

sources files with legal fixes (I'm checking whoami source dual license/MIT, eventually alternatives)
build convenient files (like Brewfile)

I'm still working on:

finalize Arrow update (the original code was using Arrow 9.x)
finalize the build and cleanup the CMakeLists.txt files

I will also create a GH Issue to track this addition.

jduo

Thanks for this @jbonofre !

Some of the code in odbc_impl should be updated to use the same naming conventions as the arrow project (eg ODBCConnection -> odbc_connection).

This covers the flightsql-odbc repo, but there is additional code needed for the ODBC entry points to build a functional driver. That was in warpdrive, but that code isn't Apache-compatible.

Need to add the license for whereami to the Arrow project itself.

Are the separate brewfile and vcpkg.json necessary or can this utilize the ones already at the top-level?

Should the top-level CMakeLists now include the driver build? Or is that on hold as part of the donation process?

jduo · 2024-04-04T23:17:35Z

cpp/src/arrow/flight/sql/odbc/flight_sql/include/flight_sql/ui/add_property_window.h

These @copedoc tags are causing doxygen failures in CI:
https://github.com/apache/arrow/actions/runs/8518121914/job/23329735731?pr=40939#step:6:3040

assignUser · 2024-04-05T01:44:38Z

finalize Arrow update (the original code was using Arrow 9.x)

I fear I can't really help on that front but once that's done I'd be happy to continue working on the CMake!

jbonofre · 2024-04-05T05:46:29Z

Hi @jduo !

Some of the code in odbc_impl should be updated to use the same naming conventions as the arrow project (eg ODBCConnection -> odbc_connection).

Yes, I'm doing that (that's why the PR is still a draft).

This covers the flightsql-odbc repo, but there is additional code needed for the ODBC entry points to build a functional driver. That was in warpdrive, but that code isn't Apache-compatible.

Yes, locally I fixed/updated it, but I'm looking for alternative.

Need to add the license for whereami to the Arrow project itself.

I'm checking if we can't find an alternative, else I will update NOTICE file.

Are the separate brewfile and vcpkg.json necessary or can this utilize the ones already at the top-level?

Yes, that's the plan, I'm doing in two steps: fixing the build and integrating in the Arrow ecosystem.

Should the top-level CMakeLists now include the driver build? Or is that on hold as part of the donation process?

Agree, same the same as for brewfile/vcpkg.json: I'm doing that in two steps.

I will push new changes to this PR and "remove" the draft state as soon as it's clean to review (build ok, etc).

jbonofre · 2024-04-05T05:47:06Z

finalize Arrow update (the original code was using Arrow 9.x)

I fear I can't really help on that front but once that's done I'd be happy to continue working on the CMake!

No worries, I'm working on it :) I will keep you posted when good for review (when I will remove the draft flag).

jduo · 2024-04-10T16:02:55Z

This covers the flightsql-odbc repo, but there is additional code needed for the ODBC entry points to build a functional driver. That was in warpdrive, but that code isn't Apache-compatible.

Yes, locally I fixed/updated it, but I'm looking for alternative.

There was another group that did some work to build a full driver from flightsql-odbc. I haven't seen it though, but they've mentioned it here:
#30622 (comment)

lidavidm · 2024-07-02T23:50:30Z

Following up here - do you want any help?

jduo · 2024-07-16T22:21:43Z

@jbonofre checking in on this. Were you still continuing work on this? Did you need help?
Thanks

jbonofre · 2024-07-17T06:36:40Z

@jduo yes, I'm back on it (sorry I was busy on Iceberg and ASF stuff). Let me do a new rebase/build/update and I will ping you for review/help.

Thanks ! And sorry again for delay :/

jbonofre · 2024-09-03T07:41:14Z

@lidavidm @jduo hey guys. I'm very very sorry to have been quiet for long. Between vacation and ASF tasks, I was not able to be active enough on Arrow. I'm now back and resuming my work on Arrow, including ODBC.

lidavidm · 2024-09-03T07:49:03Z

No worries! Glad to see you back around :)

rpourzand · 2024-09-04T19:39:55Z

Hey @jbonofre we're excited about the AFS ODBC driver! We want to start using it as soon as we can for a project we're working on and was wondering whether you had a sense of timing where it might be ready to be in someone's hands for development. We would be happy to be early adopters :)

jduo · 2024-09-04T19:58:25Z

Thanks @jbonofre . We're around if you need assistance moving this forward.

jduo · 2024-10-01T17:57:29Z

Hi @jbonofre , just checking in on if we can help to move this forward.

jbonofre · 2024-10-02T04:31:13Z

Hey @jduo
I started the recase. I will push this.
Would you have time for a Quick sync ?
Thanks !

aiguofer · 2024-10-02T14:28:26Z

I'm curious, does this version of the driver allow passing arbitrary parameters? I tried using the Dremio ODBC driver but it can't seem to be able to set arbitrary parameters. I have the following in a custom PowerBI connector:

        // This record contains all of the connection string properties we
        // will set for this ODBC driver. The 'Driver' field is required for
        // all ODBC connections. Other properties will vary between ODBC drivers,
        // but generally take Server and Database properties. Note that
        // credential related properties will be set separately.
        ConnectionString = [
            Driver = "Arrow Flight SQL ODBC Driver",
            host = server,
            port = 443,
            environmentId = "200532"
        ],

But i get the following error: ODBC: ERROR [HY000] [Apache Arrow][Flight SQL] (100) Flight returned invalid argument error, with message: environmentId is a required connection parameter

For our use-case we need to be able to pass various parameters as extra headers, which is supported by both ADBC and JDBC.

jduo · 2024-10-02T16:53:28Z

I'm curious, does this version of the driver allow passing arbitrary parameters? I tried using the Dremio ODBC driver but it can't seem to be able to set arbitrary parameters. I have the following in a custom PowerBI connector:
        // This record contains all of the connection string properties we
        // will set for this ODBC driver. The 'Driver' field is required for
        // all ODBC connections. Other properties will vary between ODBC drivers,
        // but generally take Server and Database properties. Note that
        // credential related properties will be set separately.
        ConnectionString = [
            Driver = "Arrow Flight SQL ODBC Driver",
            host = server,
            port = 443,
            environmentId = "200532"
        ],
But i get the following error: ODBC: ERROR [HY000] [Apache Arrow][Flight SQL] (100) Flight returned invalid argument error, with message: environmentId is a required connection parameter

For our use-case we need to be able to pass various parameters as extra headers, which is supported by both ADBC and JDBC.

Last I tried, it supported this. I believe it lower-cases headers though (the C++ grpc client itself requires this) -- is your server looking for environmentId case-insensitively?

jduo · 2024-10-02T16:54:19Z

Hey @jduo I started the recase. I will push this. Would you have time for a Quick sync ? Thanks !

Hi @jbonofre , yes we can have a chat. Will send meeting info offline.

aiguofer · 2024-10-02T17:33:36Z

Last I tried, it supported this. I believe it lower-cases headers though (the C++ grpc client itself requires this) -- is your server looking for environmentId case-insensitively?

Yeah it's case-insensitive. I also tried setting environmentid in all lower-case and still get the same error

aiguofer · 2024-10-02T18:57:52Z

Interesting, I do see:

TEST(PopulateCallOptionsTest, GenericOption) {
  FlightSqlConnection connection(odbcabstraction::V_3);
  connection.SetClosed(false);

  Connection::ConnPropertyMap properties;
  properties["Foo"] = "Bar";
  auto options = connection.PopulateCallOptions(properties);
  auto headers = options.headers;
  ASSERT_EQ(1, headers.size());

  // Header name must be lower-case because gRPC will crash if it is not lower-case.
  ASSERT_EQ("foo", headers[0].first);

  // Header value should preserve case.
  ASSERT_EQ("Bar", headers[0].second);
}

Maybe there's a bug with call options not getting set correctly for the HANDSHAKE request? I remember that being an issue with the JDBC driver: #33946

CurtHagenlocher · 2024-10-02T19:02:05Z

Last I tried, it supported this. I believe it lower-cases headers though (the C++ grpc client itself requires this)

This is actually part of the HTTP/2 spec.

lidavidm

(Review in progress...)

Also to double check, none of these headers get installed, right? If they do we should block them from getting installed

Longer term...I wonder if the ODBC driver should be in the arrow/ tree, or if we should consider putting it at the top level (next to arrow/, parquet/), or even in its own repo potentially.

lidavidm · 2025-05-21T01:17:47Z

cpp/src/arrow/flight/sql/odbc/flight_sql/accessors/common.h

+namespace driver {
+namespace flight_sql {


Let's fix the namespace: this should be something like arrow::flight::sql::odbc

lidavidm · 2025-05-21T01:18:08Z

cpp/src/arrow/flight/sql/odbc/flight_sql/accessors/date_array_accessor.cc

+
+namespace {
+template <typename T>
+int64_t convertDate(typename T::value_type value) {


Please update functions to Arrow naming conventions (ConvertDate, not convertDate)

lidavidm · 2025-05-21T01:18:45Z

cpp/src/arrow/flight/sql/odbc/flight_sql/accessors/main.h

By codebase convention this should be called "api.h".

lidavidm · 2025-05-21T01:19:24Z

cpp/src/arrow/flight/sql/odbc/flight_sql/include/flight_sql/config/configuration.h

In general, we don't have a separate "include" subdirectory. Headers go next to source files.

lidavidm · 2025-05-21T01:20:37Z

cpp/src/arrow/flight/sql/odbc/flight_sql/main.cc

What is the purpose of this application? Should it be converted to tests instead?

lidavidm · 2025-05-21T02:04:39Z

cpp/src/arrow/flight/sql/odbc/flight_sql/accessors/time_array_accessor.cc

+    odbcabstraction::Diagnostics& diagnostic) {
+  auto* buffer = static_cast<TIME_STRUCT*>(binding->buffer);
+
+  tm time{};


In general, why are we using this instead of C++ stdlib chrono/date?

lidavidm · 2025-05-21T02:10:22Z

cpp/src/arrow/flight/sql/odbc/flight_sql/accessors/timestamp_array_accessor_test.cc

+
+using odbcabstraction::GetTimeForSecondsSinceEpoch;
+
+TEST(TEST_TIMESTAMP, TIMESTAMP_WITH_MILLI) {


FWIW, the test suite/case name are effectively class names, so you'd normally expect to see TestTimestamp, TimestampWithMilli

lidavidm · 2025-05-21T02:11:20Z

cpp/src/arrow/flight/sql/odbc/odbcabstraction/include/odbcabstraction/types.h

+namespace odbcabstraction {
+
+/// \brief Supported ODBC versions.
+enum OdbcVersion { V_2, V_3, V_4 };


Can we use enum class?

lidavidm · 2025-05-21T02:11:38Z

cpp/src/arrow/flight/sql/odbc/odbcabstraction/include/odbcabstraction/types.h

+constexpr ssize_t MICRO_TO_SECONDS_DIVISOR = 1000000;
+constexpr ssize_t NANO_TO_SECONDS_DIVISOR = 1000000000;
+
+typedef struct tagDATE_STRUCT {


Why are these typedef struct? Just use normal struct?

lidavidm · 2025-05-21T02:11:50Z

cpp/src/arrow/flight/sql/odbc/odbcabstraction/include/odbcabstraction/types.h

+};
+
+struct MetadataSettings {
+  boost::optional<int32_t> string_column_length_{boost::none};


lidavidm

(in progress...)

lidavidm · 2025-05-21T02:18:41Z

cpp/src/arrow/flight/sql/odbc/flight_sql/accessors/string_array_accessor.cc

+namespace {
+
+#if defined _WIN32 || defined _WIN64
+std::string utf8_to_clocale(const char* utf8str, int len) {


Naming conventions: Utf8ToCLocale, utf8_str

lidavidm · 2025-05-21T02:21:20Z

cpp/src/arrow/flight/sql/odbc/flight_sql/accessors/string_array_accessor.cc

+std::string utf8_to_clocale(const char* utf8str, int len) {
+  thread_local boost::locale::generator g;
+  g.locale_cache_enabled(true);
+  std::locale loc = g(boost::locale::util::get_system_locale());


Aren't locales thread safe? Can't we initialize this once in a static instead of using a thread local?

lidavidm · 2025-05-21T02:22:38Z

cpp/src/arrow/flight/sql/odbc/flight_sql/config/configuration.cc

+namespace flight_sql {
+namespace config {
+
+static const char DEFAULT_DSN[] = "Apache Arrow Flight SQL";


Is there a reason why we sometimes use static const char[] and other times use constexpr std::string_view? IMO the latter is preferable unless there is a reason

lidavidm · 2025-05-21T02:24:57Z

cpp/src/arrow/flight/sql/odbc/flight_sql/config/configuration.cc

+Configuration::Configuration() {
+  // No-op.
+}
+
+Configuration::~Configuration() {
+  // No-op.
+}


If these are no-ops, use =default

lidavidm · 2025-05-21T02:25:46Z

cpp/src/arrow/flight/sql/odbc/flight_sql/config/configuration.cc

+  Set(FlightSqlConnection::DSN, dsn);
+  Set(FlightSqlConnection::HOST, ReadDsnString(dsn, FlightSqlConnection::HOST));
+  Set(FlightSqlConnection::PORT, ReadDsnString(dsn, FlightSqlConnection::PORT));
+  Set(FlightSqlConnection::TOKEN, ReadDsnString(dsn, FlightSqlConnection::TOKEN));
+  Set(FlightSqlConnection::UID, ReadDsnString(dsn, FlightSqlConnection::UID));
+  Set(FlightSqlConnection::PWD, ReadDsnString(dsn, FlightSqlConnection::PWD));
+  Set(FlightSqlConnection::USE_ENCRYPTION,
+      ReadDsnString(dsn, FlightSqlConnection::USE_ENCRYPTION, DEFAULT_ENABLE_ENCRYPTION));
+  Set(FlightSqlConnection::TRUSTED_CERTS,
+      ReadDsnString(dsn, FlightSqlConnection::TRUSTED_CERTS));
+  Set(FlightSqlConnection::USE_SYSTEM_TRUST_STORE,
+      ReadDsnString(dsn, FlightSqlConnection::USE_SYSTEM_TRUST_STORE,
+                    DEFAULT_USE_CERT_STORE));
+  Set(FlightSqlConnection::DISABLE_CERTIFICATE_VERIFICATION,
+      ReadDsnString(dsn, FlightSqlConnection::DISABLE_CERTIFICATE_VERIFICATION,
+                    DEFAULT_DISABLE_CERT_VERIFICATION));
+
+  auto customKeys = ReadAllKeys(dsn);
+  RemoveAllKnownKeys(customKeys);
+  for (auto key : customKeys) {
+    std::string_view key_sv(key);
+    Set(key, ReadDsnString(dsn, key_sv));
+  }


Are we effectively re-parsing the DSN from scratch for each key? Can't we parse it up front into keys/values then assign?

ReadDsnString grabs the DSN value directly from the DSN config file using Windows Driver Manager's implementation of SQLGetPrivateProfileString, which gets key/value pairs from the DSN. SQLGetPrivateProfileString is the only Driver Manager ODBC API that reads from ODBC.INI file, so it is used here.
I think the implementation of SQLGetPrivateProfileString is closed-source, so I don't know if it parses DSN upfront under the hood

lidavidm · 2025-05-21T03:06:28Z

cpp/src/arrow/flight/sql/odbc/flight_sql/config/connection_string_parser.cc

+  while (!connect_str.empty()) {
+    size_t attr_begin = connect_str.rfind(delimiter);
+
+    if (attr_begin == std::string::npos)


Don't omit braces

lidavidm · 2025-05-21T03:09:53Z

cpp/src/arrow/flight/sql/odbc/flight_sql/flight_sql_connection.h

+/// \param connPropertyMap the map with the Connection properties.
+/// \return                An instance of the FlightSqlSslConfig.
+std::shared_ptr<FlightSqlSslConfig> LoadFlightSslConfigs(
+    const odbcabstraction::Connection::ConnPropertyMap& connPropertyMap);


conn_property_map

lidavidm · 2025-05-21T03:11:21Z

cpp/src/arrow/flight/sql/odbc/flight_sql/flight_sql_connection_test.cc

+int main(int argc, char** argv) {
+  ::testing::InitGoogleTest(&argc, argv);
+  return RUN_ALL_TESTS();
+}


You shouldn't need this

lidavidm · 2025-05-21T03:12:15Z

cpp/src/arrow/flight/sql/odbc/flight_sql/flight_sql_get_tables_reader.h

+
+using arrow::RecordBatch;
+
+using std::optional;


Don't using std types, just write std::optional

lidavidm · 2025-05-21T03:12:22Z

cpp/src/arrow/flight/sql/odbc/flight_sql/flight_sql_get_tables_reader.h

+
+using std::optional;
+
+class GetTablesReader {


lidavidm

(in progress)

Is there anything explaining how the different sublibraries (odbcabstraction, etc.) fit together here?

lidavidm · 2025-05-21T03:26:15Z

cpp/src/arrow/flight/sql/odbc/flight_sql/flight_sql_connection.cc

+constexpr auto SYSTEM_TRUST_STORE_DEFAULT = true;
+constexpr auto STORES = {"CA", "MY", "ROOT", "SPC"};
+
+inline std::string GetCerts() {


Why is this inline?

lidavidm · 2025-05-21T03:29:55Z

cpp/src/arrow/flight/sql/odbc/flight_sql/flight_sql_statement_get_tables.h

+using odbcabstraction::MetadataSettings;
+using odbcabstraction::ResultSet;
+
+typedef struct {


Why the C-ism? Just struct ColumnNames.

lidavidm · 2025-05-21T03:31:08Z

cpp/src/arrow/flight/sql/odbc/flight_sql/flight_sql_statement_get_tables.cc

+  std::string curr_parse;    // the current string
+
+  for (char temp : table_type) {  // while still in the string
+    switch (temp) {               // switch depending on the character


Many of these comments are just noise and aren't necessary. I would rather we have docstrings since the purpose of many functions is not necessarily clear. Another thing is that for functions which parse a value, it would be more helpful to document the format being parsed/have examples of valid values

lidavidm · 2025-05-21T03:32:13Z

cpp/src/arrow/flight/sql/odbc/odbcabstraction/utils.cc

+  if (config_file.fail()) {
+    auto error_msg = "Arrow Flight SQL ODBC driver config file not found on \"" +
+                     config_file_path + "\"";
+    std::cerr << error_msg << std::endl;


Let's not spam console. If we're going to do this, at least use the logger.

lidavidm · 2025-05-21T03:33:23Z

cpp/src/arrow/flight/sql/odbc/odbcabstraction/odbc_impl/odbc_statement.cc

+ODBCStatement::ODBCStatement(
+    ODBCConnection& connection,
+    std::shared_ptr<driver::odbcabstraction::Statement> spiStatement)
+    : m_connection(connection),


We generally don't use the m_ prefix in this codebase. It should be connection_ if it's private.

lidavidm · 2025-05-21T03:34:20Z

cpp/src/arrow/flight/sql/odbc/odbcabstraction/odbc_impl/odbc_statement.cc

+      return;
+    case SQL_ROWSET_SIZE:
+      SetAttribute(value, m_rowsetSize);
+      return;


Why do some cases in this switch return, and others break?

The cases that break set the value successfully_written which is checked after the break. The ODBC then returns a warning if the attribute is not written successfully.

lidavidm · 2025-05-21T03:34:32Z

cpp/src/arrow/flight/sql/odbc/odbcabstraction/odbc_impl/odbc_statement.cc

+                            "HY092");
+  }
+  if (!successfully_written) {
+    GetDiagnostics().AddWarning("Optional value changed.", "01S02",


Why is it a warning that an optional value changed?

The ODBC driver is required to log a warning diagnostic record for optional value changed

lidavidm · 2025-05-21T03:34:58Z

cpp/src/arrow/flight/sql/odbc/odbcabstraction/odbc_impl/odbc_statement.cc

+
+  SQLSMALLINT evaluatedCType = cType;
+
+  // TODO: Get proper default precision and scale from abstraction.


Can we file issues for these TODOs? Is there any plan to tackle them?

lidavidm · 2025-05-21T03:41:29Z