Stories by OpenMLDB Blogs on Medium

Feature Signatures — Enabling Complete Feature Engineering with SQL

OpenMLDB Blogs — Thu, 23 May 2024 07:24:28 GMT

Introducing OpenMLDB’s New Feature: Feature Signatures — Enabling Complete Feature Engineering with SQL

Background

Rewinding to 2020, the Feature Engine team of Fourth Paradigm submitted and passed an invention patent titled “Data Processing Method, Device, Electronic Equipment, and Storage Medium Based on SQL”. This patent innovatively combines the SQL data processing language with machine learning feature signatures, greatly expanding the functional boundaries of SQL statements.

Screenshot of Patent in Cinese

At that time, no SQL database or OLAP engine on the market supported this syntax, and even on Fourth Paradigm’s machine learning platform, the feature signature function could only be implemented using a custom DSL (Domain-Specific Language).

Finally, in version v0.9.0, OpenMLDB introduced the feature signature function, supporting sample output in formats such as CSV and LIBSVM. This allows direct integration with machine learning training or prediction while ensuring consistency between offline and online environments.

Feature Signatures and Label Signatures

The feature signature function in OpenMLDB is implemented based on a series of OpenMLDB-customized UDFs (User-Defined Functions) on top of standard SQL. Currently, OpenMLDB supports the following signature functions:

continuous(column): Indicates that the column is a continuous feature; the column can be of any numerical type.
discrete(column[, bucket_size]): Indicates that the column is a discrete feature; the column can be of boolean type, integer type, or date and time type. The optional parameter bucket_size sets the number of buckets. If bucket_size is not specified, the range of values is the entire range of the int64 type.
binary_label(column): Indicates that the column is a binary classification label; the column must be of boolean type.
multiclass_label(column): Indicates that the column is a multiclass classification label; the column can be of boolean type or integer type.
regression_label(column): Indicates that the column is a regression label; the column can be of any numerical type.

These functions must be used in conjunction with the sample format functions csv or libsvm and cannot be used independently. csv and libsvm can accept any number of parameters, and each parameter needs to be specified using functions like continuous to determine how to sign it. OpenMLDB handles null and erroneous data appropriately, retaining the maximum amount of sample information.

Usage Example

First, follow the quick start guide to get the image and start the OpenMLDB server and client.

docker run -it 4pdosc/openmldb:0.9.0 bash
/work/init.sh
/work/openmldb/sbin/openmldb-cli.sh

Create a database and import data in the OpenMLDB client.

--OpenMLDB CLI
CREATE DATABASE demo_db;
USE demo_db;
CREATE TABLE t1(id string, vendor_id int, pickup_datetime timestamp, dropoff_datetime timestamp, passenger_count int, pickup_longitude double, pickup_latitude double, dropoff_longitude double, dropoff_latitude double, store_and_fwd_flag string, trip_duration int);
SET @@execute_mode='offline';
LOAD DATA INFILE '/work/taxi-trip/data/taxi_tour_table_train_simple.snappy.parquet' INTO TABLE t1 options(format='parquet', header=true, mode='append');

Use the SHOW JOBS command to check the task running status. After the task is successfully executed, perform feature engineering and export the training data in CSV format.

Currently, OpenMLDB does not support overly long column names, so specifying the column name of the sample as instance using SELECT csv(...) AS instance is necessary.

--OpenMLDB CLI
USE demo_db;
SET @@execute_mode='offline';
WITH t1 as (SELECT trip_duration,
        passenger_count,
        sum(pickup_latitude) OVER w AS vendor_sum_pl,
        count(vendor_id) OVER w AS vendor_cnt,
    FROM t1
    WINDOW w AS (PARTITION BY vendor_id ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW))
SELECT csv(
    regression_label(trip_duration),
    continuous(passenger_count),
    continuous(vendor_sum_pl),
    continuous(vendor_cnt),
    discrete(vendor_cnt DIV 10)) AS instance
FROM t1 INTO OUTFILE '/tmp/feature_data_csv' OPTIONS(format='csv', header=false, quote='');

If LIBSVM format training data is needed, simply change SELECT csv(...) to SELECT libsvm(...). Note that the OPTIONS should still use the CSV format because the exported data only has one column, which already contains the complete LIBSVM format sample.

Moreover, the libsvm function will start numbering continuous features and discrete features with a known number of buckets from 1. Therefore, specifying the number of buckets ensures that the feature encoding ranges of different columns do not conflict. If the number of buckets for discrete features is not specified, there is a small probability of feature signature conflict in some samples.

--OpenMLDB CLI
USE demo_db;
SET @@execute_mode='offline';
WITH t1 as (SELECT trip_duration,
        passenger_count,
        sum(pickup_latitude) OVER w AS vendor_sum_pl,
        count(vendor_id) OVER w AS vendor_cnt,
    FROM t1
    WINDOW w AS (PARTITION BY vendor_id ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW))
SELECT libsvm(
    regression_label(trip_duration),
    continuous(passenger_count),
    continuous(vendor_sum_pl),
    continuous(vendor_cnt),
    discrete(vendor_cnt DIV 10, 100)) AS instance
FROM t1 INTO OUTFILE '/tmp/feature_data_libsvm' OPTIONS(format='csv', header=false, quote='');

Summary

By combining SQL with machine learning, feature signatures simplify the data processing workflow, making feature engineering more efficient and consistent. This innovation extends the functional boundaries of SQL, supporting the output of various formats of data samples, directly connecting to machine learning training and prediction, improving data processing flexibility and accuracy, and having significant implications for data science and engineering practices.

OpenMLDB introduces signature functions to further bridge the gap between feature engineering and machine learning frameworks. By uniformly signing samples with OpenMLDB, offline and online consistency can be improved throughout the entire process, reducing maintenance and change costs. In the future, OpenMLDB will add more signature functions, including one-hot encoding and feature crossing, to make the information in sample feature data more easily utilized by machine learning frameworks.

For more information on OpenMLDB:

Official website: https://openmldb.ai/
GitHub: https://github.com/4paradigm/OpenMLDB
Documentation: https://openmldb.ai/docs/en/
Join us on Slack !

OpenMLDB v0.9.0

OpenMLDB Blogs — Thu, 02 May 2024 09:13:07 GMT

OpenMLDB v0.9.0 Release: Major Upgrade in SQL Capabilities Covering the Entire Feature Servicing Process

OpenMLDB has just released a new version v0.9.0, including SQL syntax extensions, MySQL protocol compatibility, TiDB storage support, online feature computation, feature signatures, and more. Among these, the most noteworthy features are the MySQL protocol and ANSI SQL compatibility, along with the extended SQL syntax capabilities.

Firstly, MySQL protocol compatibility allows OpenMLDB users to access OpenMLDB clusters using any MySQL client, not limited to GUI applications like NaviCat or Sequal Ace but also Java JDBC MySQL Driver, Python SQLAlchemy, Go MySQL Driver, and various programming language SDKs. For more information, you can refer to “Ultra High-Performance Database OpenM(ysq)LDB: Seamless Compatibility with MySQL Protocol and Multi-Language MySQL Client”.

Secondly, the new version significantly expands SQL capabilities, especially implementing OpenMLDB’s unique request mode and stored procedure execution within standard SQL syntax. Compared to traditional SQL databases, OpenMLDB covers the entire machine learning process, including offline and online modes. In online mode, users can input sample data, and get feature results through SQL feature extraction. On the contrary, in the past, we needed to deploy SQL as a stored procedure through the Deploy command and then perform online feature computation through SDKs or HTTP interfaces. The new version adds SELECT CONFIG and CALL statements, allowing users to directly specify request mode and sample data in SQL to compute feature results, as shown below:

-- Execute online request mode query for action (10, "foo", timestamp(4000))
SELECT id, count(val) over (partition by id order by ts rows between 10 preceding and current row)
FROM t1
CONFIG (execute_mode = 'online', values = (10, "foo", timestamp(4000)))

You can also use the ANSI SQL CALL statement to invoke stored procedures with sample rows as parameters, as shown below:

-- Execute online request mode query for action (10, "foo", timestamp(4000))
DEPLOY window_features SELECT id, count(val) over (partition by id order by ts rows between 10 preceding and current row)
FROM t1;

CALL window_features(10, "foo", timestamp(4000))

For detailed release notes, please refer to: https://github.com/4paradigm/OpenMLDB/releases/tag/v0.9.0

Please feel free to download and explore the latest release. Your feedback is highly valued and appreciated. We encourage you to share your thoughts and suggestions to help us improve and enhance the platform. Thank you for your support!

Release Date

April 25, 2024

Release Note

https://github.com/4paradigm/OpenMLDB/releases/tag/v0.9.0

Highlighted Features

Added support for the latest version of SQLAlchemy 2, seamlessly integrating with popular Python frameworks such as Pandas and Numpy.
Expanded support for more data backends, integrating TiDB’s distributed file storage capability with OpenMLDB’s high-performance in-memory feature computation capability.
Enhanced ANSI SQL support, fixed first_value semantics, supported MAP type and feature signatures, and added offline mode support for INSERT statements.
Added support for MySQL protocol, allowing access to OpenMLDB clusters using MySQL clients like NaviCat, Sequal Ace, and various MySQL SDKs for programming languages.
Extended SQL syntax support, enabling online feature computation directly through SELECT CONFIG or CALL statements.

For more information on OpenMLDB:

Official website: https://openmldb.ai/
GitHub: https://github.com/4paradigm/OpenMLDB
Documentation: https://openmldb.ai/docs/en/
Join us on Slack !

Comparative Analysis of Memory Consumption: OpenMLDB vs Redis Test Report

OpenMLDB Blogs — Tue, 02 Apr 2024 09:07:46 GMT

Background

OpenMLDB is an open-source high-performance in-memory SQL database with numerous innovations and optimizations particularly tailored for time-series data storage, real-time feature computation, and other advanced functionalities. On the other hand, Redis is the most popular in-memory storage database widely used in high-performance online scenarios such as caching. While their respective application landscapes differ, both databases share a common trait of utilizing memory as their storage medium.

The objective of this article is to perform a comparative analysis of memory consumption under identical data row counts for both databases. Our goal is to provide users with a clear and intuitive understanding of the respective memory resource consumptions of each database.

Test Environment

This test is based on physical machine deployment (40C250G * 3) with the following hardware specifications:

- CPU: Intel(R) Xeon(R) CPU E5–2630 v4 @ 2.20GHz
- Processor: 40 Cores
- Memory: 250 G
- Storage: HDD 7.3T * 4

The software versions are as follows:

Test Methods

We have developed a Java-based testing tool using the OpenMLDB Java SDK and Jedis to compare memory usage between OpenMLDB and Redis. The objective is to insert identical data into both databases and analyze their respective memory usage. Due to variations in supported data types and storage methods, the data insertion process differs slightly between the two platforms. Since the data being tested consists of timestamped feature data, we have devised the following two distinct testing approaches to closely mimic real-world usage scenarios.

Method One: Random Data Generation

In this method, each test dataset comprises m keys serving as primary identifiers, with each key potentially having n different values (simulating time series data). For simplicity, each value is represented by a single field, and the lengths of the key and value fields can be controlled via configuration parameters. For OpenMLDB, we create a test table with two columns (key, value) and insert each key-value pair as a data entry. In the case of Redis, we use each key as an identifier and store multiple values corresponding to that key as a sorted set (zset) within Redis.

- Example

We plan to test with 1 million (referred to as 1M) keys, each corresponding to 100 time-series data entries. Therefore, the actual data stored in OpenMLDB would be 1M * 100 = 100M, which is equivalent to 100 million data entries. In Redis, we store 1M keys, each key corresponding to a sorted set (zset) containing 100 members.

- Configurable Parameters

- Operations Steps (Reproducible Steps)

a. Deploy OpenMLDB and Redis

Deployment can be done through containerization or directly on physical machines using software packages. There is no significant difference between the two methods. Below is an example of using containerization for deployment:

OpenMLDB
- Docker image: docker pull 4pdosc/openmldb:0.8.5
- Documentation: https://openmldb.ai/docs/zh/main/quickstart/openmldb_quickstart.html
Redis:
- Docker image: docker pull redis:7.2.4
- Documentation: https://hub.docker.com/_/redis

b. Pull the testing code

c. Modify configuration

Configuration file: src/main/resources/memory.properties [link]
Note: Ensure that REDIS_HOST_PORT and ZK_CLUSTER configurations match the actual testing environment. Other configurations are related to the amount of test data and should be adjusted as needed. If the data volume is large, the testing process may take longer.

d. Rut the tests

[Related paths in the GitHub benchmark Readme]

e. Check the output results

Method Two: Using the Open Source Dataset TalkingData

To enhance the credibility of the results, cover a broader range of data types, and facilitate result reproduction and comparison, we have designed a test using an open-source dataset — the TalkingData dataset. This dataset is used as a typical case in OpenMLDB for ad fraud detection. Here, we utilize the TalkingData train dataset, which can be obtained as follows:

Sample data: sample data used in OpenMLDB
Full data: Available on Kaggle

Differing from the first method, the TalkingData dataset includes multiple columns with strings, numbers, and time types. To align storage and usage more closely with real-world scenarios, we use the “ip” column from TalkingData as the key for storage. In OpenMLDB, this involves creating a table corresponding to the TalkingData dataset and creating an index for the “ip” column (OpenMLDB defaults to creating an index for the first column). In Redis, we use “ip” as the key and store a JSON string composed of other column data in a zset (as TalkingData is time-series data, there can be multiple rows with the same “ip”).

- Example

- Configurable Parameters

- Operation Steps (Reproducible Steps)

a. Deploy OpenMLDB and Redis

Same as in method one.

b. Pull the testing code

c. Modify configuration

Configuration file: src/main/resources/memory.properties [link]
Note:
- Ensure that REDIS_HOST_PORT and ZK_CLUSTER configurations match the actual testing environment.
- Modify TALKING_DATASET_PATH (defaults to resources/data/talking_data_sample.csv).

d. Obtain the test data file

Place the test data file in the resources/data directory, which is consistent with the TALKING_DATASET_PATH configuration path.

e. Run the tests

[Related paths in the GitHub benchmark Readme]

f. Check the output results

Results

Random Data Test

Under the experimental conditions mentioned above, storing the same amount of data, OpenMLDB (memory storage-mode) consumes over 30% less memory compared to Redis.

TalkingData Test

Thanks to OpenMLDB’s data compression capabilities, when sampling small batches of data from the TalkingData train dataset, OpenMLDB’s memory usage is significantly reduced by 74.77% compared to Redis. As the volume of test data increases, due to the nature of the TalkingData train dataset, a high number of duplicate keys during storage occurs, leading to a decrease in the storage advantage of OpenMLDB relative to Redis. This trend continues until all the train dataset is stored in the database, at which point OpenMLDB’s memory reduction compared to Redis is 45.66%.

Summary

For the open-source dataset TalkingData when storing data of similar magnitude, OpenMLDB reduces memory usage by 45.66% compared to Redis. Even on datasets consisting purely of string data, OpenMLDB can still reduce memory usage by over 30% compared to Redis.

This is because of OpenMLDB’s compact row encoding format, which optimizes various data types when storing the same amount of data. The optimization reduces memory usage in in-memory databases and lowers servicing costs. Comparisons with mainstream in-memory databases like Redis further demonstrate OpenMLDB’s superior performance in terms of memory usage and Total Cost of Ownership (TCO).

For more information on OpenMLDB:

Official website: https://openmldb.ai/
GitHub: https://github.com/4paradigm/OpenMLDB
Documentation: https://openmldb.ai/docs/en/
Join us on Slack !

Ultra High-Performance Database OpenM(ysq)LDB: Seamless Compatibility with MySQL Protocol and…

OpenMLDB Blogs — Fri, 22 Mar 2024 05:02:59 GMT

Ultra High-Performance Database OpenM(ysq)LDB: Seamless Compatibility with MySQL Protocol and Multi-Language MySQL Client

What’s OpenM(ysq)LDB?

OpenMLDB has introduced a new service module called OpenM(ysq)LDB, expanding its capabilities to integrate with MySQL infrastructure. This extension redefines the “ML” in OpenMLDB to signify both Machine Learning and MySQL compatibility. Through OpenM(ysq)LDB, users gain the ability to utilize MySQL command-line clients or MySQL SDKs in various programming languages, enabling seamless access to OpenMLDB’s unique online and offline feature calculation capabilities.

OpenMLDB itself is a distributed high-performance memory time-series database built on C++ and LLVM technologies. Its architectural design and implementation logic significantly differ from traditional standalone relational databases like MySQL. OpenMLDB has garnered widespread adoption, particularly in hard real-time online feature calculation scenarios such as financial risk control and recommendation systems. While OpenMLDB’s capabilities are robust, its adoption was initially hindered by perceived high adaptation costs.

However, the introduction of OpenM(ysq)LDB addresses this barrier by facilitating direct integration with MySQL Clients and SDKs. Through standard ANSI SQL interfaces, OpenMLDB is now compatible with MySQL protocol, allowing customers to directly use the familiar MySQL clients to access OpenMLDB data and perform special OpenMLDB SQL feature extraction syntax. This enhancement streamlines the transition for users familiar with MySQL environments, making OpenMLDB’s advanced features more accessible and user-friendly.

For more details, check the official documentation at https://openmldb.ai/docs/en/main/app_ecosystem/open_mysql_db/index.html.

Usage

Use a Compatible MySQL Command Line

After deploying the OpenMLDB distributed cluster, developers do not need to install additional OpenMLDB command line tools. Using the pre-installed MySQL command line tool, developers can directly connect to the OpenMLDB cluster for testing ( note that the following SQL connections and execution results are all returned by the OpenMLDB cluster, not by a remote MySQL service).

By executing customized OpenMLDB SQL, developers can not only view the status of the OpenMLDB cluster but also switch between offline mode and online mode to realize the offline and online feature extraction functions of MLOps.

Use a Compatible JDBC Driver

Java developers generally use the MySQL JDBC driver to connect to MySQL. The same code can directly connect to the OpenMLDB cluster without any modification.

Write the Java application code as follows. Pay attention to modifying the IP, port, username, and password information according to the actual cluster situation.

public class Main {
    public static void main(String[] args) {
        String url = "jdbc:mysql://localhost:3307/db1";
        String user = "root";
        String password = "root";
        Connection connection = null;
        Statement statement = null;
        ResultSet resultSet = null;
        try {
            connection = DriverManager.getConnection(url, user, password);
            statement = connection.createStatement();
            resultSet = statement.executeQuery("SELECT * FROM db1.t1");
            while (resultSet.next()) {
                int id = resultSet.getInt("id");
                String name = resultSet.getString("name");
                System.out.println("ID: " + id + ", Name: " + name);
            }
        } catch (SQLException e) {
            e.printStackTrace();
        } finally {
            // Close the result set, statement, and connection
            try {
                if (resultSet != null) {
                    resultSet.close();
                }
                if (statement != null) {
                    statement.close();
                }
                if (connection != null) {
                    connection.close();
                }
            } catch (SQLException e) {
                e.printStackTrace();
            }
        }
    }
}

Then compile and execute, and you can see the queried data for the OpenMLDB database in the command line output.

Use a Compatible SQLAlchemy Driver

Python developers often use SQLAlchemy and MySQL drivers, and the same code can also be directly applied to query OpenMLDB’s online data.

Write the Python application code as follows:

from sqlalchemy import create_engine, text

def main():
    engine = create_engine("mysql+pymysql://root:root@127.0.0.1:3307/db1", echo=True)
    with engine.connect() as conn:
        result = conn.execute(text("SELECT * FROM db1.t1"))
        for row in result:
            print(row)

if __name__ == "__main__":
  main()

Then execute it directly, and you can see the corresponding OpenMLDB database output in the command line output.

Use a Compatible Go MySQL Driver

Golang developers generally use the officially recommended github.com/go-sql-driver/mysql driver to access MySQL. They can also directly access the OpenMLDB cluster without modifying the application code.

Write the Golang application code as follows:

package main

import (
        "database/sql"
        "fmt"
        "log"

        _ "github.com/go-sql-driver/mysql"
)

func main() {
        // MySQL database connection parameters
        dbUser := "root"         // Replace with your MySQL username
        dbPass := "root"         // Replace with your MySQL password
        dbName := "db1"    // Replace with your MySQL database name
        dbHost := "localhost:3307"        // Replace with your MySQL host address
        dbCharset := "utf8mb4"            // Replace with your MySQL charset

        // Create a database connection
        db, err := sql.Open("mysql", fmt.Sprintf("%s:%s@tcp(%s)/%s?charset=%s", dbUser, dbPass, dbHost, dbName, dbCharset))
        if err != nil {
                log.Fatalf("Error connecting to the database: %v", err)
        }
        defer db.Close()

        // Perform a simple query
        rows, err := db.Query("SELECT id, name FROM db1.t1")
        if err != nil {
                log.Fatalf("Error executing query: %v", err)
        }
        defer rows.Close()

        // Iterate over the result set
        for rows.Next() {
                var id int
                var name string
                if err := rows.Scan(&id, &name); err != nil {
                        log.Fatalf("Error scanning row: %v", err)
                }
                fmt.Printf("ID: %d, Name: %s\n", id, name)
        }
        if err := rows.Err(); err != nil {
                log.Fatalf("Error iterating over result set: %v", err)
        }
}

Compile and run directly, and you can view the database output results in the command line output.

Use a Compatible Sequel Ace Client

MySQL developers usually use GUI applications to simplify database management. If developers want to connect to an OpenMLDB cluster, they can also use such open-source GUI tools.

Taking Sequel Ace as an example, developers do not need to modify any project code. They only need to fill in the address and port of the OpenM(ysq)LDB service when connecting to the database and fill in the username and password of the OpenMLDB service as the username and password. Then developers can follow the MySQL operation method to access the OpenMLDB service.

Use a Compatible Navicat Client

In addition to Sequel Ace, Navicat is also a popular MySQL client. Developers do not need to modify any project code. They only need to fill in the address and port of the OpenM(ysq)LDB service when creating a new connection (MySQL), and fill in the user name and password. The username and password of the OpenMLDB service can be used to access the OpenMLDB service according to the MySQL operation method.

Compatibility Principle of MySQL Protocol

The protocols of MySQL (including subsequent versions like MariaDB) are publicly available. On the server side, OpenM(ysq)LDB fully implements and is compatible with the MySQL protocol. While at the backend, it manages connections to the distributed OpenMLDB cluster through the OpenMLDB SDK, enabling compatibility access with various MySQL clients.

Currently, OpenM(ysql)LDB maintains client interaction with OpenMLDB through long-lived connections. This ensures that each connection has a unique client object accessing the OpenMLDB cluster. All SQL queries from the same connection do not require additional initialization, and resources are automatically released after the connection is closed. The overhead of the service itself is almost negligible, and performance can be consistent with directly connecting to OpenMLDB.

For more usage documentation, please refer to the official documentation at https://openmldb.ai/docs/en/main/app_ecosystem/open_mysql_db/index.html.

Summary

OpenM(ysql)LDB is a bold attempt within the OpenMLDB project. After a total of 39 versions released from 0.1.5 to 0.8.5, and continuous improvement in functionality and SQL syntax compatibility, it has finally achieved full compatibility with the MySQL protocol. It not only ensures basic SQL query functionality but also provides a lower-level storage implementation and AI capabilities that outperform MySQL. From now on, MySQL/MariaDB users can seamlessly switch their database storage engines. Developers using different programming languages can also directly utilize mature MySQL SDKs. The barrier to entry for using OpenMLDB services has been significantly lowered, providing a “shortcut” for all DBAs or data developers to transition to AI.

Please note that as of now, MySQL Workbench testing with OpenM(ysql)LDB is not yet supported. Relevant testing work is still ongoing, and interested developers can stay updated on the development progress of this project on GitHub.

For more information on OpenMLDB:

Official website: https://openmldb.ai/
GitHub: https://github.com/4paradigm/OpenMLDB
Documentation: https://openmldb.ai/docs/en/
Join us on Slack !

Mastering Distributed Database Development in 10 Minutes with OpenMLDB Developer Docker Image

OpenMLDB Blogs — Wed, 13 Mar 2024 04:49:16 GMT

OpenMLDB is an open-source, distributed in-memory database system designed for time-series data. It focuses on high performance, reliability, and scalability, making it suitable for handling massive time-series data and real-time computation of online features. In the wave of big data and machine learning, OpenMLDB has emerged as a promising player in the open-source database field, thanks to its powerful data processing capabilities and efficient support for machine learning.

The core storage and SQL engine of OpenMLDB consist of over 360,000 lines of C++ code and a massive amount of C header files. To further reduce the project compilation threshold and enhance developers’ efficiency, we have introduced a newly designed OpenMLDB Docker image. This allows developers to quickly compile the database source code from scratch on any operating system platform, including Linux, MacOS, Windows, etc. With just ten minutes, developers can join as contributors to the development of distributed databases.

Usage

The mirror is currently hosted on the Alibaba Cloud Mirror Repository. The process for using the mirror is as follows:

1. Start container: Use Docker commands to start the container. This will initiate an environment containing the OpenMLDB source code and all dependencies.

docker run -it registry.cn-beijing.aliyuncs.com/openmldb/openmldb-build bash

2. Compile OpenMLDB: Inside the container, you can directly navigate to the OpenMLDB source code directory and execute the compilation script.

cd OpenMLDB
make

3. Install OpenMLDB, default installation path is ${PROJECT_ROOT}/openmldb

make install

4. Deployment and Testing: After the compilation is complete, you can proceed with deployment and testing accordingly. All necessary tools and dependencies are already prepared and ready to use.

Concurrent Compilation Time

OpenMLDB disables concurrent compilation by default. However, if the resources on the compilation machine are sufficient, you can enable concurrent compilation using the compilation parameter NPROC. Here we list the time required for concurrent compilation.

1. 4-core Compilation

make NPROC=4

2. 8-core Compilation

make NPROC=8

3. 16-core Compilation

make NPROC=16

Highlights

Quick Start: Eliminates complex setup steps, allowing developers to quickly enter development mode on different operating system platforms.
Unified Environment: Whether for individual development or team collaboration, the Docker image ensures that each member develops in a consistent environment, effectively avoiding the “it works on my machine” problem.
Easy Sharing: The image can be easily shared with other team members or distributed in the community, accelerating the adoption and application of OpenMLDB.
Complete OpenMLDB Environment: The image comes pre-installed with the complete source code of OpenMLDB, enabling developers to easily explore and modify the OpenMLDB source code and contribute to the OpenMLDB community.
Offline Compilation and Deployment Capabilities: By pre-downloading the third-party libraries required by OpenMLDB, the image can compile and deploy OpenMLDB in a completely offline environment. This greatly improves work efficiency in network-restricted environments, enhancing the flexibility and feasibility of development.
Compilation Efficiency: Since all dependencies are already built into the image, this avoids lengthy dependency download and installation processes, making the compilation process much faster.

This custom Docker image tailored for offline building of OpenMLDB not only simplifies the onboarding process for developers but also provides robust support for project compilation, deployment, and testing. We anticipate that this tool will help more developers and enterprises leverage OpenMLDB more efficiently, enabling them to control the compilation and development capabilities of OpenMLDB at the source code level. Moreover, with the enhanced development and application capabilities, we look forward to seeing OpenMLDB further develop and apply in industry ecosystems such as financial risk control, recommendation systems, and quantitative trading.

For more information on OpenMLDB:

Official website: https://openmldb.ai/
GitHub: https://github.com/4paradigm/OpenMLDB
Documentation: https://openmldb.ai/docs/en/
Join us on Slack !

Apache Hive — Offline Data for OpenMLDB

OpenMLDB Blogs — Fri, 08 Mar 2024 06:09:27 GMT

Apache Hive — Offline Data for OpenMLDB

The Apache Hive™ is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale and facilitates reading, writing, and managing petabytes of data residing in distributed storage using SQL. OpenMLDB extends its capabilities by offering seamless import and export functionalities for Hive as a data warehousing solution. While Hive is primarily used as an offline data source, it can also function as a data source for online data ingestion during the initialization phase of online engines.

Note that currently, only reading and writing to non-ACID tables (EXTERNAL tables) in Hive is supported. ACID tables (Full ACID or insert-only tables, i.e., MANAGED tables) are not supported at the moment.

OpenMLDB Deployment

You can refer to the official documentation for deployment. An easier way is to deploy with an official docker image, as described in Quickstart.

In addition, you will also need Spark, please refer to OpenMLDB Spark Distribution.

Hive-OpenMLDB Integration

Installation

For users employing OpenMLDB Spark Distribution Version, specifically v0.6.7 and newer iterations, the essential Hive dependencies are already integrated.

However, if you are working with an alternative Spark distribution, you can follow these steps for installation.

Execute the following command in Spark to compile Hive dependencies

./build/mvn -Pyarn -Phive -Phive-thriftserver -DskipTests clean package

After successfully executed, the dependent package is located in the directory assembly/target/scala-xx/jars
Add all dependent packages to Spark’s class path.

Configuration

At present, OpenMLDB exclusively supports utilizing metastore services for establishing connections to Hive. You can adopt either of the two provided configuration methods to access the Hive data source. To set up a simple HIVE environment, configuring hive.metastore.uris will suffice. However, in the production environment when HIVE configurations are required, configurations through hive-site.xml is recommended.

Using the spark.conf Approach: You can set up spark.hadoop.hive.metastore.uris within the Spark configuration. This can be accomplished in two ways:

a. taskmanager.properties: Include spark.hadoop.hive.metastore.uris=thrift://... within the spark.default.conf configuration item, followed by restarting the taskmanager.

b. CLI: Integrate this configuration directive into ini conf and use --spark_conf when start CLI. Please refer to Client Spark Configuration.

hive-site.xml: You can configure hive.metastore.uris within the hive-site.xml file. Place this configuration file within the conf/ directory of the Spark home. If the HADOOP_CONF_DIR environment variable is already set, you can also position the configuration file there. For instance:


  
    hive.metastore.uris
     
     thrift://localhost:9083
     URI for client to contact metastore server

Apart from configuring the Hive connection, it is crucial to provide the necessary permissions to the initial users (both OS users and groups) of the TaskManager for Read/Write operations within Hive. Additionally, Read/Write/Execute permissions should be granted to the HDFS path associated with the Hive table.

Check

Verify whether the task is connected to the appropriate Hive cluster by examining the task log. Here’s how you can proceed:

INFO HiveConf: indicates the Hive configuration file that was utilized. If you require further information about the loading process, you can review the Spark logs.
When connecting to the Hive metastore, there should be a log entry similar to INFO metastore: Trying to connect to metastore with URI. A successful connection will be denoted by a log entry reading INFO metastore: Connected to metastore.

Usage

Table Creation with LIKE

You can use LIKE syntax to create tables, leveraging existing Hive tables, with identical schemas in OpenMLDB.

CREATE TABLE db1.t1 LIKE HIVE 'hive://hive_db.t1';
-- SUCCEED

Import Hive Data to OpenMLDB

Importing data from Hive sources is done through the API LOAD DATA INFILE. This operation employs a specialized URI format, hive://[db].table, to seamlessly import data from Hive.

LOAD DATA INFILE 'hive://db1.t1' INTO TABLE t1 OPTIONS(deep_copy=false);

The data loading process also supports using SQL queries to filter specific data from Hive tables. The table name used should be the registered name without the hive:// prefix.

LOAD DATA INFILE 'hive://db1.t1' INTO TABLE db1.t1 OPTIONS(deep_copy=true, sql='SELECT * FROM db1.t1 where key=\"foo\"')

Export OpenMLDB Data to Hive

Exporting data to Hive sources is done through the API SELECT INTO, which employs a distinct URI hive://[db].table, to seamlessly transfer data to Hive.

SELECT col1, col2, col3 FROM t1 INTO OUTFILE 'hive://db1.t1';

Summary

This is a brief guide for integration of Hive offline data source with OpenMLDB to best facilitate your application needs. For more details, you can check the official documentation on Hive integration.

OpenMLDB community has recently released FeatInsight, a sophisticated feature store service, leveraging OpenMLDB for efficient feature computation, management, and orchestration. The service is available for trial at http://152.136.144.33/. Contact us for a user ID and password to gain access!

For more information on OpenMLDB:

Official website: https://openmldb.ai/
GitHub: https://github.com/4paradigm/OpenMLDB
Documentation: https://openmldb.ai/docs/en/
Join us on Slack !

OpenMLDB v0.8.5 Release: Enhanced Authentication Feature, Comprehensive Security Upgrade

OpenMLDB Blogs — Thu, 29 Feb 2024 05:12:42 GMT

Release Date

28 February 2024

Release Notes

https://github.com/4paradigm/OpenMLDB/releases/tag/v0.8.5

OpenMLDB released a new version v0.8.5, including SQL syntax extensions, Iceberg data lake support, TTL type extensions, and improved user authentication functionality. The most noteworthy feature is the integration of the Iceberg engine and the enhancement of user authentication.

Iceberg is an open-source table format data lake management tool, focusing on providing a highly reliable, scalable data lake management solution. Its core features include atomic write mechanisms, multi-version data management, metadata management, etc., aiming to provide comprehensive data lake management functionality for enterprises. OpenMLDB integrates Iceberg into its platform, allowing users to directly read and write Iceberg data lakes while using OpenMLDB’s features. This results in higher data reliability and consistency, more flexible data operations and management, and more efficient data query performance, providing enterprises with a comprehensive and reliable data lake management solution.

OpenMLDB has also introduced user authentication functionality, allowing users to more flexibly manage and control database access permissions through SQL statements such as CREATE / ALTER / DROP USER. This feature not only ensures the security of the data but also enhances the convenience and flexibility of database management. Users can autonomously manage the creation, modification, and deletion of user accounts according to their actual needs, better meeting the enterprise's requirements for data access permission management, and improving the overall system's security and manageability.

For detailed release notes, please refer to: https://github.com/4paradigm/OpenMLDB/releases/tag/v0.8.5

Highlights

Added integration with Apache Iceberg offline storage engine, supporting data import and feature data export functionalities, further strengthening the integration of OpenMLDB with the ecosystem.
Added standard SQL syntax UNION ALL, expanding WINDOW UNION and LAST JOIN to achieve multi-table join.
Support for SELECT INTO OUTFILE to configure OpenMLDB online tables, achieving synchronization between online and offline storage.
In offline mode, LAST JOIN and WINDOW operations support not specifying ORDER BY parameters, making usage more flexible.
Added user management functionality, enabling user addition, modification, and deletion through standard SQL statements CREATE / ALTER / DROP USER.
Support for configuring Spark task parameters through SDK, providing more flexible offline task resource configuration.
INSERT statement supports configuring server-side memory limits, providing more user-friendly error messages for insertion failures.
LEFT JOIN statement deployment supports automatic index creation, eliminating the need for manual index creation and data re-importation.
File storage engine supports TTL types absandlat / absorlat, aligning with in-memory storage engine functionality.

For more information on OpenMLDB:

Official website: https://openmldb.ai/
GitHub: https://github.com/4paradigm/OpenMLDB
Documentation: https://openmldb.ai/docs/en/
Join us on Slack !

FeatInsight: Leveraging OpenMLDB for Highly Efficient Feature Management and Orchestration

OpenMLDB Blogs — Fri, 23 Feb 2024 01:58:45 GMT

Try out our FeatInsight feature platform at http://152.136.144.33/#/. Contact us for a user ID and password to gain access!

The OpenMLDB community has recently released a new open-source feature platform product — FeatInsight (https://github.com/4paradigm/FeatInsight).

FeatInsight is a sophisticated feature store service, leveraging OpenMLDB for efficient feature computation, management, and orchestration.

FeatInsight provides a user-friendly user interface, allowing users to perform the entire process of feature engineering for machine learning, including data import, viewing and update, feature generation, store, and online deployment. For offline scenarios, users can choose features for training sample generation for ML training; for online scenarios, users can deploy online feature services for real-time feature computations.

Key Features

The main objective of FeatInsight is to address common challenges in machine learning development, including facilitating easy and quick feature extraction, transformation, combination, and selection, managing feature lineage, enabling feature reuse and sharing, version control for feature services, and ensuring consistency and reliability of feature data used in both training and inference processes. Application scenarios include the following:

Online Feature Service Deployment: Provides high-performance feature storage and online feature computation functions for localized deployment.
MLOps Platform: Establishes MLOps workflow with OpenMLDB online-offline consistent computations.
FeatureStore Platform: Provides comprehensive feature extraction, deletion, online deployment, and lineage management functionality to achieve low-cost local FeatureStore services.
Open-Source Feature Solution Reuse: Supports solution reuse locally for feature reuse and sharing.
Business Component for Machine Learning: Provides a one-stop feature engineering solution for machine learning models in recommendation systems, natural language processing, finance, healthcare, and other areas of machine learning implementation.

For more content, please refer to FeatInsight Documentation.

QuickStart

We will use a simple example to show how to use FeatInsight to perform feature engineering. The usage process includes the following four steps: data import, feature creation, offline scenarios, and online scenarios.

1. Data Import

Firstly, create database test_db and data table test_table. You can use SQL to create.

CREATE DATABASE test_db;
CREATE TABLE test_db.test_table (id STRING, trx_time DATE);

Or you can use the UI and create it under “Data Import”.

For easier testing, we prepare a CSV file and save it to /tmp/test_table.csv. Note that, this path is a local path for the machine that runs the OpenMLDB TaskManager, usually also the machine for FeatInsight. You will need access to the machine for the edition.

id,trx_time
user1,2024-01-01
user2,2024-01-02
user3,2024-01-03
user4,2024-01-04
user5,2024-01-05
user6,2024-01-06
user7,2024-01-07

For online scenarios, you can use the command LOAD DATA or INSERT. Here we use "Import from CSV".

The imported data can be previewed.

For offline scenarios, you can also use LOAD_DATA or "Import from CSV".

Wait for about half a minute for the task to finish. You can also check the status and log.

2. Feature Creation

After data imports, we can create features. Here we use SQL to create two basic features.

SELECT id, dayofweek(trx_time) as trx_day FROM test_table

In “Features”, the button beside “All Features” is to create new features. Fill in the form accordingly.

After successful creation, you can check the features. Click on the name to go into details. You can check the basic information, as well as preview feature values.

3. Offline Samples Export

In “Offline Scenario”, you can choose to export offline samples. You can choose the features to export and specify the export path. There are “More Options” for you to specify the file format and other advanced parameters.

Wait for about half a minute and you can check the status at “Offline Samples”.

You can check the content of the exported samples. To verify online-offline consistency provided by FeatInsight, you can record the result and compare it with online feature computation results.

4. Online Feature Service

In “Feature Services”, the button beside “All Feature Services” is to create a new feature service. You can choose the features to deploy, and fill in the service name and version accordingly.

After successful creation, you can check service details, including the feature list, dependent tables, and lineage.

Lastly, on the “Request Feature Service” page, we can key in test data to perform online feature calculation, and compare it with offline computation results.

Summary

This example demonstrates the complete process of using FeatInsight. By writing simple SQL statements, users can define features for both online and offline scenarios. By selecting different features or combining feature sets, users can quickly reuse and deploy feature services. Lastly, the consistency of feature computation can be validated by comparing offline and online calculation results.

If you want to have a further understanding of how to use FeatInsight and its application scenarios, please refer to Application Scenarios.

Appendix: Advanced Functions

In addition to the basic functionalities of feature engineering, FeatInsight also provides advanced functionalities to facilitate feature development for users:

SQL Playground: Offers debugging and execution capabilities for OpenMLDB SQL statements, allowing users to execute arbitrary SQL operations and debug SQL statements for feature extraction.
Computed Features: Enables the direct storage of feature values obtained through external batch computation or stream processing into OpenMLDB online tables. Users can then access and manipulate feature data in online tables.

FeatInsight Github: https://github.com/4paradigm/FeatInsight
FeatInsight documentation: https://openmldb.ai/docs/en/main/app_ecosystem/feat_insight/index.html

For more information on OpenMLDB:

Official website: https://openmldb.ai/
GitHub: https://github.com/4paradigm/OpenMLDB
Documentation: https://openmldb.ai/docs/en/
Join us on Slack !

OpenMLDB Selected as the Sole Feature Store Vendor from China in the 2023 Gartner Report

OpenMLDB Blogs — Sun, 11 Feb 2024 07:10:47 GMT

In the report “The Logical Feature Store: Data Management for Machine Learning” published by the International Authoritative Consulting and Research firm, Gartner, OpenMLDB is honored to be selected as the sole representative feature store vendor from China.

The report thoroughly analyzes the three major challenges faced by current machine learning applications in the practical implementation process: Low End-to-end Efficiency, Lack of Reusability, and Inconsistency between Training and Production Environments. This explains the urgent necessity of a feature store. Considering the challenges posed by the high complexity and resource allocation involved in developing feature stores, Gartner firmly believes that, compared to in-house development, seeking external procurement, especially purchasing from MLOps vendors with built-in feature stores, is a more cost-effective choice. In this regard, OpenMLDB has successfully been included in Gartner’s recommended list of vendors for its outstanding performance, becoming the only Chinese ML vendor with a built-in feature store. This report provides valuable professional guidance for enterprises eager to expand the scale of their AI implementation in business.

OpenMLDB: Providing a Consistent Production-level Feature Store Online and Offline, Achieving a 500% Efficiency Improvement per Unit Cost

Gartner emphasizes the challenges of machine learning in practical applications in its report. Typically, machine learning teams in enterprises find themselves investing significant time in addressing data issues, leaving little room for focusing on actual model development and optimization. During this process, there is a notable prevalence of inconsistent feature definitions and frequent repetitive rework. Similar observations are revealed in OpenMLDB’s research, “In the Realm of Artificial Intelligence Engineering Practices, Enterprises Often Allocate a Staggering 95% of their Overall Time and Effort to Tasks such as Data Processing and Feature Validation”.

In the traditional approach without OpenMLDB, the deployment of real-time feature computations typically involves the following three steps: (1) Data scientists develop feature scripts offline using SparkSQL or Python; (2) As the developed offline scripts cannot meet the requirements of the production environment, the engineering team needs to reoptimize them based on a different tool stack; (3) Finally, there is a need for consistency validation of the offline feature scripts developed by data scientists and the online services developed by the engineering team. The entire process involves two groups of developers and two sets of tool stacks, resulting in significant deployment costs.

OpenMLDB aims for a seamless transition from development to deployment, allowing feature scripts developed by data scientists to be directly deployed in the production environment. The platform is equipped with both offline and online processing engines, with the online engine being deeply optimized to meet both production-level online requirements and ensure consistency between online and offline through an automatic consistency execution plan generator. Utilizing OpenMLDB, the implementation of machine learning applications in the feature phase involves only two steps: (1) Data scientists develop offline feature scripts using SQL, and (2) deploying the feature script to the online engine with a single deployment command. This approach ensures consistency between online and offline while successfully achieving millisecond-level low latency, high concurrency, and high availability for online services.

Therefore, the greatest value of OpenMLDB lies in significantly reducing the engineering deployment costs of artificial intelligence. In a larger business scenario, OpenMLDB can achieve a remarkable reduction from the traditional approach, where 6 person/ month was required, to just 1 person/ month. This results in a 500% efficiency improvement per unit cost by eliminating the need for the engineering team to develop online services and conduct online-offline consistency checks.

OpenMLDB X Akulaku: By Scenario-driven Approach, Windowed Feature Computation for One Billion Orders Achieves 4 Milliseconds Latency Performance, Saving over 4 Million Resources

OpenMLDB is committed to addressing the data governance challenges in the implementation of AI engineering and has already been deployed in over a hundred enterprise-level AI scenarios. Among them, Akulaku, as a leading internet finance service provider in Southeast Asia, covers the entire e-commerce chain, with applications spanning financial risk control, intelligent customer service, and e-commerce recommendations. In numerous scenarios, Akulaku requires the implementation of corresponding AI applications. In the field of e-commerce finance, there is often a high demand for the feature store, requiring online deployment with low latency and high efficiency. It needs to reflect real-time feature computation for new data as much as possible, while offline demand analysis requires high throughput. At the same time, consistency between online and offline must be ensured. Meeting all three requirements simultaneously is a challenging task.

To address this challenge, OpenMLDB has assisted Akulaku in building an intelligent computing architecture. This involves embedding OpenMLDB’s online engine into the model computation layer and embedding the offline engine into the feature computation layer. Through a scenario-driven approach, real-time computation results are invoked in the business calling process. This approach has successfully performed windowed feature computation for one billion orders, achieving a 4-millisecond latency performance and conservatively estimated resource savings of over 4 million.

In addition, OpenMLDB has assisted numerous enterprises in optimizing their database architecture, facilitating more effective implementation of AI scenarios. For example, it helped Vipshop reduce the feature development and iteration time for personalized product recommendations from 5 person/ day to 2 person/ day, resulting in a 60% improvement in feature development iteration speed. A leading bank’s anti-fraud system utilized OpenMLDB for feature computation and management in offline development, online inference, and self-learning stages. This resolved long-standing issues of data traversal and inconsistent results, eliminating the need for expensive consistency verification costs. Huawei, after implementing OpenMLDB for real-time personalized product recommendations, achieved minute-level data updates and hour-level feature deployment. Looking ahead, OpenMLDB aims to assist more enterprises in addressing real-world challenges in data and feature processing for successful business implementation.

As the sole representative of a database feature store from China selected in the Gartner report “The Logical Feature Store: Data Management for Machine Learning,” OpenMLDB will continue refining its product, optimizing performance, and leveraging its strengths in the field of database feature platforms. The aim is to liberate AI practitioners from tedious and inefficient data processing, assisting enterprises in achieving simpler and more efficient implementations of machine learning applications.

For more information on OpenMLDB:

Official website: https://openmldb.ai/
GitHub: https://github.com/4paradigm/OpenMLDB
Documentation: https://openmldb.ai/docs/en/
Join us on Slack !

Kubernetes Deployment Guide for OpenMLDB

OpenMLDB Blogs — Fri, 12 Jan 2024 06:38:04 GMT

Kubernetes is a widely adopted cloud-native container orchestration and management tool in the industry that has been extensively used in project implementations. Currently, both the offline and online engines of OpenMLDB have complete support for deployment based on Kubernetes, enabling more convenient management functionalities. This article will respectively introduce the deployment strategies of the offline and online engines based on Kubernetes.

It’s important to note that the deployment of the offline engine and the online engine based on Kubernetes are entirely decoupled. Users have the flexibility to deploy either the offline or online engine based on their specific requirements.

Besides Kubernetes-based deployment, the offline engine also supports deployment in local mode and yarn mode. Similarly, the online engine supports a native deployment method that doesn't rely on containers. These deployment strategies can be flexibly mixed and matched in practical scenarios to meet the demands of production environments.

Offline Engine with Kubernetes Backend

Deployment of Kubernetes Operator for Apache Spark

Please refer to spark-on-k8s-operator official documentation. The following is the command to deploy to the default namespace using Helm. Modify the namespace and permission as required.

helm install my-release spark-operator/spark-operator --namespace default --create-namespace --set webhook.enable=true
kubectl create serviceaccount spark --namespace default
kubectl create clusterrolebinding binding --clusterrole=edit --serviceaccount=default:spark

After successful deployment, you can use the code examples provided by spark-operator to test whether Spark tasks can be submitted normally.

HDFS Support

If you need to configure Kubernetes tasks to read and write HDFS data, you need to prepare a Hadoop configuration file in advance and create a ConfigMap. You can modify the ConfigMap name and file path as needed. The creation command example is as follows:

kubectl create configmap hadoop-config --from-file=/tmp/hadoop/etc/

Offline Engine Configurations for Kubernetes Support

The configuration file for TaskManager in the offline engine can be configured for Kubernetes support, respective settings are:

https://medium.com/media/46698a47473ca417b96acb7350210adb/href

If Kubernetes is used to run the offline engine, the user’s computation tasks will run on the cluster. Therefore, it’s recommended to configure the offline storage path as an HDFS path; otherwise, it might lead to data read/write failures in tasks. Example configuration for the item is as follows:

offline.data.prefix=hdfs:///foo/bar/

💡For a complete configuration file example for TaskManager in OpenMLDB offline engine, visit: https://openmldb.ai/docs/en/main/deploy/conf.html#he-configuration-file-for-taskmanager-conf-taskmanager-properties

Task Submission and Management

After configuring TaskManager and Kubernetes, you can submit offline tasks via the command line. The usage is similar to that of the local or YARN mode, allowing not only usage within the SQL command-line client but also via SDKs in various programming languages.

For instance, to submit a data import task:

LOAD DATA INFILE 'hdfs:///hosts' INTO TABLE db1.t1 OPTIONS(delimiter = ',', mode='overwrite');

Check Hadoop ConfigMap:

kubectl get configmap hdfs-config -o yaml

Check Spark job and Pod log:

kubectl get SparkApplicationkubectl get pods

Online Engine Deployment with Kubernetes

Github

The deployment of online engine based on Kubernetes is supported as a separate tool for OpenMLDB. Its source code repository is located at: https://github.com/4paradigm/openmldb-k8s

Requirement

This deployment tool offers a Kubernetes-based deployment solution for the OpenMLDB online engine, implemented using Helm Charts. The tool has been tested and verified with the following versions:

Kubernetes 1.19+
Helm 3.2.0+

Additionally, for users who utilize pre-compiled OpenMLDB images from Docker Hub, only OpenMLDB versions >= 0.8.2 are supported. Users also have the option to create other versions of OpenMLDB images using the tool described in the last section of this article.

Preparation: Deploy ZooKeeper

If there is an available ZooKeeper instance, you can skip this step. Otherwise, proceed with the installation process:

helm install zookeeper oci://registry-1.docker.io/bitnamicharts/zookeeper --set persistence.enabled=false

You can specify a previously created storage class for persistent storage:

helm install zookeeper oci://registry-1.docker.io/bitnamicharts/zookeeper --set persistence.storageClass=local-storage

For more parameter settings, refer to here

OpenMLDB Deployment

Download Source Code

Download the source code and set the working directory to the root directory of the repository.

git clone https://github.com/4paradigm/openmldb-k8s.git
cd openmldb-k8s

Configure ZooKeeper Address

Modify the zk_cluster in the charts/openmldb/conf/tablet.flags and charts/openmldb/conf/nameserver.flags files to the actual ZooKeeper address, with the default zk_root_path set to /openmldb.

Deploy OpenMLDB

You can achieve one-click deployment using Helm with the following commands:

helm install openmldb ./charts/openmldb

Users have the flexibility to configure additional deployment options using the --set command. Detailed information about supported options can be found in the OpenMLDB Chart Configuration.

Important configuration considerations include:

By default, temporary files are used for data storage, which means that data may be lost if the pod restarts. It is recommended to associate a Persistent Volume Claim (PVC) with a specific storage class using the following method:

helm install openmldb ./charts/openmldb --set persistence.dataDir.enabled=true --set  persistence.dataDir.storageClass=local-storage

By default, the 4pdosc/openmldb-online image from Docker Hub is utilized (supporting OpenMLDB >= 0.8.2). If you prefer to use a custom image, you can specify the image name during installation with --set image.openmldbImage. For information on creating custom images, refer to the last section of this article.

helm install openmldb ./charts/openmldb --set image.openmldbImage=openmldb-online:0.8.4

Note

Deployed OpenMLDB services can only be accessed within the same namespace within Kubernetes.
The OpenMLDB cluster deployed using this method does not include a TaskManager module. Consequently, statements such as LOAD DATA and SELECT INTO, and offline-related functions are not supported. If you need to import data into OpenMLDB, you can use OpenMLDB’s Online Import Tool, OpenMLDB Connector, or SDK. For exporting table data, the Online Data Export Tool can be utilized.
For production, it’s necessary to disable Transparent Huge Pages (THP) on the physical node where Kubernetes deploys the tablet. Failure to do so may result in issues where deleted tables cannot be fully released. For instructions on disabling THP, please refer to this link.

Create Docker Image

The default deployment uses the OpenMLDB docker image from Docker Hub. Users can also create their local docker image. The creation tool is located in the repository (https://github.com/4paradigm/openmldb-k8s) as docker/build.sh.

This script supports two parameters:

OpenMLDB version number.
Source of the OpenMLDB package. By default, it pulls the package from a mirror in mainland China. If you want to pull it from GitHub, you can set the second parameter to github.

cd docker
sh build.sh 0.8.4

For more information on OpenMLDB:

Official website: https://openmldb.ai/
GitHub: https://github.com/4paradigm/OpenMLDB
Documentation: https://openmldb.ai/docs/en/
Join us on Slack !