Apache Amoro is a Lakehouse management system built on open data lake formats. It provides self-optimizing tables, unified catalog services, and automated maintenance operations for data lakes stored on HDFS, S3, or other distributed filesystems.
Key features:
For detailed information about specific subsystems, see:
The following diagram illustrates Amoro's high-level architecture with actual component and class names from the codebase:
System Architecture with Code Entities
This diagram shows Amoro's architecture mapping natural language concepts to concrete code entities. The AMS (implemented in AmoroServiceContainer) serves as the central orchestration point, exposing three service interfaces. Optimizer containers implement the AbstractOptimizerContainer interface to provide pluggable execution strategies. All metadata operations flow through MyBatis mappers to a relational database, while data operations interact directly with the Hadoop FileSystem API.
Sources: pom.xml47-58 amoro-ams/pom.xml29-32 README.md40-61 docs/admin-guides/deployment.md71-97
The AMS is the central management service implemented in AmoroServiceContainer located in the amoro-ams module. It exposes three primary service endpoints configured via AmoroManagementConf:
| Service | Port Config Property | Default | Protocol | Purpose |
|---|---|---|---|---|
| Dashboard Server | ams.http-server.bind-port | 1630 | REST/HTTP (Javalin) | Web UI and REST API for management operations |
| Table Service | ams.thrift-server.table-service.bind-port | 1260 | Thrift | Table metadata access for compute engines |
| Optimizing Service | ams.thrift-server.optimizing-service.bind-port | 1261 | Thrift | Task distribution and optimizer coordination |
The AMS supports high availability through ZooKeeper-based leader election configured via ams.ha.enabled and ams.ha.zookeeper-address properties:
DashboardServer and REST Catalog endpoints run on all instancesOptimizingService and table optimization orchestration run only on the leader nodeamoro-shade-zookeeper-3 with automatic failover on leader failureThe service container lifecycle progresses through states: CREATED → STARTING → STARTED → STOPPING → STOPPED.
Sources: amoro-ams/pom.xml29-32 docs/admin-guides/deployment.md71-97 README.md54-60 docs/admin-guides/deployment.md128-139
Optimizers execute self-optimizing operations on tables through a pluggable container abstraction. The framework is organized in the amoro-optimizer module hierarchy:
Optimizer Container Architecture
Container configuration in config.yaml:
org.apache.amoro.server.manager.LocalOptimizerContainer)Each container manages:
OptimizerGroupOptimizingQueueSources: docs/admin-guides/deployment.md142-176 README.md57-60 pom.xml51-52 amoro-optimizer/amoro-optimizer-standalone/src/main/java/org/apache/amoro/optimizer/standalone/StandaloneOptimizer.java1-100
Amoro operates on top of existing data lake storage and metadata systems:
Sources: README.md62-71 docs/admin-guides/deployment.md104-126
Amoro manages tables across four distinct formats, each optimized for different use cases:
| Format | Module | Use Case |
|---|---|---|
| Iceberg | amoro-format-iceberg | Direct management of Apache Iceberg tables with all native Iceberg capabilities |
| Mixed-Iceberg | amoro-format-mixed | Enhanced format for streaming updates with optimized change data capture (CDC) |
| Mixed-Hive | amoro-mixed-hive | Upgrade path for existing Hive tables to lakehouse format via metadata migration |
| Paimon | amoro-format-paimon | Integration with Apache Paimon tables for metadata display and management |
| Hudi | amoro-format-hudi | Integration with Apache Hudi tables |
The Mixed-Iceberg format introduces a three-tier storage architecture:
Sources: README.md62-71 pom.xml53-56 amoro-format-iceberg/pom.xml28-30 amoro-format-paimon/pom.xml29-30
The following table shows compute engine support for different table formats:
| Engine | Versions | Batch Read | Batch Write | Streaming Read | Streaming Write | Module |
|---|---|---|---|---|---|---|
| Flink | 1.16.x - 1.18.x | ✓ | ✓ | ✓ | ✓ | amoro-mixed-flink |
| Spark | 3.3, 3.4, 3.5 | ✓ | ✓ | ✗ | ✗ | amoro-mixed-spark-{version} |
| Trino | 406+ | ✓ | ✗ | ✗ | ✗ | amoro-mixed-trino |
| Hive | 2.x, 3.x | ✓ | ✗ | ✗ | ✗ | amoro-mixed-hive |
Each Spark version has a dedicated integration module (amoro-mixed-spark-3.3, amoro-mixed-spark-3.4, amoro-mixed-spark-3.5) built on a common base (amoro-mixed-spark-3-common).
Sources: README.md79-89 pom.xml47-58 amoro-format-mixed/amoro-mixed-spark/pom.xml1-54
Maven Module Dependency Graph
The module structure follows these design principles:
Shaded Dependencies: Critical libraries are relocated to org.apache.amoro.shade.* packages to prevent classpath conflicts:
amoro-shade-thrift (0.20.0) - Thrift RPC frameworkamoro-shade-guava-32 (32.1.1-jre) - Google Guava utilitiesamoro-shade-jackson-2 (2.14.2) - JSON serializationamoro-shade-zookeeper-3 (3.9.1) - ZooKeeper clientFormat Abstraction: Each format implements the TableDescriptor interface defined in amoro-common, enabling pluggable format support.
Multi-Version Support: Engine connectors use Maven profiles to build for multiple versions:
amoro-mixed-spark-3-commonamoro-mixed-flink-commonBinary Distribution: The dist module uses maven-assembly-plugin to package:
lib/plugin/optimizer/conf/bin/Sources: pom.xml47-58 README.md99-115 amoro-common/pom.xml30-64 amoro-ams/pom.xml39-61 amoro-format-mixed/amoro-mixed-spark/pom.xml1-54 amoro-format-mixed/amoro-mixed-spark/v3.3/amoro-mixed-spark-3.3/pom.xml29-32
Amoro continuously optimizes tables through background processes that:
The self-optimizing process is managed through the TableRuntime state machine, which tracks optimization status and schedules tasks to optimizer instances.
Sources: README.md92-93 docs/admin-guides/deployment.md142-176
The CatalogManager provides a unified catalog abstraction supporting:
Compute engines access tables through the Table Service (port 1260), which provides consistent metadata access regardless of the underlying catalog implementation.
Sources: README.md94 docs/admin-guides/deployment.md75-76
Through its modular architecture, Amoro provides unified management capabilities across different table formats while preserving format-specific features. The TableManager coordinates operations across formats through a pluggable provider system.
Sources: README.md93 pom.xml53-56
Amoro supports multiple deployment models with corresponding build artifacts:
Deployment Artifact and Build Pipeline
Key configuration files and their primary settings:
| File | Purpose | Key Properties |
|---|---|---|
conf/config.yaml | AMS service configuration | ams.server-bind-hostams.thrift-server.table-service.bind-portams.database.*ams.ha.* |
conf/jvm.properties | JVM tuning | xms, xmx, jmx.remote.port, extra.options |
containers section | Optimizer deployment | container-impl (fully qualified class name)properties (env vars and config) |
The AmoroManagementConf class in amoro-ams loads configuration from config.yaml with environment variable overrides via the ConfigurationHelper.
Maven build profiles control which components are included:
| Profile | Effect |
|---|---|
-Phadoop2 | Use Hadoop 2.x dependencies instead of default Hadoop 3.x |
-Pspark-3.3 / -Pspark-3.4 / -Pspark-3.5 | Build for specific Spark version (3.5 is default) |
-Psupport-all-formats | Include Paimon and Hudi format modules in distribution |
-Pflink-optimizer-pre-1.15 | Build Flink optimizer for versions before 1.15 |
Three Docker images are provided:
eclipse-temurin:11-jdk-jammyflink:${FLINK_VERSION}-java11Build scripts: docker/build.sh1-243 GitHub Actions workflow: .github/workflows/docker-images.yml1-267
Sources: README.md117-135 docs/admin-guides/deployment.md68-294 docker/build.sh1-243 .github/workflows/docker-images.yml1-267 docker/amoro/Dockerfile1-64 docker/optimizer-flink/Dockerfile1-34
Amoro requires the following runtime dependencies:
Sources: docs/admin-guides/deployment.md31-35 pom.xml73-77
The project uses Maven for building with various profiles to control which components are included:
The build produces:
dist/target/apache-amoro-x.y.z-bin.tar.gz - Complete distribution packagetarget/ directorydocker/build.sh scriptSources: README.md117-135 CONTRIBUTING.md86-88 docker/build.sh1-243
Refresh this wiki
This wiki was recently refreshed. Please wait 7 days to refresh again.