StreamNative - Medium

StreamNative: Enabling Real-time Messaging and Streaming for the Cloud

Xiaofeng Liu — Fri, 24 Sep 2021 07:43:20 GMT

As the CEO of StreamNative, it’s been an exciting ride so far, and today we are proud to share that we’ve raised a $23M series A round. For us, this funding underscores the increased adoption of StreamNative and Apache Pulsar that we are seeing in the market and a bright future ahead.

In celebrating this milestone, we’d like to look back at how our Pulsar journey began. More than 10 years ago Matteo and I were at Yahoo! working to develop a consolidated messaging platform that connected all the popular Yahoo! Applications, including Yahoo! Finance, Yahoo! Mail, Yahoo! Sports, Flickr and more, to data. At the time we looked at the existing messaging and streaming technologies, but they were not able to provide the scalability, reliability, and features needed to meet today’s modern architecture and application requirements.

The team at Yahoo! set out to build a cloud-native messaging service that would work for the global enterprise. We built Pulsar from the ground up to handle millions of topics and partitions with full support for geo-replication and multi-tenancy. Pulsar was open sourced by Yahoo! in 2016 and became a top-level Apache Software Foundation in 2018.

Over the past several years there has been a huge market shift from applications and traditional services using monolithic messaging services — either running on-premise, or simply ported to the cloud — to truly cloud-native applications designed to leverage the cloud and Kubernetes. This shift to the cloud and containers has amplified the spotlight on Apache Pulsar.

Apache Pulsar is unique in that it provides an all-in-one platform with unified messaging and streaming capabilities built for the cloud. Think about it as the combination of Kafka (streaming only) and RabbitMQ (messaging only), designed for multi-tenancy and containers.

At StreamNative, we work to help organizations around the globe successfully adopt Pulsar. StreamNative builds upon the powerful Apache Pulsar platform with two product offerings, StreamNative Cloud and StreamNative Platform, details below:

StreamNative Cloud provides Apache Pulsar-as-a-service and delivers a resilient and scalable messaging and event streaming managed service deployable in minutes (alleviating the need to spend time or resources to deploy, upgrade, or maintain clusters).
StreamNative Platform is a self-managed cloud-native offering that completes Apache Pulsar, providing a distribution of Pulsar with advanced capabilities to help accelerate real-time application development and to simplify enterprise operations at scale.

We’re excited to see the growth in the Apache Pulsar and StreamNative communities. When we started this journey Kafka was the dominant player in the space, but Pulsar’s rapid adoption since it became a top level Apache Software Foundation project in 2018 has been remarkable.

In fact, the number of monthly active Apache Pulsar contributors surpassed Apache Kafka recently (see graph below)! Many have adopted Apache Pulsar because it offers the potential of faster throughput and lower latency than Apache Kafka, along with a compatible API that allows developers to switch from Kafka to Pulsar with relative ease.

We are proud to continue to support the Apache Pulsar community through events, training, project updates, and project contributions. In fact, members of the StreamNative team often represent more than half the monthly Pulsar contributors. We also play a key role sponsoring and hosting the global Pulsar Summits (next being Pulsar Summit Europe 2021 in October).

What’s next? StreamNative will continue to focus on advancing the state-of-the-art in streaming, stream storage, and messaging technologies. From real-time microservices that use Pulsar’s pub/sub features and streaming storage for real-time analytics to infinite storage for deep analysis, we’re all about innovating on Pulsar’s flexible architecture and industry leading feature-set to deliver new capabilities.

And, we’re hiring! We’re growing our global staff across all departments to accelerate product development, ecosystem expansion, and customer acquisition. If you’re interested in joining the StreamNative team and building a platform based on Apache Pulsar to enable companies to manage the entire lifecycle of data, contact us.

About the CEO

Sijie’s journey with Apache Pulsar began at Yahoo! where he was part of the team working to develop a global messaging platform for the company. He then went to Twitter, where he led the messaging infrastructure group and co-created DistributedLog and Twitter EventBus. In 2017, he co-founded Streamlio, which was acquired by Splunk, and in 2019 he founded StreamNative. He is one of the original creators of Apache Pulsar and Apache BookKeeper, and remains VP of Apache BookKeeper and PMC Member of Apache Pulsar. Sijie lives in the San Francisco Bay Area of California. You can follow him on twitter.

About the CTO

Matteo is the CTO at StreamNative, where he brings rich experience in distributed pub-sub messaging platforms. Matteo was one of the co-creators of Apache Pulsar during his time at Yahoo!. Matteo and Sijie worked to create a global, distributed messaging system for Yahoo!, which would later become Apache Pulsar. Matteo then co-founded Streamlio with Guo, and later served as the Senior Principal Software Engineer at Splunk post-acquisition. Matteo is the PMC Chair of Apache Pulsar, where he helps to guide the community and ensure the success of the Pulsar project. He is also a PMC member for Apache BookKeeper. Matteo lives in Menlo Park, California. You can follow him on twitter.

StreamNative: Enabling Real-time Messaging and Streaming for the Cloud was originally published in StreamNative on Medium, where people are continuing the conversation by highlighting and responding to this story.

Speakers Announced for Pulsar Virtual Summit Europe 2021

Xiaofeng Liu — Wed, 15 Sep 2021 04:57:07 GMT

The first-ever Pulsar Virtual Summit Europe is just one month away! Co-hosted by StreamNative and Clever Cloud, this event will be held online on October 6th at 12:00 PM CEST.

The Pulsar Summit offers a unique opportunity for engineers, architects, data scientists, and technical leaders interested in Pulsar and the messaging and streaming ecosystem to learn and network. Since 2020, the Pulsar Summits have drawn more than 100 speakers, thousands of attendees, and hundreds of companies globally.

The speaker committee for the Pulsar Summit Europe 2021includes Apache Pulsar PMC members Matteo Merli from StreamNative, Jerry Peng from Splunk, and Rajan Dhabalia from Verizon Media. Additionally, Till Rohrmann from Ververica, Karthik Ramasamy from Splunk, Addison Higham from StreamNative, and Ricardo Ferreira from Elastic will be participating.

Featured speakers include engineers, developer advocates, and technical leaders from the Apache Pulsar PMC, Clever Cloud, Databricks, StreamNative, Elastic, DataStax, Flipkart, Zilliz, Tencent, JAMPP, and Softtech.

Register today and learn about the latest Pulsar project updates, technology deep dives, use cases, and ecosystem developments!

Featured Sessions

The Pulsar Virtual Summit Europe 2021 will feature 3 keynotes and 12 breakout sessions. Below is a sneak peak into some of the featured breakout sessions.

1. Tracking Apache Pulsar Messages with Apache SkyWalking

Presented by Penghui Li, Apache Pulsar PMC Member and Software Engineer at StreamNative

Apache SkyWalking is a popular application performance monitoring tool for distributed systems, specially designed for microservices, cloud-native, and container-based (Docker, K8s) architectures. In this talk, the speakers will walk you through the features of Apache SkyWalking and Pulsar, and demo how to track Pulsar messages with SkyWalking to troubleshoot issues related to message publishing and receiving.

2. Log System as Backbone–How We Built the World’s Most Advanced Vector Database on Pulsar

Presented by Xiaofan Luan, Partner and Director of Engineering at Zilliz

Milvus is an open-source vector database for building and managing vector similarity search applications. It has been adopted in production by thousands of companies, including Lucidworks, Shutterstock, and Cloudinary. In this talk, Xiaofan Luan will share with you how the community built Milvus 2.0, a cloud-native, highly scalable and extendable vector similarity solution, on Pulsar.

3. Writing Custom Sink Connectors for Pulsar I/O

Presented by Ricardo Ferreira, Principal Developer Advocate at Elastic

In this talk, Ricardo Ferreira will show you how to write and deploy custom sink connectors for Pulsar I/O that work just like the built-in ones. He will also discuss some of the design decisions that your custom connectors may need to address.

4. Pulsar Watermarking

Presented by Eron Wright, Cloud Engineering Lead at StreamNative

The goal of the Pulsar Watermarking project is to simplify and improve the correctness of stream processing applications. In this session, Eron Wright will do a technical deep-dive into the Apache Pulsar community’s plan to support event-time watermarking in a Pulsar topic.

5. Application of Apache Pulsar in Tencent Billing and Tencent Advertising

Presented by Mingyu Bao, Senior Engineer at Tencent

Mingyu Bao will provide a behind-the-scenes on Tencent’s adoption of Pulsar for their billing and advertising use cases and share some of their challenges. He will also discuss the adaptations and improvements Tencent made with Pulsar in order to meet their performance and operations requirements.

Register Now

Don’t miss this opportunity to learn from top Pulsar thought leaders. Register now to participate and connect with the Pulsar community at the summit. Check out the full schedule for more details.

About the Author

Alice Bi is a content strategist at StreamNative. She has experience with digital marketing, and UX design, and communication research. Alice is based in Los Angeles, California.

This post was originally published on StreamNative blog.

Like this post? Please recommend and/or share.

Want to learn more? See https://streamnative.io/blog. Follow us here on Medium and check out our GitHub.

Speakers Announced for Pulsar Virtual Summit Europe 2021 was originally published in StreamNative on Medium, where people are continuing the conversation by highlighting and responding to this story.

Exactly-Once Semantics with Transactions in Pulsar

Sijia-w — Thu, 29 Jul 2021 16:02:38 GMT

We have hit an exciting milestone for the Apache Pulsar community: exactly once semantics. As part of the 2.8 Pulsar release, we have evolved the exactly-once semantic from guaranteed message deduplication on a single topic to atomic produce and acknowledgement over multiple topics via Transaction API. In this post, I’ll explain what this means, how we made this evolution, and how the transaction features in Pulsar simplify exactly-once semantics for building messaging and streaming applications.

Before diving into the transaction features, let’s get started with an overview of messaging semantics.

What is exactly-once semantics?

In any distributed system, the machines that form the system can always fail independently of one another. In Apache Pulsar, an individual broker or bookie can crash, or a network failure can happen while the producer is producing a message to a topic. Depending on how the producer handles such a failure, the application can get one of three different semantics.

At-least-once Semantics

If the producer receives an acknowledgement (ACK) from the Pulsar broker, it means that the message has been written to the Pulsar topic. However, if a producer times out on receiving an acknowledgement or receives an error from the Pulsar broker, it might retry sending the message to the Pulsar topic. If the broker had failed right before it sent the ACK but after the message was successfully written to the Pulsar topic, this reattempt leads to the message being written twice and delivered more than once to the consumers.

At-most-once Semantics

If the producer does not attempt to produce the message when it times out on receiving an acknowledgement or receives an error, then the message might end up not being written to the Pulsar topic, and not delivered to the consumers. In some cases in order to avoid the possibility of duplication, we accept that messages will not be written.

Exactly-once Semantics

Exactly-once semantics guarantees that even if a producer retries sending a message multiple times, the message will only be written exactly-once to the Pulsar topic. Exactly-once semantics is the most desirable guarantee, but also one that is not well understood. Exactly-once semantics requires coordination between the messaging system itself and the application producing and consuming the messages. For example, if after consuming and acknowledging a message successfully, your application rewinds the subscription to a previous message ID, your application will receive all the messages from that message ID to the latest one, all over again.

Challenges in supporting exactly-once semantics

Supporting exactly-once delivery semantics in messaging systems presents some challenges. To describe them, I’ll start with a simple example.

Suppose there is a producer that sends a message “Hello StreamNative” to a Pulsar topic called “Greetings”. Further suppose a consumer on the other end receives messages from the topic and prints them. In a happy path where there are no failures, this works well, and the message “Hello StreamNative” is written to the “Greetings” topic only once. The consumer receives the message, processes it, and acknowledges it to indicate that it has completed its processing. The consumer will not receive the message again, even if the consumer application crashes and restarts.

However, at scale, failure scenarios can happen all the time.

A bookie can fail

Pulsar stores messages in BookKeeper. BookKeeper is a highly available, durable log storage service where data written to a ledger (a segment of a Pulsar topic) is persisted and replicated multiple times (number n). As a result, BookKeeper can tolerate n-1 bookie failures, meaning that a ledger is available as long as there is at least one bookie available. Inherited from Zab/Paxos, BookKeeper’s replication protocol guarantees that once the data has been successfully written to a quorum of bookies, the data is permanently stored and will be replicated to all bookies within the same ensemble.

A broker can fail or the producer-to-broker connection can fail

Durability in Pulsar depends on the producer receiving an ACK from the Pulsar broker. Failure to receive that ACK does not necessarily mean that the produce request itself failed. The broker can crash after writing a message but before it sends an ACK back to the producer. It can also crash before even writing the message to the topic. Since there is no way for the producer to know the nature of the failure, it is forced to assume that the message was not written successfully and to retry it. In some cases, the same message is duplicated in the Pulsar topic, causing the consumers to receive it more than once.

The Pulsar client can fail

Exactly-once delivery must account for client failures as well. But it is also hard to tell if a client has actually failed and is not just temporarily partitioned from the Pulsar brokers or undergoing an application pause. Having the ability to distinguish between a permanent failure and a soft one is important. The Pulsar broker should discard messages sent by a zombie producer, likewise for the consumer. Once a new client has been restarted, it must be able to recover from whatever state the previous failed client left behind and begin processing from a safe point.

The Pulsar community completes the support for exactly-once semantics in steps. We first introduced Idempotent Producer to support exactly-once semantics on a single topic in the Pulsar 1.20.0-incubating release, and then completed the vision by introducing Transaction API to provide atomicity across multiple topics in the recent 2.8.0 release.

Idempotent producer: exactly-once semantics on a single topic

We started the journey of supporting exactly-once semantics in Pulsar by introducing Idempotent Producer in its 1.20.0-incubating release.

What does Idempotent Producer mean? An idempotent operation can be performed once or many times without causing a different result. If Guaranteed Message Deduplication is enabled at the cluster level or the namespace level and a producer is configured to be a Idempotent Producer, the produce requests are idempotent. In the event of an error that causes a producer to retry, the same message sent by the producer multiple times, is guaranteed to write to the Pulsar topic only once on the broker.

To turn on this feature and get exactly-once semantics per partition — meaning no duplicates, no data loss, and in-order semantics — configure the following:

Enable message deduplication for all namespaces/topics at the cluster level, or for a specific namespace at the namespace policy level, or for a specific topic at the topic policy level
Specify a name for the producer and set the message timeout to 0

How did that feature work? Under the hood, it works in a way very similar to TCP: each message produced to Pulsar will contain a sequence ID that the Pulsar broker will use to dedupe any duplicated message. However, unlike TCP which provides guarantees only within a transient connection, this sequence ID along with the message is persisted to the Pulsar topic and Pulsar broker keeps track of the last received sequence ID. So even if the Pulsar broker fails, any broker that takes over the topic ownership will also know if a message is duplicated or not. The overhead of this mechanism is very low, adding negligible performance overhead over the non-idempotent producer.

You can try out this feature in any Pulsar version newer than 1.20.0-incubating by following the tutorial here.

While powerful, Idempotent producer only solves a narrow scope of challenges for exactly-once semantics. There are still many other challenges it doesn’t resolve. For example, there is no atomicity when a producer attempts to produce messages to multiple topics. A publish error can occur when the broker serving one of the topics crashes. If the producer doesn’t retry publishing the message again, it results in some messages being persisted once and others being lost. If the producer retries, it results in some messages being persisted multiple times.

On the consumer side, the message acknowledgement was a best-effort operation. The message ACKs can potentially be lost because the consumer has no idea if the broker has received them and will not retry sending ACKs again. This will then result in consumers receiving duplicate messages.

Transactions: atomic writes and acknowledgments across multiple topics

To address the remaining challenges described above, we’ve strengthened Pulsar’s delivery semantics by introducing a Pulsar Transaction API to support atomic writes and acknowledgments across multiple topics. This allows a producer to send a batch of messages to multiple topics such that either all messages in the batch are eventually visible to any consumer or none are ever visible to consumers. This feature also allows you to acknowledge your messages across multiple topics in the same transaction along with the messages you have processed, thereby allowing end-to-end exactly-once semantics.

Here is an example code snippet to demonstrate the use of Transaction API:

PulsarClient pulsarClient = PulsarClient.builder()
        .serviceUrl("pulsar://localhost:6650")
        .enableTransaction(true)
        .build();

Transaction txn = pulsarClient
        .newTransaction()
        .withTransactionTimeout(1, TimeUnit.MINUTES)
        .build()
        .get();

producer.newMessage(txn).value("Hello Pulsar Transaction".getBytes()).send();

Message message = consumer.receive();
consumer.acknowledge(message.getMessageId(), txn);

txn.commit().get();

The code example above describes how you can use the new producer API with Transaction API to send messages atomically to a set of topics and use the new consumer API with Transactions to acknowledge the processed messages in the same transaction.

It is worth noting that:

A Pulsar topic might have some messages that are part of a transaction while others are not.
A Pulsar client can have multiple concurrent transactions outstanding. This design is fundamentally different from the transactions implementation in other older messaging systems, and results in much higher throughput.
The current Pulsar Transaction API only supports READ_COMMITTED isolation level. The consumer can only read the messages that are not part of a transaction and the messages that are part of a committed transaction. Messages produced in an aborted transaction are not delivered to any consumers.

To use the Transaction API, you don’t need any additional settings in the Pulsar client.

End-to-end exactly-once stream processing made simple: a Pulsar+Flink Example

Exactly-once stream processing is now possible through the Pulsar Transaction API.

One of the most critical questions for a stream processing system is, “Does my stream processing application get the right answer, even if one of the instances crashes in the middle of processing?” The key, when recovering a failed instance, is to resume processing in exactly the same state as before the crash.

Stream processing on Apache Pulsar is a read-process-write operation on Pulsar topics. A source operator that runs a Pulsar consumer reads messages from one or multiple Pulsar topics, some processing operators transform the messages or modify the state maintained by them, and a sink operator that runs a Pulsar producer writes the resulting messages to another Pulsar topic. Exactly-once stream processing is simply the ability to execute a read-process-write operation exactly once. In such a context, “getting the right answer” means not missing any input messages from the source operator or producing any duplicates to the sink operator. This is the behavior users expect from an exactly-once stream processor.

Let’s take the Pulsar and Flink integration as an example.

Prior to Pulsar 2.8.0, the Pulsar and Flink integration only supported exactly-once source connector and at-least-once sink connector. That means if you want to use Flink to build stream applications with Apache Pulsar, the highest processing guarantee you can get end-to-end is at-least-once — the resulting messages from these streaming applications may potentially produce multiple times to the resulting topic in Pulsar.

With the introduction of Pulsar Transaction in 2.8.0, the Pulsar-Flink sink connector can be easily enhanced to support exactly-once semantics. Because Flink uses a two-phase commit protocol to ensure end-to-end exactly-once semantics, we can implement the designated TwoPhaseCommitSinkFunction and hook up the Flink sink message lifecycle with Pulsar Transaction API. When the Pulsar-Flink sink connector calls beginTransaction, it starts a Pulsar Transaction and obtains the transaction id. All the subsequent messages written to the sink connector will be associated with this transaction ID. They will be flushed to Pulsar when the connector calls preCommit. The Pulsar transaction will then be committed or aborted when the connector calls recoverAndCommit and recoverAndAbort accordingly. The integration is very straightforward and the connector just has to persist the transaction ID together with Flink checkpoints so the transaction ID can be retrieved back for commit or abort.

Based on idempotency and atomicity provided by Pulsar Transactions and the globally consistent checkpoint algorithm offered by Apache Flink, the streaming applications built on Pulsar and Flink can easily achieve end-to-end exactly-once semantics.

Where to go from here

Exactly-once semantics via Transaction API is now supported in StreamNative Cloud as well as in StreamNative Platform v1.0 and later. If you’d like to understand the exactly-once guarantees in more detail, I’d recommend checking out PIP-31 for the transaction feature. If you’d like to dive deeper into the detailed design, this design document is worth reading.

This post primarily focuses on describing the nature of the user-facing guarantees as supported by the Transaction API introduced in Apache Pulsar 2.8.0, and how you can use this feature. In our next post, we will go into more details about the API and design.

If you want to put the new Transaction API to practical use, check out StreamNative Cloud or download StreamNative Platform 1.0 to create your own applications with Pulsar Java clients.

My fellow colleagues Sijie Guo and Addison Higham are going to give a presentation “Exactly-Once Made Easy: Transactional Messaging in Apache Pulsar” at the upcoming Pulsar Virtual Summit North America 2021 on June 16–17th. If you are interested in this topic, reserve your spot today and listen to them diving into every detail of Pulsar Transaction.

Credits

An amazing team of Pulsar committers and contributors worked for over a year to bring this awesome exactly-once work to Pulsar. Thanks to everyone that has been involved in this feature development: Penghui Li, Ran Gao, Bo Cong, Addison Higham, Jia Zhai, Yong Zhang, Xiaolong Ran, Matteo Merli, and Sijie Guo.

About the Author

Penghui Li is a PMC member of Apache Pulsar and a tech lead in StreamNative. Previously, he worked at Zhaopin.com where he served as the leading promoter to adopt Pulsar. His career has always involved messaging service from the messaging system, through the microservice, and into the current world with Pulsar. You can follow him on twitter.

This post was originally published on StreamNative blog.

Like this post? Please recommend and/or share.

Want to learn more? See https://streamnative.io/blog. Follow us here on Medium and check out our GitHub.

Exactly-Once Semantics with Transactions in Pulsar was originally published in StreamNative on Medium, where people are continuing the conversation by highlighting and responding to this story.

Apache Pulsar Launches 2.8: Unified Messaging and Streaming With Transactions

Sijia-w — Fri, 16 Jul 2021 16:02:37 GMT

An Overview of the 2.8.0 Release

Today, the Apache Pulsar Project Management Committee announced the release of Apache Pulsar 2.8.0, which includes a number of exciting upgrades and enhancements. This blog provides a deep dive into the updates from the 2.8.0 release as well as a detailed look at the major Pulsar developments that have helped it evolve into the unified messaging and streaming platform it is today.

Note: The Pulsar community typically releases a major release every 3 months, but it has been 6 months since the release of 2.7.0. We spent more time on 2.8.0 in order to make the transaction API generally available to the Pulsar community.

Release 2.8 Overview

The key features and updates in this release are:

Exclusive Producer
Package Management API
Simplified Client Memory Limit Settings
Broker Entry Metadata
New Protobuf Code Generator
Transactions

Exclusive Producer

By default, the Pulsar producer API provides a “multi-writer” semantic to append messages to a topic. However, there are several use cases that require exclusive access for a single writer, such as ensuring a linear non-interleaved history of messages or providing a mechanism for leader election.

This new feature allows applications to require exclusive producer access in order to achieve a “single-writer” situation. It guarantees that there should be 1 single writer in any combination of errors. If the producer loses its exclusive access, no more messages from it can be published on the topic.

One use case for this feature is the metadata controller in Pulsar Functions. In order to write a single linear history of all the functions metadata updates, the metadata controller requires to elect one leader and that all the “decisions” made by this leader be written on the metadata topic. By leveraging the exclusive producer feature, Pulsar guarantees that the metadata topic contains different segments of updates, one per each successive leader, and there is no interleaving across different leaders. See “PIP-68: Exclusive Producer” for more details.

Package Management API

Since its introduction in version 2.0, the Functions API has become hugely popular among Pulsar users. While it offers many benefits, there are a number of ways to improve the user experience. For example, today, if a function is deployed multiple times, the function package ends up being uploaded multiple times. Also, there is no version management in Pulsar for Functions and IO connectors. The newly introduced package management API provides an easier way to manage the packages for Functions and IO connectors and significantly simplifies the upgrade and rollback processes. Read “Package Management API” for more details.

Simplified Client Memory Limit Settings

Prior to 2.8, there are multiple settings in producers and consumers that allow controlling the sizes of the internal message queues. These settings ultimately control the amount of memory the Pulsar client uses. However, there are few issues with this approach that make it complicated to select an overall configuration that controls the total usage of memory.

For example, the settings are based on the “number of messages”, so the expected message size must be adjusted per producer or consumer. If an application has a large (or unknown) number of producers or consumers, it’s very difficult to select an appropriate value for queue sizes. The same is true for topics that have many partitions.

In 2.8, we introduced a new API to set the memory limit. This single memoryLimit setting specifies a maximum amount of memory on a given Pulsar client. The producers and consumers compete for the memory assigned. It ensures the memory used by the Pulsar client will not go beyond the set limit. Read “PIP-74: Pulsar client memory limits” for more details.

Broker Entry Metadata

Pulsar messages define a very comprehensive set of metadata properties. However, to add a new property, the MessageMetadata definition in Pulsar protocol must change to inform both broker and client of the newly introduced property.

But in certain cases, this metadata property might need to be added from the broker side, or need to be retrieved by the broker at a very low cost. To prevent deserializing these properties from the message metadata, we introduced “Broker Entry Metadata” in 2.8.0 to provide a lightweight approach to add additional metadata properties without serializing and deserializing the protobuf-encoded MessageMetadata.

This feature unblocks a new set of capabilities for Pulsar. For example, we can leverage broker entry metadata to generate broker publish time for the messages appended to the Pulsar topic. The other example is to generate a monotonically increasing sequence-id for messages produced to a Pulsar topic. We use this feature in Kafka-on-Pulsar to implement Kafka offset.

New Protobuf Code Generator

Pulsar uses Google Protobuf in order to perform serialization and deserialization of the commands that are exchanged between clients and brokers. Because of the overhead involved with the regular Protobuf implementation, we have been using a modified version of Protobuf 2.4.1. The modifications were done to ensure a more efficient serialization code that used thread local cache for the objects used in the process.

This approach introduced a few issues. For example, the patch to the Protobuf code generator is only based on Protobuf version 2.4.1 and cannot be upgraded to the newer Protobuf versions. In 2.8, we switched the patched Protobuf 2.4.1 to Splunk LightProto as the code generator. The new code generator generates the fastest possible Java code for Protobuf SerDe, is 100% compatible with proto2 definition and wire protocol, and provides zero-copy deserialization using Netty ByteBuf.

Transactions

Prior to Pulsar 2.8, Pulsar only supported exactly-once semantics on single topic through Idempotent Producer. While powerful, Idempotent producer only solves a narrow scope of challenges for exactly-once semantics. For example, there is no atomicity when a producer attempts to produce messages to multiple topics. A publish error can occur when the broker serving one of the topics crashes. If the producer doesn’t retry publishing the message again, it results in some messages being persisted once and others being lost. If the producer retries, it results in some messages being persisted multiple times.

In order to address the remaining challenges described above, we’ve strengthened the Pulsar’s delivery semantics by introducing a Pulsar Transaction API to support atomic writes and acknowledgements across multiple topics. The addition of the Transaction API to Apache Pulsar completes our vision of making Pulsar a complete unified messaging and streaming platform.

Pulsar PMC member and StreamNative Engineering Lead Penghui Li, goes over this functionality in great detail in his recent blog, Exactly-once Semantics with Transactions in Pulsar. You can read it to learn more about the exactly-once semantics support in Pulsar.

Building a Unified Messaging and Streaming Platform with Apache Pulsar

The Evolution of Apache Pulsar

Apache Pulsar is widely adopted by hundreds of companies across the globe, including Splunk, Tencent, Verizon, and Yahoo! JAPAN, just to name a few. Born as a cloud-native distributed messaging system, Apache Pulsar has evolved into a complete messaging and streaming platform for publishing and subscribing, storing, and processing streams of data at scale and in real-time.

Back in 2012 the Yahoo! team was looking for a global, geo-replicated infrastructure that could manage all of Yahoo!’s messaging data. After vetting the messaging and streaming landscape it became clear that existing technologies were not able to serve the need for an event-driven organization. As a result, the team at Yahoo! set out to build its own.

At the time, there were generally two types of systems to handle in-motion data: message queues that handled mission-critical business events in real-time, and streaming systems that handled scalable data pipelines at scale. Companies had to limit their capabilities to one or the other, or they had to adopt multiple different technologies. If they chose multiple technologies, they would end up with a complex infrastructure that often resulted in data segregation and data silos, with one silo for message queues used to build application services and the other silo for streaming systems used to build data services. The figure below illustrates what this can look like.

However, with the diversity of data that companies need to process beyond operational data (like log data, click events, etc), coupled with the increase in the number of downstream systems that need access to combined business data and operational data, the system would need to support message queueing and streaming.

Beyond that, companies need an infrastructure platform that would allow them to build all of their applications on top of it, and then have those applications handle in-motion data (messaging and streaming data) by default. This way real-time data infrastructure could be significantly simplified, as illustrated in the diagram below.

With that vision, the Yahoo! team started working on building a unified messaging and streaming platform for in-motion data. Below is an overview of the key milestones on the Pulsar journey, from inception to today.

Step 1: A scalable storage for streams of data

The journey of Pulsar began with Apache BookKeeper. Apache BookKeeper implements a log-like abstraction for continuous streams and provides the ability to run it at internet-scale with simple write-read log APIs. A log provides a great abstraction for building distributed systems, such as distributed databases and pub-sub messaging. The write APIs are in the form of appends to the log. And the read APIs are in the form of continuous read from a starting offset defined by the readers. The implementation of BookKeeper created the foundation — a scalable log-backed messaging and streaming system.

Step 2: A multi-layered architecture that separates compute from storage

On top of the scalable log storage, a stateless serving layer was introduced which runs stateless brokers for publishing and consuming messages. This multi-layered architecture separates serving/compute from storage, allowing Pulsar to manage serving and storage in separate layers.

This architecture also ensures instant scalability and higher availability. Both of these factors are extremely important and make Pulsar well-suited for building mission-critical services, such as billing platforms for financial use cases, transaction processing systems for e-commerce and retailers, and real-time risk control systems for financial institutions.

Step 3: Unified messaging model and API

In a modern data architecture, the real-time use cases can typically be categorized into two categories: queueing and streaming. Queueing is typically used for building core business application services while streaming is typically used for building real-time data services such as data pipelines.

To provide one platform able to serve both application and data services required a unified messaging model that integrates queuing and streaming semantics. The Pulsar topics become the source of truth for consumption. Messages can be stored only once on topics, but can be consumed in different ways via different subscriptions. Such unification significantly reduces the complexity of managing and developing messaging and streaming applications.

Step 4: Schema API

Next, a new Pulsar schema registry and a new type-safe producer & consumer API were added. The built-in schema registry enables message producers and consumers on Pulsar topics to coordinate on the structure of the topic’s data through the Pulsar broker itself, without needing an external coordination mechanism. With data schemas, every single piece of data traveling through Pulsar is completely discoverable, enabling you to build systems that can easily adapt as the data changes.

Furthermore, the schema registry keeps track of data compatibility between versions of the schema. As the new schemas are uploaded the registry ensures that new schema versions are able to be read by old consumers. This ensures that Producers cannot break Consumers.

Step 5: Functions and IO API

The next step was to build APIs that made it easy to get data in and out of Pulsar and process it. The goal was to make it easy to build event-driven applications and real-time data pipelines with Apache Pulsar, so you can then process those events when they arrive, no matter where they originated from.

The Pulsar IO API allows you to build real-time streaming data pipelines by plugging various source connectors to get data from external systems into Pulsar and sink connectors to get data from Pulsar into external systems. Today, Pulsar provides several built-in connectors that you can use.

Additionally, StreamNative hosts StreamNative Hub (a registry of Pulsar connectors) that provides dozens of connectors integrated with popular data systems. If the IO API is for building streaming data pipelines, the Functions API is for building event-driven applications and real-time stream processors.

The serverless function concepts were adopted into stream processing and then built the Functions API as a lightweight serverless library that you can write any event processing logic using any language you like. The underlying motivation was to enable your engineering team to write stream processing logic without the operational complexity of running and maintaining yet another cluster.

Step 6: Infinite storage for Pulsar via Tiered Storage

As adoption of Apache Pulsar continued and the amount of data stored in Pulsar increased, users eventually hit a “retention cliff”, at which point it became significantly more expensive to store, manage, and retrieve data in Apache BookKeeper. To work around this, operators and application developers typically use an external store like AWS S3 as a sink for long-term storage. This means you lose most of the benefits of Pulsar’s immutable stream and ordering semantics, and instead end up having to manage two different systems with different access patterns.

The introduction of Tiered Storage allows Pulsar to offload the majority of the data to a remote cloud-native storage. This cheaper form of storage readily scales with the volume of data. More importantly, with the addition of Tiered Storage, Pulsar provides the batch storage capabilities needed to support batch processing when integrating with a unified batch and stream processor like Flink. The unified batch and stream processing capabilities integrated with Pulsar enable companies to query real-time streams with historical context quickly and easily, unlocking a unique competitive advantage.

Step 7: Protocol Handler

After introducing tiered storage, Pulsar evolved from a Pub/Sub messaging system into a scalable stream data system that can ingest, store, and process streams of data. However, existing applications written using other messaging protocols such as Kafka, AMQP, MQTT, etc had to be rewritten to adopt Pulsar’s messaging protocol.

The Protocol Handler API further reduces the overhead of adopting Pulsar for building messaging and streaming applications, and allows developers to extend Pulsar capabilities to other messaging domains by leveraging all the benefits provided by Pulsar architecture. This resulted in major collaborations between StreamNative and other industry leaders to develop popular protocol handlers including:

Kafka-on-Pulsar (KoP), which was launched in March 2020 by OVHCloud and StreamNative.
AMQP-on-Pulsar (AoP), which was announced in June 2020 by China Mobile and StreamNative.
MQTT-on-Pulsar (MoP), which was announced in August 2020 by StreamNative.
RocketMQ-on-Pulsar (RoP), which was launched in May 2021 by Tencent Cloud and StreamNative.

Step 8: Transaction API for exactly-once stream processing

More recently, transactions were added to Apache Pulsar to enable exactly-once semantics for stream processing. This is a fundamental feature that provides a strong guarantee for streaming data transformations, making it easy to build scalable, fault-tolerant, stateful messaging and streaming applications that process streams of data.

Furthermore, the transaction API capabilities are not limited to a given language client. Pulsar’s support for transactional messaging and streaming is primarily a protocol-level capability that can be presented in any language. Such protocol-level capability can be leveraged in all kinds of applications.

Building an ecosystem for unified messaging and streaming

In addition to contributing to the Pulsar technology, the community is also working to build a robust ecosystem to support it. Pulsar’s ability to support a rich ecosystem of pub-sub libraries, connectors, functions, protocol handlers, and integrations with popular query engines will enable Pulsar adopters to streamline workflows and achieve new use cases.

What is Next?

If you are interested in learning more about Pulsar 2.8.0, you can download 2.8.0 and try it out today!

If you want to learn more about how companies have adopted Pulsar, you can sign up for Pulsar Summit NA 2021!

For more information about the Apache Pulsar project and the progress, please visit the official website at https://pulsar.apache.org and follow the project on Twitter @apache_pulsar or @streamnativeio.

About the Author

Matteo Merli is the PMC Chair of Apache Pulsar and CTO at StreamNative. He is co-creator of Pulsar while at Yahoo!, co-founder of Streamlio, Committer and PMC of Apache Pulsar and Apache BookKeeper. He works for The Apache Software Foundation for 4+ years, and has rich experience in distributed pub-sub messaging platform. You can follow him on twitter. You can follow him on linkedin.

Sijie Guo is the co-founder and CEO of StreamNative, which provides a cloud-native event streaming platform powered by Apache Pulsar. Sijie has worked on messaging and streaming data technologies for more than a decade. Prior to StreamNative, Sijie cofounded Streamlio, a company focused on real-time solutions. At Twitter, Sijie was the tech lead for the messaging infrastructure group, where he co-created DistributedLog and Twitter EventBus. Prior to that, he worked on the push notification infrastructure at Yahoo!, where he was one of the original developers of BookKeeper and Pulsar. He is also the VP of Apache BookKeeper and PMC member of Apache Pulsar. You can follow him on twitter.

This post was originally published on Apache Pulsar blog.

Like this post? Please recommend and/or share.

Want to learn more? See https://streamnative.io/blog. Follow us here on Medium and check out our GitHub.

Apache Pulsar Launches 2.8: Unified Messaging and Streaming With Transactions was originally published in StreamNative on Medium, where people are continuing the conversation by highlighting and responding to this story.

Pulsar Isolation for Dummies: Separate Pulsar Clusters

Sijia-w — Thu, 10 Jun 2021 16:02:24 GMT

This blog is for Pulsar users of all levels. If you follow the instructions in this blog, you will successfully configure isolation in Pulsar.

This is the second blog in the series on configuring isolation in Apache Pulsar. The first blog, Taking an In-Depth Look at How to Achieve Isolation in Pulsar, explains how to use the following approaches to achieve isolation in Pulsar:

Separate Pulsar clusters
Shared BookKeeper cluster
Single Pulsar cluster

This blog details how to create multiple, separate pulsar clusters for isolation of resources. Because this approach segregates resources and does not share storage or local ZooKeeper with other clusters, it provides the highest level of isolation. You should use this approach if you want to isolate critical workloads (such as billing and ads). You can create multiple, separate clusters dedicated to each workload.

To help you get started quickly, this blog walks you through every step for the following parts:

Deploy two separate Pulsar clusters
Verify data isolation of clusters
Synchronize and migrate data between clusters (optional)
Scale up and down nodes (optional)

Deploy environment

The examples in this blog are developed on a macOS (version 11.2.3, memory 8G).

Software requirement

Java 8

You will deploy two clusters and each of them supports the following services:

1 ZooKeeper
1 bookie
1 broker

Figure 1 — Two Separate Pulsar Clusters

The following are the details of the two Pulsar clusters you will deploy.

Prepare deployment

Download Pulsar and untar the tarball.

In this example, Pulsar 2.7.0 is installed.

2. Create empty directories using the following structure and then change the names accordingly.

You can create the directories anywhere in your local environment.

Input

|-separate-clusters
    |-configuration-store
        |-zk1
    |-cluster1
        |-zk1
        |-bk1
        |-broker1
    |-cluster2
        |-zk1
        |-bk1
        |-broker1

3. Copy the files to each directory you created in step 2.

4. Start configuration store.

Configuration store operates at the instance level and provides configuration management and task coordination across clusters. In this example, cluster1 and cluster2 share one configuration store.

Input

cd configuration-store/zk1

bin/pulsar-daemon start configuration-store

Deploy Pulsar cluster1

Start a local ZooKeeper.

For each Pulsar cluster, you need to deploy 1 local ZooKeeper to manage configurations and coordinate tasks.

Input

cd cluster1/zk1

bin/pulsar-daemon start zookeeper

2. Initialize metadata.

Write metadata to ZooKeeper.

Input

cd cluster1/zk1

bin/pulsar initialize-cluster-metadata \
  --cluster cluster1 \
  --zookeeper localhost:2181 \
  --configuration-store localhost:2184 \
  --web-service-url http://localhost:8080/ \
  --web-service-url-tls https://localhost:8443/ \
  --broker-service-url pulsar://localhost:6650/ \
  --broker-service-url-tls pulsar+ssl://localhost:6651/

3. Deploy BookKeeper.

BookKeeper provides persistent storage for messages on Pulsar. Each Pulsar broker owns its bookie. BookKeeper clusters and Pulsar clusters share the local ZooKeeper.

(1) Configure bookies.

Change the value of the following configurations in the cluster1/bk1/conf/bookkeeper.conf file.

allowLoopback=true
prometheusStatsHttpPort=8002
httpServerPort=8002

(2) Start bookies.

Input

cd cluster1/bk1

bin/pulsar-daemon start bookie

Check whether the bookie is started successfully.

Input

bin/bookkeeper shell bookiesanity

Output

Bookie sanity test succeeded

4. Deploy brokers.

(1) Configure brokers.

Change the value of the following configurations in the cluster1/broker1/conf/broker.conf file.

zookeeperServers=127.0.0.1:2181
configurationStoreServers=127.0.0.1:2184
clusterName=cluster1
managedLedgerDefaultEnsembleSize=1
managedLedgerDefaultWriteQuorum=1
managedLedgerDefaultAckQuorum=1

(2) Start brokers.

Input

cd cluster1/broker1

bin/pulsar-daemon start broker

Deploy Pulsar cluster2

Deploy a local ZooKeeper.

(1) Configure a local ZooKeeper.

Change the value of the following configurations in the cluster2/zk1/conf/zookeeper.conf file.

clientPort=2186
admin.serverPort=9992

Add the following configurations to the cluster2/zk1/conf/pulsar_env.sh file.

OPTS="-Dstats_server_port=8011"

(2) Start a local ZooKeeper.

Input

cd cluster2/zk1

bin/pulsar-daemon start zookeeper

2. Initialize metadata.

Input

bin/pulsar initialize-cluster-metadata \
  --cluster cluster2 \
  --zookeeper localhost:2186 \
  --configuration-store localhost:2184 \
  --web-service-url http://localhost:8081/ \
  --web-service-url-tls https://localhost:8444/ \
  --broker-service-url pulsar://localhost:6660/ \
  --broker-service-url-tls pulsar+ssl://localhost:6661/

3. Deploy BookKeeper.

(1) Configure bookies.

Change the value of the following configurations in the cluster2/bk1/conf/bookkeeper.conf file.

bookiePort=3182
zkServers=localhost:2186
allowLoopback=true
prometheusStatsHttpPort=8003
httpServerPort=8003

(2) Start bookies.

Input

cd cluster2/bk1

bin/pulsar-daemon start bookie

Check whether the bookie is started successfully.

Input

bin/bookkeeper shell bookiesanity

Output

Bookie sanity test succeeded

4. Deploy brokers.

(1) Configure brokers.

Change the value of the following configurations in the cluster2/broker1/conf/broker.conf file.

clusterName=cluster2
zookeeperServers=127.0.0.1:2186
configurationStoreServers=127.0.0.1:2184
brokerServicePort=6660
webServicePort=8081
managedLedgerDefaultEnsembleSize=1
managedLedgerDefaultWriteQuorum=1
managedLedgerDefaultAckQuorum=1

Change the value of the following configurations in the cluster2/broker1/conf/client.conf file.

webServiceUrl=http://localhost:8081/
brokerServiceUrl=pulsar://localhost:6660/

(2) Start brokers.

Input

cd cluster2/broker1

bin/pulsar-daemon start broker

Verify data isolation of clusters

This section verifies whether the data in the two Pulsar clusters is isolated.

Create namespace1 and assign it to cluster1.

Tip
The format of a namespace name is /. For more information, see Namespaces.

Input

cd cluster1/broker1

bin/pulsar-admin namespaces create -c cluster1 public/namespace1

Check the result.

Input

bin/pulsar-admin namespaces list public

Output

"public/default"
"public/namespace1"

2. Set the retention policy for namespace1.

Note
If the retention policy is not set and the topic is not subscribed, the data stored on the topic is deleted automatically after a while.

Input

bin/pulsar-admin namespaces set-retention -s 100M -t 3d public/namespace1

3. Create topic1 in namespace1 and write 1000 messages to this topic.

Tip
The pulsar-client is a command line tool to send and consume data. For more information, see Pulsar command line tools.

Input

bin/pulsar-client produce -m 'hello c1 to c2' -n 1000 public/namespace1/topic1

09:56:34.504 [main] INFO  org.apache.pulsar.client.cli.PulsarClientTool - 1000 messages successfully produced

Check the result.

Input

bin/pulsar-admin --admin-url http://localhost:8080 topics stats-internal public/namespace1/topic1

Output

The entriesAddedCounter parameter shows that 1000 messages are added.

{
  "entriesAddedCounter" : 1000,
  "numberOfEntries" : 1000,
  "totalSize" : 65616,
  "currentLedgerEntries" : 1000,
  "currentLedgerSize" : 65616,
  "lastLedgerCreatedTimestamp" : "2021-04-22T10:24:00.582+08:00",
  "waitingCursorsCount" : 0,
  "pendingAddEntriesCount" : 0,
  "lastConfirmedEntry" : "4:999",
  "state" : "LedgerOpened",
  "ledgers" : [ {
    "ledgerId" : 4,
    "entries" : 0,
    "size" : 0,
    "offloaded" : false
  } ],
  "cursors" : { },
  "compactedLedger" : {
    "ledgerId" : -1,
    "entries" : -1,
    "size" : -1,
    "offloaded" : false
  }
}

4. Check the data stored on public/namespace1/topic1 by cluster2 (localhost:8081).

Input

bin/pulsar-admin --admin-url http://localhost:8081 topics stats-internal public/namespace1/topic1

Output

The attempt failed. The error message shows that the data stored on public/namespace1 is assigned only to cluster1. This proves that the data is isolated.

Namespace missing local cluster name in clusters list: local_cluster=cluster2 ns=public/namespace1 clusters=[cluster1]

Reason: Namespace missing local cluster name in clusters list: local_cluster=cluster2 ns=public/namespace1 clusters=[cluster1]

5. Write data to public/namespace1/topic1 in cluster2.

Input

cd cluster2/broker1

bin/pulsar-client produce -m 'hello c1 to c2' -n 1000 public/namespace1/topic1

Output

The error message shows that 0 message is written. The attempt failed because namespace1 is assigned only to cluster1. This proves that the data is isolated.

12:09:50.005 [main] INFO  org.apache.pulsar.client.cli.PulsarClientTool - 0 messages successfully produced

Synchronize and migrate data between clusters

After verifying that the data is isolated, you can synchronize (using geo-replication) and migrate data between clusters.

Assign namespace1 to cluster2, that is, adding cluster2 to the cluster list of namespace1.

This enables geo-replication to synchronize the data between cluster1 and cluster2.

Input

bin/pulsar-admin namespaces set-clusters --clusters cluster1,cluster2 public/namespace1

Check the result.

Input

bin/pulsar-admin namespaces get-clusters public/namespace1

Output

"cluster1"
"cluster2"

2. Check whether topic1 is in cluster2.

Input

bin/pulsar-admin --admin-url http://localhost:8081 topics stats-internal public/namespace1/topic1

Output

The output shows that there are 1000 messages on cluster2/topic1. This proves that the data stored on cluster1/topic1 is replicated to cluster2 successfully.

{
  "entriesAddedCounter" : 1000,
  "numberOfEntries" : 1000,
  "totalSize" : 75616,
  "currentLedgerEntries" : 1000,
  "currentLedgerSize" : 75616,
  "lastLedgerCreatedTimestamp" : "2021-04-23T12:02:52.929+08:00",
  "waitingCursorsCount" : 1,
  "pendingAddEntriesCount" : 0,
  "lastConfirmedEntry" : "1:999",
  "state" : "LedgerOpened",
  "ledgers" : [ {
    "ledgerId" : 1,
    "entries" : 0,
    "size" : 0,
    "offloaded" : false
  } ],
  "cursors" : {
    "pulsar.repl.cluster1" : {
      "markDeletePosition" : "1:999",
      "readPosition" : "1:1000",
      "waitingReadOp" : true,
      "pendingReadOps" : 0,
      "messagesConsumedCounter" : 1000,
      "cursorLedger" : 2,
      "cursorLedgerLastEntry" : 2,
      "individuallyDeletedMessages" : "[]",
      "lastLedgerSwitchTimestamp" : "2021-04-23T12:02:53.248+08:00",
      "state" : "Open",
      "numberOfEntriesSinceFirstNotAckedMessage" : 1,
      "totalNonContiguousDeletedMessagesRange" : 0,
      "properties" : { }
    }
  },
  "compactedLedger" : {
    "ledgerId" : -1,
    "entries" : -1,
    "size" : -1,
    "offloaded" : false
  }
}

3. Migrate the producer and consumer from cluster1 to cluster2.

PulsarClient pulsarClient1 = PulsarClient.builder().serviceUrl("pulsar://localhost:6650").build();
// migrate the client to cluster2 pulsar://localhost:6660
PulsarClient pulsarClient2 = PulsarClient.builder().serviceUrl("pulsar://localhost:6660").build();

4. Remove cluster1 from the cluster list of namespace1.

Input

bin/pulsar-admin namespaces set-clusters --clusters cluster2 public/namespace1

Check if the data is stored on cluster1/topic1.

Input

cd cluster1/broker1

bin/pulsar-admin --admin-url http://localhost:8080 topics stats-internal public/namespace1/topic1

Output

The data is removed from cluster1/topic1 successfully since the value of numberOfEntries parameter is 0.

{
  "entriesAddedCounter" : 0,
  "numberOfEntries" : 0,
  "totalSize" : 0,
  "currentLedgerEntries" : 0,
  "currentLedgerSize" : 0,
  "lastLedgerCreatedTimestamp" : "2021-04-23T15:20:08.1+08:00",
  "waitingCursorsCount" : 1,
  "pendingAddEntriesCount" : 0,
  "lastConfirmedEntry" : "3:-1",
  "state" : "LedgerOpened",
  "ledgers" : [ {
    "ledgerId" : 3,
    "entries" : 0,
    "size" : 0,
    "offloaded" : false
  } ],
  "cursors" : {
    "pulsar.repl.cluster2" : {
      "markDeletePosition" : "3:-1",
      "readPosition" : "3:0",
      "waitingReadOp" : true,
      "pendingReadOps" : 0,
      "messagesConsumedCounter" : 0,
      "cursorLedger" : 4,
      "cursorLedgerLastEntry" : 0,
      "individuallyDeletedMessages" : "[]",
      "lastLedgerSwitchTimestamp" : "2021-04-23T15:20:08.122+08:00",
      "state" : "Open",
      "numberOfEntriesSinceFirstNotAckedMessage" : 1,
      "totalNonContiguousDeletedMessagesRange" : 0,
      "properties" : { }
    }
  },
  "compactedLedger" : {
    "ledgerId" : -1,
    "entries" : -1,
    "size" : -1,
    "offloaded" : false
  }
}

At this point, you replicated data from cluster1/topic1 to cluster2 and then removed the data from cluster1/topic1.

Scale up and down nodes

If you need to handle increasing or decreasing workloads, you can scale up or down nodes. This section demonstrates how to scale up and scale down nodes (brokers and bookies).

Broker

Scale up brokers

In this procedure, you’ll create 2 partitioned topics on cluster1/broker1 and add 2 brokers. Then, you’ll offload the data stored on partitioned topics and check the data distribution among 3 brokers.

Check the information about brokers in cluster1.

Input

cd/cluster1/broker1

bin/pulsar-admin brokers list cluster1

Output

The output shows that broker1 is the only broker in cluster1.

"192.168.0.105:8080"

2. Create 2 partitioned topics on cluster1/broker1.

Create 6 partitions for partitioned-topic1 and 7 partitions for partitioned-topic2.

Input

bin/pulsar-admin topics create-partitioned-topic -p 6 public/namespace1/partitioned-topic1

bin/pulsar-admin topics create-partitioned-topic -p 7 public/namespace1/partitioned-topic2

Check the result.

Input

bin/pulsar-admin topics partitioned-lookup public/namespace1/partitioned-topic1

Output

All data of partitioned-topic1 is from broker1.

"persistent://public/namespace1/partitioned-topic1-partition-0    pulsar://192.168.0.105:6650"
"persistent://public/namespace1/partitioned-topic1-partition-1    pulsar://192.168.0.105:6650"
"persistent://public/namespace1/partitioned-topic1-partition-2    pulsar://192.168.0.105:6650"
"persistent://public/namespace1/partitioned-topic1-partition-3    pulsar://192.168.0.105:6650"
"persistent://public/namespace1/partitioned-topic1-partition-4    pulsar://192.168.0.105:6650"
"persistent://public/namespace1/partitioned-topic1-partition-5    pulsar://192.168.0.105:6650"

Input

bin/pulsar-admin topics partitioned-lookup public/namespace1/partitioned-topic2

Output

All data of partitioned-topic2 is from broker1.

"persistent://public/namespace1/partitioned-topic2-partition-0    pulsar://192.168.0.105:6650"
"persistent://public/namespace1/partitioned-topic2-partition-1    pulsar://192.168.0.105:6650"
"persistent://public/namespace1/partitioned-topic2-partition-2    pulsar://192.168.0.105:6650"
"persistent://public/namespace1/partitioned-topic2-partition-3    pulsar://192.168.0.105:6650"
"persistent://public/namespace1/partitioned-topic2-partition-4    pulsar://192.168.0.105:6650"
"persistent://public/namespace1/partitioned-topic2-partition-5    pulsar://192.168.0.105:6650"
"persistent://public/namespace1/partitioned-topic2-partition-6    pulsar://192.168.0.105:6650"

3. Add broker2 and broker3.

(1) Prepare for deployment.

Create two empty repositories (broker2 and broker3) under cluster1 repository. Copy the untarred files in the Pulsar repository to these two repositories.

|-separate-clusters
    |-configuration-store
        |-zk1
    |-cluster1
        |-zk1
        |-bk1
        |-broker1
        |-broker2
        |-broker3
    |-cluster2
        |-zk1
        |-bk1
        |-broker1

(2) Deploy brokers.

(2.a) Configure brokers.

(2.b) Start brokers.

(2.c) Check the information about the running brokers in cluster1.

Input

bin/pulsar-admin brokers list cluster1

Output

"192.168.0.105:8080" // broker1
"192.168.0.105:8082" // broker2
"192.168.0.105:8083" // broker3

4. Offload the data stored on namespace1/partitioned-topic1.

Input

bin/pulsar-admin namespaces unload public/namespace1

Check the result.

(1) Check the distribution of data stored on partitioned-topic1.

Input

bin/pulsar-admin topics partitioned-lookup public/namespace1/partitioned-topic1

Output

The output shows that the data stored on partitioned-topic1 is distributed evenly on broker1, broker2, and broker3.

"persistent://public/namespace1/partitioned-topic1-partition-0    pulsar://192.168.0.105:6650"
"persistent://public/namespace1/partitioned-topic1-partition-1    pulsar://192.168.0.105:6653"
"persistent://public/namespace1/partitioned-topic1-partition-2    pulsar://192.168.0.105:6652"
"persistent://public/namespace1/partitioned-topic1-partition-3    pulsar://192.168.0.105:6653"
"persistent://public/namespace1/partitioned-topic1-partition-4    pulsar://192.168.0.105:6650"
"persistent://public/namespace1/partitioned-topic1-partition-5    pulsar://192.168.0.105:6653"

(2) Check the distribution of data stored on partitioned-topic2.

Input

bin/pulsar-admin topics partitioned-lookup public/namespace1/partitioned-topic2

The output shows that the data stored on partitioned-topic2 is distributed evenly on broker1, broker2, and broker3.

Output

"persistent://public/namespace1/partitioned-topic2-partition-0    pulsar://192.168.0.105:6653"
"persistent://public/namespace1/partitioned-topic2-partition-1    pulsar://192.168.0.105:6650"
"persistent://public/namespace1/partitioned-topic2-partition-2    pulsar://192.168.0.105:6653"
"persistent://public/namespace1/partitioned-topic2-partition-3    pulsar://192.168.0.105:6652"
"persistent://public/namespace1/partitioned-topic2-partition-4    pulsar://192.168.0.105:6653"
"persistent://public/namespace1/partitioned-topic2-partition-5    pulsar://192.168.0.105:6650"
"persistent://public/namespace1/partitioned-topic2-partition-6    pulsar://192.168.0.105:6653"

Scale down brokers

Tip
The following steps continue from the previous section “Scale up brokers”.

In this procedure, you’ll stop 1 broker in cluster1 and check how the data stored on the partitioned topics is distributed among other brokers.

Stop broker3.

Input

cd/cluster1/broker3

bin/pulsar-daemon stop broker

Check the result.

Input

bin/pulsar-admin brokers list cluster1

Output

The output shows that only broker1 and broker2 are running in cluster1.

"192.168.0.105:8080" // broker1
"192.168.0.105:8082" // broker2

2. Check the distribution of data stored on partitioned-topic1.

Input

bin/pulsar-admin topics partitioned-lookup public/namespace1/partitioned-topic1

Output

The output shows that the data stored on partitioned-topic1 is distributed evenly between broker1 and broker2, which means that the data from broker3 is redistributed.

"persistent://public/namespace1/partitioned-topic1-partition-0    pulsar://192.168.0.105:6650"
"persistent://public/namespace1/partitioned-topic1-partition-1    pulsar://192.168.0.105:6650"
"persistent://public/namespace1/partitioned-topic1-partition-2    pulsar://192.168.0.105:6652"
"persistent://public/namespace1/partitioned-topic1-partition-3    pulsar://192.168.0.105:6652"
"persistent://public/namespace1/partitioned-topic1-partition-4    pulsar://192.168.0.105:6650"
"persistent://public/namespace1/partitioned-topic1-partition-5    pulsar://192.168.0.105:6650"

Similarly, the data stored on partitioned-topic2 is distributed evenly between broker1 and broker2.

Input

bin/pulsar-admin topics partitioned-lookup public/namespace1/partitioned-topic2

Output

"persistent://public/namespace1/partitioned-topic2-partition-0    pulsar://192.168.0.105:6650"
"persistent://public/namespace1/partitioned-topic2-partition-1    pulsar://192.168.0.105:6650"
"persistent://public/namespace1/partitioned-topic2-partition-2    pulsar://192.168.0.105:6652"
"persistent://public/namespace1/partitioned-topic2-partition-3    pulsar://192.168.0.105:6652"
"persistent://public/namespace1/partitioned-topic2-partition-4    pulsar://192.168.0.105:6650"
"persistent://public/namespace1/partitioned-topic2-partition-5    pulsar://192.168.0.105:6650"
"persistent://public/namespace1/partitioned-topic2-partition-6    pulsar://192.168.0.105:6652"

Bookie

Scale up bookies

In this procedure, you’ll add 2 bookies to cluster1/bookkeeper1. Then, you’ll write data to topic1 and check whether the replicas are saved.

Check the information about bookies in cluster1.

Input

cd cluster1/bk1

bin/bookkeeper shell listbookies -rw -h

Output

The output shows that broker1 is the only broker in cluster1.

12:31:34.933 [main] INFO  org.apache.bookkeeper.tools.cli.commands.bookies.ListBookiesCommand - ReadWrite Bookies :
12:31:34.946 [main] INFO  org.apache.bookkeeper.tools.cli.commands.bookies.ListBookiesCommand - BookieID:192.168.0.105:3181, IP:192.168.0.105, Port:3181, Hostname:192.168.0.105

2. Allow 3 bookies to serve.

Change the values of the following configurations in the cluster1/broker1/conf/broker.conf file.

managedLedgerDefaultEnsembleSize=3 // specify the number of bookies to use when creating a ledger
managedLedgerDefaultWriteQuorum=3 // specify the number of copies to store for each message
managedLedgerDefaultAckQuorum=2  // specify the number of guaranteed copies (acks to wait before writing is completed)

3. Restart broker1 to enable the configurations.

Input

cd cluster1/broker1

bin/pulsar-daemon stop broker

bin/pulsar-daemon start broker

4. Set the retention policy for the messages in public/default.

Note
If the retention policy is not set and the topic is not subscribed, the data of the topic is deleted automatically after a while.

Input

cd cluster1/broker1

bin/pulsar-admin namespaces set-retention -s 100M -t 3d public/default

5. Create topic1 in public/default and write 100 messages to this topic.

Input

bin/pulsar-client produce -m 'hello' -n 100 topic1

Output

The data is not written successfully because of the insufficient number of bookies.

···

12:40:38.886 [pulsar-client-io-1-1] WARN  org.apache.pulsar.client.impl.ClientCnx - [id: 0x56f92aff, L:/192.168.0.105:53069 - R:/192.168.0.105:6650] Received error from server: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available

...

12:40:38.886 [main] ERROR org.apache.pulsar.client.cli.PulsarClientTool - Error while producing messages

...

12:40:38.890 [main] INFO  org.apache.pulsar.client.cli.PulsarClientTool - 0 messages successfully produced

6. Add bookie2 and bookie3.

(1) Prepare for deployment.

Create two empty repositories (bk2 and bk3) under cluster1 repository. Copy the untarred files in Pulsar repository to these two repositories.

|-separate-clusters
    |-configuration-store
        |-zk1
    |-cluster1
        |-zk1
        |-bk1
        |-bk2
        |-bk3
        |-broker1
    |-cluster2
        |-zk1
        |-bk1
        |-broker1

(2) Deploy bookies.

(2.a) Configure bookies.

(2.b) Start bookies.

(2.c) Check the running bookies in cluster1.

Input

bin/bookkeeper shell listbookies -rw -h

Output

All three bookies are running in cluster1: bookie1：192.168.0.105:3181 bookie2：192.168.0.105:3183 bookie3：192.168.0.105:3184

12:12:47.574 [main] INFO  org.apache.bookkeeper.tools.cli.commands.bookies.ListBookiesCommand - BookieID:192.168.0.105:3183, IP:192.168.0.105, Port:3183, Hostname:192.168.0.105 
12:12:47.575 [main] INFO  org.apache.bookkeeper.tools.cli.commands.bookies.ListBookiesCommand - BookieID:192.168.0.105:3184, IP:192.168.0.105, Port:3184, Hostname:192.168.0.105
12:12:47.576 [main] INFO  org.apache.bookkeeper.tools.cli.commands.bookies.ListBookiesCommand - BookieID:192.168.0.105:3181, IP:192.168.0.105, Port:3181, Hostname:192.168.0.105

7. Set the retention policy for messages in public/default.

Note
If the retention policy is not set and the topic is not subscribed, the data stored on the topic is deleted automatically after a while.

Input

cd cluster1/broker1

bin/pulsar-admin namespaces set-retention -s 100M -t 3d public/default

8. Create topic1 in public/default and write 100 messages to this topic.

Input

bin/pulsar-client produce -m 'hello' -n 100 topic1

Output

The messages are written successfully.

...
12:17:40.222 [main] INFO  org.apache.pulsar.client.cli.PulsarClientTool - 100 messages successfully produced

9. Check the information about topic1.

Input

bin/pulsar-admin topics stats-internal topic1

Output

The output shows that the data stored on topic1 is saved in the ledger with ledgerId 5.

{
  "entriesAddedCounter" : 100,
  "numberOfEntries" : 100,
  "totalSize" : 5500,
  "currentLedgerEntries" : 100,
  "currentLedgerSize" : 5500,
  "lastLedgerCreatedTimestamp" : "2021-05-11T12:17:38.881+08:00",
  "waitingCursorsCount" : 0,
  "pendingAddEntriesCount" : 0,
  "lastConfirmedEntry" : "5:99",
  "state" : "LedgerOpened",
  "ledgers" : [ {
    "ledgerId" : 5,
    "entries" : 0,
    "size" : 0,
    "offloaded" : false
  } ],
  "cursors" : { },
  "compactedLedger" : {
    "ledgerId" : -1,
    "entries" : -1,
    "size" : -1,
    "offloaded" : false
  }
}

10. Check in which bookies the ledger with ledgerId 5 is saved.

Input

bin/bookkeeper shell ledgermetadata -ledgerid 5

Output

As configured previously, the ledger with ledgerId 5 is saved on bookie1 (3181), bookie2 (3181), and bookie3 (3184).

...
12:23:17.705 [main] INFO  org.apache.bookkeeper.tools.cli.commands.client.LedgerMetaDataCommand - ledgerID: 5
12:23:17.714 [main] INFO  org.apache.bookkeeper.tools.cli.commands.client.LedgerMetaDataCommand - LedgerMetadata{formatVersion=3, ensembleSize=3, writeQuorumSize=3, ackQuorumSize=2, state=OPEN, digestType=CRC32C, password=base64:, ensembles={0=[192.168.0.105:3184, 192.168.0.105:3181, 192.168.0.105:3183]}, customMetadata={component=base64:bWFuYWdlZC1sZWRnZXI=, pulsar/managed-ledger=base64:cHVibGljL2RlZmF1bHQvcGVyc2lzdGVudC90b3BpYzE=, application=base64:cHVsc2Fy}}
...

Scale down bookies

Tip
The following steps continue from the previous section “Scale up bookies”.

In this procedure, you’ll remove 2 bookies. Then, you’ll write data to topic2 and check where the data is saved.

Allow 1 bookie to serve.

Change the values of the following configurations in the cluster1/broker1/conf/broker.conf file.

managedLedgerDefaultEnsembleSize=1 // specify the number of bookies to use when creating a ledger
managedLedgerDefaultWriteQuorum=1 // specify the number of copies to store for each message
managedLedgerDefaultAckQuorum=1  // specify the number of guaranteed copies (acks to wait before writing is completed)

2. Restart broker1 to enable the configurations.

Input

cd cluster1/broker1

bin/pulsar-daemon stop broker

bin/pulsar-daemon start broker

3. Check the information about bookies in cluster1.

Input

cd cluster1/bk1

bin/bookkeeper shell listbookies -rw -h

Output

All three bookies are running in cluster1, including bookie1 (3181), bookie2 (3183), and bookie3 (3184).

...
15:47:41.370 [main] INFO  org.apache.bookkeeper.tools.cli.commands.bookies.ListBookiesCommand - ReadWrite Bookies :
15:47:41.382 [main] INFO  org.apache.bookkeeper.tools.cli.commands.bookies.ListBookiesCommand - BookieID:192.168.0.105:3183, IP:192.168.0.105, Port:3183, Hostname:192.168.0.105
15:47:41.383 [main] INFO  org.apache.bookkeeper.tools.cli.commands.bookies.ListBookiesCommand - BookieID:192.168.0.105:3184, IP:192.168.0.105, Port:3184, Hostname:192.168.0.105
15:47:41.384 [main] INFO  org.apache.bookkeeper.tools.cli.commands.bookies.ListBookiesCommand - BookieID:192.168.0.105:3181, IP:192.168.0.105, Port:3181, Hostname:192.168.0.105
...

4. Stop bookie2 and bookie3.

Tip
For more information about how to stop bookies, see Decommission Bookies.

Input

cd cluster1/bk2

bin/bookkeeper shell listunderreplicated

bin/pulsar-daemon stop bookie

bin/bookkeeper shell decommissionbookie

Input

cd cluster1/bk3

bin/bookkeeper shell listunderreplicated

bin/pulsar-daemon stop bookie

bin/bookkeeper shell decommissionbookie

5. Check the information about bookies in cluster1.

Input

cd cluster1/bk1

bin/bookkeeper shell listbookies -rw -h

Output

The output shows that bookie1 (3181) is the only running bookie in cluster1.

...
16:05:28.690 [main] INFO  org.apache.bookkeeper.tools.cli.commands.bookies.ListBookiesCommand - ReadWrite Bookies :
16:05:28.700 [main] INFO  org.apache.bookkeeper.tools.cli.commands.bookies.ListBookiesCommand - BookieID:192.168.0.105:3181, IP:192.168.0.105, Port:3181, Hostname:192.168.0.105
...

6. Set the retention policy for the messages in public/default.

Note
If the retention policy is not set and the topic is not subscribed, the data stored on the topic is deleted automatically after a while.

Input

cd cluster1/broker1

bin/pulsar-admin namespaces set-retention -s 100M -t 3d public/default

7. Create topic2 in public/default and write 100 messages to this topic.

Input

bin/pulsar-client produce -m 'hello' -n 100 topic2

Output

The data is written successfully.

...
16:06:59.448 [main] INFO  org.apache.pulsar.client.cli.PulsarClientTool - 100 messages successfully produced

8. Check the information about topic2.

Input

bin/pulsar-admin topics stats-internal topic2

Output

The data stored on topic2 is saved in the ledger with ledgerId 7.

{
  "entriesAddedCounter" : 100,
  "numberOfEntries" : 100,
  "totalSize" : 5400,
  "currentLedgerEntries" : 100,
  "currentLedgerSize" : 5400,
  "lastLedgerCreatedTimestamp" : "2021-05-11T16:06:59.058+08:00",
  "waitingCursorsCount" : 0,
  "pendingAddEntriesCount" : 0,
  "lastConfirmedEntry" : "7:99",
  "state" : "LedgerOpened",
  "ledgers" : [ {
    "ledgerId" : 7,
    "entries" : 0,
    "size" : 0,
    "offloaded" : false
  } ],
  "cursors" : { },
  "compactedLedger" : {
    "ledgerId" : -1,
    "entries" : -1,
    "size" : -1,
    "offloaded" : false
  }
}

9. Check where the ledger with ledgerId 7 is saved.

Input

bin/bookkeeper shell ledgermetadata -ledgerid 7

Output

The ledger with ledgerId 7 is saved on bookie1 (3181).

...
16:11:28.843 [main] INFO  org.apache.bookkeeper.tools.cli.commands.client.LedgerMetaDataCommand - ledgerID: 7
16:11:28.846 [main] INFO  org.apache.bookkeeper.tools.cli.commands.client.LedgerMetaDataCommand - LedgerMetadata{formatVersion=3, ensembleSize=1, writeQuorumSize=1, ackQuorumSize=1, state=OPEN, digestType=CRC32C, password=base64:, ensembles={0=[192.168.0.105:3181]}, customMetadata={component=base64:bWFuYWdlZC1sZWRnZXI=, pulsar/managed-ledger=base64:cHVibGljL2RlZmF1bHQvcGVyc2lzdGVudC90b3BpYzM=, application=base64:cHVsc2Fy}}
...

Conclusion

This is the second blog in the series on configuring isolation in Apache Pulsar. Now you should now know how to:

Deploy two separate Pulsar clusters
Verify data isolation of clusters
Synchronize and migrate data between clusters
Scale up and down nodes (brokers and bookies)

The next blog will discuss how to configure Pulsar isolation in a shared BookKeeper cluster. Coming soon!

About the Author

Ran Gao is a software engineer at StreamNative. Before that, he was responsible for the development of search service at Zhaopin.com. Prior to that, he worked on the development of the logistics system at JD Logistics. Being interested in open source and messaging systems, Ran is an Apache Pulsar committer. You can follow him on twitter.

Yu Liu is an Apache Pulsar committer and a content strategist from StreamNative. You can follow her on twitter.

This post was originally published on StreamNative blog.

Like this post? Please recommend and/or share.

Want to learn more? See https://streamnative.io/blog. Follow us here on Medium and check out our GitHub.

Pulsar Isolation for Dummies: Separate Pulsar Clusters was originally published in StreamNative on Medium, where people are continuing the conversation by highlighting and responding to this story.

What’s New in Apache Pulsar 2.6.4

Sijia-w — Thu, 10 Jun 2021 16:02:24 GMT

What’s New in Apache Pulsar 2.6.4

We are excited to see the Apache Pulsar community has successfully released the 2.6.4 version! 10 contributors provided improvements and bug fixes that contributed to 16 PRs.

Highlights:

Broker no longer delivers old messages after a topic is closed (#8634)
AWS credentials are refreshed after expiry (#9387)
Pulsar identifies when individual message deletes cause an unsynced cursor (#9732)

This blog walks through the most noteworthy changes. For the complete list including all enhancements and bug fixes, check out the Pulsar 2.6.4 Release Notes.

Notable enhancement

C++ client

C++ client supports multiple topic subscriptions across multiple namespaces (#9520)

Issue

Previously, you could not subscribe to different topics on different namespaces.

Resolution

Move the check for namespace in MultiTopicsConsumerImpl to PatternMultiTopicsConsumerImpl that uses a regex subscription.
Fix the existing tests for subscriptions on topics across different namespaces.

Notable bug fix

Broker

Pulsar guarantees security for clients using JWT (#9172)

Issue

Previously, it was possible for attackers to connect to Pulsar instances because the signature of the JWT was not validated when the token was set to none.

Resolution

Modified JWT to use parseClaimsJws instead of parse to get the token objects. Now, parseClaimsJws guarantees the correct security model for parsing signed JWTs.

Pulsar identifies when individual message deletes cause an unsynced cursor (#9732)

Issue

Previously, cursors were not being flushed when acknowledgements caused a dirty cursor. Instead of deleting the acknowledged messages, messages were redelivered.

Resolution

Fixed code to mark the individual acknowledgements and automatically trigger the flush of dirty cursors.

Pulsar can expire a range of messages (#9083)

Issue

Previously, only a single message expired after an expiry check. As a result, many expired messages remained in a subscription and were delivered to consumers after the expiry time.

Resolution

Modified OpFindNewest to jump to a valid position, which allows PersistentMessageExpiryMonitor to find the best range of messages to expire.

Pulsar allows manual (forced) topic deletion after removing non-durable subscriptions (#7356)

Issue

Previously, during the removal of non-durable subscriptions, there was a race condition that left a topic in a state where you could not delete it until it was unloaded or reloaded.

Resolution

Fixed the race condition by setting the topic fence before performing any delete operations and reverting the topic state after the delete operations.

Broker no longer delivers old messages after a topic is closed (#8634)

Issue

Previously, it was possible to re-deliver very old messages if a topic was not gracefully closed. The cursor rolled back to the last persisted position and triggered the re-delivery of those messages.

Resolution

Fixed the redelivery of messages by setting a time-bound period after which all cursor updates are flushed on the disk.

Batch index acknowledgement data is no longer persisted (#9504)

Issue

Previously, the batch index acknowledgement data persisted because batchDeletedIndexInfoBuilder generated the batch index acknowledgement data but did not clear the current set before adding the delete set.

Resolution

Fixed by clearing the delete set before adding a new delete set.

Closed ledger deletes after expiration (#9136)

Issue

Previously, a closed ledger (with no incoming traffic) could fail to delete after expiring because the read position of the cursor still points to the last entry of the closed ledger.

Resolution

Updated behavior when closing the current ledger. Now, when the cursor’s mark-delete position points to the last entry of the current ledger, the read position is moved to the newly created ledger.

Tiered storage

AWS credentials are refreshed after expiry (#9387)

Issue

Previously, expired AWS credentials were reused. With the refactor of Azure support, a regression occurred where the AWS credentials are fetched once and then used through the entire process.

Resolution

The AWS credential provider chain takes care of the credential refresh. When integrating with JClouds, you still need to return a new set of credentials each time.

Java client

Compression applied during schema preparation (#9396)

Issue

Previously, compression was not applied during deferred schema preparation and the consumer could receive an uncompressed message and then fail.

Resolution

Fixed by enforcing compression during the schema preparation.

Get involved

To get started, you can download Pulsar directly or you can spin up for a Pulsar cluster on StreamNative Cloud with a free 30-day trial of StreamNative Cloud in which Pulsar 2.6.4 changes are shipped! Moreover, we offer technical consulting and expert training to help get your organization started. As always, we are highly responsive to your feedback. Feel free to contact us if you have any questions at any time. Look forward to hearing from you and stay tuned for the next Pulsar release!

About the Author

Yong Zhang is an Apache Pulsar committer. He works as a software engineer at StreamNative.

Yu Liu is an Apache Pulsar committer and a content strategist from StreamNative. You can follow her on twitter.

This post was originally published on StreamNative blog.

Like this post? Please recommend and/or share.

Want to learn more? See https://streamnative.io/blog. Follow us here on Medium and check out our GitHub.

What’s New in Apache Pulsar 2.6.4 was originally published in StreamNative on Medium, where people are continuing the conversation by highlighting and responding to this story.

Function Mesh — Simplify Complex Streaming Jobs in Cloud

Sijia-w — Thu, 10 Jun 2021 16:02:24 GMT

Function Mesh — Simplify Complex Streaming Jobs in Cloud

Today, we are excited to introduce Function Mesh, a serverless framework purpose-built for event streaming applications. It brings powerful event-streaming capabilities to your applications by orchestrating multiple Pulsar Functions and Pulsar IO connector for complex event streaming jobs on Kubernetes.

What is Function Mesh

Function Mesh is a Kubernetes operator that enables users to run Pulsar Functions and connectors natively on Kubernetes, unlocking the full power of Kubernetes’ application deployment, scaling, and management. For example, Function Mesh relies on Kubernetes’ scheduling functionality, which ensures that functions are resilient to failures and can be scheduled properly at any time.

Function Mesh is also a serverless framework to orchestrate multiple Pulsar Functions and I/O connectors for complex streaming jobs in a simple way. If you’re seeking cloud-native serverless streaming solutions, Function Mesh is an ideal tool for you. The key benefits of Function Mesh include:

Eases the management of Pulsar Functions and connectors when you run multiple instances of Functions and connectors together.
Utilizes the full power of Kubernetes Scheduler, including deployment, scaling and management, to manage and scale Pulsar Functions and connectors.
Makes Pulsar Functions and connectors run natively in the cloud environment, which leads to greater possibilities when more resources become available in the cloud.
Enables Pulsar Functions to work with different messaging systems and to integrate with existing tools in the cloud environment (Function Mesh runs Pulsar Functions and connectors independently from Pulsar).

Function Mesh is well-suited for common, lightweight streaming use cases, such as ETL jobs, and is not intended to be used as a full-power streaming engine.

Why Function Mesh

Pulsar introduces Pulsar Functions and Pulsar I/O since its 2.0 release.

Pulsar Functions is a turnkey serverless event streaming framework built natively for Apache Pulsar. Pulsar Functions enables users to create event processing logic on a per message basis and bring simplicity and serverless concepts to event streaming, thus eliminating the need to deploy a separate system. Popular use cases of Pulsar Functions include ETL jobs, real-time aggregation, microservices, reactive services, event routing, and more.

Pulsar IO connector is a framework that allows you to ingress or egress data from and to Pulsar using the existing Pulsar Functions framework. Pulsar IO consist of source and sink connectors. A source is an event processor that ingests data from an external system into Pulsar, and a sink is an event processor that egresses data from Pulsar to an external system.

Both Pulsar Functions and Pulsar I/O have made building event streaming applications become simpler. Pulsar Functions supports running functions and connectors on Kubernetes. However the existing implementation has a few drawbacks:

The function metadata is stored in Pulsar and the function running state is managed by Kubernetes. This results in inconsistency between metadata and running state, which makes the management become complicated and problematic. For example, the StatefulSet running Pulsar Functions can be deleted from Kuberentes while Pulsar isn’t aware of it.
The existing implementation uses Pulsar topics for storing function metadata. It can cause broker crash loops if the function metadata topics are temperaily not available.
Functions are tied to a specific Pulsar cluster, making it difficult to use functions across multiple Pulsar clusters.
The existing implementation makes it hard for users deploying Pulsar Functions on Kuberentes to implement certain features, such as auto-scaling.

Additionally, with the increased adoption of Pulsar Functions and Pulsar I/O connectors for building serverless event streaming applications, people are looking for orchestrating multiple functions into a single streaming job to achieve complex event streaming capabilities. Without Function Mesh, there is a lot of manual work to organize and manage multiple functions to process events.

To solve the pain points and make Pulsar Functions Kubernetes-native, we developed Function Mesh — a serverless framework purpose-built for running Pulsar Functions and connectors natively on Kubernetes, and for simplifying building complex event streaming jobs.

Core Concepts

Function Mesh enables you to build event streaming applications leveraging your familiarity with Apache Pulsar and modern stream processing technologies. Three concepts are foundational to build an event streaming applications: streams, functions, and connectors.

Stream

A stream is a partitioned, immutable, append-only sequence for events that represents a series of historical facts. For example, the events of a stream could model a sequence of financial transactions, like “Jack sent $100 to Alice”, followed by “Alice sent $50 to Bob”. A stream is used for connecting functions and connectors. The streams in Function Mesh are implemented by using topics in Apache Pulsar.

Function

A function is a lightweight event processor that consumes messages from one or more input streams, applies a user-supplied processing logic to one or multiple messages, and produces the results of the processing logic to another stream. The functions in Function Mesh are implemented based on Pulsar Functions.

Connector

A connector is an event processor that ingresses or egresses events from and to streams. There are two types of connectors in Functions Mesh:

Source Connector (aka Source): an event processor that ingests events from an external data system into a stream.
Sink Connector (aka Sink): an event processor that egresses events from streams to an external data system.

The connectors in Function Mesh are implemented based on Pulsar IO connectors. The available Pulsar IO connectors can be found at StreamNative Hub.

FunctionMesh

A FunctionMesh (aka Mesh) is a collection (can be either a Directed Acyclic Graph (DAG) or a cyclic graph) of functions and connectors connected by streams that are orchestrated together for achieving powerful stream processing logics. All the functions and connectors in a Mesh share the same lifecycle. They are started when a Mesh is created and terminated when the mesh is destroyed. All the functions and connectors are long running processes. They are auto-scaled based on the workload by Function Mesh.

How Function Mesh works

Function Mesh APIs build on existing Kubernetes APIs, so that Function Mesh resources are compatible with other Kubernetes-native resources, and can be managed by cluster administrators using existing Kubernetes tools. The foundational concepts are delivered as Kubernetes Custom Resource Definitions (CRDs), which can be configured by a cluster administrator for developing event streaming applications.

Instead of using the Pulsar admin CLI tool to send function admin requests to Pulsar clusters, you now can use kubectl to submit a Function Mesh CRD manifest directly to Kubernetes clusters. The Function Mesh controller watches the CRD and creates Kubernetes resources to run the defined Function/Source/Sink, or Mesh. The benefit of this approach is both the function metadata and function running state are directly stored and managed by Kubernetes to avoid the inconsistency problem that was seen in Pulsar’s existing approach.

The following diagram illustrates a typical user flow of Function Mesh.

Function Mesh Internals

Function Mesh mainly consists of two components. One is a Kubernetes operator that watches Function Mesh CRDs and creates Kuberentes resources (i.e. StatefulSet) to run functions, connectors, and meshes on Kubernetes; while the other one is a Function Runner that invokes the functions and connectors logic when receiving events from input streams and produces the results to output streams. The Runner is currently implemented using Pulsar Functions runner.

The below diagram illustrates the overall architecture of Function Mesh. When a user creates a Function Mesh CRD, the controller receives the submitted CRD from Kubernetes API server. The controller processes the CRD and generates the corresponding Kubernetes resources. For example, when the controller processes the Function CRD, it creates a StatefulSet to run the function. Each pod of this function StatefulSet launches a Runner to invoke the function logic.

How to use Function Mesh

To use Function Mesh, you need to install Function Mesh operator and CRD into the Kubernetes cluster first. For more details about installation, refer to installation guide.

After installing the Function Mesh operator and deploying a Pulsar cluster, you need to package your functions/connectors, define CRDs for functions, connectors and Function Mesh, and then submit the CRDs to the Kubernetes cluster with the following command.

$ kubectl apply -f /path/to/custom-crd.yaml

Once your Kubernetes cluster receives the CRD, the Function Mesh operator will schedule individual parts and run the functions as a stateful set with other necessary resource objects.

Below we illustrate how to run Functions, Connectors and Meshes respectively with some examples.

How to run functions using Function Mesh

Function Mesh does not change how you develop Pulsar Functions to run in the cloud. The submission process just switches from a pulsar-admin client tool to a yaml file. Behind the scenes, we developed the CRD resources for Pulsar Function and the controller to handle it properly.

After developing and testing your function, you need to package it and then submit it to a Pulsar cluster or build it as a Docker image and upload it to the image registry. For details, refer to run Pulsar Functions using Function Mesh.

This following example for Function CRD launches an ExclamationFunction inside Kubernetes and enables auto-scaling, and it uses a Java runtime to talk to the Pulsar messaging system.

apiVersion: compute.functionmesh.io/v1alpha1
kind: Function
metadata:
  name: function-sample
  namespace: default
spec:
  className: org.apache.pulsar.functions.api.examples.ExclamationFunction
  replicas: 1
  maxReplicas: 5
  image: streamnative/function-mesh-example:latest
  logTopic: persistent://public/default/logging-function-logs
  input:
    topics:
    - persistent://public/default/source-topic
    typeClassName: java.lang.String
  output:
    topic: persistent://public/default/sink-topic
    typeClassName: java.lang.String
  resources:
    requests:
      cpu: "0.1"
      memory: 1G
    limits:
      cpu: "0.2"
      memory: 1.1G
  pulsar:
    pulsarConfig: "test-pulsar"
  java:
    jar:  "/pulsar/examples/api-examples.jar"

How to run connectors using Function Mesh

Source and sink are specialized functions. If you use Pulsar built-in or StreamNative-managed connectors, you can create them by specifying the Docker image in the source or sink CRDs. These Docker images are public at the Docker Hub, with the image name in a format of streamnative/pulsar-io-CONNECTOR-NAME:TAG, such as streamnative/pulsar-io-hbase:2.7.1. You can check all supported connectors in the StreamNative Hub.

If you use self-built connectors, you can package them to an external package or to a docker image, upload the package and then submit the connectors through CDRs. For details, refer to run Pulsar connectors using Function Mesh.

In the following CRD YAML files for source and sink, the connectors receive the input from DebeziumMongoDB and send the output to ElasticSearch.

Define the CRD yaml file for source:

apiVersion: compute.functionmesh.io/v1alpha1
kind: Source
metadata:
  name: source-sample
spec:
  image: streamnative/pulsar-io-debezium-mongodb:2.7.1
  className: org.apache.pulsar.io.debezium.mongodb.DebeziumMongoDbSource
  replicas: 1
  output:
    topic: persistent://public/default/destination
    typeClassName: org.apache.pulsar.common.schema.KeyValue
  sourceConfig:
    mongodb.hosts: rs0/mongo-dbz-0.mongo.default.svc.cluster.local:27017,rs0/mongo-dbz-1.mongo.default.svc.cluster.local:27017,rs0/mongo-dbz-2.mongo.default.svc.cluster.local:27017
    mongodb.name: dbserver1
    mongodb.user: debezium
    mongodb.password: dbz
    mongodb.task.id: "1"
    database.whitelist: inventory
    pulsar.service.url: pulsar://test-pulsar-broker.default.svc.cluster.local:6650
  pulsar:
    pulsarConfig: "test-source"
  java:
    jar: connectors/pulsar-io-debezium-mongodb-2.7.1.nar
    jarLocation: "" # use pulsar provided connectors

Define the CRD yaml file for sink:

apiVersion: compute.functionmesh.io/v1alpha1
kind: Sink
metadata:
  name: sink-sample
spec:
  image: streamnative/pulsar-io-elastic-search:2.7.1
  className: org.apache.pulsar.io.elasticsearch.ElasticSearchSink
  replicas: 1
  input:
    topics:
    - persistent://public/default/input
    typeClassName: "[B"
  sinkConfig:
    elasticSearchUrl: "http://quickstart-es-http.default.svc.cluster.local:9200"
    indexName: "my_index"
    typeName: "doc"
    username: "elastic"
    password: "X2Mq33FMWMnqlhvw598Z8562"
  pulsar:
    pulsarConfig: "test-sink"
  java:
    jar: connectors/pulsar-io-elastic-search-2.7.1.nar
    jarLocation: "" # use pulsar provided connectors

How to Run Function Mesh on Kubernetes

A FunctionMesh orchestrates functions, sources and sinks together and manages them as a whole. The FunctionMesh CRD has a list of fields for functions, sources and sinks and you can connect them together through the topics field. Once the YAML file is submitted, the FunctionMesh controller will reconcile it into multiple function/source/sink resources and delegate each of them to corresponding controllers. The function/source/sink controllers reconcile each task and launch corresponding sub-components. The FunctionMesh controller collects the status of each component from the system and aggregates them in its own status field.

The following FunctionMesh job example launches two functions and streams the input through the two functions to append exclamation marks.

apiVersion: compute.functionmesh.io/v1alpha1
kind: FunctionMesh
metadata:
  name: mesh-sample
spec:
  functions:
    - name: ex1
      className: org.apache.pulsar.functions.api.examples.ExclamationFunction
      replicas: 1
      maxReplicas: 5
      input:
        topics:
          - persistent://public/default/source-topic
        typeClassName: java.lang.String
      output:
        topic: persistent://public/default/mid-topic
        typeClassName: java.lang.String
      pulsar:
        pulsarConfig: "mesh-test-pulsar"
      java:
        jar: pulsar-functions-api-examples.jar
        jarLocation: public/default/test
   - name: ex2
      className: org.apache.pulsar.functions.api.examples.ExclamationFunction
      replicas: 1
      maxReplicas: 3
      input:
        topics:
          - persistent://public/default/mid-topic
        typeClassName: java.lang.String
      output:
        topic: persistent://public/default/sink-topic
        typeClassName: java.lang.String
      pulsar:
        pulsarConfig: "mesh-test-pulsar"
      java:
        jar: pulsar-functions-api-examples.jar
        jarLocation: public/default/test

The output topic and input topic of the two functions are the same, so that one can publish the result into this topic and the other can fetch the data from that topic.

Work with pulsar-admin CLI tool

If you want to use Function Mesh and do not want to change the way you create and submit functions, you can use Function Mesh worker service. It is similar to Pulsar Functions worker service but uses Function Mesh to schedule and run functions. Function Mesh worker service enables you to use the pulsar-admin CLI tool to manage Pulsar Functions and connectors in Function Mesh. The following figure illustrates how Function Mesh worker service works with Pulsar proxy, converts and forwards requests to the Kubernetes cluster.

For details about the usage, you can refer to work with pulsar-admin CLI tool.

Migrate Pulsar Functions to Function Mesh

If you run Pulsar Functions using the existing Kubernetes runtime and want to migrate them to Function Mesh, Function Mesh provides you a tool to generate a list of CRDs of your existing functions. You can then apply these CRDs to ask Function Mesh to take over the ownership of managing the running Pulsar Functions on Kubernetes. For details, refer to migration Pulsar Functions guide.

Supported Features

Currently, Function Mesh supports the following features:

Running Pulsar Functions and connectors natively in Kubernetes.
Orchestrating multiple Pulsar Functions and connectors as a streaming job.
Compatibility with original Pulsar Admin API for submitting Functions and connectors.
Auto-scaling instances for functions and connectors using Horizontal Pod Autoscaler.
Authentication and authorization.
Multiple runtimes with Java, Python, and Golang support.
Schema and SerDe.
Resource limitation.

Future Plans

We plan to enable the following features in the upcoming releases, if you have any ideas or would like to contribute to it, feel free to contact us.

Improve the capability level of the Function Mesh operator.
Feature parity with Pulsar Functions, such as stateful function.
Support additional runtime based on self-contained function runtime, such as web-assembly.
Develop better tools/frontend to manage and inspect Function Meshes.
Group individual functions together to improve latency and reduce cost.
Support advanced auto-scaling based on Pulsar metrics.
Integrate function registry with Apache Pulsar Packages.

Try Function Mesh Now

Function Mesh is now open source, try it on your Kubernetes clusters today!

To learn more about Function Mesh, read the docs and watch a live demo.

If you have any feedback or suggestions for this project, feel free to contact us or open issues in the GitHub repo. Any feedback is highly appreciated.

About the Author

Neng Lu is a staff software engineer at StreamNative where he drives the development of Apache Pulsar and the integrations with big data ecosystem. Before that, he was a senior software engineer at Twitter. He was the core committer to the Heron project and the leading engineer for Heron development at Twitter. He also worked on Twitter’s monitoring and key-value storage systems. Before joining Twitter, he got his master’s degree from UCLA and a bachelor’s degree from Zhejiang University. You can follow him on twitter. You can follow him on linkedin.

Rui Fu is a software engineer at StreamNative. Before joining StreamNative, he was a platform engineer at the Energy Internet Research Institute of Tsinghua University. He was leading and focused on stream data processing and IoT platform development at Energy Internet Research Institute. Rui received his postgraduate degree from HKUST and an undergraduate degree from The University of Sheffield. You can follow him on linkedin.

This post was originally published on StreamNative blog.

Like this post? Please recommend and/or share.

Want to learn more? See https://streamnative.io/blog. Follow us here on Medium and check out our GitHub.

Function Mesh — Simplify Complex Streaming Jobs in Cloud was originally published in StreamNative on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building Connectors On Pulsar Made Simple

Sijia-w — Fri, 04 Jun 2021 16:01:10 GMT

New updates in StreamNative Hub make developing and using Pulsar connectors even easier! You can also expect more connectors to be deployed and adopted on more cloud providers with GUI tools!

Why Pulsar Connectors?

Pulsar connectors enable Pulsar to quickly and easily integrate with various external systems. In fact, according to the 2021 Pulsar User Survey Report (which will be published later this month), connectors are one of the most-used Pulsar features with 30% of Pulsar users using connectors.

To facilitate connector development and improve their ease of use, we launched StreamNative Hub in 2020 to provide a single place to find, download, use, store, and share Pulsar related extensions, and offer a broad spectrum of Pulsar integrations. Since its launch last year, dozens of connectors have been created and added to the Hub. Some popular Pulsar plugins on StreamNative Hub include AWS SQS connector, AWS Lambda connector, AMQP1_0 connector, IoTDB connector, and more.

In this blog, we introduce recent updates that make developing and using Pulsar connectors even easier.

About StreamNative Hub

StreamNative Hub is an app store for developing event streaming applications and provides dozens of plugins and integrations. Its key components include:

Connectors: Allow you to move streaming data in and out of Pulsar, which simplifies integration for enterprises bringing Pulsar into their existing infrastructure. All Pulsar built-in connectors are shipped in the StreamNative Hub.
Offloader: Allow you to offload the majority of the data from BookKeeper to external remote storage, which provides a cheaper form of storage that readily scales with the volume of data.
Protocol handler: Allow you to support other messaging protocols natively and dynamically in Pulsar brokers on runtime, which streamlines operations with Pulsar’s enterprise-grade features without modifying code. Kafka, AMQP, and MQTT are supported.

As more and more members have contributed and used connectors, we’ve identified some opportunities to improve the Hub’s ease of use, read on to learn more.

New Pulsar Connector Development Guide

To simplify the integration between Pulsar and external systems, we created a new development guide, Pulsar Connector Development Guide, that developers can reference to improve productivity and boost efficiency when developing a connector. This guide helps with the following:

Developing a New Connector
If you need to pipe data in or out of Pulsar and other systems that do not have a connector yet, you can read the Pulsar Connector Development Guide. It contains step-by-step guidelines for how to develop and contribute a connector to StreamNative Hub, including detailed instructions and various templates for both code and documentation.
Promoting Awareness and Usage of an Existing Connector
If you already developed a connector and want to make it available to the community, we recommend you host it in a public repository and show it on StreamNative Hub. You can host the connector repo at your desired location and then sync the documentation to StreamNative Hub using a simple script with just one line of code by following the instructions in the Pulsar Connector Development Guide.

Future StreamNative Hub Upgrades

We are continuously looking for new ways to improve StreamNative Hub and we are working on additional upgrades, such as adding more comprehensive tests to improve the usability, reliability, and performance of connectors. You can also expect more connectors to be deployed and adopted on more cloud providers with GUI tools. Stay tuned!

Contribute Your Connector

If you develop connectors, we encourage you to add your connector to StreamNative Hub! In StreamNative Hub, your connector will get exposure to the widest possible audience and enjoy faster innovation cycles of development. You will also be contributing to a robust Pulsar ecosystem.

Get Involved in the Pulsar Community

In addition to adding a connector, there are more ways you can contribute, including:

Improve documentation!
The documentation hosted at StreamNative Hub is open source. Feel free to submit or request changes (fix typos, add clarifications, and more).
Report bugs.
Review pull requests.
Provide feedback on proposed features, enhancements, or designs.
Suggest new features.
Answer questions in issues or channels.

Ready to Get Started?

Start your journey with connectors now with the Quick Start Guide!

Happy Connectoring!

About the Author

Guangning E is an Apache Pulsar committer and the main contributor to Apache Pulsar IO and Apache Pulsar Manager. He works as a senior software engineer at StreamNative, where he specializes in cloud platform, cloud computing, and big data related fields.

Yu Liu is an Apache Pulsar committer and a content strategist from StreamNative. You can follow her on twitter.

This post was originally published on StreamNative blog.

Like this post? Please recommend and/or share.

Want to learn more? See https://streamnative.io/blog. Follow us here on Medium and check out our GitHub.

Building Connectors On Pulsar Made Simple was originally published in StreamNative on Medium, where people are continuing the conversation by highlighting and responding to this story.

What’s New in Apache Pulsar 2.7.2

Sijia-w — Thu, 27 May 2021 16:03:51 GMT

We are excited to see the Apache Pulsar community has successfully released the 2.7.2 version! More than 38 contributors provided improvements and bug fixes that contributed to 85 commits.

Highlights of this release are as below:

Consumers are no longer blocked after receiving multiple retry messages in Docker.
Consumers can consume messages published in the topic stats when using the Key_Shared subscription type.

This blog walks through the most noteworthy changes grouped by the key functionality. For the complete list, including all enhancements and bug fixes, check out the Pulsar 2.7.2 Release Notes.

Notable bug fix and enhancement

Pulsar 2.7.2 has included the following changes for broker, bookie, proxy, Pulsar admin, Pulsar SQL, and clients.

Broker

Fix NPEs and thread safety issues in PersistentReplicator. PR-9763
Previously, in a non-persistent topic with a key-shared subscription, messages were marked as published in the topic stats, but consumers did not consume them. This caused NullPointerExeptions (NPEs).
· Make cursor field volatile since the field is updated asynchronously in another thread.
· Remove the unnecessary synchronization on the openCursorAsync method since it is not needed.
· Add null checks before accessing the cursor field since statistics might be updated before the cursor is available.
Fix the issue of a message not dispatched for the Key_Shared subscription type in a non-persistent topic. PR-9826
Previously, In a non-persistent topic with a key-shared subscription, messages were marked as published in the topic stats, but consumers did not consume them. This PR fixes this issue.
Fix the issue of a consumer being blocked after receiving retry messages. PR-10078
Previously, in the Docker environment, if a consumer enabled the retry feature and set the retry topic in DeadLetterPolicy, the consumer was blocked after receiving multiple retry messages because the hasMessageAvailable check was set to false. This PR fixes this issue.
Fix the issue of schema not added when subscribing to an empty topic without schema. PR-9853
Previously, when a consumer with a schema subscribed to an empty topic without schema, the previous check used isActive, which only checked whether the topic could be deleted. However, it should check if there was any connected producer or consumer of this topic. For the previous implementation, even if a topic had no active producers or consumers, the topic's subscription list was not empty and isActive returned true. Then the consumer's schema was not attached to the topic and it threw an IncompatibleSchemaException. This PR changes to check if the topic has active producers or consumers instead of checking whether it can be deleted.
Fix the issue of schema type check when using the ALWAYS_COMPATIBLE strategy. PR-10367
This PR provides the following enhancements when using the ALWAYS_COMPATIBLE strategy for schema type check:
· For non-transitive strategy, it checks only schema type for the last schema.
· For transitive strategy, it checks all schema types.
· For getting schema by schema data, it considers different schema types.
Fix the issue of CPU 100% usage when deleting namespace. PR-10337
Previously, When deleting a namespace, the namespace Policies setting was marked as deleted, triggering the topic's onPoliciesUpdate and a read of the data of ZooKeeper’s Policies node as checkReplicationAndRetryOnFailure. Because the namespace was deleted, the ZooKeeper node no longer existed and the failure to read data triggered infinite retries. This PR fixes this issue by adding a method to check for non-deleted policies.

Bookie

Fallback to PULSAR_GC if BOOKIE_GC is not defined. PR-9621
This PR changes fallback from PULSAR_MEM to PULSAR_GC if BOOKIE_GC is not defined.
Fallback to PULSAR_EXTRA_OPTS if BOOKIE_EXTRA_OPTS is not defined. PR-10397
This PR defines that -Dio.netty.* does not pass the system properties if PULSAR_EXTRA_OPTS or BOOKIE_EXTRA_OPTS is set. This change ensures consistency with PULSAR_EXTRA_OPTS behavior and prevents duplicate properties. This PR also adds -Dio.netty.leakDetectionLevel=disabled (unless BOOKIE_EXTRA_OPTS is set) since PULSAR_EXTRA_OPTS does not include that setting by default.

Proxy

Fix authorization error while using proxy and Prefix subscription authentication mode. PR-10226
Previously, when using Pulsar proxy and Prefix subscription authentication mode, org.apache.pulsar.broker.authorization.PulsarAuthorizationProvider#canConsumeAsync threw an exception, which caused the consumer error. This PR updates the org.apache.pulsar.broker.authorization.PulsarAuthorizationProvider#allowTopicOperationAsync logic, checks isSuperUser first, and then returns isAuthorizedFuture.

Pulsar admin

Add get version command for Pulsar REST API, pulsar-admin, and pulsar-client. PR-9975

Pulsar SQL

Fix the issue of BKNoSuchLedgerExistsException. PR-9910
Previously, when using Pulsar SQL to query messages, BKNoSuchLedgerExistsException was thrown if the ZooKeeper ledger root directory was changed. This PR fixes this issue.

Client

Pulsar 2.7.2 includes the following changes for Java, Python, C++, and WebSocket clients.

Java

Fix the issue that ClientConfigurationData’s objects are not equal. PR-10091
This PR fixes this issue and reuses AuthenticationDisabled.INSTANCE as default instead of creating a new one.
Fix the issue of AutoConsumeSchema KeyValue encoding. PR-10089
This PR keeps the KeyValueEncodingType when auto-consuming a KeyValue schema.
Fix the error of OutOfMemoryError while using KeyValue. PR-9981
Previously, a topic with schema KeyValue could not be consumed due to a problem inHttpLookupService. The HttpLookupService downloaded the schema in JSON format but the KeyValue schema was expected to be encoded in binary form. This PR uses the existing utility functions to convert the JSON representation of the KeyValue schema to the desired format.
Fix the concurrency issue in the client’s producer epoch handling. PR-10436
This PR uses a volatile field for epoch and AtomicLongFieldUpdater for incrementing the value.
Handle NPE while receiving ack for a closed producer. PR-8979
Fix the issue of batch size not set when deserializing from a byte array. PR-9855
Previously, batch index message acknowledgment was added to the seek method to support more precise seek using ACK sets. However, when the seek was performed by a message that was serialized and deserialized, the batchSize was set to zero, which led to a discrepancy between messageId forms and seek results. This PR fixes this issue.
Fix the issue of a single-topic consumer being unable to close. PR-9849

Python

Support setting the default value when using Python Avro Schema. PR-10265
Previously, the default value for the Python Avro schema could not be set, causing the Python schema to not be updated. This PR fixes this issue and adds the following changes:
· Add the required field to control the type of schema that can set null.
· Add the required_default field to control the schema whether it has a default attribute or not.
· Add the default field to control the default value of the schema.
Fix the issue of nested Map or Array in schema does not work. PR-9548
Previously, the Python client did not handle nested Map or Array well, and the generated schema string was invalid. When the Map/Array's schema() method set the values field of the schema string, it ignored the Record type but not Map and Array. This PR fixes the issue and adds 4 tests for Map, Map, Array, and Array to cover all nested cases that involve Map or Array.
Add TLS SNI support for Python and C++ clients. PR-8957
This PR adds TLS SNI support for CPP and Python clients, so you can connect to brokers through the proxy.

C++

Fix the issue that the C++ client cannot be built on Windows. PR-10363
This PR puts PULSAR_PUBLIC before the variable type and keeps the LIB_NAME as the shared library's name (for example, removing the dll suffix).
Fix the issue of the paused zero queue consumer pre-fetches messages. PR-10036
Previously, zero queue consumers (the consumer’s receiver queue size is 0) pre-fetched messages after pauseMessageListener was called. This was because ConsumerImpl::increaseAvailablePermits did not check the boolean variable messageListenerRunning_, which became false after pauseMessageListener was called. Therefore, after the zero queue consumer was paused, it still sent the FLOW command to pre-fetch a message to its internal unbounded queue incomingMessages_. This PR fixes this issue and make the following changes:
· Add the check for messageListenerRunning_ in increaseAvailablePermits method and make the implementation consistent with Java client's ConsumerImpl#increaseAvailablePermits. Change the type of availablePermits_ to std::atomic_int.
· Add the increaseAvailablePermits invocation in resumeMessageListener to send FLOW command after consumer resumes since pauseMessageListener does not prefetch messages anymore.
Fix the issue of segmentation fault when getting a topic name from the received message ID. PR-10006
Previously, the C++ client supported getting a topic name from both the received message and its message ID. However, for a consumer that subscribed to a non-partitioned topic, getting a topic name from the received message ID caused a segmentation fault. This PR uses setTopicName for every single message when a consumer receives a batch and adds related tests for all types of consumers (including ConsumerImpl, MultiTopicsConsumerImpl, and PartitionedConsumerImpl).
Fix the issue of the SinglePartitionMessageRouter always picking the same partition. PR-9702
Previously, the SinglePartitionMessageRouter was supposed to pick a random partition for a given producer and stick with that. The problem was that the C rand() call always used the seed 0 and that ended up having multiple processes to always deterministically pick the same partition. This PR fixes this issue.
Reduce log level for an ack-grouping tracker. PR-10094
Previously, the warning log occurred when the ACK grouping tracker tried to send ACKs while the connection was closed. This PR changes the log level to debug when the connection is not ready for AckGroupingTrackerEnabled::flush.

WebSocket

Optimize URL token param value. PR-10187
This PR removes the Bearer prefix requirement for the token param value of the WebSocket URL.
Make the browser client support the token authentication. PR-9886
Previously, the WebSocket client used the HTTP request header to transport the authentication params, but the browser JavaScript WebSocket client could not add new headers. This PR uses the query param token to transport the authentication token for the browser JavaScript WebSocket client.

Function and connector

Allow customizable function logging. PR-10389
Previously, the function log configuration was in the jar package and could not be dynamically customized. This PR changes the function log configuration file to the configuration directory, which can be customized.
Pass through record properties from Pulsar sources. PR-9943
Fix the issue of the time unit in Pulsar Go functions. PR-10160
This PR changes the time unit of avg process latency from ns to ms.
Fix the issue that the Kinesis sink did not try to resend messages. PR-10420
Previously, when the Kinesis sink connector failed to send a message, it did not retry. In this case, if retainOrdering was enabled, it would lead to subsequent messages not being sent. This PR adds retry logic for the Kinesis sink connector. A message is retried to send if it fails to send.
Fix the issue of null error messages in the onFailure exception in the Kinesis sink. PR-10416
Previously, if the Kinesis producer failed to send a message, the error message in the onFailure exception was null. This PR extracts the UserRecordFailedException to show the real error messages.

Tiered storage

Prevent class loader leak and restore offloader directory override. PR-9878
Previously, there was a class loader leak. This PR updates the PulsarService and the PulsarConnectorCache classes to use a map from directory strings to offloaders.
Add logs for cleanup of offloaded data operation. PR-9852
Previously, the cleanup offloaded data operation lacked logs making it hard for users to analyze the reason for the tiered storage data loss. This PR adds some logs for the cleanup of offloaded data operation.

Get involved

To get started, you can download Pulsar directly or you can spin up a Pulsar cluster on StreamNative Cloud with a free 30-day trial of StreamNative Cloud in which Pulsar 2.7.2 changes are shipped! Moreover, we offer technical consulting and expert training to help get your organization started. As always, we are highly responsive to your feedback. Feel free to contact us if you have any questions at any time. Look forward to hearing from you and stay tuned for the next Pulsar release!

About the Author

Yong Zhang is an Apache Pulsar committer. He works as a software engineer at StreamNative.

Yu Liu is an Apache Pulsar committer and a technical writer from StreamNative. You can follow her on twitter.

This post was originally published on StreamNative blog.

Like this post? Please recommend and/or share.

Want to learn more? See https://streamnative.io/blog. Follow us here on Medium and check out our GitHub.

What’s New in Apache Pulsar 2.7.2 was originally published in StreamNative on Medium, where people are continuing the conversation by highlighting and responding to this story.

Flink SQL on StreamNative Cloud

Sijia-w — Thu, 22 Apr 2021 16:03:37 GMT

We are excited to announce the launch of Flink SQL on StreamNative Cloud. Flink SQL on StreamNative Cloud (aka “Flink SQL”) provides an intuitive and interactive SQL interface that reduces the complexity of building real-time data queries on Apache Pulsar. StreamNative is Cloud Partners with Ververica, the original developers of and the company behind Apache Flink. This partnership has enabled a close collaboration and integration and has helped us to create a powerful, turnkey platform for real time data insights.

Why Apache Flink and Flink SQL?

Apache Flink is a distributed, stream data processing engine that provides high throughput, low latency data processing, powerful abstractions and operational flexibility. With Apache Flink, users can easily develop and deploy event-driven applications, data analytics jobs, and data pipelines to handle real-time and historical data in complex distributed systems. Because of its powerful functionality and mature community, Apache Flink is widely adopted globally by some of the largest and most successful data-driven enterprises, including Alibaba, Netflix, and Uber.

Flink SQL provides relational abstractions of events stored in Apache Pulsar. It supports SQL standards for unified stream and batch processing. With Flink SQL, users can write SQL queries and access key insights from their real-time data, without having to write a line of Java or Python.

With a powerful execution engine and simple abstraction layer, Apache Flink and Flink SQL provide a distributed, real-time data processing solution with low development and maintenance costs. With Pulsar and Flink, StreamNative offers both stream storage and stream compute for a complete streaming solution.

Flink + Pulsar: A Cloud-Native Streaming Platform for Infinite Data Streams

The need for real-time data insights has never been more critical. But data insights aren’t limited to real-time data. Companies also need to integrate and understand large amounts of historical data in order to gain a complete picture of their business. This requires the ability to capture, store and compute both real-time and historical data.

Figure 1. Real-time data insights requires the ability to capture, store and compute both real-time and historical data

Pulsar’s tiered storage model provides the storage capabilities required for both batch and stream processing, enabling StreamNative Cloud to offer unified storage. Integrating Apache Flink and Flink SQL enables us to offer unified batch and stream processing, and Flink SQL simplifies the execution.

In a streaming-first world, the core abstraction of data is the infinite stream. The tables are derived from the stream and updated continuously as new data arrives in the stream. Apache Pulsar is the storage for infinite streams and Apache Flink is the engine that creates the materialized views in the form of streaming tables. You can then run streaming queries to perform continuous transformations, or run batch queries against streaming tables to get the latest value for every key in the stream in real time.

Figure 2. Stream & Table

Integrating Apache Flink with Apache Pulsar enables companies to represent and process streaming data in new ways. The Pulsar infinite stream is the core storage abstraction for streaming data and everything else is a materialized view over the infinite stream, including databases, search indexes, or other data serving systems in the company. All the data enrichment and ETL needed to create these derived views can now be created in a streaming fashion using Apache Flink. Monitoring, security, anomaly and threat detection, analytics, and response to failures can be done in real-time by combining historical context with real-time data analytics.

Figure 3. StreamNative Cloud as a complete streaming solution

When to Use Flink SQL

With Flink SQL on StreamNative Cloud, Pulsar clusters are treated as Flink catalogs. Users can query infinite streams of events in Apache Pulsar using Flink SQL. Below are some top use cases for utilizing the streaming SQL queries over Pulsar streams:

1. Real-time monitoring

We often think of monitoring as tracking low-level performance statistics using counters and gauges. While these metrics can tell you that your CPU usage is high, they can’t tell you if your application is doing what it’s supposed to. Flink SQL allows you to define custom metrics from streams of messages that applications generate, whether they are logging events, captured change data, or any other kind. For example, a cloud service might need to check that every time a new user signs up, a welcome email is sent, a new user record is created, and their credit card is billed. These functions might be spread over multiple different services or applications, and you want to monitor that each thing happened for each new customer within a certain SLA.

Below is a streaming SQL query to monitor error counts over a stream of error codes.

INSERT INTO error_counts
SELECT error_code, count(*) FROM monitoring_stream
GROUP BY TUMBLE(ts, INTERVAL '1' MINUTE), error_code
HAVING type = ‘ERROR’;

2. Real-time anomaly detection

Security use cases often look a lot like monitoring and analytics. Rather than monitoring application behavior or business behavior, application developers are looking for patterns of fraud, abuse, spam, intrusion, or other bad behavior. Flink SQL provides a simple and real-time way of defining these patterns and querying real-time Pulsar streams.

Below is a streaming SQL query to detect frauds over a stream of transactions.

INSERT INTO possible_fraud
SELECT card_number, count(*)
FROM transactions
GROUP BY TUMBLE(ts, INTERVAL '1' MINUTE), card_number
HAVING count(*) > 3;

3. Real-time data pipelines

Companies build real-time data pipelines for data enrichment. These data pipelines capture data changes coming out of several databases, transform them, join them together, and store them in a key-value database, search index, cache, or other data serving systems.

For a long time, ETL pipelines were built as periodic batch jobs. For example, they ingest the raw data in realtime, and then transform it every few hours to enable efficient queries. For many real-time use cases, such as transaction or payment processing, this delay is unacceptable. Flink SQL together with Pulsar I/O connectors enables real-time data integration between different systems.

Now you can enrich streams of events with metadata stored in a different table using joins, or perform simple filtering of Personally Identifiable Information (PII) data before loading the stream into another system.

The streaming SQL query below shows an example enriching a click stream using a users table.

INSERT INTO vip_users
SELECT user_id, page, action
FROM clickstream c
LEFT JOIN users u ON c.user_id = u.user_id
WHERE u.level = ‘Platinum’;

Pulsar Abstractions in Flink SQL

The integration of Flink SQL and Apache Pulsar utilizes Flink’s catalog API to reference existing Pulsar metadata and automatically map them to Flink’s corresponding metadata. There are a few core abstractions in this integration that map to the core abstractions in Pulsar and allow you to manipulate Pulsar topics using SQL.

Catalog: A catalog is a collection of databases. It is mapped to an existing Pulsar cluster.
Database: A database is a collection of tables. It is mapped to a namespace in Apache Pulsar. All the namespaces within a Pulsar cluster will automatically be converted to Flink databases in a Pulsar catalog. Databases can also be created or deleted via Data Definition Language (DDL) queries, where the underlying Pulsar namespaces will be created or deleted.

CREATE DATABASE userdb;

Table: A Pulsar topic can be presented as a STREAMING table or an UPSERT table.
Schema: The schema of a Pulsar topic will be automatically mapped as Flink table schema if the topic already exists with a schema. If a Pulsar topic doesn’t exist, creating a table via DDL queries will convert the Flink table schema to a Pulsar schema for creating a Pulsar topic.
Metadata Columns: The message metadata and properties of a Pulsar message will be mapped into the metadata columns of a Flink table. These metadata columns are: — messageId: the message ID of a Pulsar message. (read-only) - sequenceId: the sequence ID of a Pulsar message. (read-only) - publishTime: the publish timestamp of a Pulsar message. (read-only) - eventTime: the event timestamp of a Pulsar message. (readable/writable) - properties: the message properties of a Pulsar message. (readable/writable)

A Pulsar topic can be presented as a STREAMING table or an UPSERT table in Flink.

STREAMING table

A streaming table represents an unbounded sequence of structured data (“facts”). For example, we could have a stream of financial transactions such as “Jack sent $100 to Kate, then Alice sent $200 to Kate”. Facts in a table are immutable, which means new events can be inserted into a table, but existing events can never be updated or deleted. All the topics within a Pulsar namespace will automatically be mapped to streaming tables in a catalog configured to use a pulsar connector. Streaming tables can also be created or deleted via DDL queries, where the underlying Pulsar topics will be created or deleted.

CREATE TABLE pageviews (
  user_id BIGINT,
  page_id BIGINT,
  viewtime TIMESTAMP,
  user_region STRING,
  WATERMARK FOR viewtime AS viewtime - INTERVAL '2' SECOND
);

UPSERT table

An upsert table represents a collection of evolving facts. For example, we could have a table that contains the latest financial information such as “Kate’s current account balance is $300”. It is the equivalent of a traditional database table but enriched by streaming semantics such as windowing. Facts in a UPSERT table are mutable, which means new facts can be inserted into the table, and existing facts can be updated or deleted. Upsert tables can be created by specifying connector to be upsert-pulsar.

CREATE TABLE pageviews_per_region (
  user_region STRING,
  pv BIGINT,
  uv BIGINT,
  PRIMARY KEY (user_region) NOT ENFORCED
) with (
  “connector” = “upsert-pulsar”
};

By integrating the concepts of streaming tables and upsert tables, FlinkSQL allows joining upsert tables that represent the current state of the world with streaming tables that represent events that are happening right now. A topic in Pulsar can be represented as either a streaming table or an upsert table in Flink SQL, depending on the intended semantics of the processing on the topic.

For instance, if you want to read the data in a topic as a series of independent values, you would treat a Pulsar topic as a streaming table. An example of such a streaming table is a topic that captures page view events where each page view event is unrelated and independent of another. On the other hand, if you want to read the data in a topic as an evolving collection of updatable values, you would treat the topic as an upsert topic. An example of a topic that should be read as an UPSERT table in Flink is one that captures user metadata where each event represents the latest metadata for a particular user id including its user name, address or preferences.

A Dive into Flink SQL on StreamNative Cloud

Figure 4. StreamNative Cloud Architecture

StreamNative Cloud operates out of a control plane and cloud pools.

The control plane includes the backend services that StreamNative manages in its own cloud account. The backend services mainly include a Cloud API service and a Cloud console. Users can interact with StreamNative Cloud via the Cloud console, and applications can interact with it via the Cloud API service.

The cloud pools can be managed by StreamNative in its own cloud account or in the customers’ cloud accounts. Pulsar clusters are run inside the cloud pools. The SQL queries are also run on the cloud pools.

The diagram below demonstrates how the authentication/authorization is implemented in our system. Here it assumes that data has already been ingested into the Pulsar clusters on StreamNative Cloud, but you can ingest data from external data sources, such as events data, streaming data, IoT data, and more, using Pulsar’s pub/sub messaging API.

Users or applications can interact with the StreamNative control plane to create a Pulsar cluster. Once the Pulsar cluster is ready, users can either create a Flink session cluster and use the SQL editor in StreamNative’s Cloud console to initiate interactive queries, or create long-running deployments to continuously process data streams in the Pulsar cluster.

Figure 5. How Flink SQL interacts with Pulsar clusters

For each Flink session cluster, there is a SQL Gateway process which parses SQL queries and executes queries locally or submits queries to the Flink cluster. Each SQL session in the SQL Gateway will initiate Pulsar catalogs, with each catalog representing one existing Pulsar cluster. The catalog contains all the necessary information needed to securely access the Pulsar cluster. For DDL queries, they are directly executed in the SQL gateway, while all the DML queries will be submitted to the Flink session cluster to execute. All the SQL queries are impersonated as the actual user who submits them for security purposes.

What’s Next for Flink + Pulsar Integration on StreamNative Cloud?

We are releasing Flink SQL on StreamNative Cloud as a developer preview feature to gather feedback. We plan to add several more capabilities such as running Flink SQL as continuous deployments, providing the ability to run arbitrary Flink jobs, and more, as we work with both the Pulsar and Flink communities to build a robust, unified batch and streaming solution.

How Do I Access Flink SQL on StreamNative Cloud?

You can get started by watching the quick start tutorial for Flink SQL on StreamNative Cloud. We’d love to hear about any ideas you have for improvement and to work closely with early adopters. Note, the Flink SQL offering is only available on paid clusters for now. We will give free cloud credits to our early adopters. If you are interested in trying out, please email us at info@streamnative.io.

To learn more about Flink SQL, you can:

Watch the intro video.
Read about Flink SQL here.
Get started with Flink SQL in StreamNative Cloud.

Finally, if you’re interested in messaging and event streaming, and want to help build Pulsar and Flink, we are hiring.

About the Author

This post was originally published on StreamNative blog.

Like this post? Please recommend and/or share.

Want to learn more? See https://streamnative.io/blog. Follow us here on Medium and check out our GitHub.

Flink SQL on StreamNative Cloud was originally published in StreamNative on Medium, where people are continuing the conversation by highlighting and responding to this story.

StreamNative - Medium

StreamNative: Enabling Real-time Messaging and Streaming for the Cloud

About the CEO

About the CTO

Speakers Announced for Pulsar Virtual Summit Europe 2021

Featured Sessions

1. Tracking Apache Pulsar Messages with Apache SkyWalking

2. Log System as Backbone–How We Built the World’s Most Advanced Vector Database on Pulsar

3. Writing Custom Sink Connectors for Pulsar I/O

4. Pulsar Watermarking

5. Application of Apache Pulsar in Tencent Billing and Tencent Advertising

Register Now

About the Author

Exactly-Once Semantics with Transactions in Pulsar

What is exactly-once semantics?

At-least-once Semantics

At-most-once Semantics

Exactly-once Semantics

Challenges in supporting exactly-once semantics

A bookie can fail

A broker can fail or the producer-to-broker connection can fail

The Pulsar client can fail

Idempotent producer: exactly-once semantics on a single topic

Transactions: atomic writes and acknowledgments across multiple topics

End-to-end exactly-once stream processing made simple: a Pulsar+Flink Example

Where to go from here

Credits

About the Author

Apache Pulsar Launches 2.8: Unified Messaging and Streaming With Transactions

An Overview of the 2.8.0 Release

Release 2.8 Overview

Exclusive Producer

Package Management API

Simplified Client Memory Limit Settings

Broker Entry Metadata

New Protobuf Code Generator

Transactions

Building a Unified Messaging and Streaming Platform with Apache Pulsar

The Evolution of Apache Pulsar

Step 1: A scalable storage for streams of data

Step 2: A multi-layered architecture that separates compute from storage

Step 3: Unified messaging model and API

Step 4: Schema API

Step 5: Functions and IO API

Step 6: Infinite storage for Pulsar via Tiered Storage

Step 7: Protocol Handler

Step 8: Transaction API for exactly-once stream processing

Building an ecosystem for unified messaging and streaming

What is Next?

About the Author

Pulsar Isolation for Dummies: Separate Pulsar Clusters

Deploy environment

Prepare deployment

Deploy Pulsar cluster1

Deploy Pulsar cluster2

Verify data isolation of clusters

Synchronize and migrate data between clusters

Scale up and down nodes

Broker

Bookie

Conclusion

Further reading

About the Author

What’s New in Apache Pulsar 2.6.4

What’s New in Apache Pulsar 2.6.4

Notable enhancement

C++ client

Notable bug fix

Broker

Tiered storage

Java client

Get involved

About the Author

Function Mesh — Simplify Complex Streaming Jobs in Cloud

Function Mesh — Simplify Complex Streaming Jobs in Cloud

What is Function Mesh

Why Function Mesh

Core Concepts

Stream

Function