From zero to data lake: our journey to handling data at scale

Alice (formerly: ActiveFence) — Tue, 18 Jul 2023 14:12:47 GMT

Jonathan Lavon, Principal Engineer @ ActiveFence.

Having all your systems’ data in a structured, central, accessible location is critical, especially when you’re dealing with machine learning and big data analytics. But when kicking off a small startup or project, you usually don’t start out with infrastructure to support large-scale systems right away, and neither did we.

ActiveFence grew rapidly, but we started out with an architecture that was focused on the use cases and data that we had at hand and didn’t support the increasing scale of our processes and data. And as the needs and requirements of the business grew, the system’s architecture adapted organically.

The Data Group at ActiveFence builds the tools to detect and identify malicious content at scale for online platforms. As is pretty typical for startups, we did not set out with a clear vision of the end goal for our systems. In this post, I will share how we became a company that puts data first and the lessons we learned in setting up a data lake. ActiveFence’s process is a case study, but it is relevant for others who might be in the same situation.

We started with a simple system: a visual editor of mongoDB records, backed by a string of microservices that process these records and communicate using rabbitMQ. This system was used by in-house intelligence researchers, while analysts processed the information to detect harmful content.

All data in the system was manually processed — a human viewed the record to see the content decided if it fits the definition of actionable content, reviewed the decision, and acted on it. This, of course, meant that we had an upper bound on how much content could be processed as well as a limit on how fast we could respond.

However, to truly enable automated content moderation at scale, we needed to get bigger and faster. We needed to incorporate machine learning classification models. Within this architecture, we started by exporting the data from MongoDB, processing it with batch jobs triggered manually, and then importing the results back. This was cumbersome and slow and needed a lot of manual work and maintenance.

So we decided to create a data lake — a single accessible place where all of our data could rest in a format that is conducive to processing. It was important not to impact the human analysis work happening in the system, which our clients depended on at the time.

We used Feathersjs as a data access layer on top of MongoDB. The Feathers framework has the ability to attach hooks to any change in the DB, catching the change before and after it happened, as well as when errors occurred. We used this ability to facilitate rule-based workflows: after a certain record is created, send it to a specific service to do some processing. We built on that by having each record that was saved to DB be sent to a service that formats the record to a JSON object and sends it to another service to be saved to s3.

The S3 saving service was an existing archiving service that received messages from the message queue, created a local file, and filled it with the fetched messages. When the file was big enough or enough time had passed, the file was written to s3 and the messages are acked. We chose not to save every record to a different file because of performance issues — big data processes work better with fewer, larger files.

This flow had stability issues since we explicitly handled everything — including measuring the file size and time taken, keeping rabbitMQ messages alive until the file persisted, and making sure that files from different pods did not override one another. Each pod crash or problem had to release all the held messages to be reprocessed, which frequently caused a cascading failure. This architecture also had scale issues, since each “target” (different file location) in S3 had a different local file, forcing us to juggle multiple open files at the same time.

This led to us using AWS Kinesis Data Firehose to replace the s3 saving service: the hook service formats the records and sends them to Firehose, with a different stream for each target. Firehose does all the heavy lifting — batches the records to files that are big enough, keeps track of timeouts and such, and has much better uptime than our pods.

At this point, we had two streams of data flowing through our system, being formatted and saved to s3 — raw data as it is inserted into the system and DB record changes.

But our goal was to have automatic machine learning processes relying on up-to-date data. We decided to use the DB record changes that were saved to S3 to create a mirror image of MongoDB, which was still the heart of our system.

We did this by creating snapshots of the data — at first daily, but later hourly as well. The snapshot process takes the previously generated snapshot as a base for the new snapshot, gathers all the new records in S3 (that signify record deltas in mongoDB), sorts them based on timestamp, then merges them into a unified record. This mimics the actions taken on the record in MongoDB itself and creates (when everything works well) a duplicate of the DB as it was when the snapshot was created.

We used Apache Airflow to schedule and run the snapshot creation processes, as well as subsequent machine learning model training and application on this data. Once the results of this pipeline are ready, an Airflow task takes the outputs and sends them to a special input endpoint in the original system, which can then be shown to users in the UI.

This architecture works but has a few issues.

Snapshots keep getting bigger and longer to generate, as data continues coming in. Over a long enough period, it becomes unfeasible without pruning data.
Processes take a while to complete, and users have no indication of their status since they happen in a completely different system. Many times users wondered why they weren’t seeing content in the UI when jobs were delayed.
The source of truth is still mongoDB. All the data in our data lake had to first go through that system, which meant that for every inconsistency, the information in Mongo was the decider.

This last point created issues of synchronization. Since we don’t have 100% uptime, there are failure points in the process, for example:

If all the hook service pods are down, then a record change delta from mongoDB is missed
If rabbitMQ is acting up, for example, reaching maximum memory and becoming unresponsive, then messages can get lost between the different services that upload them.

These issues cause a gradual desyncing between the data lake and the DB. This can be corrected by a full DB sync which sends all records as they are to S3, but this is costly and manual.

So we decided to invert the responsibilities. The data lake should be the source of truth, and MongoDB should be a client of that source. To that end, we created a new ingestion system responsible for collecting and inserting data into our system, by various means, including scraping, manual uploads, and API calls. This system fetches whatever data we need, formats it properly, and then sends it to the data lake snapshot via Firehose. For specific conditions, it sends the records to the MongoDB system directly as well, but only after s3. Airflow processes create whatever outputs are needed from the models and send only the relevant information to the UI system for users to act on. This architecture allows us to scale — automatic models run before users see the data instead of the other way around, and we have a single source of truth in the data lake.

Our big issue now was latency — snapshots were getting bigger and longer, and a lot of data was flowing in, thus making our machine-learning processes take a long time. We were approaching a point where we just couldn’t keep up. Once a daily process takes too many hours, it gets to a point where each node failure in the cluster causes a lot of work to be lost, necessitating restarts, as well as causing costs to balloon.

Our business had also evolved, and we were heading to a B2B API offering: send us your data, and we’ll process it and give you a classification in real-time, either as an API result or in our UI platform. To address this business need, we separated the batch processes from the real-time needs. We still have all our data in the data lake and still have Airflow batch processes that run periodically, but instead of being used for inference, these are used for offline training and retraining of our models. Once models are ready and after the machine learning lifecycle permits it (quality and performance benchmarks), these models are published in an internal repository. The real-time processes take the given data, apply the model on it as part of the pipeline, and send the results both to the user in the system, as well as to the data lake for training.

The UI system gets and displays these records, as well as user-defined data. Any change to these records is handled by the UI system and sent to the same ingestion system that sends it to the data lake. The ingestion system also has an observer service that keeps track of the records in the system and caches them in dynamoDB. This is done for random access needs. If we need to get the information on a specific record, having to go over the entire snapshot to get it. This observer is limited to data from the last few weeks, so as not to overload it.

At ActiveFence, we enable automated content moderation at scale. In this post, I described the journey we took, from a small startup with small data, needs to a company with a clear business direction and real-time machine learning on big data at scale. In the end, we realized that to support machine learning at scale, we needed to ensure our data lake was the single source of truth and have a system architecture that supports using and maintaining all this data. Although we constantly rethink and improve our processes, the architecture we have outlined here is flexible and scalable enough to support our needs, while allowing impactful analysis when our clients need it.

From zero to data lake: our journey to handling data at scale was originally published in Engineering @ Alice on Medium, where people are continuing the conversation by highlighting and responding to this story.

Contextual A.I. for Adapting to Adversaries, with Matar Haller

Alice (formerly: ActiveFence) — Sun, 04 Jun 2023 09:46:30 GMT

Matar Haller speaks to Jon Krohn about the challenges of identifying, analyzing, and flagging malicious information online. In this episode, Matar explains how contextual AI and a “database of evil” can help resolve the multiple challenges of blocking dangerous content across a range of media, even those that are live-streamed.

https://medium.com/media/94e52a7a5b32d4fbf93ea7b03f4f6f29/href

Contextual A.I. for Adapting to Adversaries, with Matar Haller was originally published in Engineering @ Alice on Medium, where people are continuing the conversation by highlighting and responding to this story.

ActiveFence R&D @ PyData Conference

Alice (formerly: ActiveFence) — Sun, 04 Jun 2023 09:11:53 GMT

PyData Tel Aviv 2022

Check out the recorded session from PyData Global of Matar Haller and Noam Levy’s presentation on constructing & querying a data model to detect online harm. They discuss how the technology behind detecting harmful content online is multi-layered, as is content that users generate. A typical social post has text, an image, and interactions; each must be assessed against algorithms by the model to define a risk score that ranks harmful content. This data model supports trust and safety teams scaling their efforts to catch malicious content by calculating the probability of risk.

To build algorithms that analyze and detect this harmful activity at scale, we need a data model that can capture the complexities of this online ecosystem. In this talk, we will discuss how ActiveFence models the online content, media, creators, and users that interact with the content with likes, shares, or comments. Modeling the relationships between these items yields a complex connected graph, and to calculate a score that accurately reflects the probability of harm, we need to be able to query and access all of the relations of any given item. We will dive into the details of the complex and adversarial online space, the ActiveFence data model, and how we abstract the complexity of querying a graph-like data model using traditional SQL PySpark queries to provide maximum value to our algorithms.

https://medium.com/media/fb7552404a576d2db64505eabe5ec2a9/href

ActiveFence R&D @ PyData Conference was originally published in Engineering @ Alice on Medium, where people are continuing the conversation by highlighting and responding to this story.

ActiveFence @ PyData

Alice (formerly: ActiveFence) — Sun, 04 Jun 2023 08:13:21 GMT

Join Matar Haller, VP Data & AI , and Shimon Harush, Data Science Team Lead at ActiveFence, as they explore the significance of prioritizing data quality and quantity in Data-centric AI. Through practical examples, they highlight the benefits of this approach, including enhanced accuracy, scalability, and flexibility, ultimately emphasizing the critical role of data in building successful AI models.

https://medium.com/media/8ebe9d33fdd66713583dd58403f2aba5/href

ActiveFence @ PyData was originally published in Engineering @ Alice on Medium, where people are continuing the conversation by highlighting and responding to this story.

Stories by Alice (formerly: ActiveFence) on Medium

From zero to data lake: our journey to handling data at scale

Contextual A.I. for Adapting to Adversaries, with Matar Haller

ActiveFence R&D @ PyData Conference

ActiveFence @ PyData