<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://tech.scribd.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://tech.scribd.com/" rel="alternate" type="text/html" /><updated>2026-03-11T19:52:43+00:00</updated><id>https://tech.scribd.com/feed.xml</id><title type="html">Scribd Technology</title><subtitle>Scribd technology builds and delivers one of the world&apos;s largest libraries, bringing the best books, audiobooks, and journalism to millions of people around the world.</subtitle><entry><title type="html">Dual-Embedding Trust Scoring</title><link href="https://tech.scribd.com/blog/2026/content-trust-score.html" rel="alternate" type="text/html" title="Dual-Embedding Trust Scoring" /><published>2026-02-25T00:00:00+00:00</published><updated>2026-02-25T00:00:00+00:00</updated><id>https://tech.scribd.com/blog/2026/content-trust-score</id><content type="html" xml:base="https://tech.scribd.com/blog/2026/content-trust-score.html"><![CDATA[<p>Scribd is a digital library serving academics and lifelong learners, offering hundreds of millions of documents. This very nature presents a significant concern: content trust and safety. Protecting our library from undesirable and unsafe content is a top priority, but the multilingual and multimodal (text and images) nature of our platform makes this mission very challenging. Also, while third-party tools exist, they often fall short, lacking the nuance to handle our specific trust and safety categories.</p>

<p>To this end, we capitalized on Generative AI (GenAI) signals and our proprietary multilingual embeddings, in conjunction with classical machine learning methods, to develop our <strong>Content Trust Score</strong>. This metric reflects the severity of a document violating a specific trust pillar, enabling us to identify high-risk content and take appropriate actions. Ultimately, the score allows us to build a more robust and scalable moderation system, ensuring a safer and more reliable experience for all users while preserving the rich diversity of our UGC.</p>

<p>The data and methodologies presented here are for research purposes and do not represent Scribd’s overall moderation or policy implementation.</p>

<h2 id="content-trust-pillars">Content Trust Pillars</h2>
<p>According to our internal Trust &amp; Safety framework, we defined and prioritized our current efforts on four top-level concern pillars:</p>

<ul>
  <li><strong>Illegal:</strong> Documents that contain or promote illegal materials or activities</li>
  <li><strong>Explicit:</strong> Sexual or shocking content</li>
  <li><strong>Privacy/PII:</strong> Violate privacy or contain Personally Identifiable Information (PII)</li>
  <li><strong>Low Quality:</strong> Junk, gibberish, low information, or non-semantic documents</li>
</ul>

<p>To maintain a clear project scope, we focused our research on these four semantic-heavy pillars where our embedding-based approach offers the greatest impact. The remaining violation types are out of scope and are addressed by other specialized detection algorithms.</p>

<h2 id="from-embeddings-to-trust-score">From Embeddings to Trust Score</h2>
<h3 id="datasets--features">Datasets &amp; Features</h3>
<p>We leveraged annotated data at Scribd, which includes human-assigned trust labels, to craft our core modeling dataset of roughly 100,000 documents. This dataset was split 90-to-10 for training and testing data and distributed across the four trust pillars. The training set was used exclusively to derive the Content Trust Pillar embeddings, while the testing set provided the initial basis for comparison between content- and description-based scores. In addition to the four primary Trust Pillars, we also included documents not violating any trust &amp; safety pillars. These <em>clean</em> documents serve as the “baseline” in our analyses. It is important to note that the data presented here is for discussion purposes and does not represent approximate category distributions within the Scribd corpus.</p>

<style>
.figure-table {
  width: calc(100% - 3rem);
  max-width: 100%;
  margin-left: 1.5rem;
  margin-right: 1.5rem;
  box-sizing: border-box;
}
.figure-table figcaption {
  margin-bottom: 1em;
}
.figure-table table { width: 100%; border-collapse: collapse; table-layout: fixed; }
.figure-table th,
.figure-table td { border: 1px solid #ccc; padding: 0.5rem 0.75rem; text-align: center; box-sizing: border-box; }
.figure-table thead th { background: #f5f5f5; }
</style>

<figure class="figure-table">
  <figcaption><strong>Table 1. Document distribution across Trust Pillars.</strong> This table details the percentage of labeled documents within the training and testing datasets. Note that the <em>Clean</em> documents are included separately as the baseline.</figcaption>
  <table>
    <thead>
      <tr>
        <th>Trust Pillar</th>
        <th>Training Dataset</th>
        <th>Testing Dataset</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Illegal</td>
        <td>1.49%</td>
        <td>1.56%</td>
      </tr>
      <tr>
        <td>Explicit</td>
        <td>0.39%</td>
        <td>0.41%</td>
      </tr>
      <tr>
        <td>PII/Privacy</td>
        <td>5.43%</td>
        <td>5.48%</td>
      </tr>
      <tr>
        <td>Low Quality</td>
        <td>2.18%</td>
        <td>2.17%</td>
      </tr>
      <tr>
        <td><em>Clean</em></td>
        <td>90.51%</td>
        <td>90.38%</td>
      </tr>
    </tbody>
  </table>
</figure>

<p><br />
The core feature of our project is the 128-dimensional semantic embeddings for every document, which were generated using the <a href="https://huggingface.co/sentence-transformers/LaBSE">LaBSE model</a>, fine-tuned on our in-house dataset. Specifically, semantic embeddings are dense, numerical vector representations of text in a high-dimensional space. The goal of the embeddings is to map linguistic meaning into this vector space such that pieces of text with similar semantics are positioned mathematically closer together. Moreover, the degree of similarity between texts can be quantified by the distance between their respective vectors. For instance, in Figure 1, the words “circle” and “square” are closer to each other since they are semantically more similar, compared to words like “crocodiles” or “alligators”. This allows us to represent all the text in our document using a vector of numbers and accurately quantify their semantic relationships.
<br /></p>
<figure>
    <img width="662" alt="Figure 1. Conceptual visualization of semantic embeddings." src="/post-images/2026-content-trust/content-trust-score-Figure-1.png" />
  <figcaption><strong>Figure 1. Conceptual visualization of semantic embeddings.</strong></figcaption>
</figure>

<!-- ![Dual-Embedding Trust Scoring Fig1](/post-images/2026-content-trust/content-trust-score-Figure-1.png)
**Figure 1. Conceptual visualization of semantic embeddings.** -->

<h3 id="content-trust-score">Content Trust Score</h3>
<p>The first step in generating the Trust Score was creating the representative vectors for each trust pillar. Using the semantic embeddings, we generated the <strong>Content Trust Pillar embeddings</strong> for each trust pillar by averaging the embeddings of all documents with that pillar’s label in the training dataset. The large size of the training dataset helps ensure the representativeness of these Pillar embeddings.</p>

<p>The content trust score for a Trust Pillar was then computed as the cosine similarity between the document’s embedding and the corresponding Trust Pillar’s embedding. Crucially, all scores are generated and evaluated exclusively using the testing dataset to strictly avoid data leakage and circularity in our analysis. Our hypothesis is that documents closely matching a specific trust pillar will yield a high similarity score against that Pillar’s embedding, while non-matching documents will yield a low score.</p>

<p>This concept is visualized in Figure 2, where each “Pillar” represents a distinct trust pillar centroid. Individual documents are clustered around their respective pillar, illustrating that the closer a document’s embedding is to a specific Trust Pillar embedding, the higher its calculated similarity score, which confirms a stronger thematic match to that pillar.</p>

<figure>
    <img width="662" alt="Figure 2. Conceptual visualization of Trust Pillar embeddings and document similarity." src="/post-images/2026-content-trust/content-trust-score-Figure-2.png" />
  <figcaption><strong>Figure 2. Conceptual visualization of Trust Pillar embeddings and document similarity in a high-dimensional space.</strong> Each colored dot represents a single document.</figcaption>
</figure>

<h3 id="enhancing-semantics-with-description-embeddings">Enhancing Semantics with Description Embeddings</h3>
<p>While the content-based semantic embeddings are generally effective, they struggle in certain cases where the raw text is not fully informative. Specifically, these embeddings may fail when documents are extremely long, image-heavy, or contain meaningless repetitive text.</p>

<p>In these scenarios, a brief content summary can provide a superior document representation. For example, Figure 3 illustrates a document containing presentation slides where the raw text is minimal, yet the user-provided description is quite informative.</p>

<figure>
    <img width="662" alt="Figure 3. Example of an extremely long document with good descriptive metadata." src="/post-images/2026-content-trust/content-trust-score-Figure-3.png" />
  <figcaption><strong>Figure 3. Example of an extremely long document with good descriptive metadata.</strong> This example demonstrates how a concise, user-provided description (bottom box) provides more focused, informative text for embedding than the raw content of an extremely long document.</figcaption>
</figure>

<p>However, since users often do not provide adequate descriptions upon document upload, we rely on large language models (LLMs) to generate descriptive summaries based on the content. Figure 4 demonstrates this necessity, showing a document with lengthy and repetitive text where the LLM-generated descriptions (<strong>GenAI descriptions</strong>) summarize the core topic effectively.</p>

<p>Consequently, we generated a second set of document semantic embeddings and the corresponding Content Trust Pillar embeddings based on the LLM-generated descriptions. This dual-approach allowed us to compute the content trust score using the alternative, enhanced representation.</p>

<figure>
    <img width="662" alt="Figure 4. Example of a document with meaningless, repetitive content (top)." src="/post-images/2026-content-trust/content-trust-score-Figure-4.png" />
  <figcaption><strong>Figure 4. Example of a document with meaningless, repetitive content (top).</strong> The LLM successfully analyzes and summarizes the document, providing a usable description for embedding generation (bottom).</figcaption>
</figure>

<h3 id="content--vs-description-based-trust-scores">Content- vs. Description-Based Trust Scores</h3>
<p>For each trust pillar, we compared the distribution of the content trust scores derived from the document’s content to their GenAI-description-based counterparts, using the approximately 10,000-document testing dataset. To ensure a fair comparison, we included only documents for which both sets of scores were available. Our results reveal that <strong>content-based trust scores outperformed the scores generated from GenAI descriptions</strong> for all Trust Pillars (Figure 5a-c) except the Low Quality pillar (Figure 5d).</p>

<p>For the majority of Trust Pillars, the content-based scores demonstrated strong discrimination: they were higher for documents truly violating a given pillar (True Positives) than for documents violating other trust pillars or clean documents. Conversely, for these same pillars, the GenAI-description-based scores were indistinguishable from those of other documents, or showed significantly less separation compared to the content-based counterparts. This suggests that while content-based embeddings offer a superior representation for general trust identification, the descriptive embeddings provided little added value for these pillars.</p>

<p>This performance pattern is reversed for Low Quality documents. Specifically, the content-based scores for Low Quality documents were ineffective, proving to be indistinguishable from those violating other trust pillars or those labeled as clean. The GenAI-based approach, however, showed a distinct advantage: <strong>GenAI-description-based scores were significantly higher for Low Quality documents</strong> compared to all others. This result indicates that the <strong>descriptive summary is crucial for accurately identifying this specific type of document</strong>.</p>

<figure>
    <img width="662" alt="Figure 5. Trust Score Distribution Comparison of Content vs. GenAI-Description Trust Scores." src="/post-images/2026-content-trust/content-trust-score-Figure-5.jpg" />
  <figcaption><strong>Figure 5. Trust Score Distribution Comparison of Content vs. GenAI-Description Trust Scores.</strong> Violin plots showing the distribution of trust scores for documents belonging to a specific violation pillar (blue) compared to all other documents (red; other pillars in scope or clean documents).</figcaption>
</figure>

<p>For completeness and to verify that our results were not skewed by the presence of other violating documents, we conducted a final comparative analysis by isolating the scores of labeled documents against only the clean, non-violating documents. As evident in Figure 6, the core patterns persist: the content-based scores consistently yield superior separation between violating content (blue) and clean content (green) for the Illegal, Explicit, and PII/Privacy pillars (Figure 6a-c). In sharp contrast, the GenAI-description-based scores for these same three pillars exhibit significantly greater distribution overlap. Conversely, for the Low Quality pillar (Figure 6d), the GenAI-description method again established a much clearer boundary from the clean documents than the content-based method, further validating our hybrid scoring approach.</p>

<figure>
    <img width="662" alt="Figure 6. Trust Score Distribution Comparing Pillars Exclusively to Clean Documents." src="/post-images/2026-content-trust/content-trust-score-Figure-6.jpg" />
  <figcaption><strong>Figure 6. Trust Score Distribution Comparing Pillars Exclusively to Clean Documents.</strong> Violin plots showing the distribution of scores for documents belonging to a specific violation pillar (blue) compared only to Clean documents (green).</figcaption>
</figure>

<h3 id="score-generation-for-all-documents">Score Generation for All Documents</h3>
<p>Based on these differentiating findings, we adopted a <strong>hybrid scoring approach:</strong> we use the content-based trust scores for the Illegal, Explicit, and PII/Privacy pillars, and the GenAI-description-based trust scores for the Low Quality pillar. This decision enabled the computation of the most effective Content Trust Scores for all documents in our library across every trust pillar.</p>

<h2 id="classification-through-threshold-setting">Classification Through Threshold Setting</h2>
<p>The content trust score reflects the extent to which a document violates a specific pillar – a high score indicates that the document closely resembles the defined trust violation type. To build a classification system that flags violations, we must determine an optimal score threshold.</p>

<h3 id="strategic-thresholding-prioritizing-precision">Strategic Thresholding: Prioritizing Precision</h3>
<p>In this work, we chose to <strong>prioritize precision</strong> to build a high-confidence classification system. Our goal is to maintain a very low mislabeling rate, specifically aiming for a false positive rate (FPR) close to 1%. This decision is driven by the need to minimize user friction – incorrectly flagging documents as violating trust pillars would be an undesirable user experience, making the avoidance of high FPR our primary concern.</p>

<h3 id="building-the-evaluation-dataset">Building the Evaluation Dataset</h3>
<p>The inherently low document count for certain violation types (e.g., Explicit) prevented us from performing reliable analyses to determine classification thresholds. To address this methodological challenge, we developed an expanded evaluation dataset. This was built by taking the original modeling data (both training and testing sets) and augmenting it with a high volume of additional human-annotated documents from our existing corpus. By incorporating this high-volume, high-quality labeled data, we established a more comprehensive baseline for threshold analysis. To ensure fair comparisons between the content-based and GenAI-description-based scores, we filtered the data to only include documents with both scores available. This refinement resulted in a final working total of approximately 109,000 documents in the evaluation dataset.</p>

<h3 id="final-classification-thresholds">Final Classification Thresholds</h3>
<p>For each of the four in-scope trust pillars, we calculated classification metrics, specifically recall and false positive rate (FPR), across a range of thresholds (0.5 to 0.95). Adhering to our rigorous safety standards, we prioritized precision to maintain an FPR close to 1%. This conservative thresholding strategy was chosen to minimize user friction associated with false flagging. The final score thresholds for the classification systems of the four Trust Pillars are summarized in Table 2.</p>

<figure class="figure-table">
  <figcaption><strong>Table 2. Classification metrics at the chosen thresholds for the Trust Pillars.</strong></figcaption>
  <table>
    <thead>
      <tr>
        <th>Trust Pillar</th>
        <th>Score Threshold</th>
        <th>Recall</th>
        <th>False Positive Rate</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Illegal</td>
        <td>0.80</td>
        <td>71.83%</td>
        <td>0.79%</td>
      </tr>
      <tr>
        <td>Explicit</td>
        <td>0.80</td>
        <td>10.22%</td>
        <td>1.07%</td>
      </tr>
      <tr>
        <td>PII/Privacy</td>
        <td>0.75</td>
        <td>3.82%</td>
        <td>0.62%</td>
      </tr>
      <tr>
        <td>Low Quality</td>
        <td>0.60</td>
        <td>27.20%</td>
        <td>0.52%</td>
      </tr>
    </tbody>
  </table>
</figure>

<p>The analysis revealed that the Illegal pillar achieved the optimal balance of metrics, securing a high recall of 72% while maintaining an excellent FPR of 0.79%. The Low Quality pillar, which relies on the GenAI-description-based scores, achieved a respectable recall of 27.2% with a very low FPR of 0.52%. This outcome validates our decision to utilize the descriptive embeddings for this challenging content type.</p>

<p>However, this high-performance scenario was not replicated across all Trust Pillars. Specifically, the strict FPR target limited the system’s ability to capture certain violations, with Explicit and PII/Privacy achieving only a recall of 10% and 4%, respectively. This disparity highlights the inherent challenges in identifying documents violating these two pillars, as their topical language is much broader and less defined compared to the other classes.</p>

<p>These results serve as an initial performance baseline. We are actively exploring internal refinements to our embedding representations and scoring logic, as well as integrating complementary models, to progressively enhance detection sensitivity. Our goal is to expand coverage across these more complex pillars while strictly upholding our commitment to a low false-positive environment.</p>

<h2 id="discussion">Discussion</h2>
<p>Our work demonstrates a straightforward and flexible content moderation system by effectively leveraging classical machine learning principles (cosine similarity, thresholding) alongside modern Large Language Models (LLMs) for superior document representation. This hybrid approach offers several key operational and technical advantages:</p>

<h3 id="technical-and-operational-advantages">Technical and Operational Advantages</h3>
<ul>
  <li><strong>Scalability and Efficiency:</strong> The final content trust score calculation relies on simple vector mathematics (cosine similarity) against pre-computed pillar embeddings. This allows the system to run efficiently at scale with a low computational cost for real-time inference.</li>
  <li><strong>Customizable Representations:</strong> The system is easy to fine-tune, allowing us to quickly update the trust category representations (the Pillar Embeddings) using new data. This flexibility is critical for adapting the system to the unique data and specific violation nuances present in our library.</li>
  <li><strong>Enhanced Contextual Understanding:</strong> Incorporating LLM-generated summaries provides a level of contextual understanding that helps handle the nuance and ambiguity often present in challenging document types (e.g., extremely long documents or those with minimal text).</li>
  <li><strong>Resilience to Emerging Threats:</strong> The use of semantic embeddings, which capture underlying meaning rather than just keywords, allows the system to adapt well to new or evolving types of harmful content without requiring constant manual rule updates.</li>
</ul>

<h3 id="potential-applications">Potential Applications</h3>
<p>The Content Trust Score and the underlying classification system created in this project open the door to various critical applications at Scribd:</p>
<ul>
  <li><strong>Content Safety in Discovery:</strong> Serving as a primary filter to ensure safe content appears prominently in search results and recommendation feeds. Our N-way testing experiments revealed that filtering unsafe content from search results significantly increases core business metrics (e.g., signup) and user engagement (e.g., read time).</li>
</ul>

<h2 id="further-reading">Further Reading</h2>
<p>This project was recently presented at TrustCon 2025. For those interested in a visual walkthrough of the dual-embedding approach, you can view the <a href="https://www.slideshare.net/slideshow/enhancing-content-moderation-with-dual-embedding-trust-scoring-using-llm-summarization/286257301?utm_source=clipboard_share_button&amp;utm_campaign=slideshare_make_sharing_viral_v2&amp;utm_variation=control&amp;utm_medium=share">full presentation slides on Slideshare</a>.</p>

<h2 id="acknowledgments">Acknowledgments</h2>
<p>This work was a collaborative effort, and we are incredibly grateful to the following individuals and teams for their invaluable contributions:</p>
<ul>
  <li><strong><a href="https://www.linkedin.com/in/raflac/">Rafael Lacerda</a></strong>, <strong><a href="https://www.linkedin.com/in/moniquealvescruz/">Monique Alves Cruz</a></strong>, and <strong><a href="https://www.linkedin.com/in/seyoonkim/">Seyoon Kim</a></strong> for their strategic guidance and steadfast support throughout the project.</li>
  <li><strong><a href="https://www.linkedin.com/in/johnstrenio/">John Strenio</a></strong> for his foundational research and exploratory work that paved the way for this initiative.</li>
  <li><strong><a href="https://www.linkedin.com/in/kara-killough/">Kara Killough</a></strong> for her diligent efforts in building the high-quality annotated datasets that powered our models.</li>
  <li>The <strong>Search and Recommendation Teams</strong> for their partnership and agility in integrating the trust scores, directly driving the measurable improvements in our user experience and business metrics.</li>
</ul>]]></content><author><name>Eric Chang</name></author><category term="machinelearning" /><category term="scribd" /><category term="featured" /><category term="content-trust-series" /><summary type="html"><![CDATA[Scribd is a digital library serving academics and lifelong learners, offering hundreds of millions of documents. This very nature presents a significant concern: content trust and safety. Protecting our library from undesirable and unsafe content is a top priority, but the multilingual and multimodal (text and images) nature of our platform makes this mission very challenging. Also, while third-party tools exist, they often fall short, lacking the nuance to handle our specific trust and safety categories.]]></summary></entry><entry><title type="html">Screaming in the Cloud</title><link href="https://tech.scribd.com/blog/2026/screaming-in-the-cloud.html" rel="alternate" type="text/html" title="Screaming in the Cloud" /><published>2026-02-10T00:00:00+00:00</published><updated>2026-02-10T00:00:00+00:00</updated><id>https://tech.scribd.com/blog/2026/screaming-in-the-cloud</id><content type="html" xml:base="https://tech.scribd.com/blog/2026/screaming-in-the-cloud.html"><![CDATA[<p>Scribd has absolutely fascinating data-at-scale type problems, all the way
down to the fundamentals of how we use AWS S3. In my <a href="/blog/2026/content-crush.html">previous
post</a> I wrote about the design of Content
Crush and how Scribd is consolidating objects in S3 to minimize our costs.
Related to that work I was fortunate enough to join the (in)famous <a href="https://www.linkedin.com/in/coquinn/">Corey
Quinn</a> to talk about <strong>Engineering around Extreme S3 scale</strong>:</p>

<p><em>Checking if files are damaged? $100K. Using newer S3 tools? Way too expensive.
Normal solutions don’t work anymore. Tyler shares how with this much data, you
can’t just throw money at the problem, but rather you have to engineer your way
out.</em></p>

<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/TZj38Bm1DC4?si=m_jo0HOFPHqPC--2" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen=""></iframe>

<p>You can also listen
<a href="https://www.everand.com/podcast/980476250/Engineering-Around-Extreme-S3-Scale-with-R-Tyler-Croy">On Everand</a>
or watch via the 
<a href="https://www.youtube.com/watch?v=TZj38Bm1DC4">Last Week in AWS YouTube channel</a></p>]]></content><author><name>R Tyler Croy</name></author><category term="aws" /><category term="featured" /><summary type="html"><![CDATA[Scribd has absolutely fascinating data-at-scale type problems, all the way down to the fundamentals of how we use AWS S3. In my previous post I wrote about the design of Content Crush and how Scribd is consolidating objects in S3 to minimize our costs. Related to that work I was fortunate enough to join the (in)famous Corey Quinn to talk about Engineering around Extreme S3 scale:]]></summary></entry><entry><title type="html">Deploying a Cost-Effective, Scalable PhotoDNA System for CSAM Detection</title><link href="https://tech.scribd.com/blog/2026/photodna-csam-detection.html" rel="alternate" type="text/html" title="Deploying a Cost-Effective, Scalable PhotoDNA System for CSAM Detection" /><published>2026-01-20T00:00:00+00:00</published><updated>2026-01-20T00:00:00+00:00</updated><id>https://tech.scribd.com/blog/2026/photodna-csam-detection</id><content type="html" xml:base="https://tech.scribd.com/blog/2026/photodna-csam-detection.html"><![CDATA[<p>Child safety is a non‑negotiable responsibility for any platform that hosts user‑generated content. Over the last year, we designed and deployed a production system that detects known Child Sexual Abuse Material (CSAM) using PhotoDNA perceptual hashes, integrates with the National Center for Missing and Exploted Children’s (NCMEC) reporting system, and scales efficiently across our ingestion surfaces. This post explains the problem we set out to solve, how PhotoDNA hashing works, the online child-protection ecosystem (NCMEC, Tech Coalition, Project Lantern), our architecture and operational model, cost considerations, and key learnings.</p>

<p>Note: This article discusses safety technology at a high level. We intentionally omit sensitive operational details to protect the effectiveness of these defenses.</p>

<h3 id="problem-accurate-csam-detection-at-scale-within-strict-safety-and-cost-constraints">Problem: Accurate CSAM detection at scale, within strict safety and cost constraints</h3>

<p>We needed to:</p>

<ul>
  <li>Accurately detect known CSAM at upload and in historical backfills.</li>
  <li>Minimize false positives while keeping latency low on critical paths.</li>
  <li>Meet obligations for reporting to NCMEC and preserve chain‑of‑custody evidence.</li>
  <li>Fit within pragmatic cost envelopes and scale elastically with traffic.</li>
  <li>Integrate into Scribd’s existing ML and batch compute ecosystem for observability, auditability, and maintainability.</li>
</ul>

<h3 id="the-ecosystem-tech-coalition-project-lantern-photodna-and-ncmec">The ecosystem: Tech Coalition, Project Lantern, PhotoDNA, and NCMEC</h3>

<ul>
  <li><strong><a href="https://technologycoalition.org/">Tech Coalition and Project Lantern</a>:</strong> An industry consortium and initiative to strengthen cross‑platform child safety, including responsible signal sharing that helps disrupt abusers across services. Lantern focuses on sharing signals that increase detection of predatory accounts and coordinated abuse while respecting privacy and legal constraints.</li>
  <li><strong><a href="https://www.microsoft.com/en-us/photodna">PhotoDNA</a>:</strong> A perceptual hashing technology created by Microsoft in collaboration with Dartmouth College. PhotoDNA transforms an image into a robust hash that stays stable across common modifications (resize, recompress, minor color adjustments). Matching is performed against vetted hash sets of known illegal content.</li>
  <li><strong><a href="https://www.missingkids.org/ourwork/ncmecdata">NCMEC (National Center for Missing &amp; Exploited Children)</a>:</strong> A US-based nonprofit with a Congressional mandate to serve as the clearinghouse and triage for CSAM reports in the United States via the CyberTipline.NCMEC also serves as the steward of vetted hash CSAM data sets. US-based platforms are required to report confirmed CSAM and retain appropriate artifacts for law enforcement.</li>
</ul>

<h3 id="how-photodna-csam-detection-works-at-a-glance">How PhotoDNA CSAM detection works (at a glance)</h3>

<ol>
  <li>An image is normalized (e.g., resized, converted to a canonical colorspace). For PDFs, we first extract images.</li>
  <li>A perceptual transformation produces a PhotoDNA hash vector.</li>
  <li>We compare that hash to vetted hash sets using a distance threshold tuned to minimize both false positives and false negatives.</li>
  <li>A match triggers automated containment (quarantine/blocks), evidence preservation, safety review, and NCMEC reporting workflows.</li>
</ol>

<h3 id="architecture">Architecture</h3>

<p>At a high level, we separate event driven and highly parallel PhotoDNA hash generation from high throughput GPU based batch PhotoDNA matcher. The components in our design are AWS services but equivalent from any other hyperscaler will suffice.</p>

<p>Key properties:</p>

<ul>
  <li>The deterministic matching path is GPU‑parallel, horizontally scalable, and isolated from image transform and hash generation.</li>
  <li>Hash set updates are versioned and rolled atomically; match records include hash‑set version.</li>
  <li>Matches are logged and reviewed.</li>
</ul>

<p><img src="/post-images/2026-content-trust/photodna-csam-detection-system.png" alt="PhotoDNA CSAM Detection System diagram" /></p>

<p>The diagram above shows the high-level architecture of our PhotoDNA CSAM Detection System. The system is designed to be cost-effective, scalable, and efficient.</p>

<h3 id="hasher-and-matcher-details">Hasher and matcher details</h3>

<h4 id="hasher-event-driven-and-highly-parallel">Hasher: event driven and highly parallel</h4>

<ul>
  <li>Image sources: raw images and images extracted from PDFs (embedded image extraction are deterministic and versioned).</li>
  <li>Parallelism: Each PDF document is processed in parallel by evented and isolated compute (AWS Lambda).</li>
  <li>Storage: PhotoDNA hashes are versioned and storage for every extracted image.</li>
  <li>Observability: structured metrics (throughput, error codes, backlog depth) and end‑to‑end lineage identifiers provide for auditability.</li>
</ul>

<h4 id="matcher-highthroughput-batch">Matcher: high‑throughput batch</h4>

<ul>
  <li>Vetted hash sets are loaded for matching; where feasible, keep structures memory‑resident to maximize throughput.</li>
  <li>Batched distance computations with conservative thresholds minimize false positives; thresholds and policies are versioned.</li>
  <li>Aggregation: combine duplicate or near‑duplicate image evidence into per‑asset decisions and preserve the strongest evidence for review.</li>
  <li>Events and evidence: emit match events to quarantine/review flows and include hash‑set version and metadata for audit.</li>
</ul>

<h3 id="lessons-learned--best-practices">Lessons Learned &amp; Best Practices</h3>

<h4 id="which-ncmec-hash-set-to-use">Which NCMEC hash set to use?</h4>

<p>We prioritize vetted, legally curated sources:</p>

<ul>
  <li>Primary: NCMEC‑provided hash sets for known CSAM.</li>
  <li>Supplementary: Industry‑shared signals via Tech Coalition initiatives (e.g., Project Lantern) where applicable and approved.</li>
</ul>

<p>Operationally, we version, verify, and roll out hash updates.</p>

<h4 id="where-do-gpus-come-in">Where do GPUs come in?</h4>

<p>In our final design and implementation, graphical processing units (GPUs) materially improved throughput and unit cost for PhotoDNA hashing when run as SageMaker Batch workloads. We containerized the PhotoDNA pipeline and executed it on GPU‑backed instances to accelerate matching, enabling us to meet tight batch Service-level objectives (SLOs) and backfill schedules with fewer nodes.</p>

<ul>
  <li>Batched matching on GPU nodes via SageMaker Batch/Processing reduced runtimes significantly.</li>
  <li>GPU‑accelerated transforms improved end‑to‑end throughput.</li>
  <li>Higher throughput per node reduced cost at scale.</li>
</ul>

<h4 id="learnings-from-microsofts-photodna-guidance">Learnings from Microsoft’s PhotoDNA guidance</h4>

<ul>
  <li>Preprocessing matters: adhere to canonical normalization steps (grayscale, downsample strategy) or use the vetted cloud service where appropriate.</li>
  <li>Treat thresholds conservatively; don’t repurpose perceptual distances beyond vetted safety use cases.</li>
  <li>Keep auditable logs of match context and system versions; separate operational telemetry from sensitive evidence artifacts.</li>
</ul>

<h3 id="machine-learning-ml-deployment-at-scribd-observability-and-operational-rigor">Machine learning (ML) deployment at Scribd: Observability and operational rigor</h3>

<p>Although PhotoDNA isn’t “a model” we train, we run complementary ML components and rigorous observability:</p>

<ul>
  <li><strong>Weights &amp; Biases (W&amp;B)</strong>: Host the versioned model in the W&amp;B Model Registry, lineage, and provenance for audit. SageMaker Batch jobs resolve the model to ensure reproducibility.</li>
  <li><strong>AWS SageMaker Batch Inference</strong>: Host batch inference jobs using standardized containers, consistent IAM boundaries, and autoscaling.</li>
</ul>

<h3 id="cost-model">Cost model</h3>

<p>We sized for steady‑state uploads and periodic backfills:</p>

<ul>
  <li><strong>Compute</strong>: GPU‑backed SageMaker Batch for PhotoDNA hashing improved throughput/SLOs and, when saturated, delivered better $/throughput than equivalently provisioned CPU fleets.</li>
  <li><strong>Storage</strong>: Keep only what is necessary for safety review and legal retention. Use lifecycle policies and tiering for aging artifacts.</li>
  <li><strong>Queueing and elasticity</strong>: Amazon Simple Queue Service (SQS) buffers absorb bursts; autoscaling workers maintain SLOs without overprovisioning.</li>
  <li><strong>Hash set operations</strong>: Updates are small; cost is dominated by compute and storage around matches and evidence.</li>
</ul>

<p>In practice, the unit economics are driven by: input volume, match rate (rare but higher cost per event), retention windows, and backfill cadence.</p>

<h3 id="wins">Wins</h3>

<ul>
  <li><strong>Safety‑first by design</strong>: Deterministic matching path is simple, fast, and auditable.</li>
  <li><strong>Operational clarity</strong>: Clear blast‑radius boundaries between hashing, matching, enrichment, and reporting.</li>
  <li><strong>Scalable and cost‑effective</strong>: GPU‑accelerated hashing on SageMaker Batch achieved high throughput and favorable unit economics at scale.</li>
  <li><strong>Stronger together</strong>: Collaboration with the ecosystem improves coverage and response speed.</li>
</ul>

<h3 id="operational-guardrails-and-compliance">Operational guardrails and compliance</h3>

<ul>
  <li>Strict identity and access management (IAM) boundaries; least‑privilege for all safety components.</li>
  <li>Immutable logging with retention; separate telemetry from sensitive evidence.</li>
  <li>Privacy and data minimization: collect only what’s necessary for safety and compliance.</li>
</ul>

<h3 id="acknowledgments">Acknowledgments</h3>

<p>This was truly a cross‑functional effort. Thank you:</p>

<ul>
  <li>Machine Learning and Data Engineering team</li>
  <li>Product Managers</li>
  <li>Infrastructure</li>
  <li>Legal</li>
  <li>Partners at NCMEC and Microsoft</li>
  <li><a href="https://technologycoalition.org/">Industry peers via Tech Coalition</a> initiatives, including <a href="https://technologycoalition.org/programs/lantern/">Project Lantern</a></li>
</ul>

<h3 id="collaboration-highlights">Collaboration highlights</h3>

<ul>
  <li>Ongoing alignment with NCMEC reporting workflows (evidence packaging, retention, and audit trails).</li>
  <li>Incorporating best practices from Microsoft’s PhotoDNA guidance for normalization and thresholding.</li>
  <li>Participation with industry groups (e.g., Tech Coalition/Project Lantern) to improve cross‑platform defenses.</li>
</ul>

<h3 id="appendix-faqs">Appendix: FAQs</h3>

<ul>
  <li><strong>Does PhotoDNA require GPUs?</strong> No. However, in our SageMaker Batch implementation, GPUs significantly improved throughput and cost for large‑scale hashing, so we run hashing on GPU for batch workloads.</li>
  <li><strong>How are false positives handled?</strong> Conservative thresholds plus human‑in‑the‑loop review on any flagged item before reporting or account actions.</li>
</ul>]]></content><author><name>Anish Kumar</name></author><category term="featured" /><category term="aws" /><category term="lambda" /><category term="databricks" /><category term="content-trust-series" /><summary type="html"><![CDATA[Child safety is a non‑negotiable responsibility for any platform that hosts user‑generated content. Over the last year, we designed and deployed a production system that detects known Child Sexual Abuse Material (CSAM) using PhotoDNA perceptual hashes, integrates with the National Center for Missing and Exploted Children’s (NCMEC) reporting system, and scales efficiently across our ingestion surfaces. This post explains the problem we set out to solve, how PhotoDNA hashing works, the online child-protection ecosystem (NCMEC, Tech Coalition, Project Lantern), our architecture and operational model, cost considerations, and key learnings.]]></summary></entry><entry><title type="html">Supercharging S3 Intelligent Tiering with Content Crush</title><link href="https://tech.scribd.com/blog/2026/content-crush.html" rel="alternate" type="text/html" title="Supercharging S3 Intelligent Tiering with Content Crush" /><published>2026-01-12T00:00:00+00:00</published><updated>2026-01-12T00:00:00+00:00</updated><id>https://tech.scribd.com/blog/2026/content-crush</id><content type="html" xml:base="https://tech.scribd.com/blog/2026/content-crush.html"><![CDATA[<p>Scribd and Slideshare have been using AWS S3 for almost <em>twenty years</em> and
store hundreds of <em>billions</em> of objects making storage management quite a
challenge. My focus at Scribd has generally been around data and storage but
only in the past twelve months have I started to really focus on one of our
hardest technology problems: cost-effective storage and availability for the
hundreds of billions of objects that represent our content library.</p>

<p>Since adopting S3 for our object storage in 2007 a <em>lot</em> has changed with the service, most
notably <a href="https://aws.amazon.com/s3/storage-classes/intelligent-tiering/">Intelligent
Tiering</a> which <a href="https://aws.amazon.com/blogs/aws/new-automatic-cost-optimization-for-amazon-s3-via-intelligent-tiering/">was
introduced in
2018</a>.
At a very high level Intelligent Tiering allows object access patterns to
dictate the storage tier for a small per-object monitoring fee. Behind the
scenes S3 manages moving objects which are infrequently accessed into cheaper
storage.</p>

<p>For most organizations simply adopting Intelligent Tiering is the right
solution to save on S3 storage costs. For Scribd however the sheer number of
objects in our buckets makes the problem much more complex.</p>

<blockquote>
  <p><strong>Cost management is an architecture problem</strong></p>

  <p><a href="https://www.linkedin.com/posts/miketjulian_as-pete-alludes-to-cost-management-is-an-activity-7061767041002713088-B1M2">Mike
Julian</a>
of <a href="https://www.duckbillgroup.com">duckbill</a>.</p>
</blockquote>

<p>The <em>small per-object monitoring fee</em> adds up to some serious numbers. While
monitoring 100 million objects costs $250/month, the monitoring fees for 100
<em>billion</em> is $250,000/month. Billion is such a big number that it is hard to
make sense of it sometimes.</p>

<p>The difference between million and billion is <strong>a lot</strong>. Intelligent tiering was not going to work for Scribd unless we found a way to reduce or remove hundreds of billions of objects!</p>

<h2 id="content-crush">Content Crush</h2>

<p>When users upload a document or presentation to Scribd and Slideshare a lot of
machinery kicks in to process the file and converts it into a multitude of
smaller files to ensure clear and correct rendering on a variety of devices.
Further post-processing is done to help Scribd’s systems understand the
document with a multitude of generated textual and image-based metadata. As a
result one file upload might result in hundreds or sometimes <em>thousands</em> of
different objects being produced in various storage locations.</p>

<p><strong>Content Crush</strong> is the system we have built to bring all these objects back
into a <em>single</em> stored object while preserving a virtualized keyspace and the
discrete retrieval semantics for systems which rely on these smaller files.</p>

<p>Before Content Crush a single document upload could produce something like the following tree:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>s3://bucket/guid/
               /info.json
               /metadata.json
               /imgs/
                    /0.jpg
                    /1.jpg
               /fmts/
                    /original.pdf
                    /v1.tar.bz2
                    /v2.zip
               /other/
                     /random.uuid
                     /debian.iso
                     /dbg.txt
</code></pre></div></div>

<p>After Content Crush these different objects are collapsed into a single <a href="https://parquet.apache.org">Apache Parquet</a> file in S3:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>s3://bucket/guid.parquet
</code></pre></div></div>

<p>We became intimately familiar with the Parquet file format from our work creating <a href="https://github.com/delta-io/delta-rs">delta-rs</a>. The format was designed in a way that really excels in object storage systems like AWS S3. For example:</p>

<ul>
  <li>S3 allows <code class="language-plaintext highlighter-rouge">GetObject</code> with byte ranges for partial reads of an object, most
importantly it allows for <em>negative offset</em> reads. This allows fetching the
<em>last</em> <code class="language-plaintext highlighter-rouge">N</code> bytes of a file.</li>
  <li>Parquet stores its metadata at the <em>end</em> of a file, with the last 8 bytes
indicating the length of the footer. One can read all a file’s metadata with
two calls: <code class="language-plaintext highlighter-rouge">GetObject(-8)</code> followed by <code class="language-plaintext highlighter-rouge">GetObject(-footer_len)</code>.</li>
  <li>Parquet’s footer metadata indicates which byte offsets of different row
groups, allowing retrieving one of <code class="language-plaintext highlighter-rouge">N</code> row groups rather than requiring full
object reads.</li>
  <li>Additional user-provided metadata in the file footer allows for further optimizations around selective reads.</li>
</ul>

<p><img src="/post-images/2026-content-crush/parquet-file-layout.gif" alt="Parquet file layout" />
<a href="https://parquet.apache.org/docs/file-format/">Parquet file format</a></p>

<p>Without Apache Parquet, Content Crush fundamentally would not work. There is prior art for “compressing objects” into S3 with other formats, but for our purposes they all have downsides:</p>

<ul>
  <li><strong>Zip</strong>: Streamable but not suitable for random access inside the file</li>
  <li><strong>Tar</strong>: Also streamable but same issue as zip, then there’s nuance between different implementations.</li>
  <li><strong>Build your own</strong>: I looked into this but all my designs ended up looking like a less-good version of Apache Parquet.</li>
</ul>

<p>The original prototype implementation used <a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/amazons3-ol-change.html">S3 Object
Lambda</a>
which allowed for a <em>seamless</em> drop-in for existing S3 clients, allowing
applications to switch from on S3 Access Point to another any indication that
they are accessing “crushed” files. Since Object Lambda has ceased to be,
Content Crush is being moved over to an S3 API-compatible service.</p>

<h3 id="downsides">Downsides</h3>

<p>No optimization is ever free and crushed assets has a couple of caveats that
are important to consider:</p>

<ul>
  <li>Retrieval of a single “file” within a crushed object requires at least <em>two</em>
<code class="language-plaintext highlighter-rouge">GetObject</code> calls to retrieve the appropriate data. The worst case is <em>three</em>
since most Parquet readers will read the footer length, the footer, and
<em>then</em> fetch the data they seek. We can typically optimize this by hinting at
the footer size with a 95% estimate.</li>
  <li>This system works well with relatively static objects, since editing a “file”
inside of a crushed object requires the whole object to be read and then
re-written. There can be some concurrency concerns with object updates too,
we must ensure that only one process is updating an object at a time.</li>
</ul>

<p>A related downside with maintaining an S3 API-compatible service is that
retrieving multiple files inside of single objects cannot be easily pipelined
or streamed. There are a number of ways to solve for this that I am exploring,
but they all converge on a different API scheme entirely to take advantage of
HTTP2.</p>

<h3 id="upsides">Upsides!</h3>

<p>The ability to effectively use S3 Intelligent Tiering is by far the largest
benefit of this approach. With a dramatic reduction in object counts we can
adopt S3 Intelligent Tiering for large buckets in a way that provides <em>major</em>
cost improvements.</p>

<p>Fewer objects also makes tools like S3 Batch Operations viable for these
massive buckets.</p>

<p>There are also hidden performance optimizations now available that were not
possible before. For example, for heavily requested objects there is now
AZ-local caching opportunities whether at the API service layer or simply by
pulling popular objects into S3 Express One Zone.</p>

<hr />

<p>Much of this work is on-going and not completely open source. None of this
would have been possible without the stellar work by the folks in the <a href="https://github.com/apache/arrow-rs">Apache
Arrow Rust</a> community building the
high-performance <a href="https://crates.io/crate/parquet">parquet</a> crate. After we set
off on this path we learned of their similar work in <a href="https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/">Querying Parquet with
Millisecond
Latency</a>.</p>

<p>There remains <em>plenty</em> of work to be done building the foundational storage and
content systems at Scribd which power one of the world’s largest digital
libraries. If you’re interested in learning more we have a <a href="/careers/#open-positions">lot of positions
open</a> right now!</p>

<h2 id="presentation">Presentation</h2>

<p>Content Crush was originally shared at the August 2025 FinOps Meetup hosted by
<a href="https://www.duckbillgroup.com">duckbill</a>, with the slides from that event
hosted on Slideshare below:</p>

<iframe src="https://www.slideshare.net/slideshow/embed_code/key/aBexzIntT7GcC3" width="610" height="515" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border: var(--border-1) solid #CCC; border-width:1px; margin-bottom:5px; max-width:100%;" allowfullscreen=""></iframe>

<div style="margin-bottom:5px"><strong><a href="https://www.slideshare.net/slideshow/2025-08-san-francisco-finops-meetup-tiering-intelligently/282679847" title="2025-08-san-francisco-finops-meetup-tiering-intelligently" target="_blank">2025-08 San Francisco FinOps Meetup</a></strong> from <strong><a href="https://www.slideshare.net/RTylerCroy" target="_blank">RTylerCroy</a></strong></div>]]></content><author><name>R Tyler Croy</name></author><category term="rust" /><category term="aws" /><category term="featured" /><summary type="html"><![CDATA[Scribd and Slideshare have been using AWS S3 for almost twenty years and store hundreds of billions of objects making storage management quite a challenge. My focus at Scribd has generally been around data and storage but only in the past twelve months have I started to really focus on one of our hardest technology problems: cost-effective storage and availability for the hundreds of billions of objects that represent our content library.]]></summary></entry><entry><title type="html">Don’t hardcode IAM credentials in GitHub!</title><link href="https://tech.scribd.com/blog/2026/teraform-oidc-module.html" rel="alternate" type="text/html" title="Don’t hardcode IAM credentials in GitHub!" /><published>2026-01-06T00:00:00+00:00</published><updated>2026-01-06T00:00:00+00:00</updated><id>https://tech.scribd.com/blog/2026/teraform-oidc-module</id><content type="html" xml:base="https://tech.scribd.com/blog/2026/teraform-oidc-module.html"><![CDATA[<p>Scribd deploys a <em>lot</em> of code from GitHub to AWS using GitHub Actions, which
means many of our Actions need to access AWS resources. Managing AWS API keys
and tokens for different IAM users is time-consuming, brittle, and insecure.
Managing key-distribution between AWS and GitHub also makes it difficult to
track which keys go where, when they should be rotated, and what permissions
those keys have. Fortunately AWS supports creating <a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_providers_create_oidc.html">OpenID Connect identity
providers</a>
which is an ideal tool handle this kind of cross-cloud authentication in a more
maintainable way.</p>

<p>From the AWS documentation:</p>

<blockquote>
  <p>IAM OIDC identity providers are entities in IAM that describe an external
identity provider (IdP) service that supports the OpenID Connect (OIDC)
standard, such as Google or Salesforce.</p>

  <p>You use an IAM OIDC identity provider when you want to establish trust
between an OIDC-compatible IdP and your AWS account.</p>
</blockquote>

<p>The following diagram from <a href="https://docs.github.com/en/actions/deployment/security-hardening-your-deployments/about-security-hardening-with-openid-connect#getting-started-with-oidc">GitHub’s documentation</a>
gives an overview of how GitHub’s OIDC provider integrates with your workflows
and cloud provider:</p>

<p><img src="/post-images/2026-oidc/oidc-architecture.webp" alt="OIDC diagram from GitHub documentation" /></p>

<p>From within GitHub Actions we can specify the repository and role to assume
in via the
<a href="https://github.com/aws-actions/configure-aws-credentials">aws-actions/configure-aws-credentials</a>
action, which will automatically configure the necessary credentials for AWS
SDK operations inside the job.</p>

<p>Our newly open sourced <a href="https://github.com/scribd/terraform-oidc-module"><strong>terraform-oidc-module</strong></a> makes setting up the resources necessary to bridge the gap between AWS GitHub <em>much</em> simpler.</p>

<hr />

<p>Tying OIDC together between AWS and a single GitHub repository starts with the
<code class="language-plaintext highlighter-rouge">aws_iam_openid_connect_provider</code> resource, but then developers must also
configure resources and permissions for common deployment tasks such as:</p>

<ul>
  <li><strong>access S3 bucket with read only permissions</strong></li>
  <li><strong>access S3 bucket with write permissions</strong></li>
  <li><strong>access ECR  with read only permissions</strong></li>
  <li><strong>access ECR  with write permissions</strong></li>
  <li><strong>access some AWS service with some specific permissions set</strong></li>
</ul>

<p>Redoing this work for <em>every</em> repository in the organization to ensure
segmentation of permissions becomes <em>very</em> tedious without the
<code class="language-plaintext highlighter-rouge">terraform-oidc-module</code>.</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">module</span> <span class="s2">"oidc"</span> <span class="p">{</span>
  <span class="nx">source</span> <span class="p">=</span> <span class="s2">"git::https://github.com/scribd/terraform-oidc-module.git?ref=v1.0.0"</span>

  <span class="nx">name</span> <span class="p">=</span> <span class="s2">"example"</span>
  <span class="nx">url</span> <span class="p">=</span> <span class="s2">"https://token.actions.githubusercontent.com"</span>
  <span class="nx">client_id_list</span> <span class="p">=</span> <span class="p">[</span><span class="s2">"sts.amazonaws.com"</span><span class="p">]</span>
  <span class="nx">thumbprint_list</span> <span class="p">=</span> <span class="p">[</span><span class="s2">"example0000example000example"</span><span class="p">]</span>
  <span class="nx">repo_ref</span> <span class="p">=</span> <span class="p">[</span><span class="s2">"repo:REPO_ORG/REPO_NAME:ref:refs/heads/main"</span><span class="p">]</span>

  <span class="nx">custom_policy_arns</span> <span class="p">=</span> <span class="p">[</span><span class="nx">aws_iam_policy</span><span class="err">.</span><span class="nx">example_policy0</span><span class="err">.</span><span class="nx">arn</span><span class="p">,</span><span class="nx">aws_iam_policy</span><span class="err">.</span><span class="nx">example_policy1</span><span class="err">.</span><span class="nx">arn</span> <span class="p">]</span>

  <span class="nx">tags</span> <span class="p">=</span> <span class="p">{</span>
    <span class="nx">Terraform</span> <span class="p">=</span> <span class="s2">"true"</span>
    <span class="nx">Environment</span> <span class="p">=</span> <span class="s2">"dev"</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I hope you find this useful to getting started with OIDC and GitHub Actions!</p>]]></content><author><name>Oleh Motrunych</name></author><category term="oidc" /><category term="terraform" /><category term="github" /><summary type="html"><![CDATA[Scribd deploys a lot of code from GitHub to AWS using GitHub Actions, which means many of our Actions need to access AWS resources. Managing AWS API keys and tokens for different IAM users is time-consuming, brittle, and insecure. Managing key-distribution between AWS and GitHub also makes it difficult to track which keys go where, when they should be rotated, and what permissions those keys have. Fortunately AWS supports creating OpenID Connect identity providers which is an ideal tool handle this kind of cross-cloud authentication in a more maintainable way.]]></summary></entry><entry><title type="html">Building a Scalable Data Lake Backup System with AWS</title><link href="https://tech.scribd.com/blog/2025/building-scalable-data-warehouse-backup-system.html" rel="alternate" type="text/html" title="Building a Scalable Data Lake Backup System with AWS" /><published>2025-09-22T00:00:00+00:00</published><updated>2025-09-22T00:00:00+00:00</updated><id>https://tech.scribd.com/blog/2025/building-scalable-data-warehouse-backup-system</id><content type="html" xml:base="https://tech.scribd.com/blog/2025/building-scalable-data-warehouse-backup-system.html"><![CDATA[<p>We designed and implemented a scalable, cost-optimized backup system for S3 data warehouses that runs automatically on a monthly schedule. The system handles petabytes of data across multiple databases and uses a hybrid approach: AWS Lambda for small workloads and ECS Fargate for larger ones.
At its core, the pipeline performs incremental backups — copying only new or changed parquet files while always preserving delta logs — dramatically reducing costs and runtime compared to full backups. Data is validated through S3 Inventory manifests, processed in parallel, and stored in Glacier for long-term retention.
To avoid data loss and reduce storage costs, we also implemented a safe deletion workflow. Files older than 90 days, successfully backed up, and no longer present in the source are tagged for lifecycle-based cleanup instead of being deleted immediately.
This approach ensures reliability, efficiency, and safety: backups scale seamlessly from small to massive datasets, compute resources are right-sized, and storage is continuously optimized.</p>

<p><img src="/files/backup_system_diagram.png" alt="Open Data Warehouse Backup System diagram" /></p>

<hr />

<h3 id="our-old-approach-had-problems">Our old approach had problems:</h3>

<ul>
  <li>Copying over the same files all the time – not effective from a cost perspective</li>
  <li>Timeouts when manifests were too large for Lambda</li>
  <li>Redundant backups inflating storage cost</li>
  <li>Orphaned files piling up without clean deletion</li>
</ul>

<hr />

<h3 id="we-needed-a-systematic-automated-and-cost-effective-way-to">We needed a systematic, automated, and cost-effective way to:</h3>

<ul>
  <li>Run monthly backups across all databases</li>
  <li>Scale from small jobs to massive datasets</li>
  <li>Handle incremental changes instead of full copies</li>
  <li>Safely clean up old data without risk of data loss</li>
</ul>

<hr />

<h3 id="the-design-at-a-glance">The Design at a Glance</h3>

<p>We built a hybrid backup architecture on AWS primitives:</p>

<ul>
  <li>Step Functions – orchestrates the workflow</li>
  <li>Lambda – lightweight jobs for small manifests</li>
  <li>ECS Fargate – heavy jobs with no timeout constraints</li>
  <li>S3 + S3 Batch Ops – storage and bulk copy/delete operations</li>
  <li>EventBridge – monthly scheduler</li>
  <li>Glue, CloudWatch, Secrets Manager – reporting, monitoring, secure keys</li>
  <li>IAM – access and roles</li>
</ul>

<p>The core idea: Do not copy file what are already in back up and copy over always delta log,  Small manifests run in Lambda, big ones in ECS.</p>

<hr />

<h3 id="how-it-works">How It Works</h3>

<ol>
  <li>
    <p><strong>Database Discovery</strong></p>

    <p>Parse S3 Inventory manifests<br />
Identify database prefixes<br />
Queue for processing (up to 40 in parallel)</p>
  </li>
  <li>
    <p><strong>Manifest Validation</strong></p>

    <p>Before we touch data, we validate:</p>
    <ul>
      <li>JSON structure</li>
      <li>All CSV parts present</li>
      <li>File counts + checksums match<br />
If incomplete → wait up to 30 minutes before retry</li>
    </ul>
  </li>
  <li>
    <p><strong>Routing by Size</strong></p>

    <ul>
      <li>≤25 files → Lambda (15 minutes, 5GB)</li>
      <li>25 files → ECS Fargate (16GB RAM, 4 vCPUs, unlimited runtime)</li>
    </ul>
  </li>
  <li>
    <p><strong>Incremental Backup Logic</strong></p>

    <ul>
      <li>Load exclusion set from last backup</li>
      <li>Always include delta logs</li>
      <li>Only back up parquet files not yet in backup</li>
      <li>Ignore non-STANDARD storage classes (we use Intelligent-Tiering; over time files can go to Glacier and we don’t want to touch them)</li>
      <li>Process CSVs in parallel (20 workers)</li>
      <li>Emit new manifest + checksum for integrity</li>
    </ul>
  </li>
  <li>
    <p><strong>Copying Files</strong></p>

    <ul>
      <li>Feed manifests into S3 Batch Operations</li>
      <li>Copy objects into Glacier storage</li>
    </ul>
  </li>
  <li>
    <p><strong>Safe Deletion</strong></p>

    <ul>
      <li>Compare current inventory vs. incremental manifests</li>
      <li>Identify parquet files that:
        <ul>
          <li>Were backed up successfully</li>
          <li>No longer exist in source</li>
          <li>Are older than 90 days</li>
        </ul>
      </li>
      <li>Tag them for deletion instead of deleting immediately</li>
      <li>Deletion is performed using S3 lifecycle configuration for cost-optimized deletion</li>
      <li>Tags include timestamps for rollback + audit</li>
    </ul>
  </li>
</ol>

<hr />

<h3 id="error-handling--resilience">Error Handling &amp; Resilience</h3>

<ul>
  <li>Retries with exponential backoff + jitter</li>
  <li>Strict validation before deletes</li>
  <li>Exclusion lists ensure delta logs are never deleted</li>
  <li>ECS tasks run in private subnets with VPC endpoints</li>
</ul>

<hr />

<h3 id="cost--performance-gains">Cost &amp; Performance Gains</h3>

<ul>
  <li>Incremental logic = no redundant transfers</li>
  <li>Lifecycle rules = backups → Glacier, old ones cleaned</li>
  <li>Size-based routing = Lambda for cheap jobs, ECS for heavy jobs</li>
  <li>Parallelism = 20 CSV workers per manifest, 40 DBs at once</li>
</ul>

<hr />

<h3 id="lessons-learned">Lessons Learned</h3>

<ul>
  <li>Always validate manifests before processing</li>
  <li>Never delete immediately → tagging first saved us money</li>
  <li>Thresholds matter: 25 files was our sweet spot</li>
  <li>CloudWatch + Slack reports gave us visibility we didn’t have before</li>
</ul>

<hr />

<h3 id="conclusion">Conclusion</h3>

<p>By combining Lambda, ECS Fargate, and S3 Batch Ops, we’ve built a resilient backup system that scales from small to massive datasets. Instead of repeatedly copying the same files, the system now performs truly incremental backups — capturing only new or changed parquet files while always preserving delta logs. This not only minimizes costs but also dramatically reduces runtime.</p>

<p>Our safe deletion workflow ensures that stale data is removed without risk, using lifecycle-based cleanup rather than immediate deletion. Together, these design choices give us reliable backups, efficient scaling, and continuous optimization of storage. What used to be expensive, error-prone, and manual is now automated, predictable, and cost-effective.</p>]]></content><author><name>Oleh Motrunych</name></author><category term="terraform" /><category term="aws" /><category term="deltalake" /><category term="featured" /><summary type="html"><![CDATA[We designed and implemented a scalable, cost-optimized backup system for S3 data warehouses that runs automatically on a monthly schedule. The system handles petabytes of data across multiple databases and uses a hybrid approach: AWS Lambda for small workloads and ECS Fargate for larger ones. At its core, the pipeline performs incremental backups — copying only new or changed parquet files while always preserving delta logs — dramatically reducing costs and runtime compared to full backups. Data is validated through S3 Inventory manifests, processed in parallel, and stored in Glacier for long-term retention. To avoid data loss and reduce storage costs, we also implemented a safe deletion workflow. Files older than 90 days, successfully backed up, and no longer present in the source are tagged for lifecycle-based cleanup instead of being deleted immediately. This approach ensures reliability, efficiency, and safety: backups scale seamlessly from small to massive datasets, compute resources are right-sized, and storage is continuously optimized.]]></summary></entry><entry><title type="html">Let’s save tons of money with cloud-native data ingestion!</title><link href="https://tech.scribd.com/blog/2025/cloud-native-data-ingestion.html" rel="alternate" type="text/html" title="Let’s save tons of money with cloud-native data ingestion!" /><published>2025-08-01T00:00:00+00:00</published><updated>2025-08-01T00:00:00+00:00</updated><id>https://tech.scribd.com/blog/2025/cloud-native-data-ingestion</id><content type="html" xml:base="https://tech.scribd.com/blog/2025/cloud-native-data-ingestion.html"><![CDATA[<p>Delta Lake is a fantastic technology for quickly querying massive data sets,
but first you need those massive data sets! In <a href="https://www.youtube.com/watch?v=g1BZH8sbZWk">this
talk</a> from <a href="https://dataandaisummit.com">Data and AI
Summit</a> 2025 I dive into the cloud-native
architecture Scribd has adopted to ingest data from AWS Aurora, SQS, Kinesis
Data Firehose and more!</p>

<p>By using off-the-shelf open source tools like kafka-delta-ingest, oxbow and
Airbyte, Scribd has redefined its ingestion architecture to be more
event-driven, reliable, and most importantly: cheaper. No jobs needed!
Attendees will learn how to use third-party tools in concert with a Databricks
and Unity Catalog environment to provide a highly efficient and available data
platform.</p>

<p>This architecture will be presented in the context of AWS but can be adapted
for Azure, Google Cloud Platform or even on-premise environments. The
<a href="https://www.scribd.com/document/874418144/Data-and-AI-Summit-2025-Presentation">slides</a>
are also available on Scribd!</p>

<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/g1BZH8sbZWk?si=HM9MXf4nNrGBfHAR" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen=""></iframe>]]></content><author><name>R Tyler Croy</name></author><category term="databricks" /><category term="aws" /><category term="deltalake" /><category term="featured" /><summary type="html"><![CDATA[Delta Lake is a fantastic technology for quickly querying massive data sets, but first you need those massive data sets! In this talk from Data and AI Summit 2025 I dive into the cloud-native architecture Scribd has adopted to ingest data from AWS Aurora, SQS, Kinesis Data Firehose and more!]]></summary></entry><entry><title type="html">Terraform module to manage Oxbow Lambda and its components</title><link href="https://tech.scribd.com/blog/2025/terraform-oxbow-module.html" rel="alternate" type="text/html" title="Terraform module to manage Oxbow Lambda and its components" /><published>2025-03-14T00:00:00+00:00</published><updated>2025-03-14T00:00:00+00:00</updated><id>https://tech.scribd.com/blog/2025/terraform-oxbow-module</id><content type="html" xml:base="https://tech.scribd.com/blog/2025/terraform-oxbow-module.html"><![CDATA[<p><a href="https://github.com/buoyant-data/oxbow">Oxbow</a> is a project to take an existing storage location which contains <a href="https://parquet.apache.org/">Apache Parquet</a> files into a <a href="https://delta.io/">Delta Lake table</a>.
It is intended to run both as an AWS Lambda or as a command line application.
We are excited to introduce <a href="https://github.com/scribd/terraform-oxbow">terraform-oxbow</a>, an open-source Terraform module that simplifies the deployment and management of AWS Lambda and its supporting components. Whether you’re working with AWS Glue, Kinesis Data Firehose, SQS, or DynamoDB, this module provides a streamlined approach to infrastructure as code (IaC) in AWS.</p>

<h3 id="-why-terraform-oxbow">✨ Why terraform-oxbow?</h3>
<p>Managing event-driven architectures in AWS can be complex, requiring careful orchestration of multiple services. Terraform-oxbow abstracts much of this complexity, enabling users to configure key components with simple boolean flags and module parameters. This is an easy and efficient way to have Delta table created using Apache Parquet files.</p>
<h3 id="features">🚀Features</h3>

<p>With <strong>terraform-oxbow</strong>, you can deploy:</p>

<ul>
  <li>AWS Oxbow Lambda with customizable configurations</li>
  <li>Kinesis Data Firehose for real-time data streaming</li>
  <li>SQS and SQS Dead Letter Queues for event-driven messaging</li>
  <li>IAM policies for secure access management</li>
  <li>S3 bucket notifications to trigger Lambda functions</li>
  <li>DynamoDB tables for data storage and locking</li>
  <li>AWS Glue Catalog and Tables for schema management</li>
</ul>

<h3 id="️-how-it-works">⚙️ How It Works</h3>

<p>This module follows a modular approach, allowing users to enable or disable services based on their specific use case. Here are a few examples:</p>

<ul>
  <li>
    <p>To enable AWS Glue Catalog and Tables: <code class="language-plaintext highlighter-rouge">hcl
enable_aws_glue_catalog_table = true
</code></p>
  </li>
  <li>
    <p>To enable Kinesis Data Firehose delivery stream <code class="language-plaintext highlighter-rouge">hcl
enable_kinesis_firehose_delivery_stream = true
</code></p>
  </li>
  <li>
    <p>To enable S3 bucket notifications <code class="language-plaintext highlighter-rouge">hcl
enable_bucket_notification = true
</code></p>
  </li>
  <li>
    <p>To enable advanced Oxbow Lambda setup for multi-table filtered optimization <code class="language-plaintext highlighter-rouge">hcl
enable_group_events = true
</code></p>
  </li>
  <li>
    <p>AWS S3 bucket notifications have limitations: Due to AWS constraints, an S3 bucket can only have a single notification configuration per account. If you need to trigger multiple Lambda functions from the same S3 bucket, consider using event-driven solutions like SNS or SQS.</p>
  </li>
  <li>
    <p>IAM Policy Management: The module provides the necessary permissions but follows the principle of least privilege. Ensure your IAM policies align with your security requirements.</p>
  </li>
  <li>
    <p>Scalability and Optimization: The module allows fine-grained control over Lambda concurrency, event filtering, and data processing configurations to optimize costs and performance</p>
  </li>
</ul>]]></content><author><name>Oleh Motrunych</name></author><category term="Oxbow" /><category term="Terraform" /><category term="AWS" /><category term="deltalake" /><category term="rust" /><summary type="html"><![CDATA[Oxbow is a project to take an existing storage location which contains Apache Parquet files into a Delta Lake table. It is intended to run both as an AWS Lambda or as a command line application. We are excited to introduce terraform-oxbow, an open-source Terraform module that simplifies the deployment and management of AWS Lambda and its supporting components. Whether you’re working with AWS Glue, Kinesis Data Firehose, SQS, or DynamoDB, this module provides a streamlined approach to infrastructure as code (IaC) in AWS.]]></summary></entry><entry><title type="html">Cloud-native Data Ingestion with AWS Aurora and Delta Lake</title><link href="https://tech.scribd.com/blog/2025/cloud-native-data-ingestion.html" rel="alternate" type="text/html" title="Cloud-native Data Ingestion with AWS Aurora and Delta Lake" /><published>2025-01-15T00:00:00+00:00</published><updated>2025-01-15T00:00:00+00:00</updated><id>https://tech.scribd.com/blog/2025/cloud-native-data-ingestion</id><content type="html" xml:base="https://tech.scribd.com/blog/2025/cloud-native-data-ingestion.html"><![CDATA[<p>One of the major themes for Infrastructure Engineering over the past couple
years has been higher reliability and better operational efficiency. In a
recent session with the <a href="https://delta.io">Delta Lake</a> project I was able to
share the work led Kuntal Basu and a number of other people to <em>dramatically</em>
improve the efficiency and reliability of our online data ingestion pipeline.</p>

<blockquote>
  <p>Join Kuntal Basu, Staff Infrastructure Engineer, and R. Tyler Croy, Principal
Engineer at Scribd, Inc. as they take you behind the scenes of Scribd’s data
ingestion setup. They’ll break down the architecture, explain the tools, and
walk you through how they turned off-the-shelf solutions into a robust
pipeline.</p>
</blockquote>

<h2 id="video">Video</h2>

<center><iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/h8nCF_OI0O0?si=3v2sb4hUvPEKOKF_" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen=""></iframe></center>

<h2 id="presentation">Presentation</h2>

<center><iframe src="https://www.slideshare.net/slideshow/embed_code/key/M9NZpsRwwYnjq6?hostedIn=slideshare&amp;page=upload" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe></center>]]></content><author><name>R Tyler Croy</name></author><category term="deltalake" /><category term="rust" /><category term="featured" /><summary type="html"><![CDATA[One of the major themes for Infrastructure Engineering over the past couple years has been higher reliability and better operational efficiency. In a recent session with the Delta Lake project I was able to share the work led Kuntal Basu and a number of other people to dramatically improve the efficiency and reliability of our online data ingestion pipeline.]]></summary></entry><entry><title type="html">The Evolution of the Machine Learning Platform</title><link href="https://tech.scribd.com/blog/2024/evolution-of-mlplatform.html" rel="alternate" type="text/html" title="The Evolution of the Machine Learning Platform" /><published>2024-02-05T00:00:00+00:00</published><updated>2024-02-05T00:00:00+00:00</updated><id>https://tech.scribd.com/blog/2024/evolution-of-mlplatform</id><content type="html" xml:base="https://tech.scribd.com/blog/2024/evolution-of-mlplatform.html"><![CDATA[<p>Machine Learning Platforms (ML Platforms) have the potential to be a key component in achieving production ML at scale without large technical debt, yet ML Platforms are not often understood. This document outlines the key concepts and paradigm shifts that led to the conceptualization of ML Platforms in an effort to increase an understanding of these platforms and how they can best be applied.</p>

<h2 id="technical-debt-and-development-velocity-defined">Technical Debt and development velocity defined</h2>

<h3 id="development-velocity">Development Velocity</h3>

<p>Machine learning development velocity refers to the speed and efficiency at which machine learning (ML) projects progress from the initial concept to deployment in a production environment. It encompasses the entire lifecycle of a machine learning project, from data collection and preprocessing to model training, evaluation, validation deployment and testing for new models or for re-training, validation and deployment of existing models.</p>

<h3 id="technical-debt">Technical Debt</h3>

<p>The term “technical debt” in software engineering was coined by Ward Cunningham, Cunningham used the metaphor of financial debt to describe the trade-off between implementing a quick and dirty solution to meet immediate needs (similar to taking on financial debt for short-term gain) versus taking the time to do it properly with a more sustainable and maintainable solution (akin to avoiding financial debt but requiring more upfront investment). Just as financial debt accumulates interest over time, technical debt can accumulate and make future development more difficult and expensive.</p>

<p>The idea behind technical debt is to highlight the consequences of prioritizing short-term gains over long-term maintainability and the need to address and pay off this “debt” through proper refactoring and improvements. The term has since become widely adopted in the software development community to describe the accrued cost of deferred work on a software project.</p>

<h3 id="technical-debt-in-machine-learning">Technical Debt in Machine Learning</h3>

<p>Originally a software engineering concept, Technical debt is also relevant to Machine Learning Systems infact the landmark google paper suggest that ML systems have the propensity to easily gain this technical debt.</p>

<blockquote>
  <p>Machine learning offers a fantastically powerful toolkit for building useful complex prediction systems quickly. This paper argues it is dangerous to think of these quick wins as coming for free. Using the software engineering framework of technical debt , we ﬁnd it is common to incur massive ongoing maintenance costs in real-world ML systems</p>

  <p><a href="https://www.scribd.com/document/428241724/Hidden-Technical-Debt-in-Machine-Learning-Systems">Sculley et al (2021) Hidden Technical Debt in Machine Learning Systems</a></p>
</blockquote>

<blockquote>
  <p>As the machine learning (ML) community continues to accumulate years of experience with livesystems, a wide-spread and uncomfortable trend has emerged: developing and deploying ML sys-tems is relatively fast and cheap, but maintaining them over time is difﬁcult and expensive</p>

  <p><a href="https://www.scribd.com/document/428241724/Hidden-Technical-Debt-in-Machine-Learning-Systems">Sculley et al (2021) Hidden Technical Debt in Machine Learning Systems</a></p>
</blockquote>

<p>Technical debt is important to consider especially when trying to move fast. Moving fast is easy, moving fast without acquiring technical debt is alot more complicated.</p>

<h2 id="the-evolution-of-ml-platforms">The Evolution Of ML Platforms</h2>

<h3 id="devops--the-paradigm-shift-that-led-the-way">DevOps – The paradigm shift that led the way</h3>

<p>DevOps is a methodology in software development which advocates for teams owning the entire software development lifecycle. This paradigm shift from fragmented teams to end-to-end ownership enhances collaboration and accelerates delivery. Dev ops has become standard practice in modern software development and the adoption of DevOps has been widespread, with many organizations considering it an essential part of their software development and delivery processes. Some of the principles of DevOps are:</p>

<ol>
  <li>
    <p><strong>Automation</strong></p>
  </li>
  <li>
    <p><strong>Continuous Testing</strong></p>
  </li>
  <li>
    <p><strong>Continuous Monitoring</strong></p>
  </li>
  <li>
    <p><strong>Collaboration and Communication</strong></p>
  </li>
  <li>
    <p><strong>Version Control</strong></p>
  </li>
  <li>
    <p><strong>Feedback Loops</strong></p>
  </li>
</ol>

<h3 id="platforms--reducing-cognitive-load">Platforms – Reducing Cognitive Load</h3>

<p>This shift to DevOps and teams teams owning the entire development lifecycle introduces a new challenge—additional cognitive load. Cognitive load can be defined as</p>

<blockquote>
  <p>The total amount of mental effort a team uses to understand, operate and maintain their designated systems or tasks.</p>

  <p><a href="https://teamtopologies.com/book">Skelton &amp; Pais (2019) Team Topologies</a></p>
</blockquote>

<p>The weight of the additional load introduced in DevOps of teams owning the entire software development lifecycle can hinder productivity, prompting organizations to seek solutions.</p>

<p>Platforms emerged as a strategic solution, delicately abstracting unnecessary details of the development lifecycle. This abstraction allows engineers to focus on critical tasks, mitigating cognitive load and fostering a more streamlined workflow.</p>

<blockquote>
  <p>The purpose of a platform team is to enable stream-aligned teams to deliver work with substantial autonomy. The stream-aligned team maintains full ownership of building, running, and fixing their application in production. The platform team provides internal services to reduce the cognitive load that would be required from stream-aligned teams to develop these underlying services.</p>

  <p><a href="https://teamtopologies.com/book">Skelton &amp; Pais (2019) Team Topologies</a></p>
</blockquote>

<blockquote>
  <p>Infrastructure Platform teams enable organisations to scale delivery by solving common product and non-functional requirements with resilient solutions. This allows other teams to focus on building their own things and releasing value for their users</p>

  <p><a href="https://martinfowler.com/articles/building-infrastructure-platform.html">Rowse &amp; Shepherd (2022) Building Infrastructure Platforms</a></p>
</blockquote>

<h3 id="ml-ops--reducing-technical-debt-of-machine-learning">ML Ops – Reducing technical debt of machine learning</h3>

<p>The ability of ML systems to rapidly accumulate technical debt has given rise to the concept of MLOps. MLOps is a methodology that takes inspiration from and incorporates best practices of the DevOps, tailoring them to address the distinctive challenges inherent in machine learning. MLOps applies the established principles of DevOps to machine learning, recognizing that merely a fraction of real-world ML systems comprises the actual ML code. Serving as a crucial bridge between development and the ongoing intricacies of maintaining ML systems.
MLOps is a methodology that provides a collection of concepts and workflows designed to promote efficiency, collaboration, and sustainability of the ML Lifecycle. Correctly applied MLOps can play a pivotal role controlling technical debt and ensuring the efficiency, reliability, and scalability of the machine learning lifecycle over time.</p>

<h2 id="scribds-ml-platform--mlops-and-platforms-in-action">Scribd’s ML Platform – MLOps and Platforms in Action</h2>
<p>At Scribd we have developed a machine learning platform which provides a curated developer experience for machine learning developers. This platform has been built with MLOps in mind which can be seen through its use of common DevOps principles.</p>

<ol>
  <li><strong>Automation:</strong>
    <ul>
      <li>Applying CI/CD strategies to model deployments through the use of Jenkins pipelines which deploy models from the Model Registry to AWS based endpoints.</li>
      <li>Automating Model training throug the use of Airflow DAGS and allowing these DAGS to trigger the deployment pipelines to deploy a model once re-training has occured.</li>
    </ul>
  </li>
  <li><strong>Continuous</strong> <strong>Testing:</strong>
    <ul>
      <li>Applying continuous testing as part of a model deployment pipeline, removing the need for manual testing.</li>
      <li>Increased tooling to support model validation testing.</li>
    </ul>
  </li>
  <li><strong>Monitoring:</strong>
    <ul>
      <li>Monitoring real time inference endpoints</li>
      <li>Monitoring training DAGS</li>
      <li>Monitoring batch jobs</li>
    </ul>
  </li>
  <li><strong>Collaboration and Communication:</strong>
    <ul>
      <li>Feature Store which provides feature discovery and re-use</li>
      <li>Model Database which provides model collaboration</li>
    </ul>
  </li>
  <li><strong>Version Control:</strong>
    <ul>
      <li>Applying version control to experiments, machine learning models and features</li>
    </ul>
  </li>
</ol>

<h2 id="references">References</h2>

<p><a href="https://martinfowler.com/articles/talk-about-platforms.html">Bottcher (2018, March 05). What I Talk About When I Talk About Platforms. https://martinfowler.com/articles/talk-about-platforms.html</a></p>

<p><a href="https://www.scribd.com/document/428241724/Hidden-Technical-Debt-in-Machine-Learning-Systems">D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Franc¸ois Crespo, Dan Dennison (2021) Hidden Technical Debt in Machine Learning Systems</a></p>

<p><a href="https://martinfowler.com/bliki/ConwaysLaw.html">Fowler (2022, October 20).Conway’s Law. https://martinfowler.com/bliki/ConwaysLaw.html</a></p>

<p><a href="https://platformengineering.org/blog/what-is-platform-engineering">Galante, what is platform engineering. https://platformengineering.org/blog/what-is-platform-engineering</a></p>

<p><a href="https://www.scribd.com/document/611845499/Whitepaper-State-of-Platform-Engineering-Report">Humanitect, State of Platform Engineering Report</a></p>

<p><a href="https://martinfowler.com/articles/platform-teams-stuff-done.html">Hodgson (2023, July 19).How platform teams get stuff done. https://martinfowler.com/articles/platform-teams-stuff-done.html</a></p>

<p><a href="https://www.thoughtworks.com/insights/blog/platforms/art-platform-thinking">Murray (2017, April 27. The Art of Platform Thinking. https://www.thoughtworks.com/insights/blog/platforms/art-platform-thinking)</a></p>

<p><a href="https://www.techopedia.com/definition/27913/technical-debt">Rouse (2017, March 20). Technical Debt. https://www.techopedia.com/definition/27913/technical-debt</a></p>

<p><a href="https://martinfowler.com/articles/building-infrastructure-platform.html">Rowse &amp; Shepherd (2022).Building Infrastructure Platforms. https://martinfowler.com/articles/building-infrastructure-platform.html</a></p>

<p><a href="https://teamtopologies.com/book">Skelton &amp; Pais (2019) Team Topologies</a></p>]]></content><author><name>Ben Shaw</name></author><category term="mlops" /><category term="featured" /><category term="ml-platform-series" /><summary type="html"><![CDATA[Machine Learning Platforms (ML Platforms) have the potential to be a key component in achieving production ML at scale without large technical debt, yet ML Platforms are not often understood. This document outlines the key concepts and paradigm shifts that led to the conceptualization of ML Platforms in an effort to increase an understanding of these platforms and how they can best be applied.]]></summary></entry></feed>