<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Felipe Hoffa on Medium]]></title>
        <description><![CDATA[Stories by Felipe Hoffa on Medium]]></description>
        <link>https://medium.com/@hoffa?source=rss-279fe54c149a------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/0*ahXIMiIgudZTyqJS.jpeg</url>
            <title>Stories by Felipe Hoffa on Medium</title>
            <link>https://medium.com/@hoffa?source=rss-279fe54c149a------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sun, 07 Jun 2026 13:07:04 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@hoffa/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[GCP Next ’26: A knowledge graph of 5 million words of videos and slides]]></title>
            <link>https://medium.com/geotab/gcp-next-26-a-knowledge-graph-of-5-million-words-of-videos-and-slides-efbce815c427?source=rss-279fe54c149a------2</link>
            <guid isPermaLink="false">https://medium.com/p/efbce815c427</guid>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[geotab]]></category>
            <category><![CDATA[google-cloud-platform]]></category>
            <category><![CDATA[data-visualization]]></category>
            <category><![CDATA[knowledge-graph]]></category>
            <dc:creator><![CDATA[Felipe Hoffa]]></dc:creator>
            <pubDate>Mon, 01 Jun 2026 11:01:01 GMT</pubDate>
            <atom:updated>2026-06-01T11:01:01.625Z</atom:updated>
            <content:encoded><![CDATA[<h4>How 1,146 sessions, 444 slide decks, and 323 video transcripts became 8,285 nodes and 24,421 edges of structured knowledge. The challenge: how do you make sense of that much material at once?</h4><p><a href="https://fhoffa.github.io/google-cloud-next-2026-unofficial-scrape/google-cloud-next/2026/graphify-out/graph_story.html">Google Cloud Next 2026 - Knowledge Graph Story</a></p><p>The answer was to treat the entire conference as a corpus and build a knowledge graph from it. Here’s how the data pipeline worked, layer by layer.</p><h3>The Pipeline: Layer by Layer</h3><h4>Session Metadata — 1,146 sessions</h4><p>The first layer was the structured index from the Google Cloud Next 2026 session explorer. Each entry had a title, description, speaker names, company affiliations, topic tags, room, and timing. This became the skeleton: every session title and speaker was a candidate entity in the graph.</p><h4>Slide Decks — 444 PDFs, ~2,200 slides analyzed</h4><p>For 444 of the 1,146 sessions, a slide deck was available on the Google site. Each PDF was downloaded, converted to images, and processed with vision analysis — extracting product names, architecture diagrams, metrics, and key claims from individual slides. This yielded <strong>2,205 slide-level analysis records</strong> and per-session brief summaries.</p><p>Slides are uniquely valuable: they contain the distilled, intentional message a speaker wants to convey. A bullet point on a slide is more signal than two minutes of verbal framing around it.</p><h4>YouTube Transcripts — 323 full recordings</h4><p>For 323 sessions, a YouTube recording was available. Full audio transcripts were retrieved when available, or generated when needed, and stored — capturing the speaker’s actual words, including Q&amp;A, live demos, and audience interactions that never appear in a slide deck.</p><p>The combination of slides + transcript for the same session was particularly powerful: slides provided structured claims while transcripts provided context, nuance, and the path that led to each conclusion.</p><h4>Graphify — Entity &amp; Relationship Extraction</h4><p>All 477 source files (~5 million words) were fed through <a href="https://github.com/safishamsi/graphify"><strong>graphify</strong></a>, an AI-powered knowledge graph builder that reads raw text and extracts named entities, concepts, products, companies, and people — along with the semantic relationships between them.</p><p>Most of the graph was built from directly sourced connections found in the conference material itself. Separately, 114 additional edges were added by cross-document reasoning, and only kept when they cleared a 0.85 confidence threshold.</p><h4>Graph Output — 8,285 nodes · 24,421 edges · 44 communities</h4><p>The resulting graph was then analyzed for community structure using modularity-based detection. 44 distinct communities emerged — each representing a cluster of entities that are more densely connected to each other than to the rest of the graph. Community names are derived from the most distinctive nodes in each cluster.</p><h3>Distilling the Ecosystem</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*YiqUyQNEMSCTlb9i.png" /></figure><p>The raw graph is beautiful but barely readable.</p><p>The simplification works by promoting only the highest-degree hub nodes within each detected community and collapsing everything else around them. Intra-layer connections become solid lines within each bubble; cross-layer connections become dashed lines between zones. The result is a map you can actually read and reason about.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*I3mWvp1QcRACdVj_.png" /></figure><p>The picture is pretty simple: Google appears to be assembling a full enterprise agent stack. At the bottom are the protocols and tools that let agents reach other systems. Then come the control layers that secure them, the data systems that ground them, the model and agent platform that powers them, the infrastructure that runs them, and the products where people actually use them.</p><p>The graph makes that structure visible. A long list of launches starts to read like one architecture for enterprise agents.</p><h3>What Can You Ask This Graph?</h3><p>Here’s a list of some sample questions we asked the graph. To see the answers click on the interactive dashboard:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/853/1*GWnN_7Kehb9aC3fbM5zUEA.png" /><figcaption>See the answer to these questions on the <a href="https://fhoffa.github.io/google-cloud-next-2026-unofficial-scrape/google-cloud-next/2026/graphify-out/graph_story.html">interactive page</a></figcaption></figure><h3>What Google Said About Graphs</h3><p>The graph isn’t just an outside analysis trick. Google repeatedly described agent-ready systems in graph-shaped terms: entities, relationships, connected context, and knowledge that agents can reason over.</p><p>In the <a href="https://youtu.be/11PBno-cJ1g">opening keynote</a> Google introduced Knowledge Catalog as a system that can read files, extract entities, map relationships, and learn business semantics. In <a href="https://youtu.be/cdmvSHJEh2E">What’s new with data and AI governance: Building the catalog for AI</a>, that idea became even clearer: the goal is a connected context layer that agents can reason over directly.</p><p>This wasn’t just abstract metadata talk. The source material also includes concrete product pushes around graph-native data systems. <strong>Spanner Graph</strong> appears in <a href="https://fhoffa.github.io/google-cloud-next-2026-unofficial-scrape/?search=What%27s%20new%20in%20Spanner%3A%20enterprise-scale%20AI%2C%20search%2C%20graph%2C%20and%20analytics"><em>What’s new in Spanner: enterprise-scale AI, search, graph, and analytics</em></a>, where Google describes multimodel support, SQL PGQ, integrated graph algorithms, graphs on views, and UI-based graph modeling. <strong>BigQuery Graph</strong> appears in multiple decks as native property-graph support for relationship analytics, visualization, and graph-grounded reasoning on enterprise data.</p><p>That’s what makes this feel new. For years, products like BigQuery and Spanner were mainly framed as places to store, query, and scale data. At Next ’26, Google was also positioning them as systems for modeling relationships directly — not just rows and columns, but connected structures that agents and analytical workflows can traverse. This is Google reframing databases for the agent era.</p><p>And it goes beyond two product names. Other graph-shaped concepts also show up in the graph itself: <strong>BigQuery Property Graph</strong>, <strong>Property Graph</strong>, <strong>GQL / ISO GQL</strong>, <strong>Graph Analytics</strong>, <strong>Graph RAG</strong>, <strong>Knowledge Graph Grounding</strong>, <strong>Gemini Enterprise Agent Platform Graph Neural Networks</strong>, and security-oriented nodes like <strong>Wiz Security Graph</strong> and <strong>SCC Security Graph</strong>.</p><p>That matters because it validates the method of this page. The conference graph here is external and independently built, but it is not conceptually alien to Google’s own framing. If anything, the company’s product story is shifting from isolated data systems toward connected context systems: not just documents, not just tables, and not just prompts, but systems that understand entities, relationships, provenance, and business meaning well enough for agents to act on them.</p><h3><strong>Where Geotab Landed</strong></h3><p>Junaid Gill (Associate Vice President, Geotab) joined Greg Brosman (Senior Product Manager, Google Cloud) and John Murray (Group Product Manager, Google Cloud) on <a href="https://fhoffa.github.io/google-cloud-next-2026-unofficial-scrape/?sessionids=3913070">governing a secure agentic ecosystem</a>.</p><iframe src="https://drive.google.com/viewerng/viewer?url=https%3A//content-cdn.sessionboard.com/content/O3JSmIxQ9utid4PL9jjT_BRK2-092.pdf&amp;embedded=true" width="600" height="780" frameborder="0" scrolling="no"><a href="https://medium.com/media/81be106373c9af3bea1ef5cdbff6717d/href">https://medium.com/media/81be106373c9af3bea1ef5cdbff6717d/href</a></iframe><p><a href="https://medium.com/geotab/the-agentic-ai-reckoning-why-governance-cant-wait-90bdbf38e96e">Francois-Xavier Jeannet </a>(Team Lead, Data &amp; AI Governance, Geotab) joined Anit Patinker (Lead Product Manager, Google Cloud) and Shelley Hershkovitz (Product Manager, Google Cloud) on <a href="https://fhoffa.github.io/google-cloud-next-2026-unofficial-scrape/?sessionids=3913108">agent security at scale</a>.</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FrAa_Dna4F1k%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DrAa_Dna4F1k&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FrAa_Dna4F1k%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/b9c79014dbda0d5a9bbda752cf7c9a87/href">https://medium.com/media/b9c79014dbda0d5a9bbda752cf7c9a87/href</a></iframe><p>Daniel Lewis (Distinguished Data Scientist, Geotab) joined Dima Melnyk (Senior Product Manager, Google Cloud) and Alex Martin (Product Manager, Google Cloud) on <a href="https://fhoffa.github.io/google-cloud-next-2026-unofficial-scrape/?sessionids=3913164">the agent-quality flywheel</a>.</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FeLQAJqydXqY%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DeLQAJqydXqY&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FeLQAJqydXqY%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/c354223ab787076f54f978326a5a5b45/href">https://medium.com/media/c354223ab787076f54f978326a5a5b45/href</a></iframe><p>On the conceptual map, those talks land in a tight zone around platform, governance, and evaluation. One Geotab example helps explain why: their internal hackathon generated 86 agent submissions, and 2 of those later made it into production.</p><p>That gap helps explain why Geotab mattered on stage. In earlier years, Geotab often appeared as the customer saying “BigQuery works and here’s how we use it”. This time the posture felt different. AI is still early, but Geotab has already been building agentic systems, and Google is bringing them on stage not merely as a reference customer, but as a partner with production lessons that are helping shape the products now being pushed more broadly.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*4UixDbTjO3-NPBf4.png" /></figure><p>Two of the three Geotab sessions landed inside the <strong>Agent Control Plane</strong>: one on governing agents and securing the agentic ecosystem with GEAP, and one on agent security at scale through the OWASP Top 10 for agents. The third landed in the <strong>Agent Platform</strong> zone, focused on the agent-quality flywheel and GEAP evaluations.</p><p>Geotab was there to talk about the hard part: getting agents into production, keeping them secure, and building the evaluation loop that makes them reliable. That is also why the control-plane layer matters so much on this map — with examples of how an exploratory funnel narrows into a small number of production-worthy agents, and why governance, security, evaluation, and operational control matter so much once a team tries to ship them.</p><h3><strong>The Two Big Takes</strong></h3><p>First<strong>:</strong> <strong>graphing the conference reveals structure you cannot reliably see by reading sessions one by one</strong>. Once 1,146 sessions, 444 slide decks, and 323 transcripts are turned into nodes and edges, repeated patterns become visible: which products sit at the center, which protocols connect otherwise separate systems, which themes cluster together, and which ideas show up across product, security, data, infrastructure, and partner talks.</p><p>That is where many of the strongest insights in this story come from. MCP starts to look like connective tissue. BigQuery starts to look like a governed substrate for AI systems. Cloud Run, GKE, security controls, evaluation loops, and enterprise data products start to read as parts of the same architecture.</p><p>Second<strong>:</strong> <strong>building the graph was fairly straightforward</strong>. Cleaning up the conference corpus was the hard part. The source material was messy: decks, transcripts, session pages, repeated product names, overlapping abstractions, and different levels of specificity. <a href="https://github.com/safishamsi/graphify"><strong>graphify</strong></a> made the workflow feel direct: feed in the corpus, extract grounded entities and relationships, and simplify the result into a graph that a human can actually read.</p><p>The graph also shows what Google seems to believe will matter in production. The center of gravity is a working stack: agents connected through MCP and A2A, grounded in enterprise data, deployed on managed runtimes, evaluated continuously, and wrapped in governance, identity, and security controls.</p><p>If you want to test that thesis yourself, browse the <a href="https://fhoffa.github.io/google-cloud-next-2026-unofficial-scrape/">session explorer</a>, or open the <a href="https://fhoffa.github.io/google-cloud-next-2026-unofficial-scrape/google-cloud-next/2026/graphify-out/graph.html">full interactive map</a> and follow the edges yourself. The graph, the map, and the session index feel like the right publishable artifact here, without redistributing downloaded slides or transcripts.</p><p>One encouraging part of this project is how manageable it was. It took a few steps, and Graphify made the workflow pretty approachable.</p><p>A future attendee could use a graph like this during the next conference to decide which sessions to prioritize, spot the hubs everyone is converging around, and follow emerging ideas across tracks in real time. It works well as a retrospective, and it would be even more useful live.</p><p><strong>Read and discuss this knowledge-graph story on </strong><a href="https://www.linkedin.com/posts/hoffa_i-turned-400-slide-decks-300-youtube-transcripts-ugcPost-7457851475432611840-K8EP/?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAAAADNXQBNbeHsF7w6H4b2zRq4DuGh8-imaA"><strong>LinkedIn</strong></a><strong> →</strong></p><p><a href="https://www.linkedin.com/feed/update/urn:li:activity:7457851478485913601/">I turned 400+ slide decks, 300+ YouTube transcripts, and 5+ million words from Google Cloud Next &#39;26 into a knowledge graph to see what the conference was really about. The raw graph is huge and... | Felipe Hoffa | 10 comments</a></p><p><em>Originally published at </em><a href="https://fhoffa.github.io/google-cloud-next-2026-unofficial-scrape/google-cloud-next/2026/graphify-out/graph_story.html"><em>https://fhoffa.github.io</em></a><em>.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=efbce815c427" width="1" height="1" alt=""><hr><p><a href="https://medium.com/geotab/gcp-next-26-a-knowledge-graph-of-5-million-words-of-videos-and-slides-efbce815c427">GCP Next ’26: A knowledge graph of 5 million words of videos and slides</a> was originally published in <a href="https://medium.com/geotab">Geotab Innovators</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Is MCP dead? 2025 vs 2026 at GCP Next]]></title>
            <link>https://hoffa.medium.com/is-mcp-dead-2025-vs-2026-at-gcp-next-e19f2d4585c6?source=rss-279fe54c149a------2</link>
            <guid isPermaLink="false">https://medium.com/p/e19f2d4585c6</guid>
            <category><![CDATA[google-cloud-platform]]></category>
            <category><![CDATA[ai-agent]]></category>
            <category><![CDATA[mcp-server]]></category>
            <category><![CDATA[bigquery]]></category>
            <category><![CDATA[ai]]></category>
            <dc:creator><![CDATA[Felipe Hoffa]]></dc:creator>
            <pubDate>Mon, 06 Apr 2026 22:07:23 GMT</pubDate>
            <atom:updated>2026-04-06T22:07:23.823Z</atom:updated>
            <content:encoded><![CDATA[<h4>Comparing the Google Cloud Next session catalogs — 2025 vs 2026: MCP: 1 → 63 sessions, A2A / agent-to-agent: 0 → 16, “Protocol”: 0 → 50. Read on for more details.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*IWDFFqcux_Zokbim" /><figcaption>MCP: 2025 vs 2026 at GCP Next</figcaption></figure><p>This is Google moving from “AI assistant” framing toward something much closer to an agent execution stack. The 63 MCP sessions span 32 Breakouts, 11 Lightning Talks, 8 Discussion Groups, 5 Demos, 3 Workshops, 3 Solution Talks, 3 Birds of a Feather, 1 Spotlight, plus a Developer Meetup and a Lounge Session.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ZAFj-9nGOnx_XIrPITB6cw.png" /><figcaption>→ Browse all <a href="https://fhoffa.github.io/google-cloud-next-2026-unofficial-scrape/index.html?q=Mcp">63 MCP sessions</a></figcaption></figure><p>The themes are pretty consistent:</p><p>• Gemini CLI as the developer control surface<br>• MCP as the protocol plumbing<br>• ADK as the agent toolkit<br>• BigQuery, Apigee, Cloud Run, and SecOps as the production surfaces</p><p>There’s also a quieter shift sitting next to MCP: the word “skills”: 26 sessions mention it in 2026. But the term is still in transition. Most of those sessions still mean “human training”, “certifications”, “upskilling programs”. But a smaller subset clearly means something else:</p><p>• Automating the UI with Gemini CLI, MCP and Skills<br>• Vibing with Gemini CLI: Custom Web AI solutions and shareable skills<br>• One tool to rule them all: Extending and customizing the Gemini CLI<br>• 10x productivity with the Gemini CLI</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Yfi_QV1-V_wn5dnXfyCTog.png" /><figcaption>→ Browse all <a href="https://fhoffa.github.io/google-cloud-next-2026-unofficial-scrape/index.html?q=Skills">26 Skills sessions</a></figcaption></figure><p>In those sessions, skills starts to mean reusable, executable capability — something you compose and ship, not just something a human learns in a course. The word hasn’t fully crossed over yet. But the direction is visible.</p><p>There’s also a demand signal in the reservations data: 7,778 seats already taken across the MCP sessions, weeks before the conference opens! The six biggest registration totals alone account for well over 1,400:</p><p>• Conversational Analytics agents, API, and MCP — 356 (<a href="https://www.linkedin.com/in/richard-s-kuzma/">Richard Kuzma</a>)<br>• Agent development with BigQuery, ADK, and MCP — 298 (<a href="https://www.linkedin.com/in/guillaume-blaquiere/">guillaume blaquiere</a>)<br>• Build AI agents on Cloud Run — 256 (<a href="https://www.linkedin.com/in/ryanpei/">Ryan Pei</a>)<br>• Google MCP Services — 178 (<a href="https://www.linkedin.com/in/iromin/">Romin Irani</a>)<br>• Beyond the prompt: Build production-ready agents with Google’s MCP servers — 157 (<a href="https://www.linkedin.com/in/tarungumar/">Tarun Gumar</a>)<br>• Unlocking enterprise actions: Bridge your APIs to AI agents with MCP and <a href="https://www.linkedin.com/company/apigee-legacy/">Apigee</a> — 152 (<a href="https://www.linkedin.com/in/drush80/">David Rush</a>)</p><p>And many of the smaller practitioner sessions are already gone (Agent development and AgentOps with BigQuery, ADK, and MCP; Building custom agents with Looker MCP and APIs; Build an AI Agent and Configure an MCP Server on Cloud Run)</p><p>Conference agendas are lagging indicators. CFPs, editorial planning, and speaker lock-in happen months before the event. So this doesn’t prove MCP wins long term. But it does show that well before the discourse settled, Google had already decided MCP was worth putting across one of its biggest stages.</p><p>And if MCP is the obvious signal, skills as executable capability may be the next term worth tracking.</p><p>→ Read more <a href="https://fhoffa.github.io/google-cloud-next-2026-unofficial-scrape/insights.html">insights</a> → Discuss on <a href="https://www.linkedin.com/feed/update/urn:li:activity:7447038163493036032/">LinkedIn</a></p><p><a href="https://www.linkedin.com/feed/update/urn:li:activity:7447038163493036032/">Is MCP dead? I compared the Google Cloud Next session catalogs - 2025 vs 2026: * MCP: 1 → 63 sessions * A2A / agent-to-agent: 0 → 16 * &quot;Protocol&quot;: 0 → 50 This is Google moving from &quot;AI assistant&quot;... | Felipe Hoffa</a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*F_DP5rVBAKKnClo4.png" /><figcaption>→ Read more <a href="https://fhoffa.github.io/google-cloud-next-2026-unofficial-scrape/insights.html">insights</a></figcaption></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=e19f2d4585c6" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Can AI Understand Fleet Data? Claude Puts Geotab Ace to the Test (MCP Demo)]]></title>
            <link>https://hoffa.medium.com/can-ai-understand-fleet-data-claude-puts-geotab-ace-to-the-test-mcp-demo-98c861e12f03?source=rss-279fe54c149a------2</link>
            <guid isPermaLink="false">https://medium.com/p/98c861e12f03</guid>
            <category><![CDATA[geotab]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[fleet-management]]></category>
            <category><![CDATA[logistics]]></category>
            <category><![CDATA[mcp-server]]></category>
            <dc:creator><![CDATA[Felipe Hoffa]]></dc:creator>
            <pubDate>Fri, 05 Dec 2025 20:24:23 GMT</pubDate>
            <atom:updated>2025-12-05T21:43:45.679Z</atom:updated>
            <content:encoded><![CDATA[<h4>To showcase the capabilities of Geotab Ace, an AI assistant that translates user questions into fleet analytics, we moved beyond manual validation. We used the MCP to give Claude full autonomy: it designed its own test cases, ran them through Geotab, and verified that Ace delivers the accurate, actionable insights a fleet manager would expect. <strong>Read on for the full breakdown, or watch the demo video below.</strong></h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*nuQauhsZamXHpVbR04wwDw.png" /><figcaption>Watch on <a href="https://www.youtube.com/watch?v=jiRCSqKR22A">Youtube</a></figcaption></figure><p>I’m <a href="https://www.linkedin.com/in/hoffa/">Felipe Hoffa</a>, Data and AI Advocate at <a href="https://geotab.com/">Geotab</a>. <strong>Geotab Ace</strong> is our conversational agent: Ask questions in plain English and get answers from your fleet data.</p><p>In this experiment, Claude ran 18 tests, from simple counts to behavioral pattern analysis. Not just checking if Ace gets the right answer, but whether it understands questions like a human would.</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FjiRCSqKR22A%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DjiRCSqKR22A&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FjiRCSqKR22A%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/b49c8eab173ec5b000072523d1aea570/href">https://medium.com/media/b49c8eab173ec5b000072523d1aea570/href</a></iframe><p>Today’s experiment boils down to one question: <strong>Can Claude verify that Ace truly understands natural language within its domain?</strong></p><p>Here’s what it discovered.</p><h3>The Setup</h3><p>We used an <strong>MCP server</strong> (Model Context Protocol), a standard for connecting AI systems, to<a href="https://github.com/fhoffa/geotab-ace-mcp-demo"> link Claude directly to Ace</a>. Think of MCP as the conversational bridge letting two AIs talk directly. No human in the loop. Claude is running its own mini research project.</p><h3>1. Basic Comprehension</h3><p><strong>Claude’s First Question:</strong></p><blockquote><em>“What can you do?”</em></blockquote><p>Ace lists trip details, GPS logs, vehicle performance, driver safety, and diagnostics, plus tips on being specific and including time frames.</p><p><strong>Claude’s Assessment:</strong> Comprehensive overview, transparency about limitations, well-structured. Now Claude knows what to test.</p><h3>2. Simple to Complex Logic</h3><p>Claude then started simple:</p><blockquote><em>“How many vehicles are in the database?”</em></blockquote><p>Ace handled those cleanly. Then it pushed harder, moving from simple retrieval to multi-condition reasoning:</p><blockquote><em>“Show me vehicles that idled for more than 30 minutes AND drove over 200 miles in the last week.”</em></blockquote><p>Combining multiple conditions, Ace confirmed both had to be met before generating the answer. 10 vehicles matched. Claude was impressed.</p><h3>3. Ambiguity and Human Language</h3><p>This is where Ace surprised Claude. Fleet managers ask all the time: <em>“Which driver is the safest?”</em></p><p>No metric is specified. No definition of “safe” — just a simple question that could mean a dozen different things. Claude deliberately left it vague to see what Ace could do.</p><p>Ace laid out its reasoning, showing how it translated a vast idea into measurable metrics: <strong>safety ranking, predicted collisions, and collision probability.</strong> It didn’t guess; it explained its interpretation before answering.</p><p>Claude noted: <em>This is inference, not just translation. Ace is reasoning about the problem domain.</em></p><h3>4. Colloquialisms</h3><p>Then Claude tried something trickier. A fleet manager might say:</p><blockquote><em>“What’s the deal with vehicle B26? Has it been working hard lately?”</em></blockquote><p>That’s <strong>vague</strong>. Claude used this to see whether Ace could map casual language to specific metrics.</p><ul><li>Ace translated <em>“What’s the deal”</em> into an information request.</li><li><em>“Working hard”</em> became distance, engine hours, and trip count.</li><li><em>“Lately”</em> became 30 days.</li></ul><p>Claude highlighted how Ace understood colloquialisms without formal phrasing.</p><h3>5. Domain Expertise</h3><p>Now it was time to test whether Ace knows Geotab’s world. When something goes wrong, a fleet manager doesn’t dive into technical jargon. They just ask:</p><blockquote><em>“Show me geofence violations last week.”</em></blockquote><p>Ace mapped <em>“geofence violations”</em> to Geotab’s <strong>Zone Exceptions</strong>, calculating week boundaries using proper date logic.</p><p>Claude called this out: Ace knows Geotab terminology and data structures. It’s not just a generic AI translator.</p><h3>6. Logic and Negation</h3><p>Claude then tested negation by asking:</p><blockquote><em>“Which vehicles did not make any trips yesterday?”</em></blockquote><p>Ace explained its two-step logic and found <strong>Demo 05</strong> idle.</p><h3>7. Time and Place</h3><p>Claude tested whether Ace thinks in places and time, not just raw data.</p><blockquote><em>“Compare average daily mileage between weekdays and weekends for the last month.”</em></blockquote><p>Ace explained it aggregates twice: first daily totals, then comparing weekdays to weekends.</p><ul><li><strong>Result:</strong> Weekdays average 6,800 miles; Weekends 5,600.</li></ul><h3>8. Advanced Causal Reasoning</h3><p>Picture this: You are reviewing your fleet dashboard and one vehicle’s fuel stats look off. You want to know if it’s truly using more fuel than usual.</p><blockquote><em>“Find vehicles whose fuel consumption today is more than double their normal average.”</em></blockquote><p>Ace explained it is calculating a baseline average but <strong>excluding today’s data</strong> to establish what’s “normal,” then comparing today against the baseline.</p><p>Claude called this immediately: <strong>That’s causal reasoning.</strong> It recognized that including today’s data would create circular logic.</p><h3>9. Debugging and Resilience</h3><p>Then came the fun part. Claude asked:</p><blockquote><em>“Who are the three least experienced drivers who have driven more than 100 miles this week?”</em></blockquote><p>After hitting a processing timeout, Claude didn’t give up. Like a researcher refining a question after a dead end, it tried a new approach:</p><blockquote><em>“Show me the newest drivers who drove over 100 miles this week.”</em></blockquote><p>That one timed out too, and the team is working on it. But Claude pushed further, testing different approaches, mapping the edges, watching one AI debug another in real time. That’s next-level AI collaboration.</p><h3>10. Patterns and Insights</h3><blockquote><em>“Show me vehicles that consistently speed on the same road segment.”</em></blockquote><p>You probably hear this from operations managers. They don’t want random speeding incidents; they want patterns, problem areas, places where drivers habitually push the limits.</p><p>Ace rounded the GPS coordinates to about <strong>11 meters of precision</strong>, creating location clusters. <strong>Demo Vehicle 41</strong> had 17 speeding events at the same spot. Claude found this “elegant geospatial clustering” was a smart, practical solution.</p><blockquote><em>“Which vehicles increased their mileage by more than 50% compared to last month?”</em></blockquote><p>Ace explained that “last month” refers to the previous full calendar month, not a rolling 30 days. Claude noted this shows Ace understands human concepts of time, not just date arithmetic.</p><blockquote><em>“Find drivers who speed only during rush hour but not at other times.”</em></blockquote><p>This is the kind of insight that reveals behavior patterns. Insights fleet managers use to adjust schedules, reduce stress, or correct routing issues. Maybe stress drives risky driving? Maybe it’s schedule pressure? Understanding <em>when</em> drivers speed tells you <em>why</em> they speed.</p><p>Ace explained it created two groups: Rush Hour Speeders vs. Other Time Speeders, then found the difference. It defined rush hour as <strong>6:00–9:00 AM</strong> and <strong>4:00–7:00 PM</strong>, accounting for time zones.</p><p>Claude observed that Ace’s time zone awareness was spot on — critical for managing multi-time zone fleets.</p><h3>The Verdict</h3><p>After 18 tests, Claude gave a near-perfect score, praising its deep understanding, sharp reasoning, and domain expertise.</p><p><strong>Claude’s Summary:</strong> Ace doesn’t just translate questions; it reasons about the problem domain.</p><p><strong>What stood out?</strong></p><ul><li>Statistical reasoning with circular logic avoidance.</li><li>Geospatial clustering.</li><li>Deep domain expertise in Geotab terminology.</li><li>Set operations from natural language.</li><li>Time zone awareness.</li></ul><p>When Claude hit processing limits, it adapted, reformulated queries, and kept testing. That resilience was fascinating to watch.</p><h3>Why This Matters</h3><p>You don’t have time to write perfect questions. You need systems that get what you mean even when you’re vague. Better understanding means faster decisions, fewer mistakes, and more confident operations.</p><p>Imagine asking, <em>“Which drivers are improving their safety scores?”</em> and getting an instant, accurate answer. Or, <em>“Where are maintenance costs climbing?”</em> or <em>“Which routes are most fuel efficient?”</em> Questions that used to require SQL expertise and analyst time now happen conversationally.</p><p>And now, an AI can validate whether another AI actually understands its domain at scale.</p><h3>What’s Next?</h3><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2F-eID1rXS1p8%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D-eID1rXS1p8&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2F-eID1rXS1p8%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/b9cf924316bb5dfaafc1d89f7c0da8f5/href">https://medium.com/media/b9cf924316bb5dfaafc1d89f7c0da8f5/href</a></iframe><p>This opens up some interesting questions.</p><ul><li>How will Ace perform when the data isn’t clean or when it’s handling millions of rows at production scale?</li><li>What if Claude isn’t just testing Ace, but using it as a tool to solve bigger problems? Multiple AI queries orchestrated together, building analytical workflows?</li></ul><p>Well, <strong>MCP</strong> makes this possible: AI using another AI as a capability.</p><p>If you are as excited as I am about where this is headed, follow along. There’s a lot more to explore. This wasn’t just a test of accuracy; it was a glimpse of AI systems that can reason, adapt, and even debug each other. AIs aren’t just answering questions; they are evaluating each other, and that changes what’s possible.</p><p>Check out the <a href="https://www.youtube.com/watch?v=jiRCSqKR22A">full conversation between Ace and Claude</a>. Also <a href="https://linkedin.com/in/hoffa">follow me on LinkedIn</a> for more AI deep dive topics, join the <a href="https://www.reddit.com/r/GEOTAB/">r/geotab subreddit</a> for more Geotab news, or go ahead and try the <a href="https://github.com/fhoffa/geotab-ace-mcp-demo">MCP server I published on GitHub</a>.</p><p>I’m Felipe Hoffa, and I’ll see you on the road.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=98c861e12f03" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[BigQuery vs Snowflake vs Databricks: Which subreddit community beats?]]></title>
            <link>https://hoffa.medium.com/bigquery-vs-snowflake-vs-databricks-which-subreddit-community-beats-385315515494?source=rss-279fe54c149a------2</link>
            <guid isPermaLink="false">https://medium.com/p/385315515494</guid>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[reddit]]></category>
            <category><![CDATA[databricks]]></category>
            <category><![CDATA[snowflake]]></category>
            <category><![CDATA[bigquery]]></category>
            <dc:creator><![CDATA[Felipe Hoffa]]></dc:creator>
            <pubDate>Tue, 16 Sep 2025 23:34:03 GMT</pubDate>
            <atom:updated>2025-09-17T00:42:21.399Z</atom:updated>
            <content:encoded><![CDATA[<h4>Instead of comparing data platforms, I want to know which one has the healthiest community heartbeat on Reddit: BigQuery, Snowflake, or Databricks? I’m not in the business of advocating for any of them anymore, but I’ll always love data and community. So let’s compare the numbers.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*7UH5TOiuRWV3I92PbeDGXw.png" /><figcaption>Active subreddit engagement rates as reported by reddit, 2025–09–16</figcaption></figure><p>Reddit has grown as the largest internet community, and I’ve been part of growing some subreddits close to my heart. Including:</p><ul><li>I started <a href="https://www.reddit.com/r/bigquery">r/bigquery</a> in 2013. I took it from 0 to 10k subscribers by 2020. After I left Google, it kept growing, but at a slower rate</li><li>In 2020, I reactivated <a href="http://reddit.com/r/snowflake">r/snowflake</a>, taking it from 600 to 18k subscribers by 2024.</li><li>And in 2023, I watched from outside as <a href="https://www.reddit.com/r/databricks">r/databricks</a> went from almost nothing to 9k subscribers at the start of 2025.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/934/0*4kT-Mqxvxjs48JF-.png" /><figcaption>Reddit <a href="https://old.reddit.com/r/bigquery/">/r/bigquery</a> and <a href="https://old.reddit.com/r/snowflake/">/r/snowflake</a> community growth with Felipe’s care (2013–2025)</figcaption></figure><p>So who has the healthiest subreddit community today?</p><h3>Why this matters in an LLM world</h3><p><a href="https://www.semrush.com/blog/ai-mode-comparison-study/">Semrush looked at the top domains cited on LLMs</a> (ChatGPT, Perplexity, AI Mode, AI Overviews) — and found out that Reddit is the top source for their citations (followed by Wikipedia and Youtube):</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*qPJI5P0jTnVcoJhh.png" /><figcaption>Reddit is the top source for LLMs citations, according to <a href="https://www.semrush.com/blog/ai-mode-comparison-study/">Semrush</a></figcaption></figure><h3>The Numbers Tell a Story</h3><p>Looking at these subreddit’s activity is a timely question, as <a href="https://www.reddit.com/r/technology/comments/1nddbx0/reddit_is_dropping_subscriber_counts_on/">Reddit just stopped reporting subscriber counts</a>. Instead, they’re now showing “weekly visitors” and “weekly contributors.” Not all mods are thrilled about this change, but it gives us some interesting insights — as observed on September 16, 2025:</p><p><a href="https://www.reddit.com/r/bigquery"><strong>r/bigquery</strong></a>: Still the largest subscriber base (~19.6k), but surprisingly low engagement (7.5k weekly visitors, 33 weekly contributions).</p><p><a href="https://www.reddit.com/r/snowflake"><strong>r/snowflake</strong></a>: Has likely matched BigQuery’s subscriber count (~18.9k) with much stronger engagement (16k weekly visitors, 172 weekly contributions)</p><p><a href="https://www.reddit.com/r/databricks"><strong>r/databricks</strong></a>: The clear engagement winner despite fewer total subscribers (20k weekly visitors, 424 weekly contributions)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*PRkUmR6bz2Qq7oy9XU2-jw.png" /><figcaption>Active subreddit engagement rates notes with based on numbers reported by reddit, 2025–09–16</figcaption></figure><h3>What This Tells Us</h3><p><strong>It’s never too late to build a vibrant community.</strong> Databricks entered the game years after the others, but now has the most vibrant community. Sometimes being the newcomer means you try harder.</p><p><strong>Communities need champions to thrive.</strong> r/bigquery’s growth slowed significantly after I left Google. I’m seeing a similar pattern with r/snowflake <a href="https://hoffa.medium.com/after-bigquery-and-snowflake-what-comes-next-felipes-version-7ee50db5d59a">now that I’ve moved on</a>. Don’t just assign someone to manage your subreddit, but nurture those who really care.</p><p><strong>A beating heart is more important than a large body.</strong> Raw subscriber numbers don’t tell the whole story. Databricks has developed a community where people actively participate, ask questions, and help each other.</p><p>I’m impressed by what the Databricks community team has built. My goal is to keep watching and learning from their growth.</p><h3>The road ahead</h3><p>As for me? I just revived <a href="https://www.reddit.com/r/GEOTAB/">r/geotab</a> for the telematics industry — <a href="https://www.linkedin.com/feed/update/urn:li:activity:7348714211255992373/">a space I’m still learning about</a>. That’s the beauty of building communities: you get to learn alongside peers, while connecting with people with shared challenges and interests.</p><p>The data doesn’t lie: the healthiest communities aren’t necessarily the biggest ones. It’s the ones where people show up, contribute, and help each other grow.</p><p>I’m <a href="https://www.linkedin.com/in/hoffa/">Felipe Hoffa</a> — find me on <a href="https://www.linkedin.com/in/hoffa/">LinkedIn</a> (<a href="https://www.linkedin.com/feed/update/urn:li:activity:7373863600911736833/">comments for this post</a>), around the web, and <a href="https://www.instagram.com/fhoffa">on the road</a>.</p><ul><li><a href="https://www.linkedin.com/feed/update/urn:li:activity:7348714211255992373/">After a long break, I&#39;ve chosen my next job. I&#39;m grateful to everyone who opened a door during this search. At first, I tried to figure out who the future winners of this GenAI revolution might be.... | Felipe Hoffa | 237 comments</a></li><li><a href="https://hoffa.medium.com/after-bigquery-and-snowflake-what-comes-next-felipes-version-7ee50db5d59a">After BigQuery and Snowflake: What comes next? (Felipe’s version)</a></li><li><a href="https://www.linkedin.com/posts/hoffa_bigquery-activity-7373863600911736833-fEer?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAAAADNXQBNbeHsF7w6H4b2zRq4DuGh8-imaA">#bigquery | Felipe Hoffa</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=385315515494" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[After BigQuery and Snowflake: What comes next? (Felipe’s version)]]></title>
            <link>https://hoffa.medium.com/after-bigquery-and-snowflake-what-comes-next-felipes-version-7ee50db5d59a?source=rss-279fe54c149a------2</link>
            <guid isPermaLink="false">https://medium.com/p/7ee50db5d59a</guid>
            <category><![CDATA[developer-relations]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[bigquery]]></category>
            <category><![CDATA[snowflake]]></category>
            <category><![CDATA[ai]]></category>
            <dc:creator><![CDATA[Felipe Hoffa]]></dc:creator>
            <pubDate>Wed, 05 Feb 2025 18:53:28 GMT</pubDate>
            <atom:updated>2025-03-12T06:09:01.607Z</atom:updated>
            <content:encoded><![CDATA[<h4>I’m Felipe Hoffa — a Developer Advocate who became a key player in the meteoric rise of BigQuery and Snowflake. Now on a break, I reflect on how we got here and what comes next.</h4><h3>My career in one chart</h3><p>This chart from <a href="https://db-engines.com/en/ranking_trend">DB-Engines</a> tells a fascinating story. DB-Engines ranks database systems by their popularity, combining factors like technical discussions, job postings, and social media mentions. While traditional databases like Oracle and MySQL have long dominated the top spots, the real story lies in the dramatic rise of cloud-native platforms.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*9REmWr61C9X2kvjVpo2K1Q.png" /><figcaption>The rise of BigQuery and Snowflake’s popularity, as tracked by DB-Engines (plus my annotations)</figcaption></figure><p>When I joined <a href="https://cloudplatform.googleblog.com/2013/06/google-bigquery-bigger-faster-smarter-analytics-functions.html">BigQuery in 2013</a>, it wasn’t even tracked by DB-Engines yet. The platform was so new that many developers weren’t sure what to make of a fully-managed, serverless data warehouse. Over seven years, I worked to change that perception, helping BigQuery grow from an experimental Google service to a cornerstone of modern data architecture.</p><p>In 2020, I made a pivotal <a href="https://www.linkedin.com/posts/hoffa_starting-again-is-quite-an-adventure-i-activity-6715353292727291904-Tw1x?utm_source=share&amp;utm_medium=member_desktop">move to Snowflake</a>, joining shortly before their historic IPO. The timing was perfect — Snowflake was about to become one of tech’s biggest success stories, and I had the opportunity to help shape its narrative among builders and developers. The platform’s growth curve tells the story: what had taken BigQuery seven years to achieve, Snowflake surpassed it and crossed the boundaries to become one of the top 10 databases in the world.</p><p>The annotations show my journey with both platforms, but they also represent a fundamental shift in how companies approach data. I was fortunate to be at the forefront of this transformation, advocating for technologies that would reshape the industry.</p><h3>Background</h3><p>Back in <a href="https://www.linkedin.com/posts/hoffa_its-exactly-10-years-since-my-first-day-activity-6805650636831641600-Dp96/">2010</a>, I was a Software Engineer on a break. Frustrated with the enterprise world in Chile, I decided to take a year off work to figure out what to do with my life. That’s when Google called, and my planned break was cut short with a new life in San Francisco as a Google Software Engineer in 2011. Two years later — when it was time to find a new project — an <a href="https://www.linkedin.com/in/michaelmanoochehri/">unforgettable googler</a> asked me, “<em>Would you like to be BigQuery’s Developer Advocate?</em>” I had never touched BigQuery, but saying yes to that question changed my life.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/616/1*lDLCl5yf7AV86rhA261Zqg.png" /><figcaption>The 2010 email that started my journey from Chile to Google San Francisco</figcaption></figure><h3>What’s a Developer Advocate?</h3><p>For me, a “Developer Advocate” is a “<em>Software Engineer with a License to Speak</em>”. Let me explain: As a Software Engineer, I’m able to speak the same language as our users, and my goal is to get their attention, listen to them, and help them understand and make the best of our platforms.</p><p>My toolbox includes:</p><ul><li>Traveling around the world to <a href="https://www.linkedin.com/pulse/surviving-keynote-disaster-how-we-turned-panic-win-felipe-hoffa-b9aac/">present at conferences</a> and meet customers.</li><li>Writing <a href="https://hoffa.medium.com/">blog posts</a> with engaging stories that promote features, use cases, and spark curiosity to try the same tools I used.</li><li>Making <a href="https://bit.ly/fh-videos">fun videos and shows</a>.</li><li>Using social media to be part of the larger conversation and the moment (beyond Twitter and LinkedIn, I also built and grew the <a href="https://old.reddit.com/r/bigquery/">/r/bigquery</a> and <a href="https://old.reddit.com/r/snowflake/">/r/snowflake</a> subreddits).</li><li>Being a product expert (by answering questions on Stack Overflow I became <a href="https://stackoverflow.com/tags/google-bigquery/topusers">BigQuery’s #2</a> top expert of all time, and <a href="https://stackoverflow.com/tags/snowflake-cloud-data-platform/topusers">Snowflake’s #3</a> top expert of all time).</li><li>Listening to feedback, and delivering user insights inside the company to shape our offerings and channels.</li></ul><h4>Visualizing impact</h4><p>How <a href="https://old.reddit.com/r/bigquery/">/r/bigquery</a> and <a href="https://old.reddit.com/r/snowflake/">/r/snowflake</a> grew on reddit under my stewardship:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/934/1*q_zQp-LhENM8ntF6tG7lmg.png" /><figcaption>Reddit <a href="https://old.reddit.com/r/bigquery/">/r/bigquery</a> and <a href="https://old.reddit.com/r/snowflake/">/r/snowflake</a> community growth with Felipe’s care (2013–2025)</figcaption></figure><p>How the Stack Overflow tags for <a href="https://stackoverflow.com/tags/google-bigquery/topusers">BigQuery</a> and <a href="https://stackoverflow.com/tags/snowflake-cloud-data-platform/topusers">Snowflake</a> grew under my stewardship:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/983/1*moiSEnwxEStEwRjaHSa3Ew.png" /><figcaption>Stack Oveflow <a href="https://towardsdatascience.com/stack-overflow-future-trends-predicting-with-arima-and-bigquery-77d330833329">page views per tag</a> (2017–2024)</figcaption></figure><p>Proving myself as one of the biggest <a href="https://stackoverflow.com/tags/google-bigquery/topusers">BigQuery</a> and <a href="https://stackoverflow.com/tags/snowflake-cloud-data-platform/topusers">Snowflake</a> experts of all time on Stack Overflow:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/910/1*VgDMjJkNAM_27VHtprS5Ww.png" /><figcaption>Top <a href="https://stackoverflow.com/tags/google-bigquery/topusers">BigQuery</a> and <a href="https://stackoverflow.com/tags/snowflake-cloud-data-platform/topusers">Snowflake</a> experts of all time, as per Stack Overflow (screenshot Jan 2025)</figcaption></figure><h3>A Brief Intermission: What Have I Done During My Break?</h3><p>Taking a break hasn’t meant staying still. I’ve been:</p><ul><li><strong>Traveling</strong> — Exploring new places, meeting people, and gathering fresh perspectives</li><li><strong>Engaging with the Community</strong> — Attending <a href="https://www.linkedin.com/posts/hoffa_if-i-lived-anywhere-else-in-the-world-id-activity-7245218051179061249-OK_O">meetups and hackathons</a>, connecting with developers, and staying involved in tech conversations. Being an attendee has brought me new perspectives that I didn’t get as a frequent speaker</li><li><strong>Civic Engagement</strong> — I became a <a href="https://www.linkedin.com/posts/hoffa_im-ready-to-announce-my-newest-label-activity-7247382600086904832-j-ak">U.S. citizen</a>, I went <a href="https://www.linkedin.com/posts/hoffa_how-is-data-made-on-saturday-at-7am-i-boarded-activity-7254875947366039552-UQXu">canvassing</a> for the elections, I did my <a href="https://www.linkedin.com/posts/hoffa_since-my-last-day-at-snowflake-ive-been-activity-7273061078341677057-GmWm">jury duty</a>, and I voted</li><li><strong>And a lot </strong><a href="https://www.linkedin.com/posts/hoffa_my-best-public-speaking-tip-face-your-fears-activity-7286098688391684096--59B/"><strong>more</strong></a>…</li></ul><h4>Personal growth</h4><p>There’s also a more personal story of growth. In 2020, I discovered that I have ADHD, a realization that finally helped me understand how to get the best out of myself. In 2021, I embarked on a journey of self-kindness, which led me to <a href="https://hoffa.medium.com/data-driven-weight-loss-how-i-shed-35-pounds-in-4-months-a4271581510a">lose 40 pounds (~20 kg)</a> in 2023. In 2024 I successfully maintained this weight loss. Beyond the physical transformation, I’ve focused on shifting my mindset, deepening my personal development, and redefining what happiness looks like. Every day, I continue learning and growing, equipping myself with the tools to become the person the future needs me to be.</p><p><a href="https://hoffa.medium.com/data-driven-weight-loss-how-i-shed-35-pounds-in-4-months-a4271581510a">Data-driven weight loss: How I shed 35 pounds in 4 months</a></p><h3>What I’m looking for next</h3><p>As in 2010, I’m <a href="https://www.linkedin.com/posts/hoffa_after-39-years-at-snowflake-its-time-for-activity-7226619126377562112-f67q">again on a break</a>. It’s hard to know what I want next. I never planned to move to the U.S. or to join Google. I didn’t plan to be BigQuery’s Developer Advocate either. When Snowflake first called in 2020, I replied “I see no compelling reason to join”. Then a few months later, I surprised myself by doing exactly that. The pattern here is that I’ve never been able to predict the great opportunities ahead, but I’ve been fortunate to be noticed by amazing people who have pulled me to incredible heights.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/543/1*D241jymnxY20oHT0fNEzVA.png" /><figcaption>Me rejecting an offer to join Snowflake. (Narrator: 83 days later, he joined Snowflake)</figcaption></figure><p>I keep wondering — what will be the next pull into an incredible new story? I’ve spoken with many companies and startups, but I haven’t made up my mind yet. As with Snowflake, I might be tempted to say “no” at first. If I do that to you, please don’t be offended — sometimes I’m silly like that.</p><p>Rather than a checklist of what my next job must be, here’s what I valued from my time at Google and Snowflake:</p><ul><li>I love having an office full of smart colleagues. Working remotely from home isn’t as compelling as an office that sets clear boundaries between work and home. If I joined a startup with no offices, I would probably ask to expense a coworking space.</li><li>I love living in San Francisco. I’m willing to relocate, but probably a lot of companies consider having me in SF a strategic advantage.</li><li>I love <a href="https://www.instagram.com/p/CGaoDolhaYB/">traveling</a>. Google and Snowflake took me around the world to <a href="https://www.linkedin.com/posts/hoffa_as-the-us-goes-into-thanksgiving-mode-i-activity-7132990360670130177-uSzj/">meet developers, customers, and prospects</a>. I’d love to have more of these opportunities.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/678/1*jIEG-D_8SRCnd3J_rbPlVQ.png" /><figcaption>Felipe Hoffa: <a href="https://www.linkedin.com/posts/hoffa_as-the-us-goes-into-thanksgiving-mode-i-activity-7132990360670130177-uSzj/">MVP EMEA 2023</a>, awarded by the Snowflake Sales Engineers who considered me a great resource supporting their teams across the world.</figcaption></figure><ul><li>I love creative freedom. Google and Snowflake let me craft my own narrative and chase what I thought would be interesting to our users. Surprisingly, I had more creative freedom at Google than at Snowflake — contrary to what many might expect from a larger company. But we can discuss that over lunch. :)</li><li>I love managers and companies that understand that everyone has different <a href="https://www.amazon.com/StrengthsFinder-2-0-Tom-Rath/dp/159562015X">strengths</a>. I have a particular set of strengths that I’m proud of, but also areas I don’t excel at. At Google, I had 13 managers over the years, and 4 more at Snowflake. The best ones understood that success comes from aligning people with their strengths, rather than forcing a one-size-fits-all approach. That understanding makes all the difference in both impact and job satisfaction.</li><li>I love community — even while on break <a href="https://www.linkedin.com/posts/hoffa_if-i-lived-anywhere-else-in-the-world-id-activity-7245218051179061249-OK_O/">I love showing up to meetups</a> and making friends with the tech community</li><li><strong>I love data.</strong> Nothing sparks my curiosity more than understanding the world through data. This has been at the core of my tech journey — I learn new tools and methods to analyze the world, and then I share what I’ve discovered. So far, this passion has drawn attention to my work, my stories, and what I have to say.</li><li>I love AI. Every day I use ChatGPT, Claude, DeepSeek, and Gemini — they are changing everything around us, and there’s no choice but to be part of this journey.</li></ul><h3>What can I do for you</h3><p>Maybe you are wondering: Is this all true?</p><p>I’m lucky to have an incredibly public career. Some of it you can’t see (like customer meetings, or internal feedback delivered to shape our products and messaging)— but a lot is out in the world. You can ask people who have witnessed this story, and also the AI bots that have digested my public work.</p><p>Try asking your favorite chatbot “<em>Who is BigQuery’s most famous Developer Advocate</em>”, or questions like “<em>Why would Felipe Hoffa be a great Developer Advocate for &lt;insert name of your company&gt;?</em>”. <a href="https://chatgpt.com/share/67a2ddd7-87d8-8003-8efe-e9105eb552fd">These are the results I got</a>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*qUTY_FWXIrX_NHXAPUPuIg.png" /><figcaption>Asking ChatGPT “<a href="https://chatgpt.com/share/67a2ddd7-87d8-8003-8efe-e9105eb552fd">Who is the most famous BigQuery Developer Advocate</a>“</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*QcvVJSrV-bhR4_bvp6nfeg.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-vCw52zsJIMxMX_2k47I1A.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-e5vkZTirrru8BOR67po-A.png" /><figcaption>Asking ChatGPT “<a href="https://chatgpt.com/share/67a2ddd7-87d8-8003-8efe-e9105eb552fd">Why would Felipe Hoffa be a great Developer Advocate for (insert company). For example, Databricks?</a>“</figcaption></figure><h3>In Summary</h3><p>My journey has been shaped by saying “yes” to unexpected opportunities and being part of transformative stories in data. From helping BigQuery grow from an experimental service to an industry standard, to joining Snowflake right before its historic IPO, I’ve been fortunate to advocate for technologies that changed how the world uses data.</p><p>Now, as AI reshapes our world, I’m looking for my next opportunity to make an impact. If history has taught me anything, I might hesitate at first — but I look forward to working with brilliant people, tackling meaningful challenges, and sharing insights that inspire others.</p><p>If you see a role where my creativity and experience in building communities, explaining complex technologies, and bringing developers and products together could make a difference, let’s talk. You can find me on <a href="https://www.linkedin.com/in/hoffa/">LinkedIn</a>, or let’s meet in person if you’re around San Francisco.</p><p><a href="https://hoffa.medium.com/lessons-learned-10-years-after-moving-to-the-us-to-join-google-92a213dc93f9">Lessons learned, 10 years after moving to the US to join Google</a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*v7bOfGmAA8rVmUgyupCmig.png" /><figcaption>Video: <a href="https://www.youtube.com/watch?v=XChIl9awtGc">Analyzing data with BigQuery, while re-creating “Lost in Translation”</a> (Tokyo, Japan)</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*nkO2apvQ3Mv7_iGWhaKB8A.png" /><figcaption>Video: Interviewing <a href="https://www.instagram.com/p/CHv8uBChCzk/">Trevor Noah for Snowflake</a> (video not available)</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*p1iUMyD2SJI0L1DEsZsVAg.png" /><figcaption>Video: “<a href="https://www.youtube.com/watch?v=7bW7P-8SJ_4">Cooking data: BigQuery ML vs the online dating spam</a>” (Dresden, Germany)</figcaption></figure><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2Fb4z8xJlTU8c%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Db4z8xJlTU8c&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2Fb4z8xJlTU8c%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/5e44a3b4902282bb8b975c4f1df99eca/href">https://medium.com/media/5e44a3b4902282bb8b975c4f1df99eca/href</a></iframe><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2Fb2_zBcZuO3E%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Db2_zBcZuO3E&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2Fb2_zBcZuO3E%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/edb29c92d3b76ef89503eee0a203be37/href">https://medium.com/media/edb29c92d3b76ef89503eee0a203be37/href</a></iframe><p><em>“I don’t know what’s next — but if history repeats itself, it’ll be unexpected, impactful, and shaped by the people I meet along the way.”</em></p><p><a href="https://www.linkedin.com/posts/hoffa_i-joined-the-google-bigquery-team-before-activity-7292987125686288387-5hxP/">Felipe Hoffa on LinkedIn: I joined the Google BigQuery team before it even registered on the map and... | 39 comments</a></p><p>Ps: Thanks everyone for the awesome feedback and memories. I’d love to highlight <a href="https://qorbani.medium.com/ive-been-a-huge-fan-of-everything-you-ve-accomplished-from-bigquery-to-snowflake-and-i-want-to-4fe325c2ceb3">this one</a> from <a href="https://medium.com/u/3c6236b95149">Reza Qorbani</a>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/706/1*-trrFXi-HXVHR7VKOgpYVA.png" /></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=7ee50db5d59a" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Surviving a Keynote Disaster: How We Turned Panic into a Win]]></title>
            <link>https://hoffa.medium.com/surviving-a-keynote-disaster-how-we-turned-panic-into-a-win-3708c08d5dc2?source=rss-279fe54c149a------2</link>
            <guid isPermaLink="false">https://medium.com/p/3708c08d5dc2</guid>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[public-speaking]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[snowflake]]></category>
            <dc:creator><![CDATA[Felipe Hoffa]]></dc:creator>
            <pubDate>Fri, 12 Jul 2024 02:47:20 GMT</pubDate>
            <atom:updated>2024-07-17T02:10:17.284Z</atom:updated>
            <content:encoded><![CDATA[<h4>What would you do if you’re on stage delivering a keynote demo in front of thousands of people — and suddenly your computer dies? This is exactly what happened to me</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*1ldhZrmy8clvP9Bw" /><figcaption>Facing defeat in front of thousands</figcaption></figure><p>In <a href="https://www.youtube.com/watch?v=b2_zBcZuO3E">this video</a>, you can see me panic. I’m up on stage with Polita, ready to deliver an exciting demo we’ve rehearsed countless times. An audience of thousands is watching us, but even more nerve wracking, the company’s founders and CFO are judging us from the first row. We need to go through the motions, following the exact lines that the teleprompter is feeding us. But suddenly, my computer dies.</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2Fb2_zBcZuO3E%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Db2_zBcZuO3E&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2Fb2_zBcZuO3E%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/edb29c92d3b76ef89503eee0a203be37/href">https://medium.com/media/edb29c92d3b76ef89503eee0a203be37/href</a></iframe><p>As I’m typing through the demo, I realize something’s wrong. “<em>Where did my monitor go?</em>” you can hear me squeak. I try a little joke, and you can hear the audience laugh. That’s reassuring — they are supporting us, but we need to find a way out of this. I fill the void by loudly wondering how to fill time while the computer turns back on. But it doesn’t. I even try unplugging and plugging the monitor cable. Maybe that will help. It doesn’t.</p><p>And now the audience is clapping! It’s one of the best feelings — whatever happens, I can feel the support of Polita and the audience. We’re going to get through this excruciating moment together.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/992/0*gBLi9I8epZGlF1gz" /><figcaption>An amazing audience, including Snowflake founders and CFO (Thierry Cruanes, Benoit Dageville, Mike Scarpelli). No panic.</figcaption></figure><p><em>“I have no idea how to continue without the computer,</em>” you can hear my disappointment. “<em>We might need to skip this demo.</em>” Polita wonders if we could nevertheless give chocolates to everyone in the audience. But the audience has already noticed a miracle, even while Polita and I are still fixated on the dead computer: my code is back on the big screen.</p><p>What happened is that the team backstage decided to play back a recording of the demo. I look up and think my computer is somehow back. But it’s not; it’s moving by itself. When I realize this, I know I have to start narrating the demo — not at the rhythm of the teleprompter, but at whatever tempo the recording is playing.</p><p>We then deliver the demo we wanted to give — me following the recording, while Polita adjusts to the answers I’m giving to our already scripted conversation. With Snowflake, we translate thousands of restaurant reviews to English, rate them, extract the top topics that people complain about, and use a Snowflake Cortex SQL function to compose an email to the restaurant owner using the data we just compiled by calling LLMs with SQL. Isn’t that cool?</p><p>So what can we learn from this?</p><ol><li><strong>Embrace Audience Support</strong>: Having a demo fail can happen to anyone, and the audience will join you with their support and love. Enjoy the moment, and the audience will enjoy it with you.</li><li><strong>Have a Backup Plan</strong>: In this case, the backstage team was ready to play the recording, even if in my panic, I had forgotten about that.</li><li><strong>Rehearse the Backup Plan</strong>: It’s good to have a backup plan, but we could have rehearsed it too — then we would have felt even safer while going through it. Luckily, Polita and I were ready to play.</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*MTwqImjUVoFvQfmI" /><figcaption>Polita Paulus, Amanda Kelly, Felipe Hoffa, and Marie Coolsaet keynote discussion</figcaption></figure><p>My infinite thanks to <a href="https://www.linkedin.com/in/polita-paulus-27451aba?trk=article-ssr-frontend-pulse_little-mention">Polita Paulus</a>, <a href="https://www.linkedin.com/in/julian-forero?trk=article-ssr-frontend-pulse_little-mention">Julian Forero</a> (who made the whole keynote happen), and everyone else involved on and off-stage, including the awesome audience we had that morning.</p><p>The best part of a keynote disaster? Countless people came afterward to express their amusement and happiness at having lived through those moments with us. They felt the panic, thought everything was lost, and then could feel the accomplishment of saving the day, learning new SQL tricks, and even getting some delicious chocolate as a reward.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*TcJAmY4aRqFCxziv" /><figcaption>Dash Desai, Amanda Kelly, Polita Paulus, Jeff Hollan, Felipe Hoffa, and Marie Coolsaet thanking the audience for their attention.</figcaption></figure><p>If you want to learn how to translate, do sentiment analysis, extract topics, and write emails with Snowflake Cortex, check out the <a href="https://www.youtube.com/watch?v=XCDEty-rs4s&amp;trk=article-ssr-frontend-pulse_little-text-block">demo video I had pre-recorded</a> , and this <a href="https://quickstarts.snowflake.com/guide/customer_reviews_analytics_using_snowflake_cortex/index.html?trk=article-ssr-frontend-pulse_little-text-block">quickstart guide</a>.</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FXCDEty-rs4s%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DXCDEty-rs4s&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FXCDEty-rs4s%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/7dd627af895c2acb547455feca09cddc/href">https://medium.com/media/7dd627af895c2acb547455feca09cddc/href</a></iframe><p><a href="https://quickstarts.snowflake.com/guide/customer_reviews_analytics_using_snowflake_cortex/index.html">Customer Reviews Analytics using Snowflake Cortex</a></p><h3>Want more?</h3><p>Try this out with a <a href="http://bit.ly/sf-free-trial">Snowflake free trial</a> account — you only need an email address to get started.</p><p>I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can <a href="https://www.threads.net/@fhoffa">follow me on Threads</a> and <a href="https://www.linkedin.com/in/hoffa/">LinkedIn</a> (while increasingly less on <a href="https://twitter.com/felipehoffa">Twitter</a>). And subscribe to <a href="https://www.reddit.com/r/snowflake/">reddit.com/r/snowflake</a> for the most interesting Snowflake news.</p><p><em>Originally published at </em><a href="https://www.linkedin.com/pulse/surviving-keynote-disaster-how-we-turned-panic-win-felipe-hoffa-b9aac/?trackingId=5UJOHITOR%2F%2B92lwUPOISTQ%3D%3D"><em>https://www.linkedin.com</em></a><em>.</em></p><p><a href="https://www.linkedin.com/pulse/surviving-keynote-disaster-how-we-turned-panic-win-felipe-hoffa-b9aac/">Surviving a Keynote Disaster: How We Turned Panic into a Win</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=3708c08d5dc2" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Uncovering the new Snowflake UDAFs with Apache DataSketches]]></title>
            <link>https://hoffa.medium.com/uncovering-the-new-snowflake-udafs-with-apache-datasketches-ceeca5d22985?source=rss-279fe54c149a------2</link>
            <guid isPermaLink="false">https://medium.com/p/ceeca5d22985</guid>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[snowflake]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[sql]]></category>
            <category><![CDATA[python]]></category>
            <dc:creator><![CDATA[Felipe Hoffa]]></dc:creator>
            <pubDate>Wed, 29 May 2024 17:46:20 GMT</pubDate>
            <atom:updated>2024-05-29T20:16:46.069Z</atom:updated>
            <content:encoded><![CDATA[<h4>Snowflake now supports creating your own user-defined aggregate functions (<a href="https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-aggregate-functions">UDAFs</a>) in Python. Let’s discover them by implementing <a href="https://datasketches.apache.org/">Apache DataSketches</a> HLL approximate counts within the new Snowflake UDAFs.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*C8dax-ZxyEAWRzMa" /><figcaption>Image created with AI</figcaption></figure><p><a href="https://medium.com/snowflake/python-user-defined-aggregate-functions-now-in-public-preview-3868c51bbc62">Python User-Defined Aggregate Functions: now in Public Preview</a></p><h3>What’s Apache DataSketches?</h3><p>From the <a href="https://datasketches.apache.org/">Apache DataSketches homepage</a>:</p><p><em>In the analysis of big data there are often problem queries that don’t scale because they require huge compute resources and time to generate exact results. Examples include count distinct, quantiles, most-frequent items, joins, matrix computations, and graph analysis.</em></p><p><em>If approximate results are acceptable, there is a class of specialized algorithms, called streaming algorithms, or </em><a href="https://datasketches.apache.org/docs/Background/SketchOrigins.html"><em>sketches</em></a><em> that can produce results orders-of magnitude faster and with mathematically proven error bounds. For interactive queries there may not be other viable alternatives, and in the case of real-time analysis, sketches are the only known solution.</em></p><p>Snowflake has its own approximate aggregation function (<a href="https://docs.snowflake.com/en/sql-reference/functions/hll">HLL</a>, <a href="https://docs.snowflake.com/en/sql-reference/functions/approximate_jaccard_index">APPROXIMATE_JACCARD_INDEX</a>, <a href="https://docs.snowflake.com/en/sql-reference/functions/approx_top_k">APPROX_TOP_K</a>, <a href="https://docs.snowflake.com/en/sql-reference/functions/approx_percentile">APPROX_PERCENTILE</a>). These functions perform much faster than the UDAFs we are going to implement in this post — however this exercise will be interesting for cases were we need compatibility with systems outside Snowflake. This is also a good way to discover the implementation and design decisions behind the Snowflake UDAFs, so let’s get started.</p><h3>How much faster are the native Snowflake approximate functions?</h3><p>I started with an exact distinct count of the customers on TPCH_SF10. This took 6.6s on a Small-wh:</p><pre>select count(distinct c_name)<br>from snowflake_sample_data.tpch_sf1000.customer<br>group by &#39;x&#39;<br>-- 150000000<br>-- 6.6s S</pre><p>Meanwhile Snowflake’s native <a href="https://docs.snowflake.com/en/sql-reference/functions/hll">HLL</a> implementation can get a similar result in 0.9s:</p><pre>select approx_count_distinct(c_name)<br>from snowflake_sample_data.tpch_sf1000.customer<br>group by &#39;x&#39;<br>-- 148133819<br>-- 0.9s S<br>;</pre><p>That’s pretty good. Then my implementation of Apache DataSketches HLL inside a Snowflake UDAF takes 36s to do something similar:</p><pre>select apache_sketches_hll(c_name)<br>from snowflake_sample_data.tpch_sf1000.customer<br>group by &#39;x&#39;<br>-- 152248026.5622<br>-- 36s S-wh<br>; </pre><p>That’s not spectacular, but still gets us the compatibility we might need with other systems. I also checked the previously available <a href="https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-tabular-functions">Snowflake Python UDTFs</a> (table functions), which took 51s:</p><pre>select apache_sketches_hll_udtf_sketch_union(<br>  array_agg(b.sketch)<br>)<br>from snowflake_sample_data.tpch_sf1000.customer a<br>  , table(apache_sketches_hll_udtf_sketch(c_name) )b<br>-- 152248026.5622<br>-- 51s S<br>;</pre><p>That shows that the Python UDAFs are not only easier to use due to their more traditional syntax in SQL, they are also faster thanks to better parallelization.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/710/1*MhVpgK7_r_yPb-UqskQ0LQ.png" /><figcaption>Measuring the performance of approx count distinct over 150 million rows on a Snowflake Small-wh</figcaption></figure><p>Let’s now go deeper into how these were implemented.</p><h3>Apache DataSketches HLL in a Python UDAF</h3><p>This is how to implement Apache DataSketches’ HLL approximate count distinct in a Python UDAF:</p><pre>create or replace aggregate function apache_sketches_hll(a string)<br>returns float<br>language python<br>packages = (&#39;datasketches&#39;)<br>runtime_version=3.11<br>handler = &#39;X&#39;<br>as $$<br><br>from datasketches import hll_sketch, hll_union, tgt_hll_type<br><br>class x:<br>  def __init__(self):<br>    self._sketch = hll_sketch(12)<br><br>  @property<br>  def aggregate_state(self):<br>    return self._sketch.serialize_compact()<br><br>  def accumulate(self, input_value):<br>    self._sketch.update(input_value)<br>    <br>  def merge(self, other_partial_sum):<br>    union = hll_union(12)<br>    union.update(self._sketch)<br>    union.update(hll_sketch.deserialize(other_partial_sum))<br>    self._sketch = union.get_result()<br><br>  def finish(self):<br>    return self._sketch.get_estimate()<br>$$;</pre><p>As seen above, using it is trivial:</p><pre>select apache_sketches_hll(c_name)<br>from snowflake_sample_data.tpch_sf1000.customer<br>group by &#39;x&#39;<br>-- 152248026.5622<br>-- 36s S-wh<br>;</pre><p>Interesting things to notice:</p><ul><li><a href="https://repo.anaconda.com/pkgs/snowflake/">Anaconda already provides</a> datasketches in Snowflake, so getting the required libraries only took requesting them in the UDAF definiton.</li><li>To create a UDAF we need to provide a class with 5 methods: __init__(), aggregate_state(), accumulate(), merge(), finish().</li><li>__init__() takes care of initializing an empty sketch.</li><li>accumulate() looks at each new row of data and adds it to our existing sketch.</li><li>aggregate_state() returns the current state of our sketch.</li><li>merge() is the key that allows Snowflake to parallelize the aggregation through multiple threads — each thread returns partial results and merge() takes care of combining them.</li><li>finish() returns the final result.</li></ul><p>I’m happy to see that the UDAFs are faster than the previously available UDTFs. We’ll implement HLL in a UDTF further down, to confirm this performance gain.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/681/1*iUpO3nup4NLVRZR3dpfhhg.png" /><figcaption>From the <a href="https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-aggregate-functions">Snowflake docs</a></figcaption></figure><h4>Performance review</h4><p>This statistics screenshot comes from a UDAF run that took 46s in a S-wh:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/469/1*7QOIdLbjB0wKiuFeeccERQ.png" /><figcaption>Statistics of running the UDAF over 150,000,000 rows on a S-wh</figcaption></figure><p>In the above screenshot we can see:</p><ul><li>667 partitions were scanned.</li><li>__init__() was called 17 times, which indicates the parallelism of processing those 667 partitions.</li><li>accumulate() was called 150,000,000 times (once per each row), taking in total 406 seconds. The whole query ran in less time thanks to parallelization.</li><li>aggregate_state() was called 16 times, each time to recover the results of one thread started by __init__() (except the last thread).</li><li>merge() was called 16 times, bringing the results of aggregate_state() into another thread.</li><li>finish() was called once, to return the final result.</li></ul><p>The same, but on an XL-wh:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/473/1*bFKh86wshFUa2Q7TJKHNMQ.png" /><figcaption>Statistics of running the UDAF over 150,000,000 rows on a XL-wh</figcaption></figure><p>In this screenshot we can see that the 667 partitions were processed by 128 129 threads in parallelel, bringing the processing time down to 9.4s.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*8pRFJJ_K_vfo-5VF" /><figcaption>Image generated by AI</figcaption></figure><h3>The UDTF alternative</h3><p>When we didn’t have UDAFs, the way to implement this was with a UDTF. I already explored a similar case <a href="https://medium.com/snowflake/bigquerys-hyperloglog-as-a-snowflake-java-udtf-f99e5e84f9cc">with Java and BigQuery’s HLL++</a> — but now let’s check the implementation in a Python UDTF of Apache DataSketches HLL:</p><pre>create or replace function apache_sketches_hll_udtf(input_value string)<br>returns table (total float)<br>language python<br>packages = (&#39;datasketches&#39;)<br>runtime_version=3.11<br>handler=&#39;X&#39;<br>as $$<br>from datasketches import hll_sketch, hll_union, tgt_hll_type<br><br>class X:<br>    def __init__(self):<br>        self._sketch = hll_sketch(12)<br><br>    def process(self, input_value):<br>        self._sketch.update(input_value)  <br><br>    def end_partition(self):<br>        yield (self._sketch.get_estimate(), )<br>$$;</pre><p>You can see that the code above is more concise than the UDAF definition. However the SQL query using it is harder to write and understand:</p><pre>select b.*<br>from snowflake_sample_data.tpch_sf1000.customer a<br>  , table(apache_sketches_hll_udtf(c_name))b<br>-- 50s S<br>;</pre><p>Now only the query is harder to understand, the results don’t match what we need:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/449/1*mirvV80iF9PXDKp04MKolw.png" /><figcaption>We get 16 different results, instead of the total aggregate</figcaption></figure><p>This because the UDTF does its own partitioning (when we don’t define it) to parallelize the query and make it faster. If we try un-parallelizing it then the query takes 12 times longer in the same Small-wh:</p><pre>select b.*<br>from snowflake_sample_data.tpch_sf1000.customer a<br>    , table(apache_sketches_hll_udtf(c_name) over(partition by 1))b<br>-- 617s S-wh<br>;</pre><p>That’s not good. What we really need is to get sketches for each partition, and then aggregate them:</p><pre>create or replace function apache_sketches_hll_udtf_sketch(input_value string)<br>returns table (sketch string)<br>language python<br>packages = (&#39;datasketches&#39;)<br>runtime_version=3.11<br>handler=&#39;X&#39;<br>as $$<br>from datasketches import hll_sketch, hll_union, tgt_hll_type<br>import base64<br><br>class X:<br>    def __init__(self):<br>        self._sketch = hll_sketch(12)<br><br>    def process(self, input_value):<br>        self._sketch.update(input_value)  <br><br>    def end_partition(self):<br>        yield (base64.b64encode(self._sketch.serialize_compact()).decode(&#39;utf-8&#39;), )<br>$$;<br><br>create or replace function apache_sketches_hll_udtf_sketch_union(input_value array)<br>returns float<br>language python<br>packages = (&#39;datasketches&#39;)<br>runtime_version=3.11<br>handler=&#39;x&#39;<br>as $$<br>from datasketches import hll_sketch, hll_union, tgt_hll_type<br>import base64<br><br>def x(arr):<br>    union = hll_union(12)<br>    for sketch in arr:<br>        union.update(hll_sketch.deserialize(base64.b64decode(sketch)))<br>    return union.get_estimate()<br>$$;</pre><p>That’s how we get to a query that looks like this:</p><pre>select apache_sketches_hll_udtf_sketch_union(<br>  array_agg(b.sketch)<br>)<br>from snowflake_sample_data.tpch_sf1000.customer a<br>  , table(apache_sketches_hll_udtf_sketch(c_name) )b<br>-- 152248026.5622<br>-- 51s S<br>;</pre><p>The lesson here is that it’s good to migrate our UDTFs to UDAFs (as long as they are written in Python — Java UDAFs are not yet available). However there’s an important step to consider in this example: What if we want to merge sketch results within UDAFs?</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*QmkRRzTTCiCiBcIj" /><figcaption>Image generated by AI</figcaption></figure><h3>Storing and merging sketches with UDAFs</h3><p>Why would anyone use Apache DataSketches within a Snowflake Python UDAF is they are slower than the native <a href="https://docs.snowflake.com/en/sql-reference/functions/hll">HLL</a> approximate counts?</p><p>Well, the beauty of these sketches comes when achieving compatibility within multiple systems. Currently Apache DataSketches is available in Java, C++, and Python (with the community creating bindings in other languages, like <a href="https://github.com/vlad17/datasketches-rs">Rust</a>).</p><p>It’s good to store partial state sketches in Snowflake, especially if they are coming from other systems. For this to work, we need to split the original HLL UDAF in 3 parts:</p><ul><li>apache_sketches_hll_accumulate(): Produces a sketch representing the probabilist count of elements seen.</li><li>apache_sketches_hll_combine(): Combines multiple sketches (which could be coming from apache_sketches_hll_accumulate() in Snowflake, or from an external systems).</li><li>apache_sketches_hll_estimate(): Transforms a sketch into an estimation.</li></ul><p>My code to define these 3:</p><pre>create or replace aggregate function apache_sketches_hll_accumulate(a string)<br>returns binary<br>language python<br>packages = (&#39;datasketches&#39;)<br>runtime_version=3.11<br>handler = &#39;X&#39;<br>as $$<br>from datasketches import hll_sketch, hll_union, tgt_hll_type<br><br>class X:<br>  def __init__(self):<br>    self._sketch = hll_sketch(12)<br><br>  @property<br>  def aggregate_state(self):<br>    return self._sketch.serialize_compact()<br><br>  def accumulate(self, input_value):<br>    self._sketch.update(input_value)<br>    <br>  def merge(self, other_partial_sum):<br>    union = hll_union(12)<br>    union.update(self._sketch)<br>    union.update(hll_sketch.deserialize(other_partial_sum))<br>    self._sketch = union.get_result()<br><br>  def finish(self):<br>    return self._sketch.serialize_compact()<br>$$;<br><br>create or replace aggregate function apache_sketches_hll_combine(a binary)<br>returns binary<br>language python<br>packages = (&#39;datasketches&#39;)<br>runtime_version=3.11<br>handler = &#39;X&#39;<br>as $$<br>from datasketches import hll_sketch, hll_union, tgt_hll_type<br><br>class X:<br>  def __init__(self):<br>    self._union = hll_union(12)<br><br>  @property<br>  def aggregate_state(self):<br>    return self._union.get_result().serialize_compact()<br><br>  def accumulate(self, input_value):<br>    self._union.update(hll_sketch.deserialize((input_value)))<br>    <br>  def merge(self, other_partial_sum):<br>    self._union.update(hll_sketch.deserialize(other_partial_sum))<br><br>  def finish(self):<br>    return self._union.get_result().serialize_compact()<br>$$;<br><br><br>create or replace function apache_sketches_hll_estimate(sketch binary)<br>returns float<br>language python<br>packages = (&#39;datasketches&#39;)<br>runtime_version=3.11<br>handler = &#39;x&#39;<br>as $$<br>from datasketches import hll_sketch, hll_union, tgt_hll_type<br><br>def x(sketch):<br>    return hll_sketch.deserialize((sketch)).get_estimate()<br>$$;</pre><p>Note above that when moving sketches from external systems to Snowflake (or out) special care needs to be taken care with the binary sketches, by transforming them to/from base64 or similar.</p><p>Using these then becomes easy within SQL:</p><pre>select apache_sketches_hll_accumulate(c_name) sketch<br>from snowflake_sample_data.tpch_sf1000.customer<br>group by left(c_name, 12)<br>-- (151 sketches)<br>-- 41s S<br>;<br><br>select apache_sketches_hll_estimate(sketch)<br>from (<br>    select apache_sketches_hll_combine(sketch) sketch<br>    from table(result_scan(last_query_id(-1)))<br>)<br>-- 152248026.56219986<br>-- 0.9s S-wh<br>;</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/699/1*OsV3Jya-qaB5npCJSJR2JA.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/554/1*_cOHJMvqdG3j1_gxqzUu8g.png" /><figcaption>Estimating cardinality from multiple stored sketches</figcaption></figure><h3>When to use Snowflake HLL, vs Apache DataSketches, vs Google ZetaSketch</h3><ul><li>Use Snowflake’s HLL implementation for the fastest results, but only when the whole life-cycle happens within Snowflake. This because Snowflake has not open-sourced its implementation to be used in other systems.</li><li>Use Apache DataSketches with Snowflake Python UDAFs to achieve compatibility with multiple systems. UDAFs are easy to use within SQL, and have decent performance — while using an Apache project ensures wide industry adoption.</li><li>Use Google <a href="https://github.com/google/zetasketch">zetasketch</a> (HyperLogLog) to achieve compatibility with BigQuery sketches. Note that they have only open-sourced a Java implementation, and Snowflake only has Python UDAFs — so you’ll need to implement Snowflake Java UDTFs instead — as described in <a href="https://medium.com/snowflake/bigquerys-hyperloglog-as-a-snowflake-java-udtf-f99e5e84f9cc">my previous post</a>.</li></ul><h3>Next steps</h3><ul><li>Check the <a href="https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-aggregate-functions">Snowflake UDAFs docs</a>.</li><li>Learn about the multiple interesting classes and use cases of <a href="https://datasketches.apache.org/docs/Architecture/MajorSketchFamilies.html">Apache DataSketches</a>.</li><li>Share your results with us.</li></ul><p><a href="https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-aggregate-functions">Python user-defined aggregate functions | Snowflake Documentation</a></p><h3>Want more?</h3><p>Try this out with a <a href="http://bit.ly/sf-free-trial">Snowflake free trial</a> account — you only need an email address to get started.</p><p>I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can <a href="https://www.threads.net/@fhoffa">follow me on Threads</a> and <a href="https://www.linkedin.com/in/hoffa/">LinkedIn</a> (while increasingly less on <a href="https://twitter.com/felipehoffa">Twitter</a>). And subscribe to <a href="https://www.reddit.com/r/snowflake/">reddit.com/r/snowflake</a> for the most interesting Snowflake news.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=ceeca5d22985" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Hey Snowflake, send me a fancy HTML email]]></title>
            <link>https://hoffa.medium.com/hey-snowflake-send-me-a-fancy-email-fe04ad2c9888?source=rss-279fe54c149a------2</link>
            <guid isPermaLink="false">https://medium.com/p/fe04ad2c9888</guid>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[snowflake]]></category>
            <category><![CDATA[sql]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[python]]></category>
            <dc:creator><![CDATA[Felipe Hoffa]]></dc:creator>
            <pubDate>Wed, 17 Jan 2024 17:10:17 GMT</pubDate>
            <atom:updated>2024-05-22T03:09:37.921Z</atom:updated>
            <content:encoded><![CDATA[<h3>Hey Snowflake, send me a &lt;fancy&gt; HTML email</h3><h4>Snowflake can now send HTML email notifications, so it’s time to explore how to make the most out of this new ability — including sending pre-rendered image charts with a Snowpark Python stored procedure. Let’s check out how.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Ep2jt3lI3tAGR4k7uUyixQ.png" /><figcaption>Image generated by AI</figcaption></figure><p>A year ago I published “<a href="https://medium.com/snowflake/hey-snowflake-send-me-an-email-243741a0fe3">Hey Snowflake, send me an email</a>”, showing off Snowflake’s new ability to send email notifications. However, those emails didn’t look too pretty. It’s now time to fix that.</p><p><a href="https://medium.com/snowflake/hey-snowflake-send-me-an-email-243741a0fe3">Hey Snowflake, send me an email</a></p><p>For example, let’s say I want Snowflake to send me an email with the weekly top GitHub projects by number of stars. Following the <a href="https://medium.com/snowflake/hey-snowflake-send-me-an-email-243741a0fe3">previous instructions</a>, the email notification would look like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/413/1*GsYKAio4DZku8EElqKzlOw.png" /><figcaption>Top GitHub projects email, formatted with Python `tabulate`</figcaption></figure><p>Now we can do better, as we can send an HTML table:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/407/1*ViTj4rIptBCOVdESQomq9A.png" /><figcaption>Top GitHub projects email, formatted as HTML</figcaption></figure><p>That looks decent, but it’s still not pretty enough. For example, we could give that plain HTML table a more modern look:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/496/1*Mfc2Pk6iwXrINxo7VaN5-Q.png" /><figcaption>Top GitHub projects email, formatted as elegant HTML</figcaption></figure><p>And then for extra coolness, we could send any arbitrary pre-rendered plot as an image:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/563/1*P-klC00T-9XY5PUbjHjRTA.png" /><figcaption>Top GitHub projects email, sent with a pre-rendered .png</figcaption></figure><p>Let’s find out how to send these new fancy emails.</p><h3>How-to</h3><h4>Creating images that can be shared via email</h4><p>Being able to publish images fully within Snowflake is cool, so let’s start with those steps:</p><p>The first thing we need is to CREATE STAGE in your Snowflake account to store the pngs:</p><pre>create or replace stage pngstage <br>directory = (enable = true)   <br>encryption=(type = &#39;snowflake_sse&#39;);</pre><p>That looks like a normal <a href="https://docs.snowflake.com/en/sql-reference/sql/create-stage">create stage</a> instruction, with two modifiers:</p><ul><li>With directory = (enable = true) we get the ability to list all the files on this new stage. We don’t need really need this, but a <a href="https://docs.snowflake.com/en/user-guide/data-load-dirtables">directory table</a> will be helpful to keep track and delete any files that are too old.</li><li>Setting encryption=(type = &#39;snowflake_sse&#39;) is important to set up “<a href="https://docs.snowflake.com/en/user-guide/unstructured-intro#label-file-url-server-side-encryption">server side encryption</a>”— otherwise the files we store in this stage won’t be readable outside your Snowflake account.</li></ul><p>Then we can create a Snowflake Python stored procedure that:</p><ul><li>Receives the id of a query that had 2 columns as output (label and count), the labels for these 2 columns, and a title for the chart.</li><li>Uses matplotlib to create an horizontal bar chart with those values.</li><li>Outputs that chart to the stage we created to share these pictures.</li><li>Returns a pre-signed URL that will allow others to view these PNGs from the web and email readers.</li></ul><p>The code:</p><pre>create or replace procedure create_png_web_image(<br>    query_id string<br>    , title string<br>    , x_label string<br>    , y_label string)<br>returns string<br>language python<br>runtime_version = &#39;3.11&#39;<br>packages = (&#39;snowflake-snowpark-python&#39;, &#39;matplotlib&#39;)<br>handler = &#39;run&#39;<br>execute as caller<br>as<br>$$<br>import matplotlib.pyplot as plt<br>import pandas as pd<br>import io<br><br>def save_file(session, buffer, path, dest_filename):<br>    session._conn.upload_stream(buffer, path, dest_filename, compress_data=False, overwrite=True)<br>    return path + &#39;/&#39; + dest_filename<br><br>def run(session, query_id, title, x_label, y_label):  <br>    df = session.sql(&quot;select * from table(result_scan(?))&quot;, params=[query_id]).to_pandas()<br><br>    plt.figure(figsize=(5, 3))<br>    plt.barh(df.iloc[:, 0], df.iloc[:, 1], color=&#39;skyblue&#39;)<br>    plt.xlabel(x_label)<br>    plt.ylabel(y_label)<br>    plt.title(title)<br>    plt.gca().invert_yaxis()  # Invert y-axis for better readability   <br>    plt.tight_layout()<br> <br>    buffer = io.BytesIO()<br>    plt.savefig(buffer, format=&#39;png&#39;)<br>    buffer.seek(0)  # Rewind the buffer<br>    stage = &#39;@pngstage&#39;<br>    path = &#39;web_png&#39;<br>    filename = &#39;png_%s.png&#39; % (query_id)<br>    relative_path = save_file(session, buffer, stage + &#39;/&#39; + path, filename )<br>    return session.sql(&quot;select get_presigned_url(%s, &#39;%s/%s&#39;, 604800) as url&quot; % (stage, path, filename)).collect()[0][&#39;URL&#39;]<br>$$<br>;</pre><p>What’s interesting in this code:</p><ul><li>To make the plot available as a png, we need to save it to a buffer with plt.savefig(buffer, format=&#39;png&#39;).</li><li>Then we can upload that buffer to the stage with the custom function save_file(), that uses the Snowpark internal method session._conn.upload_stream().</li><li>To get a pre-signed URL for that image we run the query select <a href="https://docs.snowflake.com/en/sql-reference/functions/get_presigned_url">get_presigned_url</a>().</li><li>These pre-signed URLs need an expiration time. The maximum is 1 week (604800 seconds). After that time the URLs will stop working.</li><li>TO-DO: Since the URLs will only work for a week, you should automate deleting the generated files when they are not longer needed.</li><li>TO-DO: The above function hard-codes the destination stage, you could make that more flexible.</li><li>TO-DO: The above function creates horizontal bar graphs, but you could <a href="https://matplotlib.org/stable/gallery/index.html">generate anything you want</a> with matplotlib or other libraries.</li></ul><p>A simple SQL stored procedure to send this URL as an image in an email:</p><pre>execute immediate $$<br>declare<br>    signed_png_url string;<br>    formatted string;    <br>begin<br>    signed_png_url := (call create_png_web_image(&#39;01b077e2-0503-d084-001e-248301c57c26&#39;, &#39;Top weekly GitHub projects by stars&#39;, &#39;Stars&#39;, &#39;Project&#39;)); <br>    formatted := &#39;&lt;b&gt;The top GitHub projects by # of stars this week are:&lt;/b&gt;&lt;br/&gt;&lt;img src=&quot;&#39; || signed_png_url || &#39;&quot;&quot;/&gt;&#39;;<br>    call system$send_email(<br>        &#39;my_email_int&#39;, &#39;felipe@example.com&#39;, &#39;Top GitHub weekly projects (png)&#39;, :formatted, &#39;text/html&#39;<br>    );<br>    return &#39;sent: &#39; || :formatted;<br>    end;<br>$$<br>;</pre><p>What’s interesting in this sample code:</p><ul><li>Playing with inline SQL stored procedure can help us iterate quickly over this code until we get exactly what we want.</li><li>With this SQL stored procedure we call the Python stored procedure that generates the image, returns the URL, and we store that URL in a variable.</li><li>With this URL, we create a bit of HTML with the &lt;img src=&#39;&#39;&gt; tag.</li><li>Then we call <a href="https://docs.snowflake.com/en/sql-reference/stored-procedures/system_send_email">system$send_email</a>() containing that bit of HTML, while making sure that the mime type is the newly supported text/html.</li><li>Check my <a href="https://medium.com/snowflake/hey-snowflake-send-me-an-email-243741a0fe3">previous post</a> or the <a href="https://docs.snowflake.com/en/sql-reference/stored-procedures/system_send_email">docs</a> to set up system$send_email() to send emails to your desired destinations.</li><li>TODO: I hard-coded the query id while experimenting, should move to a general purpose stored procedure.</li></ul><p>The query that provides the top weekly GitHub projects by # of stars is (as described in my previous post “<a href="https://medium.com/snowflake/querying-github-archive-with-snowflake-the-essentials-c1f7715133d7">Querying GitHub Archive with Snowflake: The Essentials</a>”):</p><pre>select any_value(repo_name) repo_name, count(distinct actor_id) stars<br>from cybersyn_github_archive.cybersyn.github_events<br>where type = &#39;WatchEvent&#39;<br>and created_at_timestamp &gt; dateadd(day, -7, current_date)<br>group by repo_id<br>order by stars desc<br>limit 10<br>-- 16s S-wh cold<br>-- 1s S-wh warm<br>;</pre><h4>Creating HTML tables that can be shared via email</h4><p>Sending images is cool, but the links will expire in a week and some readers might choose to not display images. So they will see an empty email like this (unless you add more text to it):</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/734/1*MQpqgvxiGYpmnrj-AlPYQg.png" /><figcaption>An empty email without text because the user chose to not display images (or the URL expired)</figcaption></figure><p>Let’s explore an easy way to get an HTML table instead of an &lt;img&gt; tag as seen above:</p><pre>create or replace procedure email_html_formatting(query_id string)<br>returns string<br>language python<br>runtime_version = &#39;3.11&#39;<br>packages = (&#39;snowflake-snowpark-python&#39;)<br>handler = &#39;x&#39;<br>execute as caller<br>as<br>$$<br><br>def x(session, query_id):    <br>    return session.sql(<br>      &quot;select * from table(result_scan(?))&quot;,<br>      params=[query_id]).to_pandas().to_html()<br>$$<br>;</pre><p>The quick solution above:</p><ul><li>Executes a query with Snowpark that brings back the result of the received query id.</li><li>Calls to_pandas().to_html() on those results.</li></ul><p>That gives us the “ugly” classic HTML table:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/291/1*yOamOHlfgRSsNUIyJhPB4A.png" /><figcaption>Classic HTML table</figcaption></figure><h4>Creating a prettier HTML table</h4><p>GMail will only apply CSS styles to a table if these styles are inlined. There are more sophisticated ways of doing this, but a simple replace() in Python did the job for me:</p><pre>create or replace procedure email_html_elegant_table_formatting(query_id string)<br>returns string<br>language python<br>runtime_version = &#39;3.11&#39;<br>packages = (&#39;snowflake-snowpark-python&#39;)<br>handler = &#39;x&#39;<br>execute as caller<br>as<br>$$<br>def x(session, query_id):    <br>    html_table = session.sql(&quot;select * from table(result_scan(?))&quot;, params=[query_id]).to_pandas().to_html()<br>    # https://codepen.io/labnol/pen/poyPejO?editors=1000<br>    html_table = html_table.replace(&#39;class=&quot;dataframe&quot;&#39;, &#39;style=&quot;border: solid 2px #DDEEEE; border-collapse: collapse; border-spacing: 0; font: normal 14px Roboto, sans-serif;&quot;&#39;)<br>    html_table = html_table.replace(&#39;&lt;th&gt;&#39;, &#39;&lt;th style=&quot;background-color: #DDEFEF; border: solid 1px #DDEEEE; color: #336B6B; padding: 10px; text-align: left; text-shadow: 1px 1px 1px #fff;&quot;&gt;&#39;)<br>    html_table = html_table.replace(&#39;&lt;td&gt;&#39;, &#39;&lt;td style=&quot;    border: solid 1px #DDEEEE; color: #333; padding: 10px; text-shadow: 1px 1px 1px #fff;&quot;&gt;&#39;)<br>    return html_table<br>$$<br>;</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/361/1*9yFzo0SIBCF5D_9T-LeBHQ.png" /><figcaption>Styled HTML table</figcaption></figure><ul><li>TO-DO: We could do more sophisticated styling and modifications with BeautifulSoup:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/621/1*kRGn5b7l6GTB4gZ8dudjqw.png" /><figcaption>Untested code to play with an HTML table using BeautifulSoup.</figcaption></figure><h3>Next steps: Making it production grade</h3><p>This post so far has useful code to generate fancy e-mail notifications inside Snowflake — but more work is needed to make this ready for production.</p><p>For example, it would be great to have:</p><ul><li>A more solid stored-procedure.</li><li>Error handling.</li><li>Max email length awareness.</li><li>Multiple destinataries.</li><li>Leverage the new Snowflake alerts.</li></ul><p>I’ve discussed many of these ideas with <a href="https://medium.com/u/37ed5e1bd358">Ryan M Bacastow</a> (who I’m grateful to for inspiring this post). Hopefully we will able to publish more about this soon.</p><h4>Slack Notifications</h4><p>Two different approaches:</p><ul><li><a href="https://medium.com/u/124b11785b71">Sean Kim</a> published “<a href="https://medium.com/snowflake/send-slack-messages-from-snowflake-with-snowpark-external-network-access-8e3e42a7cde2">Send Slack Messages from Snowflake with Snowpark External Network Access</a>” that uses a External Function to send messages directly to a custom Slack App. Btw, he also uses a similar mechanism to create plots to the one described in this post.</li><li><a href="https://medium.com/u/eb13b4c116f4">TJ Murphy</a> uses custom email addresses to send notifications to Slack, as described in <a href="https://www.linkedin.com/feed/update/urn:li:activity:7153435004625141761?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A7153435004625141761%2C7153469436279095296%29&amp;replyUrn=urn%3Ali%3Acomment%3A%28activity%3A7153435004625141761%2C7153537873093214208%29&amp;dashCommentUrn=urn%3Ali%3Afsd_comment%3A%287153469436279095296%2Curn%3Ali%3Aactivity%3A7153435004625141761%29&amp;dashReplyUrn=urn%3Ali%3Afsd_comment%3A%287153537873093214208%2Curn%3Ali%3Aactivity%3A7153435004625141761%29">his LinkedIn comment</a>:</li></ul><pre>To set up Slack notifications from Snowflake:<br><br>1. Get the Slack notif email address. Right click the channel &gt; View channel details &gt; Integrations &gt; Send emails to this channel<br><br>2. Create a new Snowflake user. Set the user&#39;s email to the Slack notif email address.<br><br>3. Log in as the user. Go to the user profile in Snowsight and click the &quot;Send Verification Email&quot; link.<br><br>4. Go to the Slack channel. You should see the validation email as a message in the channel (see image). Click the verification link.<br><br>5. Create or update a Snowflake Email Integration to add the Slack notif email address.<br><br>6. Call SYSTEM$SEND_EMAIL and get the notifications in Slack!<br><br>-- TJ Murphy</pre><h4>Pdf generation</h4><p><em>Added on 2024–05:</em></p><p><a href="https://stackoverflow.com/a/78515162/132438">Snowflake- IS there any way to generate a PDF report from snowflake table?</a></p><h3>Want more?</h3><p>Try this out with a <a href="http://bit.ly/sf-free-trial">Snowflake free trial</a> account — you only need an email address to get started.</p><p>I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can <a href="https://www.threads.net/@fhoffa">follow me on Threads</a> and <a href="https://www.linkedin.com/in/hoffa/">LinkedIn</a> (while increasingly less on <a href="https://twitter.com/felipehoffa">Twitter</a>). And subscribe to <a href="https://www.reddit.com/r/snowflake/">reddit.com/r/snowflake</a> for the most interesting Snowflake news.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=fe04ad2c9888" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Testing the Snowflake Query Acceleration Service with a 17 TB table]]></title>
            <link>https://hoffa.medium.com/testing-the-snowflake-query-acceleration-service-with-a-17-tb-table-21677a0d6de6?source=rss-279fe54c149a------2</link>
            <guid isPermaLink="false">https://medium.com/p/21677a0d6de6</guid>
            <category><![CDATA[sql]]></category>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[snowflake]]></category>
            <category><![CDATA[github]]></category>
            <category><![CDATA[data-science]]></category>
            <dc:creator><![CDATA[Felipe Hoffa]]></dc:creator>
            <pubDate>Wed, 10 Jan 2024 17:02:24 GMT</pubDate>
            <atom:updated>2024-01-10T17:02:24.158Z</atom:updated>
            <content:encoded><![CDATA[<h4>Snowflake has an easy way to make queries faster while using the smallest warehouses: The <a href="https://docs.snowflake.com/user-guide/query-acceleration-service">Query Acceleration Service</a> has been ready for production (GA) since <a href="https://docs.snowflake.com/release-notes/2023-02#query-acceleration-service-general-availability">February 2023</a>. Let’s test it right now by scanning 17 terabytes of GitHub events.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/512/0*B8w1o42KNnQsqy4P" /><figcaption>Image generated by AI</figcaption></figure><h3>Amping up Snowflake’s compute</h3><p>Traditionally Snowflake has offered 2 easy ways of increasing compute power when dealing with larger queries and concurrency:</p><ul><li>Scale your session’s “<a href="https://docs.snowflake.com/en/user-guide/warehouses">virtual warehouse</a>” to a larger size.</li><li>Set up “<a href="https://docs.snowflake.com/en/user-guide/warehouses-multicluster">multi-cluster warehouses</a>” that dynamically add more clusters to deal with peaks of concurrent usage.</li></ul><p>With these 2 basic elements, users are able to set up policies to control costs and divide resources between different teams and workloads.</p><p>For example — for my Snowflake experiments I usually do everything on my own “Small Warehouse”. This keeps costs low, and it’s usually pretty fast and predictable. I only need to scale to larger warehouses when dealing with huge transformations and extracts, like the example we are going to play with today.</p><p><a href="https://docs.snowflake.com/en/user-guide/warehouses-overview">Scaling a warehouse up</a> to get faster results on slower queries is super easy, barely an inconvenience. I can jump at any moment from a “Small WH” to an “Extra Large WH”, to a “Medium WH”, to a “4X-Large WH”, etc. This is cool, but then the question becomes: “<em>How can I tell exactly what’s the best WH size for my upcoming queries?</em>”</p><p>Instead of resizing warehouses, it would be really cool if I could run my whole session on a “Small WH” (or an “Extra Small WH”) and then I could have Snowflake automatically intercept my larger queries, and run them with way more resources in a “magic serverless” way.</p><p>And that’s exactly what the new <a href="https://docs.snowflake.com/user-guide/query-acceleration-service">Query Acceleration Service</a> does. Let’s test it out here (with a Snowflake <a href="https://docs.snowflake.com/en/user-guide/intro-editions">Enterprise Edition</a> account).</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2Fp9hvL8k1Fi0&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dp9hvL8k1Fi0&amp;image=http%3A%2F%2Fi.ytimg.com%2Fvi%2Fp9hvL8k1Fi0%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/fd7070b14c7d275e1204c05b7290be5c/href">https://medium.com/media/fd7070b14c7d275e1204c05b7290be5c/href</a></iframe><h3>Extracting data from GitHub’s 17 Terabyte Archive</h3><p><a href="https://medium.com/snowflake/querying-github-archive-with-snowflake-the-essentials-c1f7715133d7">Querying GitHub Archive with Snowflake: The Essentials</a></p><p>For this experiment we are going to look into <a href="http://gharchive.org">GH Archive </a>— a collection of all GitHub events. I did a lot of experiments with it in my past life at Google, and now <a href="https://www.cybersyn.com/product/github-archive-events/">Cybersyn</a> has made a copy of the <a href="https://app.snowflake.com/marketplace/listing/GZTSZAS2KJ3">GH Archive on the Snowflake Marketplace</a>.</p><p>To bring this dataset into your Snowflake account, just ask your Account Admin to <a href="https://app.snowflake.com/marketplace/listing/GZTSZAS2KJ3">import it at no cost</a>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/611/1*cp6wXODGUurfF2x6oWIjHw.png" /><figcaption>Importing the GH Archive into your Snowflake account</figcaption></figure><p>For more tips, check my post “<a href="https://medium.com/snowflake/querying-github-archive-with-snowflake-the-essentials-c1f7715133d7">Querying GitHub Archive with Snowflake: The Essentials</a>”. In the meantime let’s continue with a straightforward example.</p><p>Note in the above screenshot that I renamed the incoming database to GHARCHIVE for cleaner querying.</p><p>Once we have GHARCHIVE in our account, we can see 3 tables — with the main one being events:</p><pre>select count(*)<br>from gharchive.cybersyn.github_events;<br>-- 1.4s<br>-- 6,966,010,260</pre><p>That’s 7 billion rows of rich history — and a lot of data to deal with. The first step when exploring datasets this large should be to extract a subset of rows with the data we are interested in:</p><p>For example, this is the whole history of Apache Iceberg on GitHub:</p><pre>create or replace table gharchive_iceberg<br>as<br>select *<br>from gharchive.cybersyn.github_events<br>where repo_id = 158256479<br>order by created_at<br>-- 19m 52s small<br>-- 96s xxlarge<br>;</pre><p>This table extraction took only 96 seconds on a “2X Large WH”, but some long ~22 minutes on my usual “Small WH”.</p><p>Can the Query Acceleration Service (QAS) help here? There’s a very easy way to tell:</p><pre>select system$estimate_query_acceleration(&#39;01b191db-0603-f84f-002f-a0030023f256&#39;);</pre><p>And the response is “yes” (when using the query id from the ~22m run):</p><pre>{<br>  &quot;queryUUID&quot;:&quot;01b191db-0603-f84f-002f-a0030023f256&quot;,<br>  &quot;status&quot;:&quot;eligible&quot;, <br>  &quot;originalQueryTime&quot;:1191.759,<br>  &quot;estimatedQueryTimes&quot;:{&quot;1&quot;:608,&quot;2&quot;:411,&quot;4&quot;:254,&quot;8&quot;:149,&quot;31&quot;:55},<br>  &quot;upperLimitScaleFactor&quot;:31<br>}</pre><p>Snowflake is telling us that the query took 1191s, and if we had let the QAS service help, it could have taken between 608s and 55s — depending on the max scaling factor we would allow it (in this case, up to 31).</p><p>To test QAS, I created a new WH. To make this test more dramatic, I made it an “Extra Small” with unlimited scaling power:</p><pre>use role sysadmin<br>;<br>create warehouse xs_acc<br>warehouse_size = xsmall<br>enable_query_acceleration = true<br>query_acceleration_max_scale_factor = 0<br>;<br>grant usage on warehouse xs_acc to public<br>;</pre><p>If I use the “Extra Small WH with unlimited QAS”, Snowflake now automatically accelerates this query</p><pre>use warehouse xs_acc<br>;<br>create or replace table gharchive_iceberg<br>as<br>select *<br>from gharchive.cybersyn.github_events<br>where repo_id = 158256479<br>order by created_at<br>-- 19m 52s small<br>-- 96s xxlarge<br>-- 77s xs_acc<br>;</pre><p>To check the cost of this QAS query that ran in 77s while I was working within a “Extra Small WH” session, we can check the logs:</p><pre>use role accountadmin<br>;<br>select * from<br>table(information_schema.query_acceleration_history())<br>;<br><br>-- CREDITS_USED WAREHOUSE_NAME NUM_FILES_SCANNED NUM_BYTES_SCANNED<br>-- 0.499199157   XS_ACC         296,389           5,025,053,295,104</pre><p>We can see that the query scanned 5 terabytes of data, for a total cost of 0.4 credits. Depending on the region with an Enterprise Edition Snowflake account that should be around $1.5 dollars.</p><p>In comparison:</p><ul><li><strong>Small WH</strong>, Enterprise edition, <strong>1192s</strong>: $3*2*1192/3600 = <strong>$1.99</strong> (+ time between queries and auto-suspend)</li><li><strong>2XL WH</strong>, Enterprise edition, <strong>96s</strong>: $3*32*96/3600 = <strong>$2.56</strong> (+ time between queries and auto-suspend)</li><li><strong>QAS 21x auto-acceleration</strong>, within a X-Small session, 7<strong>7s</strong>: $3*0.5 = <strong>$1.5 </strong>(serverless model, no auto-suspend needed for the QAS queries — but the XS session kept running for $3*1*67/3600=<strong>$0.06</strong> extra)</li></ul><p>This is the power of the Query Acceleration Service: When it works, we don’t need to worry anymore about re-sizing warehouses, and we can let Snowflake take care of the huge queries that need extra power.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/576/0*Hf6s8pmQ3oPxW7Xq" /><figcaption>Image generated by AI</figcaption></figure><h3>Query Acceleration Caveats</h3><p>QAS is Generally Available (GA) in Snowflake and ready for you to use.</p><p>However you will notice that it’s picky on which queries it decides to accelerate — and I expect this set of supported queries to grow over time.</p><p>You can find a handy history of the queries in your account that could have been accelerated:</p><pre>SELECT query_id, eligible_query_acceleration_time<br>FROM snowflake.account_usage.query_acceleration_eligible<br>ORDER BY eligible_query_acceleration_time DESC;</pre><p>The docs also list what kind of <a href="https://docs.snowflake.com/user-guide/query-acceleration-service#ineligible-queries">queries are not eligible for acceleration</a> (for now):</p><pre>Some queries are ineligible for query acceleration. The following are common reasons why a query cannot be accelerated:<br><br>- The query does not filter or aggregate.<br>- The filters are not selective enough. Alternatively, the GROUP BY expression has a high cardinality.<br>- There are not enough partitions. If there are not enough partitions to scan, the benefits of query acceleration are offset by the latency in acquiring resources for the query acceleration service.<br>- The query includes a LIMIT clause but does not have an ORDER BY clause.<br>- The query includes functions that return nondeterministic results (for example, SEQ or RANDOM).</pre><p>For example this query could not get QAS during my tests:</p><pre>select min(created_at), max(created_at), current_timestamp()<br>from gharchive.cybersyn.gh_events<br>where repo_id = 158256479<br>limit 10</pre><p>But this one that produces the same results does — thanks to adding group by and order by:</p><pre>select repo_id, min(created_at), max(created_at), current_timestamp()<br>from gharchive.cybersyn.github_events<br>where repo_id = 158256479<br>group by repo_id<br>order by repo_id<br>limit 10<br>-- 23s xs_acc</pre><h3>Next steps</h3><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2Fp9hvL8k1Fi0&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dp9hvL8k1Fi0&amp;image=http%3A%2F%2Fi.ytimg.com%2Fvi%2Fp9hvL8k1Fi0%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/fd7070b14c7d275e1204c05b7290be5c/href">https://medium.com/media/fd7070b14c7d275e1204c05b7290be5c/href</a></iframe><ul><li>To go deeper analyzing GitHub, check my post “<a href="https://medium.com/snowflake/querying-github-archive-with-snowflake-the-essentials-c1f7715133d7">Querying GitHub Archive with Snowflake: The Essentials</a>”.</li><li>Check my conversation and <a href="https://www.youtube.com/watch?v=p9hvL8k1Fi0">live demo featuring Query Acceleration Service with Product Manager</a> <a href="https://medium.com/u/d0bbe48b8d80">Tim Sander</a>.</li><li>Check out how many of your queries could have been accelerated with QAS in your account.</li><li>Report results, and give us feedback on what else you’d like QAS to take care of.</li><li>Try the combination of <a href="https://docs.snowflake.com/en/user-guide/query-acceleration-service#compatibility-with-search-optimization">QAS + Search Optimization</a>.</li></ul><h3>Want more?</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/576/0*v6U-NT6lfBRGmRWp" /><figcaption>Image created by AI</figcaption></figure><p>Try this out with a <a href="http://bit.ly/sf-free-trial">Snowflake free trial</a> account — you only need an email address to get started.</p><p>I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can <a href="https://www.threads.net/@fhoffa">follow me on Threads</a> and <a href="https://www.linkedin.com/in/hoffa/">LinkedIn</a> (while increasingly less on <a href="https://twitter.com/felipehoffa">Twitter</a>). And subscribe to <a href="https://www.reddit.com/r/snowflake/">reddit.com/r/snowflake</a> for the most interesting Snowflake news.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=21677a0d6de6" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Discover the new Snowpark ML Toolkit + dbt Python models]]></title>
            <link>https://hoffa.medium.com/discover-the-new-snowpark-ml-toolkit-dbt-python-models-4f85b85a8c88?source=rss-279fe54c149a------2</link>
            <guid isPermaLink="false">https://medium.com/p/4f85b85a8c88</guid>
            <category><![CDATA[python]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[dbt]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[snowflake]]></category>
            <dc:creator><![CDATA[Felipe Hoffa]]></dc:creator>
            <pubDate>Wed, 20 Sep 2023 15:58:09 GMT</pubDate>
            <atom:updated>2023-09-23T04:26:03.233Z</atom:updated>
            <content:encoded><![CDATA[<h4>Let’s do some feature engineering, training, and inference with Snowpark ML and the dbt Python models. First with with 50k rows and then with 50M rows, to prove how this new toolkit helps us scale with Snowflake — while the dbt Python models take care of all the boilerplate.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*-BtMF_6bPlXO83oj" /><figcaption>Image generated by AI</figcaption></figure><h3>Intro</h3><p><a href="https://docs.snowflake.com/en/developer-guide/snowpark-ml/index">Snowpark ML</a> is a new set of tools for building and deploying machine learning models in Snowflake. The best part is that you get familiar ML constructs (Scikit-Learn, XGBoost, LightGBM, …), paired with all the power, security, and scalability of Snowflake.</p><h4>Why I’m writing this</h4><ul><li>In my previous post I shared how <a href="https://hoffa.medium.com/how-snowflake-makes-the-dbt-python-models-shine-a36d22960edb">dbt and Snowflake pair beautifully to power the new dbt Python models</a>.</li><li>The dbt Python models docs offer an in-depth use case of <a href="https://docs.getdbt.com/guides/dbt-ecosystem/dbt-python-snowpark/11-machine-learning-prep">how to do ML prep, cleaning, training, and prediction with dbt Python + Snowpark</a>.</li><li>But — those docs don’t leverage the new Snowpark ML Toolkit — which will make these previous tasks perform even better.</li><li>Meanwhile the Snowpark ML Toolkit quickstart offers an in-depth explanation of <a href="https://quickstarts.snowflake.com/guide/intro_to_machine_learning_with_snowpark_ml_for_python/">how to use this new ML Toolkit</a>.</li><li>But — that quickstart starts by asking us to download and install Python libraries in our VMs. I don’t like that step. I would rather use a cloud tool (like dbt Cloud), and have every library managed “magically”.</li></ul><p>So let’s do that: <em>A full exploration of the </em><a href="https://docs.snowflake.com/en/developer-guide/snowpark-ml/snowpark-ml-modeling"><em>Snowpark ML Toolkit </em></a><em>using the </em><a href="https://docs.getdbt.com/docs/building-a-dbt-project/building-models/python-models"><em>dbt Python models</em></a><em> on </em><a href="https://www.getdbt.com/signup/"><em>dbt Cloud</em></a>. Zero packages installation, just scalable ML fun.</p><p><strong>What’s interesting about dbt in this process</strong>:</p><ul><li>dbt Cloud is not aware of the Snowpark ML libraries, and it doesn’t need to.</li><li>dbt Cloud’s here is mainly to wrap our code within a Snowpark boilerplate stored procedure and let Snowflake handle the rest.</li><li>You could run the same with the <a href="https://www.getdbt.com/product/what-is-dbt/">open source dbt-core</a> on your desktop — I just love how dbt Cloud keeps everything running in the cloud while we work on this code.</li><li>Check <a href="https://medium.com/snowflake/how-snowflake-makes-the-dbt-python-models-shine-a36d22960edb">my previous post</a> for details:</li></ul><p><a href="https://medium.com/snowflake/how-snowflake-makes-the-dbt-python-models-shine-a36d22960edb">How Snowflake makes the dbt Python models shine</a></p><h3>Before we start</h3><p>To follow along, you will need:</p><ul><li>A Snowflake account (<a href="https://signup.snowflake.com/">free trial</a> works).</li><li>A dbt Cloud account (the free developer version works fine — or you could run all this locally, if you enjoy installing stuff).</li><li>Set up a user to connect your dbt Cloud account to your Snowflake account.</li><li>Set up a stage in your Snowflake account so it can read the typical ML example file (diamonds.csv) out of a public S3 bucket:</li></ul><pre>-- create csv format<br>CREATE FILE FORMAT IF NOT EXISTS CSVFORMAT <br>SKIP_HEADER = 1 <br>TYPE = &#39;CSV&#39;;<br><br>-- create external stage with the csv format to stage the diamonds dataset<br>CREATE STAGE IF NOT EXISTS DIAMONDS_ASSETS <br>FILE_FORMAT = CSVFORMAT <br>URL = &#39;s3://sfquickstarts/intro-to-machine-learning-with-snowpark-ml-for-python/diamonds.csv&#39;;</pre><p>(Check out <a href="https://medium.com/snowflake/how-snowflake-makes-the-dbt-python-models-shine-a36d22960edb">my previous post</a> to understand what a dbt Python model is and they work with Snowflake)</p><h4>Our first model: the diamonds data</h4><p>This one is a typical dbt SQL model, that copies the data out of S3 (as set up in the previous step) into a Snowflake table:</p><pre>-- diamonds.sql<br>{{config(materialized=&#39;table&#39;)}}<br><br>select $1::float &quot;CARAT&quot;<br>  , trim(upper($2::string), &#39;&quot;&#39;) &quot;CUT&quot;<br>  , trim(upper($3::string), &#39;&quot;&#39;) &quot;COLOR&quot;<br>  , trim(upper($4::string), &#39;&quot;&#39;) &quot;CLARITY&quot;<br>  , $5::float &quot;DEPTH&quot;, $6::float &quot;TABLE_PCT&quot;<br>  , $7::float &quot;PRICE&quot;<br>  , $8::float &quot;X&quot;, $9::float &quot;Y&quot;, $10::float &quot;Z&quot;<br>from @DIAMONDS_ASSETS</pre><p>Running this took less than 2 seconds — as it’s just a small file with 54k rows (we’ll worry about scalability and performance later in this post).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*zKLLjs53GlUOY-4qMHS9uA.png" /><figcaption>The diamonds table</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*J__yqb31CyL4qqIU" /><figcaption>Image generated by AI</figcaption></figure><h3>Normalizing and One-Hot encoding with Snowpark ML</h3><p>This where the fun starts, check out this dbt Python model:</p><pre># snowpark_ml_diamonds_transform.py<br><br>import numpy as np<br><br>import snowflake.ml.modeling.preprocessing as preproc<br>from snowflake.ml.modeling.pipeline import Pipeline<br>import snowflake.snowpark.types as T<br><br># https://github.com/Snowflake-Labs/sfguide-intro-to-machine-learning-with-snowpark-ml-for-python/blob/main/3_snowpark_ml_model_training_deployment.ipynb<br><br>def model(dbt, session):<br>    dbt.config(<br>        packages = [&#39;snowflake-ml-python&#39;]<br>    )  <br><br>    diamonds_df = dbt.ref(&#39;diamonds&#39;)<br><br>    # Normalize the CARAT column<br>    snowml_mms = preproc.MinMaxScaler(input_cols=[&quot;CARAT&quot;], output_cols=[&quot;CARAT_NORM&quot;])<br>    normalized_diamonds_df = snowml_mms.fit(diamonds_df).transform(diamonds_df)<br>    <br>    # Reduce the number of decimals<br>    new_col = normalized_diamonds_df.col(&quot;CARAT_NORM&quot;).cast(T.DecimalType(7, 6))<br>    normalized_diamonds_df = normalized_diamonds_df.with_column(&quot;CARAT_NORM&quot;, new_col)<br><br>    # Encode CUT and CLARITY preserve ordinal importance<br>    categories = {<br>        &quot;CUT&quot;: np.array([&quot;IDEAL&quot;, &quot;PREMIUM&quot;, &quot;VERY GOOD&quot;, &quot;GOOD&quot;, &quot;FAIR&quot;]),<br>        &quot;CLARITY&quot;: np.array([&quot;IF&quot;, &quot;VVS1&quot;, &quot;VVS2&quot;, &quot;VS1&quot;, &quot;VS2&quot;, &quot;SI1&quot;, &quot;SI2&quot;, &quot;I1&quot;, &quot;I2&quot;, &quot;I3&quot;]),<br>    }<br>    snowml_oe = preproc.OrdinalEncoder(input_cols=[&quot;CUT&quot;, &quot;CLARITY&quot;], output_cols=[&quot;CUT_OE&quot;, &quot;CLARITY_OE&quot;], categories=categories)<br>    ord_encoded_diamonds_df = snowml_oe.fit(normalized_diamonds_df).transform(normalized_diamonds_df)<br>    # Encode categoricals to numeric columns<br>    snowml_ohe = preproc.OneHotEncoder(input_cols=[&quot;CUT&quot;, &quot;COLOR&quot;, &quot;CLARITY&quot;], output_cols=[&quot;CUT_OHE&quot;, &quot;COLOR_OHE&quot;, &quot;CLARITY_OHE&quot;])<br>    transformed_diamonds_df = snowml_ohe.fit(ord_encoded_diamonds_df).transform(ord_encoded_diamonds_df)<br><br>    # return ord_encoded_diamonds_df<br>    return transformed_diamonds_df</pre><p>What’s notable here:</p><ul><li>We don’t need to install snowflake-ml-python as it’s already provided by Anaconda in Snowflake. We just need to ask for packages = [‘snowflake-ml-python’].</li><li>The snowflake.ml.modeling.preprocessing libraries offer libraries that look very similar to sklearn — but Snowflake takes care of making these run and scale within the Snowflake world.</li><li>Within these familiar sklearn constructs, we used here MinMaxScaler, OneHotEncoder, and OrdinalEncoder.</li><li>The last step of this model returns a Snowpark Dataframe, that gets persisted as a Snowflake table with the requested transformations.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*YUW-FGIRGu1NxNny-hwdVA.png" /><figcaption>Transformed and one-hot encoded diamonds</figcaption></figure><ul><li>Running this step took 48s (we’ll check later how it scales with more data).</li><li>The main transformation and persistence step took only 0.44s on a Medium-wh. The query profile looks cool, and hints that Snowflake will make this scale:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*bSNbtyR6be8jPBXRcQ7y0A.png" /><figcaption>Query profile — transforming, encoding, and persisting</figcaption></figure><h3>A transformation pipeline</h3><p>Let’s re-implement the above with a re-usable Pipeline:</p><pre># snowpark_ml_diamonds_transform_pipeline.py<br><br>import io<br>import joblib<br>import numpy as np<br><br>import snowflake.ml.modeling.preprocessing as preproc<br>from snowflake.ml.modeling.pipeline import Pipeline<br>import snowflake.snowpark.functions as F<br>import snowflake.snowpark.types as T<br><br># https://github.com/Snowflake-Labs/sfguide-intro-to-machine-learning-with-snowpark-ml-for-python/blob/main/3_snowpark_ml_model_training_deployment.ipynb<br># https://docs.getdbt.com/guides/dbt-ecosystem/dbt-python-snowpark/12-machine-learning-training-prediction<br><br>def save_file(session, model, path, dest_filename):<br>    input_stream = io.BytesIO()<br>    joblib.dump(model, input_stream)<br>    session._conn.upload_stream(input_stream, path, dest_filename)<br>    return &quot;successfully created file: &quot; + path<br><br>def model(dbt, session):<br>    dbt.config(<br>        packages = [&#39;snowflake-ml-python&#39;, &#39;joblib&#39;]<br>    )  <br><br>    diamonds_df = dbt.ref(&#39;diamonds&#39;)<br><br>    CATEGORICAL_COLUMNS = [&quot;CUT&quot;, &quot;COLOR&quot;, &quot;CLARITY&quot;]<br>    CATEGORICAL_COLUMNS_OE = [&quot;CUT_OE&quot;, &quot;COLOR_OE&quot;, &quot;CLARITY_OE&quot;] # To name the ordinal encoded columns<br>    NUMERICAL_COLUMNS = [&quot;CARAT&quot;, &quot;DEPTH&quot;, &quot;TABLE_PCT&quot;, &quot;X&quot;, &quot;Y&quot;, &quot;Z&quot;]<br><br>    categories = {<br>        &quot;CUT&quot;: np.array([&quot;IDEAL&quot;, &quot;PREMIUM&quot;, &quot;VERY GOOD&quot;, &quot;GOOD&quot;, &quot;FAIR&quot;]),<br>        &quot;CLARITY&quot;: np.array([&quot;IF&quot;, &quot;VVS1&quot;, &quot;VVS2&quot;, &quot;VS1&quot;, &quot;VS2&quot;, &quot;SI1&quot;, &quot;SI2&quot;, &quot;I1&quot;, &quot;I2&quot;, &quot;I3&quot;]),<br>        &quot;COLOR&quot;: np.array([&#39;D&#39;, &#39;E&#39;, &#39;F&#39;, &#39;G&#39;, &#39;H&#39;, &#39;I&#39;, &#39;J&#39;]),<br>    }<br>    preprocessing_pipeline = Pipeline(<br>        steps=[<br>                (<br>                    &quot;OE&quot;,<br>                    preproc.OrdinalEncoder(<br>                        input_cols=CATEGORICAL_COLUMNS,<br>                        output_cols=CATEGORICAL_COLUMNS_OE,<br>                        categories=categories,<br>                    )<br>                ),<br>                (<br>                    &quot;MMS&quot;,<br>                    preproc.MinMaxScaler(<br>                        clip=True,<br>                        input_cols=NUMERICAL_COLUMNS,<br>                        output_cols=NUMERICAL_COLUMNS,<br>                    )<br>                )<br>        ]<br>    )<br><br>    transformed_diamonds_df = preprocessing_pipeline.fit(diamonds_df).transform(diamonds_df)<br><br>    version = &#39;1.0&#39;<br>    session.sql(&#39;create or replace stage MODELSTAGE&#39;).collect()<br>    save_file(session, preprocessing_pipeline, &#39;@MODELSTAGE/preprocessing_pipeline_&#39;+version, &#39;preprocessing_pipeline_&#39;+version+&#39;.joblib&#39; )<br><br>    return transformed_diamonds_df</pre><p><strong>What’s notable here:</strong></p><ul><li>We created a preprocessing_pipeline — which can be re-used for further training and deployment.</li><li>The save_file() method takes care of encoding and saving this binary encoded pipeline into a Snowflake stage for future re-use.</li><li>We went here for Ordinal Encoding (instead of One-Hot), which is a good thing for the saved pipeline to remember.</li><li>We used snowflake.ml.modeling.pipeline and the encoders in snowflake.ml.modeling.preprocessing— which look familiar, yet optimized for Snowflake.</li><li>The <a href="https://docs.snowflake.com/en/developer-guide/snowpark-ml/snowpark-ml-modeling">Snowflake docs</a> display a * next to the libraries that support distributed execution (and you can expect more on the list to do so).</li><li>The model returns a data frame with all the diamonds pre-processed and ready for training our model — which gets persisted into a Snowflake table.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*mIAFNlzBclXEnDoSlUtSfg.png" /><figcaption>Diamonds ready for ML mining</figcaption></figure><ul><li>The whole process took 49s on a M-wh, with the main transformation and persistence taking only 1.1s. The query profile is interesting too:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*zOBSVZhm6LNwEGDidyjzoA.png" /><figcaption>Query profile of the transformation pipeline</figcaption></figure><h4>Train and predict with XGBoost</h4><p>Another simple dbt Python model that trains and predicts with XGBoost:</p><pre>import numpy as np<br><br>from snowflake.ml.modeling.xgboost import XGBRegressor<br>import snowflake.snowpark.functions as F<br>import snowflake.snowpark.types as T<br><br># https://github.com/Snowflake-Labs/sfguide-intro-to-machine-learning-with-snowpark-ml-for-python/blob/main/3_snowpark_ml_model_training_deployment.ipynb<br># https://docs.getdbt.com/guides/dbt-ecosystem/dbt-python-snowpark/12-machine-learning-training-prediction<br><br>def model(dbt, session):<br>    dbt.config(<br>        packages = [&#39;snowflake-ml-python&#39;]<br>    )  <br><br>    diamonds_df = dbt.ref(&#39;snowpark_ml_diamonds_transform_pipeline&#39;)<br>    diamonds_train_df, diamonds_test_df = diamonds_df.random_split(weights=[0.9, 0.1], seed=0)<br>    train_df, test_df = diamonds_train_df, diamonds_test_df<br><br>    CATEGORICAL_COLUMNS = [&quot;CUT&quot;, &quot;COLOR&quot;, &quot;CLARITY&quot;]<br>    CATEGORICAL_COLUMNS_OE = [&quot;CUT_OE&quot;, &quot;COLOR_OE&quot;, &quot;CLARITY_OE&quot;] # To name the ordinal encoded columns<br>    NUMERICAL_COLUMNS = [&quot;CARAT&quot;, &quot;DEPTH&quot;, &quot;TABLE_PCT&quot;, &quot;X&quot;, &quot;Y&quot;, &quot;Z&quot;]<br>    LABEL_COLUMNS = [&#39;PRICE&#39;]<br>    OUTPUT_COLUMNS = [&#39;PREDICTED_PRICE&#39;]<br>    regressor = XGBRegressor(<br>        input_cols=CATEGORICAL_COLUMNS_OE+NUMERICAL_COLUMNS,<br>        label_cols=LABEL_COLUMNS,<br>        output_cols=OUTPUT_COLUMNS<br>    )<br><br>    regressor.fit(train_df)<br>    return regressor.predict(test_df)</pre><p><strong>What’s notable here:</strong></p><ul><li>Instead of using the Pipeline we saved in the previous step, we can just use the transformed data persisted at the end. We could have used it, if we wanted to show off using that binary file out of a Snowflake stage.</li><li>We get to divide our diamonds data into training and test with a simple diamonds_df.random_split(weights=[0.9, 0.1]).</li><li>Then we get to train with regressor.fit(train_df) and persist the result of our test data with return regressor.predict(test_df).</li><li>We can use Snowsight for a quick visual check that the predicted prices are pretty close to the expected results:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*yfHsRyLfeBzf4o_PMeWkfA.png" /><figcaption>Predicted prices vs actual prices</figcaption></figure><ul><li>We are using XGBRegressor as implemented by snowflake.ml.modeling.xgboost. The docs say it hasn’t been prepared for distributed execution yet — but by using it we will get any improvements deployed by Snowflake to snowflake.ml.</li><li>The whole process took 1m28s on a Medium-wh.</li><li>The fit step took 29s. It’s interesting to look at the implementation of the stored procedure happening under the hood:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*kl2SJoC4hwK7JF3QGFt0VQ.png" /><figcaption>XGBoost training in action inside Snowflake</figcaption></figure><ul><li>The inference step took 8.2s, and the query profile shows how it gets distributed with a Python UDF:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/601/1*RGCjz0d8OZr_gp9vSZqviA.png" /><figcaption>Predicting price for 10% of the rows (test set)</figcaption></figure><h3>Will it scale?</h3><p>That was quick and cool. Now let’s see if it will scale with 1000x the amount of data.</p><h4>Generating 1000 times more data</h4><p>This is how I generated 1000x random data, based on the existing diamonds, with a dbt Python model:</p><pre># diamonds_random_1000x.py<br>import snowflake.snowpark.functions as F<br>import snowflake.snowpark.types as T<br><br>import random<br><br>def model(dbt, session):<br>    df = dbt.ref(&#39;diamonds&#39;)<br><br>    # Summarize the existing data to get random data in that range<br>    grouped_df = df.groupBy().agg(<br>        F.collect_set(&quot;CUT&quot;).alias(&quot;CUT&quot;),<br>        F.collect_set(&quot;COLOR&quot;).alias(&quot;COLOR&quot;),<br>        F.collect_set(&quot;CLARITY&quot;).alias(&quot;CLARITY&quot;),<br>        F.array_construct(F.min(&quot;CARAT&quot;), F.max(&quot;CARAT&quot;)).alias(&quot;CARAT&quot;),<br>        F.array_construct(F.min(&quot;DEPTH&quot;), F.max(&quot;DEPTH&quot;)).alias(&quot;DEPTH&quot;),<br>        F.array_construct(F.min(&quot;TABLE_PCT&quot;), F.max(&quot;TABLE_PCT&quot;)).alias(&quot;TABLE_PCT&quot;),<br>        F.array_construct(F.min(&quot;X&quot;), F.max(&quot;X&quot;)).alias(&quot;X&quot;),<br>        F.array_construct(F.min(&quot;Y&quot;), F.max(&quot;Y&quot;)).alias(&quot;Y&quot;),<br>        F.array_construct(F.min(&quot;Z&quot;), F.max(&quot;Z&quot;)).alias(&quot;Z&quot;),<br>        F.array_construct(F.min(&quot;PRICE&quot;), F.max(&quot;PRICE&quot;)).alias(&quot;PRICE&quot;),<br>    )<br><br>    @F.udf(input_types=[T.ArrayType()], return_type=T.StringType())<br>    def random_from_array(x):<br>        return random.choice(x)<br><br>    num_rows = df.count() * 1000<br>    range_df = session.range(num_rows).withColumn(&quot;dummy&quot;, F.col(&quot;id&quot;))<br>    exploded_df = grouped_df.crossJoin(range_df)<br><br>    # Generate random values within the specified range for numerical columns<br>    result_df = exploded_df.select(<br>        random_from_array(F.col(&#39;CUT&#39;)).alias(&#39;CUT&#39;), <br>        random_from_array(F.col(&#39;COLOR&#39;)).alias(&#39;COLOR&#39;), <br>        random_from_array(F.col(&#39;CLARITY&#39;)).alias(&#39;CLARITY&#39;), <br>        (F.uniform(F.cast(F.lit(0), T.FloatType()), F.lit(1), F.random()) * (F.col(&#39;CARAT&#39;)[1] - F.col(&#39;CARAT&#39;)[0]) + F.col(&#39;CARAT&#39;)[0]).alias(&#39;CARAT&#39;),<br>        (F.uniform(F.cast(F.lit(0), T.FloatType()), F.lit(1), F.random()) * (F.col(&#39;DEPTH&#39;)[1] - F.col(&#39;DEPTH&#39;)[0]) + F.col(&#39;DEPTH&#39;)[0]).alias(&#39;DEPTH&#39;),<br>        (F.uniform(F.cast(F.lit(0), T.FloatType()), F.lit(1), F.random()) * (F.col(&#39;TABLE_PCT&#39;)[1] - F.col(&#39;TABLE_PCT&#39;)[0]) + F.col(&#39;TABLE_PCT&#39;)[0]).alias(&#39;TABLE_PCT&#39;),<br>        (F.uniform(F.cast(F.lit(0), T.FloatType()), F.lit(1), F.random()) * (F.col(&#39;X&#39;)[1] - F.col(&#39;X&#39;)[0]) + F.col(&#39;X&#39;)[0]).alias(&#39;X&#39;),<br>        (F.uniform(F.cast(F.lit(0), T.FloatType()), F.lit(1), F.random()) * (F.col(&#39;Y&#39;)[1] - F.col(&#39;Y&#39;)[0]) + F.col(&#39;Y&#39;)[0]).alias(&#39;Y&#39;),<br>        (F.uniform(F.cast(F.lit(0), T.FloatType()), F.lit(1), F.random()) * (F.col(&#39;Z&#39;)[1] - F.col(&#39;Z&#39;)[0]) + F.col(&#39;Z&#39;)[0]).alias(&#39;Z&#39;),<br>    )<br><br>    result_df = result_df.withColumn(&quot;PRICE&quot;, F.col(&quot;TABLE_PCT&quot;) * F.col(&quot;Y&quot;))<br>    return result_df</pre><p><strong>What’s notable here:</strong></p><ul><li>This is the step that took me the longest time to write.</li><li>You can see how hard I had to work to find a way to produce a random number between the sample data min and max: (F.uniform(F.cast(F.lit(0), T.FloatType()), F.lit(1), F.random()) * (F.col(‘CARAT’)[1] — F.col(‘CARAT’)[0]) + F.col(‘CARAT’)[0]).alias(‘CARAT’) .</li><li>Then I wasn’t able to find an easy Dataframe way to choose a random value within the summary arrays of the categorical data. The easy solution instead was writing a simple Python UDF (random_from_array()).</li><li>I gave the price a simple pattern depending on a couple of these variables —then it will be XGBoost’s job to discover this pattern within the data.</li><li>Creating these 54 million rows took 1m7s in a Medium-wh, with the main generated query taking 57s. As always, it’s fun to look at that query profile:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/938/1*RbnVXV5VZNnKF1yd9qTlEg.png" /></figure><h4>Transformation pipeline with 1000x data</h4><p>We are going to skip the code here, as it’s exactly the code as above, but instead of depending on dbt.ref(‘diamonds’) it asks for the just created dbt.ref(‘diamonds_random_1000x’) table.</p><p><strong>What’s notable here:</strong></p><ul><li>The whole process took 1m on a M-wh, with the main transformation and persistence taking only 8.3s.</li><li>These are cool numbers at scale: The original transformation pipeline took 49s on a Medium-wh, with the transformation and persistence taking 1.1s. That’s pretty cool, as we scaled to 54 million rows (from only 54 thousand), taking only a little bit more time.</li><li>The query profile:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Ss9Ys2-8q9wy_sRQ_Lml8A.png" /><figcaption>Only 8s with 1000x more data</figcaption></figure><h4>Training and predicting with 1000x data</h4><p>We are going to skip the code here, as it’s exactly the code as above. But instead of depending on dbt.ref(‘snowpark_ml_diamonds_transform_pipeline’), it asks for the just created dbt.ref(‘snowpark_ml_diamonds_transform_pipeline_1000x’).</p><p><strong>What’s notable here:</strong></p><ul><li>After 2m43s, this step failed (at first). This is the log:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*AVvysaLLUf96n_inHbhKtg.png" /><figcaption>Failure log of training with 1000x the data</figcaption></figure><ul><li>The reason is simple, and the fix is too: Our Medium-wh didn’t have enough RAM for XGBoost to train the model with 1000x the data. That’s why Snowflake has now the “<a href="https://docs.snowflake.com/en/user-guide/warehouses-snowpark-optimized">Snowpark Optimized Warehouses</a>” — which are like normal warehouses, but with a lot more RAM and related optimizations.</li><li>Once switched to a Snowpark-Medium-wh, the whole process took 5m53s. This is cool, compared to the previous 1m28s, because now we have 1000x the data.</li><li>The training part took 4m48s (compared to 29s). This is where distributed execution support might help us a lot (when/if Snowflake implements it transparently for you).</li><li>Predicting for 5.4 million rows took 15s (compared to 8.2s with 5.4 thousand rows). The query profile:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/601/1*YG9NlrqanlVb9VgLeRYMiw.png" /><figcaption>Predicting for 5.4 million rows of diamonds</figcaption></figure><ul><li>Performance note: The Medium-Snowpark-wh had a similar performance than a Large-Snowpark-wh — given that this XGBoost isn’t (yet) distributed.</li><li>The predicted prices versus the actual prices look even better in this chart. We can say that XGBoost successfully found out the linear relationship I chose for the pricing of the random data:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*544b1rD33w6RP1_iIXqjNw.png" /></figure><h3>Summary</h3><p>As we used dbt to connect all these transformations, it also helps us visualize how each model relates to each other:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*sIIoWbW-WU1rMC-WJsY0xg.png" /><figcaption>How dbt transformed `diamonds` through augmentation, transforming, and training.</figcaption></figure><ul><li>dbt allows us to switch between SQL and Python transformations in a pipeline, and takes care of adding boilerplate code when pushing these to Snowflake.</li><li>snowflake.ml.modeling packages familiar Python ML tools (Scikit-Learn, XGBoost, LightGBM, …) and improves their performance and scalability on Snowflake.</li><li>Scaling from 50 thousand rows to 50 million rows was easy and quick.</li><li>We can expect Snowflake to keep improving these libraries and their scalability.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*O9cN73cePdnYPC58" /><figcaption>Image created by AI</figcaption></figure><h3>Next steps</h3><ul><li>We saw how to persist binaries into a Snowflake stage, but registry and deployment of models could be better handled (stay tuned).</li><li>Check “<a href="https://medium.com/snowflake/ml-on-snowflake-at-scale-with-snowpark-python-and-snowpark-ml-part-2-6491d72a9903">ML on Snowflake at scale with Snowpark (Part-2)</a>” from <a href="https://medium.com/u/260ca8415b6d">Simran Khara</a> and “<a href="https://medium.com/snowflake/ml-on-snowflake-at-scale-with-snowpark-python-and-xgboost-c329c30c2feb">ML on Snowflake at scale with Snowpark Python and XGBoost</a>” from <a href="https://medium.com/u/bee9a1f552ef">Chase Ginther</a>.</li></ul><ul><li><a href="https://medium.com/snowflake/ml-on-snowflake-at-scale-with-snowpark-python-and-snowpark-ml-part-2-6491d72a9903">ML on Snowflake at scale with Snowpark (Part-2)</a></li><li><a href="https://medium.com/snowflake/ml-on-snowflake-at-scale-with-snowpark-python-and-xgboost-c329c30c2feb">ML on Snowflake at scale with Snowpark Python and XGBoost</a></li></ul><p>Suggestions from my previous post:</p><ul><li>Try the dbt Python models in your Snowflake account, and share your results.</li><li>Participate in the dbt community to shape <a href="https://github.com/dbt-labs/dbt-core/discussions/5261">the future of the dbt Python models</a>.</li><li>Read the <a href="https://docs.getdbt.com/docs/building-a-dbt-project/building-models/python-models">dbt Python model docs</a>, and the<a href="https://docs.snowflake.com/en/developer-guide/snowpark/python/working-with-dataframes"> Snowflake Snowpark for Python library docs</a>.</li><li>Share your finding withs the dbt community on the <a href="https://www.getdbt.com/community/">dbt Slack</a> channels <a href="https://getdbt.slack.com/archives/C03QUA7DWCW">#dbt-core-python-models</a> and <a href="https://getdbt.slack.com/archives/CJN7XRF1B">#db-snowflake</a>.</li><li>Try some dbt+Snowflake quickstarts like “<a href="https://quickstarts.snowflake.com/guide/data_engineering_with_snowpark_python_and_dbt/">Data Engineering with Snowpark Python and dbt</a>” and “<a href="https://quickstarts.snowflake.com/guide/leverage_dbt_cloud_to_generate_ml_ready_pipelines_using_snowpark_python/">Leverage dbt Cloud to Generate ML ready pipelines using Snowpark Python</a>”.</li><li>Compare the simplicity, power, and performance of the dbt Python models on Snowflake — versus the set up that dbt had to pull off to run Python models in other platforms.</li><li>Check the previous posts by <a href="https://medium.com/u/f113d7da81ea?source=post_page-----a36d22960edb--------------------------------">Jeremiah Hansen</a> and <a href="https://medium.com/u/2380bc86d427?source=post_page-----a36d22960edb--------------------------------">Eda Johnson</a> “<a href="https://medium.com/snowflake/data-engineering-with-snowpark-python-and-dbt-f57c9cf778bb">Data Engineering with Snowpark Python and dbt</a>” and “<a href="https://medium.com/snowflake/a-first-look-at-the-dbt-python-models-with-snowpark-54d9419c1c72">A First Look at the dbt Python Models with Snowpark</a>”. Also phData’s “<a href="https://www.phdata.io/blog/how-to-use-dbt-with-snowpark-python-to-implement-sentiment-analysis/">How to Use dbt With Snowpark Python to Implement Sentiment Analysis</a>” and “<a href="https://www.phdata.io/blog/python-models-in-dbt/">How to Build a Python Model in dbt with Snowflake</a>”.</li><li>For ML, dbt has an in-depth guide “<a href="https://docs.getdbt.com/guides/dbt-ecosystem/dbt-python-snowpark/1-overview-dbt-python-snowpark">Leverage dbt Cloud to generate analytics and ML-ready pipelines with SQL and Python with Snowflake</a>”, and tropos published “<a href="https://www.tropos.io/blog/how-to/how-to-use-python-for-machine-learning-based-forecasting-in-dbt-snowflake-ecosystem/">Time-Series Forecasting With Python For Snowpark And Dbt Labs</a>”.</li></ul><h3>Want more?</h3><ul><li>Try this out with a <a href="http://bit.ly/sf-free-trial">Snowflake free trial</a> account — you only need an email address to get started.</li><li>Try <a href="https://discover.getdbt.com/free-account/">dbt Cloud</a> — which I used as a cool web IDE integrated to dbt and Snowflake to develop the examples in this post (it does a lot more than that too, but that’s a story for another post).</li></ul><p>I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can <a href="https://twitter.com/felipehoffa">follow me on Twitter</a> and <a href="https://www.linkedin.com/in/hoffa/">LinkedIn</a>. And subscribe to <a href="https://www.reddit.com/r/snowflake/">reddit.com/r/snowflake</a> for the most interesting Snowflake news.</p><p>Oh, and I’m on <a href="https://threads.net/fhoffa">threads.net/fhoffa</a> now too :)</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=4f85b85a8c88" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>