Stories by Felipe Hoffa on Medium

GCP Next ’26: A knowledge graph of 5 million words of videos and slides

Felipe Hoffa — Mon, 01 Jun 2026 11:01:01 GMT

How 1,146 sessions, 444 slide decks, and 323 video transcripts became 8,285 nodes and 24,421 edges of structured knowledge. The challenge: how do you make sense of that much material at once?

Google Cloud Next 2026 - Knowledge Graph Story

The answer was to treat the entire conference as a corpus and build a knowledge graph from it. Here’s how the data pipeline worked, layer by layer.

The Pipeline: Layer by Layer

Session Metadata — 1,146 sessions

The first layer was the structured index from the Google Cloud Next 2026 session explorer. Each entry had a title, description, speaker names, company affiliations, topic tags, room, and timing. This became the skeleton: every session title and speaker was a candidate entity in the graph.

Slide Decks — 444 PDFs, ~2,200 slides analyzed

For 444 of the 1,146 sessions, a slide deck was available on the Google site. Each PDF was downloaded, converted to images, and processed with vision analysis — extracting product names, architecture diagrams, metrics, and key claims from individual slides. This yielded 2,205 slide-level analysis records and per-session brief summaries.

Slides are uniquely valuable: they contain the distilled, intentional message a speaker wants to convey. A bullet point on a slide is more signal than two minutes of verbal framing around it.

YouTube Transcripts — 323 full recordings

For 323 sessions, a YouTube recording was available. Full audio transcripts were retrieved when available, or generated when needed, and stored — capturing the speaker’s actual words, including Q&A, live demos, and audience interactions that never appear in a slide deck.

The combination of slides + transcript for the same session was particularly powerful: slides provided structured claims while transcripts provided context, nuance, and the path that led to each conclusion.

Graphify — Entity & Relationship Extraction

All 477 source files (~5 million words) were fed through graphify, an AI-powered knowledge graph builder that reads raw text and extracts named entities, concepts, products, companies, and people — along with the semantic relationships between them.

Most of the graph was built from directly sourced connections found in the conference material itself. Separately, 114 additional edges were added by cross-document reasoning, and only kept when they cleared a 0.85 confidence threshold.

Graph Output — 8,285 nodes · 24,421 edges · 44 communities

The resulting graph was then analyzed for community structure using modularity-based detection. 44 distinct communities emerged — each representing a cluster of entities that are more densely connected to each other than to the rest of the graph. Community names are derived from the most distinctive nodes in each cluster.

Distilling the Ecosystem

The raw graph is beautiful but barely readable.

The simplification works by promoting only the highest-degree hub nodes within each detected community and collapsing everything else around them. Intra-layer connections become solid lines within each bubble; cross-layer connections become dashed lines between zones. The result is a map you can actually read and reason about.

The picture is pretty simple: Google appears to be assembling a full enterprise agent stack. At the bottom are the protocols and tools that let agents reach other systems. Then come the control layers that secure them, the data systems that ground them, the model and agent platform that powers them, the infrastructure that runs them, and the products where people actually use them.

The graph makes that structure visible. A long list of launches starts to read like one architecture for enterprise agents.

What Can You Ask This Graph?

Here’s a list of some sample questions we asked the graph. To see the answers click on the interactive dashboard:

See the answer to these questions on the interactive page

What Google Said About Graphs

The graph isn’t just an outside analysis trick. Google repeatedly described agent-ready systems in graph-shaped terms: entities, relationships, connected context, and knowledge that agents can reason over.

In the opening keynote Google introduced Knowledge Catalog as a system that can read files, extract entities, map relationships, and learn business semantics. In What’s new with data and AI governance: Building the catalog for AI, that idea became even clearer: the goal is a connected context layer that agents can reason over directly.

This wasn’t just abstract metadata talk. The source material also includes concrete product pushes around graph-native data systems. Spanner Graph appears in What’s new in Spanner: enterprise-scale AI, search, graph, and analytics, where Google describes multimodel support, SQL PGQ, integrated graph algorithms, graphs on views, and UI-based graph modeling. BigQuery Graph appears in multiple decks as native property-graph support for relationship analytics, visualization, and graph-grounded reasoning on enterprise data.

That’s what makes this feel new. For years, products like BigQuery and Spanner were mainly framed as places to store, query, and scale data. At Next ’26, Google was also positioning them as systems for modeling relationships directly — not just rows and columns, but connected structures that agents and analytical workflows can traverse. This is Google reframing databases for the agent era.

And it goes beyond two product names. Other graph-shaped concepts also show up in the graph itself: BigQuery Property Graph, Property Graph, GQL / ISO GQL, Graph Analytics, Graph RAG, Knowledge Graph Grounding, Gemini Enterprise Agent Platform Graph Neural Networks, and security-oriented nodes like Wiz Security Graph and SCC Security Graph.

That matters because it validates the method of this page. The conference graph here is external and independently built, but it is not conceptually alien to Google’s own framing. If anything, the company’s product story is shifting from isolated data systems toward connected context systems: not just documents, not just tables, and not just prompts, but systems that understand entities, relationships, provenance, and business meaning well enough for agents to act on them.

Where Geotab Landed

Junaid Gill (Associate Vice President, Geotab) joined Greg Brosman (Senior Product Manager, Google Cloud) and John Murray (Group Product Manager, Google Cloud) on governing a secure agentic ecosystem.

https://medium.com/media/81be106373c9af3bea1ef5cdbff6717d/href

Francois-Xavier Jeannet (Team Lead, Data & AI Governance, Geotab) joined Anit Patinker (Lead Product Manager, Google Cloud) and Shelley Hershkovitz (Product Manager, Google Cloud) on agent security at scale.

https://medium.com/media/b9c79014dbda0d5a9bbda752cf7c9a87/href

Daniel Lewis (Distinguished Data Scientist, Geotab) joined Dima Melnyk (Senior Product Manager, Google Cloud) and Alex Martin (Product Manager, Google Cloud) on the agent-quality flywheel.

https://medium.com/media/c354223ab787076f54f978326a5a5b45/href

On the conceptual map, those talks land in a tight zone around platform, governance, and evaluation. One Geotab example helps explain why: their internal hackathon generated 86 agent submissions, and 2 of those later made it into production.

That gap helps explain why Geotab mattered on stage. In earlier years, Geotab often appeared as the customer saying “BigQuery works and here’s how we use it”. This time the posture felt different. AI is still early, but Geotab has already been building agentic systems, and Google is bringing them on stage not merely as a reference customer, but as a partner with production lessons that are helping shape the products now being pushed more broadly.

Two of the three Geotab sessions landed inside the Agent Control Plane: one on governing agents and securing the agentic ecosystem with GEAP, and one on agent security at scale through the OWASP Top 10 for agents. The third landed in the Agent Platform zone, focused on the agent-quality flywheel and GEAP evaluations.

Geotab was there to talk about the hard part: getting agents into production, keeping them secure, and building the evaluation loop that makes them reliable. That is also why the control-plane layer matters so much on this map — with examples of how an exploratory funnel narrows into a small number of production-worthy agents, and why governance, security, evaluation, and operational control matter so much once a team tries to ship them.

The Two Big Takes

First: graphing the conference reveals structure you cannot reliably see by reading sessions one by one. Once 1,146 sessions, 444 slide decks, and 323 transcripts are turned into nodes and edges, repeated patterns become visible: which products sit at the center, which protocols connect otherwise separate systems, which themes cluster together, and which ideas show up across product, security, data, infrastructure, and partner talks.

That is where many of the strongest insights in this story come from. MCP starts to look like connective tissue. BigQuery starts to look like a governed substrate for AI systems. Cloud Run, GKE, security controls, evaluation loops, and enterprise data products start to read as parts of the same architecture.

Second: building the graph was fairly straightforward. Cleaning up the conference corpus was the hard part. The source material was messy: decks, transcripts, session pages, repeated product names, overlapping abstractions, and different levels of specificity. graphify made the workflow feel direct: feed in the corpus, extract grounded entities and relationships, and simplify the result into a graph that a human can actually read.

The graph also shows what Google seems to believe will matter in production. The center of gravity is a working stack: agents connected through MCP and A2A, grounded in enterprise data, deployed on managed runtimes, evaluated continuously, and wrapped in governance, identity, and security controls.

If you want to test that thesis yourself, browse the session explorer, or open the full interactive map and follow the edges yourself. The graph, the map, and the session index feel like the right publishable artifact here, without redistributing downloaded slides or transcripts.

One encouraging part of this project is how manageable it was. It took a few steps, and Graphify made the workflow pretty approachable.

A future attendee could use a graph like this during the next conference to decide which sessions to prioritize, spot the hubs everyone is converging around, and follow emerging ideas across tracks in real time. It works well as a retrospective, and it would be even more useful live.

Read and discuss this knowledge-graph story on LinkedIn →

I turned 400+ slide decks, 300+ YouTube transcripts, and 5+ million words from Google Cloud Next '26 into a knowledge graph to see what the conference was really about. The raw graph is huge and... | Felipe Hoffa | 10 comments

Originally published at https://fhoffa.github.io.

GCP Next ’26: A knowledge graph of 5 million words of videos and slides was originally published in Geotab Innovators on Medium, where people are continuing the conversation by highlighting and responding to this story.

Is MCP dead? 2025 vs 2026 at GCP Next

Felipe Hoffa — Mon, 06 Apr 2026 22:07:23 GMT

Comparing the Google Cloud Next session catalogs — 2025 vs 2026: MCP: 1 → 63 sessions, A2A / agent-to-agent: 0 → 16, “Protocol”: 0 → 50. Read on for more details.

MCP: 2025 vs 2026 at GCP Next

This is Google moving from “AI assistant” framing toward something much closer to an agent execution stack. The 63 MCP sessions span 32 Breakouts, 11 Lightning Talks, 8 Discussion Groups, 5 Demos, 3 Workshops, 3 Solution Talks, 3 Birds of a Feather, 1 Spotlight, plus a Developer Meetup and a Lounge Session.

→ Browse all 63 MCP sessions

The themes are pretty consistent:

• Gemini CLI as the developer control surface
• MCP as the protocol plumbing
• ADK as the agent toolkit
• BigQuery, Apigee, Cloud Run, and SecOps as the production surfaces

There’s also a quieter shift sitting next to MCP: the word “skills”: 26 sessions mention it in 2026. But the term is still in transition. Most of those sessions still mean “human training”, “certifications”, “upskilling programs”. But a smaller subset clearly means something else:

• Automating the UI with Gemini CLI, MCP and Skills
• Vibing with Gemini CLI: Custom Web AI solutions and shareable skills
• One tool to rule them all: Extending and customizing the Gemini CLI
• 10x productivity with the Gemini CLI

→ Browse all 26 Skills sessions

In those sessions, skills starts to mean reusable, executable capability — something you compose and ship, not just something a human learns in a course. The word hasn’t fully crossed over yet. But the direction is visible.

There’s also a demand signal in the reservations data: 7,778 seats already taken across the MCP sessions, weeks before the conference opens! The six biggest registration totals alone account for well over 1,400:

• Conversational Analytics agents, API, and MCP — 356 (Richard Kuzma)
• Agent development with BigQuery, ADK, and MCP — 298 (guillaume blaquiere)
• Build AI agents on Cloud Run — 256 (Ryan Pei)
• Google MCP Services — 178 (Romin Irani)
• Beyond the prompt: Build production-ready agents with Google’s MCP servers — 157 (Tarun Gumar)
• Unlocking enterprise actions: Bridge your APIs to AI agents with MCP and Apigee — 152 (David Rush)

And many of the smaller practitioner sessions are already gone (Agent development and AgentOps with BigQuery, ADK, and MCP; Building custom agents with Looker MCP and APIs; Build an AI Agent and Configure an MCP Server on Cloud Run)

Conference agendas are lagging indicators. CFPs, editorial planning, and speaker lock-in happen months before the event. So this doesn’t prove MCP wins long term. But it does show that well before the discourse settled, Google had already decided MCP was worth putting across one of its biggest stages.

And if MCP is the obvious signal, skills as executable capability may be the next term worth tracking.

→ Read more insights → Discuss on LinkedIn

Is MCP dead? I compared the Google Cloud Next session catalogs - 2025 vs 2026: * MCP: 1 → 63 sessions * A2A / agent-to-agent: 0 → 16 * "Protocol": 0 → 50 This is Google moving from "AI assistant"... | Felipe Hoffa

→ Read more insights

Can AI Understand Fleet Data? Claude Puts Geotab Ace to the Test (MCP Demo)

Felipe Hoffa — Fri, 05 Dec 2025 20:24:23 GMT

To showcase the capabilities of Geotab Ace, an AI assistant that translates user questions into fleet analytics, we moved beyond manual validation. We used the MCP to give Claude full autonomy: it designed its own test cases, ran them through Geotab, and verified that Ace delivers the accurate, actionable insights a fleet manager would expect. Read on for the full breakdown, or watch the demo video below.

Watch on Youtube

I’m Felipe Hoffa, Data and AI Advocate at Geotab. Geotab Ace is our conversational agent: Ask questions in plain English and get answers from your fleet data.

In this experiment, Claude ran 18 tests, from simple counts to behavioral pattern analysis. Not just checking if Ace gets the right answer, but whether it understands questions like a human would.

https://medium.com/media/b49c8eab173ec5b000072523d1aea570/href

Today’s experiment boils down to one question: Can Claude verify that Ace truly understands natural language within its domain?

Here’s what it discovered.

The Setup

We used an MCP server (Model Context Protocol), a standard for connecting AI systems, to link Claude directly to Ace. Think of MCP as the conversational bridge letting two AIs talk directly. No human in the loop. Claude is running its own mini research project.

1. Basic Comprehension

Claude’s First Question:

“What can you do?”

Ace lists trip details, GPS logs, vehicle performance, driver safety, and diagnostics, plus tips on being specific and including time frames.

Claude’s Assessment: Comprehensive overview, transparency about limitations, well-structured. Now Claude knows what to test.

2. Simple to Complex Logic

Claude then started simple:

“How many vehicles are in the database?”

Ace handled those cleanly. Then it pushed harder, moving from simple retrieval to multi-condition reasoning:

“Show me vehicles that idled for more than 30 minutes AND drove over 200 miles in the last week.”

Combining multiple conditions, Ace confirmed both had to be met before generating the answer. 10 vehicles matched. Claude was impressed.

3. Ambiguity and Human Language

This is where Ace surprised Claude. Fleet managers ask all the time: “Which driver is the safest?”

No metric is specified. No definition of “safe” — just a simple question that could mean a dozen different things. Claude deliberately left it vague to see what Ace could do.

Ace laid out its reasoning, showing how it translated a vast idea into measurable metrics: safety ranking, predicted collisions, and collision probability. It didn’t guess; it explained its interpretation before answering.

Claude noted: This is inference, not just translation. Ace is reasoning about the problem domain.

4. Colloquialisms

Then Claude tried something trickier. A fleet manager might say:

“What’s the deal with vehicle B26? Has it been working hard lately?”

That’s vague. Claude used this to see whether Ace could map casual language to specific metrics.

Ace translated “What’s the deal” into an information request.
“Working hard” became distance, engine hours, and trip count.
“Lately” became 30 days.

Claude highlighted how Ace understood colloquialisms without formal phrasing.

5. Domain Expertise

Now it was time to test whether Ace knows Geotab’s world. When something goes wrong, a fleet manager doesn’t dive into technical jargon. They just ask:

“Show me geofence violations last week.”

Ace mapped “geofence violations” to Geotab’s Zone Exceptions, calculating week boundaries using proper date logic.

Claude called this out: Ace knows Geotab terminology and data structures. It’s not just a generic AI translator.

6. Logic and Negation

Claude then tested negation by asking:

“Which vehicles did not make any trips yesterday?”

Ace explained its two-step logic and found Demo 05 idle.

7. Time and Place

Claude tested whether Ace thinks in places and time, not just raw data.

“Compare average daily mileage between weekdays and weekends for the last month.”

Ace explained it aggregates twice: first daily totals, then comparing weekdays to weekends.

Result: Weekdays average 6,800 miles; Weekends 5,600.

8. Advanced Causal Reasoning

Picture this: You are reviewing your fleet dashboard and one vehicle’s fuel stats look off. You want to know if it’s truly using more fuel than usual.

“Find vehicles whose fuel consumption today is more than double their normal average.”

Ace explained it is calculating a baseline average but excluding today’s data to establish what’s “normal,” then comparing today against the baseline.

Claude called this immediately: That’s causal reasoning. It recognized that including today’s data would create circular logic.

9. Debugging and Resilience

Then came the fun part. Claude asked:

“Who are the three least experienced drivers who have driven more than 100 miles this week?”

After hitting a processing timeout, Claude didn’t give up. Like a researcher refining a question after a dead end, it tried a new approach:

“Show me the newest drivers who drove over 100 miles this week.”

That one timed out too, and the team is working on it. But Claude pushed further, testing different approaches, mapping the edges, watching one AI debug another in real time. That’s next-level AI collaboration.

10. Patterns and Insights

“Show me vehicles that consistently speed on the same road segment.”

You probably hear this from operations managers. They don’t want random speeding incidents; they want patterns, problem areas, places where drivers habitually push the limits.

Ace rounded the GPS coordinates to about 11 meters of precision, creating location clusters. Demo Vehicle 41 had 17 speeding events at the same spot. Claude found this “elegant geospatial clustering” was a smart, practical solution.

“Which vehicles increased their mileage by more than 50% compared to last month?”

Ace explained that “last month” refers to the previous full calendar month, not a rolling 30 days. Claude noted this shows Ace understands human concepts of time, not just date arithmetic.

“Find drivers who speed only during rush hour but not at other times.”

This is the kind of insight that reveals behavior patterns. Insights fleet managers use to adjust schedules, reduce stress, or correct routing issues. Maybe stress drives risky driving? Maybe it’s schedule pressure? Understanding when drivers speed tells you why they speed.

Ace explained it created two groups: Rush Hour Speeders vs. Other Time Speeders, then found the difference. It defined rush hour as 6:00–9:00 AM and 4:00–7:00 PM, accounting for time zones.

Claude observed that Ace’s time zone awareness was spot on — critical for managing multi-time zone fleets.

The Verdict

After 18 tests, Claude gave a near-perfect score, praising its deep understanding, sharp reasoning, and domain expertise.

Claude’s Summary: Ace doesn’t just translate questions; it reasons about the problem domain.

What stood out?

Statistical reasoning with circular logic avoidance.
Geospatial clustering.
Deep domain expertise in Geotab terminology.
Set operations from natural language.
Time zone awareness.

When Claude hit processing limits, it adapted, reformulated queries, and kept testing. That resilience was fascinating to watch.

Why This Matters

You don’t have time to write perfect questions. You need systems that get what you mean even when you’re vague. Better understanding means faster decisions, fewer mistakes, and more confident operations.

Imagine asking, “Which drivers are improving their safety scores?” and getting an instant, accurate answer. Or, “Where are maintenance costs climbing?” or “Which routes are most fuel efficient?” Questions that used to require SQL expertise and analyst time now happen conversationally.

And now, an AI can validate whether another AI actually understands its domain at scale.

What’s Next?

https://medium.com/media/b9cf924316bb5dfaafc1d89f7c0da8f5/href

This opens up some interesting questions.

How will Ace perform when the data isn’t clean or when it’s handling millions of rows at production scale?
What if Claude isn’t just testing Ace, but using it as a tool to solve bigger problems? Multiple AI queries orchestrated together, building analytical workflows?

Well, MCP makes this possible: AI using another AI as a capability.

If you are as excited as I am about where this is headed, follow along. There’s a lot more to explore. This wasn’t just a test of accuracy; it was a glimpse of AI systems that can reason, adapt, and even debug each other. AIs aren’t just answering questions; they are evaluating each other, and that changes what’s possible.

Check out the full conversation between Ace and Claude. Also follow me on LinkedIn for more AI deep dive topics, join the r/geotab subreddit for more Geotab news, or go ahead and try the MCP server I published on GitHub.

I’m Felipe Hoffa, and I’ll see you on the road.

BigQuery vs Snowflake vs Databricks: Which subreddit community beats?

Felipe Hoffa — Tue, 16 Sep 2025 23:34:03 GMT

Instead of comparing data platforms, I want to know which one has the healthiest community heartbeat on Reddit: BigQuery, Snowflake, or Databricks? I’m not in the business of advocating for any of them anymore, but I’ll always love data and community. So let’s compare the numbers.

Active subreddit engagement rates as reported by reddit, 2025–09–16

Reddit has grown as the largest internet community, and I’ve been part of growing some subreddits close to my heart. Including:

I started r/bigquery in 2013. I took it from 0 to 10k subscribers by 2020. After I left Google, it kept growing, but at a slower rate
In 2020, I reactivated r/snowflake, taking it from 600 to 18k subscribers by 2024.
And in 2023, I watched from outside as r/databricks went from almost nothing to 9k subscribers at the start of 2025.

Reddit /r/bigquery and /r/snowflake community growth with Felipe’s care (2013–2025)

So who has the healthiest subreddit community today?

Why this matters in an LLM world

Semrush looked at the top domains cited on LLMs (ChatGPT, Perplexity, AI Mode, AI Overviews) — and found out that Reddit is the top source for their citations (followed by Wikipedia and Youtube):

Reddit is the top source for LLMs citations, according to Semrush

The Numbers Tell a Story

Looking at these subreddit’s activity is a timely question, as Reddit just stopped reporting subscriber counts. Instead, they’re now showing “weekly visitors” and “weekly contributors.” Not all mods are thrilled about this change, but it gives us some interesting insights — as observed on September 16, 2025:

r/bigquery: Still the largest subscriber base (~19.6k), but surprisingly low engagement (7.5k weekly visitors, 33 weekly contributions).

r/snowflake: Has likely matched BigQuery’s subscriber count (~18.9k) with much stronger engagement (16k weekly visitors, 172 weekly contributions)

r/databricks: The clear engagement winner despite fewer total subscribers (20k weekly visitors, 424 weekly contributions)

Active subreddit engagement rates notes with based on numbers reported by reddit, 2025–09–16

What This Tells Us

It’s never too late to build a vibrant community. Databricks entered the game years after the others, but now has the most vibrant community. Sometimes being the newcomer means you try harder.

Communities need champions to thrive. r/bigquery’s growth slowed significantly after I left Google. I’m seeing a similar pattern with r/snowflake now that I’ve moved on. Don’t just assign someone to manage your subreddit, but nurture those who really care.

A beating heart is more important than a large body. Raw subscriber numbers don’t tell the whole story. Databricks has developed a community where people actively participate, ask questions, and help each other.

I’m impressed by what the Databricks community team has built. My goal is to keep watching and learning from their growth.

The road ahead

As for me? I just revived r/geotab for the telematics industry — a space I’m still learning about. That’s the beauty of building communities: you get to learn alongside peers, while connecting with people with shared challenges and interests.

The data doesn’t lie: the healthiest communities aren’t necessarily the biggest ones. It’s the ones where people show up, contribute, and help each other grow.

I’m Felipe Hoffa — find me on LinkedIn (comments for this post), around the web, and on the road.

After BigQuery and Snowflake: What comes next? (Felipe’s version)

Felipe Hoffa — Wed, 05 Feb 2025 18:53:28 GMT

I’m Felipe Hoffa — a Developer Advocate who became a key player in the meteoric rise of BigQuery and Snowflake. Now on a break, I reflect on how we got here and what comes next.

My career in one chart

This chart from DB-Engines tells a fascinating story. DB-Engines ranks database systems by their popularity, combining factors like technical discussions, job postings, and social media mentions. While traditional databases like Oracle and MySQL have long dominated the top spots, the real story lies in the dramatic rise of cloud-native platforms.

The rise of BigQuery and Snowflake’s popularity, as tracked by DB-Engines (plus my annotations)

When I joined BigQuery in 2013, it wasn’t even tracked by DB-Engines yet. The platform was so new that many developers weren’t sure what to make of a fully-managed, serverless data warehouse. Over seven years, I worked to change that perception, helping BigQuery grow from an experimental Google service to a cornerstone of modern data architecture.

In 2020, I made a pivotal move to Snowflake, joining shortly before their historic IPO. The timing was perfect — Snowflake was about to become one of tech’s biggest success stories, and I had the opportunity to help shape its narrative among builders and developers. The platform’s growth curve tells the story: what had taken BigQuery seven years to achieve, Snowflake surpassed it and crossed the boundaries to become one of the top 10 databases in the world.

The annotations show my journey with both platforms, but they also represent a fundamental shift in how companies approach data. I was fortunate to be at the forefront of this transformation, advocating for technologies that would reshape the industry.

Background

Back in 2010, I was a Software Engineer on a break. Frustrated with the enterprise world in Chile, I decided to take a year off work to figure out what to do with my life. That’s when Google called, and my planned break was cut short with a new life in San Francisco as a Google Software Engineer in 2011. Two years later — when it was time to find a new project — an unforgettable googler asked me, “Would you like to be BigQuery’s Developer Advocate?” I had never touched BigQuery, but saying yes to that question changed my life.

The 2010 email that started my journey from Chile to Google San Francisco

What’s a Developer Advocate?

For me, a “Developer Advocate” is a “Software Engineer with a License to Speak”. Let me explain: As a Software Engineer, I’m able to speak the same language as our users, and my goal is to get their attention, listen to them, and help them understand and make the best of our platforms.

My toolbox includes:

Traveling around the world to present at conferences and meet customers.
Writing blog posts with engaging stories that promote features, use cases, and spark curiosity to try the same tools I used.
Making fun videos and shows.
Using social media to be part of the larger conversation and the moment (beyond Twitter and LinkedIn, I also built and grew the /r/bigquery and /r/snowflake subreddits).
Being a product expert (by answering questions on Stack Overflow I became BigQuery’s #2 top expert of all time, and Snowflake’s #3 top expert of all time).
Listening to feedback, and delivering user insights inside the company to shape our offerings and channels.

Visualizing impact

How /r/bigquery and /r/snowflake grew on reddit under my stewardship:

Reddit /r/bigquery and /r/snowflake community growth with Felipe’s care (2013–2025)

How the Stack Overflow tags for BigQuery and Snowflake grew under my stewardship:

Stack Oveflow page views per tag (2017–2024)

Proving myself as one of the biggest BigQuery and Snowflake experts of all time on Stack Overflow:

Top BigQuery and Snowflake experts of all time, as per Stack Overflow (screenshot Jan 2025)

A Brief Intermission: What Have I Done During My Break?

Taking a break hasn’t meant staying still. I’ve been:

Traveling — Exploring new places, meeting people, and gathering fresh perspectives
Engaging with the Community — Attending meetups and hackathons, connecting with developers, and staying involved in tech conversations. Being an attendee has brought me new perspectives that I didn’t get as a frequent speaker
Civic Engagement — I became a U.S. citizen, I went canvassing for the elections, I did my jury duty, and I voted
And a lot more…

Personal growth

There’s also a more personal story of growth. In 2020, I discovered that I have ADHD, a realization that finally helped me understand how to get the best out of myself. In 2021, I embarked on a journey of self-kindness, which led me to lose 40 pounds (~20 kg) in 2023. In 2024 I successfully maintained this weight loss. Beyond the physical transformation, I’ve focused on shifting my mindset, deepening my personal development, and redefining what happiness looks like. Every day, I continue learning and growing, equipping myself with the tools to become the person the future needs me to be.

Data-driven weight loss: How I shed 35 pounds in 4 months

What I’m looking for next

As in 2010, I’m again on a break. It’s hard to know what I want next. I never planned to move to the U.S. or to join Google. I didn’t plan to be BigQuery’s Developer Advocate either. When Snowflake first called in 2020, I replied “I see no compelling reason to join”. Then a few months later, I surprised myself by doing exactly that. The pattern here is that I’ve never been able to predict the great opportunities ahead, but I’ve been fortunate to be noticed by amazing people who have pulled me to incredible heights.

Me rejecting an offer to join Snowflake. (Narrator: 83 days later, he joined Snowflake)

I keep wondering — what will be the next pull into an incredible new story? I’ve spoken with many companies and startups, but I haven’t made up my mind yet. As with Snowflake, I might be tempted to say “no” at first. If I do that to you, please don’t be offended — sometimes I’m silly like that.

Rather than a checklist of what my next job must be, here’s what I valued from my time at Google and Snowflake:

I love having an office full of smart colleagues. Working remotely from home isn’t as compelling as an office that sets clear boundaries between work and home. If I joined a startup with no offices, I would probably ask to expense a coworking space.
I love living in San Francisco. I’m willing to relocate, but probably a lot of companies consider having me in SF a strategic advantage.
I love traveling. Google and Snowflake took me around the world to meet developers, customers, and prospects. I’d love to have more of these opportunities.

Felipe Hoffa: MVP EMEA 2023, awarded by the Snowflake Sales Engineers who considered me a great resource supporting their teams across the world.

I love creative freedom. Google and Snowflake let me craft my own narrative and chase what I thought would be interesting to our users. Surprisingly, I had more creative freedom at Google than at Snowflake — contrary to what many might expect from a larger company. But we can discuss that over lunch. :)
I love managers and companies that understand that everyone has different strengths. I have a particular set of strengths that I’m proud of, but also areas I don’t excel at. At Google, I had 13 managers over the years, and 4 more at Snowflake. The best ones understood that success comes from aligning people with their strengths, rather than forcing a one-size-fits-all approach. That understanding makes all the difference in both impact and job satisfaction.
I love community — even while on break I love showing up to meetups and making friends with the tech community
I love data. Nothing sparks my curiosity more than understanding the world through data. This has been at the core of my tech journey — I learn new tools and methods to analyze the world, and then I share what I’ve discovered. So far, this passion has drawn attention to my work, my stories, and what I have to say.
I love AI. Every day I use ChatGPT, Claude, DeepSeek, and Gemini — they are changing everything around us, and there’s no choice but to be part of this journey.

What can I do for you

Maybe you are wondering: Is this all true?

I’m lucky to have an incredibly public career. Some of it you can’t see (like customer meetings, or internal feedback delivered to shape our products and messaging)— but a lot is out in the world. You can ask people who have witnessed this story, and also the AI bots that have digested my public work.

Try asking your favorite chatbot “Who is BigQuery’s most famous Developer Advocate”, or questions like “Why would Felipe Hoffa be a great Developer Advocate for ?”. These are the results I got:

Asking ChatGPT “Who is the most famous BigQuery Developer Advocate“

Asking ChatGPT “Why would Felipe Hoffa be a great Developer Advocate for (insert company). For example, Databricks?“

In Summary

My journey has been shaped by saying “yes” to unexpected opportunities and being part of transformative stories in data. From helping BigQuery grow from an experimental service to an industry standard, to joining Snowflake right before its historic IPO, I’ve been fortunate to advocate for technologies that changed how the world uses data.

Now, as AI reshapes our world, I’m looking for my next opportunity to make an impact. If history has taught me anything, I might hesitate at first — but I look forward to working with brilliant people, tackling meaningful challenges, and sharing insights that inspire others.

If you see a role where my creativity and experience in building communities, explaining complex technologies, and bringing developers and products together could make a difference, let’s talk. You can find me on LinkedIn, or let’s meet in person if you’re around San Francisco.

Lessons learned, 10 years after moving to the US to join Google

Video: Analyzing data with BigQuery, while re-creating “Lost in Translation” (Tokyo, Japan)

Video: Interviewing Trevor Noah for Snowflake (video not available)

Video: “Cooking data: BigQuery ML vs the online dating spam” (Dresden, Germany)

https://medium.com/media/5e44a3b4902282bb8b975c4f1df99eca/href https://medium.com/media/edb29c92d3b76ef89503eee0a203be37/href

“I don’t know what’s next — but if history repeats itself, it’ll be unexpected, impactful, and shaped by the people I meet along the way.”

Felipe Hoffa on LinkedIn: I joined the Google BigQuery team before it even registered on the map and... | 39 comments

Ps: Thanks everyone for the awesome feedback and memories. I’d love to highlight this one from Reza Qorbani:

Surviving a Keynote Disaster: How We Turned Panic into a Win

Felipe Hoffa — Fri, 12 Jul 2024 02:47:20 GMT

What would you do if you’re on stage delivering a keynote demo in front of thousands of people — and suddenly your computer dies? This is exactly what happened to me

Facing defeat in front of thousands

In this video, you can see me panic. I’m up on stage with Polita, ready to deliver an exciting demo we’ve rehearsed countless times. An audience of thousands is watching us, but even more nerve wracking, the company’s founders and CFO are judging us from the first row. We need to go through the motions, following the exact lines that the teleprompter is feeding us. But suddenly, my computer dies.

https://medium.com/media/edb29c92d3b76ef89503eee0a203be37/href

As I’m typing through the demo, I realize something’s wrong. “Where did my monitor go?” you can hear me squeak. I try a little joke, and you can hear the audience laugh. That’s reassuring — they are supporting us, but we need to find a way out of this. I fill the void by loudly wondering how to fill time while the computer turns back on. But it doesn’t. I even try unplugging and plugging the monitor cable. Maybe that will help. It doesn’t.

And now the audience is clapping! It’s one of the best feelings — whatever happens, I can feel the support of Polita and the audience. We’re going to get through this excruciating moment together.

An amazing audience, including Snowflake founders and CFO (Thierry Cruanes, Benoit Dageville, Mike Scarpelli). No panic.

“I have no idea how to continue without the computer,” you can hear my disappointment. “We might need to skip this demo.” Polita wonders if we could nevertheless give chocolates to everyone in the audience. But the audience has already noticed a miracle, even while Polita and I are still fixated on the dead computer: my code is back on the big screen.

What happened is that the team backstage decided to play back a recording of the demo. I look up and think my computer is somehow back. But it’s not; it’s moving by itself. When I realize this, I know I have to start narrating the demo — not at the rhythm of the teleprompter, but at whatever tempo the recording is playing.

We then deliver the demo we wanted to give — me following the recording, while Polita adjusts to the answers I’m giving to our already scripted conversation. With Snowflake, we translate thousands of restaurant reviews to English, rate them, extract the top topics that people complain about, and use a Snowflake Cortex SQL function to compose an email to the restaurant owner using the data we just compiled by calling LLMs with SQL. Isn’t that cool?

So what can we learn from this?

Embrace Audience Support: Having a demo fail can happen to anyone, and the audience will join you with their support and love. Enjoy the moment, and the audience will enjoy it with you.
Have a Backup Plan: In this case, the backstage team was ready to play the recording, even if in my panic, I had forgotten about that.
Rehearse the Backup Plan: It’s good to have a backup plan, but we could have rehearsed it too — then we would have felt even safer while going through it. Luckily, Polita and I were ready to play.

Polita Paulus, Amanda Kelly, Felipe Hoffa, and Marie Coolsaet keynote discussion

My infinite thanks to Polita Paulus, Julian Forero (who made the whole keynote happen), and everyone else involved on and off-stage, including the awesome audience we had that morning.

The best part of a keynote disaster? Countless people came afterward to express their amusement and happiness at having lived through those moments with us. They felt the panic, thought everything was lost, and then could feel the accomplishment of saving the day, learning new SQL tricks, and even getting some delicious chocolate as a reward.

Dash Desai, Amanda Kelly, Polita Paulus, Jeff Hollan, Felipe Hoffa, and Marie Coolsaet thanking the audience for their attention.

If you want to learn how to translate, do sentiment analysis, extract topics, and write emails with Snowflake Cortex, check out the demo video I had pre-recorded , and this quickstart guide.

https://medium.com/media/7dd627af895c2acb547455feca09cddc/href

Customer Reviews Analytics using Snowflake Cortex

Want more?

Try this out with a Snowflake free trial account — you only need an email address to get started.

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Threads and LinkedIn (while increasingly less on Twitter). And subscribe to reddit.com/r/snowflake for the most interesting Snowflake news.

Originally published at https://www.linkedin.com.

Surviving a Keynote Disaster: How We Turned Panic into a Win

Uncovering the new Snowflake UDAFs with Apache DataSketches

Felipe Hoffa — Wed, 29 May 2024 17:46:20 GMT

Snowflake now supports creating your own user-defined aggregate functions (UDAFs) in Python. Let’s discover them by implementing Apache DataSketches HLL approximate counts within the new Snowflake UDAFs.

Image created with AI

Python User-Defined Aggregate Functions: now in Public Preview

What’s Apache DataSketches?

From the Apache DataSketches homepage:

In the analysis of big data there are often problem queries that don’t scale because they require huge compute resources and time to generate exact results. Examples include count distinct, quantiles, most-frequent items, joins, matrix computations, and graph analysis.

If approximate results are acceptable, there is a class of specialized algorithms, called streaming algorithms, or sketches that can produce results orders-of magnitude faster and with mathematically proven error bounds. For interactive queries there may not be other viable alternatives, and in the case of real-time analysis, sketches are the only known solution.

Snowflake has its own approximate aggregation function (HLL, APPROXIMATE_JACCARD_INDEX, APPROX_TOP_K, APPROX_PERCENTILE). These functions perform much faster than the UDAFs we are going to implement in this post — however this exercise will be interesting for cases were we need compatibility with systems outside Snowflake. This is also a good way to discover the implementation and design decisions behind the Snowflake UDAFs, so let’s get started.

How much faster are the native Snowflake approximate functions?

I started with an exact distinct count of the customers on TPCH_SF10. This took 6.6s on a Small-wh:

select count(distinct c_name)
from snowflake_sample_data.tpch_sf1000.customer
group by 'x'
-- 150000000
-- 6.6s S

Meanwhile Snowflake’s native HLL implementation can get a similar result in 0.9s:

select approx_count_distinct(c_name)
from snowflake_sample_data.tpch_sf1000.customer
group by 'x'
-- 148133819
-- 0.9s S
;

That’s pretty good. Then my implementation of Apache DataSketches HLL inside a Snowflake UDAF takes 36s to do something similar:

select apache_sketches_hll(c_name)
from snowflake_sample_data.tpch_sf1000.customer
group by 'x'
-- 152248026.5622
-- 36s S-wh
;

That’s not spectacular, but still gets us the compatibility we might need with other systems. I also checked the previously available Snowflake Python UDTFs (table functions), which took 51s:

select apache_sketches_hll_udtf_sketch_union(
  array_agg(b.sketch)
)
from snowflake_sample_data.tpch_sf1000.customer a
  , table(apache_sketches_hll_udtf_sketch(c_name) )b
-- 152248026.5622
-- 51s S
;

That shows that the Python UDAFs are not only easier to use due to their more traditional syntax in SQL, they are also faster thanks to better parallelization.

Measuring the performance of approx count distinct over 150 million rows on a Snowflake Small-wh

Let’s now go deeper into how these were implemented.

Apache DataSketches HLL in a Python UDAF

This is how to implement Apache DataSketches’ HLL approximate count distinct in a Python UDAF:

create or replace aggregate function apache_sketches_hll(a string)
returns float
language python
packages = ('datasketches')
runtime_version=3.11
handler = 'X'
as $$

from datasketches import hll_sketch, hll_union, tgt_hll_type

class x:
  def __init__(self):
    self._sketch = hll_sketch(12)

  @property
  def aggregate_state(self):
    return self._sketch.serialize_compact()

  def accumulate(self, input_value):
    self._sketch.update(input_value)
    
  def merge(self, other_partial_sum):
    union = hll_union(12)
    union.update(self._sketch)
    union.update(hll_sketch.deserialize(other_partial_sum))
    self._sketch = union.get_result()

  def finish(self):
    return self._sketch.get_estimate()
$$;

As seen above, using it is trivial:

select apache_sketches_hll(c_name)
from snowflake_sample_data.tpch_sf1000.customer
group by 'x'
-- 152248026.5622
-- 36s S-wh
;

Interesting things to notice:

Anaconda already provides datasketches in Snowflake, so getting the required libraries only took requesting them in the UDAF definiton.
To create a UDAF we need to provide a class with 5 methods: __init__(), aggregate_state(), accumulate(), merge(), finish().
__init__() takes care of initializing an empty sketch.
accumulate() looks at each new row of data and adds it to our existing sketch.
aggregate_state() returns the current state of our sketch.
merge() is the key that allows Snowflake to parallelize the aggregation through multiple threads — each thread returns partial results and merge() takes care of combining them.
finish() returns the final result.

I’m happy to see that the UDAFs are faster than the previously available UDTFs. We’ll implement HLL in a UDTF further down, to confirm this performance gain.

From the Snowflake docs

Performance review

This statistics screenshot comes from a UDAF run that took 46s in a S-wh:

Statistics of running the UDAF over 150,000,000 rows on a S-wh

In the above screenshot we can see:

667 partitions were scanned.
__init__() was called 17 times, which indicates the parallelism of processing those 667 partitions.
accumulate() was called 150,000,000 times (once per each row), taking in total 406 seconds. The whole query ran in less time thanks to parallelization.
aggregate_state() was called 16 times, each time to recover the results of one thread started by __init__() (except the last thread).
merge() was called 16 times, bringing the results of aggregate_state() into another thread.
finish() was called once, to return the final result.

The same, but on an XL-wh:

Statistics of running the UDAF over 150,000,000 rows on a XL-wh

In this screenshot we can see that the 667 partitions were processed by 128 129 threads in parallelel, bringing the processing time down to 9.4s.

Image generated by AI

The UDTF alternative

When we didn’t have UDAFs, the way to implement this was with a UDTF. I already explored a similar case with Java and BigQuery’s HLL++ — but now let’s check the implementation in a Python UDTF of Apache DataSketches HLL:

create or replace function apache_sketches_hll_udtf(input_value string)
returns table (total float)
language python
packages = ('datasketches')
runtime_version=3.11
handler='X'
as $$
from datasketches import hll_sketch, hll_union, tgt_hll_type

class X:
    def __init__(self):
        self._sketch = hll_sketch(12)

    def process(self, input_value):
        self._sketch.update(input_value)  

    def end_partition(self):
        yield (self._sketch.get_estimate(), )
$$;

You can see that the code above is more concise than the UDAF definition. However the SQL query using it is harder to write and understand:

select b.*
from snowflake_sample_data.tpch_sf1000.customer a
  , table(apache_sketches_hll_udtf(c_name))b
-- 50s S
;

Now only the query is harder to understand, the results don’t match what we need:

We get 16 different results, instead of the total aggregate

This because the UDTF does its own partitioning (when we don’t define it) to parallelize the query and make it faster. If we try un-parallelizing it then the query takes 12 times longer in the same Small-wh:

select b.*
from snowflake_sample_data.tpch_sf1000.customer a
    , table(apache_sketches_hll_udtf(c_name) over(partition by 1))b
-- 617s S-wh
;

That’s not good. What we really need is to get sketches for each partition, and then aggregate them:

create or replace function apache_sketches_hll_udtf_sketch(input_value string)
returns table (sketch string)
language python
packages = ('datasketches')
runtime_version=3.11
handler='X'
as $$
from datasketches import hll_sketch, hll_union, tgt_hll_type
import base64

class X:
    def __init__(self):
        self._sketch = hll_sketch(12)

    def process(self, input_value):
        self._sketch.update(input_value)  

    def end_partition(self):
        yield (base64.b64encode(self._sketch.serialize_compact()).decode('utf-8'), )
$$;

create or replace function apache_sketches_hll_udtf_sketch_union(input_value array)
returns float
language python
packages = ('datasketches')
runtime_version=3.11
handler='x'
as $$
from datasketches import hll_sketch, hll_union, tgt_hll_type
import base64

def x(arr):
    union = hll_union(12)
    for sketch in arr:
        union.update(hll_sketch.deserialize(base64.b64decode(sketch)))
    return union.get_estimate()
$$;

That’s how we get to a query that looks like this:

select apache_sketches_hll_udtf_sketch_union(
  array_agg(b.sketch)
)
from snowflake_sample_data.tpch_sf1000.customer a
  , table(apache_sketches_hll_udtf_sketch(c_name) )b
-- 152248026.5622
-- 51s S
;

The lesson here is that it’s good to migrate our UDTFs to UDAFs (as long as they are written in Python — Java UDAFs are not yet available). However there’s an important step to consider in this example: What if we want to merge sketch results within UDAFs?

Image generated by AI

Storing and merging sketches with UDAFs

Why would anyone use Apache DataSketches within a Snowflake Python UDAF is they are slower than the native HLL approximate counts?

Well, the beauty of these sketches comes when achieving compatibility within multiple systems. Currently Apache DataSketches is available in Java, C++, and Python (with the community creating bindings in other languages, like Rust).

It’s good to store partial state sketches in Snowflake, especially if they are coming from other systems. For this to work, we need to split the original HLL UDAF in 3 parts:

apache_sketches_hll_accumulate(): Produces a sketch representing the probabilist count of elements seen.
apache_sketches_hll_combine(): Combines multiple sketches (which could be coming from apache_sketches_hll_accumulate() in Snowflake, or from an external systems).
apache_sketches_hll_estimate(): Transforms a sketch into an estimation.

My code to define these 3:

create or replace aggregate function apache_sketches_hll_accumulate(a string)
returns binary
language python
packages = ('datasketches')
runtime_version=3.11
handler = 'X'
as $$
from datasketches import hll_sketch, hll_union, tgt_hll_type

class X:
  def __init__(self):
    self._sketch = hll_sketch(12)

  @property
  def aggregate_state(self):
    return self._sketch.serialize_compact()

  def accumulate(self, input_value):
    self._sketch.update(input_value)
    
  def merge(self, other_partial_sum):
    union = hll_union(12)
    union.update(self._sketch)
    union.update(hll_sketch.deserialize(other_partial_sum))
    self._sketch = union.get_result()

  def finish(self):
    return self._sketch.serialize_compact()
$$;

create or replace aggregate function apache_sketches_hll_combine(a binary)
returns binary
language python
packages = ('datasketches')
runtime_version=3.11
handler = 'X'
as $$
from datasketches import hll_sketch, hll_union, tgt_hll_type

class X:
  def __init__(self):
    self._union = hll_union(12)

  @property
  def aggregate_state(self):
    return self._union.get_result().serialize_compact()

  def accumulate(self, input_value):
    self._union.update(hll_sketch.deserialize((input_value)))
    
  def merge(self, other_partial_sum):
    self._union.update(hll_sketch.deserialize(other_partial_sum))

  def finish(self):
    return self._union.get_result().serialize_compact()
$$;


create or replace function apache_sketches_hll_estimate(sketch binary)
returns float
language python
packages = ('datasketches')
runtime_version=3.11
handler = 'x'
as $$
from datasketches import hll_sketch, hll_union, tgt_hll_type

def x(sketch):
    return hll_sketch.deserialize((sketch)).get_estimate()
$$;

Note above that when moving sketches from external systems to Snowflake (or out) special care needs to be taken care with the binary sketches, by transforming them to/from base64 or similar.

Using these then becomes easy within SQL:

select apache_sketches_hll_accumulate(c_name) sketch
from snowflake_sample_data.tpch_sf1000.customer
group by left(c_name, 12)
-- (151 sketches)
-- 41s S
;

select apache_sketches_hll_estimate(sketch)
from (
    select apache_sketches_hll_combine(sketch) sketch
    from table(result_scan(last_query_id(-1)))
)
-- 152248026.56219986
-- 0.9s S-wh
;

Estimating cardinality from multiple stored sketches

When to use Snowflake HLL, vs Apache DataSketches, vs Google ZetaSketch

Use Snowflake’s HLL implementation for the fastest results, but only when the whole life-cycle happens within Snowflake. This because Snowflake has not open-sourced its implementation to be used in other systems.
Use Apache DataSketches with Snowflake Python UDAFs to achieve compatibility with multiple systems. UDAFs are easy to use within SQL, and have decent performance — while using an Apache project ensures wide industry adoption.
Use Google zetasketch (HyperLogLog) to achieve compatibility with BigQuery sketches. Note that they have only open-sourced a Java implementation, and Snowflake only has Python UDAFs — so you’ll need to implement Snowflake Java UDTFs instead — as described in my previous post.

Next steps

Check the Snowflake UDAFs docs.
Learn about the multiple interesting classes and use cases of Apache DataSketches.
Share your results with us.

Python user-defined aggregate functions | Snowflake Documentation

Want more?

Try this out with a Snowflake free trial account — you only need an email address to get started.

Hey Snowflake, send me a fancy HTML email

Felipe Hoffa — Wed, 17 Jan 2024 17:10:17 GMT

Hey Snowflake, send me a HTML email

Snowflake can now send HTML email notifications, so it’s time to explore how to make the most out of this new ability — including sending pre-rendered image charts with a Snowpark Python stored procedure. Let’s check out how.

Image generated by AI

A year ago I published “Hey Snowflake, send me an email”, showing off Snowflake’s new ability to send email notifications. However, those emails didn’t look too pretty. It’s now time to fix that.

Hey Snowflake, send me an email

For example, let’s say I want Snowflake to send me an email with the weekly top GitHub projects by number of stars. Following the previous instructions, the email notification would look like this:

Top GitHub projects email, formatted with Python `tabulate`

Now we can do better, as we can send an HTML table:

Top GitHub projects email, formatted as HTML

That looks decent, but it’s still not pretty enough. For example, we could give that plain HTML table a more modern look:

Top GitHub projects email, formatted as elegant HTML

And then for extra coolness, we could send any arbitrary pre-rendered plot as an image:

Top GitHub projects email, sent with a pre-rendered .png

Let’s find out how to send these new fancy emails.

How-to

Creating images that can be shared via email

Being able to publish images fully within Snowflake is cool, so let’s start with those steps:

The first thing we need is to CREATE STAGE in your Snowflake account to store the pngs:

create or replace stage pngstage 
directory = (enable = true)   
encryption=(type = 'snowflake_sse');

That looks like a normal create stage instruction, with two modifiers:

With directory = (enable = true) we get the ability to list all the files on this new stage. We don’t need really need this, but a directory table will be helpful to keep track and delete any files that are too old.
Setting encryption=(type = 'snowflake_sse') is important to set up “server side encryption”— otherwise the files we store in this stage won’t be readable outside your Snowflake account.

Then we can create a Snowflake Python stored procedure that:

Receives the id of a query that had 2 columns as output (label and count), the labels for these 2 columns, and a title for the chart.
Uses matplotlib to create an horizontal bar chart with those values.
Outputs that chart to the stage we created to share these pictures.
Returns a pre-signed URL that will allow others to view these PNGs from the web and email readers.

The code:

create or replace procedure create_png_web_image(
    query_id string
    , title string
    , x_label string
    , y_label string)
returns string
language python
runtime_version = '3.11'
packages = ('snowflake-snowpark-python', 'matplotlib')
handler = 'run'
execute as caller
as
$$
import matplotlib.pyplot as plt
import pandas as pd
import io

def save_file(session, buffer, path, dest_filename):
    session._conn.upload_stream(buffer, path, dest_filename, compress_data=False, overwrite=True)
    return path + '/' + dest_filename

def run(session, query_id, title, x_label, y_label):  
    df = session.sql("select * from table(result_scan(?))", params=[query_id]).to_pandas()

    plt.figure(figsize=(5, 3))
    plt.barh(df.iloc[:, 0], df.iloc[:, 1], color='skyblue')
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    plt.title(title)
    plt.gca().invert_yaxis()  # Invert y-axis for better readability   
    plt.tight_layout()
 
    buffer = io.BytesIO()
    plt.savefig(buffer, format='png')
    buffer.seek(0)  # Rewind the buffer
    stage = '@pngstage'
    path = 'web_png'
    filename = 'png_%s.png' % (query_id)
    relative_path = save_file(session, buffer, stage + '/' + path, filename )
    return session.sql("select get_presigned_url(%s, '%s/%s', 604800) as url" % (stage, path, filename)).collect()[0]['URL']
$$
;

What’s interesting in this code:

To make the plot available as a png, we need to save it to a buffer with plt.savefig(buffer, format='png').
Then we can upload that buffer to the stage with the custom function save_file(), that uses the Snowpark internal method session._conn.upload_stream().
To get a pre-signed URL for that image we run the query select get_presigned_url().
These pre-signed URLs need an expiration time. The maximum is 1 week (604800 seconds). After that time the URLs will stop working.
TO-DO: Since the URLs will only work for a week, you should automate deleting the generated files when they are not longer needed.
TO-DO: The above function hard-codes the destination stage, you could make that more flexible.
TO-DO: The above function creates horizontal bar graphs, but you could generate anything you want with matplotlib or other libraries.

A simple SQL stored procedure to send this URL as an image in an email:

execute immediate $$
declare
    signed_png_url string;
    formatted string;    
begin
    signed_png_url := (call create_png_web_image('01b077e2-0503-d084-001e-248301c57c26', 'Top weekly GitHub projects by stars', 'Stars', 'Project')); 
    formatted := 'The top GitHub projects by # of stars this week are:
';
    call system$send_email(
        'my_email_int', 'felipe@example.com', 'Top GitHub weekly projects (png)', :formatted, 'text/html'
    );
    return 'sent: ' || :formatted;
    end;
$$
;

What’s interesting in this sample code:

Playing with inline SQL stored procedure can help us iterate quickly over this code until we get exactly what we want.
With this SQL stored procedure we call the Python stored procedure that generates the image, returns the URL, and we store that URL in a variable.
With this URL, we create a bit of HTML with the tag.
Then we call system$send_email() containing that bit of HTML, while making sure that the mime type is the newly supported text/html.
Check my previous post or the docs to set up system$send_email() to send emails to your desired destinations.
TODO: I hard-coded the query id while experimenting, should move to a general purpose stored procedure.

The query that provides the top weekly GitHub projects by # of stars is (as described in my previous post “Querying GitHub Archive with Snowflake: The Essentials”):

select any_value(repo_name) repo_name, count(distinct actor_id) stars
from cybersyn_github_archive.cybersyn.github_events
where type = 'WatchEvent'
and created_at_timestamp > dateadd(day, -7, current_date)
group by repo_id
order by stars desc
limit 10
-- 16s S-wh cold
-- 1s S-wh warm
;

Creating HTML tables that can be shared via email

Sending images is cool, but the links will expire in a week and some readers might choose to not display images. So they will see an empty email like this (unless you add more text to it):

An empty email without text because the user chose to not display images (or the URL expired)

Let’s explore an easy way to get an HTML table instead of an tag as seen above:

create or replace procedure email_html_formatting(query_id string)
returns string
language python
runtime_version = '3.11'
packages = ('snowflake-snowpark-python')
handler = 'x'
execute as caller
as
$$

def x(session, query_id):    
    return session.sql(
      "select * from table(result_scan(?))",
      params=[query_id]).to_pandas().to_html()
$$
;

The quick solution above:

Executes a query with Snowpark that brings back the result of the received query id.
Calls to_pandas().to_html() on those results.

That gives us the “ugly” classic HTML table:

Classic HTML table

Creating a prettier HTML table

GMail will only apply CSS styles to a table if these styles are inlined. There are more sophisticated ways of doing this, but a simple replace() in Python did the job for me:

create or replace procedure email_html_elegant_table_formatting(query_id string)
returns string
language python
runtime_version = '3.11'
packages = ('snowflake-snowpark-python')
handler = 'x'
execute as caller
as
$$
def x(session, query_id):    
    html_table = session.sql("select * from table(result_scan(?))", params=[query_id]).to_pandas().to_html()
    # https://codepen.io/labnol/pen/poyPejO?editors=1000
    html_table = html_table.replace('class="dataframe"', 'style="border: solid 2px #DDEEEE; border-collapse: collapse; border-spacing: 0; font: normal 14px Roboto, sans-serif;"')
    html_table = html_table.replace('', '')
    html_table = html_table.replace('', '')
    return html_table
$$
;

Styled HTML table

TO-DO: We could do more sophisticated styling and modifications with BeautifulSoup:

Untested code to play with an HTML table using BeautifulSoup.

Next steps: Making it production grade

This post so far has useful code to generate fancy e-mail notifications inside Snowflake — but more work is needed to make this ready for production.

For example, it would be great to have:

A more solid stored-procedure.
Error handling.
Max email length awareness.
Multiple destinataries.
Leverage the new Snowflake alerts.

I’ve discussed many of these ideas with Ryan M Bacastow (who I’m grateful to for inspiring this post). Hopefully we will able to publish more about this soon.

Slack Notifications

Two different approaches:

Sean Kim published “Send Slack Messages from Snowflake with Snowpark External Network Access” that uses a External Function to send messages directly to a custom Slack App. Btw, he also uses a similar mechanism to create plots to the one described in this post.
TJ Murphy uses custom email addresses to send notifications to Slack, as described in his LinkedIn comment:

To set up Slack notifications from Snowflake:

1. Get the Slack notif email address. Right click the channel > View channel details > Integrations > Send emails to this channel

2. Create a new Snowflake user. Set the user's email to the Slack notif email address.

3. Log in as the user. Go to the user profile in Snowsight and click the "Send Verification Email" link.

4. Go to the Slack channel. You should see the validation email as a message in the channel (see image). Click the verification link.

5. Create or update a Snowflake Email Integration to add the Slack notif email address.

6. Call SYSTEM$SEND_EMAIL and get the notifications in Slack!

-- TJ Murphy

Pdf generation

Added on 2024–05:

Snowflake- IS there any way to generate a PDF report from snowflake table?

Want more?

Try this out with a Snowflake free trial account — you only need an email address to get started.

Testing the Snowflake Query Acceleration Service with a 17 TB table

Felipe Hoffa — Wed, 10 Jan 2024 17:02:24 GMT

Snowflake has an easy way to make queries faster while using the smallest warehouses: The Query Acceleration Service has been ready for production (GA) since February 2023. Let’s test it right now by scanning 17 terabytes of GitHub events.

Image generated by AI

Amping up Snowflake’s compute

Traditionally Snowflake has offered 2 easy ways of increasing compute power when dealing with larger queries and concurrency:

Scale your session’s “virtual warehouse” to a larger size.
Set up “multi-cluster warehouses” that dynamically add more clusters to deal with peaks of concurrent usage.

With these 2 basic elements, users are able to set up policies to control costs and divide resources between different teams and workloads.

For example — for my Snowflake experiments I usually do everything on my own “Small Warehouse”. This keeps costs low, and it’s usually pretty fast and predictable. I only need to scale to larger warehouses when dealing with huge transformations and extracts, like the example we are going to play with today.

Scaling a warehouse up to get faster results on slower queries is super easy, barely an inconvenience. I can jump at any moment from a “Small WH” to an “Extra Large WH”, to a “Medium WH”, to a “4X-Large WH”, etc. This is cool, but then the question becomes: “How can I tell exactly what’s the best WH size for my upcoming queries?”

Instead of resizing warehouses, it would be really cool if I could run my whole session on a “Small WH” (or an “Extra Small WH”) and then I could have Snowflake automatically intercept my larger queries, and run them with way more resources in a “magic serverless” way.

And that’s exactly what the new Query Acceleration Service does. Let’s test it out here (with a Snowflake Enterprise Edition account).

https://medium.com/media/fd7070b14c7d275e1204c05b7290be5c/href

Extracting data from GitHub’s 17 Terabyte Archive

Querying GitHub Archive with Snowflake: The Essentials

For this experiment we are going to look into GH Archive — a collection of all GitHub events. I did a lot of experiments with it in my past life at Google, and now Cybersyn has made a copy of the GH Archive on the Snowflake Marketplace.

To bring this dataset into your Snowflake account, just ask your Account Admin to import it at no cost:

Importing the GH Archive into your Snowflake account

For more tips, check my post “Querying GitHub Archive with Snowflake: The Essentials”. In the meantime let’s continue with a straightforward example.

Note in the above screenshot that I renamed the incoming database to GHARCHIVE for cleaner querying.

Once we have GHARCHIVE in our account, we can see 3 tables — with the main one being events:

select count(*)
from gharchive.cybersyn.github_events;
-- 1.4s
-- 6,966,010,260

That’s 7 billion rows of rich history — and a lot of data to deal with. The first step when exploring datasets this large should be to extract a subset of rows with the data we are interested in:

For example, this is the whole history of Apache Iceberg on GitHub:

create or replace table gharchive_iceberg
as
select *
from gharchive.cybersyn.github_events
where repo_id = 158256479
order by created_at
-- 19m 52s small
-- 96s xxlarge
;

This table extraction took only 96 seconds on a “2X Large WH”, but some long ~22 minutes on my usual “Small WH”.

Can the Query Acceleration Service (QAS) help here? There’s a very easy way to tell:

select system$estimate_query_acceleration('01b191db-0603-f84f-002f-a0030023f256');

And the response is “yes” (when using the query id from the ~22m run):

{
  "queryUUID":"01b191db-0603-f84f-002f-a0030023f256",
  "status":"eligible", 
  "originalQueryTime":1191.759,
  "estimatedQueryTimes":{"1":608,"2":411,"4":254,"8":149,"31":55},
  "upperLimitScaleFactor":31
}

Snowflake is telling us that the query took 1191s, and if we had let the QAS service help, it could have taken between 608s and 55s — depending on the max scaling factor we would allow it (in this case, up to 31).

To test QAS, I created a new WH. To make this test more dramatic, I made it an “Extra Small” with unlimited scaling power:

use role sysadmin
;
create warehouse xs_acc
warehouse_size = xsmall
enable_query_acceleration = true
query_acceleration_max_scale_factor = 0
;
grant usage on warehouse xs_acc to public
;

If I use the “Extra Small WH with unlimited QAS”, Snowflake now automatically accelerates this query

use warehouse xs_acc
;
create or replace table gharchive_iceberg
as
select *
from gharchive.cybersyn.github_events
where repo_id = 158256479
order by created_at
-- 19m 52s small
-- 96s xxlarge
-- 77s xs_acc
;

To check the cost of this QAS query that ran in 77s while I was working within a “Extra Small WH” session, we can check the logs:

use role accountadmin
;
select * from
table(information_schema.query_acceleration_history())
;

-- CREDITS_USED WAREHOUSE_NAME NUM_FILES_SCANNED NUM_BYTES_SCANNED
-- 0.499199157   XS_ACC         296,389           5,025,053,295,104

We can see that the query scanned 5 terabytes of data, for a total cost of 0.4 credits. Depending on the region with an Enterprise Edition Snowflake account that should be around $1.5 dollars.

In comparison:

Small WH, Enterprise edition, 1192s: $3*2*1192/3600 = $1.99 (+ time between queries and auto-suspend)
2XL WH, Enterprise edition, 96s: $3*32*96/3600 = $2.56 (+ time between queries and auto-suspend)
QAS 21x auto-acceleration, within a X-Small session, 77s: $3*0.5 = $1.5 (serverless model, no auto-suspend needed for the QAS queries — but the XS session kept running for $3*1*67/3600=$0.06 extra)

This is the power of the Query Acceleration Service: When it works, we don’t need to worry anymore about re-sizing warehouses, and we can let Snowflake take care of the huge queries that need extra power.

Image generated by AI

Query Acceleration Caveats

QAS is Generally Available (GA) in Snowflake and ready for you to use.

However you will notice that it’s picky on which queries it decides to accelerate — and I expect this set of supported queries to grow over time.

You can find a handy history of the queries in your account that could have been accelerated:

SELECT query_id, eligible_query_acceleration_time
FROM snowflake.account_usage.query_acceleration_eligible
ORDER BY eligible_query_acceleration_time DESC;

The docs also list what kind of queries are not eligible for acceleration (for now):

Some queries are ineligible for query acceleration. The following are common reasons why a query cannot be accelerated:

- The query does not filter or aggregate.
- The filters are not selective enough. Alternatively, the GROUP BY expression has a high cardinality.
- There are not enough partitions. If there are not enough partitions to scan, the benefits of query acceleration are offset by the latency in acquiring resources for the query acceleration service.
- The query includes a LIMIT clause but does not have an ORDER BY clause.
- The query includes functions that return nondeterministic results (for example, SEQ or RANDOM).

For example this query could not get QAS during my tests:

select min(created_at), max(created_at), current_timestamp()
from gharchive.cybersyn.gh_events
where repo_id = 158256479
limit 10

But this one that produces the same results does — thanks to adding group by and order by:

select repo_id, min(created_at), max(created_at), current_timestamp()
from gharchive.cybersyn.github_events
where repo_id = 158256479
group by repo_id
order by repo_id
limit 10
-- 23s xs_acc

Next steps

https://medium.com/media/fd7070b14c7d275e1204c05b7290be5c/href

To go deeper analyzing GitHub, check my post “Querying GitHub Archive with Snowflake: The Essentials”.
Check my conversation and live demo featuring Query Acceleration Service with Product Manager Tim Sander.
Check out how many of your queries could have been accelerated with QAS in your account.
Report results, and give us feedback on what else you’d like QAS to take care of.
Try the combination of QAS + Search Optimization.

Want more?

Image created by AI

Try this out with a Snowflake free trial account — you only need an email address to get started.

Discover the new Snowpark ML Toolkit + dbt Python models

Felipe Hoffa — Wed, 20 Sep 2023 15:58:09 GMT

Let’s do some feature engineering, training, and inference with Snowpark ML and the dbt Python models. First with with 50k rows and then with 50M rows, to prove how this new toolkit helps us scale with Snowflake — while the dbt Python models take care of all the boilerplate.

Image generated by AI

Intro

Snowpark ML is a new set of tools for building and deploying machine learning models in Snowflake. The best part is that you get familiar ML constructs (Scikit-Learn, XGBoost, LightGBM, …), paired with all the power, security, and scalability of Snowflake.

Why I’m writing this

In my previous post I shared how dbt and Snowflake pair beautifully to power the new dbt Python models.
The dbt Python models docs offer an in-depth use case of how to do ML prep, cleaning, training, and prediction with dbt Python + Snowpark.
But — those docs don’t leverage the new Snowpark ML Toolkit — which will make these previous tasks perform even better.
Meanwhile the Snowpark ML Toolkit quickstart offers an in-depth explanation of how to use this new ML Toolkit.
But — that quickstart starts by asking us to download and install Python libraries in our VMs. I don’t like that step. I would rather use a cloud tool (like dbt Cloud), and have every library managed “magically”.

So let’s do that: A full exploration of the Snowpark ML Toolkit using the dbt Python models on dbt Cloud. Zero packages installation, just scalable ML fun.

What’s interesting about dbt in this process:

dbt Cloud is not aware of the Snowpark ML libraries, and it doesn’t need to.
dbt Cloud’s here is mainly to wrap our code within a Snowpark boilerplate stored procedure and let Snowflake handle the rest.
You could run the same with the open source dbt-core on your desktop — I just love how dbt Cloud keeps everything running in the cloud while we work on this code.
Check my previous post for details:

How Snowflake makes the dbt Python models shine

Before we start

To follow along, you will need:

A Snowflake account (free trial works).
A dbt Cloud account (the free developer version works fine — or you could run all this locally, if you enjoy installing stuff).
Set up a user to connect your dbt Cloud account to your Snowflake account.
Set up a stage in your Snowflake account so it can read the typical ML example file (diamonds.csv) out of a public S3 bucket:

-- create csv format
CREATE FILE FORMAT IF NOT EXISTS CSVFORMAT 
SKIP_HEADER = 1 
TYPE = 'CSV';

-- create external stage with the csv format to stage the diamonds dataset
CREATE STAGE IF NOT EXISTS DIAMONDS_ASSETS 
FILE_FORMAT = CSVFORMAT 
URL = 's3://sfquickstarts/intro-to-machine-learning-with-snowpark-ml-for-python/diamonds.csv';

(Check out my previous post to understand what a dbt Python model is and they work with Snowflake)

Our first model: the diamonds data

This one is a typical dbt SQL model, that copies the data out of S3 (as set up in the previous step) into a Snowflake table:

-- diamonds.sql
{{config(materialized='table')}}

select $1::float "CARAT"
  , trim(upper($2::string), '"') "CUT"
  , trim(upper($3::string), '"') "COLOR"
  , trim(upper($4::string), '"') "CLARITY"
  , $5::float "DEPTH", $6::float "TABLE_PCT"
  , $7::float "PRICE"
  , $8::float "X", $9::float "Y", $10::float "Z"
from @DIAMONDS_ASSETS

Running this took less than 2 seconds — as it’s just a small file with 54k rows (we’ll worry about scalability and performance later in this post).

The diamonds table

Image generated by AI

Normalizing and One-Hot encoding with Snowpark ML

This where the fun starts, check out this dbt Python model:

# snowpark_ml_diamonds_transform.py

import numpy as np

import snowflake.ml.modeling.preprocessing as preproc
from snowflake.ml.modeling.pipeline import Pipeline
import snowflake.snowpark.types as T

# https://github.com/Snowflake-Labs/sfguide-intro-to-machine-learning-with-snowpark-ml-for-python/blob/main/3_snowpark_ml_model_training_deployment.ipynb

def model(dbt, session):
    dbt.config(
        packages = ['snowflake-ml-python']
    )  

    diamonds_df = dbt.ref('diamonds')

    # Normalize the CARAT column
    snowml_mms = preproc.MinMaxScaler(input_cols=["CARAT"], output_cols=["CARAT_NORM"])
    normalized_diamonds_df = snowml_mms.fit(diamonds_df).transform(diamonds_df)
    
    # Reduce the number of decimals
    new_col = normalized_diamonds_df.col("CARAT_NORM").cast(T.DecimalType(7, 6))
    normalized_diamonds_df = normalized_diamonds_df.with_column("CARAT_NORM", new_col)

    # Encode CUT and CLARITY preserve ordinal importance
    categories = {
        "CUT": np.array(["IDEAL", "PREMIUM", "VERY GOOD", "GOOD", "FAIR"]),
        "CLARITY": np.array(["IF", "VVS1", "VVS2", "VS1", "VS2", "SI1", "SI2", "I1", "I2", "I3"]),
    }
    snowml_oe = preproc.OrdinalEncoder(input_cols=["CUT", "CLARITY"], output_cols=["CUT_OE", "CLARITY_OE"], categories=categories)
    ord_encoded_diamonds_df = snowml_oe.fit(normalized_diamonds_df).transform(normalized_diamonds_df)
    # Encode categoricals to numeric columns
    snowml_ohe = preproc.OneHotEncoder(input_cols=["CUT", "COLOR", "CLARITY"], output_cols=["CUT_OHE", "COLOR_OHE", "CLARITY_OHE"])
    transformed_diamonds_df = snowml_ohe.fit(ord_encoded_diamonds_df).transform(ord_encoded_diamonds_df)

    # return ord_encoded_diamonds_df
    return transformed_diamonds_df

What’s notable here:

We don’t need to install snowflake-ml-python as it’s already provided by Anaconda in Snowflake. We just need to ask for packages = [‘snowflake-ml-python’].
The snowflake.ml.modeling.preprocessing libraries offer libraries that look very similar to sklearn — but Snowflake takes care of making these run and scale within the Snowflake world.
Within these familiar sklearn constructs, we used here MinMaxScaler, OneHotEncoder, and OrdinalEncoder.
The last step of this model returns a Snowpark Dataframe, that gets persisted as a Snowflake table with the requested transformations.

Transformed and one-hot encoded diamonds

Running this step took 48s (we’ll check later how it scales with more data).
The main transformation and persistence step took only 0.44s on a Medium-wh. The query profile looks cool, and hints that Snowflake will make this scale:

Query profile — transforming, encoding, and persisting

A transformation pipeline

Let’s re-implement the above with a re-usable Pipeline:

# snowpark_ml_diamonds_transform_pipeline.py

import io
import joblib
import numpy as np

import snowflake.ml.modeling.preprocessing as preproc
from snowflake.ml.modeling.pipeline import Pipeline
import snowflake.snowpark.functions as F
import snowflake.snowpark.types as T

# https://github.com/Snowflake-Labs/sfguide-intro-to-machine-learning-with-snowpark-ml-for-python/blob/main/3_snowpark_ml_model_training_deployment.ipynb
# https://docs.getdbt.com/guides/dbt-ecosystem/dbt-python-snowpark/12-machine-learning-training-prediction

def save_file(session, model, path, dest_filename):
    input_stream = io.BytesIO()
    joblib.dump(model, input_stream)
    session._conn.upload_stream(input_stream, path, dest_filename)
    return "successfully created file: " + path

def model(dbt, session):
    dbt.config(
        packages = ['snowflake-ml-python', 'joblib']
    )  

    diamonds_df = dbt.ref('diamonds')

    CATEGORICAL_COLUMNS = ["CUT", "COLOR", "CLARITY"]
    CATEGORICAL_COLUMNS_OE = ["CUT_OE", "COLOR_OE", "CLARITY_OE"] # To name the ordinal encoded columns
    NUMERICAL_COLUMNS = ["CARAT", "DEPTH", "TABLE_PCT", "X", "Y", "Z"]

    categories = {
        "CUT": np.array(["IDEAL", "PREMIUM", "VERY GOOD", "GOOD", "FAIR"]),
        "CLARITY": np.array(["IF", "VVS1", "VVS2", "VS1", "VS2", "SI1", "SI2", "I1", "I2", "I3"]),
        "COLOR": np.array(['D', 'E', 'F', 'G', 'H', 'I', 'J']),
    }
    preprocessing_pipeline = Pipeline(
        steps=[
                (
                    "OE",
                    preproc.OrdinalEncoder(
                        input_cols=CATEGORICAL_COLUMNS,
                        output_cols=CATEGORICAL_COLUMNS_OE,
                        categories=categories,
                    )
                ),
                (
                    "MMS",
                    preproc.MinMaxScaler(
                        clip=True,
                        input_cols=NUMERICAL_COLUMNS,
                        output_cols=NUMERICAL_COLUMNS,
                    )
                )
        ]
    )

    transformed_diamonds_df = preprocessing_pipeline.fit(diamonds_df).transform(diamonds_df)

    version = '1.0'
    session.sql('create or replace stage MODELSTAGE').collect()
    save_file(session, preprocessing_pipeline, '@MODELSTAGE/preprocessing_pipeline_'+version, 'preprocessing_pipeline_'+version+'.joblib' )

    return transformed_diamonds_df

What’s notable here:

We created a preprocessing_pipeline — which can be re-used for further training and deployment.
The save_file() method takes care of encoding and saving this binary encoded pipeline into a Snowflake stage for future re-use.
We went here for Ordinal Encoding (instead of One-Hot), which is a good thing for the saved pipeline to remember.
We used snowflake.ml.modeling.pipeline and the encoders in snowflake.ml.modeling.preprocessing— which look familiar, yet optimized for Snowflake.
The Snowflake docs display a * next to the libraries that support distributed execution (and you can expect more on the list to do so).
The model returns a data frame with all the diamonds pre-processed and ready for training our model — which gets persisted into a Snowflake table.

Diamonds ready for ML mining

The whole process took 49s on a M-wh, with the main transformation and persistence taking only 1.1s. The query profile is interesting too:

Query profile of the transformation pipeline

Train and predict with XGBoost

Another simple dbt Python model that trains and predicts with XGBoost:

import numpy as np

from snowflake.ml.modeling.xgboost import XGBRegressor
import snowflake.snowpark.functions as F
import snowflake.snowpark.types as T

# https://github.com/Snowflake-Labs/sfguide-intro-to-machine-learning-with-snowpark-ml-for-python/blob/main/3_snowpark_ml_model_training_deployment.ipynb
# https://docs.getdbt.com/guides/dbt-ecosystem/dbt-python-snowpark/12-machine-learning-training-prediction

def model(dbt, session):
    dbt.config(
        packages = ['snowflake-ml-python']
    )  

    diamonds_df = dbt.ref('snowpark_ml_diamonds_transform_pipeline')
    diamonds_train_df, diamonds_test_df = diamonds_df.random_split(weights=[0.9, 0.1], seed=0)
    train_df, test_df = diamonds_train_df, diamonds_test_df

    CATEGORICAL_COLUMNS = ["CUT", "COLOR", "CLARITY"]
    CATEGORICAL_COLUMNS_OE = ["CUT_OE", "COLOR_OE", "CLARITY_OE"] # To name the ordinal encoded columns
    NUMERICAL_COLUMNS = ["CARAT", "DEPTH", "TABLE_PCT", "X", "Y", "Z"]
    LABEL_COLUMNS = ['PRICE']
    OUTPUT_COLUMNS = ['PREDICTED_PRICE']
    regressor = XGBRegressor(
        input_cols=CATEGORICAL_COLUMNS_OE+NUMERICAL_COLUMNS,
        label_cols=LABEL_COLUMNS,
        output_cols=OUTPUT_COLUMNS
    )

    regressor.fit(train_df)
    return regressor.predict(test_df)

What’s notable here:

Instead of using the Pipeline we saved in the previous step, we can just use the transformed data persisted at the end. We could have used it, if we wanted to show off using that binary file out of a Snowflake stage.
We get to divide our diamonds data into training and test with a simple diamonds_df.random_split(weights=[0.9, 0.1]).
Then we get to train with regressor.fit(train_df) and persist the result of our test data with return regressor.predict(test_df).
We can use Snowsight for a quick visual check that the predicted prices are pretty close to the expected results:

Predicted prices vs actual prices

We are using XGBRegressor as implemented by snowflake.ml.modeling.xgboost. The docs say it hasn’t been prepared for distributed execution yet — but by using it we will get any improvements deployed by Snowflake to snowflake.ml.
The whole process took 1m28s on a Medium-wh.
The fit step took 29s. It’s interesting to look at the implementation of the stored procedure happening under the hood:

XGBoost training in action inside Snowflake

The inference step took 8.2s, and the query profile shows how it gets distributed with a Python UDF:

Predicting price for 10% of the rows (test set)

Will it scale?

That was quick and cool. Now let’s see if it will scale with 1000x the amount of data.

Generating 1000 times more data

This is how I generated 1000x random data, based on the existing diamonds, with a dbt Python model:

# diamonds_random_1000x.py
import snowflake.snowpark.functions as F
import snowflake.snowpark.types as T

import random

def model(dbt, session):
    df = dbt.ref('diamonds')

    # Summarize the existing data to get random data in that range
    grouped_df = df.groupBy().agg(
        F.collect_set("CUT").alias("CUT"),
        F.collect_set("COLOR").alias("COLOR"),
        F.collect_set("CLARITY").alias("CLARITY"),
        F.array_construct(F.min("CARAT"), F.max("CARAT")).alias("CARAT"),
        F.array_construct(F.min("DEPTH"), F.max("DEPTH")).alias("DEPTH"),
        F.array_construct(F.min("TABLE_PCT"), F.max("TABLE_PCT")).alias("TABLE_PCT"),
        F.array_construct(F.min("X"), F.max("X")).alias("X"),
        F.array_construct(F.min("Y"), F.max("Y")).alias("Y"),
        F.array_construct(F.min("Z"), F.max("Z")).alias("Z"),
        F.array_construct(F.min("PRICE"), F.max("PRICE")).alias("PRICE"),
    )

    @F.udf(input_types=[T.ArrayType()], return_type=T.StringType())
    def random_from_array(x):
        return random.choice(x)

    num_rows = df.count() * 1000
    range_df = session.range(num_rows).withColumn("dummy", F.col("id"))
    exploded_df = grouped_df.crossJoin(range_df)

    # Generate random values within the specified range for numerical columns
    result_df = exploded_df.select(
        random_from_array(F.col('CUT')).alias('CUT'), 
        random_from_array(F.col('COLOR')).alias('COLOR'), 
        random_from_array(F.col('CLARITY')).alias('CLARITY'), 
        (F.uniform(F.cast(F.lit(0), T.FloatType()), F.lit(1), F.random()) * (F.col('CARAT')[1] - F.col('CARAT')[0]) + F.col('CARAT')[0]).alias('CARAT'),
        (F.uniform(F.cast(F.lit(0), T.FloatType()), F.lit(1), F.random()) * (F.col('DEPTH')[1] - F.col('DEPTH')[0]) + F.col('DEPTH')[0]).alias('DEPTH'),
        (F.uniform(F.cast(F.lit(0), T.FloatType()), F.lit(1), F.random()) * (F.col('TABLE_PCT')[1] - F.col('TABLE_PCT')[0]) + F.col('TABLE_PCT')[0]).alias('TABLE_PCT'),
        (F.uniform(F.cast(F.lit(0), T.FloatType()), F.lit(1), F.random()) * (F.col('X')[1] - F.col('X')[0]) + F.col('X')[0]).alias('X'),
        (F.uniform(F.cast(F.lit(0), T.FloatType()), F.lit(1), F.random()) * (F.col('Y')[1] - F.col('Y')[0]) + F.col('Y')[0]).alias('Y'),
        (F.uniform(F.cast(F.lit(0), T.FloatType()), F.lit(1), F.random()) * (F.col('Z')[1] - F.col('Z')[0]) + F.col('Z')[0]).alias('Z'),
    )

    result_df = result_df.withColumn("PRICE", F.col("TABLE_PCT") * F.col("Y"))
    return result_df

What’s notable here:

This is the step that took me the longest time to write.
You can see how hard I had to work to find a way to produce a random number between the sample data min and max: (F.uniform(F.cast(F.lit(0), T.FloatType()), F.lit(1), F.random()) * (F.col(‘CARAT’)[1] — F.col(‘CARAT’)[0]) + F.col(‘CARAT’)[0]).alias(‘CARAT’) .
Then I wasn’t able to find an easy Dataframe way to choose a random value within the summary arrays of the categorical data. The easy solution instead was writing a simple Python UDF (random_from_array()).
I gave the price a simple pattern depending on a couple of these variables —then it will be XGBoost’s job to discover this pattern within the data.
Creating these 54 million rows took 1m7s in a Medium-wh, with the main generated query taking 57s. As always, it’s fun to look at that query profile:

Transformation pipeline with 1000x data

We are going to skip the code here, as it’s exactly the code as above, but instead of depending on dbt.ref(‘diamonds’) it asks for the just created dbt.ref(‘diamonds_random_1000x’) table.

What’s notable here:

The whole process took 1m on a M-wh, with the main transformation and persistence taking only 8.3s.
These are cool numbers at scale: The original transformation pipeline took 49s on a Medium-wh, with the transformation and persistence taking 1.1s. That’s pretty cool, as we scaled to 54 million rows (from only 54 thousand), taking only a little bit more time.
The query profile:

Only 8s with 1000x more data

Training and predicting with 1000x data

We are going to skip the code here, as it’s exactly the code as above. But instead of depending on dbt.ref(‘snowpark_ml_diamonds_transform_pipeline’), it asks for the just created dbt.ref(‘snowpark_ml_diamonds_transform_pipeline_1000x’).

What’s notable here:

After 2m43s, this step failed (at first). This is the log:

Failure log of training with 1000x the data

The reason is simple, and the fix is too: Our Medium-wh didn’t have enough RAM for XGBoost to train the model with 1000x the data. That’s why Snowflake has now the “Snowpark Optimized Warehouses” — which are like normal warehouses, but with a lot more RAM and related optimizations.
Once switched to a Snowpark-Medium-wh, the whole process took 5m53s. This is cool, compared to the previous 1m28s, because now we have 1000x the data.
The training part took 4m48s (compared to 29s). This is where distributed execution support might help us a lot (when/if Snowflake implements it transparently for you).
Predicting for 5.4 million rows took 15s (compared to 8.2s with 5.4 thousand rows). The query profile:

Predicting for 5.4 million rows of diamonds

Performance note: The Medium-Snowpark-wh had a similar performance than a Large-Snowpark-wh — given that this XGBoost isn’t (yet) distributed.
The predicted prices versus the actual prices look even better in this chart. We can say that XGBoost successfully found out the linear relationship I chose for the pricing of the random data:

Summary

As we used dbt to connect all these transformations, it also helps us visualize how each model relates to each other:

How dbt transformed `diamonds` through augmentation, transforming, and training.

dbt allows us to switch between SQL and Python transformations in a pipeline, and takes care of adding boilerplate code when pushing these to Snowflake.
snowflake.ml.modeling packages familiar Python ML tools (Scikit-Learn, XGBoost, LightGBM, …) and improves their performance and scalability on Snowflake.
Scaling from 50 thousand rows to 50 million rows was easy and quick.
We can expect Snowflake to keep improving these libraries and their scalability.

Image created by AI

Next steps

We saw how to persist binaries into a Snowflake stage, but registry and deployment of models could be better handled (stay tuned).
Check “ML on Snowflake at scale with Snowpark (Part-2)” from Simran Khara and “ML on Snowflake at scale with Snowpark Python and XGBoost” from Chase Ginther.

Suggestions from my previous post:

Try the dbt Python models in your Snowflake account, and share your results.
Participate in the dbt community to shape the future of the dbt Python models.
Read the dbt Python model docs, and the Snowflake Snowpark for Python library docs.
Share your finding withs the dbt community on the dbt Slack channels #dbt-core-python-models and #db-snowflake.
Try some dbt+Snowflake quickstarts like “Data Engineering with Snowpark Python and dbt” and “Leverage dbt Cloud to Generate ML ready pipelines using Snowpark Python”.
Compare the simplicity, power, and performance of the dbt Python models on Snowflake — versus the set up that dbt had to pull off to run Python models in other platforms.
Check the previous posts by Jeremiah Hansen and Eda Johnson “Data Engineering with Snowpark Python and dbt” and “A First Look at the dbt Python Models with Snowpark”. Also phData’s “How to Use dbt With Snowpark Python to Implement Sentiment Analysis” and “How to Build a Python Model in dbt with Snowflake”.
For ML, dbt has an in-depth guide “Leverage dbt Cloud to generate analytics and ML-ready pipelines with SQL and Python with Snowflake”, and tropos published “Time-Series Forecasting With Python For Snowpark And Dbt Labs”.

Want more?

Try this out with a Snowflake free trial account — you only need an email address to get started.
Try dbt Cloud — which I used as a cool web IDE integrated to dbt and Snowflake to develop the examples in this post (it does a lot more than that too, but that’s a story for another post).

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. And subscribe to reddit.com/r/snowflake for the most interesting Snowflake news.

Oh, and I’m on threads.net/fhoffa now too :)