Stories by Tyler White on Medium

Getting started with goose using Snowflake and MCP

Tyler White — Tue, 24 Jun 2025 19:01:55 GMT

Snowflake’s Cortex completion models can now be used with goose, making the Snowflake MCP server for Cortex Search and Cortex Analyst a powerful combination.

Introduction

In January this year, Block announced the release of goose, an open source, extensible AI agent that bridges the gap between LLMs and real-world actions.

Its support for multiple LLM providers and frontier models supports integrating with leading model providers, democratizes the use of Generative AI, and allows users to build agentic processes and workflows. These workflows are built off Model Context Protocol open standard, of which Block was an early adopter and contributor.

codename goose | codename goose

With the newly open-sourced Snowflake MCP server and the goose Snowflake Cortex integration, you can now use Cortex as a first-party provider and get answers supported by your Snowflake data.

This blog will show you how to get started with goose and Snowflake.

GitHub - Snowflake-Labs/mcp

Getting Started with goose and Snowflake

Follow the instructions to install goose.

After installation is complete, you’ll receive the option to configure goose with a provider — Snowflake is now an option. Choose Snowflake.

The provider list available in goose.

From there, configuring Snowflake as a provider only requires the hostname and a Programmatic Access Token (PAT).

Options to configure Snowflake as a provider for goose.

Extensions

Before you get started, it’s worth knowing a little bit more about extensions and what they mean for the tool.

It integrates with extensions that use open source Model Context Protocol (MCP), giving you access to a wide range of capabilities. This lets you connect goose to various tools, such as content repositories and business apps. Goose has several built-in extensions, but only the Developer extension is activated by default. This core extension offers key software development tools, and you can activate options like Computer Controller and Memory integration.

As a final step before diving into using Snowflake and goose, you need to enable the Snowflake extension.

Configuring extensions in goose.

And voila! You are now ready to use Snowflake with goose!

Using the MCP server

Snowflake Lab’s new open source MCP server supports the following:

Cortex Complete: generates completion responses using supported language models
Cortex Search: low-latency, high-quality “fuzzy” search over unstructured data
Cortex Analyst: text-to-SQL over structured data.

This means we can use all these features with goose as additional tools! For example, I’ve configured two search services and one analyst service in my Snowflake MCP server configuration file:

search_services:
  - service_name: SEC_SEARCH_SERVICE
    description: >
      Search services that contains reports that publicly traded US companies
      must file with the Securities and Exchange Commission (SEC)
    database_name: CUBE_TESTING
    schema_name: PUBLIC
  - service_name: FOMC_MINUTES_SEARCH_SERVICE
    description: >
      Search service that contains the minutes of regularly scheduled
      meetings held by The Federal Open Market Committee
    database_name: DEMO_CORTEX_SEARCH
    schema_name: FOMC
- analyst_services:
  - service_name: CUSTOMER_DATA
    semantic_model: '@CATRANSLATOR.ANALYTICS.DATA/customers.yaml'
    description: >
      Analyst service that contains structured customer data to query

The LLM interprets that it must invoke the Snowflake extension and determine which tool to use. In the following example, you will see how the LLM routes to use Cortex Analyst and Cortex Search when asking questions about financial performance and our customers.

Goose using Cortex Analyst and Cortex Search.

This capability allows us to use natural language to interact with our local machine and any additional extensions we wish to configure.

Conclusion

I’ve personally been using goose to analyze some of our team’s Snowflake query usage to detect anomalies. Following up on it, I can use other extensions to summarize and help me write queries to diagnose the anomalies further.

Taking things to the next level, you can transform a goose session into a reusable recipe encompassing the tools, goals, and setup you’re using, packaging it into a new agent that others (or your future self) can launch with just a single click.

I’m excited to have contributed to goose and enabled this Snowflake functionality for users of the project. I can’t wait to see how the integrations evolve in the future. In the meantime, I hope you’ll check out goose and take the new extension for a spin!

Getting started with goose using Snowflake and MCP was originally published in Snowflake Builders Blog: Data Engineers, App Developers, AI, & Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Cortex Complete in Rust

Tyler White — Mon, 31 Mar 2025 15:09:12 GMT

Reaching out to Snowflake LLMs using Rust and Python with pyo3.

I’ve been exploring Rust casually, creating a few small CLI utilities tailored to my needs. Recently, however, I discovered an opportunity to streamline some of my work with Cortex and Snowflake. Since I frequently use Cortex Complete for various tasks, I decided to leverage the reqwest crate in Rust, using a token from the Python Connector to simplify the process.

Snowflake’s suite of Cortex functions offers capabilities to invoke various LLMs. I’m particularly interested in the COMPLETE function, which has a corresponding REST API. Performing this task in Rust rather than Python might speed things up.

In this article, I’ll outline the steps and the code needed to use Snowflake’s Cortex Complete function with Rust and invoke it using Python.

Getting started

We’ll need Rust and a few other tools to build this thing out. I’ve got Rust and uv installed, so I’ll set up my project, and we’ll add everything we need.

mkdir cortex-complete
cd cortex-complete
cargo init
cargo add pyo3 -F extension-module -F experimental-async
cargo add reqwest -F blocking -F json
cargo add serde_json
uv init --python 3.11
uv sync
uv add maturin "snowflake-connector-python[secure-local-storage]"

This will create all of the appropriate configuration files that we’ll need, but we will need to modify both the cargo.toml and pyproject.toml files.

[package]
name = "cortex-complete"
version = "0.1.0"
edition = "2021"

[dependencies]
pyo3 = { version = "0.24.0", features = ["extension-module", "experimental-async"] }
reqwest = { version = "0.12.15", features = ["blocking", "json"] }
serde_json = "1.0.140"

[build-system]
requires = ["maturin>=1,<2"]
build-backend = "maturin"

[lib]
name = "cortex"
crate-type = ["cdylib"]

[project]
name = "cortex-complete"
version = "0.1.0"
readme = "README.md"
requires-python = ">=3.11"
dependencies = [
    "maturin>=1.8.3",
    "snowflake-connector-python[secure-local-storage]>=3.14.0",
]

Now, we can start coding!

Library

In the src folder, we will create an empty file named lib.rs which will be used to store our Rust code, which we will execute with Python. A lot is going on here, so I won’t try to explain it all, but essentially, we’re grabbing the “rest” attribute from the Snowflake Python Connection object to submit a POST request and parse the response. Something also worthy of note is if we were to use the complete function multiple times, we would only need to authenticate once, reducing MFA prompts as the connection object is being cached.

use pyo3::{
    prelude::*,
    types::PyString,
};
use reqwest::{
    blocking::Client,
    header::{self, HeaderMap, HeaderValue},
};
use serde_json::Value;
use std::sync::OnceLock;

static CON: OnceLock = OnceLock::new();
static HEADERS: OnceLock = OnceLock::new();

fn get_con() -> Result {
    Python::with_gil(|py| {
        Ok(CON
            .get_or_init(|| {
                let module = py
                    .import("snowflake.connector")
                    .expect("Failed to import 'snowflake.connector'");
                let con = module
                    .call_method("connect", (), None)
                    .expect("Failed to call 'connect'");
                con.into()
            })
            .clone_ref(py))
    })
}

fn get_headers() -> Result {
    Python::with_gil(|py| {
        Ok(HEADERS
            .get_or_init(|| {
                let con = get_con().expect("Failed to get connection");
                let token: String = con
                    .getattr(py, "rest")
                    .expect("Failed to get 'rest'")
                    .getattr(py, "token")
                    .expect("Failed to get 'token'")
                    .extract(py)
                    .expect("Failed to extract token");
                let mut headers = HeaderMap::new();
                headers.insert(
                    header::AUTHORIZATION,
                    HeaderValue::from_str(&format!("Snowflake Token=\"{}\"", token))
                        .expect("Failed to create AUTHORIZATION header"),
                );
                headers.insert(
                    header::CONTENT_TYPE,
                    HeaderValue::from_static("application/json"),
                );
                headers.insert(header::USER_AGENT, HeaderValue::from_static("Mozilla/5.0"));
                headers
            })
            .clone())
    })
}

fn handle_error(message: &str, error: impl std::fmt::Display) -> PyErr {
    PyErr::new::(format!("{}: {}", message, error))
}

fn extract_and_join(json_list: Vec) -> String {
    json_list
        .into_iter()
        .filter_map(|s| {
            s["choices"][0]["delta"]["content"]
                .as_str()
                .map(|s| s.to_string())
        })
        .collect()
}

#[pyfunction]
fn complete(model: &str, prompt: &str) -> PyResult> {
    Python::with_gil(|py| {
        let data = serde_json::json!({
            "model": model,
            "messages": [{"content": prompt}],
        });
        let con = get_con()?;
        let host: String = con.getattr(py, "host")?.extract(py)?;
        let url = format!("https://{}{}", host, "/api/v2/cortex/inference:complete");
        let headers = get_headers()?;
        let client = Client::new();

        let response_text = client
            .post(url)
            .headers(headers)
            .json(&data)
            .send()
            .map_err(|e| handle_error("Request error", e))?
            .text()
            .map_err(|e| handle_error("Request error", e))?;

        let json_list: Vec = response_text
            .lines()
            .filter_map(|line| line.trim().strip_prefix("data: "))
            .filter_map(|line| serde_json::from_str::(line).ok())
            .collect();

        let answer = extract_and_join(json_list);

        Ok(PyString::new(py, &answer.trim()).into())
    })
}

#[pymodule]
fn cortex(m: &Bound<'_, PyModule>) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(complete, m)?)?;
    Ok(())
}

Using it

Now we can import our Rust module and try it out.

import cortex
cortex.complete("mistral-large2", "Where do people typically publish technical articles?")

Using the Rust module

Comparison

I did a minimal benchmark using the Python requests library and the custom Rust code and found similar results. Likely, the context window and token size would make a more considerable difference between the two.

Here’s a Python function so that we can parse the results.

import json
import requests
import snowflake.connector

def python_complete(url: str, data: dict, headers: dict) -> str:
    r = requests.post(url, json=data, headers=headers)
    return "".join(
        [
            json.loads(obj.strip()).get("choices")[0].get("delta").get("content", "")
            for obj in r.text.split("data: ")
            if obj.strip()
        ]
    ).strip()


con = snowflake.connector.connect()
url = f"https://{con.host}/api/v2/cortex/inference:complete"
data = {
    "model": "mistral-large2",
    "messages": [{"content": "Where do people typically publish technical articles?"}],
}
headers = {
    "Authorization": f'Snowflake Token="{con.rest.token}"',
    "Content-Type": "application/json",
    "Accept": "application/json",
}

print(python_complete(url, data, headers))

Conclusion

I enjoy working on these projects, and there is much opportunity to swap certain operations into Rust. I hope to rewrite this without using pyo3 so that it’s even lighter (and maybe quicker).

If you’ve been using Rust and want to try this, I would love to hear from you!

Casting bools with Polars using Rust

Tyler White — Tue, 14 Jan 2025 22:13:30 GMT

I’ve wanted to explore Rust for a while and figured, “What better way to experience it than using a dataframe API I am already familiar with?” Tools like pytest, uv, and ruff will no longer be necessary with Rust’s built-in Cargo package manager to manage our environment, formatting, and testing.

It would be neat to get Polars up and running using Rust and explore the behavior of boolean columns. This approach allows us to stay in the shallow end as we’re not yet using LazyFrames or DataFrames, and we’ll be sticking with Polars’ Series.

Installing Rust

To get started, we’ll need Rust installed on our preferred system. I’m using macOS, but it’s a reasonably similar operation if you’re using Linux or Windows.

Install Rust

Cargo

We’re ready to create our first project! To get started, we will primarily work with our shell of preference (I am using zsh).

mkdir ~/Desktop/bools-polars-rust
cd ~/Desktop/bools-polars-rust
cargo init

If we open this directory using an IDE, we’ll see that we now have a src folder, a preconfigured .gitignore file, a gitignored target folder, and two additional files I’ll explain: Cargo.lock and Cargo.toml.

Cargo.toml is a manifest file that describes your dependencies
Cargo.lock contains exact information about your dependencies. Cargo maintains it and edits it.

uv follows a similar behavior with its pyproject.toml and uv.lock files. We will add our first dependency, otherwise known as a crate, with cargo add.

I recommend configuring appropriate extensions for Rust using your IDE for a better experience.

cargo add polars

This command updates the crates.io index and adds the newest Polars version to your project dependencies. In a future article, we may explore various crate features, but the core Polars crate should suffice for now.

Let’s jump into that src folder and open up main.rs. We can greet the world to ensure things work fine, but we will edit this file shortly.

cargo run

Compiling may take a moment, but once it is complete, you should see “Hello, world!” printed on your console. That’s nice, but we want to see a Polars boolean Series!

Editing the file

Let’s edit the file now. Similar to Python imports, we’ll need to make Polars available. At the top of the main.rs file, add the following:

use polars::prelude::*;

The prelude module in a Rust crate typically contains the most commonly used items from that crate. By including prelude::*, you’re importing all the standard and essential functionalities from Polars. We also get traits, functions, and types required for typical operations like creating dataframes, working with series, and performing computations.

Making a series

We will create and print a simple five-row Polars Series on the console. We can also remove the unneeded “Hello, world!” print. The entirety of our script should look like this:

use polars::prelude::*;

fn main() -> Result<(), PolarsError> {
    let vals: &[bool;5] = &[true, false, true, false, true];
    let bool_ser: Series = Series::new("bool_ser".into(), vals);
    println!("{:?}", bool_ser);

    Ok(())
}

When we run this with cargo run we now see a different response:

shape: (5,)
Series: 'bool_ser' [bool]
[
        true
        false
        true
        false
        true
]

If we were to perform a sum operation over this series, we would get three, as I am assuming that true = 1 and false = 0. Let’s add the following snippet to our main function.

println!("{:?}", bool_ser.sum::());

We did need to cast this to an 8-bit signed integer, but when we run it, it prints “Ok(3)” below the series. The print works as expected! What if we wanted to compute the minimum and maximum values from bool_ser where we expect 0 and 1, respectively? Add the following two lines to your src/main.rs file:

println!("{:?}", bool_ser.min::());
println!("{:?}", bool_ser.max::());

When we run this, we get the following larger output:

shape: (5,)
Series: 'bool_ser' [bool]
[
        true
        false
        true
        false
        true
]
Ok(3)
Ok(Some(0))
Ok(Some(1))

We can view the final code structure with the following bash command.

❯ tree -I target
.
├── Cargo.lock
├── Cargo.toml
└── src
    └── main.rs

Conclusion

While this only ended up being a few lines of code, I was excited to get Rust set up and running with Polars. It was pretty straightforward to get Polars working with Rust.

I'm looking forward to experimenting with reading CSV and Parquet files and exploring more DataFrame operations. I hope you give it a try as well. I'm eager to learn about more advanced applications for data engineering using Rust with Polars.

Casting bools with Polars using Rust was originally published in Learning The Computers on Medium, where people are continuing the conversation by highlighting and responding to this story.

Using DataFusion with Minio for (sort of remote) reads and writes

Tyler White — Fri, 10 Jan 2025 17:40:53 GMT

Using Python and Docker for single-node data engineering efforts.

There are many new “hot” single-node engines out there, and I don’t like to pick favorites. We have DataFusion, DuckDB, pandas, and Polars, to name a few. We’re also reaching an era where the default answer throws everything into cloud storage systems such as Amazon S3.

Let’s combine the best of both worlds and use Docker to host a local Minio container to emulate cloud object storage and DataFusion/Ibis to read and write from this container.

Settings things up

uv

I’ve been using uv for nearly all of my projects recently, and this will be no exception. We’ll set everything up using the following commands:

uv venv --python 3.12
source .venv/bin/activate
uv add datafusion "ibis-framework[datafusion] @ git+https://github.com/ibis-project/ibis.git"
uv add --dev python-dotenv

I’m installing Ibis from source here to ensure I have the latest features. 😃

Docker/Container Management Tool

We must set up a quick compose file to configure Minio to run on our machine. I suggest you modify these values and target a specific image in production scenarios. Create a file named docker-compose.yml and populate it with the following:

services:
  minio:
    image: quay.io/minio/minio
    ports:
      - "9000:9000"
      - "9001:9001"
    volumes:
      - ${HOME}/minio/data:/data
    command: server /data --console-address ":9001"
    depends_on:
      - create-data-dir

  create-data-dir:
    image: busybox
    command: mkdir -p /data
    volumes:
      - ${HOME}/minio/data:/data

Then, using your favorite shell, execute docker compose up -d. To complete this step, you must install Docker (or a similar tool, such as Podman) for your operating system.

This will allow you to connect to the console endpoint using http://localhost:9001.

username: minioadmin
password: minioadmin

While we’re at it, let’s add this to our project’s .env file:

AWS_ACCESS_KEY_ID=minioadmin
AWS_SECRET_ACCESS_KEY=minioadmin

Successfully logging in should pull up an empty object browser page because we must create a bucket. Navigate to “Buckets” and select “Create bucket.”

The “Create Bucket” button.

Name your bucket nyc-tlc and add one of the yellow_tripdata Parquet files to a local directory named “data” in your repo.

mkdir data
curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet -o /data/yellow_tripdata_2023-01.parquet

Then, upload this file using the UI to the nyc-tlc repo.

The Parquet file in the Minio Object Browser.

DataFusion

Let’s create a Python file named aggregate_tripdata.py and populate it with the following code:

import os

import datafusion
from dotenv import load_dotenv
from ibis.interactive import *

load_dotenv(override=True)

ctx = datafusion.SessionContext()

s3 = datafusion.object_store.AmazonS3(
    bucket_name="nyc-tlc",
    access_key_id=os.environ.get("AWS_ACCESS_KEY_ID"),
    secret_access_key=os.environ.get("AWS_SECRET_ACCESS_KEY"),
    endpoint="http://localhost:9000",
    allow_http=True,
)

ctx.register_object_store("s3://nyc-tlc/", s3)
ctx.register_parquet("yellow_tripdata", "s3://nyc-tlc/yellow_tripdata_2023-01.parquet")

We haven’t yet read or written any data; we’ve only configured our SessionContext to communicate with Minio and registered the Parquet file as a table.

Let’s jump into an IPython shell to complete the rest of the journey.

uv add --dev ipython
source .venv/bin/activate

Now, starting the IPython shell, we can create an Ibis backend and query our table.

In [1]: %run aggregate_tripdata.py
In [2]: con = ibis.datafusion.from_connection(ctx)
In [3]: t = con.table("yellow_tripdata")
In [4]: t
Out[4]:
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━┓
┃ VendorID ┃ tpep_pickup_datetime ┃ tpep_dropoff_datetime ┃ passenger_count ┃ … ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━┩
│ int64    │ timestamp(6)         │ timestamp(6)          │ float64         │ … │
├──────────┼──────────────────────┼───────────────────────┼─────────────────┼───┤
│        2 │ 2023-01-01 00:32:10  │ 2023-01-01 00:40:36   │             1.0 │ … │
│        2 │ 2023-01-01 00:55:08  │ 2023-01-01 01:01:27   │             1.0 │ … │
│        2 │ 2023-01-01 00:25:04  │ 2023-01-01 00:37:49   │             1.0 │ … │
│        1 │ 2023-01-01 00:03:48  │ 2023-01-01 00:13:25   │             0.0 │ … │
│        2 │ 2023-01-01 00:10:29  │ 2023-01-01 00:21:19   │             1.0 │ … │
│        2 │ 2023-01-01 00:50:34  │ 2023-01-01 01:02:52   │             1.0 │ … │
│        2 │ 2023-01-01 00:09:22  │ 2023-01-01 00:19:49   │             1.0 │ … │
│        2 │ 2023-01-01 00:27:12  │ 2023-01-01 00:49:56   │             1.0 │ … │
│        2 │ 2023-01-01 00:21:44  │ 2023-01-01 00:36:40   │             1.0 │ … │
│        2 │ 2023-01-01 00:39:42  │ 2023-01-01 00:50:36   │             1.0 │ … │
│        … │ …                    │ …                     │               … │ … │
└──────────┴──────────────────────┴───────────────────────┴─────────────────┴───┘

We’re reading this from the Parquet file in Minio! Let’s do some aggregation, and we will write back into a “processed” folder in our Minio bucket.

Let’s aggregate the numeric column statistics by VendorID and the tpep_pickup_datetime (casted as a date) and write it back.

In [5]: t_agg = (
    ...    t.mutate(tpep_pickup_date=t.tpep_pickup_datetime.cast("date"))
    ...    .group_by(["VendorID", "tpep_pickup_date"])
    ...    .agg(s.across(s.numeric(), dict(min=_.min(), max=_.min(), mean=_.mean())))
)

We haven’t executed this yet; we will write it to a table named “t_agg”.

In [6]: con.create_table("t_agg", t_agg)
Out[6]: 
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━┓
┃ VendorID ┃ tpep_pickup_date ┃ VendorID_min ┃ passenger_count_min ┃ … ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━┩
│ int64    │ date             │ int64        │ float64             │ … │
├──────────┼──────────────────┼──────────────┼─────────────────────┼───┤
│        2 │ 2023-01-08       │            2 │                 0.0 │ … │
│        2 │ 2023-01-10       │            2 │                 0.0 │ … │
│        2 │ 2023-01-14       │            2 │                 0.0 │ … │
│        1 │ 2023-01-18       │            1 │                 0.0 │ … │
│        2 │ 2023-01-21       │            2 │                 0.0 │ … │
│        2 │ 2023-01-30       │            2 │                 0.0 │ … │
│        1 │ 2023-01-04       │            1 │                 0.0 │ … │
│        1 │ 2023-01-12       │            1 │                 0.0 │ … │
│        1 │ 2023-01-15       │            1 │                 0.0 │ … │
│        2 │ 2023-01-20       │            2 │                 0.0 │ … │
│        … │ …                │            … │                   … │ … │
└──────────┴──────────────────┴──────────────┴─────────────────────┴───┘

We can easily write it back to Minio using the existing DataFusion SessionContext.

In [7]: target_file_name = "s3://nyc-tlc/processed/t_agg_2023-01.parquet"
In [7]: ctx.table("t_agg").write_parquet(target_file_name)

The aggregated file in the Minio bucket.

Reading the file again directly is easy with Ibis as well.

In [8]: target_file_name = "s3://nyc-tlc/processed/t_agg_2023-01.parquet"
Out[8]:
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━┓
┃ VendorID ┃ tpep_pickup_date ┃ VendorID_min ┃ passenger_count_min ┃ … ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━┩
│ int64    │ date             │ int64        │ float64             │ … │
├──────────┼──────────────────┼──────────────┼─────────────────────┼───┤
│        1 │ 2023-01-04       │            1 │                 0.0 │ … │
│        1 │ 2023-01-12       │            1 │                 0.0 │ … │
│        1 │ 2023-01-15       │            1 │                 0.0 │ … │
│        2 │ 2023-01-20       │            2 │                 0.0 │ … │
│        2 │ 2023-01-24       │            2 │                 0.0 │ … │
│        2 │ 2023-02-01       │            2 │                 1.0 │ … │
│        2 │ 2023-01-05       │            2 │                 0.0 │ … │
│        1 │ 2023-01-07       │            1 │                 0.0 │ … │
│        1 │ 2023-01-16       │            1 │                 0.0 │ … │
│        1 │ 2023-01-19       │            1 │                 0.0 │ … │
│        … │ …                │            … │                   … │ … │
└──────────┴──────────────────┴──────────────┴─────────────────────┴───┘

Conclusion

This post uses various technologies to read and write to local cloud storage performantly. This landscape of data processing changes frequently, so it was fun to experiment with an approach like this to sharpen my DataFusion skills. It’s always fun to explore different dataframe APIs!

Using DataFusion with Minio for (sort of remote) reads and writes was originally published in Learning The Computers on Medium, where people are continuing the conversation by highlighting and responding to this story.

Using pytest in GitLab pipelines

Tyler White — Tue, 07 Jan 2025 18:43:04 GMT

Make sure your code works like it’s supposed to.

Introduction

In a previous article, we set up pre-commit to enforce code quality and formatting standards in our GitLab pipeline. However, ensuring code cleanliness is only half the battle. What’s the point of having pretty code if it doesn’t work? We also need to verify our code functions as expected. This is where pytest, a popular testing framework for Python, comes in.

Using pre-commit in GitLab pipelines

In this article, we’ll configure and integrate pytest into our existing GitLab codebase and pipeline. We’ll create a new stage in our .gitlab-ci.yml file to run our tests, ensuring our code is both clean and functional.

https://giphy.com/gifs/8xgqLTTgWqHWU

Configuring the environment

Before we use pytest, we must ensure it’s installed in our environment. I’ve been using uv to manage project dependencies lately, so we can install pytest and specify that it’s a development dependency by executing uv add --dev pytest in our shell. If you don’t have uv, install pytest using pip by executing pip install pytest in your shell.

Why pytest?

Now that pytest is installed let’s talk about why we used it in the first place. One of the key benefits of pytest is its support for fixtures, which allows us to efficiently set up and teardown resources needed for our tests. Fixtures are setup functions that provide a fixed baseline so that tests execute reliably and consistently. We can define fixtures that run before and after each test, or even before and after a group of tests. Fixtures are handy when we need to setup and teardown complex resources, such as database connections or file systems, required for our tests to run.

With pytest, we can define fixtures using the @pytest.fixture decorator, which allows us to specify the scope of the fixture (e.g., function, class, module, etc.) and the code that should be executed to setup and teardown the fixture. The @pytest.mark.parametrize decorator allows you to run the same test function with different input parameters. This decorator is practical when testing a function with multiple inputs or scenarios.

By leveraging pytest’s fixture functionality, we can write more tests that are easier to maintain and extend.

Folder layout

By default, pytest looks in your current directory and subdirectories for test files and runs any tests it finds. I prefer to keep a folder named “tests” at the root level of the repository to keep these together in a convenient location for all contributors to find.

.
├── README.md
├── our_project
│   ├── __init__.py
│   └── classification_metrics.py
├── pyproject.toml
├── tests
│   ├── __init__.py
│   ├── conftest.py
│   └── test_classification_metrics.py
└── uv.lock

To tell this story, we will create a fictional project that computes two binary classification metrics using pandas. We will test that the metric calculations are correct with pytest. We will first add pandas as a project dependency by running uv add pandas. This will install it and ensure we can write our classification functions to test against.

The functions to test against

Functions to compute accuracy and precision scores are easy to recreate and use as examples. We’ll add a file named classification_metrics.py in the project folder, and add two functions:

import pandas as pd


def accuracy_score(y_true: pd.Series, y_pred: pd.Series) -> float:
    """Calculate the accuracy of a binary classification model.

    Parameters
    ----------
    y_true : pd.Series
        The true labels.
    y_pred : pd.Series
        The predicted labels.

    Returns
    -------
    float
        The accuracy of the model.

    """
    return (y_true == y_pred).mean()


def precision_score(y_true: pd.Series, y_pred: pd.Series) -> float:
    """Calculate the precision of a binary classification model.

    Parameters
    ----------
    y_true : pd.Series
        The true labels.
    y_pred : pd.Series
        The predicted labels.

    Returns
    -------
    float
        The precision of the model.

    """
    tp = ((y_true == 1) & (y_pred == 1)).sum()
    fp = ((y_true == 0) & (y_pred == 1)).sum()
    return tp / (tp + fp)

The docstrings are a great plus for future users as well to understand how these functions can be used. Examples are helpful as well.

These will be easy enough to use with both accepting the same arguments. This is a great use case for a pytest fixture to share a dataset for testing and a pytest marker to parametrize our test to do both tests in one test function.

Creating a test fixture

In this project, we will likely want a dataframe readily available to test against. This is a great way to ensure code is being tested and evaluated with the same dataset. This fixture will be defined in a file called conftest.py, located inside our tests folder:

import pandas as pd
import pytest


@pytest.fixture(scope="session")
def predictions() -> pd.DataFrame:
    d = {
         "id": range(1, 13),
         "actual": [1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1],
         "prediction": [1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1],
    }
    yield pd.DataFrame(d)

This allows us to pass predictions as an argument to any testing function to reference this dataframe. Let’s put together our test function now.

Creating the test function

We can test against our accuracy and precision metric functions with a single test function.

I want to ensure our code works against scikit-learn’s metrics with the help of parametrize decorators. We will add that as an additional development dependency by running uv add --dev scikit-learn.

We can create a new file named test_classification_metrics.py inside our tests folder:

import pytest
import sklearn.metrics

import our_project.classification_metrics


@pytest.mark.parametrize(
    "metric_name",
    [
        pytest.param("accuracy_score", id="accuracy_score"),
        pytest.param("precision_score", id="precision_score"),
    ],
)
def test_classification_metrics(predictions, metric_name):
    our_project_func = getattr(our_project.metrics, metric_name)
    sklearn_func = getattr(sklearn.metrics, metric_name)
    result = our_project_func(df["actual"], df["prediction"])
    expected = sklearn_func(df["actual"], df["prediction"])
    assert result == pytest.approx(expected, abs=1e-4)

We can kick off our tests by running pytest in our shell. We should see two green dots indicating the tests have passed and your functions are working as intended. This is a convenient way to test both our accuracy_score and precision_score functions against scikit-learn’s equivalent functions.

Integrating this to run automatically with GitLab pipelines

This is excellent progress, but we’re not out of the woods yet. We can’t always count on individual contributors to test code locally before submitting a merge request, and even if we could, it’s nice as a maintainer to ensure any merge requests pass tests during review. We explored automating pre-commit with our .gitlab-ci.yml file. We’ll add a test stage to this file now to run pytest on any commits on the main branch or when any merge request is opened in addition to pre-commit so that the complete file looks like this:

variables:
  UV_VERSION: 0.5
  PYTHON_VERSION: 3.12
  BASE_LAYER: bookworm-slim
  UV_CACHE_DIR: .uv-cache
  UV_SYSTEM_PYTHON: 1
stages:
  - build
  - test
pre-commit:
  stage: build
  image: python:3.11
  script:
    - pip install pre-commit
    - pre-commit run --all-files
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
    - if: $CI_COMMIT_BRANCH == "main"
pytest:
  stage: test
  image: ghcr.io/astral-sh/uv:$UV_VERSION-python$PYTHON_VERSION-$BASE_LAYER
  cache:
    - key:
        files:
          - uv.lock
      paths:
        - $UV_CACHE_DIR
  script:
    - uv sync --all-extras
    - uv run pytest
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
    - if: $CI_COMMIT_BRANCH == "main"

We’re using uv to run pytest and sync our environment dependencies.

GitLab will run our tests if the build stage (pre-commit) succeeds.

Successful merge pipeline with three stages, including a compliance stage.

Conclusion

I have done this sort of thing with GitHub plenty of times, so it’s been a bit of a challenge to adjust to GitLab, but it’s been fun to learn how to get this all working. I am eager to improve this process and hope to continue sharing what I learn.

Using pytest in GitLab pipelines was originally published in Learning The Computers on Medium, where people are continuing the conversation by highlighting and responding to this story.

Using pre-commit in GitLab pipelines

Tyler White — Fri, 03 Jan 2025 22:20:22 GMT

Catch code quality errors before they happen.

As developers, we’ve all been there — you’ve spent hours working on a feature, only to have it break when you merge it into the main branch. Or, you’ve pushed code to production only to realize it’s riddled with errors and typos. pre-commit is very convenient for enforcing high-quality commits and keeping the codebase tidy. I’ve been using it for a while now, and I appreciate how easy it is to set up and use with my projects. Using it with GitHub Actions is a breeze. However, we use GitLab for various projects at work.

In this article, we’ll explore how to automate pre-commit checks in GitLab pipelines. We enforce quality and formatting standards with pre-commit hooks.

What are pre-commit hooks?

Pre-commit hooks are scripts that automatically execute before you commit code to your repository. They help enforce coding standards and check for syntax errors. These hooks run every time a commit is made, providing a thorough check. By performing these checks before the code is committed, you can identify and resolve issues early, preventing them from entering your remote branch.

It’s never fun to push your exciting feature only to have the first comment on your merge request say something like, “Please format the code.”

How do pre-commit hooks work in GitLab pipelines?

In GitLab, you can use pre-commit hooks as part of your CI/CD pipeline. When you push code to your repository, GitLab will run the pre-commit hook before the code is committed. If the hook fails, the commit will be rejected, and you’ll receive an error message indicating what went wrong.

A pipeline error in GitLab where pre-commit failed.

How to set up pre-commit hooks in GitLab pipelines

Setting up pre-commit hooks in GitLab pipelines is relatively straightforward. Here are the steps:

Create a .pre-commit-config.yaml file

In the root of your repository, create a new file called .pre-commit-config.yaml. This file will contain the scripts to run before the code is committed.

Define your hooks

In the .pre-commit-config.yaml file, define the hooks you want to run. For example, run a linter to check for syntax errors or a formatter to enforce coding standards.

Here’s an example of what your .pre-commit-config.yaml file might look like:

repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v5.0.0
    hooks:
      - id: end-of-file-fixer
      - id: trailing-whitespace
  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.8.5
    hooks:
      - id: ruff
        args: [--fix]
      - id: ruff-format
  - repo: https://github.com/codespell-project/codespell
    rev: v2.3.0
    hooks:
      - id: codespell
  - repo: https://github.com/google/yamlfmt
    rev: v0.14.0
    hooks:
      - id: yamlfmt

This will clean up any files that need their endings fixed, have extra whitespace at the end, enforce ruff formatting standards (this can be enhanced with ruff configuration settings), warn us of our typos, and format our yaml files.

Configure your pipeline

To combine all of this to run automatically, we will need to set up a file so GitLab will know what to do. In your .gitlab-ci.yml configure your pipeline to run the pre-commit hook before the code is committed.

And here’s an example of what your .gitlab-ci.yml file might look like:

stages:
  - build
pre-commit:
  stage: build
  image: python:3.11
  script:
    - pip install pre-commit
    - pre-commit run --all-files
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
    - if: $CI_COMMIT_BRANCH == "main"

This will use Python to execute pre-commit anytime a commit occurs on the main branch, a merge request is created, or a commit is pushed to said merge request. We put this in the “build” stage, but GitLab has a few stages available out of the box. You might want to use the “test” or “deploy” stage for other actions.

Conclusion

It wasn’t too difficult to adjust from GitHub, but I’m sure there are ways this workflow could be improved, possibly with caching. As I continue to work with GitLab, I look forward to sharing what I have learned.

Using pre-commit in GitLab pipelines was originally published in Learning The Computers on Medium, where people are continuing the conversation by highlighting and responding to this story.

PyIceberg — Trying out the SQLite Catalog

Tyler White — Mon, 23 Dec 2024 20:29:45 GMT

PyIceberg — Trying out the SQLite Catalog

Configuring a development environment, reading and writing to an SQL catalog, and exploring snapshots. In this article, we’ll work with PyIceberg, Ibis, DuckDB, and PyArrow.

I’ve been following the PyIceberg project for a while now. This video inspired me to give it a try!

https://medium.com/media/d12f455d00a84d6537e8dd8b124af439/href

In this article, we’ll create and configure a local catalog, load the starwars dataset, create an Iceberg table, populate it, and then query it with the PyIceberg API. You’ll need both Ibis and PyIceberg installed with extras to follow along. These dependencies can be installed using the following command within your virtual environment:

pip install "ibis-framework[duckdb,examples]" "pyiceberg[sql-sqlite]"

Configuring a local catalog

PyIceberg supports loading a configuration file named .pyiceberg.yaml that helps to manage credentials. This file’s location will be searched in the home path (~/) by default, but an alternate location can also be provided using the PYICEBERG_HOME environment variable.

The file I’m using contains the following content:

catalog:
  default:
    type: sql
    uri: sqlite:////tmp/warehouse/pyiceberg_catalog.db"
    warehouse: file://///tmp/warehouse

This will require creating the /tmp/warehouse directory, but that’s all we’ll need to get started.

mkdir /tmp/warehouse

The SQLite catalog is intended for exploratory or development purposes.

We’ll load our catalog in a later section; this step is in preparation.

Loading the starwars data

Ibis ships with an examples module that includes a starwars dataset for experimentation. As the default Ibis backend is DuckDB, we’ll use that transitively. DuckDB will read the Parquet file, and we’ll use Ibis to convert this to a PyArrow table.

import ibis

starwars = ibis.examples.starwars.fetch().to_pyarrow()
starwars

pyarrow.Table
name: string
height: int64
mass: double
hair_color: string
skin_color: string
eye_color: string
birth_year: double
sex: string
gender: string
homeworld: string
species: string
films: string
vehicles: string
starships: string
----
name: [["Luke Skywalker","C-3PO","R2-D2","Darth Vader","Leia Organa",...,"Finn","Rey","Poe Dameron","BB8","Captain Phasma"]]
height: [[172,167,96,202,150,...,null,null,null,null,null]]
mass: [[77,75,32,136,49,...,null,null,null,null,null]]
hair_color: [["blond",null,null,"none","brown",...,"black","brown","brown","none","none"]]
skin_color: [["fair","gold","white, blue","white","light",...,"dark","light","light","none","none"]]
eye_color: [["blue","yellow","red","yellow","brown",...,"dark","hazel","brown","black","unknown"]]
birth_year: [[19,112,33,41.9,19,...,null,null,null,null,null]]
sex: [["male","none","none","male","female",...,"male","female","male","none","female"]]
gender: [["masculine","masculine","masculine","masculine","feminine",...,"masculine","feminine","masculine","masculine","feminine"]]
homeworld: [["Tatooine","Tatooine","Naboo","Tatooine","Alderaan",...,null,null,null,null,null]]
...

Connecting to the catalog and creating the table

Now that we’ve got our PyArrow Table after querying a Parquet file behind the scenes with DuckDB, we can create our Iceberg table using the schema.

First, we’ll need to load the catalog we created in a previous step.

from pyiceberg.catalog import load_catalog

catalog = load_catalog("default")

As this is a new catalog, we’ll need to create a namespace before we can create our table.

catalog.create_namespace_if_not_exists("default")

Now that we have a default catalog, we can create a table to align with the schema of the starwars PyArrow table.

catalog.create_table_if_not_exists("default.starwars", starwars.schema)

starwars(
  1: name: optional string,
  2: height: optional long,
  3: mass: optional double,
  4: hair_color: optional string,
  5: skin_color: optional string,
  6: eye_color: optional string,
  7: birth_year: optional double,
  8: sex: optional string,
  9: gender: optional string,
  10: homeworld: optional string,
  11: species: optional string,
  12: films: optional string,
  13: vehicles: optional string,
  14: starships: optional string
),
partition by: [],
sort order: [],
snapshot: null

We have our table here, but it’s empty. We can get an instance of the table with load_table.

table = catalog.load_table("default.starwars")

We can create a scan of the table and bring it to pandas to see that we have an Empty DataFrame.

table.scan().to_pandas()

Empty DataFrame
Columns: [name, height, mass, hair_color, skin_color, eye_color, birth_year, sex, gender, homeworld, species, films, vehicles, starships]
Index: []

Let’s fix this by appending the PyArrow table we previously loaded.

table.append(starwars)

It’s loaded up!

Deleting the Droids!

We can delete rows from the table by using a filter in the delete method. This is demonstrative only so that we can experiment with the snapshot IDs in a later step. This also serves as an introduction to the PyIceberg expression API.

First, let’s see how many rows we have before performing the delete operation.

table.inspect.partitions()["record_count"][0].as_py()

Let’s remove any rows where species == ‘Droid’ and recheck the count.

from pyiceberg.expressions import EqualTo

table.delete(EqualTo("species", "Droid"))

“C-3PO”, “R2-D2", “R5-D4", “IG-88", “R4-P17”, and “BB8” should no longer be in the table, giving us six fewer rows.

table.inspect.partitions()["record_count"][0].as_py()

Scanning the Iceberg table

When we use the scan method on an Iceberg table, we can optionally provide a filter, columns we wish to include, a limit, and a snapshot ID. It’s generally up to the engine to filter the files, but the scan’s filter will help produce files that might contain matching rows.

Checking snapshots

Since we removed a few rows in the previous section, we should now have two snapshots, which can be checked with len(table.snapshots()).

We can look at the snapshot_id of the snapshots to query our table from before the deletion occurred with [snapshot.snapshot_id for snapshot in table.snapshots()]. In my case, I have snapshots ending in 80693 and 35347, with the former including the droids.

table.scan(
    selected_fields=("name", "species"), snapshot_id=3932249899255080693
).to_pandas()

              name species
0   Luke Skywalker   Human
1            C-3PO   Droid
2            R2-D2   Droid
3      Darth Vader   Human
4      Leia Organa   Human
..             ...     ...
82            Finn   Human
83             Rey   Human
84     Poe Dameron   Human
85             BB8   Droid
86  Captain Phasma   Human

[87 rows x 2 columns]

If we exclude the snapshot_id, it will use the latest snapshot, where we will see the droids are no longer present.

table.scan(selected_fields=("name", "species")).to_pandas()

                  name species
0       Luke Skywalker   Human
1          Darth Vader   Human
2          Leia Organa   Human
3            Owen Lars   Human
4   Beru Whitesun Lars   Human
..                 ...     ...
76          Tion Medon  Pau'an
77                Finn   Human
78                 Rey   Human
79         Poe Dameron   Human
80      Captain Phasma   Human

[81 rows x 2 columns]

The scan can quack

We’ll bring scans to DuckDB, process them further using Ibis, and then return a pandas DataFrame.

Counting the species

Using scan on a PyIceberg Table object allows us to use the to_duckdb method, which returns a DuckDBPyConnection. We can pass the Ibis DuckDB backend connection to this method to work with an existing connection.

import pandas as pd
from ibis import _


def species_counts(
    catalog: pyiceberg.catalog.Catalog, connection=ibis.BaseBackend
) -> pd.DataFrame:
    table = catalog.load_table("default.starwars")
    table.scan(selected_fields=(["species"])).to_duckdb(
        "starwars_species", connection=connection.con
    )
    expr = (
        connection.table("starwars_species")
        .group_by("species")
        .aggregate(species_count=_.species.count())
        .order_by(_.species_count.desc())
        .limit(10)
    )
    return expr.to_pandas()


con = ibis.duckdb.connect()
species_counts(catalog, con)

           species  species_count
0            Human             35
1           Gungan              3
2          Twi'lek              2
3         Kaminoan              2
4           Zabrak              2
5          Wookiee              2
6         Mirialan              2
7             Hutt              1
8            Xexto              1
9             Ewok              1

Counting the home worlds

Similarly to the logic for counting the species, we can do a similar operation to count the homeworlds.

def homeworlds_counts(
    catalog: pyiceberg.catalog.Catalog, connection=ibis.BaseBackend
) -> pd.DataFrame:
    table = catalog.load_table("default.starwars")
    table.scan(selected_fields=(["homeworld"])).to_duckdb("starwars_homeworlds", connection=connection.con)
    expr = (
        connection.table("starwars_homeworlds")
        .group_by("homeworld")
        .aggregate(homeworld_count=_.homeworld.count())
        .order_by(_.homeworld_count.desc())
        .limit(10)
    )
    return expr.to_pandas()


homeworlds_counts(catalog, con)

    homeworld  homeworld_count
0       Naboo               10
1    Tatooine                8
2      Kamino                3
3   Coruscant                3
4    Alderaan                3
5      Ryloth                2
6    Kashyyyk                2
7      Mirial                2
8    Corellia                2
9  Bestine IV                1

Conclusion

PyIceberg is pretty neat! Table formats have been an interesting topic as of late, and I suspect they will continue to remain a popular subject. I found it intuitive to start with PyIceberg, especially with the SQLite catalog. While this only demonstrated potential capabilities on a toy dataset, I’m looking forward to trying this with more data.

PyIceberg — Trying out the SQLite Catalog was originally published in Learning The Computers on Medium, where people are continuing the conversation by highlighting and responding to this story.

Powering a PDF Chatbot with Cortex Search and Document AI

Tyler White — Sat, 14 Dec 2024 15:41:23 GMT

By: Tyler White & Katy Haynie

Architecture diagram for PDF Chatbot

Why we needed something

Unstructured data can be difficult to manage, including formats like video, audio, and specialized types that do not fit predefined models. PDFs are commonly used for documents across various industries, but extracting useful information from them can be challenging. While PDFs are structured with text and graphics, they lack a consistent layout, making it hard to find specific information without maintaining metadata. This often leads to folders filled with numerous PDFs, where users open files based only on their titles.

For procurement and legal teams, using Retrieval-Augmented Generation (RAG) architecture on PDF files can result in significant operational efficiencies. By leveraging RAG, teams can quickly retrieve key information from large volumes of procurement-related documents — such as contracts, agreements, and compliance records — without manually combing through them. This reduces the time spent on routine tasks, enabling procurement teams to respond faster to supplier inquiries, identify cost-saving opportunities, and ensure compliance with contractual terms. RAG provides added value by enhancing the ability to extract and review specific clauses, obligations, and risks from complex contracts. This can greatly streamline the contract review process, reduce the potential for oversight, and support faster, more accurate decision-making. Overall, RAG boosts operational efficiency and empowers legal teams to manage their workloads more effectively, improving productivity and accuracy in day-to-day operations.

What we wanted to build

Osmose is a utility services company that assists its customers in managing their structural assets and infrastructure. The company is dedicated to ensuring that the electrical grid is as strong, safe, and resilient as the communities it serves. Osmose holds numerous contracts across the U.S. and Canada that require review by the procurement team to ensure compliance with the agreed-upon services.

For Osmose, the solution we built allowed them to extract data from their contracts, answering questions like “Is there a non-compete clause in this contract?”. We wanted to enable users to ask questions, extract text, and search over documents. We wanted them to have an interface to select and filter down to specific documents given their contents to be more precise with their targeted selection. The solution leverages a Document AI model to extract that information across all the available contracts. When employees at Osmose want to drill down and ask specific questions against an individual contract they can do so using a combination of Cortex Search (hybrid, vector and keyword, search engine on text) and the Cortex Complete function to summarize the results. The combination of Cortex Search and Cortex Complete allows users to easily generate a new non-compete clause by analyzing the existing contracts with non-compete clauses.

Solution Overview

This solution analyzes contracts using a combination of Document AI, Cortex Complete, and Cortex Search. It allows the user to extract specific information in a tabular format from a document and then ask natural language questions about the set of documents, facilitating a better understanding of each document. The tabular data is leveraged as filters on the Cortex Search service to reduce the number of rows the service needs to look across to return the most relevant information. The Snowflake Cortex Complete function contextualizes the responses to the questions asked against the search service.

Key Functionality

Extraction (Document AI)

Uses a proprietary large language model, to extract data from PDFs

Document Retrieval (Cortex Search)

RAG engine for LLM chatbots

Question Answering (Cortex Complete)

Given a prompt, generates a response using your choice of supported language model

Data Preprocessing (Parse Document / Python UDFs)

Snowflake PARSE_DOCUMENT function to extract text from documents using OCR method
Python UDFs to count page numbers on PDFs and chunk extractions from PARSE_DOCUMENT function

Automation (Tasks)

Auto-processing pipeline for contracts

User Interface (Streamlit in Snowflake)

Simple application for users to interact with their Cortex Search Services and ask questions against their PDFs

Closing thoughts

It’s nice to be able to leverage your documents and answer questions that are specific to your team or company. In this article, we put a single solution where organizations can quickly get value from thousands of documents and precisely find what they’re looking for.

If you would like to give this a try yourself, please see the following repo on GitHub:

https://github.com/Snowflake-Labs/sfguide-build-contracts-chatbot-using-documentai-cortex-search-in-snowflake

Powering a PDF Chatbot with Cortex Search and Document AI was originally published in Snowflake Builders Blog: Data Engineers, App Developers, AI, & Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Snowpark ML — An Example Project

Tyler White — Tue, 30 Apr 2024 15:05:16 GMT

Snowflake ML — An Example Project

A brief walkthrough of a template project for Snowflake ML.

DALL·E 3: A colorful, fictional world of the 1800s, featuring a polar bear and two human data scientists.

Background

The Snowflake ML Modeling API has been generally available since early December 2023, and the Snowflake Model Registry has been in public preview since January 2024.

There are a few examples of working with these packages out there that we frequently visit for inspiration and reference. Chase Romano and Sikha Das have shared some great material on this subject.

These examples use notebooks, which are great but can often be challenging to maintain in repositories. We frequently encounter scenarios where users want to keep their code in Python scripts and operationalize it via traditional orchestration tools and/or Snowflake tasks.

Kirk Mason and I opted to implement our project as a Python module. This helps avoid needing to append a relative path to the system path to specify imports, which we often see in notebooks. The scripts we will walk through can be executed directly in Python or registered via Snowpark as Stored Procedures, later executed as Snowflake tasks in a DAG.

We think having a well-organized project structure is essential, so that’s what we’ll outline here.

The Project Structure

Having a template or framework is essential to start any project. This helps with onboarding new team members as well. While there may be variations of this to fit specific needs, here’s an outline of what we find to be a good starting point.

Example Project Structure

Let’s break down these sections and the various files at the top level.

docs

The docs folder contains any project-specific documentation needed to encompass requirements and help enable future developers and stakeholders.

scratch

The scratch folder contains any notebook code, experimentation, or exploration that should not be executed automatically.

my_project (or src)

This folder serves as the “heart” of our project. Treating code as a Python package has several advantages. It enhances modularity, organization, and readability by separating related functionalities into distinct modules. It also provides namespace isolation to avoid naming clashes and promotes reusability, minimizing code duplication.

I mentioned in the earlier section that providing options to run these Python scripts directly or by registering the functions was important. Here’s how we did that. The code is shortened for brevity, as the model training code isn’t the main focus here.

from snowflake.ml.modeling.impute import SimpleImputer
from snowflake.ml.modeling.pipeline import Pipeline
from snowflake.ml.modeling.xgboost import XGBRegressor
from snowflake.ml.registry import Registry
from snowflake.snowpark import Session
from snowflake.snowpark import functions as F
from snowflake.snowpark import types as T
from my_project.common import get_next_version
from my_project.common import get_performance_metrics

import logging


def train(session: Session) -> str:
    logger = logging.getLogger(__name__)
    logger.info(
        "{'message':'Begin model training procedure', 'unit':'analytics'}"
    )
    ...

    pipeline = Pipeline(
        [
            ...
        ]
    )
    pipeline.fit(train_df)

    logger.info(
        "{'message':'Obtain metrics', 'unit':'analytics'}"
    )

    train_result_df = pipeline.predict(train_df)
    test_result_df = pipeline.predict(test_df)

    combined_metrics = dict(
        train_metrics=get_performance_metrics(
            "REGRESSION", train_result_df, "PRICE", "OUTPUT_PRICE"
        ),
    )

    reg = Registry(session=session, schema_name="MODELS")

    model_name = "MY_MODEL"
    model_version = get_next_version(reg, model_name)

    reg.log_model(
        model_name=model_name,
        version_name=model_version,
        model=pipeline,
        metrics=combined_metrics,
    )

    logger.info(
        "{'message':'Finished training and registering', 'unit':'analytics'}"
    )

    return f"Model {model_name}.{model_version} is trained and deployed."


if __name__ == "__main__":
    session = Session.builder.getOrCreate()
    session.use_warehouse("ML_BIG")
    session.use_database("ML_EXAMPLES")
    session.use_schema("DIAMONDS")
    raise SystemExit(train(session))

We define our function train, which can be registered directly as a stored procedure at a later step (register_deploy_dags).

As an alternative to registering as a stored procedure, invoking this script directly will hit the __main__ condition, establish a connection to Snowflake, and execute the train function. This allows using this code with an orchestration tool if you do not intend to use Snowflake tasks.

pyproject.toml

The pyproject.toml file contains the information needed to set up the project and the Python environment. Here’s what ours looks like.

[build-system]
requires = ["setuptools", "wheel"]
build-backend = "setuptools.build_meta"

[project]
name = "my_project"
description = "A Snowpark ML project."
version = "0.1.0"
readme = "README.md"
dependencies = [
    "snowflake-snowpark-python==1.14.0",
    "numpy==1.26.3",
    "scikit-learn==1.3.0",
    "snowflake[ml]",
    "xgboost==1.7.3"
]

[tool.setuptools.packages.find]
include = ["my_project"]

[project.optional-dependencies]
dev = ["nbqa[toolchain]", "jupyter"]

Having this file in the root of our repo will allow us to easily install our project with pip by executing the following command:

pip install .

Or, if you wish to perform an editable install, you can also do the following:

pip install -e .

Since we have specified a “dev” extra in our configuration file, we can also install dev dependencies like this:

pip install -e ".[dev]"

register_deploy_dags.py

In this code, we’re registering stored procedures from our Python module and setting them up to create a DAG in Snowflake.

session.sproc.register_from_file(
    file_path="my_project/train.py",
    func_name="train_model",
    name="TRAIN_MODEL",
    is_permanent=True,
    packages=["snowflake-snowpark-python", "snowflake-ml-python"],
    imports=["my_project"],
    stage_location="@PYTHON_CODE",
    replace=True,
    execute_as='caller'
)

Later, we can deploy our DAG in the same script.

with DAG(
    "EXAMPLE_DAG",
    schedule=Cron("0 8 * * 1", "America/New_York"),
    stage_location="@PYTHON_CODE",
    use_func_return_value=True,
) as dag:
    train_task = DAGTask(
        name="TRAIN_MODEL_TASK",
        definition="CALL TRAIN_MODEL();",
        warehouse="COMPUTE_WH",
    )
    set_default_task = DAGTask(
        name="SET_DEFAULT_VERSION",
        definition="CALL SET_DEFAULT_VERSION('DIAMONDS', 'rmse', True);",
        warehouse="COMPUTE_WH",
    )
    train_task >> set_default_task

If you want to learn more about how this works, please check out the docs.

Conclusion

We hope this gives you a foundation for getting started and is flexible enough to meet your needs.

We encourage you to try this, make any necessary changes, and let us know how it works. Please don't hesitate to share your experiences or challenges, as they might help others.

A working demonstration of this can be found at the following repo:

GitHub - IndexSeek/snowflake-ml-example-project: An example project for Snowflake ML.

Snowpark ML — An Example Project was originally published in Snowflake Builders Blog: Data Engineers, App Developers, AI, & Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

It’s a wrap(s)! Python Logging Decorators Improve Troubleshooting

Tyler White — Fri, 23 Feb 2024 21:53:50 GMT

A short journey into decorators and logging in Python.

DALL·E 3 Created Image

Have you ever wondered what arguments are passed when a function is called? Do you need help deciphering long error messages that are often confusing? Hopefully, we can fix that.

logging > print?

A common practice among developers is to use print statements frequently. However, as they become more familiar with logging and are reminded of its importance, they switch to using logging.info statements everywhere instead.

DALL·E 3 Created Image

This approach can result in longer code, though it allows for capturing more information. Fortunately, a simple solution can capture the desired information quite efficiently.

What’s the “simple” solution?

Decorators!

Python decorators are a powerful tool that allows developers to dynamically modify the behavior of functions or methods. They sound scary but aren’t too bad once you understand how they work.

Let’s see how they can be used with Python's logging module to improve the functionality and maintainability of code.

Here’s some code to implement a function to do this.

import logging
from functools import wraps

def logger(func):
    """
    A decorator function to log information about function calls and their results.
    """

    @wraps(func)
    def wrapper(*args, **kwargs):
        """
        The wrapper function that logs function calls and their results.
        """
        logging.info(f"Running {func.__name__} with args: {args}, kwargs: {kwargs}")
        try:
            result = func(*args, **kwargs)
            logging.info(f"Finished {func.__name__} with result: {result}")
        
        except Exception as e:
            logging.error(f"Error occurred in {func.__name__}: {e}")
            raise
        
        else:
            return result

    return wrapper

The wrapper function is a middle layer between the original function and the calling code. It logs information related to the function calls and their results, including any exceptions. When the decorated function is called, the wrapper function is invoked and calls the original function while logging relevant information. The decorator preserves the original function's metadata, ensuring introspection and debugging tools function correctly.

How does it work?

We can throw it on any function using the decorator.

@logger
def greet_user(name: str, greeting_type: str):
    if greeting_type == "short":
        return f"Hey, {name}!"
    elif greeting_type == "long":
        return f"Oh, hello there, {name}! How are you today?"
    else:
        raise ValueError("Invalid greeting type. Please choose 'short' or 'long'.")

Calling it is just like using any other function.

greet_user("Tyler", "short")
greet_user("Not Tyler", "short")
greet_user("Tyler", "BREAK")

Since my current session logs at the DEBUG level, I get back some helpful messages!

Logging messages from running the code.

Conclusion

Many topics are written on this subject, which can be very in-depth. I hope this serves as a gentle introduction and offers a starting point.

You should experiment with custom logging formats, integrate third-party logging libraries, or explore advanced decorator patterns beyond the scope of the article. Don’t hesitate to share your experiences or challenges while implementing logging decorators, as they might help others.

For some additional reading, check out these links:

It’s a wrap(s)! Python Logging Decorators Improve Troubleshooting was originally published in Learning The Computers on Medium, where people are continuing the conversation by highlighting and responding to this story.