<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Tyler White on Medium]]></title>
        <description><![CDATA[Stories by Tyler White on Medium]]></description>
        <link>https://medium.com/@btylerwhite?source=rss-4c938695f2e2------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*aWvMM9wscA0G8HNXS3hbww.jpeg</url>
            <title>Stories by Tyler White on Medium</title>
            <link>https://medium.com/@btylerwhite?source=rss-4c938695f2e2------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Fri, 10 Apr 2026 16:37:34 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@btylerwhite/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Getting started with goose using Snowflake and MCP]]></title>
            <link>https://medium.com/snowflake/getting-started-with-goose-using-snowflake-and-mcp-4e6c78ca0e19?source=rss-4c938695f2e2------2</link>
            <guid isPermaLink="false">https://medium.com/p/4e6c78ca0e19</guid>
            <category><![CDATA[open-source]]></category>
            <category><![CDATA[snowflake]]></category>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[data-engineering]]></category>
            <dc:creator><![CDATA[Tyler White]]></dc:creator>
            <pubDate>Tue, 24 Jun 2025 19:01:55 GMT</pubDate>
            <atom:updated>2025-06-24T19:01:55.289Z</atom:updated>
            <content:encoded><![CDATA[<h4>Snowflake’s Cortex completion models can now be used with goose, making the Snowflake MCP server for Cortex Search and Cortex Analyst a powerful combination.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1000/0*DIKwoxYiRp9zpaEY" /></figure><h3>Introduction</h3><p>In January this year, Block announced the release of goose, an open source, extensible AI agent that bridges the gap between LLMs and real-world actions.</p><p>Its support for multiple LLM providers and frontier models supports integrating with leading model providers, democratizes the use of Generative AI, and allows users to build agentic processes and workflows. These workflows are built off Model Context Protocol open standard, of which Block was an early adopter and contributor.</p><p><a href="https://block.github.io/goose/">codename goose | codename goose</a></p><p>With the newly open-sourced Snowflake MCP server and the goose Snowflake Cortex integration, you can now use Cortex as a first-party provider and get answers supported by your Snowflake data.</p><p>This blog will show you how to get started with goose and Snowflake.</p><p><a href="https://github.com/snowflake-labs/mcp">GitHub - Snowflake-Labs/mcp</a></p><h3>Getting Started with goose and Snowflake</h3><p>Follow the instructions to install <a href="https://block.github.io/goose/docs/getting-started/installation">goose</a>.</p><p>After installation is complete, you’ll receive the option to configure goose with a provider — Snowflake is now an option. Choose Snowflake.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/732/1*Nxi5X5dA_xrFij8dDZR1GQ.png" /><figcaption>The provider list available in goose.</figcaption></figure><p>From there, configuring Snowflake as a provider only requires the hostname and a <a href="https://docs.snowflake.com/en/user-guide/programmatic-access-tokens">Programmatic Access Token (PAT)</a>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/850/0*Vg2drpy7XknN1vL1" /><figcaption>Options to configure Snowflake as a provider for goose.</figcaption></figure><h3>Extensions</h3><p>Before you get started, it’s worth knowing a little bit more about extensions and what they mean for the tool.</p><p>It integrates with extensions that use open source Model Context Protocol (MCP), giving you access to a wide range of capabilities. This lets you connect goose to various tools, such as content repositories and business apps. Goose has several built-in extensions, but only the Developer extension is activated by default. This core extension offers key software development tools, and you can activate options like Computer Controller and Memory integration.</p><p>As a final step before diving into using Snowflake and goose, you need to enable the Snowflake extension.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*eReooiONFxjP745N" /><figcaption>Configuring extensions in goose.</figcaption></figure><p>And voila! You are now ready to use Snowflake with goose!</p><h3>Using the MCP server</h3><p>Snowflake Lab’s new <a href="https://github.com/Snowflake-Labs/mcp">open source MCP server</a> supports the following:</p><ul><li>Cortex Complete: generates completion responses using supported language models</li><li>Cortex Search: low-latency, high-quality “fuzzy” search over unstructured data</li><li>Cortex Analyst: text-to-SQL over structured data.</li></ul><p>This means we can use all these features with goose as additional tools! For example, I’ve configured two search services and one analyst service in my Snowflake MCP server configuration file:</p><pre>search_services:<br>  - service_name: SEC_SEARCH_SERVICE<br>    description: &gt;<br>      Search services that contains reports that publicly traded US companies<br>      must file with the Securities and Exchange Commission (SEC)<br>    database_name: CUBE_TESTING<br>    schema_name: PUBLIC<br>  - service_name: FOMC_MINUTES_SEARCH_SERVICE<br>    description: &gt;<br>      Search service that contains the minutes of regularly scheduled<br>      meetings held by The Federal Open Market Committee<br>    database_name: DEMO_CORTEX_SEARCH<br>    schema_name: FOMC<br>- analyst_services:<br>  - service_name: CUSTOMER_DATA<br>    semantic_model: &#39;@CATRANSLATOR.ANALYTICS.DATA/customers.yaml&#39;<br>    description: &gt;<br>      Analyst service that contains structured customer data to query</pre><p>The LLM interprets that it must invoke the Snowflake extension and determine which tool to use. In the following example, you will see how the LLM routes to use Cortex Analyst and Cortex Search when asking questions about financial performance and our customers.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*-g5vNQZNI59ZlIlx" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/742/0*donjcndwH9klrxGs" /><figcaption>Goose using Cortex Analyst and Cortex Search.</figcaption></figure><p>This capability allows us to use natural language to interact with our local machine and any additional extensions we wish to configure.</p><h3>Conclusion</h3><p>I’ve personally been using goose to analyze some of our team’s Snowflake query usage to detect anomalies. Following up on it, I can use other extensions to summarize and help me write queries to diagnose the anomalies further.</p><p>Taking things to the next level, you can transform a goose session into a reusable recipe encompassing the tools, goals, and setup you’re using, packaging it into a new agent that others (or your future self) can launch with just a single click.</p><p>I’m excited to have contributed to goose and enabled this Snowflake functionality for users of the project. I can’t wait to see how the integrations evolve in the future. In the meantime, I hope you’ll check out goose and take the new extension for a spin!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=4e6c78ca0e19" width="1" height="1" alt=""><hr><p><a href="https://medium.com/snowflake/getting-started-with-goose-using-snowflake-and-mcp-4e6c78ca0e19">Getting started with goose using Snowflake and MCP</a> was originally published in <a href="https://medium.com/snowflake">Snowflake Builders Blog: Data Engineers, App Developers, AI, &amp; Data Science</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Cortex Complete in Rust]]></title>
            <link>https://medium.com/@btylerwhite/cortex-complete-in-rust-f897e160f621?source=rss-4c938695f2e2------2</link>
            <guid isPermaLink="false">https://medium.com/p/f897e160f621</guid>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[snowflake]]></category>
            <category><![CDATA[rust]]></category>
            <dc:creator><![CDATA[Tyler White]]></dc:creator>
            <pubDate>Mon, 31 Mar 2025 15:09:12 GMT</pubDate>
            <atom:updated>2025-03-31T15:09:12.065Z</atom:updated>
            <content:encoded><![CDATA[<h4>Reaching out to Snowflake LLMs using Rust and Python with pyo3.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Su75yAy-ENhPHuDP39ye0Q.png" /></figure><p>I’ve been exploring Rust casually, creating a few small CLI utilities tailored to my needs. Recently, however, I discovered an opportunity to streamline some of my work with Cortex and Snowflake. Since I frequently use Cortex Complete for various tasks, I decided to leverage the <a href="https://docs.rs/reqwest/latest/reqwest/">reqwest</a> crate in Rust, using a token from the Python Connector to simplify the process.</p><p>Snowflake’s suite of Cortex functions offers capabilities to invoke various LLMs. I’m particularly interested in the COMPLETE function, which has a corresponding REST API. Performing this task in Rust rather than Python might speed things up.</p><p>In this article, I’ll outline the steps and the code needed to use Snowflake’s Cortex Complete function with Rust and invoke it using Python.</p><h4>Getting started</h4><p>We’ll need Rust and a few other tools to build this thing out. I’ve got Rust and uv installed, so I’ll set up my project, and we’ll add everything we need.</p><pre>mkdir cortex-complete<br>cd cortex-complete<br>cargo init<br>cargo add pyo3 -F extension-module -F experimental-async<br>cargo add reqwest -F blocking -F json<br>cargo add serde_json<br>uv init --python 3.11<br>uv sync<br>uv add maturin &quot;snowflake-connector-python[secure-local-storage]&quot;</pre><p>This will create all of the appropriate configuration files that we’ll need, but we will need to modify both the cargo.toml and pyproject.toml files.</p><pre>[package]<br>name = &quot;cortex-complete&quot;<br>version = &quot;0.1.0&quot;<br>edition = &quot;2021&quot;<br><br>[dependencies]<br>pyo3 = { version = &quot;0.24.0&quot;, features = [&quot;extension-module&quot;, &quot;experimental-async&quot;] }<br>reqwest = { version = &quot;0.12.15&quot;, features = [&quot;blocking&quot;, &quot;json&quot;] }<br>serde_json = &quot;1.0.140&quot;</pre><pre>[build-system]<br>requires = [&quot;maturin&gt;=1,&lt;2&quot;]<br>build-backend = &quot;maturin&quot;<br><br>[lib]<br>name = &quot;cortex&quot;<br>crate-type = [&quot;cdylib&quot;]<br><br>[project]<br>name = &quot;cortex-complete&quot;<br>version = &quot;0.1.0&quot;<br>readme = &quot;README.md&quot;<br>requires-python = &quot;&gt;=3.11&quot;<br>dependencies = [<br>    &quot;maturin&gt;=1.8.3&quot;,<br>    &quot;snowflake-connector-python[secure-local-storage]&gt;=3.14.0&quot;,<br>]</pre><p>Now, we can start coding!</p><h4>Library</h4><p>In the src folder, we will create an empty file named lib.rs which will be used to store our Rust code, which we will execute with Python. A lot is going on here, so I won’t try to explain it all, but essentially, we’re grabbing the “rest” attribute from the Snowflake Python Connection object to submit a POST request and parse the response. Something also worthy of note is if we were to use the complete function multiple times, we would only need to authenticate once, reducing MFA prompts as the connection object is being cached.</p><pre>use pyo3::{<br>    prelude::*,<br>    types::PyString,<br>};<br>use reqwest::{<br>    blocking::Client,<br>    header::{self, HeaderMap, HeaderValue},<br>};<br>use serde_json::Value;<br>use std::sync::OnceLock;<br><br>static CON: OnceLock&lt;PyObject&gt; = OnceLock::new();<br>static HEADERS: OnceLock&lt;HeaderMap&gt; = OnceLock::new();<br><br>fn get_con() -&gt; Result&lt;PyObject, PyErr&gt; {<br>    Python::with_gil(|py| {<br>        Ok(CON<br>            .get_or_init(|| {<br>                let module = py<br>                    .import(&quot;snowflake.connector&quot;)<br>                    .expect(&quot;Failed to import &#39;snowflake.connector&#39;&quot;);<br>                let con = module<br>                    .call_method(&quot;connect&quot;, (), None)<br>                    .expect(&quot;Failed to call &#39;connect&#39;&quot;);<br>                con.into()<br>            })<br>            .clone_ref(py))<br>    })<br>}<br><br>fn get_headers() -&gt; Result&lt;HeaderMap, PyErr&gt; {<br>    Python::with_gil(|py| {<br>        Ok(HEADERS<br>            .get_or_init(|| {<br>                let con = get_con().expect(&quot;Failed to get connection&quot;);<br>                let token: String = con<br>                    .getattr(py, &quot;rest&quot;)<br>                    .expect(&quot;Failed to get &#39;rest&#39;&quot;)<br>                    .getattr(py, &quot;token&quot;)<br>                    .expect(&quot;Failed to get &#39;token&#39;&quot;)<br>                    .extract(py)<br>                    .expect(&quot;Failed to extract token&quot;);<br>                let mut headers = HeaderMap::new();<br>                headers.insert(<br>                    header::AUTHORIZATION,<br>                    HeaderValue::from_str(&amp;format!(&quot;Snowflake Token=\&quot;{}\&quot;&quot;, token))<br>                        .expect(&quot;Failed to create AUTHORIZATION header&quot;),<br>                );<br>                headers.insert(<br>                    header::CONTENT_TYPE,<br>                    HeaderValue::from_static(&quot;application/json&quot;),<br>                );<br>                headers.insert(header::USER_AGENT, HeaderValue::from_static(&quot;Mozilla/5.0&quot;));<br>                headers<br>            })<br>            .clone())<br>    })<br>}<br><br>fn handle_error(message: &amp;str, error: impl std::fmt::Display) -&gt; PyErr {<br>    PyErr::new::&lt;pyo3::exceptions::PyException, _&gt;(format!(&quot;{}: {}&quot;, message, error))<br>}<br><br>fn extract_and_join(json_list: Vec&lt;Value&gt;) -&gt; String {<br>    json_list<br>        .into_iter()<br>        .filter_map(|s| {<br>            s[&quot;choices&quot;][0][&quot;delta&quot;][&quot;content&quot;]<br>                .as_str()<br>                .map(|s| s.to_string())<br>        })<br>        .collect()<br>}<br><br>#[pyfunction]<br>fn complete(model: &amp;str, prompt: &amp;str) -&gt; PyResult&lt;Py&lt;PyString&gt;&gt; {<br>    Python::with_gil(|py| {<br>        let data = serde_json::json!({<br>            &quot;model&quot;: model,<br>            &quot;messages&quot;: [{&quot;content&quot;: prompt}],<br>        });<br>        let con = get_con()?;<br>        let host: String = con.getattr(py, &quot;host&quot;)?.extract(py)?;<br>        let url = format!(&quot;https://{}{}&quot;, host, &quot;/api/v2/cortex/inference:complete&quot;);<br>        let headers = get_headers()?;<br>        let client = Client::new();<br><br>        let response_text = client<br>            .post(url)<br>            .headers(headers)<br>            .json(&amp;data)<br>            .send()<br>            .map_err(|e| handle_error(&quot;Request error&quot;, e))?<br>            .text()<br>            .map_err(|e| handle_error(&quot;Request error&quot;, e))?;<br><br>        let json_list: Vec&lt;Value&gt; = response_text<br>            .lines()<br>            .filter_map(|line| line.trim().strip_prefix(&quot;data: &quot;))<br>            .filter_map(|line| serde_json::from_str::&lt;Value&gt;(line).ok())<br>            .collect();<br><br>        let answer = extract_and_join(json_list);<br><br>        Ok(PyString::new(py, &amp;answer.trim()).into())<br>    })<br>}<br><br>#[pymodule]<br>fn cortex(m: &amp;Bound&lt;&#39;_, PyModule&gt;) -&gt; PyResult&lt;()&gt; {<br>    m.add_function(wrap_pyfunction!(complete, m)?)?;<br>    Ok(())<br>}</pre><h4>Using it</h4><p>Now we can import our Rust module and try it out.</p><pre>import cortex<br>cortex.complete(&quot;mistral-large2&quot;, &quot;Where do people typically publish technical articles?&quot;)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/747/1*nQhyfxjo7hvkYDKl7PBb_Q.png" /><figcaption>Using the Rust module</figcaption></figure><h4>Comparison</h4><p>I did a minimal benchmark using the Python requests library and the custom Rust code and found similar results. Likely, the context window and token size would make a more considerable difference between the two.</p><p>Here’s a Python function so that we can parse the results.</p><pre>import json<br>import requests<br>import snowflake.connector<br><br>def python_complete(url: str, data: dict, headers: dict) -&gt; str:<br>    r = requests.post(url, json=data, headers=headers)<br>    return &quot;&quot;.join(<br>        [<br>            json.loads(obj.strip()).get(&quot;choices&quot;)[0].get(&quot;delta&quot;).get(&quot;content&quot;, &quot;&quot;)<br>            for obj in r.text.split(&quot;data: &quot;)<br>            if obj.strip()<br>        ]<br>    ).strip()<br><br><br>con = snowflake.connector.connect()<br>url = f&quot;https://{con.host}/api/v2/cortex/inference:complete&quot;<br>data = {<br>    &quot;model&quot;: &quot;mistral-large2&quot;,<br>    &quot;messages&quot;: [{&quot;content&quot;: &quot;Where do people typically publish technical articles?&quot;}],<br>}<br>headers = {<br>    &quot;Authorization&quot;: f&#39;Snowflake Token=&quot;{con.rest.token}&quot;&#39;,<br>    &quot;Content-Type&quot;: &quot;application/json&quot;,<br>    &quot;Accept&quot;: &quot;application/json&quot;,<br>}<br><br>print(python_complete(url, data, headers))</pre><h4>Conclusion</h4><p>I enjoy working on these projects, and there is much opportunity to swap certain operations into Rust. I hope to rewrite this without using pyo3 so that it’s even lighter (and maybe quicker).</p><p>If you’ve been using Rust and want to try this, I would love to hear from you!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=f897e160f621" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Casting bools with Polars using Rust]]></title>
            <link>https://medium.com/learning-the-computers/casting-bools-with-polars-using-rust-f1ec95b43a4b?source=rss-4c938695f2e2------2</link>
            <guid isPermaLink="false">https://medium.com/p/f1ec95b43a4b</guid>
            <category><![CDATA[polar]]></category>
            <category><![CDATA[rust]]></category>
            <dc:creator><![CDATA[Tyler White]]></dc:creator>
            <pubDate>Tue, 14 Jan 2025 22:13:30 GMT</pubDate>
            <atom:updated>2025-01-14T22:13:30.711Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/509/1*P3xPKANuAADYZ3OGr9Kn0A.png" /></figure><p>I’ve wanted to explore Rust for a while and figured, “What better way to experience it than using a dataframe API I am already familiar with?” Tools like <a href="https://docs.pytest.org/en/stable/"><strong>pytest</strong></a>, <a href="https://docs.astral.sh/uv/"><strong>uv</strong></a>, and <a href="https://docs.astral.sh/ruff/"><strong>ruff</strong></a> will no longer be necessary with Rust’s built-in Cargo package manager to manage our environment, formatting, and testing.</p><p>It would be neat to get <a href="https://crates.io/crates/polars"><strong>Polars</strong></a> up and running using Rust and explore the behavior of boolean columns. This approach allows us to stay in the shallow end as we’re not yet using LazyFrames or DataFrames, and we’ll be sticking with Polars’ Series.</p><h4>Installing Rust</h4><p>To get started, we’ll need Rust installed on our preferred system. I’m using macOS, but it’s a reasonably similar operation if you’re using Linux or Windows.</p><p><a href="https://www.rust-lang.org/tools/install">Install Rust</a></p><h4>Cargo</h4><p>We’re ready to create our first project! To get started, we will primarily work with our shell of preference (I am using zsh).</p><pre>mkdir ~/Desktop/bools-polars-rust<br>cd ~/Desktop/bools-polars-rust<br>cargo init</pre><p>If we open this directory using an IDE, we’ll see that we now have a <strong>src</strong> folder, a preconfigured <strong>.gitignore</strong> file, a gitignored <strong>target</strong> folder, and two additional files I’ll explain: <strong>Cargo.lock</strong> and <strong>Cargo.toml</strong>.</p><ul><li><strong>Cargo.toml</strong> is a manifest file that describes your dependencies</li><li><strong>Cargo.lock</strong> contains exact information about your dependencies. Cargo maintains it and edits it.</li></ul><p>uv follows a similar behavior with its <strong>pyproject.toml</strong> and <strong>uv.lock</strong> files. We will add our first dependency, otherwise known as a crate, with cargo add.</p><blockquote>I recommend configuring appropriate extensions for Rust using your IDE for a better experience.</blockquote><pre>cargo add polars</pre><p>This command updates the crates.io index and adds the newest Polars version to your project dependencies. In a future article, we may explore various crate features, but the core Polars crate should suffice for now.</p><p>Let’s jump into that <strong>src</strong> folder and open up <strong>main.rs</strong>. We can greet the world to ensure things work fine, but we will edit this file shortly.</p><pre>cargo run</pre><p>Compiling may take a moment, but once it is complete, you should see “Hello, world!” printed on your console. That’s nice, but we want to see a Polars boolean Series!</p><h4>Editing the file</h4><p>Let’s edit the file now. Similar to Python imports, we’ll need to make Polars available. At the top of the <strong>main.rs</strong> file, add the following:</p><pre>use polars::prelude::*;</pre><p>The prelude module in a Rust crate typically contains the most commonly used items from that crate. By including prelude::*, you’re importing all the standard and essential functionalities from Polars. We also get traits, functions, and types required for typical operations like creating dataframes, working with series, and performing computations.</p><h4>Making a series</h4><p>We will create and print a simple five-row Polars Series on the console. We can also remove the unneeded “Hello, world!” print. The entirety of our script should look like this:</p><pre>use polars::prelude::*;<br><br>fn main() -&gt; Result&lt;(), PolarsError&gt; {<br>    let vals: &amp;[bool;5] = &amp;[true, false, true, false, true];<br>    let bool_ser: Series = Series::new(&quot;bool_ser&quot;.into(), vals);<br>    println!(&quot;{:?}&quot;, bool_ser);<br><br>    Ok(())<br>}</pre><p>When we run this with cargo run we now see a different response:</p><pre>shape: (5,)<br>Series: &#39;bool_ser&#39; [bool]<br>[<br>        true<br>        false<br>        true<br>        false<br>        true<br>]</pre><p>If we were to perform a sum operation over this series, we would get three, as I am assuming that true = 1 and false = 0. Let’s add the following snippet to our main function.</p><pre>println!(&quot;{:?}&quot;, bool_ser.sum::&lt;i8&gt;());</pre><p>We did need to cast this to an 8-bit signed integer, but when we run it, it prints “Ok(3)” below the series. The print works as expected! What if we wanted to compute the minimum and maximum values from bool_ser where we expect 0 and 1, respectively? Add the following two lines to your <strong>src/main.rs</strong> file:</p><pre>println!(&quot;{:?}&quot;, bool_ser.min::&lt;i8&gt;());<br>println!(&quot;{:?}&quot;, bool_ser.max::&lt;i8&gt;());</pre><p>When we run this, we get the following larger output:</p><pre>shape: (5,)<br>Series: &#39;bool_ser&#39; [bool]<br>[<br>        true<br>        false<br>        true<br>        false<br>        true<br>]<br>Ok(3)<br>Ok(Some(0))<br>Ok(Some(1))</pre><p>We can view the final code structure with the following bash command.</p><pre>❯ tree -I target<br>.<br>├── Cargo.lock<br>├── Cargo.toml<br>└── src<br>    └── main.rs</pre><h3>Conclusion</h3><p>While this only ended up being a few lines of code, I was excited to get Rust set up and running with Polars. It was pretty straightforward to get Polars working with Rust.</p><p>I&#39;m looking forward to experimenting with reading CSV and Parquet files and exploring more DataFrame operations. I hope you give it a try as well. I&#39;m eager to learn about more advanced applications for data engineering using Rust with Polars.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=f1ec95b43a4b" width="1" height="1" alt=""><hr><p><a href="https://medium.com/learning-the-computers/casting-bools-with-polars-using-rust-f1ec95b43a4b">Casting bools with Polars using Rust</a> was originally published in <a href="https://medium.com/learning-the-computers">Learning The Computers</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Using DataFusion with Minio for (sort of remote) reads and writes]]></title>
            <link>https://medium.com/learning-the-computers/using-datafusion-with-minio-for-sort-of-remote-reads-and-writes-8f68b423f620?source=rss-4c938695f2e2------2</link>
            <guid isPermaLink="false">https://medium.com/p/8f68b423f620</guid>
            <category><![CDATA[data-fusion]]></category>
            <category><![CDATA[minio]]></category>
            <category><![CDATA[python]]></category>
            <dc:creator><![CDATA[Tyler White]]></dc:creator>
            <pubDate>Fri, 10 Jan 2025 17:40:53 GMT</pubDate>
            <atom:updated>2025-01-10T17:40:53.316Z</atom:updated>
            <content:encoded><![CDATA[<h4>Using Python and Docker for single-node data engineering efforts.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/597/1*iauqqzAKQUtVIYg_LYj08g.png" /></figure><p>There are many new “hot” single-node engines out there, and I don’t like to pick favorites. We have DataFusion, DuckDB, pandas, and Polars, to name a few. We’re also reaching an era where the default answer throws everything into cloud storage systems such as Amazon S3.</p><p>Let’s combine the best of both worlds and use Docker to host a local Minio container to emulate cloud object storage and DataFusion/Ibis to read and write from this container.</p><h3>Settings things up</h3><h4>uv</h4><p>I’ve been using <a href="https://github.com/astral-sh/uv"><strong>uv</strong></a> for nearly all of my projects recently, and this will be no exception. We’ll set everything up using the following commands:</p><pre>uv venv --python 3.12<br>source .venv/bin/activate<br>uv add datafusion &quot;ibis-framework[datafusion] @ git+https://github.com/ibis-project/ibis.git&quot;<br>uv add --dev python-dotenv</pre><blockquote>I’m installing Ibis from source here to ensure I have the latest features. 😃</blockquote><h4>Docker/Container Management Tool</h4><p>We must set up a quick compose file to configure Minio to run on our machine. I suggest you modify these values and target a specific image in production scenarios. Create a file named <strong>docker-compose.yml</strong> and populate it with the following:</p><pre>services:<br>  minio:<br>    image: quay.io/minio/minio<br>    ports:<br>      - &quot;9000:9000&quot;<br>      - &quot;9001:9001&quot;<br>    volumes:<br>      - ${HOME}/minio/data:/data<br>    command: server /data --console-address &quot;:9001&quot;<br>    depends_on:<br>      - create-data-dir<br><br>  create-data-dir:<br>    image: busybox<br>    command: mkdir -p /data<br>    volumes:<br>      - ${HOME}/minio/data:/data</pre><p>Then, using your favorite shell, execute docker compose up -d. To complete this step, you must install <a href="https://www.docker.com/">Docker</a> (or a similar tool, such as <a href="https://podman.io/">Podman</a>) for your operating system.</p><p>This will allow you to connect to the console endpoint using <a href="http://localhost:9001">http://localhost:9001</a>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*WWMMfqw6679LG0g_hrYeDA.png" /><figcaption>Login screen for Minio console.</figcaption></figure><p>Sign in with the default credentials:</p><ul><li>username: <strong>minioadmin</strong></li><li>password: <strong>minioadmin</strong></li></ul><p>While we’re at it, let’s add this to our project’s <strong>.env</strong> file:</p><pre>AWS_ACCESS_KEY_ID=minioadmin<br>AWS_SECRET_ACCESS_KEY=minioadmin</pre><p>Successfully logging in should pull up an empty object browser page because we must create a bucket. Navigate to “Buckets” and select “Create bucket.”</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*WhsVRQ06Vqqn_QsfkxTpmw.png" /><figcaption>The “Create Bucket” button.</figcaption></figure><p>Name your bucket <strong>nyc-tlc </strong>and add one of the yellow_tripdata Parquet files to a local directory named “data” in your repo.</p><pre>mkdir data<br>curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet -o /data/yellow_tripdata_2023-01.parquet</pre><p>Then, upload this file using the UI to the nyc-tlc repo.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*_y3OXB3z0_IsAg3uW7-m8Q.png" /><figcaption>The Parquet file in the Minio Object Browser.</figcaption></figure><h4>DataFusion</h4><p>Let’s create a Python file named <strong>aggregate_tripdata.py </strong>and populate it with the following code:</p><pre>import os<br><br>import datafusion<br>from dotenv import load_dotenv<br>from ibis.interactive import *<br><br>load_dotenv(override=True)<br><br>ctx = datafusion.SessionContext()<br><br>s3 = datafusion.object_store.AmazonS3(<br>    bucket_name=&quot;nyc-tlc&quot;,<br>    access_key_id=os.environ.get(&quot;AWS_ACCESS_KEY_ID&quot;),<br>    secret_access_key=os.environ.get(&quot;AWS_SECRET_ACCESS_KEY&quot;),<br>    endpoint=&quot;http://localhost:9000&quot;,<br>    allow_http=True,<br>)<br><br>ctx.register_object_store(&quot;s3://nyc-tlc/&quot;, s3)<br>ctx.register_parquet(&quot;yellow_tripdata&quot;, &quot;s3://nyc-tlc/yellow_tripdata_2023-01.parquet&quot;)</pre><p>We haven’t yet read or written any data; we’ve only configured our SessionContext to communicate with Minio and registered the Parquet file as a table.</p><p>Let’s jump into an IPython shell to complete the rest of the journey.</p><pre>uv add --dev ipython<br>source .venv/bin/activate</pre><p>Now, starting the IPython shell, we can create an Ibis backend and query our table.</p><pre>In [1]: %run aggregate_tripdata.py<br>In [2]: con = ibis.datafusion.from_connection(ctx)<br>In [3]: t = con.table(&quot;yellow_tripdata&quot;)<br>In [4]: t<br>Out[4]:<br>┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━┓<br>┃ VendorID ┃ tpep_pickup_datetime ┃ tpep_dropoff_datetime ┃ passenger_count ┃ … ┃<br>┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━┩<br>│ int64    │ timestamp(6)         │ timestamp(6)          │ float64         │ … │<br>├──────────┼──────────────────────┼───────────────────────┼─────────────────┼───┤<br>│        2 │ 2023-01-01 00:32:10  │ 2023-01-01 00:40:36   │             1.0 │ … │<br>│        2 │ 2023-01-01 00:55:08  │ 2023-01-01 01:01:27   │             1.0 │ … │<br>│        2 │ 2023-01-01 00:25:04  │ 2023-01-01 00:37:49   │             1.0 │ … │<br>│        1 │ 2023-01-01 00:03:48  │ 2023-01-01 00:13:25   │             0.0 │ … │<br>│        2 │ 2023-01-01 00:10:29  │ 2023-01-01 00:21:19   │             1.0 │ … │<br>│        2 │ 2023-01-01 00:50:34  │ 2023-01-01 01:02:52   │             1.0 │ … │<br>│        2 │ 2023-01-01 00:09:22  │ 2023-01-01 00:19:49   │             1.0 │ … │<br>│        2 │ 2023-01-01 00:27:12  │ 2023-01-01 00:49:56   │             1.0 │ … │<br>│        2 │ 2023-01-01 00:21:44  │ 2023-01-01 00:36:40   │             1.0 │ … │<br>│        2 │ 2023-01-01 00:39:42  │ 2023-01-01 00:50:36   │             1.0 │ … │<br>│        … │ …                    │ …                     │               … │ … │<br>└──────────┴──────────────────────┴───────────────────────┴─────────────────┴───┘</pre><p>We’re reading this from the Parquet file in Minio! Let’s do some aggregation, and we will write back into a “processed” folder in our Minio bucket.</p><p>Let’s aggregate the numeric column statistics by <strong>VendorID</strong> and the <strong>tpep_pickup_datetime</strong> (casted as a date) and write it back.</p><pre>In [5]: t_agg = (<br>    ...    t.mutate(tpep_pickup_date=t.tpep_pickup_datetime.cast(&quot;date&quot;))<br>    ...    .group_by([&quot;VendorID&quot;, &quot;tpep_pickup_date&quot;])<br>    ...    .agg(s.across(s.numeric(), dict(min=_.min(), max=_.min(), mean=_.mean())))<br>)</pre><p>We haven’t executed this yet; we will write it to a table named “t_agg”.</p><pre>In [6]: con.create_table(&quot;t_agg&quot;, t_agg)<br>Out[6]: <br>┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━┓<br>┃ VendorID ┃ tpep_pickup_date ┃ VendorID_min ┃ passenger_count_min ┃ … ┃<br>┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━┩<br>│ int64    │ date             │ int64        │ float64             │ … │<br>├──────────┼──────────────────┼──────────────┼─────────────────────┼───┤<br>│        2 │ 2023-01-08       │            2 │                 0.0 │ … │<br>│        2 │ 2023-01-10       │            2 │                 0.0 │ … │<br>│        2 │ 2023-01-14       │            2 │                 0.0 │ … │<br>│        1 │ 2023-01-18       │            1 │                 0.0 │ … │<br>│        2 │ 2023-01-21       │            2 │                 0.0 │ … │<br>│        2 │ 2023-01-30       │            2 │                 0.0 │ … │<br>│        1 │ 2023-01-04       │            1 │                 0.0 │ … │<br>│        1 │ 2023-01-12       │            1 │                 0.0 │ … │<br>│        1 │ 2023-01-15       │            1 │                 0.0 │ … │<br>│        2 │ 2023-01-20       │            2 │                 0.0 │ … │<br>│        … │ …                │            … │                   … │ … │<br>└──────────┴──────────────────┴──────────────┴─────────────────────┴───┘</pre><p>We can easily write it back to Minio using the existing DataFusion SessionContext.</p><pre>In [7]: target_file_name = &quot;s3://nyc-tlc/processed/t_agg_2023-01.parquet&quot;<br>In [7]: ctx.table(&quot;t_agg&quot;).write_parquet(target_file_name)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*9yoMY36B2hjxmdFu-l1bfg.png" /><figcaption>The aggregated file in the Minio bucket.</figcaption></figure><p>Reading the file again directly is easy with Ibis as well.</p><pre>In [8]: target_file_name = &quot;s3://nyc-tlc/processed/t_agg_2023-01.parquet&quot;<br>Out[8]:<br>┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━┓<br>┃ VendorID ┃ tpep_pickup_date ┃ VendorID_min ┃ passenger_count_min ┃ … ┃<br>┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━┩<br>│ int64    │ date             │ int64        │ float64             │ … │<br>├──────────┼──────────────────┼──────────────┼─────────────────────┼───┤<br>│        1 │ 2023-01-04       │            1 │                 0.0 │ … │<br>│        1 │ 2023-01-12       │            1 │                 0.0 │ … │<br>│        1 │ 2023-01-15       │            1 │                 0.0 │ … │<br>│        2 │ 2023-01-20       │            2 │                 0.0 │ … │<br>│        2 │ 2023-01-24       │            2 │                 0.0 │ … │<br>│        2 │ 2023-02-01       │            2 │                 1.0 │ … │<br>│        2 │ 2023-01-05       │            2 │                 0.0 │ … │<br>│        1 │ 2023-01-07       │            1 │                 0.0 │ … │<br>│        1 │ 2023-01-16       │            1 │                 0.0 │ … │<br>│        1 │ 2023-01-19       │            1 │                 0.0 │ … │<br>│        … │ …                │            … │                   … │ … │<br>└──────────┴──────────────────┴──────────────┴─────────────────────┴───┘</pre><h3>Conclusion</h3><p>This post uses various technologies to read and write to local cloud storage performantly. This landscape of data processing changes frequently, so it was fun to experiment with an approach like this to sharpen my DataFusion skills. It’s always fun to explore different dataframe APIs!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=8f68b423f620" width="1" height="1" alt=""><hr><p><a href="https://medium.com/learning-the-computers/using-datafusion-with-minio-for-sort-of-remote-reads-and-writes-8f68b423f620">Using DataFusion with Minio for (sort of remote) reads and writes</a> was originally published in <a href="https://medium.com/learning-the-computers">Learning The Computers</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Using pytest in GitLab pipelines]]></title>
            <link>https://medium.com/learning-the-computers/using-pytest-in-gitlab-pipelines-dd22854a9f4a?source=rss-4c938695f2e2------2</link>
            <guid isPermaLink="false">https://medium.com/p/dd22854a9f4a</guid>
            <category><![CDATA[gitlab]]></category>
            <category><![CDATA[pytest]]></category>
            <dc:creator><![CDATA[Tyler White]]></dc:creator>
            <pubDate>Tue, 07 Jan 2025 18:43:04 GMT</pubDate>
            <atom:updated>2025-01-08T13:47:15.148Z</atom:updated>
            <content:encoded><![CDATA[<h4>Make sure your code works like it’s supposed to.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/609/1*COO3jJ0WvcPFCi_IurKgww.png" /></figure><h3>Introduction</h3><p>In a previous article, we set up <strong>pre-commit</strong> to enforce code quality and formatting standards in our GitLab pipeline. However, ensuring code cleanliness is only half the battle. What’s the point of having pretty code if it doesn’t work? We also need to verify our code functions as expected. This is where <a href="https://github.com/pytest-dev/pytest"><strong>pytest</strong></a>, a popular testing framework for Python, comes in.</p><p><a href="https://medium.com/learning-the-computers/using-pre-commit-in-gitlab-pipelines-3d6854968344">Using pre-commit in GitLab pipelines</a></p><p>In this article, we’ll configure and integrate pytest into our existing GitLab codebase and pipeline. We’ll create a new stage in our <strong>.gitlab-ci.yml</strong> file to run our tests, ensuring our code is both clean and functional.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/320/0*slldZJ-hip23npdt.gif" /><figcaption><a href="https://giphy.com/gifs/8xgqLTTgWqHWU">https://giphy.com/gifs/8xgqLTTgWqHWU</a></figcaption></figure><h3>Configuring the environment</h3><p>Before we use pytest, we must ensure it’s installed in our environment. I’ve been using <a href="https://docs.astral.sh/uv/"><strong>uv</strong></a> to manage project dependencies lately, so we can install pytest and specify that it’s a development dependency by executing uv add --dev pytest in our shell. If you don’t have uv, install pytest using pip by executing pip install pytest in your shell.</p><h4>Why pytest?</h4><p>Now that pytest is installed let’s talk about why we used it in the first place. One of the key benefits of pytest is its support for <a href="https://docs.pytest.org/en/6.2.x/fixture.html">fixtures</a>, which allows us to efficiently set up and teardown resources needed for our tests. Fixtures are setup functions that provide a fixed baseline so that tests execute reliably and consistently. We can define fixtures that run before and after each test, or even before and after a group of tests. Fixtures are handy when we need to setup and teardown complex resources, such as database connections or file systems, required for our tests to run.</p><p>With pytest, we can define fixtures using the @pytest.fixture decorator, which allows us to specify the scope of the fixture (e.g., function, class, module, etc.) and the code that should be executed to setup and teardown the fixture. The <a href="https://docs.pytest.org/en/6.2.x/reference.html?highlight=parametrize#pytest-mark-parametrize">@pytest.mark.parametrize</a> decorator allows you to run the same test function with different input parameters. This decorator is practical when testing a function with multiple inputs or scenarios.</p><p>By leveraging pytest’s fixture functionality, we can write more tests that are easier to maintain and extend.</p><h4>Folder layout</h4><p>By default, pytest looks in your current directory and subdirectories for test files and runs any tests it finds. I prefer to keep a folder named “tests” at the root level of the repository to keep these together in a convenient location for all contributors to find.</p><pre>.<br>├── README.md<br>├── our_project<br>│   ├── __init__.py<br>│   └── classification_metrics.py<br>├── pyproject.toml<br>├── tests<br>│   ├── __init__.py<br>│   ├── conftest.py<br>│   └── test_classification_metrics.py<br>└── uv.lock</pre><p>To tell this story, we will create a fictional project that computes two binary classification metrics using <strong>pandas</strong>. We will test that the metric calculations are correct with pytest. We will first add pandas as a project dependency by running uv add pandas. This will install it and ensure we can write our classification functions to test against.</p><h4>The functions to test against</h4><p>Functions to compute accuracy and precision scores are easy to recreate and use as examples. We’ll add a file named <strong>classification_metrics.py </strong>in the project folder, and add two functions:</p><pre>import pandas as pd<br><br><br>def accuracy_score(y_true: pd.Series, y_pred: pd.Series) -&gt; float:<br>    &quot;&quot;&quot;Calculate the accuracy of a binary classification model.<br><br>    Parameters<br>    ----------<br>    y_true : pd.Series<br>        The true labels.<br>    y_pred : pd.Series<br>        The predicted labels.<br><br>    Returns<br>    -------<br>    float<br>        The accuracy of the model.<br><br>    &quot;&quot;&quot;<br>    return (y_true == y_pred).mean()<br><br><br>def precision_score(y_true: pd.Series, y_pred: pd.Series) -&gt; float:<br>    &quot;&quot;&quot;Calculate the precision of a binary classification model.<br><br>    Parameters<br>    ----------<br>    y_true : pd.Series<br>        The true labels.<br>    y_pred : pd.Series<br>        The predicted labels.<br><br>    Returns<br>    -------<br>    float<br>        The precision of the model.<br><br>    &quot;&quot;&quot;<br>    tp = ((y_true == 1) &amp; (y_pred == 1)).sum()<br>    fp = ((y_true == 0) &amp; (y_pred == 1)).sum()<br>    return tp / (tp + fp)</pre><blockquote>The docstrings are a great plus for future users as well to understand how these functions can be used. Examples are helpful as well.</blockquote><p>These will be easy enough to use with both accepting the same arguments. This is a great use case for a pytest fixture to share a dataset for testing and a pytest marker to parametrize our test to do both tests in one test function.</p><h4>Creating a test fixture</h4><p>In this project, we will likely want a dataframe readily available to test against. This is a great way to ensure code is being tested and evaluated with the same dataset. This fixture will be defined in a file called <strong>conftest.py</strong>, located inside our <strong>tests</strong> folder:</p><pre>import pandas as pd<br>import pytest<br><br><br>@pytest.fixture(scope=&quot;session&quot;)<br>def predictions() -&gt; pd.DataFrame:<br>    d = {<br>         &quot;id&quot;: range(1, 13),<br>         &quot;actual&quot;: [1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1],<br>         &quot;prediction&quot;: [1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1],<br>    }<br>    yield pd.DataFrame(d)</pre><p>This allows us to pass predictions as an argument to any testing function to reference this dataframe. Let’s put together our test function now.</p><h4>Creating the test function</h4><p>We can test against our accuracy and precision metric functions with a single test function.</p><p>I want to ensure our code works against scikit-learn’s metrics with the help of parametrize decorators. We will add that as an additional development dependency by running uv add --dev scikit-learn.</p><p>We can create a new file named <strong>test_classification_metrics.py</strong> inside our <strong>tests</strong> folder:</p><pre>import pytest<br>import sklearn.metrics<br><br>import our_project.classification_metrics<br><br><br>@pytest.mark.parametrize(<br>    &quot;metric_name&quot;,<br>    [<br>        pytest.param(&quot;accuracy_score&quot;, id=&quot;accuracy_score&quot;),<br>        pytest.param(&quot;precision_score&quot;, id=&quot;precision_score&quot;),<br>    ],<br>)<br>def test_classification_metrics(predictions, metric_name):<br>    our_project_func = getattr(our_project.metrics, metric_name)<br>    sklearn_func = getattr(sklearn.metrics, metric_name)<br>    result = our_project_func(df[&quot;actual&quot;], df[&quot;prediction&quot;])<br>    expected = sklearn_func(df[&quot;actual&quot;], df[&quot;prediction&quot;])<br>    assert result == pytest.approx(expected, abs=1e-4)</pre><p>We can kick off our tests by running pytest in our shell. We should see two green dots indicating the tests have passed and your functions are working as intended. This is a convenient way to test both our accuracy_score and precision_score functions against scikit-learn’s equivalent functions.</p><h3>Integrating this to run automatically with GitLab pipelines</h3><p>This is excellent progress, but we’re not out of the woods yet. We can’t always count on individual contributors to test code locally before submitting a merge request, and even if we could, it’s nice as a maintainer to ensure any merge requests pass tests during review. We explored automating pre-commit with our <strong>.gitlab-ci.yml </strong>file. We’ll add a test stage to this file now to run pytest on any commits on the main branch or when any merge request is opened in addition to pre-commit so that the complete file looks like this:</p><pre>variables:<br>  UV_VERSION: 0.5<br>  PYTHON_VERSION: 3.12<br>  BASE_LAYER: bookworm-slim<br>  UV_CACHE_DIR: .uv-cache<br>  UV_SYSTEM_PYTHON: 1<br>stages:<br>  - build<br>  - test<br>pre-commit:<br>  stage: build<br>  image: python:3.11<br>  script:<br>    - pip install pre-commit<br>    - pre-commit run --all-files<br>  rules:<br>    - if: $CI_PIPELINE_SOURCE == &quot;merge_request_event&quot;<br>    - if: $CI_COMMIT_BRANCH == &quot;main&quot;<br>pytest:<br>  stage: test<br>  image: ghcr.io/astral-sh/uv:$UV_VERSION-python$PYTHON_VERSION-$BASE_LAYER<br>  cache:<br>    - key:<br>        files:<br>          - uv.lock<br>      paths:<br>        - $UV_CACHE_DIR<br>  script:<br>    - uv sync --all-extras<br>    - uv run pytest<br>  rules:<br>    - if: $CI_PIPELINE_SOURCE == &quot;merge_request_event&quot;<br>    - if: $CI_COMMIT_BRANCH == &quot;main&quot;</pre><blockquote>We’re using uv to run pytest and sync our environment dependencies.</blockquote><p>GitLab will run our tests if the build stage (pre-commit) succeeds.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/932/1*ohT1uWRBejq_xZVqYQWMcQ.png" /><figcaption>Successful merge pipeline with three stages, including a compliance stage.</figcaption></figure><h3>Conclusion</h3><p>I have done this sort of thing with GitHub plenty of times, so it’s been a bit of a challenge to adjust to GitLab, but it’s been fun to learn how to get this all working. I am eager to improve this process and hope to continue sharing what I learn.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=dd22854a9f4a" width="1" height="1" alt=""><hr><p><a href="https://medium.com/learning-the-computers/using-pytest-in-gitlab-pipelines-dd22854a9f4a">Using pytest in GitLab pipelines</a> was originally published in <a href="https://medium.com/learning-the-computers">Learning The Computers</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Using pre-commit in GitLab pipelines]]></title>
            <link>https://medium.com/learning-the-computers/using-pre-commit-in-gitlab-pipelines-3d6854968344?source=rss-4c938695f2e2------2</link>
            <guid isPermaLink="false">https://medium.com/p/3d6854968344</guid>
            <category><![CDATA[pre-commit]]></category>
            <category><![CDATA[gitlab]]></category>
            <dc:creator><![CDATA[Tyler White]]></dc:creator>
            <pubDate>Fri, 03 Jan 2025 22:20:22 GMT</pubDate>
            <atom:updated>2025-01-08T13:46:48.263Z</atom:updated>
            <content:encoded><![CDATA[<h4>Catch code quality errors before they happen.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/499/1*hILBf2vvRdHSeG7DxtU7uw.png" /></figure><p>As developers, we’ve all been there — you’ve spent hours working on a feature, only to have it break when you merge it into the main branch. Or, you’ve pushed code to production only to realize it’s riddled with errors and typos. <a href="https://pre-commit.com/">pre-commit</a> is very convenient for enforcing high-quality commits and keeping the codebase tidy. I’ve been using it for a while now, and I appreciate how easy it is to set up and use with my projects. Using it with GitHub Actions is a breeze. However, we use GitLab for various projects at work.</p><p>In this article, we’ll explore how to automate pre-commit checks in GitLab <a href="https://docs.gitlab.com/ee/ci/pipelines/">pipelines</a>. We enforce quality and formatting standards with pre-commit <a href="https://pre-commit.com/hooks.html">hooks</a>.</p><h3>What are pre-commit hooks?</h3><p>Pre-commit hooks are scripts that automatically execute before you commit code to your repository. They help enforce coding standards and check for syntax errors. These hooks run every time a commit is made, providing a thorough check. By performing these checks before the code is committed, you can identify and resolve issues early, preventing them from entering your remote branch.</p><p>It’s never fun to push your exciting feature only to have the first comment on your merge request say something like, “Please format the code.”</p><h3>How do pre-commit hooks work in GitLab pipelines?</h3><p>In GitLab, you can use pre-commit hooks as part of your CI/CD pipeline. When you push code to your repository, GitLab will run the pre-commit hook before the code is committed. If the hook fails, the commit will be rejected, and you’ll receive an error message indicating what went wrong.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/643/1*dQeumPnEaRA5AeSt5n8Qqw.png" /><figcaption>A pipeline error in GitLab where pre-commit failed.</figcaption></figure><h3>How to set up pre-commit hooks in GitLab pipelines</h3><p>Setting up pre-commit hooks in GitLab pipelines is relatively straightforward. Here are the steps:</p><h4>Create a .pre-commit-config.yaml file</h4><p>In the root of your repository, create a new file called <strong>.pre-commit-config.yaml</strong>. This file will contain the scripts to run before the code is committed.</p><h4>Define your hooks</h4><p>In the <strong>.pre-commit-config.yaml</strong> file, define the hooks you want to run. For example, run a linter to check for syntax errors or a formatter to enforce coding standards.</p><p>Here’s an example of what your <strong>.pre-commit-config.yaml</strong> file might look like:</p><pre>repos:<br>  - repo: https://github.com/pre-commit/pre-commit-hooks<br>    rev: v5.0.0<br>    hooks:<br>      - id: end-of-file-fixer<br>      - id: trailing-whitespace<br>  - repo: https://github.com/astral-sh/ruff-pre-commit<br>    rev: v0.8.5<br>    hooks:<br>      - id: ruff<br>        args: [--fix]<br>      - id: ruff-format<br>  - repo: https://github.com/codespell-project/codespell<br>    rev: v2.3.0<br>    hooks:<br>      - id: codespell<br>  - repo: https://github.com/google/yamlfmt<br>    rev: v0.14.0<br>    hooks:<br>      - id: yamlfmt</pre><p>This will clean up any files that need their endings fixed, have extra whitespace at the end, enforce ruff formatting standards (this can be enhanced with <a href="https://docs.astral.sh/ruff/configuration/">ruff configuration</a> settings), warn us of our typos, and format our yaml files.</p><h4>Configure your pipeline</h4><p>To combine all of this to run automatically, we will need to set up a file so GitLab will know what to do. In your <strong>.gitlab-ci.yml</strong> configure your pipeline to run the pre-commit hook before the code is committed.</p><p>And here’s an example of what your <strong>.gitlab-ci.yml</strong> file might look like:</p><pre>stages:<br>  - build<br>pre-commit:<br>  stage: build<br>  image: python:3.11<br>  script:<br>    - pip install pre-commit<br>    - pre-commit run --all-files<br>  rules:<br>    - if: $CI_PIPELINE_SOURCE == &quot;merge_request_event&quot;<br>    - if: $CI_COMMIT_BRANCH == &quot;main&quot;</pre><p>This will use Python to execute pre-commit anytime a commit occurs on the main branch, a merge request is created, or a commit is pushed to said merge request. We put this in the “build” stage, but GitLab has a few stages available out of the box. You might want to use the “test” or “deploy” stage for other actions.</p><h3>Conclusion</h3><p>It wasn’t too difficult to adjust from GitHub, but I’m sure there are ways this workflow could be improved, possibly with caching. As I continue to work with GitLab, I look forward to sharing what I have learned.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=3d6854968344" width="1" height="1" alt=""><hr><p><a href="https://medium.com/learning-the-computers/using-pre-commit-in-gitlab-pipelines-3d6854968344">Using pre-commit in GitLab pipelines</a> was originally published in <a href="https://medium.com/learning-the-computers">Learning The Computers</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[PyIceberg — Trying out the SQLite Catalog]]></title>
            <link>https://medium.com/learning-the-computers/pyiceberg-trying-out-the-sqlite-catalog-d7ace2a4ca5f?source=rss-4c938695f2e2------2</link>
            <guid isPermaLink="false">https://medium.com/p/d7ace2a4ca5f</guid>
            <category><![CDATA[iceberg-table]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[duckdb]]></category>
            <category><![CDATA[pandas]]></category>
            <dc:creator><![CDATA[Tyler White]]></dc:creator>
            <pubDate>Mon, 23 Dec 2024 20:29:45 GMT</pubDate>
            <atom:updated>2024-12-24T12:28:55.643Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*wvwKDdB8ePmuh-fj47haDQ.png" /></figure><h3>PyIceberg — Trying out the SQLite Catalog</h3><p>Configuring a development environment, reading and writing to an SQL catalog, and exploring snapshots. In this article, we’ll work with PyIceberg, Ibis, DuckDB, and PyArrow.</p><p>I’ve been following the PyIceberg project for a while now. This video inspired me to give it a try!</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FePId2izONVo%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DePId2izONVo&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FePId2izONVo%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/d12f455d00a84d6537e8dd8b124af439/href">https://medium.com/media/d12f455d00a84d6537e8dd8b124af439/href</a></iframe><p>In this article, we’ll create and configure a local catalog, load the starwars dataset, create an Iceberg table, populate it, and then query it with the PyIceberg API. You’ll need both Ibis and PyIceberg installed with extras to follow along. These dependencies can be installed using the following command within your virtual environment:</p><pre>pip install &quot;ibis-framework[duckdb,examples]&quot; &quot;pyiceberg[sql-sqlite]&quot;</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/910/1*GZuUpBA8J-IXRLSJ-dfIxQ.png" /></figure><h3>Configuring a local catalog</h3><p>PyIceberg supports loading a configuration file named <strong>.pyiceberg.yaml</strong> that helps to manage credentials. This file’s location will be searched in the home path (~/) by default, but an alternate location can also be provided using the PYICEBERG_HOME environment variable.</p><p>The file I’m using contains the following content:</p><pre>catalog:<br>  default:<br>    type: sql<br>    uri: sqlite:////tmp/warehouse/pyiceberg_catalog.db&quot;<br>    warehouse: file://///tmp/warehouse</pre><p>This will require creating the /tmp/warehouse directory, but that’s all we’ll need to get started.</p><pre>mkdir /tmp/warehouse</pre><blockquote>The SQLite catalog is intended for exploratory or development purposes.</blockquote><p>We’ll load our catalog in a later section; this step is in preparation.</p><h3>Loading the starwars data</h3><p>Ibis ships with an examples module that includes a starwars dataset for experimentation. As the default Ibis backend is DuckDB, we’ll use that transitively. DuckDB will read the Parquet file, and we’ll use Ibis to convert this to a PyArrow table.</p><pre>import ibis<br><br>starwars = ibis.examples.starwars.fetch().to_pyarrow()<br>starwars</pre><pre>pyarrow.Table<br>name: string<br>height: int64<br>mass: double<br>hair_color: string<br>skin_color: string<br>eye_color: string<br>birth_year: double<br>sex: string<br>gender: string<br>homeworld: string<br>species: string<br>films: string<br>vehicles: string<br>starships: string<br>----<br>name: [[&quot;Luke Skywalker&quot;,&quot;C-3PO&quot;,&quot;R2-D2&quot;,&quot;Darth Vader&quot;,&quot;Leia Organa&quot;,...,&quot;Finn&quot;,&quot;Rey&quot;,&quot;Poe Dameron&quot;,&quot;BB8&quot;,&quot;Captain Phasma&quot;]]<br>height: [[172,167,96,202,150,...,null,null,null,null,null]]<br>mass: [[77,75,32,136,49,...,null,null,null,null,null]]<br>hair_color: [[&quot;blond&quot;,null,null,&quot;none&quot;,&quot;brown&quot;,...,&quot;black&quot;,&quot;brown&quot;,&quot;brown&quot;,&quot;none&quot;,&quot;none&quot;]]<br>skin_color: [[&quot;fair&quot;,&quot;gold&quot;,&quot;white, blue&quot;,&quot;white&quot;,&quot;light&quot;,...,&quot;dark&quot;,&quot;light&quot;,&quot;light&quot;,&quot;none&quot;,&quot;none&quot;]]<br>eye_color: [[&quot;blue&quot;,&quot;yellow&quot;,&quot;red&quot;,&quot;yellow&quot;,&quot;brown&quot;,...,&quot;dark&quot;,&quot;hazel&quot;,&quot;brown&quot;,&quot;black&quot;,&quot;unknown&quot;]]<br>birth_year: [[19,112,33,41.9,19,...,null,null,null,null,null]]<br>sex: [[&quot;male&quot;,&quot;none&quot;,&quot;none&quot;,&quot;male&quot;,&quot;female&quot;,...,&quot;male&quot;,&quot;female&quot;,&quot;male&quot;,&quot;none&quot;,&quot;female&quot;]]<br>gender: [[&quot;masculine&quot;,&quot;masculine&quot;,&quot;masculine&quot;,&quot;masculine&quot;,&quot;feminine&quot;,...,&quot;masculine&quot;,&quot;feminine&quot;,&quot;masculine&quot;,&quot;masculine&quot;,&quot;feminine&quot;]]<br>homeworld: [[&quot;Tatooine&quot;,&quot;Tatooine&quot;,&quot;Naboo&quot;,&quot;Tatooine&quot;,&quot;Alderaan&quot;,...,null,null,null,null,null]]<br>...</pre><h3>Connecting to the catalog and creating the table</h3><p>Now that we’ve got our PyArrow Table after querying a Parquet file behind the scenes with DuckDB, we can create our Iceberg table using the schema.</p><p>First, we’ll need to load the catalog we created in a previous step.</p><pre>from pyiceberg.catalog import load_catalog<br><br>catalog = load_catalog(&quot;default&quot;)</pre><p>As this is a new catalog, we’ll need to create a namespace before we can create our table.</p><pre>catalog.create_namespace_if_not_exists(&quot;default&quot;)</pre><p>Now that we have a default catalog, we can create a table to align with the schema of the starwars PyArrow table.</p><pre>catalog.create_table_if_not_exists(&quot;default.starwars&quot;, starwars.schema)</pre><pre>starwars(<br>  1: name: optional string,<br>  2: height: optional long,<br>  3: mass: optional double,<br>  4: hair_color: optional string,<br>  5: skin_color: optional string,<br>  6: eye_color: optional string,<br>  7: birth_year: optional double,<br>  8: sex: optional string,<br>  9: gender: optional string,<br>  10: homeworld: optional string,<br>  11: species: optional string,<br>  12: films: optional string,<br>  13: vehicles: optional string,<br>  14: starships: optional string<br>),<br>partition by: [],<br>sort order: [],<br>snapshot: null</pre><p>We have our table here, but it’s empty. We can get an instance of the table with load_table.</p><pre>table = catalog.load_table(&quot;default.starwars&quot;)</pre><p>We can create a scan of the table and bring it to pandas to see that we have an Empty DataFrame.</p><pre>table.scan().to_pandas()</pre><pre>Empty DataFrame<br>Columns: [name, height, mass, hair_color, skin_color, eye_color, birth_year, sex, gender, homeworld, species, films, vehicles, starships]<br>Index: []</pre><p>Let’s fix this by appending the PyArrow table we previously loaded.</p><pre>table.append(starwars)</pre><p>It’s loaded up!</p><h3>Deleting the Droids!</h3><p>We can delete rows from the table by using a filter in the delete method. This is demonstrative only so that we can experiment with the snapshot IDs in a later step. This also serves as an introduction to the PyIceberg expression API.</p><p>First, let’s see how many rows we have before performing the delete operation.</p><pre>table.inspect.partitions()[&quot;record_count&quot;][0].as_py()</pre><pre>87</pre><p>Let’s remove any rows where species == ‘Droid’ and recheck the count.</p><pre>from pyiceberg.expressions import EqualTo<br><br>table.delete(EqualTo(&quot;species&quot;, &quot;Droid&quot;))</pre><p>“C-3PO”, “R2-D2&quot;, “R5-D4&quot;, “IG-88&quot;, “R4-P17”, and “BB8” should no longer be in the table, giving us six fewer rows.</p><pre>table.inspect.partitions()[&quot;record_count&quot;][0].as_py()</pre><pre>81</pre><h3>Scanning the Iceberg table</h3><p>When we use the scan method on an Iceberg table, we can optionally provide a filter, columns we wish to include, a limit, and a snapshot ID. It’s generally up to the engine to filter the files, but the scan’s filter will help produce files that might contain matching rows.</p><h4>Checking snapshots</h4><p>Since we removed a few rows in the previous section, we should now have two snapshots, which can be checked with len(table.snapshots()).</p><p>We can look at the snapshot_id of the snapshots to query our table from before the deletion occurred with [snapshot.snapshot_id for snapshot in table.snapshots()]. In my case, I have snapshots ending in 80693 and 35347, with the former including the droids.</p><pre>table.scan(<br>    selected_fields=(&quot;name&quot;, &quot;species&quot;), snapshot_id=3932249899255080693<br>).to_pandas()</pre><pre>              name species<br>0   Luke Skywalker   Human<br>1            C-3PO   Droid<br>2            R2-D2   Droid<br>3      Darth Vader   Human<br>4      Leia Organa   Human<br>..             ...     ...<br>82            Finn   Human<br>83             Rey   Human<br>84     Poe Dameron   Human<br>85             BB8   Droid<br>86  Captain Phasma   Human<br><br>[87 rows x 2 columns]</pre><p>If we exclude the snapshot_id, it will use the latest snapshot, where we will see the droids are no longer present.</p><pre>table.scan(selected_fields=(&quot;name&quot;, &quot;species&quot;)).to_pandas()</pre><pre>                  name species<br>0       Luke Skywalker   Human<br>1          Darth Vader   Human<br>2          Leia Organa   Human<br>3            Owen Lars   Human<br>4   Beru Whitesun Lars   Human<br>..                 ...     ...<br>76          Tion Medon  Pau&#39;an<br>77                Finn   Human<br>78                 Rey   Human<br>79         Poe Dameron   Human<br>80      Captain Phasma   Human<br><br>[81 rows x 2 columns]</pre><h3>The scan can quack</h3><p>We’ll bring scans to DuckDB, process them further using Ibis, and then return a pandas DataFrame.</p><h4>Counting the species</h4><p>Using scan on a PyIceberg Table object allows us to use the to_duckdb method, which returns a DuckDBPyConnection. We can pass the Ibis DuckDB backend connection to this method to work with an existing connection.</p><pre>import pandas as pd<br>from ibis import _<br><br><br>def species_counts(<br>    catalog: pyiceberg.catalog.Catalog, connection=ibis.BaseBackend<br>) -&gt; pd.DataFrame:<br>    table = catalog.load_table(&quot;default.starwars&quot;)<br>    table.scan(selected_fields=([&quot;species&quot;])).to_duckdb(<br>        &quot;starwars_species&quot;, connection=connection.con<br>    )<br>    expr = (<br>        connection.table(&quot;starwars_species&quot;)<br>        .group_by(&quot;species&quot;)<br>        .aggregate(species_count=_.species.count())<br>        .order_by(_.species_count.desc())<br>        .limit(10)<br>    )<br>    return expr.to_pandas()<br><br><br>con = ibis.duckdb.connect()<br>species_counts(catalog, con)</pre><pre>           species  species_count<br>0            Human             35<br>1           Gungan              3<br>2          Twi&#39;lek              2<br>3         Kaminoan              2<br>4           Zabrak              2<br>5          Wookiee              2<br>6         Mirialan              2<br>7             Hutt              1<br>8            Xexto              1<br>9             Ewok              1</pre><h4>Counting the home worlds</h4><p>Similarly to the logic for counting the species, we can do a similar operation to count the homeworlds.</p><pre>def homeworlds_counts(<br>    catalog: pyiceberg.catalog.Catalog, connection=ibis.BaseBackend<br>) -&gt; pd.DataFrame:<br>    table = catalog.load_table(&quot;default.starwars&quot;)<br>    table.scan(selected_fields=([&quot;homeworld&quot;])).to_duckdb(&quot;starwars_homeworlds&quot;, connection=connection.con)<br>    expr = (<br>        connection.table(&quot;starwars_homeworlds&quot;)<br>        .group_by(&quot;homeworld&quot;)<br>        .aggregate(homeworld_count=_.homeworld.count())<br>        .order_by(_.homeworld_count.desc())<br>        .limit(10)<br>    )<br>    return expr.to_pandas()<br><br><br>homeworlds_counts(catalog, con)</pre><pre>    homeworld  homeworld_count<br>0       Naboo               10<br>1    Tatooine                8<br>2      Kamino                3<br>3   Coruscant                3<br>4    Alderaan                3<br>5      Ryloth                2<br>6    Kashyyyk                2<br>7      Mirial                2<br>8    Corellia                2<br>9  Bestine IV                1</pre><h3>Conclusion</h3><p>PyIceberg is pretty neat! Table formats have been an interesting topic as of late, and I suspect they will continue to remain a popular subject. I found it intuitive to start with PyIceberg, especially with the SQLite catalog. While this only demonstrated potential capabilities on a toy dataset, I’m looking forward to trying this with more data.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=d7ace2a4ca5f" width="1" height="1" alt=""><hr><p><a href="https://medium.com/learning-the-computers/pyiceberg-trying-out-the-sqlite-catalog-d7ace2a4ca5f">PyIceberg — Trying out the SQLite Catalog</a> was originally published in <a href="https://medium.com/learning-the-computers">Learning The Computers</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Powering a PDF Chatbot with Cortex Search and Document AI]]></title>
            <link>https://medium.com/snowflake/powering-a-pdf-chatbot-with-cortex-search-and-document-ai-5eae6defe4c9?source=rss-4c938695f2e2------2</link>
            <guid isPermaLink="false">https://medium.com/p/5eae6defe4c9</guid>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[snowflake]]></category>
            <category><![CDATA[streamlit]]></category>
            <dc:creator><![CDATA[Tyler White]]></dc:creator>
            <pubDate>Sat, 14 Dec 2024 15:41:23 GMT</pubDate>
            <atom:updated>2024-12-14T15:41:23.962Z</atom:updated>
            <content:encoded><![CDATA[<h4>By: Tyler White &amp; <a href="https://medium.com/@katy.haynie">Katy Haynie</a></h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Up4LA5U12h2H6EUqdbSGlg.png" /><figcaption>Architecture diagram for PDF Chatbot</figcaption></figure><h3>Why we needed something</h3><p>Unstructured data can be difficult to manage, including formats like video, audio, and specialized types that do not fit predefined models. PDFs are commonly used for documents across various industries, but extracting useful information from them can be challenging. While PDFs are structured with text and graphics, they lack a consistent layout, making it hard to find specific information without maintaining metadata. This often leads to folders filled with numerous PDFs, where users open files based only on their titles.</p><p>For procurement and legal teams, using Retrieval-Augmented Generation (RAG) architecture on PDF files can result in significant operational efficiencies. By leveraging RAG, teams can quickly retrieve key information from large volumes of procurement-related documents — such as contracts, agreements, and compliance records — without manually combing through them. This reduces the time spent on routine tasks, enabling procurement teams to respond faster to supplier inquiries, identify cost-saving opportunities, and ensure compliance with contractual terms. RAG provides added value by enhancing the ability to extract and review specific clauses, obligations, and risks from complex contracts. This can greatly streamline the contract review process, reduce the potential for oversight, and support faster, more accurate decision-making. Overall, RAG boosts operational efficiency and empowers legal teams to manage their workloads more effectively, improving productivity and accuracy in day-to-day operations.</p><h3>What we wanted to build</h3><p>Osmose is a utility services company that assists its customers in managing their structural assets and infrastructure. The company is dedicated to ensuring that the electrical grid is as strong, safe, and resilient as the communities it serves. Osmose holds numerous contracts across the U.S. and Canada that require review by the procurement team to ensure compliance with the agreed-upon services.</p><p>For Osmose, the solution we built allowed them to extract data from their contracts, answering questions like “Is there a non-compete clause in this contract?”. We wanted to enable users to ask questions, extract text, and search over documents. We wanted them to have an interface to select and filter down to specific documents given their contents to be more precise with their targeted selection. The solution leverages a <a href="https://docs.snowflake.com/en/user-guide/snowflake-cortex/document-ai/overview">Document AI</a> model to extract that information across all the available contracts. When employees at Osmose want to drill down and ask specific questions against an individual contract they can do so using a combination of <a href="https://docs.snowflake.com/en/user-guide/snowflake-cortex/cortex-search/cortex-search-overview">Cortex Search</a> (hybrid, vector and keyword, search engine on text) and the <a href="https://docs.snowflake.com/en/user-guide/snowflake-cortex/llm-functions#label-cortex-llm-complete-function">Cortex Complete</a> function to summarize the results. The combination of Cortex Search and Cortex Complete allows users to easily generate a new non-compete clause by analyzing the existing contracts with non-compete clauses.</p><h3>Solution Overview</h3><p>This solution analyzes contracts using a combination of Document AI, Cortex Complete, and Cortex Search. It allows the user to extract specific information in a tabular format from a document and then ask natural language questions about the set of documents, facilitating a better understanding of each document. The tabular data is leveraged as filters on the Cortex Search service to reduce the number of rows the service needs to look across to return the most relevant information. The Snowflake Cortex Complete function contextualizes the responses to the questions asked against the search service.</p><p><strong>Key Functionality</strong></p><p><strong>Extraction</strong> (<a href="https://docs.snowflake.com/en/user-guide/snowflake-cortex/document-ai/overview">Document AI</a>)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*NzUH_3n5KbdMvbXO" /></figure><ul><li>Uses a proprietary large language model, to extract data from PDFs</li></ul><p><strong>Document Retrieval</strong> (<a href="https://docs.snowflake.com/en/user-guide/snowflake-cortex/cortex-search/cortex-search-overview">Cortex Search</a>)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*PPl6RZkFyG_-EQY5" /></figure><ul><li>RAG engine for LLM chatbots</li></ul><p><strong>Question Answering</strong> (<a href="https://docs.snowflake.com/en/sql-reference/functions/complete-snowflake-cortex">Cortex Complete</a>)</p><ul><li>Given a prompt, generates a response using your choice of supported language model</li></ul><p><strong>Data Preprocessing</strong> (<a href="https://docs.snowflake.com/en/sql-reference/functions/parse_document-snowflake-cortex">Parse Document</a> / <a href="https://docs.snowflake.com/en/developer-guide/snowpark/python/creating-udfs">Python UDFs</a>)</p><ul><li>Snowflake PARSE_DOCUMENT function to extract text from documents using OCR method</li><li>Python UDFs to count page numbers on PDFs and chunk extractions from PARSE_DOCUMENT function</li></ul><p><strong>Automation</strong> (<a href="https://docs.snowflake.com/en/user-guide/tasks-intro">Tasks</a>)</p><ul><li>Auto-processing pipeline for contracts</li></ul><p><strong>User Interface</strong> (<a href="https://docs.snowflake.com/en/developer-guide/streamlit/about-streamlit">Streamlit in Snowflake</a>)</p><ul><li>Simple application for users to interact with their Cortex Search Services and ask questions against their PDFs</li></ul><h3>Closing thoughts</h3><p>It’s nice to be able to leverage your documents and answer questions that are specific to your team or company. In this article, we put a single solution where organizations can quickly get value from thousands of documents and precisely find what they’re looking for.</p><p>If you would like to give this a try yourself, please see the following repo on GitHub:</p><p><a href="https://github.com/Snowflake-Labs/sfguide-build-contracts-chatbot-using-documentai-cortex-search-in-snowflake">https://github.com/Snowflake-Labs/sfguide-build-contracts-chatbot-using-documentai-cortex-search-in-snowflake</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=5eae6defe4c9" width="1" height="1" alt=""><hr><p><a href="https://medium.com/snowflake/powering-a-pdf-chatbot-with-cortex-search-and-document-ai-5eae6defe4c9">Powering a PDF Chatbot with Cortex Search and Document AI</a> was originally published in <a href="https://medium.com/snowflake">Snowflake Builders Blog: Data Engineers, App Developers, AI, &amp; Data Science</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Snowpark ML — An Example Project]]></title>
            <link>https://medium.com/snowflake/snowpark-ml-an-example-project-5627e212520c?source=rss-4c938695f2e2------2</link>
            <guid isPermaLink="false">https://medium.com/p/5627e212520c</guid>
            <category><![CDATA[snowpark]]></category>
            <category><![CDATA[snowflake]]></category>
            <dc:creator><![CDATA[Tyler White]]></dc:creator>
            <pubDate>Tue, 30 Apr 2024 15:05:16 GMT</pubDate>
            <atom:updated>2024-06-15T15:13:50.840Z</atom:updated>
            <content:encoded><![CDATA[<h3>Snowflake ML — An Example Project</h3><h4>A brief walkthrough of a template project for Snowflake ML.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*cP3M82jPEwPpgT9K7C341A.png" /><figcaption>DALL·E 3: A colorful, fictional world of the 1800s, featuring a polar bear and two human data scientists.</figcaption></figure><h3>Background</h3><p>The Snowflake ML Modeling API has been generally available since early December 2023, and the Snowflake Model Registry has been in public preview since January 2024.</p><p>There are a few examples of working with these packages out there that we frequently visit for inspiration and reference. <a href="https://medium.com/u/bc62ab47b4a9">Chase Romano</a> and <a href="https://medium.com/u/829dc2ec3eae">Sikha Das</a> have shared some great material on this subject.</p><ul><li><a href="https://github.com/cromano8/Snowflake_ML_Intro">GitHub - cromano8/Snowflake_ML_Intro: Introduction to performing Machine Learning on Snowflake</a></li><li><a href="https://quickstarts.snowflake.com/guide/intro_to_machine_learning_with_snowpark_ml_for_python">Intro to Machine Learning with Snowpark ML</a></li></ul><p>These examples use notebooks, which are great but can often be challenging to maintain in repositories. We frequently encounter scenarios where users want to keep their code in Python scripts and operationalize it via traditional orchestration tools and/or Snowflake tasks.</p><p><a href="https://medium.com/u/3eeeaa27a8c9">Kirk Mason</a> and I opted to implement our project as a Python module. This helps avoid needing to append a relative path to the system path to specify imports, which we often see in notebooks. The scripts we will walk through can be executed directly in Python or registered via Snowpark as Stored Procedures, later executed as Snowflake tasks in a DAG.</p><p>We think having a well-organized project structure is essential, so that’s what we’ll outline here.</p><h3>The Project Structure</h3><p>Having a template or framework is essential to start any project. This helps with onboarding new team members as well. While there may be variations of this to fit specific needs, here’s an outline of what we find to be a good starting point.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/944/1*TALLlvyL3VHz7NayJjXT4w.png" /><figcaption>Example Project Structure</figcaption></figure><p>Let’s break down these sections and the various files at the top level.</p><h4>docs</h4><p>The docs folder contains any project-specific documentation needed to encompass requirements and help enable future developers and stakeholders.</p><h4>scratch</h4><p>The scratch folder contains any notebook code, experimentation, or exploration that should not be executed automatically.</p><h4>my_project (or src)</h4><p>This folder serves as the “heart” of our project. Treating code as a Python package has several advantages. It enhances modularity, organization, and readability by separating related functionalities into distinct modules. It also provides namespace isolation to avoid naming clashes and promotes reusability, minimizing code duplication.</p><p>I mentioned in the earlier section that providing options to run these Python scripts directly or by registering the functions was important. Here’s how we did that. The code is shortened for brevity, as the model training code isn’t the main focus here.</p><pre>from snowflake.ml.modeling.impute import SimpleImputer<br>from snowflake.ml.modeling.pipeline import Pipeline<br>from snowflake.ml.modeling.xgboost import XGBRegressor<br>from snowflake.ml.registry import Registry<br>from snowflake.snowpark import Session<br>from snowflake.snowpark import functions as F<br>from snowflake.snowpark import types as T<br>from my_project.common import get_next_version<br>from my_project.common import get_performance_metrics<br><br>import logging<br><br><br>def train(session: Session) -&gt; str:<br>    logger = logging.getLogger(__name__)<br>    logger.info(<br>        &quot;{&#39;message&#39;:&#39;Begin model training procedure&#39;, &#39;unit&#39;:&#39;analytics&#39;}&quot;<br>    )<br>    ...<br><br>    pipeline = Pipeline(<br>        [<br>            ...<br>        ]<br>    )<br>    pipeline.fit(train_df)<br><br>    logger.info(<br>        &quot;{&#39;message&#39;:&#39;Obtain metrics&#39;, &#39;unit&#39;:&#39;analytics&#39;}&quot;<br>    )<br><br>    train_result_df = pipeline.predict(train_df)<br>    test_result_df = pipeline.predict(test_df)<br><br>    combined_metrics = dict(<br>        train_metrics=get_performance_metrics(<br>            &quot;REGRESSION&quot;, train_result_df, &quot;PRICE&quot;, &quot;OUTPUT_PRICE&quot;<br>        ),<br>    )<br><br>    reg = Registry(session=session, schema_name=&quot;MODELS&quot;)<br><br>    model_name = &quot;MY_MODEL&quot;<br>    model_version = get_next_version(reg, model_name)<br><br>    reg.log_model(<br>        model_name=model_name,<br>        version_name=model_version,<br>        model=pipeline,<br>        metrics=combined_metrics,<br>    )<br><br>    logger.info(<br>        &quot;{&#39;message&#39;:&#39;Finished training and registering&#39;, &#39;unit&#39;:&#39;analytics&#39;}&quot;<br>    )<br><br>    return f&quot;Model {model_name}.{model_version} is trained and deployed.&quot;<br><br><br>if __name__ == &quot;__main__&quot;:<br>    session = Session.builder.getOrCreate()<br>    session.use_warehouse(&quot;ML_BIG&quot;)<br>    session.use_database(&quot;ML_EXAMPLES&quot;)<br>    session.use_schema(&quot;DIAMONDS&quot;)<br>    raise SystemExit(train(session))</pre><p>We define our function train, which can be registered directly as a stored procedure at a later step (<strong>register_deploy_dags</strong>).</p><p>As an alternative to registering as a stored procedure, invoking this script directly will hit the __main__ condition, establish a connection to Snowflake, and execute the train function. This allows using this code with an orchestration tool if you do not intend to use Snowflake tasks.</p><h4>pyproject.toml</h4><p>The pyproject.toml file contains the information needed to set up the project and the Python environment. Here’s what ours looks like.</p><pre>[build-system]<br>requires = [&quot;setuptools&quot;, &quot;wheel&quot;]<br>build-backend = &quot;setuptools.build_meta&quot;<br><br>[project]<br>name = &quot;my_project&quot;<br>description = &quot;A Snowpark ML project.&quot;<br>version = &quot;0.1.0&quot;<br>readme = &quot;README.md&quot;<br>dependencies = [<br>    &quot;snowflake-snowpark-python==1.14.0&quot;,<br>    &quot;numpy==1.26.3&quot;,<br>    &quot;scikit-learn==1.3.0&quot;,<br>    &quot;snowflake[ml]&quot;,<br>    &quot;xgboost==1.7.3&quot;<br>]<br><br>[tool.setuptools.packages.find]<br>include = [&quot;my_project&quot;]<br><br>[project.optional-dependencies]<br>dev = [&quot;nbqa[toolchain]&quot;, &quot;jupyter&quot;]</pre><p>Having this file in the root of our repo will allow us to easily install our project with pip by executing the following command:</p><pre>pip install .</pre><p>Or, if you wish to perform an editable install, you can also do the following:</p><pre>pip install -e .</pre><p>Since we have specified a “dev” extra in our configuration file, we can also install dev dependencies like this:</p><pre>pip install -e &quot;.[dev]&quot;</pre><h4>register_deploy_dags.py</h4><p>In this code, we’re registering stored procedures from our Python module and setting them up to create a DAG in Snowflake.</p><pre>session.sproc.register_from_file(<br>    file_path=&quot;my_project/train.py&quot;,<br>    func_name=&quot;train_model&quot;,<br>    name=&quot;TRAIN_MODEL&quot;,<br>    is_permanent=True,<br>    packages=[&quot;snowflake-snowpark-python&quot;, &quot;snowflake-ml-python&quot;],<br>    imports=[&quot;my_project&quot;],<br>    stage_location=&quot;@PYTHON_CODE&quot;,<br>    replace=True,<br>    execute_as=&#39;caller&#39;<br>)</pre><p>Later, we can deploy our DAG in the same script.</p><pre>with DAG(<br>    &quot;EXAMPLE_DAG&quot;,<br>    schedule=Cron(&quot;0 8 * * 1&quot;, &quot;America/New_York&quot;),<br>    stage_location=&quot;@PYTHON_CODE&quot;,<br>    use_func_return_value=True,<br>) as dag:<br>    train_task = DAGTask(<br>        name=&quot;TRAIN_MODEL_TASK&quot;,<br>        definition=&quot;CALL TRAIN_MODEL();&quot;,<br>        warehouse=&quot;COMPUTE_WH&quot;,<br>    )<br>    set_default_task = DAGTask(<br>        name=&quot;SET_DEFAULT_VERSION&quot;,<br>        definition=&quot;CALL SET_DEFAULT_VERSION(&#39;DIAMONDS&#39;, &#39;rmse&#39;, True);&quot;,<br>        warehouse=&quot;COMPUTE_WH&quot;,<br>    )<br>    train_task &gt;&gt; set_default_task</pre><p>If you want to learn more about how this works, please check out the <a href="https://docs.snowflake.com/developer-guide/snowflake-python-api/snowflake-python-managing-tasks">docs</a>.</p><h3>Conclusion</h3><p>We hope this gives you a foundation for getting started and is flexible enough to meet your needs.</p><p>We encourage you to try this, make any necessary changes, and let us know how it works. Please don&#39;t hesitate to share your experiences or challenges, as they might help others.</p><p>A working demonstration of this can be found at the following repo:</p><p><a href="https://github.com/IndexSeek/snowflake-ml-example-project">GitHub - IndexSeek/snowflake-ml-example-project: An example project for Snowflake ML.</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=5627e212520c" width="1" height="1" alt=""><hr><p><a href="https://medium.com/snowflake/snowpark-ml-an-example-project-5627e212520c">Snowpark ML — An Example Project</a> was originally published in <a href="https://medium.com/snowflake">Snowflake Builders Blog: Data Engineers, App Developers, AI, &amp; Data Science</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[It’s a wrap(s)! Python Logging Decorators Improve Troubleshooting]]></title>
            <link>https://medium.com/learning-the-computers/its-a-wrap-s-python-logging-decorators-improve-troubleshooting-a96f4ab728d1?source=rss-4c938695f2e2------2</link>
            <guid isPermaLink="false">https://medium.com/p/a96f4ab728d1</guid>
            <category><![CDATA[logging]]></category>
            <category><![CDATA[python]]></category>
            <dc:creator><![CDATA[Tyler White]]></dc:creator>
            <pubDate>Fri, 23 Feb 2024 21:53:50 GMT</pubDate>
            <atom:updated>2024-02-23T21:53:50.128Z</atom:updated>
            <content:encoded><![CDATA[<h4>A short journey into decorators and logging in Python.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*bC1PlpYJyiA8gWyg9GhtOg.png" /><figcaption>DALL·E 3 Created Image</figcaption></figure><p>Have you ever wondered what arguments are passed when a function is called? Do you need help deciphering long error messages that are often confusing? Hopefully, we can fix that.</p><h3>logging &gt; print?</h3><p>A common practice among developers is to use print statements frequently. However, as they become more familiar with logging and are reminded of its importance, they switch to using logging.info statements everywhere instead.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*GNHA-Xcp7J9-hfNrhfI0cA.png" /><figcaption>DALL·E 3 Created Image</figcaption></figure><p>This approach can result in longer code, though it allows for capturing more information. Fortunately, a simple solution can capture the desired information quite efficiently.</p><h3>What’s the “simple” solution?</h3><p>Decorators!</p><p>Python decorators are a powerful tool that allows developers to dynamically modify the behavior of functions or methods. They sound scary but aren’t too bad once you understand how they work.</p><p>Let’s see how they can be used with Python&#39;s logging module to improve the functionality and maintainability of code.</p><p>Here’s some code to implement a function to do this.</p><pre>import logging<br>from functools import wraps<br><br>def logger(func):<br>    &quot;&quot;&quot;<br>    A decorator function to log information about function calls and their results.<br>    &quot;&quot;&quot;<br><br>    @wraps(func)<br>    def wrapper(*args, **kwargs):<br>        &quot;&quot;&quot;<br>        The wrapper function that logs function calls and their results.<br>        &quot;&quot;&quot;<br>        logging.info(f&quot;Running {func.__name__} with args: {args}, kwargs: {kwargs}&quot;)<br>        try:<br>            result = func(*args, **kwargs)<br>            logging.info(f&quot;Finished {func.__name__} with result: {result}&quot;)<br>        <br>        except Exception as e:<br>            logging.error(f&quot;Error occurred in {func.__name__}: {e}&quot;)<br>            raise<br>        <br>        else:<br>            return result<br><br>    return wrapper</pre><p>The wrapper function is a middle layer between the original function and the calling code. It logs information related to the function calls and their results, including any exceptions. When the decorated function is called, the wrapper function is invoked and calls the original function while logging relevant information. The decorator preserves the original function&#39;s metadata, ensuring introspection and debugging tools function correctly.</p><h3>How does it work?</h3><p>We can throw it on any function using the decorator.</p><pre>@logger<br>def greet_user(name: str, greeting_type: str):<br>    if greeting_type == &quot;short&quot;:<br>        return f&quot;Hey, {name}!&quot;<br>    elif greeting_type == &quot;long&quot;:<br>        return f&quot;Oh, hello there, {name}! How are you today?&quot;<br>    else:<br>        raise ValueError(&quot;Invalid greeting type. Please choose &#39;short&#39; or &#39;long&#39;.&quot;)</pre><p>Calling it is just like using any other function.</p><pre>greet_user(&quot;Tyler&quot;, &quot;short&quot;)<br>greet_user(&quot;Not Tyler&quot;, &quot;short&quot;)<br>greet_user(&quot;Tyler&quot;, &quot;BREAK&quot;)</pre><p>Since my current session logs at the DEBUG level, I get back some helpful messages!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*3Xdy0-D-vrfl6DNNx5DUdQ.png" /><figcaption>Logging messages from running the code.</figcaption></figure><h3>Conclusion</h3><p>Many topics are written on this subject, which can be very in-depth. I hope this serves as a gentle introduction and offers a starting point.</p><p>You should experiment with custom logging formats, integrate third-party logging libraries, or explore advanced decorator patterns beyond the scope of the article. Don’t hesitate to share your experiences or challenges while implementing logging decorators, as they might help others.</p><p>For some additional reading, check out these links:</p><ul><li><a href="https://docs.python.org/3/library/functools.html#module-functools">functools - Higher-order functions and operations on callable objects</a></li><li><a href="https://peps.python.org/pep-0318/">PEP 318 - Decorators for Functions and Methods</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=a96f4ab728d1" width="1" height="1" alt=""><hr><p><a href="https://medium.com/learning-the-computers/its-a-wrap-s-python-logging-decorators-improve-troubleshooting-a96f4ab728d1">It’s a wrap(s)! Python Logging Decorators Improve Troubleshooting</a> was originally published in <a href="https://medium.com/learning-the-computers">Learning The Computers</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>