<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Radovan Bacovic on Medium]]></title>
        <description><![CDATA[Stories by Radovan Bacovic on Medium]]></description>
        <link>https://medium.com/@radovan.bacovic?source=rss-ff65005cbd7e------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*XFnFWQNZqqREMjG13gRURw.png</url>
            <title>Stories by Radovan Bacovic on Medium</title>
            <link>https://medium.com/@radovan.bacovic?source=rss-ff65005cbd7e------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Thu, 16 Apr 2026 03:04:42 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@radovan.bacovic/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Snowflake Task and Pipe Failures]]></title>
            <link>https://medium.com/snowflake/snowflake-task-and-pipe-failures-91b4680b6ba8?source=rss-ff65005cbd7e------2</link>
            <guid isPermaLink="false">https://medium.com/p/91b4680b6ba8</guid>
            <category><![CDATA[aws-lambda]]></category>
            <category><![CDATA[snowflake]]></category>
            <category><![CDATA[slack]]></category>
            <category><![CDATA[gitlab]]></category>
            <dc:creator><![CDATA[Radovan Bacovic]]></dc:creator>
            <pubDate>Tue, 17 Mar 2026 21:01:00 GMT</pubDate>
            <atom:updated>2026-03-17T21:01:00.999Z</atom:updated>
            <content:encoded><![CDATA[<h4>How we alert the team in real time with AWS Lambda and Slack</h4><blockquote><strong><em>Note: </em></strong><em>this article is written with a huge help from</em><strong><em> </em></strong><a href="http://twitter.com/csnehansh06"><strong><em>@csnehansh06</em></strong></a><strong><em> </em></strong><em>and consider him as a co-author</em><strong><em>.</em></strong></blockquote><p>How the <a href="https://handbook.gitlab.com/handbook/enterprise-data/"><strong>GitLab Data Team</strong></a> uses AWS SNS and a Lambda function to turn silent Snowflake failures into instant Slack notifications before anyone notices data is missing or the pipeline is broken.</p><p>This was our problem: a Snowpipe silently stops loading data. No error is thrown at the pipeline level and the dashboard turns red. Just a growing gap in your tables that someone notices in the morning hours or days later.</p><p>And it’s a problem that’s embarrassingly common with Snowflake tasks and pipes. They fail quietly. Snowflake logs the error, but unless you’re actively polling for it, you won’t know. And polling is the kind of thing that either doesn’t get built, or gets built once, breaks, and nobody notices.</p><p>We needed something better: a push-based alerting system that fires the moment something goes wrong, and drops a message directly into Slack where the team already lives.</p><p>Here’s exactly how we built it.</p><h3>The problem with Snowflake failure visibility</h3><p>Snowflake tasks and Snowpipes are workhorses of any modern data platform. Tasks let you schedule SQL logic on a cron-like schedule. Snowpipes load data continuously from cloud storage — in our case, S3 — into Snowflake tables.</p><p>Both can fail. And when they do, Snowflake doesn’t shout about it by default.</p><p>For tasks, failures are logged in INFORMATION_SCHEMA.TASK_HISTORY. For Snowpipes, errors show up in INFORMATION_SCHEMA.COPY_HISTORY or via the REST API. You can query these, but querying is reactive. You&#39;re always behind.</p><p>What we wanted was reactive in the good sense: something that responds to the failure the moment it happens, rather than us having to go looking.</p><p>The solution was Snowflake’s native error notification integration using <strong>AWS SNS (Simple Notification Service)</strong>. Snowflake can publish failure events directly to an SNS topic. From there, you can wire up anything — Lambda, email, PagerDuty, you name it. We wired it to a Lambda function that formats the message and posts it to Slack.</p><p>The full flow looks like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*XWWKrMds2MLBOxmL0YYLFA.png" /><figcaption><strong><em>How this feature works (clean and simple)</em></strong></figcaption></figure><h3>Building the integration step by step</h3><h4>Step 1: Create the SNS topic in AWS</h4><p>First, create an SNS topic that Snowflake will publish to. You can do this through the AWS Console or CLI.</p><pre>aws sns create-topic --name snowflake-error-notifications --region us-east-1</pre><p>Note the ARN that comes back — you’ll need it in the next step. It looks like:</p><pre>arn:aws:sns:us-east-1:123456789012:snowflake-error-notifications</pre><h4>Step 2: Grant Snowflake permission to publish to the topic</h4><p>Snowflake needs an IAM role with permission to publish to your SNS topic. Create the policy:</p><pre>{<br>  &quot;Version&quot;: &quot;2012-10-17&quot;,<br>  &quot;Statement&quot;: [<br>    {<br>      &quot;Effect&quot;: &quot;Allow&quot;,<br>      &quot;Action&quot;: &quot;sns:Publish&quot;,<br>      &quot;Resource&quot;: &quot;arn:aws:sns:us-east-1:123456789012:snowflake-error-notifications&quot;<br>    }<br>  ]<br>}</pre><p>Attach this to a new IAM role. The trust relationship on the role needs to allow Snowflake’s AWS account to assume it. You’ll get Snowflake’s IAM user ARN after creating the notification integration in the next step — so this is a two-step dance.</p><h4>Step 3: Create the Snowflake notification integration</h4><p>In Snowflake, create a notification integration that points to your SNS topic:</p><pre>CREATE OR REPLACE NOTIFICATION INTEGRATION sns_error_integration<br>  ENABLED = TRUE<br>  TYPE = QUEUE<br>  NOTIFICATION_PROVIDER = AWS_SNS<br>  DIRECTION = OUTBOUND<br>  AWS_SNS_TOPIC_ARN = &#39;arn:aws:sns:us-east-1:123456789012:snowflake-error-notifications&#39;<br>  AWS_SNS_ROLE_ARN = &#39;arn:aws:iam::123456789012:role/snowflake-sns-role&#39;;</pre><p>Then describe it to get Snowflake’s side of the IAM trust:</p><pre>DESC INTEGRATION sns_error_integration;</pre><p>Look for SF_AWS_IAM_USER_ARN and SF_AWS_EXTERNAL_ID in the output. Use these to update the trust policy on your IAM role so Snowflake can actually assume it:</p><pre>{<br>  &quot;Version&quot;: &quot;2012-10-17&quot;,<br>  &quot;Statement&quot;: [<br>    {<br>      &quot;Effect&quot;: &quot;Allow&quot;,<br>      &quot;Principal&quot;: {<br>        &quot;AWS&quot;: &quot;arn:aws:iam::SNOWFLAKE_ACCOUNT:user/SNOWFLAKE_USER&quot;<br>      },<br>      &quot;Action&quot;: &quot;sts:AssumeRole&quot;,<br>      &quot;Condition&quot;: {<br>        &quot;StringEquals&quot;: {<br>          &quot;sts:ExternalId&quot;: &quot;YOUR_EXTERNAL_ID&quot;<br>        }<br>      }<br>    }<br>  ]<br>}</pre><h4>Step 4: Attach the notification integration to your tasks and pipes</h4><p>For a <strong>Snowflake task</strong>, add the error integration to its definition:</p><pre>CREATE OR REPLACE TASK my_transform_task<br>  WAREHOUSE = TRANSFORM_WH<br>  SCHEDULE = &#39;USING CRON 0 6 * * * UTC&#39;<br>  ERROR_INTEGRATION = sns_error_integration<br>AS<br>  CALL my_stored_procedure();</pre><p>For an existing task you want to update:</p><pre>ALTER TASK my_transform_task SET ERROR_INTEGRATION = sns_error_integration;</pre><p>For a <strong>Snowpipe</strong>, add it at creation time:</p><pre>CREATE OR REPLACE PIPE my_data_pipe<br>  AUTO_INGEST = TRUE<br>  ERROR_INTEGRATION = sns_error_integration<br>AS<br>  COPY INTO my_table<br>  FROM @my_stage<br>  FILE_FORMAT = (TYPE = &#39;JSON&#39;);</pre><p>Now, whenever that task or pipe fails, Snowflake pushes a JSON payload to your SNS topic automatically. No polling, no cron job checking for failures.</p><h4>Step 5: Create the Lambda function</h4><p>Subscribe your Lambda to the SNS topic, then write the handler. The SNS message payload from Snowflake looks like this:</p><pre>{<br>  &quot;version&quot;: &quot;1.0&quot;,<br>  &quot;messageId&quot;: &quot;abc-123&quot;,<br>  &quot;timestamp&quot;: &quot;2024-01-15T08:32:11Z&quot;,<br>  &quot;snowflakeEventType&quot;: &quot;TASK_FAILURE&quot;,<br>  &quot;resource&quot;: {<br>    &quot;database&quot;: &quot;PROD&quot;,<br>    &quot;schema&quot;: &quot;TRANSFORMS&quot;,<br>    &quot;name&quot;: &quot;MY_TRANSFORM_TASK&quot;<br>  },<br>  &quot;errorMessage&quot;: &quot;SQL compilation error: Object &#39;MY_TABLE&#39; does not exist or not authorized.&quot;<br>}</pre><p>Here’s the Lambda function we use to parse that and send it to Slack:</p><pre>import json<br>import os<br>import urllib.request<br><br>SLACK_WEBHOOK_URL = os.environ[&quot;SLACK_WEBHOOK_URL&quot;]<br><br>def format_slack_message(event_data: dict) -&gt; dict:<br>    event_type = event_data.get(&quot;snowflakeEventType&quot;, &quot;UNKNOWN_EVENT&quot;)<br>    resource = event_data.get(&quot;resource&quot;, {})<br>    resource_name = (<br>        f&quot;{resource.get(&#39;database&#39;, &#39;?&#39;)}.&quot;<br>        f&quot;{resource.get(&#39;schema&#39;, &#39;?&#39;)}.&quot;<br>        f&quot;{resource.get(&#39;name&#39;, &#39;?&#39;)}&quot;<br>    )<br>    error_message = event_data.get(&quot;errorMessage&quot;, &quot;No error message provided.&quot;)<br>    timestamp = event_data.get(&quot;timestamp&quot;, &quot;Unknown time&quot;)<br>    emoji = &quot;:snowflake:&quot; if &quot;PIPE&quot; in event_type else &quot;:gear:&quot;<br><br>    # Pure Block Kit — no legacy `attachments` wrapper<br>    return {<br>        &quot;blocks&quot;: [<br>            {<br>                &quot;type&quot;: &quot;header&quot;,<br>                &quot;text&quot;: {<br>                    &quot;type&quot;: &quot;plain_text&quot;,<br>                    &quot;text&quot;: f&quot;{emoji} Snowflake {event_type.replace(&#39;_&#39;, &#39; &#39;).title()}&quot;,<br>                },<br>            },<br>            {<br>                &quot;type&quot;: &quot;section&quot;,<br>                &quot;fields&quot;: [<br>                    {&quot;type&quot;: &quot;mrkdwn&quot;, &quot;text&quot;: f&quot;*Resource:*\n`{resource_name}`&quot;},<br>                    {&quot;type&quot;: &quot;mrkdwn&quot;, &quot;text&quot;: f&quot;*Time:*\n{timestamp}&quot;},<br>                ],<br>            },<br>            {<br>                &quot;type&quot;: &quot;section&quot;,<br>                &quot;text&quot;: {<br>                    &quot;type&quot;: &quot;mrkdwn&quot;,<br>                    &quot;text&quot;: f&quot;:red_circle: *Error:*\n```{error_message[:500]}```&quot;,<br>                },<br>            },<br>            {&quot;type&quot;: &quot;divider&quot;},<br>        ]<br>    }<br><br>def send_to_slack(message: dict) -&gt; None:<br>    payload = json.dumps(message).encode(&quot;utf-8&quot;)<br>    req = urllib.request.Request(<br>        SLACK_WEBHOOK_URL,<br>        data=payload,<br>        headers={&quot;Content-Type&quot;: &quot;application/json&quot;},<br>        method=&quot;POST&quot;,<br>    )<br>    with urllib.request.urlopen(req) as response:<br>        if response.status != 200:<br>            raise ValueError(f&quot;Slack returned {response.status}: {response.read()}&quot;)<br><br>def lambda_handler(event, context):<br>    for record in event.get(&quot;Records&quot;, []):<br>        sns_message = record.get(&quot;Sns&quot;, {}).get(&quot;Message&quot;, &quot;{}&quot;)<br>        <br>        try:<br>            event_data = json.loads(sns_message)<br>        except json.JSONDecodeError:<br>            print(f&quot;Could not parse SNS message: {sns_message}&quot;)<br>            continue<br>        print(f&quot;Processing event: {event_data.get(&#39;snowflakeEventType&#39;)} for {event_data.get(&#39;resource&#39;)}&quot;)<br>        slack_message = format_slack_message(event_data)<br>        send_to_slack(slack_message)<br>        print(&quot;Alert sent to Slack successfully.&quot;)<br>    return {&quot;statusCode&quot;: 200, &quot;body&quot;: &quot;Done&quot;}</pre><p>Set the SLACK_WEBHOOK_URL as an environment variable in your Lambda configuration <em>(not hardcoded, never hardcoded)</em>. You can create a Slack incoming webhook from the Slack API dashboard for your workspace.</p><h3>What the alert looks like in Slack</h3><p>When a task fails, the team sees something like this in the #data-alerts channel:</p><pre>⚙️ Snowflake Task Failure<br>Resource: PROD.TRANSFORMS.MY_TRANSFORM_TASK<br>Time: 2024-01-15T08:32:11Z<br>Error:<br>SQL compilation error: Object &#39;MY_TABLE&#39; does not exist or not authorized.</pre><p>Clean, specific, actionable. No one needs to go digging in Snowflake’s query history to know what broke.</p><h3>Handling multiple tasks and pipes</h3><p>One integration covers everything. Any task or pipe you attach ERROR_INTEGRATION = sns_error_integration to will automatically publish to the same SNS topic and flow through the same Lambda. You don&#39;t need separate integrations per object — just update the task or pipe definition.</p><p>We tag the resource name in the Slack message, so you always know exactly which task or pipe failed without any ambiguity.</p><h3>Making this production-ready</h3><p>Getting the alert firing is the quick win. Here’s what we did to make it actually reliable in production.</p><ul><li><strong>Add a dead-letter queue.</strong> Lambda invocations can fail. If your Slack webhook is temporarily down or your Lambda has a bug, you don’t want to silently lose failure notifications — that’s the exact opposite of what you’re building. Configure an SQS dead-letter queue on the Lambda so failed invocations are captured and can be replayed.</li><li><strong>Limit error message length.</strong> Snowflake error messages can be verbose. We cap ours at <strong>500</strong> characters in the Lambda before sending to Slack. Slack blocks have limits, and a wall of text in an alert channel gets ignored fast.</li><li><strong>Route alerts by severity.</strong> Not all failures are equal. A prod Snowpipe failing is an incident. A dev task failing overnight is background noise. We route to different Slack channels based on the database name — prod failures go to #data-incidents, everything else goes to #data-alerts.</li></ul><p>Here’s the routing logic added to the Lambda:</p><pre>def get_slack_channel(resource: dict) -&gt; str:<br>    database = resource.get(&quot;database&quot;, &quot;&quot;).upper()<br>    if database == &quot;SOMETHING_IN_PRODUCTION&quot;:<br>        return &quot;#data-incidents&quot;<br>    return &quot;#data-alerts&quot;</pre><p><em>(With Slack’s incoming webhooks, you’ll need separate webhooks per channel, or migrate to the Slack Web API with a bot token to route dynamically.)</em></p><p><strong>Test it before you trust it.</strong> You can manually trigger a test by creating a task that intentionally fails:</p><pre>CREATE OR REPLACE TASK test_failure_task<br>  WAREHOUSE = TRANSFORM_WH<br>  SCHEDULE = &#39;USING CRON * * * * * UTC&#39;<br>  ERROR_INTEGRATION = sns_error_integration<br>AS<br>  SELECT 1 / 0;  -- guaranteed division by zero</pre><pre>-- I always forgot to run this command, so you should not.<br>ALTER TASK test_failure_task RESUME;</pre><p>Watch Slack. Within a minute, you should see the alert fire. Once confirmed, drop the task:</p><pre>ALTER TASK test_failure_task SUSPEND;<br>DROP TASK test_failure_task;</pre><p>The whole setup — from SNS topic to Slack message — takes a reasonable short time to configure. After that, it runs itself. We haven’t missed a Snowflake failure since we deployed this, and the team spends zero time polling task history to check whether things are working.</p><p>If you’re running Snowflake in production and you don’t have something like this, set it up today. Quiet failures are the ones that get you.</p><p>The GitLab Data Team’s full documentation on this integration is publicly available in the GitLab <a href="https://handbook.gitlab.com/handbook/enterprise-data/platform/snowflake/snowpipe/"><strong>handbook</strong></a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=91b4680b6ba8" width="1" height="1" alt=""><hr><p><a href="https://medium.com/snowflake/snowflake-task-and-pipe-failures-91b4680b6ba8">Snowflake Task and Pipe Failures</a> was originally published in <a href="https://medium.com/snowflake">Snowflake Builders Blog: Data Engineers, App Developers, AI, &amp; Data Science</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Data classification with Snowflake: from impossible to production]]></title>
            <link>https://medium.com/snowflake/data-classification-with-snowflake-from-impossible-to-production-aa11680aca75?source=rss-ff65005cbd7e------2</link>
            <guid isPermaLink="false">https://medium.com/p/aa11680aca75</guid>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[dbt]]></category>
            <category><![CDATA[gitlab]]></category>
            <category><![CDATA[python]]></category>
            <dc:creator><![CDATA[Radovan Bacovic]]></dc:creator>
            <pubDate>Sat, 21 Feb 2026 15:01:02 GMT</pubDate>
            <atom:updated>2026-02-21T15:01:02.628Z</atom:updated>
            <content:encoded><![CDATA[<h3>Automated Data Classification with Snowflake: From Nowhere to Production</h3><h4>What is the problem with today’s data?</h4><p>Let me scare you for a moment.</p><p>Data breaches. Reputation damage. Knowledge stealing. Code leakage. GDPR violations. Multi-million dollar fines.</p><p>This isn’t a dystopian future — it’s happening right now to companies just like yours. And the scary part? Most organisations have no idea which of their database tables contain sensitive information.</p><p>At the <a href="https://handbook.gitlab.com/handbook/enterprise-data/"><strong>GitLab Data Team</strong></a>, we faced this exact problem. Our data landscape had grown to thousands of models across <strong>RAW</strong>, <strong>PREP</strong>, and <strong>PROD</strong> databases. Somewhere in that massive ecosystem lurked personally identifiable information (<a href="https://www.ibm.com/think/topics/pii"><strong>PII</strong></a>) and material non-public information (<a href="https://www.investopedia.com/terms/m/materialinsiderinformation.asp"><strong>MNPI</strong></a>) — the kind of data that could trigger compliance violations, reputation damage, and those terrifying x-million-dollar fines.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Ahf9mcqrJVoGRoEZDQmw4w.png" /></figure><p>The traditional approach won’t work: Manual tagging by data engineers who already have full plates. The reality?</p><ol><li>It doesn’t scale</li><li>It’s error-prone, and</li><li>By the time you finish, your data has already changed.</li></ol><h3>Why Data Classification Became Non-Negotiable</h3><p>Here’s the uncomfortable truth: every software company in 2025 is racing to add AI features to its products. But AI and unclassified data is a recipe for disaster.</p><p>Our data classification challenge wasn’t just about compliance checkboxes. We needed to solve four critical problems:</p><p><strong>Who’s accessing our sensitive data?</strong> Without proper classification, we couldn’t audit who was querying <strong>PII</strong> or <strong>MNPI</strong>. Any employee could potentially download customer information without leaving a trace.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*8sIdLdQPbAbahsE226NKEQ.png" /><figcaption>PII and personal data, good to know what is what</figcaption></figure><p><strong>Automated tagging at scale.</strong> With approximately 10,000 models in Snowflake, manual classification was dead on arrival.</p><p><strong>End-to-end governance.</strong> We needed a solution that covered everything from initial tagging to ongoing monitoring and audit trails.</p><p>The market and the open source space offered plenty of tools:</p><ol><li><a href="https://github.com/tokern/piicatcher"><strong>PIICatcher</strong></a><strong>,</strong></li><li><a href="https://microsoft.github.io/presidio/"><strong>Microsoft Presidio</strong></a>, and</li><li>various Snowflake features like <a href="https://docs.snowflake.com/en/user-guide/classify-intro"><strong>Sensitive data classification</strong></a>.</li></ol><p>Here’s what we learned — none of them fully solved our specific problem. Open source tools lacked community support and scalability. Third-party solutions introduced vendor lock-in and high costs.</p><p>We needed to move quickly <em>(considering upcoming audit deadlines)</em> and had the right ingredients: hands-on experience with Python, Snowflake, GitLab CI/CD, and AI.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ks8Wvto4dF_j2FX7HcS6og.png" /><figcaption>Available options for the automated data classification</figcaption></figure><p>We built our own pipeline, combining it with Snowflake’s classification mechanism as a means to classify the data.</p><h3>How What: Building production-grade classification</h3><p>Our success criteria were stringent:</p><blockquote>tag <a href="https://www.ibm.com/think/topics/pii"><strong>PII</strong></a> and <a href="https://www.investopedia.com/terms/m/materialinsiderinformation.asp"><strong>MNPI</strong></a> data across all environments through a fully automated process, with zero (or near-zero) false positives, and the ability to classify 10k+ models in under two hours.</blockquote><p>After evaluating the landscape, we chose <strong>Snowflake’s classification capabilities</strong> as our foundation. Not because it was perfect, but because it offered the best balance of scalability, usability, and development speed for our specific needs. This was a production-ready feature set to be used. And the cost was reasonably applicable; refer <a href="https://www.snowflake.com/legal-files/CreditConsumptionTable.pdf"><strong>here</strong></a> for more details.</p><h3>The Architecture</h3><p>We built a multi-layered system orchestrated through Apache Airflow:</p><ol><li><strong>Domain configuration (YAML)</strong> We began by defining our scope in configuration files, specifying which databases, schemas, and tables to include or exclude for MNPI and PII classification. This gave us the flexibility to adapt as our data landscape evolved.</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*JRbr5HO0vZxUFx4v0n-14A.png" /><figcaption>YML specification for the data classification</figcaption></figure><p><strong>2. Python<em> </em></strong><em>(in K8s, as most of our pipelines was implemeneted)</em><strong><em> </em>+ Airflow Orchestration</strong> A single DAG (data_classification) coordinates five parallel tasks:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*7wj5sEKp55vj03gaEAk0_A.png" /><figcaption>Deployment options for Python code</figcaption></figure><ul><li>extract_classification — pulls metadata from Snowflake</li><li>execute_classification_MNPI — runs MNPI classification</li><li>execute_classification_[RAW,PREP,PROD] — processes each environment separately</li></ul><p><strong>3. Smart parameters:</strong> Three key parameters made our solution flexible:</p><ul><li><strong>DATA_CLASSIFICATION_DAYS</strong> — how far back to scan (default: 90 days)</li><li><strong>DATA_CLASSIFICATION_TAGGING_TYPE</strong> — full refresh or incremental to do a full tagging (full) or to speed up classification (incremental)</li><li><strong>DATA_CLASSIFICATION_UNSET</strong> — remove all tags and start from the beginning</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*eNh4m9l_v9bY-tPKUD6aHA.png" /><figcaption>Airflow setup for Data classification</figcaption></figure><p><strong>4. Snowflake’s LLM-powered classification.</strong> This is where the magic happens. Snowflake’s built-in classification leverages large language models to understand data semantically — not just pattern matching on column names, but actually analysing sample data to determine if it contains sensitive information.</p><p>We use Snowflake’s <a href="https://docs.snowflake.com/en/sql-reference/stored-procedures/system_classify_schema"><strong>SYSTEM$CLASSIFY_SCHEMA</strong></a> function as the core engine:</p><pre>CALL SYSTEM$CLASSIFY_SCHEMA(&#39;PREP.SALES&#39;, {<br>  &#39;sample_count&#39;: 1000, <br>  &#39;auto_tag&#39;: true<br>});</pre><p>This single function call does the heavy lifting: samples <strong>1000</strong> rows from each table in the schema, runs them through Snowflake’s LLM models, identifies PII patterns, and automatically applies tags to sensitive columns. Where we need more samples for accuracy, you can extend the <strong>sample_count</strong> parameter:</p><pre>CALL SYSTEM$CLASSIFY_SCHEMA(&#39;RAW.CUSTOMER_DATA&#39;, {<br>  &#39;sample_count&#39;: 5000, <br>  &#39;auto_tag&#39;: true<br>});</pre><p>Snowflake handles the complexity - the LLM inference, the tag management, the metadata updates. We just orchestrate which schemas to process and when.</p><p><strong>Good to know the limitations:</strong> Snowflake’s SYSTEM$CLASSIFY_SCHEMA has a hidden limit — it can only process up to <strong>1,000</strong> tables per schema in a single call, so be careful with that limitation. When you have schemas with thousands of tables <em>(like we did)</em>, the function simply stops processing after hitting that ceiling.</p><p>For MNPI data, we pick up from the dbt models specification and tags and do the same tagging in Snowflake.</p><h3>The Technical Win</h3><p>The entire process runs on Snowflake, offering better and well-known scalability, lower maintenance overhead, and tighter security within Snowflake’s trust boundary.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*S7ADOUvE6ZlRfvsiG2istg.png" /><figcaption>Architectural diagram for the data classification project</figcaption></figure><p>We leverage GitLab Duo for code assistance and maintain everything in <a href="https://about.gitlab.com/"><strong>GitLab</strong></a> with full CI/CD pipelines. Every change goes through review, staging, and production — <strong>DevSecOps</strong> culture applied to data governance.</p><h3>Beyond Simple Tagging</h3><p>But classification was just the foundation. We built three additional layers:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*DevmZE2jYL7V1Lbc7iju-w.png" /><figcaption>Tagged table example (fictional)</figcaption></figure><p><strong>Audit layer.</strong> Every classification action gets logged. Every suspicious query gets flagged. Security teams can review and action inappropriate activity before it becomes a breach. For more details, refer to our dbt <a href="https://dbt.gitlabdata.com/#!/model/model.gitlab_snowflake.wk_sensitive_queries_source"><strong>documentation</strong></a>. Actually, we combined meta tables to join tags and query history. As a result, we have a comprehensive overview of the queries and tags and can easily combine them.</p><p><strong>Monitoring layer</strong> A dbt-based monitoring system that continuously checks queries against our tagged data. It looks for suspicious patterns:</p><ul><li>SELECT * queries with no WHERE clauses or aggregations</li><li>Queries fetching round numbers of results (500, 1000 — typical data extraction patterns)</li><li>GET operations (clear signs of data downloads)</li><li>Unusual access patterns by non-system users</li></ul><p><strong>The hidden data problem.</strong> Here’s something most classification tools miss: when you create a view from a PII-tagged table, or clone a table, the new object should inherit those tags automatically. We solved this by tracking object lineage and propagating tags through the dependency graph.</p><h3>What We Actually Solved</h3><p>A few months after launch, here’s our scorecard:</p><ol><li><strong>Scalability</strong> — 10k+ models classified in under 2 hours</li><li><strong>Automation</strong> — zero human intervention required</li><li><strong>Compliance</strong> — legal and audit teams satisfied</li><li><strong>Adaptability</strong> — handles frequent dbt re-creation of objects</li><li><strong>Security</strong> — catches suspicious access patterns</li><li><strong>Malicious use detection</strong> — audit trail for every sensitive query</li><li><strong>Hidden data tracking</strong> — tags propagate through views and clones</li></ol><p>What we didn’t solve yet:</p><ol><li>❌ <strong>Vendor lock-in</strong> — we’re committed to Snowflake (but we’re okay with that)</li><li>❌ <strong>Full control</strong> — we’re dependent on Snowflake’s classification evolution</li></ol><p>Also, the solution is scheduled in Airflow and will finish the incremental tagging under 1 hour. For full tagging, we are done in under <strong>2</strong> hours for the <strong>L-size</strong> warehouse.</p><h3>The Honest Setbacks</h3><p>Not everything went smoothly. We initially tried Snowflake’s auto-classification during PrP <em>(private preview)</em> — the accuracy wasn’t there yet. We pivoted to their LLM-based approach when it hit GA <em>(generally available)</em>. The cost per classification run is higher, but warehouse size tuning solved our scalability concerns.</p><h3>Conclusion: Choose the Problem, Not the Tool</h3><p>The data landscape is changing faster than our ability to secure it. Every company rushing to add AI features faces the same fundamental challenge: <strong><em>you can’t safely feed AI tools with unclassified data.</em></strong></p><p>Our journey taught us three critical lessons:</p><ul><li><strong>Stay open-source where possible — </strong>but don’t let it hinder your progress. We evaluated PIICatcher and Presidio extensively, but when Snowflake offered a native solution that immediately solved 80% of our problems, we adopted it.</li><li><strong>Think about scalability from day one, but start small</strong> — we began with a single database, proved the concept, then scaled to 10k+ models. The architectural decisions we made on day one <em>(Airflow, YAML configuration, incremental processing…)</em> enabled that scale.</li><li><strong>Focus on business value, assess costs, but move fast</strong> — compliance deadlines don’t wait for perfect solutions. We shipped a working classification system in weeks, not months, because we chose the problem <em>(data security) </em>over the tool <em>(vendor independence).</em></li></ul><p>The future of data is clear: it needs to be classified, governed, and continuously monitored. Whether you build or buy, the time to start is now. Because the cost of waiting isn’t just measured in dollars — it’s measured in reputation, trust, and the ability to innovate safely with AI.</p><p><em>What’s next for us? We’re exploring:</em></p><ol><li><a href="https://docs.snowflake.com/en/user-guide/classify-auto"><strong><em>Auto classification</em></strong></a><em> in Snowflake</em></li><li><em>Integrating with tools like Atlan and Tableau, and connecting our classification data to other governance programs.</em></li><li>Using AI models to find suspicious queries and track tags</li></ol><p><strong><em>Our journey continues.</em></strong></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=aa11680aca75" width="1" height="1" alt=""><hr><p><a href="https://medium.com/snowflake/data-classification-with-snowflake-from-impossible-to-production-aa11680aca75">Data classification with Snowflake: from impossible to production</a> was originally published in <a href="https://medium.com/snowflake">Snowflake Builders Blog: Data Engineers, App Developers, AI, &amp; Data Science</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Snowflake: Rolling window DISTINCT count. How to make this happen?]]></title>
            <link>https://medium.com/snowflake/snowflake-rolling-window-distinct-count-how-to-make-this-happen-b9ba35cc105a?source=rss-ff65005cbd7e------2</link>
            <guid isPermaLink="false">https://medium.com/p/b9ba35cc105a</guid>
            <category><![CDATA[snowflake]]></category>
            <category><![CDATA[sql]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[gitlab]]></category>
            <category><![CDATA[data-engineering]]></category>
            <dc:creator><![CDATA[Radovan Bacovic]]></dc:creator>
            <pubDate>Thu, 22 Jan 2026 20:02:02 GMT</pubDate>
            <atom:updated>2026-01-22T20:02:02.201Z</atom:updated>
            <content:encoded><![CDATA[<h3>Snowflake: Rolling Window DISTINCT Count. How to make this happen?</h3><h4><strong>The problem that shouldn’t exist but does</strong></h4><p>You need to count <strong>DISTINCT</strong> users over a rolling 28-day window. Seems straightforward, right? Write a COUNT(DISTINCT user_id) with a <a href="https://docs.snowflake.com/en/user-guide/functions-window-using">window function</a> and you&#39;re done.</p><p><strong>Except you can’t!</strong></p><p><a href="https://www.snowflake.com/en/"><strong>Snowflake</strong></a> — and mainly other database systems — don’t support COUNT(DISTINCT...) over window functions. This fundamental limitation forces data engineers into workarounds that either sacrifice performance or accuracy when dealing with time-series analytics at scale.</p><p>At <a href="http://about.gitlab.com/"><strong>GitLab</strong></a><strong> Data Team</strong>, we hit this wall while calculating monthly active user metrics across millions of events. We needed to process multiple billion records table efficiently, accurately handle date gaps, and make the solution reusable for our analytics engineers. The standard SQL approaches either crawled to a halt or required complex date-filling logic.</p><h3>What We Built: Why Python Changes Everything</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*y3yUq4jmgRbq7GK9YWNxrA.png" /></figure><p>The breakthrough came from rethinking the problem entirely. Instead of fighting SQL’s <a href="https://docs.snowflake.com/en/user-guide/functions-window-using">window function</a> limitations, we used Python to generate date arrays and leveraged Snowflake’s LATERAL FLATTEN to explode them. This combination eliminates window functions while maintaining scalability.</p><p><strong>The key insight</strong>: a 28-day rolling window is just 28 array members. Cache this small array in Python, flatten it with SQL, and suddenly you’re processing hundreds of millions of rows in minutes instead of hours.</p><p>Here’s why this scales:</p><ol><li><strong>Python caching</strong>: The @functools.lru_cache decorator means date array generation happens once per unique date range, not billions of times</li><li><strong>No window partitioning</strong>: Standard window functions scan entire partitions repeatedly; our approach processes each row exactly once</li><li><strong>Native Snowflake operations</strong>: LATERAL FLATTEN is optimized C code, not interpreted SQL</li></ol><p>The performance difference is dramatic. On our <strong>500M+</strong> record dataset, this approach completed in under 5 minutes on an <strong>L</strong> warehouse. Comparable window function solutions either failed or required <strong>XL</strong> warehouses running <strong>30+</strong> minutes.</p><h3>How it Works: The Complete Implementation</h3><p>Let’s walk through a concrete example. First, create sample data representing user activity:</p><pre>CREATE OR REPLACE TABLE user_activity (<br>    activity_date DATE,<br>    namespace_id INTEGER,<br>    user_id INTEGER<br>);<br><br>-- Insert sample data with intentional date gaps<br>INSERT INTO user_activity VALUES<br>    (&#39;2024-01-01&#39;, 100, 1001),<br>    (&#39;2024-01-01&#39;, 100, 1002),<br>    (&#39;2024-01-01&#39;, 100, 1002), -- same as prior record, expect to see 2 for 2024-01-01<br>    (&#39;2024-01-02&#39;, 100, 1001),<br>    (&#39;2024-01-05&#39;, 100, 1003), -- gap: no data for 01-03, 01-04<br>    (&#39;2024-01-05&#39;, 100, 1001),<br>    (&#39;2024-01-08&#39;, 100, 1004),<br>    (&#39;2024-01-10&#39;, 100, 1002),<br>    (&#39;2024-01-15&#39;, 100, 1005),<br>    (&#39;2024-01-20&#39;, 100, 1001),<br>    (&#39;2024-01-25&#39;, 100, 1006),<br>    (&#39;2024-01-28&#39;, 100, 1003),<br>    (&#39;2024-02-01&#39;, 100, 1007);</pre><p>Next, create the Python function that generates date arrays with built-in caching:</p><pre>CREATE OR REPLACE FUNCTION generate_date_list(start_date DATE, end_date DATE)<br>RETURNS ARRAY<br>LANGUAGE PYTHON<br>RUNTIME_VERSION = &#39;3.11&#39;<br>HANDLER = &#39;generate_date_list&#39;<br>AS<br>$$<br>import datetime<br>import functools<br><br>@functools.lru_cache(maxsize=100)<br>def generate(start_date, end_date):<br>    &quot;&quot;&quot;<br>    Generate cached date arrays for rolling windows.<br>    <br>    With maxsize=100, we can cache all unique 28-day windows<br>    in a typical monthly processing run, dramatically reducing<br>    computation overhead.<br>    &quot;&quot;&quot;<br>    result = []<br>    current = start_date<br>    while current &lt;= end_date:<br>        result.append(current)<br>        current += datetime.timedelta(days=1)<br>    return result<br><br>def generate_date_list(start_date, end_date):<br>    return generate(start_date, end_date)<br>$$;</pre><p><strong>Why this matters:</strong> When processing millions of rows with a 28-day window, most rows will request identical date ranges (like “27 days before today”). Without caching, you’d rebuild the same 28-element array millions of times; with caching, you build it once and reuse it billions of times, turning an <strong><em>O(n × m)</em></strong> operation into effectively <strong><em>O(n)</em></strong>.</p><p>Now implement the rolling window logic:</p><pre>WITH base_data AS (<br>    -- Your source data<br>    SELECT activity_date,<br>           namespace_id,<br>           user_id<br>      FROM user_activity<br>),<br>date_bounds AS (<br>    -- Calculate processing range<br>    SELECT MIN(activity_date) AS min_date,<br>           MAX(activity_date) AS max_date<br>      FROM base_data<br>),<br>rolling_windows AS (<br>    -- Generate 28-day window for each activity<br>    SELECT activity_date,<br>           namespace_id,<br>           user_id,<br>           generate_date_list(<br>               DATEADD(day, -27, activity_date),<br>               activity_date<br>           ) AS date_window<br>      FROM base_data<br>)<br>-- Flatten windows and count distinct users<br>SELECT DATEADD(day, 27, dates.value::DATE) AS report_date,<br>       rolling_windows.namespace_id,<br>       COUNT(DISTINCT rolling_windows.user_id) AS distinct_users_28d<br>  FROM rolling_windows,<br>       LATERAL FLATTEN(INPUT =&gt; rolling_windows.date_window) AS dates<br> WHERE DATEADD(day, 27, dates.value::DATE) <br>       BETWEEN (SELECT min_date FROM date_bounds) <br>           AND (SELECT max_date FROM date_bounds)<br> GROUP BY report_date, namespace_id<br> ORDER BY report_date;</pre><p>What’s happening here:</p><ol><li><strong>base_data</strong>: Your source events (activity date, namespace, user)</li><li><strong>date_bounds</strong>: Establishes the processing range to avoid edge effects</li><li><strong>rolling_windows</strong>: For each activity, generates a 28-day lookback array using the Python function <strong><em>generate_date_list</em></strong></li><li><strong>Final SELECT</strong>: Flattens arrays with <strong><em>LATERAL</em></strong> <strong><em>FLATTEN</em></strong>, then counts distinct users for each date</li></ol><p>The crucial performance trick:</p><pre>DATEADD(day, 27, dates.value::DATE)</pre><p>converts each array member back to the “end date” perspective, allowing proper grouping without date gaps.</p><p>Output:</p><pre>| report_date | namespace_id | distinct_users_28d |<br>|-------------|--------------|--------------------|<br>| 2024-01-01  | 100          | 2                  |<br>| 2024-01-02  | 100          | 2                  |<br>| 2024-01-03  | 100          | 2                  |<br>| 2024-01-04  | 100          | 2                  |<br>| 2024-01-05  | 100          | 3                  |<br>| 2024-01-08  | 100          | 4                  |<br>...<br>...<br>...<br>| 2024-01-28  | 100          | 5                  |<br>| 2024-02-01  | 100          | 6                  |</pre><p>Notice how date gaps <em>(like January 3–4)</em> are automatically filled with accurate rolling counts — no manual date spine required.</p><h3>Why the Alternatives Fall Short</h3><p>Before arriving at this solution, we evaluated four standard approaches:</p><p><strong>Range-based window functions</strong>: Requires DENSE_RANK() workarounds since COUNT(DISTINCT) isn&#39;t supported. Can&#39;t handle date gaps without manual date spines. Fails on datasets over 100M records due to partition size limits.</p><p><strong>Row-based window functions</strong>: Slightly better performance but still requires extensive date-filling logic. Misses the maximum date row without workarounds. Complexity scales poorly with dataset size.</p><p><strong>Regular subqueries</strong>: Conceptually simple — join the table to itself with date range conditions. Performance degrades exponentially with data volume. Our 500M record dataset would take hours even on XL warehouses.</p><p><strong>Lateral join subqueries</strong>: Cleaner syntax than regular subqueries but identical performance characteristics. Still requires full table scans per date partition.</p><p>The lateral array construction approach consistently outperformed these alternatives by 10–20x while maintaining code clarity and handling edge cases automatically.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*sj4_xM_tI5dpj-H_i4aSjA.png" /><figcaption>Comparing options for the rolling windows count DISTINCT</figcaption></figure><h3>Do’s and Don’ts for Production</h3><p>✅ <strong>Do’s:</strong></p><ul><li><strong>Pre-aggregate to the daily level</strong> before applying rolling windows. Convert timestamps to dates in a base table first — mixing timestamp and date types in window calculations kills performance</li><li><strong>Right-size your warehouse</strong>. Use XS for &lt;100K rows, S for 100K-1M, L for 1M-20M, XL for 20M+</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/706/1*KjAAR2vz4D9NniJKQCCajQ.png" /><figcaption>Proposed wareshouse size, based on the dataset size</figcaption></figure><ul><li><strong>Process monthly</strong>. Calculate 60 days of rolling history once per month rather than recalculating the full history daily</li><li><strong>Monitor cache effectiveness</strong>. If you’re processing many different window sizes, increase lru_cache(maxsize=...) appropriately</li><li><strong>Create a dbt macro</strong> for this pattern and make your analyst happy. Hide the logic and serve dbt marco with full flexibility. With this implementation, anyone can simply call the function inside dbt, and it will do the rest.</li></ul><pre>{%- macro count_distinct_rolling_window(source_table, distinct_column, other_columns_list, date_column_name=&#39;ping_date&#39;, window_in_days=28) -%}<br><br>{% set source_table_name = source_table %}%}<br>{% set time_window = window_in_days - 1 %}<br>{% set date_column = date_column_name %}<br>{% set distinct_column = distinct_column %}<br><br>  WITH base AS (<br>    SELECT {{ date_column }}     AS ping_date,<br>           {% for other_column in other_columns_list %}<br>              {{ other_column }} AS {{ other_column }},<br>           {%- endfor -%}<br>           {{ distinct_column }} AS {{ distinct_column }}<br>      FROM {{ ref(source_table_name) }}<br>     WHERE metrics_path = &#39;redis_hll_counters.ide_edit.g_edit_by_sfe_monthly&#39;<br>  ), min_max AS (<br>    SELECT MIN(ping_date) AS min_date,<br>           MAX(ping_date) AS max_date<br>      FROM base<br>  ), generate_rolling_window AS (<br>    SELECT ping_date,<br>           {% for other_column in other_columns_list %}<br>              {{ other_column }} AS {{ other_column }},<br>           {%- endfor -%},<br>           {{ distinct_column }},<br>           generate_list(DATEADD(day, -{{ time_window }},  ping_date), ping_date) AS rolling_window<br>      FROM base<br>  )<br>  SELECT DATEADD(day, {{ time_window }}, unnest_dates.value::DATE) AS ddate,<br>         {% for other_column in other_columns_list %}<br>              {{ other_column }} AS {{ other_column }},<br>         {%- endfor -%},<br>         COUNT(DISTINCT generate_rolling_window.{{ distinct_column }}) AS distinct_user_count<br>    FROM generate_rolling_window,<br>         LATERAL FLATTEN(INPUT =&gt; generate_rolling_window.rolling_window) AS unnest_dates<br>   WHERE ddate BETWEEN (SELECT DATEADD(day, -{{ time_window }}, min_date) FROM min_max) AND (SELECT max_date FROM min_max)<br>   GROUP BY ALL<br>  <br>{%- endmacro -%}</pre><p>and call the macro from the dbt project:</p><pre>{{ <br>count_distinct_rolling_window(source_table=&#39;my_table&#39;, <br>                              distinct_column=&#39;id&#39;, <br>                              date_column_name=&#39;date&#39;, <br>                              other_columns_list=[&#39;metrics_path&#39;,&#39;namespace&#39;]<br>                              ) <br>}}</pre><p>❌ <strong>Don’ts:</strong></p><ul><li><strong>Don’t use window functions on massive datasets (&gt;100M records)</strong>. The partition scans will overwhelm even large warehouses. Use the lateral array approach instead</li><li><strong>Don’t mix timestamp and date types in calculations</strong>. Always cast to date in your base table: timestamp_column::DATE AS date_column, query the date column directly. Example:<br>Instead of using:</li></ul><pre>SELECT <br>...<br>timestamp_column::DATE AS date_column,<br>...</pre><p>Create a base table as:</p><pre><br>CREATE TABLE<br>...<br>SELECT <br>...<br>timestamp_column::DATE AS date_column,</pre><p>and later on use it as aDATE data type:</p><pre>SELECT<br>...<br>date_column -- this is a DATE data type now</pre><ul><li><strong>Don’t skip the date bounds CTE</strong>. Without it, you’ll get incorrect counts at the edges of your date range</li><li><strong>Don’t process the full history daily</strong>. Rolling windows only need recent data — create a sliding 60-day base table and process incrementally</li></ul><p>Snowflake’s lack of distinct count window functions isn’t a limitation you need to accept. With Python UDFs and lateral joins, you can build rolling window calculations that are faster, cleaner, and more maintainable than traditional SQL workarounds.</p><p>The code is straightforward, the performance is excellent, and the pattern is reusable. Stop fighting SQL’s limitations and start combining the best from both (🐍<strong>Python</strong> + ❄️<strong>Snowflake</strong>) worlds.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b9ba35cc105a" width="1" height="1" alt=""><hr><p><a href="https://medium.com/snowflake/snowflake-rolling-window-distinct-count-how-to-make-this-happen-b9ba35cc105a">Snowflake: Rolling window DISTINCT count. How to make this happen?</a> was originally published in <a href="https://medium.com/snowflake">Snowflake Builders Blog: Data Engineers, App Developers, AI, &amp; Data Science</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Learn Weaviate in 15 minutes: A practical guide for SQL developers]]></title>
            <link>https://medium.com/@radovan.bacovic/learn-weaviate-in-15-minutes-a-practical-guide-for-sql-developers-2badafc4081a?source=rss-ff65005cbd7e------2</link>
            <guid isPermaLink="false">https://medium.com/p/2badafc4081a</guid>
            <category><![CDATA[data]]></category>
            <category><![CDATA[weaviate]]></category>
            <category><![CDATA[vector-database]]></category>
            <category><![CDATA[sql]]></category>
            <category><![CDATA[gitlab]]></category>
            <dc:creator><![CDATA[Radovan Bacovic]]></dc:creator>
            <pubDate>Thu, 25 Dec 2025 21:25:15 GMT</pubDate>
            <atom:updated>2025-12-30T14:44:15.077Z</atom:updated>
            <content:encoded><![CDATA[<p>Understanding semantic search through the lens of relational databases</p><p><strong>📣NOTE: </strong>The complete code from this tutorial can be found in the repo <a href="https://gitlab.com/radovan.bacovic/weaviate101"><strong>Weaviate101</strong></a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/612/1*UDWpCCdJl_MCNmKCwMsTUQ.png" /></figure><h3>Part 1: Understanding vector databases</h3><p>Before diving into Weaviate specifically, let’s establish what problem vector databases solve and why you should care.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*KHMlzrl2kmigoGiAJ4u6gA.png" /><figcaption>Basic concepts of Vector dabatase (<a href="https://learn.microsoft.com/en-us/data-engineering/playbook/solutions/vector-database/">source</a>)</figcaption></figure><h3>The fundamental problem</h3><p>Traditional databases excel at exact matches:</p><pre>SELECT * <br>  FROM accounts <br> WHERE account_name = &#39;Acme Corp&#39;;</pre><p>But they fail at semantic understanding:</p><pre>-- This doesn&#39;t work in traditional SQL<br>SELECT * <br>  FROM accounts <br> WHERE meaning_similar_to(&#39;accounts showing churn risk&#39;)</pre><p>Vector databases solve this by converting data into mathematical representations that capture semantic meaning. Here’s the core concept:</p><pre>Text: &quot;Customer experiencing integration challenges&quot;<br> ↓<br>Vector: [0.23, -0.15, 0.67, 0.45, -0.82, … 768 dimensions]<br>Text: &quot;Account having technical difficulties&quot;<br> ↓ <br>Vector: [0.21, -0.18, 0.63, 0.48, -0.79, … 768 dimensions]</pre><p>These vectors are <strong>close in mathematical space</strong> even though the text is different. This is how semantic search works.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*9CSYp42JLySu2MMfK2HZKw.png" /><figcaption>Search example in Weaviate, <a href="https://weaviate.io/apple-and-weaviate/apple-apps-part-2">source</a>.</figcaption></figure><h3>Why you should use vector databases</h3><p>If you’re used to <strong>SQL</strong> and relational databases, you might wonder: <em>“Why add another database to my stack?”</em> Here are the concrete business problems vector databases solve:</p><p><strong>1. The “synonyms and variations” problem</strong></p><p>Your SQL database can’t find related concepts without exhaustive keyword lists:</p><pre>-- Traditional approach - brittle and incomplete<br>SELECT * <br>  FROM tickets <br> WHERE description LIKE &#39;%slow%&#39; <br>    OR description LIKE &#39;%performance%&#39;<br>    OR description LIKE &#39;%lag%&#39;<br>    OR description LIKE &#39;%timeout%&#39;<br>    OR description LIKE &#39;%unresponsive%&#39;<br>    -- ...and 50 more variations you forgot</pre><p>Vector databases understand that “slow”, “sluggish”, “laggy”, “unresponsive” and “performance issues” are semantically similar — without you listing every variation.</p><p><strong>2. The “exploratory search” problem</strong></p><p>Business users ask questions like:</p><ul><li>“Show me accounts that might churn.”</li><li>“Find customers discussing integration challenges.”</li><li>“Which tickets indicate product-market fit issues?”</li></ul><p>These are <strong>conceptual queries</strong> that don’t map cleanly to SQL predicates. You can’t write:</p><pre>WHERE conceptually_similar_to(&#39;churn risk&#39;)</pre><p>But with vector databases, you can search by concept, not just keywords.</p><p><strong>3. The “unstructured data” problem</strong></p><p>You have valuable insights trapped in:</p><ul><li>Support ticket descriptions</li><li>Customer call transcripts</li><li>Contract notes</li><li>Email communications</li><li>Product feedback</li></ul><p>Traditional databases store this text, but can’t make it <strong>searchable by meaning</strong>. Full-text search helps, but it’s still keyword-based. Vector databases make unstructured data as queryable as structured data.</p><p><strong>4. The “recommendation” problem</strong></p><p><strong><em>“Find accounts similar to this one”</em></strong> or <strong><em>“Show me related support tickets” </em></strong>requires understanding similarity across multiple dimensions. SQL can do basic matching:</p><pre>-- Find accounts with similar characteristics<br>SELECT * <br>  FROM accounts <br> WHERE segment = &#39;Enterprise&#39; <br>   AND arr BETWEEN 400000 AND 600000<br>   AND health_score BETWEEN 75 AND 85</pre><p>However, this overlooks accounts that share similar <strong>behaviour patterns</strong>, <strong>engagement styles</strong>, or <strong>business contexts</strong> — aspects that are not captured in structured columns.</p><p><strong>5. The “data quality” problem</strong></p><p>Finding duplicates and near-duplicates is hard in SQL:</p><pre>-- Which of these are the same company?<br>&#39;Acme Corp&#39;<br>&#39;ACME Corporation&#39;  <br>&#39;Acme Corp.&#39;<br>&#39;ACME CORP&#39;</pre><p>Vector similarity instantly identifies these as the same entity without writing complex string-matching rules.</p><h3>Real business impact</h3><p>For the project, vector search is enabled:</p><ul><li><strong>Account managers</strong>: <em>“Show me accounts like this high-performing customer”</em> → instant recommendations</li><li><strong>Support teams</strong>: <em>“Find similar issues to this ticket”</em> → faster resolution through past solutions</li><li><strong>Executives</strong>: <em>“What are our at-risk accounts saying?”</em> → semantic analysis across all touchpoints</li><li><strong>Data quality teams</strong>: <em>“Find duplicate accounts”</em> → automatic deduplication</li></ul><p><strong>The bottom line</strong>: Vector databases aren’t replacing <strong>SQL</strong> — they’re augmenting it. Use SQL for structured queries <em>(“ARR &gt; $100K”)</em>, use vector search for semantic queries <em>(“accounts showing expansion signals”)</em>, and combine them for powerful hybrid searches.</p><h3>Key capabilities of vector databases enable</h3><ol><li><strong>Semantic search</strong>: Find conceptually similar items, not just keyword matches</li><li><strong>Similarity ranking</strong>: Order results by how semantically close they are</li><li><strong>Multi-modal search</strong>: Search across text, images, and audio using the same infrastructure</li><li><strong>Recommendation engines</strong>: <em>“Find items like this one”</em> becomes a vector proximity search</li><li><strong>Deduplication</strong>: Identify near-duplicates without exact matching</li><li><strong>Classification and clustering</strong>: Group similar items automatically without predefined categories</li></ol><h3>Part 2: What is <a href="https://weaviate.io/">Weaviate</a>?</h3><blockquote><strong>Weaviate is an open-source vector database that stores both vectors and their original data, enabling semantic search with structured filtering.</strong></blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*lci5ZZvK8HgR9L653FsLhA.png" /><figcaption>Weaviate database (<a href="https://weaviate.io/blog/what-is-a-vector-database">source</a>)</figcaption></figure><h3>What makes <a href="https://weaviate.io/">Weaviate</a> different?</h3><p>Most vector databases only store vectors. Weaviate stores:</p><ul><li><strong>Vectors</strong> (the semantic embeddings)</li><li><strong>Properties</strong> (structured data like account_name, segment, health_score)</li><li><strong>Relationships</strong> (cross-references between objects)</li></ul><p>This enables <strong>hybrid queries</strong> — semantic search combined with structured filters:</p><pre># Find semantically similar accounts...<br>&quot;accounts at risk of churning&quot;</pre><pre># ...filtered by business rules<br>WHERE segment = &quot;Enterprise&quot; <br>AND health_score &lt; 50<br>AND renewal_date &lt; 90 days</pre><p><strong>1. Schema-based collections (think: tables with vectors)</strong></p><p>If you’re coming from SQL, think of a Weaviate <strong>collection</strong> as similar to a database <strong>table</strong>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*y9wo07zLLBwIsRTQXvtpMg.png" /><figcaption>SQL vs Weaviate concepts</figcaption></figure><p><strong>Key difference</strong>: Each object in Weaviate has both traditional properties <em>(such as </em><strong><em>SQL</em></strong><em> columns)</em> and a vector embedding that captures its semantic meaning.</p><p>Here’s how <strong>AccountIntelligence</strong> collection maps to <strong>SQL</strong> thinking:</p><pre>Collection: &quot;AccountIntelligence&quot; # Like a SQL table named &quot;accounts&quot;<br>├── Properties                    # Like SQL columns<br>│   ├── accountId: TEXT           # Primary key equivalent<br>│   ├── accountName: TEXT         # Regular text column<br>│   ├── segment: TEXT             # Categorical column<br>│   └── healthScore: NUMBER       # Numeric column<br>└── Vector: [768 dimensions]      # NEW: Semantic representation</pre><p>When you query Weaviate, you can:</p><ul><li>Filter by properties<em> (just like SQL WHERE clauses)</em></li><li>Order by properties <em>(just like SQL ORDER BY)</em></li><li>Search by vector similarity <em>(NEW: semantic search capability)</em></li></ul><p><strong>Example comparison</strong>:</p><p>Traditional SQL query:</p><pre><br>SELECT * <br>  FROM accounts <br> WHERE segment = &#39;Enterprise&#39; <br>   AND health_score &lt; 50<br> ORDER BY arr DESC<br> LIMIT 10;<br></pre><p>Weaviate equivalent using semantic search:</p><pre>collection.query.near_vector(<br>    near_vector=query_embedding,<br>    where=Filter.by_property(&quot;segment&quot;).equal(&quot;Enterprise&quot;) &amp;<br>          Filter.by_property(&quot;healthScore&quot;).less_than(50),<br>    limit=10<br>)<br># Returns semantically similar accounts PLUS structured filtering</pre><p><strong>2. Multiple search modes</strong>:</p><ul><li><a href="https://docs.weaviate.io/weaviate/concepts/search/vector-search">Pure vector search</a> <em>(semantic only)</em></li><li><a href="https://weaviate.io/learn/knowledgecards/keyword-search">BM25 keyword search</a> <em>(traditional)</em></li><li><a href="https://docs.weaviate.io/weaviate/search/hybrid">Hybrid search</a> <em>(vector + keyword + filters)</em></li><li><a href="https://docs.weaviate.io/weaviate/concepts/filtering">Filtered semantic search</a> <em>(semantic with business rules)</em></li></ul><p><strong>3. Built-in vectorization or “bring-your-own”</strong>:</p><ul><li>Use Weaviate’s modules <em>(OpenAI, Cohere, etc.)</em></li><li>Or provide pre-computed embeddings <em>(we do this with Ollama)</em></li></ul><p><strong>4. GraphQL and REST APIs</strong>: Flexible query interface</p><p><strong>5. Horizontal scaling</strong>: Production-ready architecture</p><h3>Part 3: Weaviate walkthrough with code</h3><p>Let’s build a minimal semantic search system step by step.</p><h3>Step 1: Start Weaviate with Docker</h3><p>Copy this code into the <strong><em>docker-compose.yml </em></strong>file:</p><pre># docker-compose.yml<br>services:<br>  weaviate:<br>    image: semitechnologies/weaviate:1.23.0<br>    ports:<br>      - &quot;8080:8080&quot;<br>      - &quot;50051:50051&quot;  # gRPC port<br>    environment:<br>      QUERY_DEFAULTS_LIMIT: 25<br>      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: &#39;true&#39;<br>      PERSISTENCE_DATA_PATH: &#39;/var/lib/weaviate&#39;<br>      DEFAULT_VECTORIZER_MODULE: &#39;none&#39;  # We&#39;ll provide embeddings later on<br>      ENABLE_MODULES: &#39;&#39;<br>      CLUSTER_HOSTNAME: &#39;node1&#39;</pre><p>Start it up:</p><pre>docker-compose up -d<br># Wait 30 seconds for Weaviate to start</pre><pre>[+] Running 2/2<br> ✔ Network weaviate101_default       Created                                                                                                                                                                                                                                  0.0s <br> ✔ Container weaviate101-weaviate-1  Started      </pre><p>and check if Weaviate is working:</p><pre>curl http://localhost:8080/v1</pre><h3>Step 2: Connect and create a schema</h3><p>Install needed libraries from the <strong>requirements.txt</strong> file:</p><pre>anthropic<br>requests<br>weaviate-client==4.19.0</pre><pre>pip install -r requirements.txt</pre><p>Create <strong>weaviate_start.py </strong>file and paste the code:</p><pre>import weaviate<br>from weaviate.classes.config import Configure, DataType, Property<br><br>COLLECTION_NAME = &quot;AccountIntelligence&quot;<br>try:<br>    # Connect to local Weaviate instance<br>    client = weaviate.connect_to_local(host=&quot;localhost&quot;, port=8080, grpc_port=50051)<br>    # Check connection<br>    print(client.is_ready())  # Should print True<br>    # Create a collection for account rdata<br><br>    if client.collections.exists(COLLECTION_NAME):<br>        client.collections.delete(COLLECTION_NAME)<br><br>    accounts = client.collections.create(<br>        name=COLLECTION_NAME,<br>        description=&quot;Customer account data with semantic search&quot;,<br>        # Vector configuration<br>        vectorizer_config=Configure.Vectorizer.none(),  # We provide embeddings<br>        vector_index_config=Configure.VectorIndex.hnsw(),<br>        # Define properties (structured data)<br>        properties=[<br>            Property(<br>                name=&quot;accountId&quot;,<br>                data_type=DataType.TEXT,<br>                skip_vectorization=True,  # Don&#39;t include in vector<br>                description=&quot;Unique account identifier&quot;,<br>            ),<br>            Property(<br>                name=&quot;accountName&quot;,<br>                data_type=DataType.TEXT,<br>                skip_vectorization=True,<br>                description=&quot;Account name&quot;,<br>            ),<br>            Property(<br>                name=&quot;content&quot;,<br>                data_type=DataType.TEXT,<br>                skip_vectorization=False,  # This WILL be vectorized<br>                description=&quot;Rich account summary for semantic search&quot;,<br>            ),<br>            Property(<br>                name=&quot;segment&quot;,<br>                data_type=DataType.TEXT,<br>                skip_vectorization=True,<br>                description=&quot;Account segment: Enterprise, Mid-Market, SMB&quot;,<br>            ),<br>            Property(<br>                name=&quot;healthScore&quot;,<br>                data_type=DataType.NUMBER,<br>                skip_vectorization=True,<br>                description=&quot;Health score 0-100&quot;,<br>            ),<br>            Property(<br>                name=&quot;arr&quot;,<br>                data_type=DataType.NUMBER,<br>                skip_vectorization=True,<br>                description=&quot;Annual Recurring Revenue&quot;,<br>            ),<br>        ],<br>    )<br>    print(f&quot;Collection &#39;{accounts.name}&#39; created successfully!&quot;)<br>finally:<br>    client.close()</pre><p>When you run the file, you will got a message:</p><pre>True<br>Collection &#39;AccountIntelligence&#39; created successfully!</pre><p><strong>Key concepts explained:</strong></p><ul><li><strong>skip_vectorization=True</strong>: These fields are metadata only. Used for filtering, not semantic search.</li><li><strong>skip_vectorization=False</strong>: The content field gets vectorised for semantic search.</li><li><strong>HNSW index</strong>: Hierarchical Navigable Small World graph — fast approximate nearest neighbour search</li><li><strong>Cosine distance</strong>: Measures the angle between vectors, perfect for text embeddings</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/896/1*LaxNZ7Z-cFp0GAGVnbPJNw.png" /><figcaption>Dummy represenation of the AccountIntelligence colleciton in Weaviate</figcaption></figure><h3>Step 3: Add data with vectors</h3><p>Here’s where the magic happens — inserting data with embeddings. We pre-computed embeddings for the easier testing; later on, Weaviate will do it for us:</p><p>Create file <strong>weaviate_insert_data.py</strong></p><pre># Sample account data<br>sample_accounts = [<br>    {<br>        &quot;accountId&quot;: &quot;ACC001&quot;,<br>        &quot;accountName&quot;: &quot;Acme Corp&quot;,<br>        &quot;content&quot;: &quot;Enterprise account showing strong product adoption. Active in community, &quot;<br>                   &quot;high license utilization at 85%. Recent expansion discussion with CSM. &quot;<br>                   &quot;Strong technical engagement across engineering teams.&quot;,<br>        &quot;segment&quot;: &quot;Enterprise&quot;,<br>        &quot;healthScore&quot;: 82,<br>        &quot;arr&quot;: 450000,<br>        # Vector would come from embedding model - simplified here<br>        &quot;vector&quot;: [0.23, -0.15, 0.67, 0.45, -0.82, 0.12, 0.56, -0.33]  # 768 dims in reality<br>    },<br>    {<br>        &quot;accountId&quot;: &quot;ACC002&quot;,<br>        &quot;accountName&quot;: &quot;TechStart Inc&quot;,<br>        &quot;content&quot;: &quot;Mid-market account with recent support escalations. Multiple tickets on API &quot;<br>                   &quot;performance issues. Low license utilization at 35%. CSM noted budget concerns &quot;<br>                   &quot;in last quarterly review. Contract renewal in 60 days.&quot;,<br>        &quot;segment&quot;: &quot;Mid-Market&quot;,<br>        &quot;healthScore&quot;: 42,<br>        &quot;arr&quot;: 125000,<br>        &quot;vector&quot;: [-0.15, 0.22, -0.45, 0.67, 0.33, -0.78, 0.11, 0.55]<br>    },<br>    {<br>        &quot;accountId&quot;: &quot;ACC003&quot;,<br>        &quot;accountName&quot;: &quot;Global Solutions Ltd&quot;,<br>        &quot;content&quot;: &quot;Enterprise account with critical escalation. Integration challenges blocking &quot;<br>                   &quot;production deployment. Executive stakeholder expressing frustration. &quot;<br>                   &quot;Competitors mentioned in recent calls.&quot;,<br>        &quot;segment&quot;: &quot;Enterprise&quot;,<br>        &quot;healthScore&quot;: 28,<br>        &quot;arr&quot;: 780000,<br>        &quot;vector&quot;: [-0.33, 0.45, -0.67, 0.22, 0.88, -0.12, -0.55, 0.15]<br>    }<br>]</pre><p>You can try it now:</p><pre># Insert using batch API (much faster than individual inserts)<br>client = weaviate.connect_to_local(host=&quot;localhost&quot;, port=8080, grpc_port=50051)<br><br>collection = client.collections.get(&quot;AccountIntelligence&quot;)<br>with collection.batch.dynamic() as batch:<br>    for account in sample_accounts:<br>        batch.add_object(<br>            properties={<br>                &quot;accountId&quot;: account[&quot;accountId&quot;],<br>                &quot;accountName&quot;: account[&quot;accountName&quot;],<br>                &quot;content&quot;: account[&quot;content&quot;],<br>                &quot;segment&quot;: account[&quot;segment&quot;],<br>                &quot;healthScore&quot;: account[&quot;healthScore&quot;],<br>                &quot;arr&quot;: account[&quot;arr&quot;],<br>            },<br>            vector=account[&quot;vector&quot;],<br>        )<br># Check what was inserted<br>result = collection.aggregate.over_all(total_count=True)<br>print(f&quot;Total accounts in Weaviate: {result.total_count}&quot;)</pre><p>When you run the file, you will get the result:</p><pre>Total accounts in Weaviate: 3</pre><p><strong>What just happened?</strong></p><p>Each account now exists in Weaviate as:</p><ol><li>A <strong>768-dimensional vector</strong> <em>(in reality, not the simplified 8-dim example) </em>capturing semantic meaning</li><li><strong>Structured properties</strong> available for filtering and display</li><li><strong>Indexed</strong> for fast retrieval</li></ol><h3>Part 4: Search options in Weaviate</h3><p>Now let’s explore the different ways to query this data.</p><h3>Option 1: Pure semantic search (vector similarity)</h3><p>Create a file <strong>weaviate_search.py</strong> and paste the code:</p><pre>import weaviate<br>from weaviate.classes.query import MetadataQuery<br><br>client = weaviate.connect_to_local(host=&quot;localhost&quot;, port=8080, grpc_port=50051)<br><br><br>def semantic_search(query_text: str, limit: int = 5):<br>    &quot;&quot;&quot;<br>    Find accounts semantically similar to the query<br>    &quot;&quot;&quot;<br>    collection = client.collections.get(&quot;AccountIntelligence&quot;)<br><br>    # In reality, generate embedding for query_text using same model as data<br>    # For demo, we&#39;ll use a simplified query vector<br>    query_vector = [0.1, 0.3, -0.5, 0.7, 0.2, -0.6, 0.4, 0.1]<br><br>    response = collection.query.near_vector(<br>        near_vector=query_vector,<br>        limit=limit,<br>        return_metadata=MetadataQuery(distance=True, certainty=True),<br>    )<br><br>    print(f&quot;\n🔍 Semantic search: &#39;{query_text}&#39;\n&quot;)<br>    for obj in response.objects:<br>        print(f&quot;Account: {obj.properties[&#39;accountName&#39;]}&quot;)<br>        print(f&quot;Segment: {obj.properties[&#39;segment&#39;]}&quot;)<br>        print(f&quot;Health: {obj.properties[&#39;healthScore&#39;]}&quot;)<br>        print(f&quot;Certainty: {obj.metadata.certainty:.3f}&quot;)<br>        print(f&quot;Content: {obj.properties[&#39;content&#39;][:100]}...&quot;)<br>        print(&quot;-&quot; * 80)<br><br>    return response.objects<br><br><br># Try it<br>results = semantic_search(&quot;accounts at risk of churning&quot;)</pre><p>Result:</p><pre>🔍 Semantic search: &#39;accounts at risk of churning&#39;<br>Account: Global Solutions Ltd<br>Segment: Enterprise<br>Health: 28<br>Certainty: 0.892<br>Content: Enterprise account with critical escalation. Integration challenges blocking production...<br>--------------------------------------------------------------------------------<br>Account: TechStart Inc<br>Segment: Mid-Market<br>Health: 42<br>Certainty: 0.765<br>Content: Mid-market account with recent support escalations. Multiple tickets on API performance...<br>--------------------------------------------------------------------------------</pre><p><strong>Notice</strong>: The search found accounts discussing “escalations”, “issues”, and “concerns” even though the query was “risk of churning” — this is semantic understanding!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*nTpTadwxMw542YRogo8RUQ.png" /><figcaption>Graphical representation of vectors (<a href="https://weaviate.io/blog/what-is-a-vector-database">source</a>)</figcaption></figure><h3>Option 2: Keyword search (BM25)</h3><p>Sometimes you want exact keyword matching, not semantic similarity. Add this code at the end of the file <strong>weavite_search.py</strong>:</p><pre>def keyword_search(keyword: str, limit: int = 5):<br>    &quot;&quot;&quot;<br>    Traditional BM25 keyword search<br>    &quot;&quot;&quot;<br>    collection = client.collections.get(&quot;AccountIntelligence&quot;)<br>    <br>    response = collection.query.bm25(<br>        query=keyword,<br>        limit=limit,<br>        return_metadata=MetadataQuery(score=True)<br>    )<br>    <br>    print(f&quot;\n🔍 Keyword search: &#39;{keyword}&#39;\n&quot;)<br>    for obj in response.objects:<br>        print(f&quot;Account: {obj.properties[&#39;accountName&#39;]}&quot;)<br>        print(f&quot;BM25 Score: {obj.metadata.score:.3f}&quot;)<br>        print(f&quot;Content: {obj.properties[&#39;content&#39;][:100]}...&quot;)<br>        print(&quot;-&quot; * 80)<br>    <br>    return response.objects<br><br>results = keyword_search(&quot;escalation&quot;)</pre><p>And run the file.</p><p><strong>When to use keyword vs semantic:</strong></p><ul><li><strong>Keyword</strong>: Specific product names, account IDs, exact terminology</li><li><strong>Semantic</strong>: Conceptual queries, exploratory search, synonym handling</li></ul><h3>Option 3: Hybrid search (best of both worlds)</h3><p>Combine vector similarity with keyword matching, add this to the end of <strong>weaviate_search.py</strong> file:</p><pre>def hybrid_search(query_text: str, alpha: float = 0.5, limit: int = 5):<br>    &quot;&quot;&quot;<br>    Hybrid search balancing semantic and keyword<br>    <br>    alpha=0.0 → 100% keyword (BM25)<br>    alpha=0.5 → 50% semantic, 50% keyword<br>    alpha=1.0 → 100% semantic<br>    &quot;&quot;&quot;<br>    collection = client.collections.get(&quot;AccountIntelligence&quot;)<br>    query_vector = [0.1, 0.3, -0.5, 0.7, 0.2, -0.6, 0.4, 0.1]<br>    <br>    response = collection.query.hybrid(<br>        query=query_text,<br>        vector=query_vector,<br>        alpha=alpha,<br>        limit=limit,<br>        return_metadata=MetadataQuery(score=True)<br>    )<br>    <br>    print(f&quot;\n🔍 Hybrid search (α={alpha}): &#39;{query_text}&#39;\n&quot;)<br>    for obj in response.objects:<br>        print(f&quot;Account: {obj.properties[&#39;accountName&#39;]}&quot;)<br>        print(f&quot;Hybrid Score: {obj.metadata.score:.3f}&quot;)<br>        print(f&quot;Content: {obj.properties[&#39;content&#39;][:100]}...&quot;)<br>        print(&quot;-&quot; * 80)<br>    <br>    return response.objects<br><br>results = hybrid_search(&quot;critical escalation with enterprise accounts&quot;, alpha=0.5)</pre><p>And run the file.</p><p><strong>Why hybrid?</strong> It finds accounts that are both:</p><ul><li>Semantically similar to the concept</li><li>Actually, mention the keywords</li></ul><p>This usually produces the best results for business queries.</p><h3>Option 4: Filtered semantic search</h3><p>The real power: combine semantic search with structured filters, add the routine to the end of <strong>weaviate_search.py </strong>file :</p><pre>from weaviate.classes.query import Filter<br><br>def filtered_search(query_text: str, segment: str = None,<br>                    max_health: int = None, min_arr: float = None, limit: int = 5):<br>    collection = client.collections.get(&quot;AccountIntelligence&quot;)<br>    query_vector = [0.1, 0.3, -0.5, 0.7, 0.2, -0.6, 0.4, 0.1]<br><br>    # Build filter<br>    where_filter = None<br>    if segment:<br>        where_filter = Filter.by_property(&quot;segment&quot;).equal(segment)<br>    if max_health is not None:<br>        health_filter = Filter.by_property(&quot;healthScore&quot;).less_than(max_health)<br>        where_filter = where_filter &amp; health_filter if where_filter else health_filter<br>    if min_arr is not None:<br>        arr_filter = Filter.by_property(&quot;arr&quot;).greater_than(min_arr)<br>        where_filter = where_filter &amp; arr_filter if where_filter else arr_filter<br><br>    # Pass filter as &#39;filters&#39; not &#39;where&#39;<br>    response = collection.query.near_vector(<br>        near_vector=query_vector,<br>        filters=where_filter,  # Changed from &#39;where&#39; to &#39;filters&#39;<br>        limit=limit,<br>        return_metadata=MetadataQuery(distance=True)<br>    )<br><br>    print(f&quot;\n🔍 Filtered search: &#39;{query_text}&#39;&quot;)<br>    print(f&quot;   Filters: segment={segment}, health&lt;{max_health}, ARR&gt;${min_arr}\n&quot;)<br><br>    for obj in response.objects:<br>        print(f&quot;Account: {obj.properties[&#39;accountName&#39;]}&quot;)<br>        print(f&quot;Segment: {obj.properties[&#39;segment&#39;]}&quot;)<br>        print(f&quot;Health: {obj.properties[&#39;healthScore&#39;]}&quot;)<br>        print(f&quot;ARR: ${obj.properties[&#39;arr&#39;]:,}&quot;)<br>        print(f&quot;Distance: {obj.metadata.distance:.3f}&quot;)<br>        print(&quot;-&quot; * 80)<br><br>    return response.objects<br><br># Try it - find at-risk enterprise accounts with high revenue<br>results = filtered_search(<br>    &quot;accounts showing churn risk&quot;,<br>    segment=&quot;Enterprise&quot;,<br>    max_health=50,<br>    min_arr=200000<br>)</pre><p>Result:</p><pre>🔍 Filtered search: &#39;accounts showing churn risk&#39;<br>   Filters: segment=Enterprise, health&lt;50, ARR&gt;$200000<br>Account: Global Solutions Ltd<br>Segment: Enterprise<br>Health: 28<br>ARR: $780,000<br>Distance: 0.156<br>--------------------------------------------------------------------------------</pre><h3>Part 5: Ollama and local vectorization</h3><p>We use Ollama for local embedding generation. This keeps costs low and data private.</p><h3>What is vectorization?</h3><p>Before we dive into Ollama, let’s clarify what <strong>vectorization</strong> actually means.</p><blockquote><strong>Vectorization</strong> is the process of converting text (or other data) into numerical vectors that capture semantic meaning.</blockquote><p>Think of it as translating human language into math that computers can compare:</p><pre>Text (Human Language):<br>&quot;Customer experiencing integration challenges&quot;<br> ↓ VECTORIZATION ↓<br>Vector (Machine Language):<br>[0.023, -0.156, 0.671, 0.445, -0.823, … 768 numbers total]</pre><p><strong>Why numbers?</strong> Because computers can efficiently:<br>- Compare vectors using mathematical distance (cosine similarity, Euclidean distance)<br>- Search through millions of vectors in milliseconds<br>- Cluster similar concepts automatically<br>- Rank results by relevance<br>- The magic: Texts with similar meanings produce similar vectors, even if they use completely different words:</p><pre>&quot;The product is too slow&quot; → [0.21, -0.15, 0.63, …]<br>&quot;Performance issues&quot; → [0.19, -0.18, 0.65, …]<br>&quot;System is laggy&quot; → [0.23, -0.14, 0.61, …]<br> ↑ These vectors are close in 768-dimensional space</pre><p><strong>SQL analogy</strong>: If SQL indexes make lookups fast, vectorization makes <strong>semantic similarity</strong> fast. But instead of indexing exact values, you’re indexing meaning.</p><h3>Why <a href="https://ollama.com/">Ollama</a>?</h3><p><strong>Ollama</strong> provides local LLM inference — no API calls, no costs, no data leaving your infrastructure.</p><p>For embeddings, we use <strong>nomic-embed-text</strong>:</p><ul><li>768 dimensions</li><li>MTEB score: 62.39 (competitive with OpenAI)</li><li>Runs locally on CPU or GPU</li><li>Free and open source</li></ul><h3>Setting up Ollama with Weaviate</h3><p>Stop Docker and paste the code with Ollama image:</p><pre>docker-compose down --remove-orphans</pre><pre># docker-compose.yml - Add Ollama service<br>services:<br>  weaviate:<br>    image: semitechnologies/weaviate:1.34.0<br>    ports:<br>      - &quot;8080:8080&quot;<br>      - &quot;50051:50051&quot;<br>    environment:<br>      DEFAULT_VECTORIZER_MODULE: &#39;none&#39;<br>      CLUSTER_HOSTNAME: &#39;node1&#39;<br><br>  ollama:<br>    image: ollama/ollama:0.12.11<br>    ports:<br>      - &quot;11434:11434&quot;<br>    volumes:<br>      - ./ollama_data:/root/.ollama<br>    restart: unless-stopped<br>    healthcheck:<br>      test: [&quot;CMD&quot;, &quot;ollama&quot;, &quot;list&quot;]<br>      interval: 30s<br>      timeout: 10s<br>      retries: 3<br>    entrypoint: [&quot;/bin/sh&quot;, &quot;-c&quot;]<br>    command:<br>      - |<br>        ollama serve &amp;<br>        sleep 5<br>        ollama pull nomic-embed-text<br>        wait<br><br>volumes:<br>  ollama_data:</pre><p>Start Docker (again):</p><pre># Start services<br>docker-compose up -d</pre><p>Check the environment:</p><pre><br># Test if it works<br>curl http://localhost:11434/api/embeddings -d &#39;{<br>  &quot;model&quot;: &quot;nomic-embed-text&quot;,<br>  &quot;prompt&quot;: &quot;Test embedding generation&quot;<br>}&#39;</pre><h3>Generate embeddings with Ollama</h3><p>Add this code to a new file, <strong>weaviate_embeddings.py</strong>:</p><pre>import requests<br><br><br>def generate_embedding(text: str) -&gt; list[float]:<br>    &quot;&quot;&quot;<br>    Generate 768-dimensional embedding using Ollama<br>    &quot;&quot;&quot;<br>    response = requests.post(<br>        &quot;http://localhost:11434/api/embeddings&quot;,<br>        json={&quot;model&quot;: &quot;nomic-embed-text&quot;, &quot;prompt&quot;: text},<br>    )<br><br>    if response.status_code == 200:<br>        return response.json()[&quot;embedding&quot;]<br>    else:<br>        raise Exception(f&quot;Ollama error: {response.text}&quot;)<br><br><br># Test it<br>text = &quot;Account showing signs of technical challenges with integration&quot;<br>embedding = generate_embedding(text)<br>print(f&quot;Text: {text}&quot;)<br>print(f&quot;Embedding dimensions: {len(embedding)}&quot;)<br>print(f&quot;First 10 values: {embedding[:10]}&quot;)</pre><p>After running the file, you will see the result:</p><pre>Text: Account showing signs of technical challenges with integration<br>Embedding dimensions: 768<br>First 10 values: [0.023, -0.156, 0.671, 0.445, -0.823, 0.121, 0.567, -0.334, 0.789, -0.234]</pre><h3>Complete pipeline: Data → embeddings → Weaviate</h3><p>Alter a file <strong>weaviate_embeddings.py</strong> and run it with this code:</p><pre>import weaviate<br><br>def process_and_insert_account(account_data: dict):<br>    &quot;&quot;&quot;<br>    Complete pipeline: take account data, generate embedding, insert to Weaviate<br>    &quot;&quot;&quot;<br>    # 1. Create rich content for vectorization<br>    content = f&quot;&quot;&quot;<br>    Account: {account_data[&#39;accountName&#39;]}<br>    Segment: {account_data[&#39;segment&#39;]}<br>    Health Score: {account_data[&#39;healthScore&#39;]}/100<br>    ARR: ${account_data[&#39;arr&#39;]:,}<br><br>    Recent Activity:<br>    {account_data[&#39;activity_summary&#39;]}<br><br>    Support Status:<br>    {account_data[&#39;support_summary&#39;]}<br><br>    Engagement Level:<br>    {account_data[&#39;engagement_summary&#39;]}<br>    &quot;&quot;&quot;<br><br>    # 2. Generate embedding with Ollama<br>    embedding = generate_embedding(content)<br><br>    # 3. Insert to Weaviate<br>    collection = client.collections.get(&quot;AccountIntelligence&quot;)<br><br>    result = collection.data.insert(<br>        properties={<br>            &quot;accountId&quot;: account_data[&quot;accountId&quot;],<br>            &quot;accountName&quot;: account_data[&quot;accountName&quot;],<br>            &quot;content&quot;: content,<br>            &quot;segment&quot;: account_data[&quot;segment&quot;],<br>            &quot;healthScore&quot;: account_data[&quot;healthScore&quot;],<br>            &quot;arr&quot;: account_data[&quot;arr&quot;]<br>        },<br>        vector=embedding<br>    )<br><br>    print(f&quot;✅ Inserted {account_data[&#39;accountName&#39;]} with UUID: {result}&quot;)<br>    return result<br><br>try:<br>    client = weaviate.connect_to_local(host=&quot;localhost&quot;, port=8080, grpc_port=50051)<br>    # Example usage<br>    account = {<br>        &quot;accountId&quot;: &quot;ACC004&quot;,<br>        &quot;accountName&quot;: &quot;DataCorp Systems&quot;,<br>        &quot;segment&quot;: &quot;Enterprise&quot;,<br>        &quot;healthScore&quot;: 65,<br>        &quot;arr&quot;: 890000,<br>        &quot;activity_summary&quot;: &quot;Regular product usage, 3 admin logins per week&quot;,<br>        &quot;support_summary&quot;: &quot;2 open tickets, both low priority&quot;,<br>        &quot;engagement_summary&quot;: &quot;Attended last webinar, CSM meeting scheduled&quot;<br>    }<br>    process_and_insert_account(account)<br>finally:<br>    client.close()</pre><p><strong>Key benefits of this approach:</strong></p><ol><li><strong>No API costs</strong>: Ollama runs locally</li><li><strong>Data privacy</strong>: Nothing leaves your infrastructure</li><li><strong>Consistency</strong>: Same model for all embeddings</li><li><strong>Performance</strong>: ~100–200ms per embedding on CPU</li><li><strong>Quality</strong>: nomic-embed-text performs comparably to paid solutions</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/960/1*CbMoie76KGAKXK4GlxFqsw.png" /><figcaption>Pipeline flow</figcaption></figure><h3>Part 6: Search + RAG with Claude</h3><p>The final piece: combining Weaviate search with Claude for intelligent analysis.</p><h3>The RAG pattern (Retrieval-Augmented Generation)</h3><p><strong>RAG prevents hallucinations</strong> by grounding AI responses in retrieved facts:</p><ol><li><strong>Retrieve</strong>: Search Weaviate for relevant accounts</li><li><strong>Augment</strong>: Package search results as context</li><li><strong>Generate</strong>: Claude analyzes the real data, not inventing information</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/998/1*LHBG5cpMTvJ_wof0UgCfbg.png" /><figcaption>Hybrid search + RAG (Claude)</figcaption></figure><h3>Implementation: Search + Claude analysis</h3><p>You should have your ANTHROPIC_API_KEY generated:</p><pre>export ANTHROPIC_API_KEY=&quot;sk-ant-api...&quot;</pre><p>and then you can do the RAG with Weaviate. Alter the file <strong>weaviate_embeddings.py </strong>and run it:</p><pre>import anthropic<br>import json<br>from weaviate.classes.query import MetadataQuery<br><br>def search_and_analyze(user_query: str, limit: int = 10) -&gt; dict:<br>    &quot;&quot;&quot;<br>    Complete RAG pipeline: Search Weaviate → Analyze with Claude<br>    &quot;&quot;&quot;<br>    # Step 1: Semantic search in Weaviate<br>    print(f&quot;🔍 Searching Weaviate for: &#39;{user_query}&#39;&quot;)<br>    <br>    query_embedding = generate_embedding(user_query)<br>    collection = client.collections.get(&quot;AccountIntelligence&quot;)<br>    <br>    search_results = collection.query.near_vector(<br>        near_vector=query_embedding,<br>        limit=limit,<br>        return_metadata=MetadataQuery(distance=True, certainty=True)<br>    )<br>    <br>    print(f&quot;📊 Found {len(search_results.objects)} relevant accounts\n&quot;)<br>    <br>    # Step 2: Format results as context for Claude<br>    context_accounts = []<br>    for obj in search_results.objects:<br>        context_accounts.append({<br>            &quot;account_id&quot;: obj.properties[&quot;accountId&quot;],<br>            &quot;account_name&quot;: obj.properties[&quot;accountName&quot;],<br>            &quot;segment&quot;: obj.properties[&quot;segment&quot;],<br>            &quot;health_score&quot;: obj.properties[&quot;healthScore&quot;],<br>            &quot;arr&quot;: obj.properties[&quot;arr&quot;],<br>            &quot;content&quot;: obj.properties[&quot;content&quot;],<br>            &quot;relevance_score&quot;: round(obj.metadata.certainty, 3)<br>        })<br>    <br>    # Step 3: Build context window for Claude<br>    context_package = {<br>        &quot;query&quot;: user_query,<br>        &quot;total_accounts_found&quot;: len(context_accounts),<br>        &quot;accounts&quot;: context_accounts<br>    }<br>    <br>    context_json = json.dumps(context_package, indent=2)<br>    <br>    # Step 4: Analyze with Claude<br>    print(&quot;🤖 Analyzing with Claude...\n&quot;)<br>    <br>    client_anthropic = anthropic.Anthropic()<br>    <br>    message = client_anthropic.messages.create(<br>        model=&quot;claude-sonnet-4-20250514&quot;,<br>        max_tokens=2000,<br>        temperature=0,  # Deterministic responses<br>        messages=[{<br>            &quot;role&quot;: &quot;user&quot;,<br>            &quot;content&quot;: f&quot;&quot;&quot;You are analyzing customer account data. You must ONLY use information from the provided search results. Do not make up or assume any information.<br>SEARCH RESULTS:<br>{context_json}<br>USER QUESTION: {user_query}<br>Analyze the accounts found and provide:<br>1. Key patterns or themes across these accounts<br>2. Specific risk factors or opportunities identified<br>3. Actionable recommendations with account examples<br>4. Priority ranking if applicable<br>Cite specific accounts by name when making claims. If the data is insufficient to answer the question, state that explicitly.&quot;&quot;&quot;<br>        }]<br>    )<br>    <br>    analysis = message.content[0].text<br>    <br>    # Step 5: Return complete response<br>    return {<br>        &quot;query&quot;: user_query,<br>        &quot;search_results&quot;: context_accounts,<br>        &quot;result_count&quot;: len(context_accounts),<br>        &quot;claude_analysis&quot;: analysis<br>    }<br><br># Try it!<br>result = search_and_analyze(<br>    &quot;Which high-value accounts are showing signs of risk and need immediate attention?&quot;<br>)<br>print(&quot;=&quot; * 80)<br>print(&quot;CLAUDE&#39;S ANALYSIS:&quot;)<br>print(&quot;=&quot; * 80)<br>print(result[&quot;claude_analysis&quot;])<br>print(&quot;\n&quot; + &quot;=&quot; * 80)<br>print(f&quot;\nBased on {result[&#39;result_count&#39;]} accounts retrieved from Weaviate&quot;)</pre><p>Result:</p><pre>🔍 Searching Weaviate for: &#39;Which high-value accounts are showing signs of risk and need immediate attention?&#39;<br>📊 Found 3 relevant accounts<br>🤖 Analyzing with Claude...<br>================================================================================<br>CLAUDE&#39;S ANALYSIS:<br>================================================================================<br>PRIORITY AT-RISK ACCOUNTS ANALYSIS<br>Based on the search results, I&#39;ve identified 2 high-value accounts requiring immediate attention:<br>1. **CRITICAL: Global Solutions Ltd** (ARR: $780,000)<br>   Risk Level: SEVERE (Health Score: 28/100)<br>   <br>   Key Issues:<br>   - Critical escalation currently active<br>   - Integration challenges blocking production deployment<br>   - Executive stakeholder expressing frustration<br>   - Competitors mentioned in recent calls<br>   - Relevance Score: 0.892 (strong semantic match to query)<br>   <br>   Immediate Actions Needed:<br>   - Executive engagement within 24-48 hours<br>   - Technical escalation team assignment<br>   - Competitor analysis and value proposition reinforcement<br>   - Timeline: Address within this week<br>2. **HIGH PRIORITY: TechStart Inc** (ARR: $125,000)<br>   Risk Level: HIGH (Health Score: 42/100)<br>   <br>   Key Issues:<br>   - Multiple support tickets on API performance<br>   - Low license utilization (35%)<br>   - Budget concerns noted by CSM<br>   - Contract renewal in 60 days<br>   - Relevance Score: 0.765<br>   <br>   Immediate Actions Needed:<br>   - Performance issue resolution<br>   - Value demonstration to justify renewal<br>   - Budget discussion with stakeholders<br>   - Timeline: Next 30 days critical<br>COMMON PATTERNS:<br>- Both accounts show technical challenges as primary risk factor<br>- Support escalations correlate with low health scores<br>- Executive stakeholder sentiment is key indicator<br>RECOMMENDATION PRIORITY:<br>1. Global Solutions Ltd - Highest ARR, lowest health, critical escalation<br>2. TechStart Inc - Renewal timeline urgency, budget sensitivity<br>Note: Acme Corp (Health: 82, ARR: $450K) was also in results but shows positive indicators and doesn&#39;t require immediate intervention.<br>================================================================================<br>Based on 3 accounts retrieved from Weaviate</pre><p><strong>Why this works:</strong></p><ol><li><strong>No hallucinations</strong>: Claude only analyzes the 3 accounts Weaviate returned</li><li><strong>Cited examples</strong>: Every claim references specific accounts</li><li><strong>Grounded in facts</strong>: Health scores, ARR, and issues come from real data</li><li><strong>Actionable</strong>: Recommendations tied to specific accounts and timeframes</li></ol><h3>Advanced RAG: Adding filters to search</h3><p>Alter the file <strong>weaviate_embeddings.py </strong>and run it:</p><pre>from weaviate.classes.query import Filter<br><br>def filtered_search_and_analyze(<br>        user_query: str,<br>        segment: str = None,<br>        max_health: int = None,<br>        min_arr: float = None,<br>        limit: int = 10<br>) -&gt; dict:<br>    &quot;&quot;&quot;<br>    RAG with structured filters for business rules<br>    &quot;&quot;&quot;<br>    # Build Weaviate filters<br>    filters = []<br>    if segment:<br>        filters.append(Filter.by_property(&quot;segment&quot;).equal(segment))<br>    if max_health is not None:<br>        filters.append(Filter.by_property(&quot;healthScore&quot;).less_than(max_health))<br>    if min_arr is not None:<br>        filters.append(Filter.by_property(&quot;arr&quot;).greater_than(min_arr))<br><br>    where_filter = None<br>    if filters:<br>        where_filter = filters[0]<br>        for f in filters[1:]:<br>            where_filter = where_filter &amp; f<br><br>    # Search with filters<br>    query_embedding = generate_embedding(user_query)<br>    collection = client.collections.get(&quot;AccountIntelligence&quot;)<br><br>    # FIXED: Changed &#39;where&#39; to &#39;filters&#39;<br>    search_results = collection.query.near_vector(<br>        near_vector=query_embedding,<br>        filters=where_filter,  # Changed from where=where_filter<br>        limit=limit,<br>        return_metadata=MetadataQuery(distance=True)<br>    )<br><br>    # Package results for Claude<br>    context_accounts = []<br>    for obj in search_results.objects:<br>        context_accounts.append({<br>            &quot;account_name&quot;: obj.properties[&quot;accountName&quot;],<br>            &quot;segment&quot;: obj.properties[&quot;segment&quot;],<br>            &quot;health_score&quot;: obj.properties[&quot;healthScore&quot;],<br>            &quot;arr&quot;: obj.properties[&quot;arr&quot;],<br>            &quot;content&quot;: obj.properties[&quot;content&quot;]<br>        })<br><br>    # Inform Claude about filters applied<br>    filter_description = []<br>    if segment:<br>        filter_description.append(f&quot;segment={segment}&quot;)<br>    if max_health:<br>        filter_description.append(f&quot;health&lt;{max_health}&quot;)<br>    if min_arr:<br>        filter_description.append(f&quot;ARR&gt;${min_arr:,}&quot;)<br><br>    filters_text = &quot; AND &quot;.join(filter_description) if filter_description else &quot;None&quot;<br><br>    # Analyze with Claude<br>    client_anthropic = anthropic.Anthropic()<br><br>    message = client_anthropic.messages.create(<br>        model=&quot;claude-sonnet-4-20250514&quot;,<br>        max_tokens=2000,<br>        temperature=0,<br>        messages=[{<br>            &quot;role&quot;: &quot;user&quot;,<br>            &quot;content&quot;: f&quot;&quot;&quot;Analyze these filtered account search results.<br><br>FILTERS APPLIED: {filters_text}<br>QUERY: {user_query}<br><br>ACCOUNTS FOUND ({len(context_accounts)}):<br>{json.dumps(context_accounts, indent=2)}<br><br>Provide analysis specifically considering the filter context.&quot;&quot;&quot;<br>        }]<br>    )<br><br>    return {<br>        &quot;query&quot;: user_query,<br>        &quot;filters_applied&quot;: filters_text,<br>        &quot;search_results&quot;: context_accounts,<br>        &quot;claude_analysis&quot;: message.content[0].text<br>    }<br><br># Use it<br>result = filtered_search_and_analyze(<br>    user_query=&quot;What are the common challenges?&quot;,<br>    segment=&quot;Enterprise&quot;,<br>    max_health=50,<br>    min_arr=500000<br>)<br>print(result[&quot;claude_analysis&quot;])</pre><p>This pattern enables you to ask: <em>“What challenges face high-value Enterprise accounts?”</em> and receive analysis based on that exact filtered subset.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/968/1*4sc4bYixHGb26dxEI9GWxQ.png" /></figure><h3>Putting it all together: Complete workflow</h3><p>Here’s the complete pattern:</p><pre># 1. Generate embeddings locally<br>embedding = generate_embedding(account_content)<br><br># 2. Store in Weaviate with metadata<br>collection.data.insert(<br>    properties=account_properties,<br>    vector=embedding<br>)<br># 3. Search semantically with filters<br>results = collection.query.near_vector(<br>    near_vector=query_embedding,<br>    where=business_filters,<br>    limit=20<br>)<br># 4. Analyze with Claude (grounded in facts)<br>analysis = claude.analyze(<br>    search_results=results,<br>    user_query=query<br>)<br># 5. Return actionable insights<br>return {<br>    &quot;search_results&quot;: results,<br>    &quot;ai_analysis&quot;: analysis<br>}</pre><p><strong>This architecture prevents hallucinations because:</strong></p><ul><li>Weaviate retrieves <strong>facts</strong> (actual account data)</li><li>Claude analyses <strong>only</strong> what Weaviate returned</li><li>No generation without retrieval</li><li>Every claim cites specific data</li></ul><h3>Summary: What you’ve learned</h3><p>Till now, you will (probably) understand:</p><p>✅ <strong>Vector databases</strong> convert text to numbers that capture meaning<br> ✅ <strong>Weaviate</strong> stores vectors + metadata for hybrid search<br> ✅ <strong>Collections</strong> are like SQL tables with semantic search superpowers<br> ✅ <strong>Schema design</strong> separates vectorized content from filterable properties<br> ✅ <strong>Search modes</strong>: semantic, keyword, hybrid, filtered<br> ✅ <strong>Vectorization</strong> is the process of converting text into numerical embeddings<br> ✅ <strong>Ollama</strong> generates embeddings locally with no API costs<br> ✅ <strong>RAG pattern</strong> grounds Claude in retrieved facts</p><p><strong>Next steps to deepen your understanding:</strong></p><ol><li>Deploy the docker-compose environment and try the code samples</li><li>Experiment with different alpha values in hybrid search</li><li>Compare semantic vs keyword results for your own queries</li><li>Build a simple RAG application combining search + LLM</li><li>Monitor embedding quality and query performance</li></ol><p>Vector search isn’t magic — it’s engineering with semantic understanding.</p><p><strong>Happy embeddings!</strong></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=2badafc4081a" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Snowflake CORTEX_COMPLETE in Full Throttle]]></title>
            <link>https://medium.com/snowflake/snowflake-cortex-complete-in-full-throttle-eb6d143f451a?source=rss-ff65005cbd7e------2</link>
            <guid isPermaLink="false">https://medium.com/p/eb6d143f451a</guid>
            <category><![CDATA[dbt]]></category>
            <category><![CDATA[snowflake]]></category>
            <category><![CDATA[snowflake-cortex-ai]]></category>
            <category><![CDATA[gitlab]]></category>
            <category><![CDATA[ai]]></category>
            <dc:creator><![CDATA[Radovan Bacovic]]></dc:creator>
            <pubDate>Wed, 10 Dec 2025 20:02:11 GMT</pubDate>
            <atom:updated>2025-12-10T20:02:11.567Z</atom:updated>
            <content:encoded><![CDATA[<h4>AI-based summarisation and categorisation to consume customer tickets</h4><h3>The challenge of locked data</h3><p>GitLab’s support team wants to process <strong>100k+</strong> customer tickets — a valuable source of customer feedback, product issues, and improvement opportunities. The traditional approach? Manual, ad-hoc summaries are requested one at a time. Not scalable, not secure, and not practical.</p><p>We are looking for an automated solution that can summarise and categorise customer support tickets for analytical purposes, converting them into actionable insights without exposing private customer information.</p><h3>Unlocking business value through AI</h3><p>This wasn’t just a technical exercise. Unlocking this dataset meant:</p><ul><li><strong>For Product Teams:</strong> Direct customer feedback to prioritise features and fix recurring issues.</li><li><strong>For Customer Success:</strong> Pattern recognition across accounts to prevent churn.</li><li><strong>For Sales:</strong> Understanding pain points to improve positioning and solutions.</li><li><strong>For the GitLab Data Team:</strong> Dogfooding our own AI capabilities at scale.</li></ul><p><strong>The business case </strong>was<strong> </strong>that<strong> 100k+</strong> tickets represented years of customer voice sitting unused due to compliance constraints.</p><h3>Building the AI processing pipeline</h3><h4>Architecture decision: CORTEX_COMPLETE over Claude API</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Sk7PiN7DbwuGkzGuJztyAg.png" /></figure><p>What is <a href="https://docs.snowflake.com/en/sql-reference/functions/complete-snowflake-cortex"><strong>CORTEX_COMPLETE</strong></a>?</p><blockquote>Given a prompt, generates a response (completion) using your choice of supported language model.</blockquote><p><strong>The critical decision:</strong> We chose Snowflake’s <strong>CORTEX_COMPLETE</strong> function over calling the <a href="https://www.claude.com/platform/api"><strong>Claude API</strong></a> directly (real-time or batch).</p><p><strong>Why? Infrastructure simplicity trumps cost optimisation.</strong></p><p>When evaluating our options, we considered three approaches:</p><ol><li><strong>Claude API Real-time:</strong> Call Anthropic’s API directly from Python/external services</li><li><strong>Claude API Batch:</strong> Use Anthropic’s batch processing through API for lower costs</li><li><strong>Snowflake CORTEX_COMPLETE:</strong> Use Snowflake’s native AI function</li></ol><p><strong>CORTEX_COMPLETE won decisively</strong>, even personally, I was impressed with Claude API capabilities, because:</p><ol><li><strong>Zero infrastructure overhead</strong></li></ol><ul><li>No external API keys to manage and rotate</li><li>No Python services to deploy, monitor, or scale</li><li>No network egress to configure and secure</li><li>No retry logic, rate limiting, or error handling to implement</li><li>No additional compute environments beyond our existing Snowflake warehouse</li></ul><p><strong>2. Data never leaves Snowflake</strong></p><ul><li>No need to export tickets to external processing services</li><li>Compliance teams approved it— no cross-boundary data flow</li><li>Audit trail built into Snowflake’s query history</li></ul><p><strong>3. Native </strong><a href="https://www.getdbt.com/"><strong>dbt</strong></a><strong> integration</strong></p><ul><li>Process tickets with pure <strong>SQL</strong> in <a href="https://www.getdbt.com/"><strong>dbt</strong></a> models</li><li>No orchestration of external services or API calls</li><li>Standard <a href="https://docs.getdbt.com/docs/build/incremental-models"><strong>dbt incremental</strong></a> patterns work out of the box</li><li>Developers work in a single environment (dbt/Snowflake)</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*J5Y-GMKSGBWD_aWg3rFntQ.png" /><figcaption>Desicion making process to choose the proper architecture</figcaption></figure><p><strong>Yes, CORTEX_COMPLETE costs more per token than direct Claude API calls.</strong> However, when you factor in the engineering time saved — with no infrastructure to build, maintain, or troubleshoot — the total cost of ownership (TCO) is significantly lower. Roughly, it was 10–15% cheaper per API call from Claude, but overall, the total cost of ownership is 30% lower when using <strong>CORTEX_COMPLETE</strong>.</p><p><strong>The trade-off is clear:</strong> Pay slightly more per API call to eliminate weeks of infrastructure work, ongoing operational overhead, and security complexity. For our use case, this was a logical decision.</p><p>We chose Snowflake’s CORTEX_COMPLETE function with <a href="https://platform.claude.com/docs/en/about-claude/models/overview"><strong>Claude 4 Sonnet</strong></a> as our processing engine, and business stakeholders validated the quality and consistency of the results. After processing thousands of tickets, teams reported high satisfaction with the model&#39;s ability to extract product categories and assess sentiment for the customer tickets.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*bo9vzsOF_6RgDc2yMEc2Wg.png" /><figcaption>Data processing architecture for customer tickets</figcaption></figure><p>Note: You can use any of the well-known models for this purpose, up to you. Here is the <a href="https://docs.snowflake.com/en/user-guide/snowflake-cortex/aisql#regional-availability"><strong>complete list</strong></a> of the available models.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*aSgnhyDTPc0Ri_3xdSYL1w.png" /><figcaption>Partial list of model avaliable for use in CORTEX_COMPLETE function (per region)</figcaption></figure><h3>The core: AI processing function</h3><p>Here’s the actual SQL function that does the heavy lifting:</p><pre>CREATE OR REPLACE FUNCTION ANALYZE_TICKET(<br>    COMMENTS VARCHAR, <br>    IS_PROD  BOOLEAN DEFAULT FALSE<br>)<br>RETURNS VARIANT<br>AS &#39;<br>    IFF(<br>        is_prod = TRUE,<br>        -- PRODUCTION MODE: Call Snowflake Cortex<br>        SNOWFLAKE.CORTEX.COMPLETE(<br>            &#39;&#39;claude-4-sonnet&#39;&#39;,<br>            ARRAY_CONSTRUCT(<br>                OBJECT_CONSTRUCT(<br>                    &#39;&#39;role&#39;&#39;, &#39;&#39;user&#39;&#39;,<br>                    &#39;&#39;content&#39;&#39;, CONCAT(<br>                        &#39;&#39;Analyze these support tickets. <br>                        PRIVACY CRITICAL: ...<br>                        <br>                        Return ONLY JSON with this structure:<br>                        {<br>                            ...<br>                        }&#39;&#39;,<br>                    )<br>                )<br>            )<br>        ),<br>        -- TEST MODE: Return dummy response<br>        PARSE_JSON(&#39;&#39;{&quot;choices&quot;: [{&quot;messages&quot;: &quot;dummy_data&quot;}]}&#39;&#39;)<br>    )<br>&#39;</pre><p><strong>That’s it.</strong> No external services. No API key management. No network configuration. Just SQL calling a native Snowflake AI function. The complexity reduction alone justified the higher per-token cost.</p><h3>Why the IS_PROD parameter matters</h3><p>The IS_PROD a boolean parameter is critical for several reasons:</p><ul><li><strong>Cost control during development.</strong> Every CORTEX_COMPLETE call costs money. During development and testing, we are iterating on SQL logic, debugging dbt models, and validating data transformations. Without the IS_PROD guard, every dbt run in a development environment, it would trigger expensive API calls for the entire dataset. With <strong>100k+</strong> tickets at stake, testing iterations could burn through thousands of dollars before you even reach production without bringing any value.</li><li><strong>Faster development cycles.</strong> Calling CORTEX_COMPLETE adds latency—each AI inference takes time. In test mode, the function returns instant dummy JSON, allowing developers to validate parsing logic, test incremental strategies, and debug SQL transformations without waiting for real AI processing.</li><li><strong>Preventing accidental production runs.</strong> The parameter creates an explicit contract: <em>“This function only processes real data when explicitly told it’s production.”</em> This prevents accidental full backfills triggered by a misplaced dbt run --full-refresh in the wrong environment.</li></ul><p>Implementation in <strong>dbt</strong>:</p><pre>analyze_ticket(<br>    comments =&gt; ticket_content,<br>    is_prod =&gt; {% if target.name == &#39;prod&#39; %} TRUE {% else %} FALSE {% endif %}<br>)</pre><p>Only when target.name == &#39;prod&#39; does the real processing happen. Every other environment gets dummy data for testing.</p><h3>The dbt pipeline: simple and efficient</h3><p>dbt model orchestrates the entire flow:</p><p><strong>Step 1: Filter relevant tickets</strong></p><pre>WHERE created_at &gt;= DATEADD(&#39;year&#39;, -3, CURRENT_DATE())<br>  AND ticket_status = &#39;closed&#39;<br></pre><p><strong>Step 2: Exclude noise.</strong> Remove trial accounts, free users, bot-generated tickets, and password resets using tag-based filtering. This was also done using SQL function:</p><pre>CREATE OR REPLACE FUNCTION clean_content(content_text VARCHAR)<br>    RETURNS STRING<br>    LANGUAGE SQL<br>    COMMENT = &#39;Cleans text content by removing PII, paths, and noise&#39;<br>    AS<br>    $$<br>        TRIM(<br>            REGEXP_REPLACE(<br>                            content_text,<br>                            &#39;...&#39;,<br>                            &#39;&#39;, 1, 0, &#39;si&#39;<br>                        ),<br>                        &#39;...&#39;,<br>                        &#39;[PATH]&#39;, 1, 0, &#39;i&#39;<br>                    ),<br>                    &#39;...&#39;,<br>                    &#39;[ID]&#39;, 1, 0, &#39;i&#39;<br>                ),<br>                &#39;...&#39;,<br><br>            )<br>        )<br>    $$;</pre><p><strong>Step 3: AI processing</strong></p><pre>analyze_ticket(<br>    comments =&gt; comments,<br>    is_prod  =&gt; TRUE<br>) AS ai_processed_results</pre><p><strong>Step 4: Structured output.</strong> Parse the JSON response into analytical columns.</p><h3>Why do we prevent full backfills?</h3><p>Notice this critical configuration:</p><pre>{{ config(<br>    materialized = &quot;incremental&quot;,<br>    incremental_strategy = &quot;append&quot;,<br>    unique_key = &quot;ticket_id&quot;,<br>    full_refresh = false  -- THIS IS CRITICAL<br>) }}</pre><p>The full_refresh = false setting is not optional—it&#39;s a financial and operational safeguard:</p><ul><li><strong>Cost protection:</strong> A single accidental dbt run --full-refresh would reprocess all <strong>100k+</strong> tickets, costing hundreds to thousands of dollars in Cortex API calls. The false flag prevents this disaster scenario.</li><li><strong>Idempotency guarantee:</strong> Once a closed ticket is processed, it never changes — the ticket is immutable. Reprocessing would generate identical results while wasting compute and money.</li><li><strong>Performance optimisation:</strong> The initial backfill took 10 hours. Preventing full refreshes ensures we only process the incremental delta of newly closed tickets each day.</li></ul><p>The incremental logic ensures we never double-process as the data is immutable:</p><pre>{% if is_incremental() %}<br>  AND ticket_id NOT IN (SELECT ticket_id FROM {{ this }})<br>{% endif %}<br></pre><p>The result? <a href="https://docs.snowflake.com/en/sql-reference/functions/complete-snowflake-cortex">CORTEX_COMPLETE</a> transforms raw data into clean, categorised, de-identified insights ready for analytics. The output table is safe to connect to Tableau or any other visual tool, share with external teams, and query without compliance restrictions.</p><p>Every field that once contained customer names, emails, or authentication details now contains generic terms or structured categories like “Frustrated sentiment due to recurring pipeline failures.”</p><h3>Why prompt engineering is critical</h3><p>In our use case, the prompt isn’t just important — it’s the entire control mechanism for data governance and business value extraction. A poorly designed prompt would either:</p><ul><li><strong>Miss business context:</strong> Vague prompts like <em>“summarise this ticket”</em> would produce useless generic summaries instead of structured categorisation by product stage, severity, and sentiment.</li><li><strong>Generate inconsistent output:</strong> Without specifying the exact JSON structure and enum values <em>(Frustrated/Concerned/Satisfied/Neutral)</em>, downstream analytics would break due to parsing errors or inconsistent categories.</li></ul><p>Our prompt does three critical jobs simultaneously:</p><ol><li><strong>Data governance enforcement:</strong> Explicitly instructs AI to remove personal data using concrete examples</li><li><strong>Business logic implementation:</strong> Defines 15+ structured fields matching GitLab’s product taxonomy</li><li><strong>Quality control:</strong> Requires explanations for classifications (sentiment_reason, severity_reason) to ensure AI isn’t guessing</li></ol><p>The prompt is essentially a <strong>data governance policy written in natural language</strong> and executed at scale by AI. Get it wrong, and you expose and generate useless output. Get it right, and you transform compliance-restricted data into a strategic asset.</p><h3>Production performance: fast and affordable</h3><p>After the initial backfill, daily operations are remarkably efficient:</p><p><strong>Daily incremental runs complete in under 1 minute</strong> on a Snowflake <strong><em>L-size</em></strong> warehouse. Since we’re only processing newly closed tickets <em>(typically 50–100 per day)</em>, the compute overhead is minimal. The dbt model identifies unprocessed tickets, calls CORTEX_COMPLETE for each new ticket, and appends results to the target table—all within 1–2 minutes.</p><p><strong>AI processing costs remain low</strong> because:</p><ul><li>We only process closed tickets (immutable, process-once guarantee)</li><li>An incremental strategy prevents redundant API calls</li><li>Token usage is monitored per ticket to catch cost anomalies</li><li>The IS_PROD parameter is the guard that prevents accidentally expensive runs in development</li></ul><p>For a typical daily run processing 50–100 new tickets, the Cortex API cost is just a few bucks — far less than the value of unlocking <strong>100k+ </strong>tickets for business analysis. The combination of incremental processing, Snowflake’s native integration, and Claude 4 Sonnet’s efficiency makes this pipeline both performant and cost-effective at scale.</p><h3>The numbers</h3><ul><li><strong>Initial backfill:</strong> <strong>100k+ </strong>tickets processed</li><li><strong>Daily incremental:</strong> New closed tickets are automatically processed in under <strong>1–2</strong> minutes</li><li><strong>Processing time:</strong> <strong>~10</strong> hours for the initial load</li><li><strong>Daily AI costs:</strong> Just a few dollars or less for typical <strong>50–100</strong>-ticket increments</li><li><strong>Warehouse size:</strong> L-size (daily runs), can be even smaller</li><li><strong>Infrastructure complexity:</strong> Zero additional services beyond Snowflake</li></ul><h3>Conclusion: AI as a data governance tool</h3><p>This implementation demonstrates that AI isn’t just for generating content — it’s a powerful tool for <strong>data governance at scale</strong>. We transformed compliance-restricted data into a business asset through:</p><ol><li><strong>Automated de-identification</strong> — AI removes personal data more consistently than manual review</li><li><strong>Structured categorisation</strong> — Turn unstructured text into queryable dimensions</li><li><strong>Self-service access</strong> — Product and Success teams query directly without bottlenecks</li><li><strong>Native security</strong> — Processing happens inside Snowflake’s security ecosystem</li><li><strong>Product taxonomy alignment</strong> — Map customer issues to GitLab’s stages for strategic insights</li><li><strong>Zero infrastructure overhead</strong> — <strong>CORTEX_COMPLETE</strong> eliminated weeks of engineering work</li></ol><p>The real innovation isn’t the AI model — it’s choosing the right use case supported with the proper infrastructure to make AI processing operationally simple, secure, and maintainable at scale.</p><p><strong>Paying slightly more per API call to eliminate weeks of engineering work is the smartest business decision.</strong></p><p><em>Want to implement a similar AI-powered data processing solution? Choose infrastructure simplicity over cost optimisation. Snowflake Cortex keeps data secure with zero additional services; dbt provides orchestration with pure SQL; </em><strong><em>Claude 4 Sonnet</em></strong><em> delivers “good enough” accuracy and business results. The engineering time you save is worth far more than the marginal cost difference.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=eb6d143f451a" width="1" height="1" alt=""><hr><p><a href="https://medium.com/snowflake/snowflake-cortex-complete-in-full-throttle-eb6d143f451a">Snowflake CORTEX_COMPLETE in Full Throttle</a> was originally published in <a href="https://medium.com/snowflake">Snowflake Builders Blog: Data Engineers, App Developers, AI, &amp; Data Science</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How We Built a Structured Streamlit Framework in Snowflake: From Chaos to Compliance]]></title>
            <link>https://medium.com/snowflake/how-we-built-a-structured-streamlit-framework-in-snowflake-from-chaos-to-compliance-baa3b709aead?source=rss-ff65005cbd7e------2</link>
            <guid isPermaLink="false">https://medium.com/p/baa3b709aead</guid>
            <category><![CDATA[snowflake]]></category>
            <category><![CDATA[streamlit]]></category>
            <category><![CDATA[gitlab]]></category>
            <dc:creator><![CDATA[Radovan Bacovic]]></dc:creator>
            <pubDate>Wed, 15 Oct 2025 22:02:17 GMT</pubDate>
            <atom:updated>2025-10-15T22:02:17.173Z</atom:updated>
            <content:encoded><![CDATA[<h4>How we transformed scattered Streamlit applications into a unified, secure, and scalable solution for the Snowflake environment</h4><h3><strong>What You Should Learn</strong></h3><p>What’s happening when you pack 🐍<strong>Python</strong>, <strong>Streamlit</strong>, ❄️<strong>Snowflake</strong> and 🦊 <strong>Gitlab</strong>? Let’s find out together…</p><p>As GitLab’s Data team, we leveraged our unique position as customer zero by building this entire framework on GitLab’s own CI/CD infrastructure and project management tools. What are our secret ingredients:</p><ol><li><a href="https://about.gitlab.com/platform/"><strong>GitLab</strong></a> <em>(product)</em> — the tool we create for DevSecOps success.</li><li><a href="https://about.gitlab.com/platform/"><strong>Snowflake</strong></a> — our Single Source of Truth <em>(</em><strong><em>SSOT</em></strong><em>)</em> for the Data Warehouse activities <em>(and more than that).</em></li><li><a href="https://streamlit.io/"><strong>Streamlit</strong></a> — an open-source tool for visual applications, pure Python code under the hood.</li></ol><p>This provided us with immediate access to enterprise-grade DevSecOps capabilities, enabling us to implement automated testing, code review processes, and deployment pipelines from the outset. By utilizing GitLab’s built-in features for issue tracking, merge requests, and automated deployments <em>(CI/CD pipelines)</em>, we can iterate rapidly and validate our framework against real-world enterprise requirements. This internal-first approach ensured our solution was battle-tested on GitLab’s own infrastructure before any external implementation.</p><p>The most critical lesson from building the <strong>Streamlit Application Framework in </strong>❄️<strong>Snowflake</strong> in the <a href="http://about.gitlab.com"><strong>GitLab</strong></a> Data team is that structure beats chaos every time — implement governance early rather than retrofitting it later when maintenance becomes exponential.</p><p>Success requires clearly defining roles and responsibilities, separating infrastructure concerns from application development, so that each team can focus on its strengths.</p><p><strong>Security and compliance cannot be afterthoughts</strong>; they must be built into templates and automated processes from day one, because it’s far easier to enforce consistent standards upfront than to retrofit them onto existing applications. Invest heavily in automation and CI/CD pipelines, as manual processes don’t scale and introduce human error.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*zZSaZLEjpW72PFMRpBlRPg.png" /><figcaption>Architecture of the framework (general overview)</figcaption></figure><h3>What: The Problem We Solved</h3><p><strong>Imagine this scenario:</strong> Your organisation has dozens of Streamlit applications scattered across different environments, running various Python versions, connecting to sensitive data with inconsistent security practices. Some apps work, others break mysteriously, and nobody knows who built what or how to maintain them.</p><p>This was exactly the challenge our data team faced. Applications were being created in isolation, with no standardization, no security oversight, and no clear deployment process. The result? A compliance nightmare and a maintenance burden that was growing exponentially.</p><p>We built a comprehensive Streamlit Framework that transforms how data applications are created, maintained, and deployed in enterprise environments.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*3u-REbAq4L82O4ry1KeLHA.png" /><figcaption>Functional architectural design (high level)</figcaption></figure><h3>So What: Why the Streamlit Application Framework Changes Everything</h3><h4>Three clear roles, one unified process</h4><p>Our framework introduces a structured approach with three distinct roles:</p><ol><li><strong>Maintainers</strong> <em>(Data team members and contributors)</em> handle the infrastructure — CI/CD pipelines, security templates, and compliance rules. They ensure the framework runs smoothly and stays secure.</li><li><strong>Creators</strong> <em>(Those who need to build applications)</em> can focus on what they do best: creating visualizations, connecting to Snowflake data, and building user experiences. They have full flexibility to create new applications from scratch, add new pages to existing apps, integrate additional Python libraries, and build complex data visualizations — all without worrying about deployment pipelines or security configurations.</li><li><strong>Viewers</strong> <em>(End users)</em> access polished, secure applications without any technical overhead. All they need is a Snowflake access.</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*6Iv2-5Fm-SZReetOzmKMJw.png" /><figcaption>Roles overview and their functionality</figcaption></figure><h3>Automate Everything</h3><p>We solve the problem with Continuous Integration and Continuous Delivery: days of manual deployments and configuration headaches are a thing of the past. Our framework provides:</p><ul><li><strong>One-click </strong>environment<strong> </strong>preparation: with a set of <strong>make</strong> commands, the environment is installed and ready in a few seconds:</li></ul><pre>================================================================================<br>✅ Snowflake CLI successfully installed and configured!<br>Connection: gitlab_streamlit<br>User: YOU@GITLAB.COM<br>Account: gitlab<br>================================================================================<br>Using virtualenv: /Users/YOU/repos/streamlit/.venv<br>📚 Installing project dependencies...<br>Installing dependencies from lock file<br><br>No dependencies to install or update<br>✅ Streamlit environment prepared!</pre><ul><li><strong>Automated CI/CD pipelines</strong> that handle testing, code review, and deployment from development to production.</li><li><strong>Secure sandbox environments</strong> for safe development and testing before production deployment:</li></ul><pre>╰─$ make streamlit-rules<br>🔍 Running Streamlit compliance check...<br>================================================================================<br>CODE COMPLIANCE REPORT<br>================================================================================<br>Generated: 2025-07-09 14:01:16<br>Files checked: 1<br><br>SUMMARY:<br>✅ Passed: 1<br>❌ Failed: 0<br>Success Rate: 100.0%<br><br>APPLICATION COMPLIANCE SUMMARY:<br>📱 Total Applications Checked: 1<br>⚠️ Applications with Issues: 0<br>📊 File Compliance Rate: 100.0%<br><br>DETAILED RESULTS BY APPLICATION:<br>...</pre><ul><li><strong>Template-based application creation</strong> that ensures consistency across all applications and pages:</li></ul><pre>╰─$ make streamlit-new-page STREAMLIT_APP=sales_dashboard STREAMLIT_PAGE_NAME=analytics<br>📝 Generating new Streamlit page: analytics for app: sales_dashboard<br><br>📃 Create new page from template:<br>  Page name: analytics<br>  App directory: sales_dashboard<br>  Template path: page_template.py<br><br>✅ Successfully created &#39;analytics.py&#39; in &#39;sales_dashboard&#39; directory from template</pre><ul><li><strong>Poetry-based dependency management</strong> that prevents version conflicts and maintains clean environments.</li><li><strong>Organized project structure</strong> with dedicated folders for applications, templates, compliance rules, and configuration management:</li></ul><pre>├── src/<br>│   ├── applications/          # Folder for Streamlit applications<br>│   │   ├── main_app/          # Main dashboard application<br>│   │   ├── components/        # Shared components<br>│   │   └── &lt;your_apps&gt;/       # Your custom application<br>│   │   └── &lt;your_apps2&gt;/      # Your 2nd custom application<br>│   ├── templates/             # Application and page templates<br>│   ├── compliance/            # Compliance rules and checks<br>│   └── setup/                 # Setup and configuration utilities<br>├── tests/                     # Test files<br>├── config.yml                 # Environment configuration<br>├── Makefile                   # Build and deployment automation<br>└── README.md                  # Main README.md file</pre><ul><li><strong>Streamlined workflow</strong> from local development through testing schema to production, all automated through GitLab CI/CD pipelines.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*KImquLnLO-aTpJv19LtAkw.png" /><figcaption>GitLab CI/CD pipelines for full automation of the process</figcaption></figure><h3>Security and Compliance By Design</h3><p>Instead of bolting on security as an afterthought, our framework builds it in from the ground up. Every application adheres to the same security standards, and compliance requirements are automatically enforced; audit trails are maintained throughout the development lifecycle. We introduce our compliance rules and verify them with a single command. For instance, we can list which classes and methods are mandatory to use, which files you should have, and which role is allowed and which are forbidden to share the application with. The rules are flexible and descriptive, all you ned to do is to define them in a YAML file:</p><pre>class_rules:<br>  - name: &quot;Inherit code for the page from GitLabDataStreamlitInit&quot;<br>    description: &quot;All Streamlit apps must inherit from GitLabDataStreamlitInit&quot;<br>    severity: &quot;error&quot;<br>    required: true<br>    class_name: &quot;*&quot;<br>    required_base_classes:<br>      - &quot;GitLabDataStreamlitInit&quot;<br>    required_methods:<br>      - &quot;__init__&quot;<br>      - &quot;set_page_layout&quot;<br>      - &quot;setup_ui&quot;<br>      - &quot;run&quot;<br><br>function_rules:<br>  - name: &quot;Main function required&quot;<br>    description: &quot;Must have a main() function&quot;<br>    severity: &quot;error&quot;<br>    required: true<br>    function_name: &quot;main&quot;<br><br>import_rules:<br>  - name: &quot;Import GitLabDataStreamlitInit&quot;<br>    description: &quot;Must import the mandatory base class&quot;<br>    severity: &quot;error&quot;<br>    required: true<br>    module_name: &quot;gitlab_data_streamlit_init&quot;<br>    required_items:<br>      - &quot;GitLabDataStreamlitInit&quot;<br>  - name: &quot;Import streamlit&quot;<br>    description: &quot;Must import streamlit library&quot;<br>    severity: &quot;error&quot;<br>    required: true<br>    module_name: &quot;streamlit&quot;<br><br>file_rules:<br>  - name: &quot;Snowflake configuration required (snowflake.yml)&quot;<br>    description: &quot;Each application must have a snowflake.yml configuration file&quot;<br>    severity: &quot;error&quot;<br>    required: true<br>    file_pattern: &quot;**/applications/**/snowflake.yml&quot;<br>    base_path: &quot;&quot;<br>  - name: &quot;Snowflake environment required (environment.yml)&quot;<br>    description: &quot;Each application must have a environment.yml configuration file&quot;<br>    severity: &quot;error&quot;<br>    required: true<br>    file_pattern: &quot;**/applications/**/environment.yml&quot;<br>    base_path: &quot;&quot;<br>  - name: &quot;Share specification required (share.yml)&quot;<br>    description: &quot;Each application must have a share.yml file&quot;<br>    severity: &quot;warning&quot;<br>    required: true<br>    file_pattern: &quot;**/applications/**/share.yml&quot;<br>    base_path: &quot;&quot;<br>  - name: &quot;README.md required (README.md)&quot;<br>    description: &quot;Each application should have a README.md file with a proper documentation&quot;<br>    severity: &quot;error&quot;<br>    required: true<br>    file_pattern: &quot;**/applications/**/README.md&quot;<br>    base_path: &quot;&quot;<br>  - name: &quot;Starting point recommended (dashboard.py)&quot;<br>    description: &quot;Each application must have a dashboard.py as a starting point&quot;<br>    severity: &quot;warning&quot;<br>    required: true<br>    file_pattern: &quot;**/applications/**/dashboard.py&quot;<br>    base_path: &quot;&quot;<br><br>sql_rules:<br>  - name: &quot;SQL files must contain only SELECT statements&quot;<br>    description: &quot;SQL files and SQL code in other files should only contain SELECT statements for data safety&quot;<br>    severity: &quot;error&quot;<br>    required: true<br>    file_extensions: [&quot;.sql&quot;, &quot;.py&quot;]<br>    select_only: true<br>    forbidden_statements:<br>      - ....<br>    case_sensitive: false<br>  - name: &quot;SQL queries should include proper SELECT statements&quot;<br>    description: &quot;When SQL is present, it should contain proper SELECT statements&quot;<br>    severity: &quot;warning&quot;<br>    required: false<br>    file_extensions: [&quot;.sql&quot;, &quot;.py&quot;]<br>    required_statements:<br>      - &quot;SELECT&quot;<br>    case_sensitive: false<br><br>share_rules:<br>  - name: &quot;Valid functional roles in share.yml&quot;<br>    description: &quot;Share.yml files must contain only valid functional roles from the approved list&quot;<br>    severity: &quot;error&quot;<br>    required: true<br>    file_pattern: &quot;**/applications/**/share.yml&quot;<br>    valid_roles:<br>      - ...<br>    safe_data_roles:<br>      - ...<br>  - name: &quot;Share.yml file format validation&quot;<br>    description: &quot;Share.yml files must follow the correct YAML format structure&quot;<br>    severity: &quot;error&quot;<br>    required: true<br>    file_pattern: &quot;**/applications/**/share.yml&quot;<br>    required_keys:<br>      - &quot;share&quot;<br>    min_roles: 1<br>    max_roles: 10</pre><p>With one command running:</p><pre>╰─$ make streamlit-rules</pre><p>We can verify all the rules we have created and validate that the <strong>developers </strong><em>(who are building a Streamlit application)</em><strong> </strong>are following the policy specified by the <strong>creators </strong><em>(who determine the policies and building blocks of the framework)</em>, and that all the building blocks are in the right place. This ensures consistent behaviour across all Streamlit applications.</p><pre><br>🔍 Running Streamlit compliance check...<br>================================================================================<br>CODE COMPLIANCE REPORT<br>================================================================================<br>Generated: 2025-08-18 17:05:12<br>Files checked: 4<br><br>SUMMARY:<br>✅ Passed: 4<br>❌ Failed: 0<br>Success Rate: 100.0%<br><br>APPLICATION COMPLIANCE SUMMARY:<br>📱 Total Applications Checked: 1<br>⚠️ Applications with Issues: 0<br>📊 File Compliance Rate: 100.0%<br><br>DETAILED RESULTS BY APPLICATION:<br>================================================================================<br><br>✅ PASS APPLICATION: main_app<br>------------------------------------------------------------<br>📁 FILES ANALYZED (4):<br>  ✅ dashboard.py<br>    📦 Classes: SnowflakeConnectionTester<br>    🔧 Functions: main<br>    📥 Imports: os, pwd, gitlab_data_streamlit_init, snowflake.snowpark.exceptions, streamlit<br>  ✅ show_streamlit_apps.py<br>    📦 Classes: ShowStreamlitApps<br>    🔧 Functions: main<br>    📥 Imports: pandas, gitlab_data_streamlit_init, snowflake_session, streamlit<br>  ✅ available_packages.py<br>    📦 Classes: AvailablePackages<br>    🔧 Functions: main<br>    📥 Imports: pandas, gitlab_data_streamlit_init, streamlit<br>  ✅ share.yml<br>    👥 Share Roles: snowflake_analyst_safe<br><br>📄 FILE COMPLIANCE FOR MAIN_APP:<br>  ✅ Required files found:<br>    ✓ snowflake.yml<br>    ✓ environment.yml<br>    ✓ share.yml<br>    ✓ README.md<br>    ✓ dashboard.py<br><br>RULES CHECKED:<br>----------------------------------------<br>Class Rules (1):<br>  - Inherit code for the page from GitLabDataStreamlitInit (error)<br>Function Rules (1):<br>  - Main function required (error)<br>Import Rules (2):<br>  - Import GitLabDataStreamlitInit (error)<br>  - Import streamlit (error)<br>File Rules (5):<br>  - Snowflake configuration required (snowflake.yml) (error)<br>  - Snowflake environment required (environment.yml) (error)<br>  - Share specification required (share.yml) (warning)<br>  - README.md required (README.md) (error)<br>  - Starting point recommended (dashboard.py) (warning)<br>SQL Rules (2):<br>  - SQL files must contain only SELECT statements (error)<br>    🗄 SELECT-only mode enabled<br>    🚨 Forbidden: INSERT, UPDATE, DELETE, DROP, ALTER...<br>  - SQL queries should include proper SELECT statements (warning)<br>Share Rules (2):<br>  - Valid functional roles in share.yml (error)<br>    👥 Valid roles: 15 roles defined<br>    🔒 Safe data roles: 11 roles<br>  - Share.yml file format validation (error)<br> <br>------------------------------------------------------------<br>✅ Compliance check passed<br>-----------------------------------------------------------</pre><h3>Developer Experience That Works</h3><p>Whether you prefer your favorite IDE, a web-based development environment or Snowflake Snowsight, the experience remains consistent. The framework provides:</p><ul><li><strong>Template-driven development</strong>: New applications and pages are created through standardized templates, ensuring consistency and best practices from day one. No more scattered design and elements.</li></ul><pre>╰─$ make streamlit-new-app NAME=sales_dashboard<br>🔧 Configuration Environment: TEST<br>📝 Configuration File: config.yml<br>📜 Config Loader Script: ./setup/get_config.sh<br>🐍 Python Version: 3.12<br>📁 Applications Directory: ./src/applications<br>🗄 Database: ...<br>📊 Schema: ...<br>🏗️ Stage: ...<br>🏭 Warehouse: ...<br>🆕 Creating new Streamlit app: sales_dashboard<br>Initialized the new project in ./src/applications/sales_dashboar</pre><ul><li><strong>Poetry package management</strong>: All dependencies are managed through Poetry, creating isolated environments that won’t disrupt your existing Python setup:</li></ul><pre>[tool.poetry]<br>name = &quot;GitLab Data Streamlit&quot;<br>version = &quot;0.1.1&quot;<br>description = &quot;GitLab Data Team Streamlit project&quot;<br>authors = [&quot;GitLab Data Team &lt;*****@gitlab.com&gt;&quot;]<br>readme = &quot;README.md&quot;<br><br>[tool.poetry.dependencies]<br>python = &quot;&lt;3.13,&gt;=3.12&quot;<br>snowflake-snowpark-python = &quot;==1.32.0&quot;<br>snowflake-connector-python = {extras = [&quot;development&quot;, &quot;pandas&quot;, &quot;secure-local-storage&quot;], version = &quot;^3.15.0&quot;}<br>streamlit = &quot;==1.22.0&quot;<br>watchdog = &quot;^6.0.0&quot;<br>types-toml = &quot;^0.10.8.20240310&quot;<br>pytest = &quot;==7.0.0&quot;<br>black = &quot;==25.1.0&quot;<br>importlib-metadata= &quot;==4.13.0&quot;<br>pyyaml = &quot;==6.0.2&quot;<br>python-qualiter = &quot;*&quot;<br>ruff = &quot;^0.1.0&quot;<br>types-pyyaml = &quot;^6.0.12.20250516&quot;<br>jinja2 = &quot;==3.1.6&quot;<br><br>[build-system]<br>requires = [&quot;poetry-core&quot;]<br>build-backend = &quot;poetry.core.masonry.api&quot;</pre><ul><li><strong>Multi-page application support</strong>: Creators can easily build complex applications with multiple pages and add new libraries as needed. Multipage applications are part of the framework and a developer is focusing on the logic, not the design and structuring.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*9lK6wyqHbjnsh3t6k7r8nA.png" /><figcaption>Multipage application example (in Snowflake)</figcaption></figure><ul><li><strong>Seamless Snowflake integration</strong>: Built-in connectors and authentication handling for secure data access provide the same experience, regardless of your environment <em>(local development or directly in Snowflake):</em></li></ul><pre>make streamlit-push-test APPLICATION_NAME=sales_dashboard<br><br>📤 Deploying Streamlit app to test environment: sales_dashboard<br>...<br>------------------------------------------------------------------------------------------------------------<br>🔗 Running share command for application: sales_dashboard<br>Running commands to grant shares<br><br>🚀 Executing: snow streamlit share sales_dashboard with SOME_NICE_ROLE<br>✅ Command executed successfully<br><br>📊 Execution Summary: 1/1 commands succeeded</pre><ul><li><strong>Comprehensive Makefile</strong>: All common commands are wrapped in simple Makefile commands, from local development to testing and deployment, including CI/CD pipelines.</li><li><strong>Safe local development</strong>: Everything runs in isolated Poetry environments, protecting your system while providing production-like experiences.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*BnF37Y9y2O1N3a0D0dxlQw.png" /><figcaption>Same experience despite the environment (example of the local development)</figcaption></figure><ul><li><strong>Collaboration via code: </strong>All applications and components are wrapped up in one repository, which allows the entire organization to collaborate on the same resources and avoid double work and redundant setup</li></ul><h3>Now What: Getting Started and Moving Forward</h3><h4>Next steps — how our experience can improve your flow</h4><p>If you’re facing similar challenges with scattered Streamlit applications, here’s how to begin and move quickly:</p><ol><li><strong>Assess your current state</strong>: Inventory your existing applications and identify pain points.</li><li><strong>Define your roles</strong>: Separate maintainer responsibilities from creator and end users&#39; needs.</li><li><strong>Start with templates</strong>: Create standardize application templates that enforce your security and compliance requirements.</li><li><strong>Implement CI/CD</strong>: Automate your deployment pipeline to reduce manual errors and ensure consistency.</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*EdMNHtZ9FS2C9U8t6e2RZw.png" /><figcaption>Deployed application in Snowflake</figcaption></figure><h3>The Bigger Picture</h3><p>This framework represents more than just a technical solution — it’s a paradigm shift toward treating data applications as first-class citizens in your enterprise (Data) architecture.</p><p>By providing structure without sacrificing flexibility, the GitLab Data team created an environment where anyone in the company with minimal technical knowledge can innovate rapidly while maintaining the highest standards of security and compliance.</p><h3>What’s Next?</h3><p>We’re continuing to enhance the framework based on user feedback and emerging needs. Future improvements include expanded template libraries, enhanced monitoring capabilities, and more flexibility and a smoother user experience.</p><p><strong>The goal isn’t just to solve today’s problems, but to create a foundation that scales with your organization’s growing data application needs.</strong></p><h3>Summary</h3><p><a href="https://handbook.gitlab.com/handbook/enterprise-data/"><strong>GitLab Data Team</strong></a> transformed from having dozens of scattered, insecure Streamlit applications with no standardisation into a unified, enterprise-grade framework that separates roles cleanly:</p><ol><li><strong>Maintainers</strong> handle infrastructure and security,</li><li><strong>Creators</strong> focus on building applications without deployment headaches, and</li><li><strong>Viewers</strong> access polished, compliant apps.</li></ol><p>Using building blocks that separate concerns:</p><ol><li>Automated <strong>CI/CD</strong> pipelines</li><li><strong>Fully</strong> collaborative and versioned code in <strong>git</strong></li><li><strong>Template</strong>-based development</li><li>Built-in <strong>security</strong>, <strong>compliance</strong>, <strong>testing</strong> and</li><li><a href="https://python-poetry.org/"><strong>Poetry</strong></a>-managed environments</li></ol><p><em>We eliminated the maintenance nightmare while enabling rapid innovation — proving that you can have both structure and flexibility when you treat data applications as first-class enterprise assets rather than throwaway prototypes.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=baa3b709aead" width="1" height="1" alt=""><hr><p><a href="https://medium.com/snowflake/how-we-built-a-structured-streamlit-framework-in-snowflake-from-chaos-to-compliance-baa3b709aead">How We Built a Structured Streamlit Framework in Snowflake: From Chaos to Compliance</a> was originally published in <a href="https://medium.com/snowflake">Snowflake Builders Blog: Data Engineers, App Developers, AI, &amp; Data Science</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Yet another Python article — give me quality or give me death]]></title>
            <link>https://medium.com/@radovan.bacovic/yet-another-python-article-give-me-quality-or-give-me-death-7029b71e16ce?source=rss-ff65005cbd7e------2</link>
            <guid isPermaLink="false">https://medium.com/p/7029b71e16ce</guid>
            <category><![CDATA[gitlab]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[data]]></category>
            <dc:creator><![CDATA[Radovan Bacovic]]></dc:creator>
            <pubDate>Mon, 18 Aug 2025 15:10:15 GMT</pubDate>
            <atom:updated>2025-08-18T15:10:15.585Z</atom:updated>
            <content:encoded><![CDATA[<h3>Yet another Python article — give me quality or give me death</h3><h3>Introducing python-qualiter: The All-in-One Python Code Quality Tool You’ve Been Waiting For</h3><p><em>Simplify your Python linting workflow with a single, powerful command-line interface</em></p><p>As Python developers, we’ve all been there. You’re working on a project, and suddenly you find yourself juggling multiple tools: black for formatting, isort for import sorting, mypy for type checking, flake8 for style guide enforcement, and pylint for comprehensive code analysis. Each tool serves its purpose, but managing them all becomes a complexity nightmare, especially when setting up CI/CD pipelines.</p><p>What if I told you there’s now a way to run all these essential code quality checks with a single command?</p><p>Today, I’m excited to introduce <a href="https://pypi.org/project/python-qualiter/"><strong>python-qualiter</strong></a> — an open-source package that wraps all your favorite Python linters and code quality tools into one unified, user-friendly interface.</p><h3>The Problem with Multiple Tools</h3><p>Every Python developer knows the pain:</p><ul><li><strong>Local Development</strong>: Remembering to run multiple commands before committing code</li><li><strong>CI/CD Complexity</strong>: Setting up separate pipeline steps for each linting tool</li><li><strong>Resource Waste</strong>: Multiple pipeline executions consuming unnecessary compute resources</li><li><strong>Inconsistent Results</strong>: Different team members running different combinations of tools</li></ul><p>The result? Fragmented code quality processes that slow down development and create inconsistencies across teams.</p><h3>Meet python-qualiter: Your New Code Quality Companion</h3><p><strong>python-qualiter</strong> is a modern CLI wrapper that brings together the power of multiple Python linting and formatting tools under a single, intuitive interface. Think of it as your code quality Swiss Army knife.</p><h3>What Makes It Special?</h3><p>The tool combines industry-standard linters, including:</p><ul><li><strong>isort</strong> for import organization</li><li><strong>black</strong> for code formatting</li><li><strong>mypy</strong> for static type checking</li><li><strong>flake8</strong> for style guide enforcement</li><li><strong>pylint</strong> for comprehensive code analysis</li><li><strong>vulture</strong> for dead code detection</li><li><strong>ruff</strong> — the rising star in Python tooling</li></ul><p>But it’s more than just a collection of tools — it’s a thoughtfully designed experience that makes code quality management effortless.</p><h3>Key Features That Set It Apart</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*3u7RfgdZAcX3XPduPOryTA.png" /></figure><h3>🎯 All-in-One Linting</h3><p>Run every essential code quality check with a single command. No more remembering multiple tool names or parameters.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*kC_2R8mY8fwwCq6d-4nQow.png" /></figure><h3>📊 Visual Result Matrix</h3><p>Get a clear, at-a-glance view of which files pass which linters. The visual feedback makes it easy to identify exactly where issues exist.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*cm6gWdTvlRc8hgfBxF-YHg.png" /></figure><h3>🔧 Auto-Fix Capability</h3><p>Many code quality issues can be automatically resolved. python-qualiter identifies and fixes problems where possible, saving you valuable development time.</p><pre>python-qualiter lint my_file.py --fix</pre><h3>⚙️ Flexible Configuration</h3><p>Enable or disable specific linters based on your project’s needs. Not every project requires every tool, and python-qualiter respects that.</p><pre>python-qualiter lint my_file.py --disable pylint</pre><h3>📈 Detailed Reports</h3><p>When issues are found, you get comprehensive information about what went wrong and how to fix it.</p><pre>python-qualiter lint my_file.py --verbose</pre><pre>ruff found issues in ./lint.py:<br>lint.py:21:21: F401 [*] `pathlib.Path` imported but unused<br>lint.py:378:19: F821 Undefined name `lint_file`<br>Found 2 errors.<br>[*] 1 fixable with the `--fix` option.<br><br><br>=====================================================================================================================<br>LINTING RESULTS MATRIX<br>=====================================================================================================================<br>File                                     | black    | flake8   | isort    | mypy     | pylint   | ruff     | vulture <br>---------------------------------------------------------------------------------------------------------------------<br>./__init__.py                            | ✅        | ❌        | ✅        | ✅        | ✅        | ✅        | ✅        | <br>./app.py                                 | ✅        | ❌        | ✅        | ✅        | ❌        | ✅        | ✅        | <br>./lint.py                                | ✅        | ❌        | ✅        | ❌        | ❌        | ❌        | ✅        | <br>=====================================================================================================================<br>❌ 7 FAILURES OUT OF 21 CHECKS<br>=====================================================================================================================</pre><h3>Getting Started: It’s Easier Than You Think</h3><p>Installation couldn’t be simpler:</p><pre>pip install python-qualiter</pre><p>Check your code quality:</p><pre>python-qualiter lint path/to/your/code.py</pre><p>Apply automatic fixes:</p><pre>python-qualiter lint path/to/your/code.py --fix</pre><p>For multiple files or directories:</p><pre>python-qualiter lint src/*.py test/*.py -v</pre><h3>Streamlining Your CI/CD Pipeline</h3><p>One of the most powerful applications of python-qualiter is in your CI/CD pipeline. Instead of managing multiple pipeline steps, you can consolidate everything into a single, efficient step:</p><pre># .gitlab-ci.yml<br>python_linters:<br>  script:<br>    - pip install python-qualiter<br>    - python-qualiter lint src/*.py test/*.py -v<br>  allow_failure: true</pre><p>This approach offers several advantages:</p><p><strong>Cost Efficiency</strong>: Reduce compute resources by running all checks in a single pipeline step rather than spawning multiple containers.</p><p><strong>Simplicity</strong>: One pipeline step to maintain instead of multiple complex configurations.</p><p><strong>Consistency</strong>: Ensure the same checks run locally and in CI/CD, eliminating the “works on my machine” problem.</p><p><strong>Speed</strong>: Faster pipeline execution with reduced overhead from multiple tool startups.</p><h3>Why This Matters for Your Team</h3><p>Quality code is consistent code, whether you’re checking it on your local machine or through your CI/CD pipeline. python-qualiter ensures that your entire team has access to the same comprehensive code quality checks without the complexity traditionally associated with multi-tool setups.</p><p>The tool also embraces <a href="https://docs.astral.sh/ruff/"><strong>ruff</strong></a>, the new rising star in the Python ecosystem known for its incredible speed and comprehensive rule set. By integrating ruff alongside established tools, python-qualiter gives you the best of both worlds: proven reliability and cutting-edge performance.</p><h3>The Open Source Advantage</h3><p>python-qualiter is fully open source and available on PyPI. This means:</p><ul><li><strong>Transparency</strong>: You can see exactly how your code is being analysed</li><li><strong>Community-Driven</strong>: Contributions and feedback from developers worldwide</li><li><strong>No Vendor Lock-in</strong>: Use it freely in any project, commercial or personal</li><li><strong>Continuous Improvement</strong>: Regular updates and enhancements based on real-world usage</li></ul><h3>Join the Movement</h3><p>Ready to simplify your Python code quality workflow? Here’s how you can get involved:</p><ol><li><strong>Try it out</strong>: pip install python-qualiter and run it on your current project</li><li><strong>Share feedback</strong>: Report bugs, suggest features, or share your experience</li><li><strong>Contribute</strong>: The <a href="https://gitlab.com/rbacovic/python-qualiter"><strong>project</strong></a> welcomes contributions from developers of all skill levels</li><li><strong>Spread the word</strong>: Help other Python developers discover this tool</li></ol><h3>Conclusion</h3><p>Whether you’re a solo developer working on personal projects or part of a large team managing complex applications, python-qualiter adapts to your workflow and makes code quality checks as simple as a single command.</p><p><em>Happy coding! 🐍</em></p><p><em>Have questions or feedback? I’d love to hear from you in the comments below.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=7029b71e16ce" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[“Broken English” 2023 tour recap and videos]]></title>
            <link>https://medium.com/@radovan.bacovic/broken-english-2023-tour-recap-and-videos-8193cc6f1751?source=rss-ff65005cbd7e------2</link>
            <guid isPermaLink="false">https://medium.com/p/8193cc6f1751</guid>
            <category><![CDATA[serbia]]></category>
            <category><![CDATA[data]]></category>
            <category><![CDATA[gitlab]]></category>
            <category><![CDATA[confrence]]></category>
            <dc:creator><![CDATA[Radovan Bacovic]]></dc:creator>
            <pubDate>Fri, 12 Jan 2024 21:08:00 GMT</pubDate>
            <atom:updated>2024-01-12T21:08:24.032Z</atom:updated>
            <content:encoded><![CDATA[<h3>My talks — or “Broken English” 2023 tour recap and videos</h3><h3>Hey mom, I am on the Internet now!</h3><p>A quick recap on my “Broken English” (which is a different name for my talks at various conferences) tour in 2023.</p><p>Travelled thousands of miles in 2023 restlessly and always feel overjoyed to share my mileage and experience with the audience. The main driver to me when stepping on the stage is to put a smile on people’s faces and make them feel good and fulfilled. As simple as that!</p><p>Happy to highlight and share a few talks from the last quarter of the year.</p><h3>#9Inspiration</h3><p>If you like AI, DevSecOps and all these buzzwords popular these days, probably the talk:</p><p>🎥 <a href="https://www.youtube.com/watch?v=h5TWnYI0sCw&amp;list=PLQyyxph2CGupNGhGLZ1ofCxqJe_RzM7ME&amp;index=19&amp;t=3s&amp;pp=gAQBiAQB"><strong>#9Inspiration: When nimble is not fast enough: Will AI and Data leverage your DevSecOps journey</strong></a></p><p>will give you a clear overview of the trends in this area.</p><p>My contribution to the <a href="https://levi9conference.com/"><strong>#9Inspiration Conference</strong></a> in Belgrade, Serbia 🇷🇸 in September 2023.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*W7eDIL8tXV83f68Xj4YopA.jpeg" /><figcaption>Talk, talk, talk and beyond — part 1</figcaption></figure><h3>Crunch Conference</h3><p>Here is one of my greatest hits, as provides an overview what are the secret sauce of a successful Data Platform.</p><p><a href="https://www.youtube.com/watch?v=2_O3jGpicOg&amp;list=PLQyyxph2CGupNGhGLZ1ofCxqJe_RzM7ME&amp;index=18&amp;pp=gAQBiAQB">🎥 <strong>Do the Magic with All-Remote Data Teams… — Radovan Bacovic | Compass Tech Summit 2023</strong></a></p><p>Happy as was taking part at the famous <a href="https://crunchconf.com/2023"><strong>Crunch Conference</strong></a> in Budapest, Hungary 🇭🇺 in October 2023.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*IcdyjBfpuTe4XO9vChPALQ.jpeg" /><figcaption>Talk, talk, talk and beyond — part 2</figcaption></figure><h3>DSC Europe 2023</h3><p>Should you go back to the office or work from your kitchen; no one cares, indeed. How I see this topic, and my overview from the first-face experience:</p><p><a href="https://www.youtube.com/watch?v=thkSOcYFPe8&amp;list=PLQyyxph2CGupNGhGLZ1ofCxqJe_RzM7ME&amp;index=20&amp;t=1s&amp;pp=gAQBiAQB">🎥 <strong>Asynchronous Work: The Next Phase of Remote Work | Radovan Bacovic | DSC Europe 23</strong></a></p><p>As always, DSC organizers provide the ultimate conference experience in Belgrade, Serbia 🇷🇸 in November 2023 at the <a href="https://datasciconference.com/"><strong>Data Science Conference</strong></a>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*WOHHElmHO_h5oHrgYC-HLw.jpeg" /><figcaption>Talk, talk, talk and beyond — part 3</figcaption></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=8193cc6f1751" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[DSC Croatia 2022]]></title>
            <link>https://medium.com/@radovan.bacovic/dsc-croatia-2022-8468fc12bc70?source=rss-ff65005cbd7e------2</link>
            <guid isPermaLink="false">https://medium.com/p/8468fc12bc70</guid>
            <category><![CDATA[data]]></category>
            <category><![CDATA[open-source]]></category>
            <category><![CDATA[gitlab]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[data-engineering]]></category>
            <dc:creator><![CDATA[Radovan Bacovic]]></dc:creator>
            <pubDate>Mon, 16 May 2022 12:05:55 GMT</pubDate>
            <atom:updated>2022-05-16T12:05:55.897Z</atom:updated>
            <content:encoded><![CDATA[<p>Well, one more live conference <a href="https://dsccroatia.com/">DSC Croatia 2022 </a>— this time in Zagreb, Croatia from <strong>10th-12th May 2022. </strong>Always feel good to contribute and exchange the experience. As usual, the best talks were on the margins over the coffee chats and/or glass of wine.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*jHjaK98dnCMKr-FN1GcTwg.jpeg" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*3aEI17v8GhQxGv0rxhL7tg.jpeg" /></figure><p>Had a great time meeting old and new peers and had good and interesting points for discussion. Good vibe, great atmosphere and was happy to share the same space with brilliant shaped minds and opened the data challenges discussions.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/768/1*OQpOFGGzuPLXiEzDiPtRlQ.jpeg" /></figure><p>Spoke about how we do the Data things in <a href="http://about.gitlab.com"><strong>GitLab.com</strong></a> <em>(look, we have a brand new logo, hope you should like it).</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/990/1*23wV60Mvjt8UQk0-G4tKqA.jpeg" /></figure><p>Here is the presentation with all details. Feel free to ping me for more details, if you are interested in the topic.</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.slideshare.net%2Fslideshow%2Fembed_code%2Fkey%2FNFE2Xva9QGbfte&amp;display_name=SlideShare&amp;url=https%3A%2F%2Fwww.slideshare.net%2FRadovanBaovi%2Fdsc-2021-presentationradovanbacovic&amp;image=https%3A%2F%2Fcdn.slidesharecdn.com%2Fss_thumbnails%2Fdsc2021presentationradovanbacovic-220203100205-thumbnail-4.jpg%3Fcb%3D1643882543&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=slideshare" width="600" height="500" frameborder="0" scrolling="no"><a href="https://medium.com/media/caa7a4069f312b1e936d6b3ef891e4ac/href">https://medium.com/media/caa7a4069f312b1e936d6b3ef891e4ac/href</a></iframe><p>And of course, strongly recommended Zagreb as a sweet spot with a good atmosphere, delicious food, and a great choice of wines.</p><p>See you around, live, of course.</p><p>#DSCROATIA2022 #data</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=8468fc12bc70" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Data Innovation Summit 2022 Stockholm — brief recap]]></title>
            <link>https://medium.com/@radovan.bacovic/data-innovation-summit-2022-stockholm-brief-recap-3fbcc2135c29?source=rss-ff65005cbd7e------2</link>
            <guid isPermaLink="false">https://medium.com/p/3fbcc2135c29</guid>
            <category><![CDATA[open-source]]></category>
            <category><![CDATA[gitlab]]></category>
            <category><![CDATA[conference]]></category>
            <category><![CDATA[data]]></category>
            <category><![CDATA[data-engineer]]></category>
            <dc:creator><![CDATA[Radovan Bacovic]]></dc:creator>
            <pubDate>Wed, 11 May 2022 11:49:39 GMT</pubDate>
            <atom:updated>2022-05-11T11:49:39.726Z</atom:updated>
            <content:encoded><![CDATA[<h3>Data Innovation Summit 2022 Stockholm — brief recap</h3><p>Thrilled to share my latest (Live) Data Conference experience with the audience. Just back from beautiful Stockholm (Sweden), where I attended <a href="https://datainnovationsummit.com/">Data Innovation Summit 2022</a>. Was a great data Conference where I had a chance to catch up with top-notch Data companies and most of them create products we are using.</p><p>Had an opportunity to learn a lot <em>(workshops + talks + informal chats)</em> from:<br>- Snowflake | Google/GCP | Fivetran | Firebolt | Snowplow| DataBricks | Apple | Aiven | Dremio | SODA | Confluent | <a href="http://coalesce.io/">Coalesce.io</a> ...</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/768/1*TsE9x8ltxg5jTDT3BaonQg.jpeg" /></figure><p>Happy to share my thoughts and takeaways about the trajectory of where data word is going:</p><ul><li>Cloud is not only the primary but the only choice these days. Also, most the companies praise multi-cloud as a good solution</li><li>Seems GCP has a most aggressive campaign for a Data-based approach on Cloud. Spoke with their engineers and they build the entire ecosystem around BigQuery - probably they want to compete and gather users from Snowflake</li><li>ETL As A Service <em>(out of the box, ready to run in a second)</em> is definitely a rising area - many cloud-based Meltano-ish platforms are really really good like Aiven, Kebola etc. Focus is moving from coding to researching and the <em>“right-tools”</em> choosing</li><li>The Open-source the approach is a good way to avoid vendor lock-in in a long term</li><li>Data observability tools are a <em>“must-have”</em> for 2022 - among other tools, SODA <em>(the Netherland company)</em> has a slightly different approach than MC, BigEye or Anomalo, and has an Open Source version, worth checking</li><li>Data mesh was the hottest word at the Conference, but think it is just a trend</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*q2w7HyoT0__JEctbtdxLOg.jpeg" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*yugWN7POYMZklpupDKlgUg.jpeg" /></figure><p>Bottom line (for GitLab): People really like and respect GitLab - well known and positively recognized brand - 80% of these companies use a paid version of GitLab to build products we are using on a daily basis, that’s so cool!<br>Strongly believe we belong to the top companies in the world, without any doubt.</p><p>See you next year in Stockholm.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=3fbcc2135c29" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>