Stories by Radovan Bacovic on Medium

Snowflake Task and Pipe Failures

Radovan Bacovic — Tue, 17 Mar 2026 21:01:00 GMT

How we alert the team in real time with AWS Lambda and Slack

Note: this article is written with a huge help from @csnehansh06 and consider him as a co-author.

How the GitLab Data Team uses AWS SNS and a Lambda function to turn silent Snowflake failures into instant Slack notifications before anyone notices data is missing or the pipeline is broken.

This was our problem: a Snowpipe silently stops loading data. No error is thrown at the pipeline level and the dashboard turns red. Just a growing gap in your tables that someone notices in the morning hours or days later.

And it’s a problem that’s embarrassingly common with Snowflake tasks and pipes. They fail quietly. Snowflake logs the error, but unless you’re actively polling for it, you won’t know. And polling is the kind of thing that either doesn’t get built, or gets built once, breaks, and nobody notices.

We needed something better: a push-based alerting system that fires the moment something goes wrong, and drops a message directly into Slack where the team already lives.

Here’s exactly how we built it.

The problem with Snowflake failure visibility

Snowflake tasks and Snowpipes are workhorses of any modern data platform. Tasks let you schedule SQL logic on a cron-like schedule. Snowpipes load data continuously from cloud storage — in our case, S3 — into Snowflake tables.

Both can fail. And when they do, Snowflake doesn’t shout about it by default.

For tasks, failures are logged in INFORMATION_SCHEMA.TASK_HISTORY. For Snowpipes, errors show up in INFORMATION_SCHEMA.COPY_HISTORY or via the REST API. You can query these, but querying is reactive. You're always behind.

What we wanted was reactive in the good sense: something that responds to the failure the moment it happens, rather than us having to go looking.

The solution was Snowflake’s native error notification integration using AWS SNS (Simple Notification Service). Snowflake can publish failure events directly to an SNS topic. From there, you can wire up anything — Lambda, email, PagerDuty, you name it. We wired it to a Lambda function that formats the message and posts it to Slack.

The full flow looks like this:

How this feature works (clean and simple)

Building the integration step by step

Step 1: Create the SNS topic in AWS

First, create an SNS topic that Snowflake will publish to. You can do this through the AWS Console or CLI.

aws sns create-topic --name snowflake-error-notifications --region us-east-1

Note the ARN that comes back — you’ll need it in the next step. It looks like:

arn:aws:sns:us-east-1:123456789012:snowflake-error-notifications

Step 2: Grant Snowflake permission to publish to the topic

Snowflake needs an IAM role with permission to publish to your SNS topic. Create the policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sns:Publish",
      "Resource": "arn:aws:sns:us-east-1:123456789012:snowflake-error-notifications"
    }
  ]
}

Attach this to a new IAM role. The trust relationship on the role needs to allow Snowflake’s AWS account to assume it. You’ll get Snowflake’s IAM user ARN after creating the notification integration in the next step — so this is a two-step dance.

Step 3: Create the Snowflake notification integration

In Snowflake, create a notification integration that points to your SNS topic:

CREATE OR REPLACE NOTIFICATION INTEGRATION sns_error_integration
  ENABLED = TRUE
  TYPE = QUEUE
  NOTIFICATION_PROVIDER = AWS_SNS
  DIRECTION = OUTBOUND
  AWS_SNS_TOPIC_ARN = 'arn:aws:sns:us-east-1:123456789012:snowflake-error-notifications'
  AWS_SNS_ROLE_ARN = 'arn:aws:iam::123456789012:role/snowflake-sns-role';

Then describe it to get Snowflake’s side of the IAM trust:

DESC INTEGRATION sns_error_integration;

Look for SF_AWS_IAM_USER_ARN and SF_AWS_EXTERNAL_ID in the output. Use these to update the trust policy on your IAM role so Snowflake can actually assume it:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::SNOWFLAKE_ACCOUNT:user/SNOWFLAKE_USER"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "YOUR_EXTERNAL_ID"
        }
      }
    }
  ]
}

Step 4: Attach the notification integration to your tasks and pipes

For a Snowflake task, add the error integration to its definition:

CREATE OR REPLACE TASK my_transform_task
  WAREHOUSE = TRANSFORM_WH
  SCHEDULE = 'USING CRON 0 6 * * * UTC'
  ERROR_INTEGRATION = sns_error_integration
AS
  CALL my_stored_procedure();

For an existing task you want to update:

ALTER TASK my_transform_task SET ERROR_INTEGRATION = sns_error_integration;

For a Snowpipe, add it at creation time:

CREATE OR REPLACE PIPE my_data_pipe
  AUTO_INGEST = TRUE
  ERROR_INTEGRATION = sns_error_integration
AS
  COPY INTO my_table
  FROM @my_stage
  FILE_FORMAT = (TYPE = 'JSON');

Now, whenever that task or pipe fails, Snowflake pushes a JSON payload to your SNS topic automatically. No polling, no cron job checking for failures.

Step 5: Create the Lambda function

Subscribe your Lambda to the SNS topic, then write the handler. The SNS message payload from Snowflake looks like this:

{
  "version": "1.0",
  "messageId": "abc-123",
  "timestamp": "2024-01-15T08:32:11Z",
  "snowflakeEventType": "TASK_FAILURE",
  "resource": {
    "database": "PROD",
    "schema": "TRANSFORMS",
    "name": "MY_TRANSFORM_TASK"
  },
  "errorMessage": "SQL compilation error: Object 'MY_TABLE' does not exist or not authorized."
}

Here’s the Lambda function we use to parse that and send it to Slack:

import json
import os
import urllib.request

SLACK_WEBHOOK_URL = os.environ["SLACK_WEBHOOK_URL"]

def format_slack_message(event_data: dict) -> dict:
    event_type = event_data.get("snowflakeEventType", "UNKNOWN_EVENT")
    resource = event_data.get("resource", {})
    resource_name = (
        f"{resource.get('database', '?')}."
        f"{resource.get('schema', '?')}."
        f"{resource.get('name', '?')}"
    )
    error_message = event_data.get("errorMessage", "No error message provided.")
    timestamp = event_data.get("timestamp", "Unknown time")
    emoji = ":snowflake:" if "PIPE" in event_type else ":gear:"

    # Pure Block Kit — no legacy `attachments` wrapper
    return {
        "blocks": [
            {
                "type": "header",
                "text": {
                    "type": "plain_text",
                    "text": f"{emoji} Snowflake {event_type.replace('_', ' ').title()}",
                },
            },
            {
                "type": "section",
                "fields": [
                    {"type": "mrkdwn", "text": f"*Resource:*\n`{resource_name}`"},
                    {"type": "mrkdwn", "text": f"*Time:*\n{timestamp}"},
                ],
            },
            {
                "type": "section",
                "text": {
                    "type": "mrkdwn",
                    "text": f":red_circle: *Error:*\n```{error_message[:500]}```",
                },
            },
            {"type": "divider"},
        ]
    }

def send_to_slack(message: dict) -> None:
    payload = json.dumps(message).encode("utf-8")
    req = urllib.request.Request(
        SLACK_WEBHOOK_URL,
        data=payload,
        headers={"Content-Type": "application/json"},
        method="POST",
    )
    with urllib.request.urlopen(req) as response:
        if response.status != 200:
            raise ValueError(f"Slack returned {response.status}: {response.read()}")

def lambda_handler(event, context):
    for record in event.get("Records", []):
        sns_message = record.get("Sns", {}).get("Message", "{}")
        
        try:
            event_data = json.loads(sns_message)
        except json.JSONDecodeError:
            print(f"Could not parse SNS message: {sns_message}")
            continue
        print(f"Processing event: {event_data.get('snowflakeEventType')} for {event_data.get('resource')}")
        slack_message = format_slack_message(event_data)
        send_to_slack(slack_message)
        print("Alert sent to Slack successfully.")
    return {"statusCode": 200, "body": "Done"}

Set the SLACK_WEBHOOK_URL as an environment variable in your Lambda configuration (not hardcoded, never hardcoded). You can create a Slack incoming webhook from the Slack API dashboard for your workspace.

What the alert looks like in Slack

When a task fails, the team sees something like this in the #data-alerts channel:

⚙️ Snowflake Task Failure
Resource: PROD.TRANSFORMS.MY_TRANSFORM_TASK
Time: 2024-01-15T08:32:11Z
Error:
SQL compilation error: Object 'MY_TABLE' does not exist or not authorized.

Clean, specific, actionable. No one needs to go digging in Snowflake’s query history to know what broke.

Handling multiple tasks and pipes

One integration covers everything. Any task or pipe you attach ERROR_INTEGRATION = sns_error_integration to will automatically publish to the same SNS topic and flow through the same Lambda. You don't need separate integrations per object — just update the task or pipe definition.

We tag the resource name in the Slack message, so you always know exactly which task or pipe failed without any ambiguity.

Making this production-ready

Getting the alert firing is the quick win. Here’s what we did to make it actually reliable in production.

Add a dead-letter queue. Lambda invocations can fail. If your Slack webhook is temporarily down or your Lambda has a bug, you don’t want to silently lose failure notifications — that’s the exact opposite of what you’re building. Configure an SQS dead-letter queue on the Lambda so failed invocations are captured and can be replayed.
Limit error message length. Snowflake error messages can be verbose. We cap ours at 500 characters in the Lambda before sending to Slack. Slack blocks have limits, and a wall of text in an alert channel gets ignored fast.
Route alerts by severity. Not all failures are equal. A prod Snowpipe failing is an incident. A dev task failing overnight is background noise. We route to different Slack channels based on the database name — prod failures go to #data-incidents, everything else goes to #data-alerts.

Here’s the routing logic added to the Lambda:

def get_slack_channel(resource: dict) -> str:
    database = resource.get("database", "").upper()
    if database == "SOMETHING_IN_PRODUCTION":
        return "#data-incidents"
    return "#data-alerts"

(With Slack’s incoming webhooks, you’ll need separate webhooks per channel, or migrate to the Slack Web API with a bot token to route dynamically.)

Test it before you trust it. You can manually trigger a test by creating a task that intentionally fails:

CREATE OR REPLACE TASK test_failure_task
  WAREHOUSE = TRANSFORM_WH
  SCHEDULE = 'USING CRON * * * * * UTC'
  ERROR_INTEGRATION = sns_error_integration
AS
  SELECT 1 / 0;  -- guaranteed division by zero

-- I always forgot to run this command, so you should not.
ALTER TASK test_failure_task RESUME;

Watch Slack. Within a minute, you should see the alert fire. Once confirmed, drop the task:

ALTER TASK test_failure_task SUSPEND;
DROP TASK test_failure_task;

The whole setup — from SNS topic to Slack message — takes a reasonable short time to configure. After that, it runs itself. We haven’t missed a Snowflake failure since we deployed this, and the team spends zero time polling task history to check whether things are working.

If you’re running Snowflake in production and you don’t have something like this, set it up today. Quiet failures are the ones that get you.

The GitLab Data Team’s full documentation on this integration is publicly available in the GitLab handbook.

Snowflake Task and Pipe Failures was originally published in Snowflake Builders Blog: Data Engineers, App Developers, AI, & Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Data classification with Snowflake: from impossible to production

Radovan Bacovic — Sat, 21 Feb 2026 15:01:02 GMT

Automated Data Classification with Snowflake: From Nowhere to Production

What is the problem with today’s data?

Let me scare you for a moment.

Data breaches. Reputation damage. Knowledge stealing. Code leakage. GDPR violations. Multi-million dollar fines.

This isn’t a dystopian future — it’s happening right now to companies just like yours. And the scary part? Most organisations have no idea which of their database tables contain sensitive information.

At the GitLab Data Team, we faced this exact problem. Our data landscape had grown to thousands of models across RAW, PREP, and PROD databases. Somewhere in that massive ecosystem lurked personally identifiable information (PII) and material non-public information (MNPI) — the kind of data that could trigger compliance violations, reputation damage, and those terrifying x-million-dollar fines.

The traditional approach won’t work: Manual tagging by data engineers who already have full plates. The reality?

It doesn’t scale
It’s error-prone, and
By the time you finish, your data has already changed.

Why Data Classification Became Non-Negotiable

Here’s the uncomfortable truth: every software company in 2025 is racing to add AI features to its products. But AI and unclassified data is a recipe for disaster.

Our data classification challenge wasn’t just about compliance checkboxes. We needed to solve four critical problems:

Who’s accessing our sensitive data? Without proper classification, we couldn’t audit who was querying PII or MNPI. Any employee could potentially download customer information without leaving a trace.

PII and personal data, good to know what is what

Automated tagging at scale. With approximately 10,000 models in Snowflake, manual classification was dead on arrival.

End-to-end governance. We needed a solution that covered everything from initial tagging to ongoing monitoring and audit trails.

The market and the open source space offered plenty of tools:

PIICatcher,
Microsoft Presidio, and
various Snowflake features like Sensitive data classification.

Here’s what we learned — none of them fully solved our specific problem. Open source tools lacked community support and scalability. Third-party solutions introduced vendor lock-in and high costs.

We needed to move quickly (considering upcoming audit deadlines) and had the right ingredients: hands-on experience with Python, Snowflake, GitLab CI/CD, and AI.

Available options for the automated data classification

We built our own pipeline, combining it with Snowflake’s classification mechanism as a means to classify the data.

How What: Building production-grade classification

Our success criteria were stringent:

tag PII and MNPI data across all environments through a fully automated process, with zero (or near-zero) false positives, and the ability to classify 10k+ models in under two hours.

After evaluating the landscape, we chose Snowflake’s classification capabilities as our foundation. Not because it was perfect, but because it offered the best balance of scalability, usability, and development speed for our specific needs. This was a production-ready feature set to be used. And the cost was reasonably applicable; refer here for more details.

The Architecture

We built a multi-layered system orchestrated through Apache Airflow:

Domain configuration (YAML) We began by defining our scope in configuration files, specifying which databases, schemas, and tables to include or exclude for MNPI and PII classification. This gave us the flexibility to adapt as our data landscape evolved.

YML specification for the data classification

2. Python (in K8s, as most of our pipelines was implemeneted) + Airflow Orchestration A single DAG (data_classification) coordinates five parallel tasks:

Deployment options for Python code

extract_classification — pulls metadata from Snowflake
execute_classification_MNPI — runs MNPI classification
execute_classification_[RAW,PREP,PROD] — processes each environment separately

3. Smart parameters: Three key parameters made our solution flexible:

DATA_CLASSIFICATION_DAYS — how far back to scan (default: 90 days)
DATA_CLASSIFICATION_TAGGING_TYPE — full refresh or incremental to do a full tagging (full) or to speed up classification (incremental)
DATA_CLASSIFICATION_UNSET — remove all tags and start from the beginning

Airflow setup for Data classification

4. Snowflake’s LLM-powered classification. This is where the magic happens. Snowflake’s built-in classification leverages large language models to understand data semantically — not just pattern matching on column names, but actually analysing sample data to determine if it contains sensitive information.

We use Snowflake’s SYSTEM$CLASSIFY_SCHEMA function as the core engine:

CALL SYSTEM$CLASSIFY_SCHEMA('PREP.SALES', {
  'sample_count': 1000, 
  'auto_tag': true
});

This single function call does the heavy lifting: samples 1000 rows from each table in the schema, runs them through Snowflake’s LLM models, identifies PII patterns, and automatically applies tags to sensitive columns. Where we need more samples for accuracy, you can extend the sample_count parameter:

CALL SYSTEM$CLASSIFY_SCHEMA('RAW.CUSTOMER_DATA', {
  'sample_count': 5000, 
  'auto_tag': true
});

Snowflake handles the complexity - the LLM inference, the tag management, the metadata updates. We just orchestrate which schemas to process and when.

Good to know the limitations: Snowflake’s SYSTEM$CLASSIFY_SCHEMA has a hidden limit — it can only process up to 1,000 tables per schema in a single call, so be careful with that limitation. When you have schemas with thousands of tables (like we did), the function simply stops processing after hitting that ceiling.

For MNPI data, we pick up from the dbt models specification and tags and do the same tagging in Snowflake.

The Technical Win

The entire process runs on Snowflake, offering better and well-known scalability, lower maintenance overhead, and tighter security within Snowflake’s trust boundary.

Architectural diagram for the data classification project

We leverage GitLab Duo for code assistance and maintain everything in GitLab with full CI/CD pipelines. Every change goes through review, staging, and production — DevSecOps culture applied to data governance.

Beyond Simple Tagging

But classification was just the foundation. We built three additional layers:

Tagged table example (fictional)

Audit layer. Every classification action gets logged. Every suspicious query gets flagged. Security teams can review and action inappropriate activity before it becomes a breach. For more details, refer to our dbt documentation. Actually, we combined meta tables to join tags and query history. As a result, we have a comprehensive overview of the queries and tags and can easily combine them.

Monitoring layer A dbt-based monitoring system that continuously checks queries against our tagged data. It looks for suspicious patterns:

SELECT * queries with no WHERE clauses or aggregations
Queries fetching round numbers of results (500, 1000 — typical data extraction patterns)
GET operations (clear signs of data downloads)
Unusual access patterns by non-system users

The hidden data problem. Here’s something most classification tools miss: when you create a view from a PII-tagged table, or clone a table, the new object should inherit those tags automatically. We solved this by tracking object lineage and propagating tags through the dependency graph.

What We Actually Solved

A few months after launch, here’s our scorecard:

Scalability — 10k+ models classified in under 2 hours
Automation — zero human intervention required
Compliance — legal and audit teams satisfied
Adaptability — handles frequent dbt re-creation of objects
Security — catches suspicious access patterns
Malicious use detection — audit trail for every sensitive query
Hidden data tracking — tags propagate through views and clones

What we didn’t solve yet:

❌ Vendor lock-in — we’re committed to Snowflake (but we’re okay with that)
❌ Full control — we’re dependent on Snowflake’s classification evolution

Also, the solution is scheduled in Airflow and will finish the incremental tagging under 1 hour. For full tagging, we are done in under 2 hours for the L-size warehouse.

The Honest Setbacks

Not everything went smoothly. We initially tried Snowflake’s auto-classification during PrP (private preview) — the accuracy wasn’t there yet. We pivoted to their LLM-based approach when it hit GA (generally available). The cost per classification run is higher, but warehouse size tuning solved our scalability concerns.

Conclusion: Choose the Problem, Not the Tool

The data landscape is changing faster than our ability to secure it. Every company rushing to add AI features faces the same fundamental challenge: you can’t safely feed AI tools with unclassified data.

Our journey taught us three critical lessons:

Stay open-source where possible — but don’t let it hinder your progress. We evaluated PIICatcher and Presidio extensively, but when Snowflake offered a native solution that immediately solved 80% of our problems, we adopted it.
Think about scalability from day one, but start small — we began with a single database, proved the concept, then scaled to 10k+ models. The architectural decisions we made on day one (Airflow, YAML configuration, incremental processing…) enabled that scale.
Focus on business value, assess costs, but move fast — compliance deadlines don’t wait for perfect solutions. We shipped a working classification system in weeks, not months, because we chose the problem (data security) over the tool (vendor independence).

The future of data is clear: it needs to be classified, governed, and continuously monitored. Whether you build or buy, the time to start is now. Because the cost of waiting isn’t just measured in dollars — it’s measured in reputation, trust, and the ability to innovate safely with AI.

What’s next for us? We’re exploring:

Auto classification in Snowflake
Integrating with tools like Atlan and Tableau, and connecting our classification data to other governance programs.
Using AI models to find suspicious queries and track tags

Our journey continues.

Data classification with Snowflake: from impossible to production was originally published in Snowflake Builders Blog: Data Engineers, App Developers, AI, & Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Snowflake: Rolling window DISTINCT count. How to make this happen?

Radovan Bacovic — Thu, 22 Jan 2026 20:02:02 GMT

Snowflake: Rolling Window DISTINCT Count. How to make this happen?

The problem that shouldn’t exist but does

You need to count DISTINCT users over a rolling 28-day window. Seems straightforward, right? Write a COUNT(DISTINCT user_id) with a window function and you're done.

Except you can’t!

Snowflake — and mainly other database systems — don’t support COUNT(DISTINCT...) over window functions. This fundamental limitation forces data engineers into workarounds that either sacrifice performance or accuracy when dealing with time-series analytics at scale.

At GitLab Data Team, we hit this wall while calculating monthly active user metrics across millions of events. We needed to process multiple billion records table efficiently, accurately handle date gaps, and make the solution reusable for our analytics engineers. The standard SQL approaches either crawled to a halt or required complex date-filling logic.

What We Built: Why Python Changes Everything

The breakthrough came from rethinking the problem entirely. Instead of fighting SQL’s window function limitations, we used Python to generate date arrays and leveraged Snowflake’s LATERAL FLATTEN to explode them. This combination eliminates window functions while maintaining scalability.

The key insight: a 28-day rolling window is just 28 array members. Cache this small array in Python, flatten it with SQL, and suddenly you’re processing hundreds of millions of rows in minutes instead of hours.

Here’s why this scales:

Python caching: The @functools.lru_cache decorator means date array generation happens once per unique date range, not billions of times
No window partitioning: Standard window functions scan entire partitions repeatedly; our approach processes each row exactly once
Native Snowflake operations: LATERAL FLATTEN is optimized C code, not interpreted SQL

The performance difference is dramatic. On our 500M+ record dataset, this approach completed in under 5 minutes on an L warehouse. Comparable window function solutions either failed or required XL warehouses running 30+ minutes.

How it Works: The Complete Implementation

Let’s walk through a concrete example. First, create sample data representing user activity:

CREATE OR REPLACE TABLE user_activity (
    activity_date DATE,
    namespace_id INTEGER,
    user_id INTEGER
);

-- Insert sample data with intentional date gaps
INSERT INTO user_activity VALUES
    ('2024-01-01', 100, 1001),
    ('2024-01-01', 100, 1002),
    ('2024-01-01', 100, 1002), -- same as prior record, expect to see 2 for 2024-01-01
    ('2024-01-02', 100, 1001),
    ('2024-01-05', 100, 1003), -- gap: no data for 01-03, 01-04
    ('2024-01-05', 100, 1001),
    ('2024-01-08', 100, 1004),
    ('2024-01-10', 100, 1002),
    ('2024-01-15', 100, 1005),
    ('2024-01-20', 100, 1001),
    ('2024-01-25', 100, 1006),
    ('2024-01-28', 100, 1003),
    ('2024-02-01', 100, 1007);

Next, create the Python function that generates date arrays with built-in caching:

CREATE OR REPLACE FUNCTION generate_date_list(start_date DATE, end_date DATE)
RETURNS ARRAY
LANGUAGE PYTHON
RUNTIME_VERSION = '3.11'
HANDLER = 'generate_date_list'
AS
$$
import datetime
import functools

@functools.lru_cache(maxsize=100)
def generate(start_date, end_date):
    """
    Generate cached date arrays for rolling windows.
    
    With maxsize=100, we can cache all unique 28-day windows
    in a typical monthly processing run, dramatically reducing
    computation overhead.
    """
    result = []
    current = start_date
    while current <= end_date:
        result.append(current)
        current += datetime.timedelta(days=1)
    return result

def generate_date_list(start_date, end_date):
    return generate(start_date, end_date)
$$;

Why this matters: When processing millions of rows with a 28-day window, most rows will request identical date ranges (like “27 days before today”). Without caching, you’d rebuild the same 28-element array millions of times; with caching, you build it once and reuse it billions of times, turning an O(n × m) operation into effectively O(n).

Now implement the rolling window logic:

WITH base_data AS (
    -- Your source data
    SELECT activity_date,
           namespace_id,
           user_id
      FROM user_activity
),
date_bounds AS (
    -- Calculate processing range
    SELECT MIN(activity_date) AS min_date,
           MAX(activity_date) AS max_date
      FROM base_data
),
rolling_windows AS (
    -- Generate 28-day window for each activity
    SELECT activity_date,
           namespace_id,
           user_id,
           generate_date_list(
               DATEADD(day, -27, activity_date),
               activity_date
           ) AS date_window
      FROM base_data
)
-- Flatten windows and count distinct users
SELECT DATEADD(day, 27, dates.value::DATE) AS report_date,
       rolling_windows.namespace_id,
       COUNT(DISTINCT rolling_windows.user_id) AS distinct_users_28d
  FROM rolling_windows,
       LATERAL FLATTEN(INPUT => rolling_windows.date_window) AS dates
 WHERE DATEADD(day, 27, dates.value::DATE) 
       BETWEEN (SELECT min_date FROM date_bounds) 
           AND (SELECT max_date FROM date_bounds)
 GROUP BY report_date, namespace_id
 ORDER BY report_date;

What’s happening here:

base_data: Your source events (activity date, namespace, user)
date_bounds: Establishes the processing range to avoid edge effects
rolling_windows: For each activity, generates a 28-day lookback array using the Python function generate_date_list
Final SELECT: Flattens arrays with LATERAL FLATTEN, then counts distinct users for each date

The crucial performance trick:

DATEADD(day, 27, dates.value::DATE)

converts each array member back to the “end date” perspective, allowing proper grouping without date gaps.

Output:

| report_date | namespace_id | distinct_users_28d |
|-------------|--------------|--------------------|
| 2024-01-01  | 100          | 2                  |
| 2024-01-02  | 100          | 2                  |
| 2024-01-03  | 100          | 2                  |
| 2024-01-04  | 100          | 2                  |
| 2024-01-05  | 100          | 3                  |
| 2024-01-08  | 100          | 4                  |
...
...
...
| 2024-01-28  | 100          | 5                  |
| 2024-02-01  | 100          | 6                  |

Notice how date gaps (like January 3–4) are automatically filled with accurate rolling counts — no manual date spine required.

Why the Alternatives Fall Short

Before arriving at this solution, we evaluated four standard approaches:

Range-based window functions: Requires DENSE_RANK() workarounds since COUNT(DISTINCT) isn't supported. Can't handle date gaps without manual date spines. Fails on datasets over 100M records due to partition size limits.

Row-based window functions: Slightly better performance but still requires extensive date-filling logic. Misses the maximum date row without workarounds. Complexity scales poorly with dataset size.

Regular subqueries: Conceptually simple — join the table to itself with date range conditions. Performance degrades exponentially with data volume. Our 500M record dataset would take hours even on XL warehouses.

Lateral join subqueries: Cleaner syntax than regular subqueries but identical performance characteristics. Still requires full table scans per date partition.

The lateral array construction approach consistently outperformed these alternatives by 10–20x while maintaining code clarity and handling edge cases automatically.

Comparing options for the rolling windows count DISTINCT

Do’s and Don’ts for Production

✅ Do’s:

Pre-aggregate to the daily level before applying rolling windows. Convert timestamps to dates in a base table first — mixing timestamp and date types in window calculations kills performance
Right-size your warehouse. Use XS for <100K rows, S for 100K-1M, L for 1M-20M, XL for 20M+

Proposed wareshouse size, based on the dataset size

Process monthly. Calculate 60 days of rolling history once per month rather than recalculating the full history daily
Monitor cache effectiveness. If you’re processing many different window sizes, increase lru_cache(maxsize=...) appropriately
Create a dbt macro for this pattern and make your analyst happy. Hide the logic and serve dbt marco with full flexibility. With this implementation, anyone can simply call the function inside dbt, and it will do the rest.

{%- macro count_distinct_rolling_window(source_table, distinct_column, other_columns_list, date_column_name='ping_date', window_in_days=28) -%}

{% set source_table_name = source_table %}%}
{% set time_window = window_in_days - 1 %}
{% set date_column = date_column_name %}
{% set distinct_column = distinct_column %}

  WITH base AS (
    SELECT {{ date_column }}     AS ping_date,
           {% for other_column in other_columns_list %}
              {{ other_column }} AS {{ other_column }},
           {%- endfor -%}
           {{ distinct_column }} AS {{ distinct_column }}
      FROM {{ ref(source_table_name) }}
     WHERE metrics_path = 'redis_hll_counters.ide_edit.g_edit_by_sfe_monthly'
  ), min_max AS (
    SELECT MIN(ping_date) AS min_date,
           MAX(ping_date) AS max_date
      FROM base
  ), generate_rolling_window AS (
    SELECT ping_date,
           {% for other_column in other_columns_list %}
              {{ other_column }} AS {{ other_column }},
           {%- endfor -%},
           {{ distinct_column }},
           generate_list(DATEADD(day, -{{ time_window }},  ping_date), ping_date) AS rolling_window
      FROM base
  )
  SELECT DATEADD(day, {{ time_window }}, unnest_dates.value::DATE) AS ddate,
         {% for other_column in other_columns_list %}
              {{ other_column }} AS {{ other_column }},
         {%- endfor -%},
         COUNT(DISTINCT generate_rolling_window.{{ distinct_column }}) AS distinct_user_count
    FROM generate_rolling_window,
         LATERAL FLATTEN(INPUT => generate_rolling_window.rolling_window) AS unnest_dates
   WHERE ddate BETWEEN (SELECT DATEADD(day, -{{ time_window }}, min_date) FROM min_max) AND (SELECT max_date FROM min_max)
   GROUP BY ALL
  
{%- endmacro -%}

and call the macro from the dbt project:

{{ 
count_distinct_rolling_window(source_table='my_table', 
                              distinct_column='id', 
                              date_column_name='date', 
                              other_columns_list=['metrics_path','namespace']
                              ) 
}}

❌ Don’ts:

Don’t use window functions on massive datasets (>100M records). The partition scans will overwhelm even large warehouses. Use the lateral array approach instead
Don’t mix timestamp and date types in calculations. Always cast to date in your base table: timestamp_column::DATE AS date_column, query the date column directly. Example:
Instead of using:

SELECT 
...
timestamp_column::DATE AS date_column,
...

Create a base table as:


CREATE TABLE
...
SELECT 
...
timestamp_column::DATE AS date_column,

and later on use it as aDATE data type:

SELECT
...
date_column -- this is a DATE data type now

Don’t skip the date bounds CTE. Without it, you’ll get incorrect counts at the edges of your date range
Don’t process the full history daily. Rolling windows only need recent data — create a sliding 60-day base table and process incrementally

Snowflake’s lack of distinct count window functions isn’t a limitation you need to accept. With Python UDFs and lateral joins, you can build rolling window calculations that are faster, cleaner, and more maintainable than traditional SQL workarounds.

The code is straightforward, the performance is excellent, and the pattern is reusable. Stop fighting SQL’s limitations and start combining the best from both (🐍Python + ❄️Snowflake) worlds.

Snowflake: Rolling window DISTINCT count. How to make this happen? was originally published in Snowflake Builders Blog: Data Engineers, App Developers, AI, & Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Learn Weaviate in 15 minutes: A practical guide for SQL developers

Radovan Bacovic — Thu, 25 Dec 2025 21:25:15 GMT

Understanding semantic search through the lens of relational databases

📣NOTE: The complete code from this tutorial can be found in the repo Weaviate101

Part 1: Understanding vector databases

Before diving into Weaviate specifically, let’s establish what problem vector databases solve and why you should care.

Basic concepts of Vector dabatase (source)

The fundamental problem

Traditional databases excel at exact matches:

SELECT * 
  FROM accounts 
 WHERE account_name = 'Acme Corp';

But they fail at semantic understanding:

-- This doesn't work in traditional SQL
SELECT * 
  FROM accounts 
 WHERE meaning_similar_to('accounts showing churn risk')

Vector databases solve this by converting data into mathematical representations that capture semantic meaning. Here’s the core concept:

Text: "Customer experiencing integration challenges"
 ↓
Vector: [0.23, -0.15, 0.67, 0.45, -0.82, … 768 dimensions]
Text: "Account having technical difficulties"
 ↓ 
Vector: [0.21, -0.18, 0.63, 0.48, -0.79, … 768 dimensions]

These vectors are close in mathematical space even though the text is different. This is how semantic search works.

Search example in Weaviate, source.

Why you should use vector databases

If you’re used to SQL and relational databases, you might wonder: “Why add another database to my stack?” Here are the concrete business problems vector databases solve:

1. The “synonyms and variations” problem

Your SQL database can’t find related concepts without exhaustive keyword lists:

-- Traditional approach - brittle and incomplete
SELECT * 
  FROM tickets 
 WHERE description LIKE '%slow%' 
    OR description LIKE '%performance%'
    OR description LIKE '%lag%'
    OR description LIKE '%timeout%'
    OR description LIKE '%unresponsive%'
    -- ...and 50 more variations you forgot

Vector databases understand that “slow”, “sluggish”, “laggy”, “unresponsive” and “performance issues” are semantically similar — without you listing every variation.

2. The “exploratory search” problem

Business users ask questions like:

“Show me accounts that might churn.”
“Find customers discussing integration challenges.”
“Which tickets indicate product-market fit issues?”

These are conceptual queries that don’t map cleanly to SQL predicates. You can’t write:

WHERE conceptually_similar_to('churn risk')

But with vector databases, you can search by concept, not just keywords.

3. The “unstructured data” problem

You have valuable insights trapped in:

Support ticket descriptions
Customer call transcripts
Contract notes
Email communications
Product feedback

Traditional databases store this text, but can’t make it searchable by meaning. Full-text search helps, but it’s still keyword-based. Vector databases make unstructured data as queryable as structured data.

4. The “recommendation” problem

“Find accounts similar to this one” or “Show me related support tickets” requires understanding similarity across multiple dimensions. SQL can do basic matching:

-- Find accounts with similar characteristics
SELECT * 
  FROM accounts 
 WHERE segment = 'Enterprise' 
   AND arr BETWEEN 400000 AND 600000
   AND health_score BETWEEN 75 AND 85

However, this overlooks accounts that share similar behaviour patterns, engagement styles, or business contexts — aspects that are not captured in structured columns.

5. The “data quality” problem

Finding duplicates and near-duplicates is hard in SQL:

-- Which of these are the same company?
'Acme Corp'
'ACME Corporation'  
'Acme Corp.'
'ACME CORP'

Vector similarity instantly identifies these as the same entity without writing complex string-matching rules.

Real business impact

For the project, vector search is enabled:

Account managers: “Show me accounts like this high-performing customer” → instant recommendations
Support teams: “Find similar issues to this ticket” → faster resolution through past solutions
Executives: “What are our at-risk accounts saying?” → semantic analysis across all touchpoints
Data quality teams: “Find duplicate accounts” → automatic deduplication

The bottom line: Vector databases aren’t replacing SQL — they’re augmenting it. Use SQL for structured queries (“ARR > $100K”), use vector search for semantic queries (“accounts showing expansion signals”), and combine them for powerful hybrid searches.

Key capabilities of vector databases enable

Semantic search: Find conceptually similar items, not just keyword matches
Similarity ranking: Order results by how semantically close they are
Multi-modal search: Search across text, images, and audio using the same infrastructure
Recommendation engines: “Find items like this one” becomes a vector proximity search
Deduplication: Identify near-duplicates without exact matching
Classification and clustering: Group similar items automatically without predefined categories

Part 2: What is Weaviate?

Weaviate is an open-source vector database that stores both vectors and their original data, enabling semantic search with structured filtering.

Weaviate database (source)

What makes Weaviate different?

Most vector databases only store vectors. Weaviate stores:

Vectors (the semantic embeddings)
Properties (structured data like account_name, segment, health_score)
Relationships (cross-references between objects)

This enables hybrid queries — semantic search combined with structured filters:

# Find semantically similar accounts...
"accounts at risk of churning"

# ...filtered by business rules
WHERE segment = "Enterprise" 
AND health_score < 50
AND renewal_date < 90 days

1. Schema-based collections (think: tables with vectors)

If you’re coming from SQL, think of a Weaviate collection as similar to a database table:

SQL vs Weaviate concepts

Key difference: Each object in Weaviate has both traditional properties (such as SQL columns) and a vector embedding that captures its semantic meaning.

Here’s how AccountIntelligence collection maps to SQL thinking:

Collection: "AccountIntelligence" # Like a SQL table named "accounts"
├── Properties                    # Like SQL columns
│   ├── accountId: TEXT           # Primary key equivalent
│   ├── accountName: TEXT         # Regular text column
│   ├── segment: TEXT             # Categorical column
│   └── healthScore: NUMBER       # Numeric column
└── Vector: [768 dimensions]      # NEW: Semantic representation

When you query Weaviate, you can:

Filter by properties (just like SQL WHERE clauses)
Order by properties (just like SQL ORDER BY)
Search by vector similarity (NEW: semantic search capability)

Example comparison:

Traditional SQL query:


SELECT * 
  FROM accounts 
 WHERE segment = 'Enterprise' 
   AND health_score < 50
 ORDER BY arr DESC
 LIMIT 10;

Weaviate equivalent using semantic search:

collection.query.near_vector(
    near_vector=query_embedding,
    where=Filter.by_property("segment").equal("Enterprise") &
          Filter.by_property("healthScore").less_than(50),
    limit=10
)
# Returns semantically similar accounts PLUS structured filtering

2. Multiple search modes:

Pure vector search (semantic only)
BM25 keyword search (traditional)
Hybrid search (vector + keyword + filters)
Filtered semantic search (semantic with business rules)

3. Built-in vectorization or “bring-your-own”:

Use Weaviate’s modules (OpenAI, Cohere, etc.)
Or provide pre-computed embeddings (we do this with Ollama)

4. GraphQL and REST APIs: Flexible query interface

5. Horizontal scaling: Production-ready architecture

Part 3: Weaviate walkthrough with code

Let’s build a minimal semantic search system step by step.

Step 1: Start Weaviate with Docker

Copy this code into the docker-compose.yml file:

# docker-compose.yml
services:
  weaviate:
    image: semitechnologies/weaviate:1.23.0
    ports:
      - "8080:8080"
      - "50051:50051"  # gRPC port
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'none'  # We'll provide embeddings later on
      ENABLE_MODULES: ''
      CLUSTER_HOSTNAME: 'node1'

Start it up:

docker-compose up -d
# Wait 30 seconds for Weaviate to start

[+] Running 2/2
 ✔ Network weaviate101_default       Created                                                                                                                                                                                                                                  0.0s 
 ✔ Container weaviate101-weaviate-1  Started

and check if Weaviate is working:

curl http://localhost:8080/v1

Step 2: Connect and create a schema

Install needed libraries from the requirements.txt file:

anthropic
requests
weaviate-client==4.19.0

pip install -r requirements.txt

Create weaviate_start.py file and paste the code:

import weaviate
from weaviate.classes.config import Configure, DataType, Property

COLLECTION_NAME = "AccountIntelligence"
try:
    # Connect to local Weaviate instance
    client = weaviate.connect_to_local(host="localhost", port=8080, grpc_port=50051)
    # Check connection
    print(client.is_ready())  # Should print True
    # Create a collection for account rdata

    if client.collections.exists(COLLECTION_NAME):
        client.collections.delete(COLLECTION_NAME)

    accounts = client.collections.create(
        name=COLLECTION_NAME,
        description="Customer account data with semantic search",
        # Vector configuration
        vectorizer_config=Configure.Vectorizer.none(),  # We provide embeddings
        vector_index_config=Configure.VectorIndex.hnsw(),
        # Define properties (structured data)
        properties=[
            Property(
                name="accountId",
                data_type=DataType.TEXT,
                skip_vectorization=True,  # Don't include in vector
                description="Unique account identifier",
            ),
            Property(
                name="accountName",
                data_type=DataType.TEXT,
                skip_vectorization=True,
                description="Account name",
            ),
            Property(
                name="content",
                data_type=DataType.TEXT,
                skip_vectorization=False,  # This WILL be vectorized
                description="Rich account summary for semantic search",
            ),
            Property(
                name="segment",
                data_type=DataType.TEXT,
                skip_vectorization=True,
                description="Account segment: Enterprise, Mid-Market, SMB",
            ),
            Property(
                name="healthScore",
                data_type=DataType.NUMBER,
                skip_vectorization=True,
                description="Health score 0-100",
            ),
            Property(
                name="arr",
                data_type=DataType.NUMBER,
                skip_vectorization=True,
                description="Annual Recurring Revenue",
            ),
        ],
    )
    print(f"Collection '{accounts.name}' created successfully!")
finally:
    client.close()

When you run the file, you will got a message:

True
Collection 'AccountIntelligence' created successfully!

Key concepts explained:

skip_vectorization=True: These fields are metadata only. Used for filtering, not semantic search.
skip_vectorization=False: The content field gets vectorised for semantic search.
HNSW index: Hierarchical Navigable Small World graph — fast approximate nearest neighbour search
Cosine distance: Measures the angle between vectors, perfect for text embeddings

Dummy represenation of the AccountIntelligence colleciton in Weaviate

Step 3: Add data with vectors

Here’s where the magic happens — inserting data with embeddings. We pre-computed embeddings for the easier testing; later on, Weaviate will do it for us:

Create file weaviate_insert_data.py

# Sample account data
sample_accounts = [
    {
        "accountId": "ACC001",
        "accountName": "Acme Corp",
        "content": "Enterprise account showing strong product adoption. Active in community, "
                   "high license utilization at 85%. Recent expansion discussion with CSM. "
                   "Strong technical engagement across engineering teams.",
        "segment": "Enterprise",
        "healthScore": 82,
        "arr": 450000,
        # Vector would come from embedding model - simplified here
        "vector": [0.23, -0.15, 0.67, 0.45, -0.82, 0.12, 0.56, -0.33]  # 768 dims in reality
    },
    {
        "accountId": "ACC002",
        "accountName": "TechStart Inc",
        "content": "Mid-market account with recent support escalations. Multiple tickets on API "
                   "performance issues. Low license utilization at 35%. CSM noted budget concerns "
                   "in last quarterly review. Contract renewal in 60 days.",
        "segment": "Mid-Market",
        "healthScore": 42,
        "arr": 125000,
        "vector": [-0.15, 0.22, -0.45, 0.67, 0.33, -0.78, 0.11, 0.55]
    },
    {
        "accountId": "ACC003",
        "accountName": "Global Solutions Ltd",
        "content": "Enterprise account with critical escalation. Integration challenges blocking "
                   "production deployment. Executive stakeholder expressing frustration. "
                   "Competitors mentioned in recent calls.",
        "segment": "Enterprise",
        "healthScore": 28,
        "arr": 780000,
        "vector": [-0.33, 0.45, -0.67, 0.22, 0.88, -0.12, -0.55, 0.15]
    }
]

You can try it now:

# Insert using batch API (much faster than individual inserts)
client = weaviate.connect_to_local(host="localhost", port=8080, grpc_port=50051)

collection = client.collections.get("AccountIntelligence")
with collection.batch.dynamic() as batch:
    for account in sample_accounts:
        batch.add_object(
            properties={
                "accountId": account["accountId"],
                "accountName": account["accountName"],
                "content": account["content"],
                "segment": account["segment"],
                "healthScore": account["healthScore"],
                "arr": account["arr"],
            },
            vector=account["vector"],
        )
# Check what was inserted
result = collection.aggregate.over_all(total_count=True)
print(f"Total accounts in Weaviate: {result.total_count}")

When you run the file, you will get the result:

Total accounts in Weaviate: 3

What just happened?

Each account now exists in Weaviate as:

A 768-dimensional vector (in reality, not the simplified 8-dim example) capturing semantic meaning
Structured properties available for filtering and display
Indexed for fast retrieval

Part 4: Search options in Weaviate

Now let’s explore the different ways to query this data.

Option 1: Pure semantic search (vector similarity)

Create a file weaviate_search.py and paste the code:

import weaviate
from weaviate.classes.query import MetadataQuery

client = weaviate.connect_to_local(host="localhost", port=8080, grpc_port=50051)


def semantic_search(query_text: str, limit: int = 5):
    """
    Find accounts semantically similar to the query
    """
    collection = client.collections.get("AccountIntelligence")

    # In reality, generate embedding for query_text using same model as data
    # For demo, we'll use a simplified query vector
    query_vector = [0.1, 0.3, -0.5, 0.7, 0.2, -0.6, 0.4, 0.1]

    response = collection.query.near_vector(
        near_vector=query_vector,
        limit=limit,
        return_metadata=MetadataQuery(distance=True, certainty=True),
    )

    print(f"\n🔍 Semantic search: '{query_text}'\n")
    for obj in response.objects:
        print(f"Account: {obj.properties['accountName']}")
        print(f"Segment: {obj.properties['segment']}")
        print(f"Health: {obj.properties['healthScore']}")
        print(f"Certainty: {obj.metadata.certainty:.3f}")
        print(f"Content: {obj.properties['content'][:100]}...")
        print("-" * 80)

    return response.objects


# Try it
results = semantic_search("accounts at risk of churning")

Result:

🔍 Semantic search: 'accounts at risk of churning'
Account: Global Solutions Ltd
Segment: Enterprise
Health: 28
Certainty: 0.892
Content: Enterprise account with critical escalation. Integration challenges blocking production...
--------------------------------------------------------------------------------
Account: TechStart Inc
Segment: Mid-Market
Health: 42
Certainty: 0.765
Content: Mid-market account with recent support escalations. Multiple tickets on API performance...
--------------------------------------------------------------------------------

Notice: The search found accounts discussing “escalations”, “issues”, and “concerns” even though the query was “risk of churning” — this is semantic understanding!

Graphical representation of vectors (source)

Option 2: Keyword search (BM25)

Sometimes you want exact keyword matching, not semantic similarity. Add this code at the end of the file weavite_search.py:

def keyword_search(keyword: str, limit: int = 5):
    """
    Traditional BM25 keyword search
    """
    collection = client.collections.get("AccountIntelligence")
    
    response = collection.query.bm25(
        query=keyword,
        limit=limit,
        return_metadata=MetadataQuery(score=True)
    )
    
    print(f"\n🔍 Keyword search: '{keyword}'\n")
    for obj in response.objects:
        print(f"Account: {obj.properties['accountName']}")
        print(f"BM25 Score: {obj.metadata.score:.3f}")
        print(f"Content: {obj.properties['content'][:100]}...")
        print("-" * 80)
    
    return response.objects

results = keyword_search("escalation")

And run the file.

When to use keyword vs semantic:

Keyword: Specific product names, account IDs, exact terminology
Semantic: Conceptual queries, exploratory search, synonym handling

Option 3: Hybrid search (best of both worlds)

Combine vector similarity with keyword matching, add this to the end of weaviate_search.py file:

def hybrid_search(query_text: str, alpha: float = 0.5, limit: int = 5):
    """
    Hybrid search balancing semantic and keyword
    
    alpha=0.0 → 100% keyword (BM25)
    alpha=0.5 → 50% semantic, 50% keyword
    alpha=1.0 → 100% semantic
    """
    collection = client.collections.get("AccountIntelligence")
    query_vector = [0.1, 0.3, -0.5, 0.7, 0.2, -0.6, 0.4, 0.1]
    
    response = collection.query.hybrid(
        query=query_text,
        vector=query_vector,
        alpha=alpha,
        limit=limit,
        return_metadata=MetadataQuery(score=True)
    )
    
    print(f"\n🔍 Hybrid search (α={alpha}): '{query_text}'\n")
    for obj in response.objects:
        print(f"Account: {obj.properties['accountName']}")
        print(f"Hybrid Score: {obj.metadata.score:.3f}")
        print(f"Content: {obj.properties['content'][:100]}...")
        print("-" * 80)
    
    return response.objects

results = hybrid_search("critical escalation with enterprise accounts", alpha=0.5)

And run the file.

Why hybrid? It finds accounts that are both:

Semantically similar to the concept
Actually, mention the keywords

This usually produces the best results for business queries.

Option 4: Filtered semantic search

The real power: combine semantic search with structured filters, add the routine to the end of weaviate_search.py file :

from weaviate.classes.query import Filter

def filtered_search(query_text: str, segment: str = None,
                    max_health: int = None, min_arr: float = None, limit: int = 5):
    collection = client.collections.get("AccountIntelligence")
    query_vector = [0.1, 0.3, -0.5, 0.7, 0.2, -0.6, 0.4, 0.1]

    # Build filter
    where_filter = None
    if segment:
        where_filter = Filter.by_property("segment").equal(segment)
    if max_health is not None:
        health_filter = Filter.by_property("healthScore").less_than(max_health)
        where_filter = where_filter & health_filter if where_filter else health_filter
    if min_arr is not None:
        arr_filter = Filter.by_property("arr").greater_than(min_arr)
        where_filter = where_filter & arr_filter if where_filter else arr_filter

    # Pass filter as 'filters' not 'where'
    response = collection.query.near_vector(
        near_vector=query_vector,
        filters=where_filter,  # Changed from 'where' to 'filters'
        limit=limit,
        return_metadata=MetadataQuery(distance=True)
    )

    print(f"\n🔍 Filtered search: '{query_text}'")
    print(f"   Filters: segment={segment}, health<{max_health}, ARR>${min_arr}\n")

    for obj in response.objects:
        print(f"Account: {obj.properties['accountName']}")
        print(f"Segment: {obj.properties['segment']}")
        print(f"Health: {obj.properties['healthScore']}")
        print(f"ARR: ${obj.properties['arr']:,}")
        print(f"Distance: {obj.metadata.distance:.3f}")
        print("-" * 80)

    return response.objects

# Try it - find at-risk enterprise accounts with high revenue
results = filtered_search(
    "accounts showing churn risk",
    segment="Enterprise",
    max_health=50,
    min_arr=200000
)

Result:

🔍 Filtered search: 'accounts showing churn risk'
   Filters: segment=Enterprise, health<50, ARR>$200000
Account: Global Solutions Ltd
Segment: Enterprise
Health: 28
ARR: $780,000
Distance: 0.156
--------------------------------------------------------------------------------

Part 5: Ollama and local vectorization

We use Ollama for local embedding generation. This keeps costs low and data private.

What is vectorization?

Before we dive into Ollama, let’s clarify what vectorization actually means.

Vectorization is the process of converting text (or other data) into numerical vectors that capture semantic meaning.

Think of it as translating human language into math that computers can compare:

Text (Human Language):
"Customer experiencing integration challenges"
 ↓ VECTORIZATION ↓
Vector (Machine Language):
[0.023, -0.156, 0.671, 0.445, -0.823, … 768 numbers total]

Why numbers? Because computers can efficiently:
- Compare vectors using mathematical distance (cosine similarity, Euclidean distance)
- Search through millions of vectors in milliseconds
- Cluster similar concepts automatically
- Rank results by relevance
- The magic: Texts with similar meanings produce similar vectors, even if they use completely different words:

"The product is too slow" → [0.21, -0.15, 0.63, …]
"Performance issues" → [0.19, -0.18, 0.65, …]
"System is laggy" → [0.23, -0.14, 0.61, …]
 ↑ These vectors are close in 768-dimensional space

SQL analogy: If SQL indexes make lookups fast, vectorization makes semantic similarity fast. But instead of indexing exact values, you’re indexing meaning.

Why Ollama?

Ollama provides local LLM inference — no API calls, no costs, no data leaving your infrastructure.

For embeddings, we use nomic-embed-text:

768 dimensions
MTEB score: 62.39 (competitive with OpenAI)
Runs locally on CPU or GPU
Free and open source

Setting up Ollama with Weaviate

Stop Docker and paste the code with Ollama image:

docker-compose down --remove-orphans

# docker-compose.yml - Add Ollama service
services:
  weaviate:
    image: semitechnologies/weaviate:1.34.0
    ports:
      - "8080:8080"
      - "50051:50051"
    environment:
      DEFAULT_VECTORIZER_MODULE: 'none'
      CLUSTER_HOSTNAME: 'node1'

  ollama:
    image: ollama/ollama:0.12.11
    ports:
      - "11434:11434"
    volumes:
      - ./ollama_data:/root/.ollama
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "ollama", "list"]
      interval: 30s
      timeout: 10s
      retries: 3
    entrypoint: ["/bin/sh", "-c"]
    command:
      - |
        ollama serve &
        sleep 5
        ollama pull nomic-embed-text
        wait

volumes:
  ollama_data:

Start Docker (again):

# Start services
docker-compose up -d

Check the environment:


# Test if it works
curl http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "Test embedding generation"
}'

Generate embeddings with Ollama

Add this code to a new file, weaviate_embeddings.py:

import requests


def generate_embedding(text: str) -> list[float]:
    """
    Generate 768-dimensional embedding using Ollama
    """
    response = requests.post(
        "http://localhost:11434/api/embeddings",
        json={"model": "nomic-embed-text", "prompt": text},
    )

    if response.status_code == 200:
        return response.json()["embedding"]
    else:
        raise Exception(f"Ollama error: {response.text}")


# Test it
text = "Account showing signs of technical challenges with integration"
embedding = generate_embedding(text)
print(f"Text: {text}")
print(f"Embedding dimensions: {len(embedding)}")
print(f"First 10 values: {embedding[:10]}")

After running the file, you will see the result:

Text: Account showing signs of technical challenges with integration
Embedding dimensions: 768
First 10 values: [0.023, -0.156, 0.671, 0.445, -0.823, 0.121, 0.567, -0.334, 0.789, -0.234]

Complete pipeline: Data → embeddings → Weaviate

Alter a file weaviate_embeddings.py and run it with this code:

import weaviate

def process_and_insert_account(account_data: dict):
    """
    Complete pipeline: take account data, generate embedding, insert to Weaviate
    """
    # 1. Create rich content for vectorization
    content = f"""
    Account: {account_data['accountName']}
    Segment: {account_data['segment']}
    Health Score: {account_data['healthScore']}/100
    ARR: ${account_data['arr']:,}

    Recent Activity:
    {account_data['activity_summary']}

    Support Status:
    {account_data['support_summary']}

    Engagement Level:
    {account_data['engagement_summary']}
    """

    # 2. Generate embedding with Ollama
    embedding = generate_embedding(content)

    # 3. Insert to Weaviate
    collection = client.collections.get("AccountIntelligence")

    result = collection.data.insert(
        properties={
            "accountId": account_data["accountId"],
            "accountName": account_data["accountName"],
            "content": content,
            "segment": account_data["segment"],
            "healthScore": account_data["healthScore"],
            "arr": account_data["arr"]
        },
        vector=embedding
    )

    print(f"✅ Inserted {account_data['accountName']} with UUID: {result}")
    return result

try:
    client = weaviate.connect_to_local(host="localhost", port=8080, grpc_port=50051)
    # Example usage
    account = {
        "accountId": "ACC004",
        "accountName": "DataCorp Systems",
        "segment": "Enterprise",
        "healthScore": 65,
        "arr": 890000,
        "activity_summary": "Regular product usage, 3 admin logins per week",
        "support_summary": "2 open tickets, both low priority",
        "engagement_summary": "Attended last webinar, CSM meeting scheduled"
    }
    process_and_insert_account(account)
finally:
    client.close()

Key benefits of this approach:

No API costs: Ollama runs locally
Data privacy: Nothing leaves your infrastructure
Consistency: Same model for all embeddings
Performance: ~100–200ms per embedding on CPU
Quality: nomic-embed-text performs comparably to paid solutions

Pipeline flow

Part 6: Search + RAG with Claude

The final piece: combining Weaviate search with Claude for intelligent analysis.

The RAG pattern (Retrieval-Augmented Generation)

RAG prevents hallucinations by grounding AI responses in retrieved facts:

Retrieve: Search Weaviate for relevant accounts
Augment: Package search results as context
Generate: Claude analyzes the real data, not inventing information

Hybrid search + RAG (Claude)

Implementation: Search + Claude analysis

You should have your ANTHROPIC_API_KEY generated:

export ANTHROPIC_API_KEY="sk-ant-api..."

and then you can do the RAG with Weaviate. Alter the file weaviate_embeddings.py and run it:

import anthropic
import json
from weaviate.classes.query import MetadataQuery

def search_and_analyze(user_query: str, limit: int = 10) -> dict:
    """
    Complete RAG pipeline: Search Weaviate → Analyze with Claude
    """
    # Step 1: Semantic search in Weaviate
    print(f"🔍 Searching Weaviate for: '{user_query}'")
    
    query_embedding = generate_embedding(user_query)
    collection = client.collections.get("AccountIntelligence")
    
    search_results = collection.query.near_vector(
        near_vector=query_embedding,
        limit=limit,
        return_metadata=MetadataQuery(distance=True, certainty=True)
    )
    
    print(f"📊 Found {len(search_results.objects)} relevant accounts\n")
    
    # Step 2: Format results as context for Claude
    context_accounts = []
    for obj in search_results.objects:
        context_accounts.append({
            "account_id": obj.properties["accountId"],
            "account_name": obj.properties["accountName"],
            "segment": obj.properties["segment"],
            "health_score": obj.properties["healthScore"],
            "arr": obj.properties["arr"],
            "content": obj.properties["content"],
            "relevance_score": round(obj.metadata.certainty, 3)
        })
    
    # Step 3: Build context window for Claude
    context_package = {
        "query": user_query,
        "total_accounts_found": len(context_accounts),
        "accounts": context_accounts
    }
    
    context_json = json.dumps(context_package, indent=2)
    
    # Step 4: Analyze with Claude
    print("🤖 Analyzing with Claude...\n")
    
    client_anthropic = anthropic.Anthropic()
    
    message = client_anthropic.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2000,
        temperature=0,  # Deterministic responses
        messages=[{
            "role": "user",
            "content": f"""You are analyzing customer account data. You must ONLY use information from the provided search results. Do not make up or assume any information.
SEARCH RESULTS:
{context_json}
USER QUESTION: {user_query}
Analyze the accounts found and provide:
1. Key patterns or themes across these accounts
2. Specific risk factors or opportunities identified
3. Actionable recommendations with account examples
4. Priority ranking if applicable
Cite specific accounts by name when making claims. If the data is insufficient to answer the question, state that explicitly."""
        }]
    )
    
    analysis = message.content[0].text
    
    # Step 5: Return complete response
    return {
        "query": user_query,
        "search_results": context_accounts,
        "result_count": len(context_accounts),
        "claude_analysis": analysis
    }

# Try it!
result = search_and_analyze(
    "Which high-value accounts are showing signs of risk and need immediate attention?"
)
print("=" * 80)
print("CLAUDE'S ANALYSIS:")
print("=" * 80)
print(result["claude_analysis"])
print("\n" + "=" * 80)
print(f"\nBased on {result['result_count']} accounts retrieved from Weaviate")

Result:

🔍 Searching Weaviate for: 'Which high-value accounts are showing signs of risk and need immediate attention?'
📊 Found 3 relevant accounts
🤖 Analyzing with Claude...
================================================================================
CLAUDE'S ANALYSIS:
================================================================================
PRIORITY AT-RISK ACCOUNTS ANALYSIS
Based on the search results, I've identified 2 high-value accounts requiring immediate attention:
1. **CRITICAL: Global Solutions Ltd** (ARR: $780,000)
   Risk Level: SEVERE (Health Score: 28/100)
   
   Key Issues:
   - Critical escalation currently active
   - Integration challenges blocking production deployment
   - Executive stakeholder expressing frustration
   - Competitors mentioned in recent calls
   - Relevance Score: 0.892 (strong semantic match to query)
   
   Immediate Actions Needed:
   - Executive engagement within 24-48 hours
   - Technical escalation team assignment
   - Competitor analysis and value proposition reinforcement
   - Timeline: Address within this week
2. **HIGH PRIORITY: TechStart Inc** (ARR: $125,000)
   Risk Level: HIGH (Health Score: 42/100)
   
   Key Issues:
   - Multiple support tickets on API performance
   - Low license utilization (35%)
   - Budget concerns noted by CSM
   - Contract renewal in 60 days
   - Relevance Score: 0.765
   
   Immediate Actions Needed:
   - Performance issue resolution
   - Value demonstration to justify renewal
   - Budget discussion with stakeholders
   - Timeline: Next 30 days critical
COMMON PATTERNS:
- Both accounts show technical challenges as primary risk factor
- Support escalations correlate with low health scores
- Executive stakeholder sentiment is key indicator
RECOMMENDATION PRIORITY:
1. Global Solutions Ltd - Highest ARR, lowest health, critical escalation
2. TechStart Inc - Renewal timeline urgency, budget sensitivity
Note: Acme Corp (Health: 82, ARR: $450K) was also in results but shows positive indicators and doesn't require immediate intervention.
================================================================================
Based on 3 accounts retrieved from Weaviate

Why this works:

No hallucinations: Claude only analyzes the 3 accounts Weaviate returned
Cited examples: Every claim references specific accounts
Grounded in facts: Health scores, ARR, and issues come from real data
Actionable: Recommendations tied to specific accounts and timeframes

Advanced RAG: Adding filters to search

Alter the file weaviate_embeddings.py and run it:

from weaviate.classes.query import Filter

def filtered_search_and_analyze(
        user_query: str,
        segment: str = None,
        max_health: int = None,
        min_arr: float = None,
        limit: int = 10
) -> dict:
    """
    RAG with structured filters for business rules
    """
    # Build Weaviate filters
    filters = []
    if segment:
        filters.append(Filter.by_property("segment").equal(segment))
    if max_health is not None:
        filters.append(Filter.by_property("healthScore").less_than(max_health))
    if min_arr is not None:
        filters.append(Filter.by_property("arr").greater_than(min_arr))

    where_filter = None
    if filters:
        where_filter = filters[0]
        for f in filters[1:]:
            where_filter = where_filter & f

    # Search with filters
    query_embedding = generate_embedding(user_query)
    collection = client.collections.get("AccountIntelligence")

    # FIXED: Changed 'where' to 'filters'
    search_results = collection.query.near_vector(
        near_vector=query_embedding,
        filters=where_filter,  # Changed from where=where_filter
        limit=limit,
        return_metadata=MetadataQuery(distance=True)
    )

    # Package results for Claude
    context_accounts = []
    for obj in search_results.objects:
        context_accounts.append({
            "account_name": obj.properties["accountName"],
            "segment": obj.properties["segment"],
            "health_score": obj.properties["healthScore"],
            "arr": obj.properties["arr"],
            "content": obj.properties["content"]
        })

    # Inform Claude about filters applied
    filter_description = []
    if segment:
        filter_description.append(f"segment={segment}")
    if max_health:
        filter_description.append(f"health<{max_health}")
    if min_arr:
        filter_description.append(f"ARR>${min_arr:,}")

    filters_text = " AND ".join(filter_description) if filter_description else "None"

    # Analyze with Claude
    client_anthropic = anthropic.Anthropic()

    message = client_anthropic.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2000,
        temperature=0,
        messages=[{
            "role": "user",
            "content": f"""Analyze these filtered account search results.

FILTERS APPLIED: {filters_text}
QUERY: {user_query}

ACCOUNTS FOUND ({len(context_accounts)}):
{json.dumps(context_accounts, indent=2)}

Provide analysis specifically considering the filter context."""
        }]
    )

    return {
        "query": user_query,
        "filters_applied": filters_text,
        "search_results": context_accounts,
        "claude_analysis": message.content[0].text
    }

# Use it
result = filtered_search_and_analyze(
    user_query="What are the common challenges?",
    segment="Enterprise",
    max_health=50,
    min_arr=500000
)
print(result["claude_analysis"])

This pattern enables you to ask: “What challenges face high-value Enterprise accounts?” and receive analysis based on that exact filtered subset.

Putting it all together: Complete workflow

Here’s the complete pattern:

# 1. Generate embeddings locally
embedding = generate_embedding(account_content)

# 2. Store in Weaviate with metadata
collection.data.insert(
    properties=account_properties,
    vector=embedding
)
# 3. Search semantically with filters
results = collection.query.near_vector(
    near_vector=query_embedding,
    where=business_filters,
    limit=20
)
# 4. Analyze with Claude (grounded in facts)
analysis = claude.analyze(
    search_results=results,
    user_query=query
)
# 5. Return actionable insights
return {
    "search_results": results,
    "ai_analysis": analysis
}

This architecture prevents hallucinations because:

Weaviate retrieves facts (actual account data)
Claude analyses only what Weaviate returned
No generation without retrieval
Every claim cites specific data

Summary: What you’ve learned

Till now, you will (probably) understand:

✅ Vector databases convert text to numbers that capture meaning
✅ Weaviate stores vectors + metadata for hybrid search
✅ Collections are like SQL tables with semantic search superpowers
✅ Schema design separates vectorized content from filterable properties
✅ Search modes: semantic, keyword, hybrid, filtered
✅ Vectorization is the process of converting text into numerical embeddings
✅ Ollama generates embeddings locally with no API costs
✅ RAG pattern grounds Claude in retrieved facts

Next steps to deepen your understanding:

Deploy the docker-compose environment and try the code samples
Experiment with different alpha values in hybrid search
Compare semantic vs keyword results for your own queries
Build a simple RAG application combining search + LLM
Monitor embedding quality and query performance

Vector search isn’t magic — it’s engineering with semantic understanding.

Happy embeddings!

Snowflake CORTEX_COMPLETE in Full Throttle

Radovan Bacovic — Wed, 10 Dec 2025 20:02:11 GMT

AI-based summarisation and categorisation to consume customer tickets

The challenge of locked data

GitLab’s support team wants to process 100k+ customer tickets — a valuable source of customer feedback, product issues, and improvement opportunities. The traditional approach? Manual, ad-hoc summaries are requested one at a time. Not scalable, not secure, and not practical.

We are looking for an automated solution that can summarise and categorise customer support tickets for analytical purposes, converting them into actionable insights without exposing private customer information.

Unlocking business value through AI

This wasn’t just a technical exercise. Unlocking this dataset meant:

For Product Teams: Direct customer feedback to prioritise features and fix recurring issues.
For Customer Success: Pattern recognition across accounts to prevent churn.
For Sales: Understanding pain points to improve positioning and solutions.
For the GitLab Data Team: Dogfooding our own AI capabilities at scale.

The business case was that 100k+ tickets represented years of customer voice sitting unused due to compliance constraints.

Building the AI processing pipeline

Architecture decision: CORTEX_COMPLETE over Claude API

What is CORTEX_COMPLETE?

Given a prompt, generates a response (completion) using your choice of supported language model.

The critical decision: We chose Snowflake’s CORTEX_COMPLETE function over calling the Claude API directly (real-time or batch).

Why? Infrastructure simplicity trumps cost optimisation.

When evaluating our options, we considered three approaches:

Claude API Real-time: Call Anthropic’s API directly from Python/external services
Claude API Batch: Use Anthropic’s batch processing through API for lower costs
Snowflake CORTEX_COMPLETE: Use Snowflake’s native AI function

CORTEX_COMPLETE won decisively, even personally, I was impressed with Claude API capabilities, because:

Zero infrastructure overhead

No external API keys to manage and rotate
No Python services to deploy, monitor, or scale
No network egress to configure and secure
No retry logic, rate limiting, or error handling to implement
No additional compute environments beyond our existing Snowflake warehouse

2. Data never leaves Snowflake

No need to export tickets to external processing services
Compliance teams approved it— no cross-boundary data flow
Audit trail built into Snowflake’s query history

3. Native dbt integration

Process tickets with pure SQL in dbt models
No orchestration of external services or API calls
Standard dbt incremental patterns work out of the box
Developers work in a single environment (dbt/Snowflake)

Desicion making process to choose the proper architecture

Yes, CORTEX_COMPLETE costs more per token than direct Claude API calls. However, when you factor in the engineering time saved — with no infrastructure to build, maintain, or troubleshoot — the total cost of ownership (TCO) is significantly lower. Roughly, it was 10–15% cheaper per API call from Claude, but overall, the total cost of ownership is 30% lower when using CORTEX_COMPLETE.

The trade-off is clear: Pay slightly more per API call to eliminate weeks of infrastructure work, ongoing operational overhead, and security complexity. For our use case, this was a logical decision.

We chose Snowflake’s CORTEX_COMPLETE function with Claude 4 Sonnet as our processing engine, and business stakeholders validated the quality and consistency of the results. After processing thousands of tickets, teams reported high satisfaction with the model's ability to extract product categories and assess sentiment for the customer tickets.

Data processing architecture for customer tickets

Note: You can use any of the well-known models for this purpose, up to you. Here is the complete list of the available models.

Partial list of model avaliable for use in CORTEX_COMPLETE function (per region)

The core: AI processing function

Here’s the actual SQL function that does the heavy lifting:

CREATE OR REPLACE FUNCTION ANALYZE_TICKET(
    COMMENTS VARCHAR, 
    IS_PROD  BOOLEAN DEFAULT FALSE
)
RETURNS VARIANT
AS '
    IFF(
        is_prod = TRUE,
        -- PRODUCTION MODE: Call Snowflake Cortex
        SNOWFLAKE.CORTEX.COMPLETE(
            ''claude-4-sonnet'',
            ARRAY_CONSTRUCT(
                OBJECT_CONSTRUCT(
                    ''role'', ''user'',
                    ''content'', CONCAT(
                        ''Analyze these support tickets. 
                        PRIVACY CRITICAL: ...
                        
                        Return ONLY JSON with this structure:
                        {
                            ...
                        }'',
                    )
                )
            )
        ),
        -- TEST MODE: Return dummy response
        PARSE_JSON(''{"choices": [{"messages": "dummy_data"}]}'')
    )
'

That’s it. No external services. No API key management. No network configuration. Just SQL calling a native Snowflake AI function. The complexity reduction alone justified the higher per-token cost.

Why the IS_PROD parameter matters

The IS_PROD a boolean parameter is critical for several reasons:

Cost control during development. Every CORTEX_COMPLETE call costs money. During development and testing, we are iterating on SQL logic, debugging dbt models, and validating data transformations. Without the IS_PROD guard, every dbt run in a development environment, it would trigger expensive API calls for the entire dataset. With 100k+ tickets at stake, testing iterations could burn through thousands of dollars before you even reach production without bringing any value.
Faster development cycles. Calling CORTEX_COMPLETE adds latency—each AI inference takes time. In test mode, the function returns instant dummy JSON, allowing developers to validate parsing logic, test incremental strategies, and debug SQL transformations without waiting for real AI processing.
Preventing accidental production runs. The parameter creates an explicit contract: “This function only processes real data when explicitly told it’s production.” This prevents accidental full backfills triggered by a misplaced dbt run --full-refresh in the wrong environment.

Implementation in dbt:

analyze_ticket(
    comments => ticket_content,
    is_prod => {% if target.name == 'prod' %} TRUE {% else %} FALSE {% endif %}
)

Only when target.name == 'prod' does the real processing happen. Every other environment gets dummy data for testing.

The dbt pipeline: simple and efficient

dbt model orchestrates the entire flow:

Step 1: Filter relevant tickets

WHERE created_at >= DATEADD('year', -3, CURRENT_DATE())
  AND ticket_status = 'closed'

Step 2: Exclude noise. Remove trial accounts, free users, bot-generated tickets, and password resets using tag-based filtering. This was also done using SQL function:

CREATE OR REPLACE FUNCTION clean_content(content_text VARCHAR)
    RETURNS STRING
    LANGUAGE SQL
    COMMENT = 'Cleans text content by removing PII, paths, and noise'
    AS
    $$
        TRIM(
            REGEXP_REPLACE(
                            content_text,
                            '...',
                            '', 1, 0, 'si'
                        ),
                        '...',
                        '[PATH]', 1, 0, 'i'
                    ),
                    '...',
                    '[ID]', 1, 0, 'i'
                ),
                '...',

            )
        )
    $$;

Step 3: AI processing

analyze_ticket(
    comments => comments,
    is_prod  => TRUE
) AS ai_processed_results

Step 4: Structured output. Parse the JSON response into analytical columns.

Why do we prevent full backfills?

Notice this critical configuration:

{{ config(
    materialized = "incremental",
    incremental_strategy = "append",
    unique_key = "ticket_id",
    full_refresh = false  -- THIS IS CRITICAL
) }}

The full_refresh = false setting is not optional—it's a financial and operational safeguard:

Cost protection: A single accidental dbt run --full-refresh would reprocess all 100k+ tickets, costing hundreds to thousands of dollars in Cortex API calls. The false flag prevents this disaster scenario.
Idempotency guarantee: Once a closed ticket is processed, it never changes — the ticket is immutable. Reprocessing would generate identical results while wasting compute and money.
Performance optimisation: The initial backfill took 10 hours. Preventing full refreshes ensures we only process the incremental delta of newly closed tickets each day.

The incremental logic ensures we never double-process as the data is immutable:

{% if is_incremental() %}
  AND ticket_id NOT IN (SELECT ticket_id FROM {{ this }})
{% endif %}

The result? CORTEX_COMPLETE transforms raw data into clean, categorised, de-identified insights ready for analytics. The output table is safe to connect to Tableau or any other visual tool, share with external teams, and query without compliance restrictions.

Every field that once contained customer names, emails, or authentication details now contains generic terms or structured categories like “Frustrated sentiment due to recurring pipeline failures.”

Why prompt engineering is critical

In our use case, the prompt isn’t just important — it’s the entire control mechanism for data governance and business value extraction. A poorly designed prompt would either:

Miss business context: Vague prompts like “summarise this ticket” would produce useless generic summaries instead of structured categorisation by product stage, severity, and sentiment.
Generate inconsistent output: Without specifying the exact JSON structure and enum values (Frustrated/Concerned/Satisfied/Neutral), downstream analytics would break due to parsing errors or inconsistent categories.

Our prompt does three critical jobs simultaneously:

Data governance enforcement: Explicitly instructs AI to remove personal data using concrete examples
Business logic implementation: Defines 15+ structured fields matching GitLab’s product taxonomy
Quality control: Requires explanations for classifications (sentiment_reason, severity_reason) to ensure AI isn’t guessing

The prompt is essentially a data governance policy written in natural language and executed at scale by AI. Get it wrong, and you expose and generate useless output. Get it right, and you transform compliance-restricted data into a strategic asset.

Production performance: fast and affordable

After the initial backfill, daily operations are remarkably efficient:

Daily incremental runs complete in under 1 minute on a Snowflake L-size warehouse. Since we’re only processing newly closed tickets (typically 50–100 per day), the compute overhead is minimal. The dbt model identifies unprocessed tickets, calls CORTEX_COMPLETE for each new ticket, and appends results to the target table—all within 1–2 minutes.

AI processing costs remain low because:

We only process closed tickets (immutable, process-once guarantee)
An incremental strategy prevents redundant API calls
Token usage is monitored per ticket to catch cost anomalies
The IS_PROD parameter is the guard that prevents accidentally expensive runs in development

For a typical daily run processing 50–100 new tickets, the Cortex API cost is just a few bucks — far less than the value of unlocking 100k+ tickets for business analysis. The combination of incremental processing, Snowflake’s native integration, and Claude 4 Sonnet’s efficiency makes this pipeline both performant and cost-effective at scale.

The numbers

Initial backfill: 100k+ tickets processed
Daily incremental: New closed tickets are automatically processed in under 1–2 minutes
Processing time: ~10 hours for the initial load
Daily AI costs: Just a few dollars or less for typical 50–100-ticket increments
Warehouse size: L-size (daily runs), can be even smaller
Infrastructure complexity: Zero additional services beyond Snowflake

Conclusion: AI as a data governance tool

This implementation demonstrates that AI isn’t just for generating content — it’s a powerful tool for data governance at scale. We transformed compliance-restricted data into a business asset through:

Automated de-identification — AI removes personal data more consistently than manual review
Structured categorisation — Turn unstructured text into queryable dimensions
Self-service access — Product and Success teams query directly without bottlenecks
Native security — Processing happens inside Snowflake’s security ecosystem
Product taxonomy alignment — Map customer issues to GitLab’s stages for strategic insights
Zero infrastructure overhead — CORTEX_COMPLETE eliminated weeks of engineering work

The real innovation isn’t the AI model — it’s choosing the right use case supported with the proper infrastructure to make AI processing operationally simple, secure, and maintainable at scale.

Paying slightly more per API call to eliminate weeks of engineering work is the smartest business decision.

Want to implement a similar AI-powered data processing solution? Choose infrastructure simplicity over cost optimisation. Snowflake Cortex keeps data secure with zero additional services; dbt provides orchestration with pure SQL; Claude 4 Sonnet delivers “good enough” accuracy and business results. The engineering time you save is worth far more than the marginal cost difference.

Snowflake CORTEX_COMPLETE in Full Throttle was originally published in Snowflake Builders Blog: Data Engineers, App Developers, AI, & Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

How We Built a Structured Streamlit Framework in Snowflake: From Chaos to Compliance

Radovan Bacovic — Wed, 15 Oct 2025 22:02:17 GMT

How we transformed scattered Streamlit applications into a unified, secure, and scalable solution for the Snowflake environment

What You Should Learn

What’s happening when you pack 🐍Python, Streamlit, ❄️Snowflake and 🦊 Gitlab? Let’s find out together…

As GitLab’s Data team, we leveraged our unique position as customer zero by building this entire framework on GitLab’s own CI/CD infrastructure and project management tools. What are our secret ingredients:

GitLab (product) — the tool we create for DevSecOps success.
Snowflake — our Single Source of Truth (SSOT) for the Data Warehouse activities (and more than that).
Streamlit — an open-source tool for visual applications, pure Python code under the hood.

This provided us with immediate access to enterprise-grade DevSecOps capabilities, enabling us to implement automated testing, code review processes, and deployment pipelines from the outset. By utilizing GitLab’s built-in features for issue tracking, merge requests, and automated deployments (CI/CD pipelines), we can iterate rapidly and validate our framework against real-world enterprise requirements. This internal-first approach ensured our solution was battle-tested on GitLab’s own infrastructure before any external implementation.

The most critical lesson from building the Streamlit Application Framework in ❄️Snowflake in the GitLab Data team is that structure beats chaos every time — implement governance early rather than retrofitting it later when maintenance becomes exponential.

Success requires clearly defining roles and responsibilities, separating infrastructure concerns from application development, so that each team can focus on its strengths.

Security and compliance cannot be afterthoughts; they must be built into templates and automated processes from day one, because it’s far easier to enforce consistent standards upfront than to retrofit them onto existing applications. Invest heavily in automation and CI/CD pipelines, as manual processes don’t scale and introduce human error.

Architecture of the framework (general overview)

What: The Problem We Solved

Imagine this scenario: Your organisation has dozens of Streamlit applications scattered across different environments, running various Python versions, connecting to sensitive data with inconsistent security practices. Some apps work, others break mysteriously, and nobody knows who built what or how to maintain them.

This was exactly the challenge our data team faced. Applications were being created in isolation, with no standardization, no security oversight, and no clear deployment process. The result? A compliance nightmare and a maintenance burden that was growing exponentially.

We built a comprehensive Streamlit Framework that transforms how data applications are created, maintained, and deployed in enterprise environments.

Functional architectural design (high level)

So What: Why the Streamlit Application Framework Changes Everything

Three clear roles, one unified process

Our framework introduces a structured approach with three distinct roles:

Maintainers (Data team members and contributors) handle the infrastructure — CI/CD pipelines, security templates, and compliance rules. They ensure the framework runs smoothly and stays secure.
Creators (Those who need to build applications) can focus on what they do best: creating visualizations, connecting to Snowflake data, and building user experiences. They have full flexibility to create new applications from scratch, add new pages to existing apps, integrate additional Python libraries, and build complex data visualizations — all without worrying about deployment pipelines or security configurations.
Viewers (End users) access polished, secure applications without any technical overhead. All they need is a Snowflake access.

Roles overview and their functionality

Automate Everything

We solve the problem with Continuous Integration and Continuous Delivery: days of manual deployments and configuration headaches are a thing of the past. Our framework provides:

One-click environment preparation: with a set of make commands, the environment is installed and ready in a few seconds:

================================================================================
✅ Snowflake CLI successfully installed and configured!
Connection: gitlab_streamlit
User: YOU@GITLAB.COM
Account: gitlab
================================================================================
Using virtualenv: /Users/YOU/repos/streamlit/.venv
📚 Installing project dependencies...
Installing dependencies from lock file

No dependencies to install or update
✅ Streamlit environment prepared!

Automated CI/CD pipelines that handle testing, code review, and deployment from development to production.
Secure sandbox environments for safe development and testing before production deployment:

╰─$ make streamlit-rules
🔍 Running Streamlit compliance check...
================================================================================
CODE COMPLIANCE REPORT
================================================================================
Generated: 2025-07-09 14:01:16
Files checked: 1

SUMMARY:
✅ Passed: 1
❌ Failed: 0
Success Rate: 100.0%

APPLICATION COMPLIANCE SUMMARY:
📱 Total Applications Checked: 1
⚠️ Applications with Issues: 0
📊 File Compliance Rate: 100.0%

DETAILED RESULTS BY APPLICATION:
...

Template-based application creation that ensures consistency across all applications and pages:

╰─$ make streamlit-new-page STREAMLIT_APP=sales_dashboard STREAMLIT_PAGE_NAME=analytics
📝 Generating new Streamlit page: analytics for app: sales_dashboard

📃 Create new page from template:
  Page name: analytics
  App directory: sales_dashboard
  Template path: page_template.py

✅ Successfully created 'analytics.py' in 'sales_dashboard' directory from template

Poetry-based dependency management that prevents version conflicts and maintains clean environments.
Organized project structure with dedicated folders for applications, templates, compliance rules, and configuration management:

├── src/
│   ├── applications/          # Folder for Streamlit applications
│   │   ├── main_app/          # Main dashboard application
│   │   ├── components/        # Shared components
│   │   └── /       # Your custom application
│   │   └── /      # Your 2nd custom application
│   ├── templates/             # Application and page templates
│   ├── compliance/            # Compliance rules and checks
│   └── setup/                 # Setup and configuration utilities
├── tests/                     # Test files
├── config.yml                 # Environment configuration
├── Makefile                   # Build and deployment automation
└── README.md                  # Main README.md file

Streamlined workflow from local development through testing schema to production, all automated through GitLab CI/CD pipelines.

GitLab CI/CD pipelines for full automation of the process

Security and Compliance By Design

Instead of bolting on security as an afterthought, our framework builds it in from the ground up. Every application adheres to the same security standards, and compliance requirements are automatically enforced; audit trails are maintained throughout the development lifecycle. We introduce our compliance rules and verify them with a single command. For instance, we can list which classes and methods are mandatory to use, which files you should have, and which role is allowed and which are forbidden to share the application with. The rules are flexible and descriptive, all you ned to do is to define them in a YAML file:

class_rules:
  - name: "Inherit code for the page from GitLabDataStreamlitInit"
    description: "All Streamlit apps must inherit from GitLabDataStreamlitInit"
    severity: "error"
    required: true
    class_name: "*"
    required_base_classes:
      - "GitLabDataStreamlitInit"
    required_methods:
      - "__init__"
      - "set_page_layout"
      - "setup_ui"
      - "run"

function_rules:
  - name: "Main function required"
    description: "Must have a main() function"
    severity: "error"
    required: true
    function_name: "main"

import_rules:
  - name: "Import GitLabDataStreamlitInit"
    description: "Must import the mandatory base class"
    severity: "error"
    required: true
    module_name: "gitlab_data_streamlit_init"
    required_items:
      - "GitLabDataStreamlitInit"
  - name: "Import streamlit"
    description: "Must import streamlit library"
    severity: "error"
    required: true
    module_name: "streamlit"

file_rules:
  - name: "Snowflake configuration required (snowflake.yml)"
    description: "Each application must have a snowflake.yml configuration file"
    severity: "error"
    required: true
    file_pattern: "**/applications/**/snowflake.yml"
    base_path: ""
  - name: "Snowflake environment required (environment.yml)"
    description: "Each application must have a environment.yml configuration file"
    severity: "error"
    required: true
    file_pattern: "**/applications/**/environment.yml"
    base_path: ""
  - name: "Share specification required (share.yml)"
    description: "Each application must have a share.yml file"
    severity: "warning"
    required: true
    file_pattern: "**/applications/**/share.yml"
    base_path: ""
  - name: "README.md required (README.md)"
    description: "Each application should have a README.md file with a proper documentation"
    severity: "error"
    required: true
    file_pattern: "**/applications/**/README.md"
    base_path: ""
  - name: "Starting point recommended (dashboard.py)"
    description: "Each application must have a dashboard.py as a starting point"
    severity: "warning"
    required: true
    file_pattern: "**/applications/**/dashboard.py"
    base_path: ""

sql_rules:
  - name: "SQL files must contain only SELECT statements"
    description: "SQL files and SQL code in other files should only contain SELECT statements for data safety"
    severity: "error"
    required: true
    file_extensions: [".sql", ".py"]
    select_only: true
    forbidden_statements:
      - ....
    case_sensitive: false
  - name: "SQL queries should include proper SELECT statements"
    description: "When SQL is present, it should contain proper SELECT statements"
    severity: "warning"
    required: false
    file_extensions: [".sql", ".py"]
    required_statements:
      - "SELECT"
    case_sensitive: false

share_rules:
  - name: "Valid functional roles in share.yml"
    description: "Share.yml files must contain only valid functional roles from the approved list"
    severity: "error"
    required: true
    file_pattern: "**/applications/**/share.yml"
    valid_roles:
      - ...
    safe_data_roles:
      - ...
  - name: "Share.yml file format validation"
    description: "Share.yml files must follow the correct YAML format structure"
    severity: "error"
    required: true
    file_pattern: "**/applications/**/share.yml"
    required_keys:
      - "share"
    min_roles: 1
    max_roles: 10

With one command running:

╰─$ make streamlit-rules

We can verify all the rules we have created and validate that the developers (who are building a Streamlit application) are following the policy specified by the creators (who determine the policies and building blocks of the framework), and that all the building blocks are in the right place. This ensures consistent behaviour across all Streamlit applications.


🔍 Running Streamlit compliance check...
================================================================================
CODE COMPLIANCE REPORT
================================================================================
Generated: 2025-08-18 17:05:12
Files checked: 4

SUMMARY:
✅ Passed: 4
❌ Failed: 0
Success Rate: 100.0%

APPLICATION COMPLIANCE SUMMARY:
📱 Total Applications Checked: 1
⚠️ Applications with Issues: 0
📊 File Compliance Rate: 100.0%

DETAILED RESULTS BY APPLICATION:
================================================================================

✅ PASS APPLICATION: main_app
------------------------------------------------------------
📁 FILES ANALYZED (4):
  ✅ dashboard.py
    📦 Classes: SnowflakeConnectionTester
    🔧 Functions: main
    📥 Imports: os, pwd, gitlab_data_streamlit_init, snowflake.snowpark.exceptions, streamlit
  ✅ show_streamlit_apps.py
    📦 Classes: ShowStreamlitApps
    🔧 Functions: main
    📥 Imports: pandas, gitlab_data_streamlit_init, snowflake_session, streamlit
  ✅ available_packages.py
    📦 Classes: AvailablePackages
    🔧 Functions: main
    📥 Imports: pandas, gitlab_data_streamlit_init, streamlit
  ✅ share.yml
    👥 Share Roles: snowflake_analyst_safe

📄 FILE COMPLIANCE FOR MAIN_APP:
  ✅ Required files found:
    ✓ snowflake.yml
    ✓ environment.yml
    ✓ share.yml
    ✓ README.md
    ✓ dashboard.py

RULES CHECKED:
----------------------------------------
Class Rules (1):
  - Inherit code for the page from GitLabDataStreamlitInit (error)
Function Rules (1):
  - Main function required (error)
Import Rules (2):
  - Import GitLabDataStreamlitInit (error)
  - Import streamlit (error)
File Rules (5):
  - Snowflake configuration required (snowflake.yml) (error)
  - Snowflake environment required (environment.yml) (error)
  - Share specification required (share.yml) (warning)
  - README.md required (README.md) (error)
  - Starting point recommended (dashboard.py) (warning)
SQL Rules (2):
  - SQL files must contain only SELECT statements (error)
    🗄 SELECT-only mode enabled
    🚨 Forbidden: INSERT, UPDATE, DELETE, DROP, ALTER...
  - SQL queries should include proper SELECT statements (warning)
Share Rules (2):
  - Valid functional roles in share.yml (error)
    👥 Valid roles: 15 roles defined
    🔒 Safe data roles: 11 roles
  - Share.yml file format validation (error)
 
------------------------------------------------------------
✅ Compliance check passed
-----------------------------------------------------------

Developer Experience That Works

Whether you prefer your favorite IDE, a web-based development environment or Snowflake Snowsight, the experience remains consistent. The framework provides:

Template-driven development: New applications and pages are created through standardized templates, ensuring consistency and best practices from day one. No more scattered design and elements.

╰─$ make streamlit-new-app NAME=sales_dashboard
🔧 Configuration Environment: TEST
📝 Configuration File: config.yml
📜 Config Loader Script: ./setup/get_config.sh
🐍 Python Version: 3.12
📁 Applications Directory: ./src/applications
🗄 Database: ...
📊 Schema: ...
🏗️ Stage: ...
🏭 Warehouse: ...
🆕 Creating new Streamlit app: sales_dashboard
Initialized the new project in ./src/applications/sales_dashboar

Poetry package management: All dependencies are managed through Poetry, creating isolated environments that won’t disrupt your existing Python setup:

[tool.poetry]
name = "GitLab Data Streamlit"
version = "0.1.1"
description = "GitLab Data Team Streamlit project"
authors = ["GitLab Data Team <*****@gitlab.com>"]
readme = "README.md"

[tool.poetry.dependencies]
python = "<3.13,>=3.12"
snowflake-snowpark-python = "==1.32.0"
snowflake-connector-python = {extras = ["development", "pandas", "secure-local-storage"], version = "^3.15.0"}
streamlit = "==1.22.0"
watchdog = "^6.0.0"
types-toml = "^0.10.8.20240310"
pytest = "==7.0.0"
black = "==25.1.0"
importlib-metadata= "==4.13.0"
pyyaml = "==6.0.2"
python-qualiter = "*"
ruff = "^0.1.0"
types-pyyaml = "^6.0.12.20250516"
jinja2 = "==3.1.6"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

Multi-page application support: Creators can easily build complex applications with multiple pages and add new libraries as needed. Multipage applications are part of the framework and a developer is focusing on the logic, not the design and structuring.

Multipage application example (in Snowflake)

Seamless Snowflake integration: Built-in connectors and authentication handling for secure data access provide the same experience, regardless of your environment (local development or directly in Snowflake):

make streamlit-push-test APPLICATION_NAME=sales_dashboard

📤 Deploying Streamlit app to test environment: sales_dashboard
...
------------------------------------------------------------------------------------------------------------
🔗 Running share command for application: sales_dashboard
Running commands to grant shares

🚀 Executing: snow streamlit share sales_dashboard with SOME_NICE_ROLE
✅ Command executed successfully

📊 Execution Summary: 1/1 commands succeeded

Comprehensive Makefile: All common commands are wrapped in simple Makefile commands, from local development to testing and deployment, including CI/CD pipelines.
Safe local development: Everything runs in isolated Poetry environments, protecting your system while providing production-like experiences.

Same experience despite the environment (example of the local development)

Collaboration via code: All applications and components are wrapped up in one repository, which allows the entire organization to collaborate on the same resources and avoid double work and redundant setup

Now What: Getting Started and Moving Forward

Next steps — how our experience can improve your flow

If you’re facing similar challenges with scattered Streamlit applications, here’s how to begin and move quickly:

Assess your current state: Inventory your existing applications and identify pain points.
Define your roles: Separate maintainer responsibilities from creator and end users' needs.
Start with templates: Create standardize application templates that enforce your security and compliance requirements.
Implement CI/CD: Automate your deployment pipeline to reduce manual errors and ensure consistency.

Deployed application in Snowflake

The Bigger Picture

This framework represents more than just a technical solution — it’s a paradigm shift toward treating data applications as first-class citizens in your enterprise (Data) architecture.

By providing structure without sacrificing flexibility, the GitLab Data team created an environment where anyone in the company with minimal technical knowledge can innovate rapidly while maintaining the highest standards of security and compliance.

What’s Next?

We’re continuing to enhance the framework based on user feedback and emerging needs. Future improvements include expanded template libraries, enhanced monitoring capabilities, and more flexibility and a smoother user experience.

The goal isn’t just to solve today’s problems, but to create a foundation that scales with your organization’s growing data application needs.

Summary

GitLab Data Team transformed from having dozens of scattered, insecure Streamlit applications with no standardisation into a unified, enterprise-grade framework that separates roles cleanly:

Maintainers handle infrastructure and security,
Creators focus on building applications without deployment headaches, and
Viewers access polished, compliant apps.

Using building blocks that separate concerns:

Automated CI/CD pipelines
Fully collaborative and versioned code in git
Template-based development
Built-in security, compliance, testing and
Poetry-managed environments

We eliminated the maintenance nightmare while enabling rapid innovation — proving that you can have both structure and flexibility when you treat data applications as first-class enterprise assets rather than throwaway prototypes.

How We Built a Structured Streamlit Framework in Snowflake: From Chaos to Compliance was originally published in Snowflake Builders Blog: Data Engineers, App Developers, AI, & Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Yet another Python article — give me quality or give me death

Radovan Bacovic — Mon, 18 Aug 2025 15:10:15 GMT

Yet another Python article — give me quality or give me death

Introducing python-qualiter: The All-in-One Python Code Quality Tool You’ve Been Waiting For

Simplify your Python linting workflow with a single, powerful command-line interface

As Python developers, we’ve all been there. You’re working on a project, and suddenly you find yourself juggling multiple tools: black for formatting, isort for import sorting, mypy for type checking, flake8 for style guide enforcement, and pylint for comprehensive code analysis. Each tool serves its purpose, but managing them all becomes a complexity nightmare, especially when setting up CI/CD pipelines.

What if I told you there’s now a way to run all these essential code quality checks with a single command?

Today, I’m excited to introduce python-qualiter — an open-source package that wraps all your favorite Python linters and code quality tools into one unified, user-friendly interface.

The Problem with Multiple Tools

Every Python developer knows the pain:

Local Development: Remembering to run multiple commands before committing code
CI/CD Complexity: Setting up separate pipeline steps for each linting tool
Resource Waste: Multiple pipeline executions consuming unnecessary compute resources
Inconsistent Results: Different team members running different combinations of tools

The result? Fragmented code quality processes that slow down development and create inconsistencies across teams.

Meet python-qualiter: Your New Code Quality Companion

python-qualiter is a modern CLI wrapper that brings together the power of multiple Python linting and formatting tools under a single, intuitive interface. Think of it as your code quality Swiss Army knife.

What Makes It Special?

The tool combines industry-standard linters, including:

isort for import organization
black for code formatting
mypy for static type checking
flake8 for style guide enforcement
pylint for comprehensive code analysis
vulture for dead code detection
ruff — the rising star in Python tooling

But it’s more than just a collection of tools — it’s a thoughtfully designed experience that makes code quality management effortless.

Key Features That Set It Apart

🎯 All-in-One Linting

Run every essential code quality check with a single command. No more remembering multiple tool names or parameters.

📊 Visual Result Matrix

Get a clear, at-a-glance view of which files pass which linters. The visual feedback makes it easy to identify exactly where issues exist.

🔧 Auto-Fix Capability

Many code quality issues can be automatically resolved. python-qualiter identifies and fixes problems where possible, saving you valuable development time.

python-qualiter lint my_file.py --fix

⚙️ Flexible Configuration

Enable or disable specific linters based on your project’s needs. Not every project requires every tool, and python-qualiter respects that.

python-qualiter lint my_file.py --disable pylint

📈 Detailed Reports

When issues are found, you get comprehensive information about what went wrong and how to fix it.

python-qualiter lint my_file.py --verbose

ruff found issues in ./lint.py:
lint.py:21:21: F401 [*] `pathlib.Path` imported but unused
lint.py:378:19: F821 Undefined name `lint_file`
Found 2 errors.
[*] 1 fixable with the `--fix` option.


=====================================================================================================================
LINTING RESULTS MATRIX
=====================================================================================================================
File                                     | black    | flake8   | isort    | mypy     | pylint   | ruff     | vulture 
---------------------------------------------------------------------------------------------------------------------
./__init__.py                            | ✅        | ❌        | ✅        | ✅        | ✅        | ✅        | ✅        | 
./app.py                                 | ✅        | ❌        | ✅        | ✅        | ❌        | ✅        | ✅        | 
./lint.py                                | ✅        | ❌        | ✅        | ❌        | ❌        | ❌        | ✅        | 
=====================================================================================================================
❌ 7 FAILURES OUT OF 21 CHECKS
=====================================================================================================================

Getting Started: It’s Easier Than You Think

Installation couldn’t be simpler:

pip install python-qualiter

Check your code quality:

python-qualiter lint path/to/your/code.py

Apply automatic fixes:

python-qualiter lint path/to/your/code.py --fix

For multiple files or directories:

python-qualiter lint src/*.py test/*.py -v

Streamlining Your CI/CD Pipeline

One of the most powerful applications of python-qualiter is in your CI/CD pipeline. Instead of managing multiple pipeline steps, you can consolidate everything into a single, efficient step:

# .gitlab-ci.yml
python_linters:
  script:
    - pip install python-qualiter
    - python-qualiter lint src/*.py test/*.py -v
  allow_failure: true

This approach offers several advantages:

Cost Efficiency: Reduce compute resources by running all checks in a single pipeline step rather than spawning multiple containers.

Simplicity: One pipeline step to maintain instead of multiple complex configurations.

Consistency: Ensure the same checks run locally and in CI/CD, eliminating the “works on my machine” problem.

Speed: Faster pipeline execution with reduced overhead from multiple tool startups.

Why This Matters for Your Team

Quality code is consistent code, whether you’re checking it on your local machine or through your CI/CD pipeline. python-qualiter ensures that your entire team has access to the same comprehensive code quality checks without the complexity traditionally associated with multi-tool setups.

The tool also embraces ruff, the new rising star in the Python ecosystem known for its incredible speed and comprehensive rule set. By integrating ruff alongside established tools, python-qualiter gives you the best of both worlds: proven reliability and cutting-edge performance.

The Open Source Advantage

python-qualiter is fully open source and available on PyPI. This means:

Transparency: You can see exactly how your code is being analysed
Community-Driven: Contributions and feedback from developers worldwide
No Vendor Lock-in: Use it freely in any project, commercial or personal
Continuous Improvement: Regular updates and enhancements based on real-world usage

Join the Movement

Ready to simplify your Python code quality workflow? Here’s how you can get involved:

Try it out: pip install python-qualiter and run it on your current project
Share feedback: Report bugs, suggest features, or share your experience
Contribute: The project welcomes contributions from developers of all skill levels
Spread the word: Help other Python developers discover this tool

Conclusion

Whether you’re a solo developer working on personal projects or part of a large team managing complex applications, python-qualiter adapts to your workflow and makes code quality checks as simple as a single command.

Happy coding! 🐍

Have questions or feedback? I’d love to hear from you in the comments below.

“Broken English” 2023 tour recap and videos

Radovan Bacovic — Fri, 12 Jan 2024 21:08:00 GMT

My talks — or “Broken English” 2023 tour recap and videos

Hey mom, I am on the Internet now!

A quick recap on my “Broken English” (which is a different name for my talks at various conferences) tour in 2023.

Travelled thousands of miles in 2023 restlessly and always feel overjoyed to share my mileage and experience with the audience. The main driver to me when stepping on the stage is to put a smile on people’s faces and make them feel good and fulfilled. As simple as that!

Happy to highlight and share a few talks from the last quarter of the year.

#9Inspiration

If you like AI, DevSecOps and all these buzzwords popular these days, probably the talk:

🎥 #9Inspiration: When nimble is not fast enough: Will AI and Data leverage your DevSecOps journey

will give you a clear overview of the trends in this area.

My contribution to the #9Inspiration Conference in Belgrade, Serbia 🇷🇸 in September 2023.

Talk, talk, talk and beyond — part 1

Crunch Conference

Here is one of my greatest hits, as provides an overview what are the secret sauce of a successful Data Platform.

🎥 Do the Magic with All-Remote Data Teams… — Radovan Bacovic | Compass Tech Summit 2023

Happy as was taking part at the famous Crunch Conference in Budapest, Hungary 🇭🇺 in October 2023.

Talk, talk, talk and beyond — part 2

DSC Europe 2023

Should you go back to the office or work from your kitchen; no one cares, indeed. How I see this topic, and my overview from the first-face experience:

🎥 Asynchronous Work: The Next Phase of Remote Work | Radovan Bacovic | DSC Europe 23

As always, DSC organizers provide the ultimate conference experience in Belgrade, Serbia 🇷🇸 in November 2023 at the Data Science Conference.

Talk, talk, talk and beyond — part 3

DSC Croatia 2022

Radovan Bacovic — Mon, 16 May 2022 12:05:55 GMT

Well, one more live conference DSC Croatia 2022 — this time in Zagreb, Croatia from 10th-12th May 2022. Always feel good to contribute and exchange the experience. As usual, the best talks were on the margins over the coffee chats and/or glass of wine.

Had a great time meeting old and new peers and had good and interesting points for discussion. Good vibe, great atmosphere and was happy to share the same space with brilliant shaped minds and opened the data challenges discussions.

Spoke about how we do the Data things in GitLab.com (look, we have a brand new logo, hope you should like it).

Here is the presentation with all details. Feel free to ping me for more details, if you are interested in the topic.

https://medium.com/media/caa7a4069f312b1e936d6b3ef891e4ac/href

And of course, strongly recommended Zagreb as a sweet spot with a good atmosphere, delicious food, and a great choice of wines.

See you around, live, of course.

#DSCROATIA2022 #data

Data Innovation Summit 2022 Stockholm — brief recap

Radovan Bacovic — Wed, 11 May 2022 11:49:39 GMT

Data Innovation Summit 2022 Stockholm — brief recap

Thrilled to share my latest (Live) Data Conference experience with the audience. Just back from beautiful Stockholm (Sweden), where I attended Data Innovation Summit 2022. Was a great data Conference where I had a chance to catch up with top-notch Data companies and most of them create products we are using.

Happy to share my thoughts and takeaways about the trajectory of where data word is going:

Cloud is not only the primary but the only choice these days. Also, most the companies praise multi-cloud as a good solution
Seems GCP has a most aggressive campaign for a Data-based approach on Cloud. Spoke with their engineers and they build the entire ecosystem around BigQuery - probably they want to compete and gather users from Snowflake
ETL As A Service (out of the box, ready to run in a second) is definitely a rising area - many cloud-based Meltano-ish platforms are really really good like Aiven, Kebola etc. Focus is moving from coding to researching and the “right-tools” choosing
The Open-source the approach is a good way to avoid vendor lock-in in a long term
Data observability tools are a “must-have” for 2022 - among other tools, SODA (the Netherland company) has a slightly different approach than MC, BigEye or Anomalo, and has an Open Source version, worth checking
Data mesh was the hottest word at the Conference, but think it is just a trend

Bottom line (for GitLab): People really like and respect GitLab - well known and positively recognized brand - 80% of these companies use a paid version of GitLab to build products we are using on a daily basis, that’s so cool!
Strongly believe we belong to the top companies in the world, without any doubt.

See you next year in Stockholm.