GoodData Developers - Medium

From RAG to GraphRAG: Knowledge Graphs, Ontologies and Smarter AI

Marcelo G. Almiron — Tue, 02 Sep 2025 16:45:09 GMT

Modern AI chatbots often rely on Retrieval-Augmented Generation (RAG), a technique where the chatbot pulls in external data to ground its answers in real facts. If you’ve used a “Chat with your” tool, you’ve seen RAG in action: the system finds relevant snippets from a document and feeds them into a Large Language Model (LLM) so it can answer your question with accurate information.

RAG has greatly improved the factual accuracy of LLM answers. However, traditional RAG systems mostly treat knowledge as disconnected text passages. The LLM is given a handful of relevant paragraphs and left to piece them together during its response. This works for simple questions, but it struggles with complex queries that require connecting the dots across multiple sources.

This article will demystify two concepts that can take chatbots to the next level, namely, ontologies and knowledge graphs, and show how they combine with RAG to form a GraphRAG (Graph-based Retrieval-Augmented Generation). We’ll explain what they mean and why they matter in simple terms.

Why does this matter, you might ask? Because GraphRAG promises to make chatbot answers more accurate, context-aware, and insightful than what you get with a traditional RAG. Businesses exploring AI solutions value these qualities — an AI that can truly understand context, avoid mistakes, and reason through complex questions can be a game-changer. (Although this needs a perfect implementation, which often is not the case in practice.)

By combining unstructured text with a structured knowledge graph, GraphRAG systems can provide answers that feel far more informed. Bridging knowledge graphs with LLMs is a key step toward AI that doesn’t just retrieve information, but actually understands it.

What is RAG?

Retrieval-Augmented Generation, or RAG, is a technique for enhancing language model responses by grounding them in external knowledge. Instead of replying based solely on what’s in its model memory, which might be outdated or incomplete, a RAG-based system will fetch relevant information from an outside source (e.g., documents, databases and the web) and feed that into the model to help formulate the answer.

In simple terms, RAG = LLM + Search Engine: the model first retrieves supporting data, augments its understanding of the topic and then generates a response using both its built-in knowledge and the retrieved info.

Figure 1: Traditional RAG Pipeline.

As shown in the figure above the typical RAG pipeline involves a few steps that mirror a smart lookup process:

Indexing Knowledge: First, the system breaks the knowledge source (say a collection of documents) into chunks of text and creates vector embeddings for each chunk. These embeddings are numerical representations of the text meaning. All these vectors are stored in a vector database or index.
Query Embedding: When a user asks a question, the query is also converted into a vector embedding using the same technique.
Similarity Search: The system compares the query vector to all the stored vectors to find which text chunks are most “similar” or relevant to the question.
Generation with Context: Finally, the language model is given the user’s question plus the retrieved snippets as context. It then generates an answer that incorporates the provided information.

RAG has been a big step forward for making LLMs useful in real-world scenarios. It’s how tools like Bing Chat or various document QA bots can provide current, specific answers with references. By grounding answers in retrieved text, RAG reduces hallucinations (the model can be pointed to the facts) and allows access to information beyond the AI’s training cutoff date. However, traditional RAG also has some well-known limitations:

It treats the retrieved documents essentially as separate, unstructured blobs. If an answer requires synthesising info across multiple documents or understanding relationships, the model has to do that heavy lifting itself during generation.
RAG retrieval is usually based on semantic similarity. It finds relevant passages but doesn’t inherently understand the meaning of the content or how one fact might relate to another.
There is no built-in mechanism for reasoning or enforcing consistency across the retrieved data; the LLM just gets a dump of text and tries its best to weave it together.

In practice, for straightforward factual queries, e.g., “When was this company founded?”, traditional RAG is great. For more complex questions, e.g., “Compare the trends in Q1 sales and Q1 marketing spend and identify any correlations.”, traditional RAG might falter. It could return one chunk about sales, another about marketing, but leave the logical integration to the LLM, which may or may not succeed coherently.

These limitations point to an opportunity. What if, instead of giving the AI system just a pile of documents, we also gave it a knowledge graph (i.e. a network of entities and their relationships) as a scaffold for reasoning? If RAG retrieval could return not just text based on similarity search, but a set of interconnected facts, the AI system could follow those connections to produce a more insightful answer.

GraphRAG is about integrating this graph-based knowledge into the RAG pipeline. By doing so, we aim to overcome the multi-source, ambiguity, and reasoning issues highlighted above.

Before we get into how GraphRAG works, though, let’s clarify what we mean by knowledge graphs and ontologies — the building blocks of this approach.

Knowledge Graphs

A knowledge graph is a networked representation of real-world knowledge, where each node represents an entity and each edge represents a relationship between entities.

Figure 2: Knowledge (sub-)graph sample from Online Retail dataset.

In the figure above, we see a graphical representation of what a knowledge graph looks like. It structures data as a graph, not as tables or isolated documents. This means information is stored in a way that inherently captures connections. Some key traits:

They are flexible: You can add a new type of relationship or a new property to an entity without upending the whole system. Graphs can easily evolve to accommodate new knowledge.
They are semantic: Each edge has meaning, which makes it possible to traverse the graph and retrieve meaningful chains of reasoning. The graph can represent context along with content.
They naturally support multi-hop queries: If you want to find how two entities are connected, a graph database can traverse neighbors, then neighbors-of-neighbors, and so on.
Knowledge graphs are usually stored in specialised graph databases or triplestores. These systems are optimised for storing nodes and edges and running graph queries.

The structure of knowledge graphs is a boon for AI systems, especially in the RAG context. Because facts are linked, an LLM can get a web of related information rather than isolated snippets. This means:

AI systems can better disambiguate context. For example, if a question mentions “Jaguar,” the graph can clarify whether it refers to the car or the animal through relationships, providing context that text alone often lacks.
An AI system can use “joins” or traversals to collect related facts. Instead of separate passages, a graph query can provide a connected subgraph of all relevant information, offering the model a pre-connected puzzle rather than individual pieces.
Knowledge graphs ensure consistency. For example, if a graph knows Product X has Part A and Part B, it can reliably list only those parts, unlike text models that might hallucinate or miss information. The structured nature of graphs allows complete and correct aggregation of facts.
Graphs offer explainability by tracing the nodes and edges used to derive an answer, allowing for a clear chain of reasoning and increased trust through cited facts.

To sum up, a knowledge graph injects meaning into the AI’s context. Rather than treating your data as a bag of words, it treats it as a network of knowledge. This is exactly what we want for an AI system tasked with answering complex questions: a rich, connected context it can navigate, instead of a heap of documents it has to brute-force parse every time.

Now that we know what knowledge graphs are, and how they can benefit AI systems, let’s see what ontologies are and how they may help to build better knowledge graphs.

Ontologies

In the context of knowledge systems, an ontology is a formal specification of knowledge for a particular domain. It defines the entities (or concepts) that exist in the domain and the relationships between those entities.

Figure 3: A simplified example of ontology for e-commerce.

Ontologies often organise concepts into hierarchies or taxonomies. But can also include logical constraints or rules: for example, one could declare “Every Order must have at least one Product item.”

Why ontologies matter? You may ask. Well, an ontology provides a shared understanding of a domain, which is incredibly useful when integrating data from multiple sources or when building AI systems that need to reason about the domain. By defining a common set of entity types and relationships, an ontology ensures that different teams or systems refer to things consistently. For example, if one dataset calls a person a “Client” and another calls them “Customer,” mapping both to the same ontology class (say Customer as a subclass of Person) lets you merge that data seamlessly.

In the context of AI and GraphRAG, an ontology is the blueprint for the knowledge graph — it dictates what kinds of nodes and links your graph will have. This is crucial for complex reasoning. If your chatbot knows that “Amazon” in the context of your application is a Company (not a river) and that Company is defined in your ontology (with attributes like headquarters, CEO, etc., and relationships like hasSubsidiary), it can ground its answers much more precisely.

Now that we know about knowledge graphs and ontologies, let’s see how we put it all together in a RAG-alike pipeline.

GraphRAG

GraphRAG is an evolution of the traditional RAG approach that explicitly incorporates a knowledge graph into the retrieval process. In GraphRAG, when a user asks a question, the system doesn’t just do a vector similarity search over text; it also queries the knowledge graph for relevant entities and relationships.

Figure 4: GraphRAG Pipeline.

Let’s walk through a typical GraphRAG pipeline at a high level:

Indexing knowledge: Both structured data (e.g., databases, CSV files) and unstructured data (e.g., documents) are taken as input. Structured data goes through data transformation, converting table rows to triples. Unstructured data is broken down into manageable text chunks. Entities and relationships are extracted from these chunks and simultaneously embeddings are calculated to create triples with embeddings.
Question Analysis and Embedding: The user’s query is analyzed to identify key terms or entities. These elements are embedded with the same embedding model used for indexing.
Graph Search: The system queries the knowledge graph for any nodes related to those key terms. Instead of retrieving only semantically similar items, the system also leverages relationships.
Generation with Graph Context: A generative model uses the user’s query and the retrieved graph-enriched context to produce an answer.

Under the hood, GraphRAG can use various strategies to integrate the graph query. The system might first do a semantic search for top-K text chunks as usual, then traverse the graph neighborhood of those chunks to gather additional context, before generating the answer. This ensures that if relevant info is spread across documents, the graph will help pull in the connecting pieces. In practice, GraphRAG might involve extra steps like entity disambiguation (to make sure the “Apple” in the question is linked to the right node, either Company or Fruit) and graph traversal algorithms to expand the context. But the high-level picture is as described: search + graph lookup instead of search alone.

Overall, for non-technical readers, you can think of GraphRAG as giving the AI a “brain-like” knowledge network in addition to the library of documents. Instead of reading each book (document) in isolation, the AI also has an encyclopedia of facts and how those facts relate. For technical readers, you might imagine an architecture where we have both a vector index and a graph database working in tandem — one retrieving raw passages, the other retrieving structured facts, both feeding into the LLM’s context window.

Building a Knowledge Graph for RAG: Approaches

There are two broad ways to build the knowledge graph that powers a GraphRAG system: a Top-Down approach or a Bottom-Up approach. They’re not mutually exclusive (often you might use a bit of both), but it’s helpful to distinguish them.

Approach 1: Top-Down (Ontology First)

The top-down approach to ontology begins by defining the domain’s ontology before adding data. This involves domain experts or industry standards to establish classes, relationships, and rules. This schema, loaded into a graph database as empty scaffolding, guides data extraction and organization, acting as a blueprint.

Once the ontology (schema) is in place, the next step is to instantiate it with real data. There are a few sub-approaches here:

Using Structured Sources: If you have existing structured databases or CSV files, you map those to the ontology. This can sometimes be done via automated ETL tools that convert SQL tables to graph data if the mapping is straightforward.
Extracting from Text via Ontology: For unstructured data (like documents, PDFs, etc.), you would use NLP techniques but guided by the ontology. This often involves writing extraction rules or using an LLM with prompts that reference the ontology’s terms.
Manual or Semi-Manual Curation: In critical domains, a human might verify each extracted triple or manually input some data into the graph, especially if it’s a one-time setup of key knowledge. For example, a company might manually input its org chart or product hierarchy into the graph according to the ontology, because that data is relatively static and very important.

The key is that with a top-down approach, the ontology acts as a guide at every step. It tells your extraction algorithms what to look for and ensures the data coming in fits a coherent model.

One big advantage of using a formal ontology is that you can leverage reasoners and validators to keep the knowledge graph consistent. Ontology reasoners can automatically infer new facts or check for logical inconsistencies, while tools like SHACL enforce data shape rules (similar to richer database schemas). These checks prevent contradictory facts and enrich the graph by automatically deriving relationships. In GraphRAG, this means answers can be found even if multi-hop connections aren’t explicit, as the ontology helps derive them.

Approach 2: Bottom-Up (Data First)

The bottom-up approach seeks to generate knowledge graphs directly from data, without relying on a predefined schema. Advances in NLP and LLMs enable the extraction of structured triples from unstructured text, which can then be ingested into a graph database where entities form nodes and relationships form edges.

Under the hood, bottom-up extraction can combine classical NLP and modern LLMs:

Named Entity Recognition (NER): Identify names of people, organizations, places, etc., in text.
Relation Extraction (RE): Identify if any of those entities have a relationship mentioned.
Coreference Resolution: Figure out the referent of a pronoun in a passage, so the triple can use the full name.

There are libraries like spaCy or Flair for the traditional approach, and newer libraries that integrate LLM calls for IE (Information Extraction). Also, techniques like ChatGPT plugins or LangChain agents can be set up to populate a graph: the agent could iteratively read documents and call a “graph insert” tool as it finds facts. Another interesting strategy is using LLMs to suggest the schema by reading a sample of documents (this edges towards ontology generation, but bottom-up).

A big caution with bottom-up extraction is that LLMs can be imperfect or even “creative” in what they output. They might hallucinate a relationship that wasn’t actually stated, or they might mis-label an entity. Therefore, an important step is validation:

Cross-check critical facts against the source text.
Use multiple passes: e.g., first pass for entities, second pass just to verify and fill relations.
Human spot-checking: Have humans review a sample of the extracted triples, especially those that are going to be high impact.

The process is typically iterative. You run the extraction, find errors or gaps, adjust your prompts or filters, and run again. Over time, this can dramatically refine the knowledge graph quality. The good news is that even with some errors, the knowledge graph can still be useful for many queries — and you can prioritize cleaning the parts of the graph that matter most for your use cases.

Finally, keep in mind that sending text for extraction exposes your data to the LLM/service, so you should ensure compliance with privacy and retention requirements.

Tools and Frameworks in the GraphRAG Ecosystem

Building a GraphRAG system might sound daunting, you need to manage a vector database, a graph database, run LLM extraction pipelines, etc. The good news is that the community is developing tools to make this easier. Let’s briefly mention some of the tools and frameworks that can help, and what role they play.

Graph Storage

First, you’ll need a place to store and query your knowledge graph. Traditional graph databases like Neo4j, Amazon Neptune, TigerGraph, or RDF triplestores (like GraphDB or Stardog) are common choices.

These databases are optimized for exactly the kind of operations we discussed:

traversing relationships
finding neighbors
executing graph queries

In a GraphRAG setup, the retrieval pipeline can use such queries to fetch relevant subgraphs. Some vector databases (like Milvus or Elasticsearch with Graph plugin) are also starting to integrate graph-like querying, but generally, a specialized graph DB offers the richest capabilities. The important thing is that your graph store should allow efficient retrieval of both direct neighbors and multi-hop neighborhoods, since a complex question might require grabbing a whole network of facts.

Emerging Tools

New tools are emerging to combine graphs with LLMs:

Cognee — An open-source “AI memory engine” that builds and uses knowledge graphs for LLMs. It acts as a semantic memory layer for agents or chatbots, turning unstructured data into structured graphs of concepts and relationships. LLMs can then query these graphs for precise answers. Cognee hides graph complexity: developers only need to provide data, and it produces a graph ready for queries. It integrates with graph databases and offers a pipeline for ingesting data, building graphs, and querying them with LLMs.
Graphiti (by Zep AI) — A framework for AI agents needing real-time, evolving memory. Unlike many RAG systems with static data, Graphiti updates knowledge graphs incrementally as new information arrives. It stores both facts and their temporal context, using Neo4j for storage and offering an agent-facing API. Unlike earlier batch-based GraphRAG systems, Graphiti handles streams efficiently with incremental updates, making it suited for long-running agents that learn continuously. This ensures answers always reflect the latest data.
Other frameworks — Tools like LlamaIndex and Haystack add graph modules without being graph-first. LlamaIndex can extract triplets from documents and support graph-based queries. Haystack experimented with integrating graph databases to extend question answering beyond vector search. Cloud providers are also adding graph features: AWS Bedrock Knowledge Bases supports GraphRAG with managed ingestion into Neptune, while Azure Cognitive Search integrates with graphs. The ecosystem is evolving quickly.

No Need to Reinvent the Wheel

The takeaway is that if you want to experiment with GraphRAG, you don’t have to build everything from scratch. You can:

Use Cognee to handle knowledge extraction and graph construction from your text (instead of writing all the prompts and parsing logic yourself).
Use Graphiti if you need a plug-and-play memory graph especially for an agent that has conversations or time-based data.
Use LlamaIndex or others to get basic KG extraction capabilities with just a few lines of code.
Rely on proven graph databases so you don’t have to worry about writing a custom graph traversal engine.

In summary, while GraphRAG is at the cutting edge, the surrounding ecosystem is rapidly growing. You can leverage these libraries and services to stand up a prototype quickly, then iteratively refine your knowledge graph and prompts.

Conclusion

Traditional RAG works well for simple fact lookups, but struggles when queries demand deeper reasoning, accuracy, or multi-step answers. This is where GraphRAG excels. By combining documents with a knowledge graph, it grounds responses in structured facts, reduces hallucinations, and supports multi-hop reasoning. Thus enabling AI to connect and synthesize information in ways standard RAG cannot.

Of course, this power comes with trade-offs. Building and maintaining a knowledge graph requires schema design, extraction, updates, and infrastructure overhead. For straightforward use cases, traditional RAG remains the simpler and more efficient choice. But when richer answers, consistency, or explainability matter, GraphRAG delivers clear benefits.

Looking ahead, knowledge-enhanced AI is evolving rapidly. Future platforms may generate graphs automatically from documents, with LLMs reasoning directly over them. For companies like GoodData, GraphRAG bridges AI with analytics, enabling insights that go beyond “what happened” to “why it happened.”

Ultimately, GraphRAG moves us closer to AI that doesn’t just retrieve facts, but truly understands and reasons about them, like a human analyst, but at scale and speed. While the journey involves complexity, the destination (more accurate, explainable, and insightful AI) is well worth the investment. The key lies in not just collecting facts, but connecting them.

From RAG to GraphRAG: Knowledge Graphs, Ontologies and Smarter AI was originally published in GoodData Developers on Medium, where people are continuing the conversation by highlighting and responding to this story.

Analytics as Code with Cursor: How do you make the most out of it?

Andrii Chumak — Wed, 06 Aug 2025 11:07:35 GMT

Analytics as Code slowly becoming the norm for analytics solution, same as Infrastructure as Code or Security as Code. The level of confidence in your solution that you get from version control and CI/CD pipelines is hard to beat.

Meanwhile, smart IDEs, like Cursor and Windsurf, are getting more and more popular, helping software engineers to be more productive. For data analysts, Cursor can help write SQL or Python code. It’s generally good with widespread technologies and languages, as there is a lot of data to train on. But how can you take advantage of AI features for more niche technologies like Analytics as Code?

In this article, I’ll show you how to make the most of any AI-powered IDE when working with Analytics as Code by leveraging rule files, VS Code extensions, and MCP servers.

Out-of-the-Box Cursor Experience

I must admit, when I first tried using our Analytics as Code with Cursor, I was worried we’d have to make a lot of changes to our setup to take advantage of AI features in full. But the out-of-the-box experience was surprisingly good.

Work by Example

First of all, AI coding agents are quite good at creating new code by example. If you’re not starting from scratch and already have some analytics in the workspace, the chances are Cursor will work great for you. It helps that our Analytics as Code syntax is based on YAML, and AI models know how to write a valid YAML file already — that’s included in the training. They only need to figure out the correct schema for our specific use case.

VS Code Extension

The second contributing factor is that we already have a VS Code extension — that works in Cursor and Windsurf as they’re both based on VS Code.

The extension is available in both Microsoft Marketplace for VS Code and OpenVSX Marketplace for Cursor. Just looks for “GoodData” in the extensions tab.

Our extension does a lot of things, from syntax highlighting to autocomplete, reference resolution, and previews. But most importantly, schema validation and reference integrity validation. Cursor listens to any validation errors the extension produces and is capable of fixing them automatically.

GoodData for VS Code highlighting referential error

Let’s say you’re building a new visualization: Cursor will look at any existing visualizations for examples and produce a new one. Sure, it may hallucinate a reference to some non-existing metric, but then our extension will highlight the error and suggest a list of valid metrics that can be used instead. And since Cursor works iteratively, it can choose the right metric and fix the code before handing control back to you.

A Better Cursor Experience

But we can take this even further by leveraging more advanced Cursor features, rule files and MCP server integration.

It’s worth noting that most of the features described here will also work in other VS Code-based smart IDEs, like Windsurf.

Cursor Rules

The rule files are designed for developers to give Cursor more context about the workspace. It’s a simple Markdown file where you can describe how to work with certain file types, how the project is structured, how to perform a specific task, and so on. All the things you probably should have had in the internal documentation a long time ago, but were too lazy to write down.

Rule files are the perfect place to provide Cursor with examples to reduce its dependence on pre-existing items for examples, as well as to cover rare cases that you likely don’t have in the workspace just yet.

MCP Server

I was checking out Model Context Protocol (MCP) and noticed a lot of commonalities with Language Server Protocol (LSP). Both protocols provide a useful abstraction for the communication between server and client, as well as features discoverability, authentication, and transport for the messages. This means you can implement your LLM app as an MCP client and not need to care if the MCP servers you’re connecting to are written in different programming languages, running locally, or deployed on a 3rd party server.

This makes an MCP server a perfect companion for AI-driven development in Cursor, as it complements the Language Server. It can provide additional context and trigger (read-only) actions on the user’s behalf — all good things for your productivity.

https://medium.com/media/c68cfa6d974dfd0015e2b0215e405ca5/href

Imagine you have a new table added to the data source. With well-defined rules, Language Server and MCP Server, Cursor can automatically:

Scan the database model to review the new table.
Sample the data to understand the contents.
Create the dataset definition.
Validate it and fix any issues.
Run the dataset preview to verify the result.

We’ve Got You Covered

Starting v0.14.0, the GoodData for VS Code extension comes with batteries included for AI-assisted development.

New Project Initialization and Cursor Rules

When starting a new project, you can now pass a --cursor parameter to add boilerplate Cursor rules and a configuration for the MCP server to let Cursor know how to connect to it. See all commands and options in our documentation.

The rule files are a good starting point for any analytics project, but feel free to edit those as you identify any gaps in Cursor’s understanding of your code.

Bundled MCP Server

GoodData for VS Code MCP Server connected to Cursor

The extension itself now comes with the MCP server bundled in. Once enabled and configured in Cursor, it will provide tools for database schema scanning, all kinds of previews, as well as shortcuts for workspace deploy and clone commands.

Conclusion

AI-assisted software engineering is in a volatile state these days. Some people swear by it, others hate it. Some say it’s a productivity booster, while others call it a time-waster. It’s not clear what final form it will take when all is said and done, but it’s quite clear that software engineering is changing. And, more importantly, data analysts take note, or risk being left behind.

Want to learn more?

If you’d like to learn more about our different AI initiatives, you can read some of our other articles, such as Why AI in Analytics Needs Metadata or our other technical articles.

Analytics as Code with Cursor: How do you make the most out of it? was originally published in GoodData Developers on Medium, where people are continuing the conversation by highlighting and responding to this story.

Evaluation of AI Applications In Practice

Jakub Zovak — Wed, 06 Aug 2025 09:20:31 GMT

In recent years, many companies have rushed to develop AI features or products, yet most initiatives fail to progress beyond the proof-of-concept stage.

The main reason behind the failure to productize these applications is often a lack of evaluation and proper data science. The so-called “vibe” development paradigm can only get you so far before things start to fall apart and you begin to feel like you are building something on top of a quicksand.

In this article, I describe how to address these problems by discussing the proper evaluation of applications that utilize LLMs and how to leverage this knowledge to produce reliable solutions.

Lifecycle Of AI Application Development

There is nothing wrong with writing custom prompts without any evaluation when starting a new product or feature. On the contrary, I would argue that it is a preferred approach when you need to create a proof-of-concept as quickly as possible. However, after a while, you will find that it is insufficient, especially when you begin to transition to the production phase. Without proper evaluation, you will end up going in circles from one regression to another.

Moreover, you do not have any data to support the reliability of your solution.Therefore, after the initial phase of prototyping your idea, you should move to implementing the evaluation of your AI application. With the evaluation in place, you can get confidence in how your solution performs and improve it iteratively without reintroducing bugs or undesired behavior.

Furthermore, you can move on to another step in AI app development and use prompt optimizers such as DSPy and discard manual prompt tuning completely. You can see this lifecycle visualized below:

Development life cycle of an AI application

Evaluation of AI Applications

Evaluating AI applications differs significantly from traditional software testing or data science validation. These systems, often referred to as Software 3.0, blend traditional software engineering with data science. As a result, the evaluation does not focus on the underlying LLM, nor does it resemble standard unit testing. Instead, it assesses the behavior of an application built on top of various AI models, such as LLMs, embedding models, and rerankers.

The primary objective is to evaluate the configuration of the complete AI system. This might include RAG pipelines (retrieval and reranking stages), prompt templates (e.g., structured instructions, few-shot examples, prompting strategies), and any surrounding pre-/post-processing logic. This article focuses specifically on the evaluation of the LLM components (i.e., prompt templates) of such applications. The evaluation of RAG pipelines falls under the domain of information retrieval and deserves a separate article.

To conduct a meaningful evaluation, three core elements are needed:

A dataset with ground truth outputs, if available,
Appropriate evaluation metrics that reflect desired behavior

An evaluation infrastructure to run and monitor the evaluation process.

Datasets

To evaluate an AI application, you need a dataset with expected outputs, also called ground truth. The hardest part is often getting the initial dataset. Fortunately, even a tiny dataset can help you meaningfully tune your application and check if it behaves as expected.

There are three main ways to obtain a dataset. First, you can manually write a few input-output pairs. This helps clarify exactly what you expect from the application, rather than relying on vague specifications. Second, if your company policy allows and the application is already running, you can use user interactions with positive feedback to extend the dataset. Lastly, you can use an LLM to generate synthetic examples from crafted prompts or existing dataset items, but always review these carefully before using them.

Evaluation Metrics

Choosing the right metrics is crucial when performing an evaluation to determine whether your AI application behaves as you expect. Metrics can assess the output on their own (e.g., politeness, toxicity, contextual relevance) or measure how closely it aligns with the expected result. Broadly, these evaluation metrics fall into three categories: analytical metrics (commonly used in traditional ML), deterministic assertions (akin to unit tests), and LLM-as-a-judge (a newer approach using LLM for evaluation).

A common mistake is to start with LLM-as-a-Judge and use it for every aspect of the evaluation. While the LLM-as-a-judge is selected for its ease of use, this approach comes with significant drawbacks. These include the cost and latency of calling the judge itself, as well as the uncertainty it introduces into the evaluation.

Therefore, it is recommended to use LLM-as-a-Judge always as a last resort when traditional approaches such as analytical metrics or deterministic assertions are not enough. It is beneficial to consider these metrics in a similar manner to unit, integration, and E2E tests, where E2E tests are akin to LLM-as-a-judge since they have the highest cost. Here is the view of these metric types visualized :

A pyramid illustrating the use of metrics according to their cost.

Analytical Metrics

Analytical metrics are quantitative functions that assign a numerical score to the output of the AI application. These metrics are present at the bottom of our pyramid since they are broadly applicable to all test cases with minimal implementation or real cost. Let’s describe these metrics, explain how to use them, and discuss their interpretation. Note that the selection of a metric always depends on the specific use case you are evaluating.

Let’s list commonly used analytical metrics:

Perplexity

Explanation: Perplexity measures how well the model predicts the sequence of tokens. It is defined as the exponentiated average negative log-likelihood of a token sequence.
Interpretation: The lower the perplexity, the more confident the LLM is in its prediction.
Use Case: It is a good practice to track Perplexity every time you have access to the token output probabilities.

Cosine Similarity

Explanation: Cosine similarity measures how similar two embedding vectors are. These vectors are produced by encoder models (e.g., BERT) trained to capture the semantic meaning of sentences. The similarity corresponds to the cosine of the angle between the vectors.
Interpretation: The cosine similarity score can be challenging to interpret because it depends significantly on the underlying embedding models and the distribution of scores they produce.
Use Case: Due to the difficulty of interpretation, I would not recommend relying on this measure when doing iterative development, but it can be leveraged in automatic prompt optimization frameworks.

NLP Metrics (BLUE, ROUGE, METEOR)

Explanation: Traditional NLP metrics that compare differences between two texts use token-level overlaps, n-grams, or the number of edits to get the same text.
Interpretation: The result of these metrics is usually normalized between zero and one, where one is the best possible score.
Use Case: These types of metrics are ideal for relatively short texts with lower variability.

Other

Apart from the aforementioned metrics, you can track a host of other aspects of the generation, such as the number of tokens, the number of reasoning tokens, latency, cost, etc.

Our Evaluation Infrastructure

To achieve proper evaluation in the compound AI system with multiple components that depend on each other and take advantage of LLM, it is essential to have these LLM components encapsulated so they can be easily tested and evaluated.

This approach was inspired by the chapter “Design Your Evaluation Pipeline” from the book AI Engineering by Chip Huyen.

In our evaluation infrastructure, each LLM component has its own dataset and evaluation pipeline. You can think of the LLM component as an arbitrary machine learning model that is being evaluated. This separation of components is necessary in a complex application like ours, which focuses on an AI-assisted analytics use case, because evaluating such a system end-to-end can be extremely challenging.

To evaluate each of these components, we use the following tools:

Langfuse

An LLM observability framework that is used primarily to track and log user interaction with an AI application
Additionally, it supports dataset and experiment tracking, which we employ in our infrastructure.

Pytest

It is a minimalistic framework for running unit tests
We use it as our script runner when evaluating different LLM components

DeepEval

An LLM evaluation framework that implements evaluation metrics
We use it mainly for its Conversational G-Eval implementation

For each component, we have precisely one test. Each test is parametrized using the pytest_generate_tests function, so it runs for each item of the dataset for each element. The whole infrastructure setup with the use of these tools visualized:

Visualization of our evaluation setup.

The results of the evaluation of the specific LLM component are logged to the Langfuse, shown in the next image. As you can see, we are using G-Eval LLM-as-a-Judge. We are thresholding scores from the G-Eval to determine if the output is correct. On top of that, we are tracking the perplexity of the model. If perplexity values start to spike, it can be an indication that something might be wrong in the configuration of the LLM component.

Result of the evaluation of the specific component in the Langfuse.

Conclusion

Evaluation is essential for building reliable and production-ready AI applications. Compared to traditional unit testing or model evaluation, evaluating AI systems presents its unique challenges. The first step is always creating or gathering a dataset that matches your business goals and helps guide improvements. Then, selecting the right metrics is crucial to understanding how effectively your system performs. With these foundations in place, you can apply the ideas through a practical evaluation setup, as described in this article. I hope it helps you take the next step in evaluating your AI application.

Evaluation of AI Applications In Practice was originally published in GoodData Developers on Medium, where people are continuing the conversation by highlighting and responding to this story.

Make Key Driver Analysis Smarter with Automation

Štěpán Machovský — Mon, 28 Jul 2025 08:44:11 GMT

What is Key Driver Analysis, and why should you care?

Key Driver Analysis (KDA) identifies the primary factors that influence changes in your data, enabling informed and timely decisions. Imagine managing an ice cream shop: if your suppliers’ ice cream prices spike unexpectedly, you’d want to quickly pinpoint the reasons. Be it rising milk costs, chocolate shortages, or external market factors.

Traditional KDA can compute what drove the changes in the data. And while it provides valuable insights, it often arrives too late, delaying critical decisions. Why? Because KDA traditionally involves extensive statistical analysis, which can be resource-intensive and slow.

Automation transforms this scenario by streamlining the process and bringing KDA into your decision-making much faster through modern analytical tools.

Why Bring KDA Closer to Your Decisions?

Consider the ice cream shop scenario: one morning, your supply of vanilla ice cream spikes by 63%. A manual KDA might reveal — hours or even days later, depending on when it’s run (or whether someone remembers to check the dashboard) — that milk and chocolate prices have surged, leaving you scrambling for solutions in the meantime.

Automating this process through real-time alerts ensures you never miss crucial events:

Webhook triggers when ingredient prices exceed defined thresholds.
Immediate automated KDA execution identifies critical drivers within moments.
Instant alerts enable swift actions like sourcing alternative suppliers or adjusting prices, safeguarding your business agility.

These systems can significantly reduce your response times, allowing you to mitigate risks and leverage opportunities immediately, rather than reacting post-mortem.

Automating KDA

Automation significantly streamlines the KDA process. Sometimes, you don’t need to react to alerts immediately, but you need your answers by the next morning. Let’s explore how you can set this up using a practical example with Python for overnight jobs:

def get_notifications(self, workspace_id: str) -> list[Notification]:
    params = {
        "workspaceId": workspace_id,
        "size": 1000,
    }
    res = requests.get(
        f"{self.host}/api/v1/actions/notifications",
        headers={"Authorization": f"Bearer {self.token}"},
        params=params,
    )
    res.raise_for_status()

    ResponseModel = ResponseEnvelope[list[Notification]]
    parsed = ResponseModel.model_validate(res.json())
    
    return parsed.data

For this example, I have deliberately chosen 1000 as the polling size for notifications. In case you have more than 1000 notifications on a single workspace each day, you might want to reconsider your alerting rules. Or you might greatly benefit from things like Anomaly Detection, which I touch on in the last section.

This simply retrieves all notifications for a given workspace, allowing you to run KDA selectively during the night. This saves your computation resources and helps you focus only on relevant events in your data.

Alternatively, you can also automate the processing of the notifications with webhooks or our PySDK, so you don’t have to poll them proactively. You can easily just react to them and have your KDA computed as soon as any outlier in your data is detected.

Automated KDA in GoodData

While we are currently working on integrated Key Driver Analysis as an internal feature, we already have a working flow that elegantly automates this externally. Let’s have a look at the details. If you’d like to learn more or want to try to implement it yourself, feel free to reach out!

Every time a configured alert in GoodData is triggered, it initiates the KDA workflow (through a webhook). The workflow operates in multiple steps:

Data Extraction
Semantic Model Integration
Work Separation
Partial Summarization
External Drivers
Final Summarization

Data Extraction + Semantic Model integration

First, it extracts information about the metric and filters involved in the alert, including the value that triggered the notification, and then it reads the related semantic models using the PySDK.

The analysis planner then prepares an analysis plan based on the priority of dimensions available in the semantic model. This plan defines which dimensions and combinations will be used to analyze the metric.

Setting up the Work

The analysis planner then initiates analysis workers that execute the plan in parallel. Each worker uses the plan to query data and perform its assigned analyses. These analyses produce signals that the worker evaluates for potential drivers (what drives the change in the data).

Partial Summarization

If any drivers are found, they are passed to LLM, which selects the most relevant ones based on past user feedback. It also generates a summary, provides recommendations, and checks for external events that could be related.

External Drivers

The analysis workers process the plan starting from the most important dimension combinations and continue until all combinations are analyzed or the allocated analysis credits are used up. The credit system is something we implemented to allow users to assign a specific amount of credits to each KDA in order to manage the duration and cost of the analysis/LLMs.

Final Summarization

Once the analyses are completed, a post-processing step organizes the root causes into a hierarchical tree for easier exploration and understanding of nested drivers. The LLM then generates an executive summary that highlights the most important findings.

We are currently working on enhancing KDA using the semantic model of the metrics. This will help identify root causes based on combinations of underlying dimensions and related metrics. For example, a decline in ice-cream margins may be caused by an increase in the milk price.

A Sneak Peek Into the Future

Currently, there are three very promising technologies that we are experimenting with.

FlexConnect: Enhancing KDA with External APIs

Expanding automated KDA further, FlexConnect integrates external data through APIs, providing additional layers of context. Imagine an ice cream shop’s data extended with external market trends, consumer behavior analytics, or global commodity price indexes.

This integration allows deeper insights beyond internal data limitations. This can make your decision-making process more robust and future-proof. For instance, connecting to a weather API could proactively predict ingredient price fluctuations based on forecasted agricultural impacts.

Enhanced Anomaly Detection

Integrated machine learning models that highlight significant outliers, improving signal-to-noise ratios and accuracy. This would mean that you can easily move beyond simple thresholds and/or change gates. Your alerts can take into account the seasonality of your data and simply adapt to it.

Chatbot Integration

We are currently expanding the possibilities for our AI chatbot, which, of course, includes Key Driver Analysis. Soon, with this capability, the chatbot can help you set up alerts for automatic detection of outliers and send you notifications about them. Also, in the future, it may recommend you next steps based on KDA.

The output could look something like this:

Practical Application: Ice Cream Shop Example

To illustrate, assume your Anomaly Detection detects a price deviation. Immediately:

An automated KDA process initiates, revealing milk shortages as the primary driver.
Simultaneously, FlexConnect fetches external market data, showing a global dairy shortage due to weather conditions.
An AI agent notifies you via instant messaging (or e-mail), offering alternative suppliers or recommending price adjustments based on historical data.
You can then chat with this agent and reveal even more information (or ask it to use additional data) on the anomaly. The agent has the whole context, as he has been briefed even before you knew about the anomaly.

And while this might sound like a very distant future, we are currently experimenting with each of these! Don’t worry, when each of these features is nearing deployment, we’ll share the PoC with you on this.

Want to learn more?

If you’d like to dig deeper into automation in analytics, check out our article on how to effectively utilize Scheduled Exports & Data Exports. It explores how to use automation to set up alerts correctly, so that they are useful and not simply a distraction.

Stay tuned if you’re interested in learning more about KDA, as we’ll soon follow up with a more in-depth article while also exploring its practical application in analytics.

Have questions or want to implement automated KDA in your workflow? Reach out — we’re here to help!

Make Key Driver Analysis Smarter with Automation was originally published in GoodData Developers on Medium, where people are continuing the conversation by highlighting and responding to this story.

Listen to your dashboards. Literally!

Tomáš Muchka — Mon, 23 Jun 2025 11:29:13 GMT

Transforming your dashboards into a data podcast is easy

Making your data speak was one of the earliest applications of AI on analytics I saw. I remember asking myself two questions. How hard would it be to implement it, and is there actually a real user need behind it? TLDR; it is actually easier than you could imagine and in this article I will outline how you can create your own data podcast. And although I was a bit skeptical about its usefulness, after trying it out for a week, I must admit it is my #1 activity on my way to work.

How to build a data podcast

Building a naive implementation of an AI-driven data podcast is quite easy. Especially if your analytics platform provides a robust SDK. I will showcase the steps using GoodData and its Python SDK. This nicely matches OpenAI’s Python SDK I’m using to communicate with AI.

Step 1: Export dashboard

OK, let’s do it. First I need to connect GoodData to an existing workspace. Workspace is a logical unit that combines data, model and dashboards.

# GoodData base URL, e.g. "https://www.example.com"
host = "https://example.cloud.gooddata.com/"
token = os.environ.get("GOODDATA_API_TOKEN")
sdk = GoodDataSdk.create(host, token)

Now once I’m connected to the right workspace, let’s find the dashboard I’m interested in and export it. GoodData currently allows exporting the whole dashboard into a PDF file, so let’s use it.

EXPORT_FILE_NAME = "test"
EXPORT_SUFFIX = "export"

def export_dashboard_to_images():
  # Export a dashboard in PDF format
  export_path = Path.cwd() / EXPORT_SUFFIX
  export_path.mkdir(parents=True, exist_ok=True)
  sdk.export.export_pdf(
    workspace_id = "demo_workspace",
    dashboard_id = "demo_dashboard",
    file_name = EXPORT_FILE_NAME,
    store_path = export_path,
    metadata = {}
  )

Here is the exported dashboard I spoke about. I’m using the standard GoodData demo dashboard here. The dashboard informs about various aspects of customer engagement in a hypothetical e-shop.

A GoodData demo dashboard used as a podcast source

OpenAI unfortunately accepts PDF files only in their chatbot interface. On the API, it accepts just text and images. That’s why I need to convert the PDF into images.

# Convert PDF to images
  images = convert_from_path(export_path / (EXPORT_FILE_NAME + ".pdf"), dpi=300)  # Change dpi for quality
  
  image_data = []
  for img in images:
    buffered = io.BytesIO()
    img.save(buffered, format="PNG")
    img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
    
    image_data.append(img_str)

  # Return the list of base64-encoded images 
  return image_data

Step 2: Describe the dashboard

With the dashboard exported in a format that OpenAI accepts, let’s get a nice description out of it.

Noticeable parts of the code:

The prompt clearly states that the description should be used for the podcast talktrack.
The language of the podcast can be easily set using the prompt again. Just add “Write the text in {language}.” at the end of it.

def describe_dashboard(images, language="en"):
    client = OpenAI(
        # This is the default and can be omitted
        api_key=os.environ.get("OPENAI_API_KEY"),
    )

    message = [
      {
      "role": "user",
      "content": [
        {
        "type": "text",
        "text": (
          "Act as a data analyst who creates a daily data summary podcast. "
          "Start with the current day and a brief introduction of the dashboard, "
          "then describe the dashboard in a way that is easy to understand for a "
          "non-technical audience. Don't mention there is actually a dashboard"
          ", make it sound like you are reading a daily summary report. "
          f"Write the text in {language}."
        ),
        }
      ]
      }
    ]

    for image in images:
      message[0]["content"].append(
        {
          "type": "image_url",
          "image_url": {"url": f"data:image/jpeg;base64,{image}"},
        }
      )

    outcome = client.chat.completions.create(
      model="gpt-4o",
      messages=message
    )
    return outcome.choices[0].message.content

Step 3: Generate audio

Generating the audio is by far the easiest step in this guide. OpenAI offers an API for this exact purpose. You just need to set the model and voice type. The output can be stored straight into a file.

def generate_audio(text, file, voice="alloy"):
  client = OpenAI()

  response = client.audio.speech.create(
      model="tts-1",
      voice=voice,
      input=text,
  )
  response.stream_to_file(file+".mp3")

Note: OpenAI automatically detects the language from the text, so there is no need to set it explicitly.

Step 4: Send it to your audience

There are multiple ways to distribute your new podcast to your audience. E.g., Oracle Analytics decided to add podcasting capabilities into their mobile app.

I chose a different direction — publishing the podcast episodes straight into a podcast app (Youtube Music in my case).

Youtube Music with GoodData dashboard insights podcast

First, we need to generate a title and a description for each episode. This can be done using a simple OpenAI call. Just take the full dashboard description as an input and generate a brief summary out of it.

def generate_summary(text, timestamp):
    client = OpenAI()

    message = [
        {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": (
              "Act as a podcaster and data analyst publishing a new podcast episode. Generate output in JSON format with two fields: 'title' and 'description'."
              " - 'title' should be a string of 6 to 10 words, following this pattern: ' ,  - ' (e.g. 'June 1, 2025 - Exploring the Dynamic Expansion of LEGO Sets')."
              " - 'description' should be a very short and brief summary of the episode, suitable for RSS feeds."
              "Important: Output only raw JSON. Do not include any Markdown formatting, code block markers, or extra commentary."
              f"Text to summarize: {text}"
              f"Current date is: {timestamp}"

            ),
          }
        ]
        }
      ]

    outcome = client.chat.completions.create(
        model="gpt-4o",
        messages=message
      )
    return outcome.choices[0].message.content

The code results in a simple JSON structure that has two props, title and description. The title always contains the current date and barebone of the episode summary.

Now, with the audio file, title and description in place, we can finally publish the episode! But, how?

I spent quite a time trying to figure out the most suitable way to do it. Here are a few of the dead-ends I encountered.

Don’t try to upload the podcast to Git pages. Although I was able to generate an rss file and added the podcast to the YT Music app, it could not play any audio. Later I discovered that Github pages are not optimized to stream media files, which is mandatory for podcasts.
Soundcloud is quite unfriendly for 3rd party integrations. Having already a Soundcloud account, I thought it would be easy to upload the episodes there. I however found the Soundcloud API and integrations quite hostile towards my simple Python-upload use case.

In the end I found a podcast platform called Buzzsprout. Uploading episodes is quite easy thanks to their API. Before uploading the first episode, I just had to create an API token and a new podcast entity on the Buzzsprout website.

Uploading the episode requires an HTTP call with episode details in form of a JSON payload and the episode audio file.

def upload_episode(path, summary, season_number, episode_number):
    API_TOKEN = os.getenv("BUZZSPROUT_API_TOKEN")
    PODCAST_ID = os.getenv("BUZZSPROUT_PODCAST_ID")
    summary = json.loads(summary)


    # Endpoint
    url = f"https://www.buzzsprout.com/api/{PODCAST_ID}/episodes.json?"

    # Metadata
    data = {
        "title": summary["title"],
        "description": summary["description"],
        "summary": summary["description"],
        "artist": "GoodData AI Assistant",
        "episode_number": episode_number,
        "season_number": season_number,
        "explicit": False,
        "private": False,
        "email_user_after_audio_processed": True,
        "artwork_url": "https://www.gooddata.com/img/blog/_1200x630/01_21_2022_goodatasociallaunch_rebrand_og.png.webp"
    }

    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; BuzzsproutBot/1.0)",
        "Accept": "application/json",
        "Authorization": f"Token token={API_TOKEN}"
    }

    # File upload
    files = {
        "audio_file": (os.path.basename(path), open(path, "rb"), "audio/mpeg")
    }

    # POST request
    response = requests.post(url, data=data, headers=headers, files=files)

Podcast Episode

Automate it

The circle is closed, the podcast is published. The only downside is in the need to run the code locally every time a new episode should get created.

I’m a software engineer by soul and I hate tedious error-prone processes, which is exactly this case. Creating a sticky note is not something that really solves the underlying problem. Instead, let’s find an easy way to automate the release of a new podcast episode.

Github actions to the rescue.

The easiest solution is probably Github actions. I set a daily schedule to release a new episode and Github took care of the rest.

Notice the env variables it uses. These are defined in my local unversioned .env file and then in Github settings. This way these won’t leak to unwanted locations.

Poppler is used during the transformation of PDF to images and the utility is not available out of the box in the latest Ubuntu environment.

name: Daily Run

on:
  schedule:
    - cron: '0 6 * * *'  # Runs every day at 8:00 Czech time (GMT+2)
  workflow_dispatch:      # Allows you to manually trigger it

jobs:
  run-python-script:
    runs-on: ubuntu-latest

    env:
      GOODDATA_WORKSPACE_ID: ${{ secrets.GOODDATA_WORKSPACE_ID }}
      GOODDATA_API_TOKEN: ${{ secrets.GOODDATA_API_TOKEN }}
      GOODDATA_DASHBOARD_ID: ${{ secrets.GOODDATA_DASHBOARD_ID }}
      GOODDATA_ENDPOINT: ${{ secrets.GOODDATA_ENDPOINT }}
      BUZZSPROUT_API_TOKEN: ${{ secrets.BUZZSPROUT_API_TOKEN }}
      BUZZSPROUT_PODCAST_ID: ${{ secrets.BUZZSPROUT_PODCAST_ID }}
      OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      
- name: Install Poppler
        run: sudo apt-get update && sudo apt-get install -y poppler-utils
        
      - name: Install dependencies
        run: |
          pip install -r requirements.txt

      - name: Run script
        run: |
          python app.py

And the result as seen on Github:

Daily releases of the podcast episodes through Github actions

But… is there a use case?

During my 7 years of UX design for an analytics platform, I never heard any customer wish to listen to the dashboard summaries as a podcast.

Having said that, I often hear customers wish to reduce the time to insight as much as possible. A quick audio summary of what happened in the last X days might actually fulfil this need quite well. I would personally like to push the idea a bit further and accompany the audio summary with additional infographic to make it stand out.

Conclusion

In this article I outlined to you that building a dashboard podcast is actually quite easy and with proper tooling requires just a few lines of code. All thanks to the developer friendly platform. If you'd like to learn how that works, check out my AI-generated Visualizations article.

There is one but…

As already mentioned, the examples above have one big flaw; I’m sending the data through a visual export, losing most of the actual dashboard data.

For a summary podcast, or a quick glance at the dashboard, a visual export is acceptable, but for in-depth analysis, work directly with the data, not just the visual output.

How to make it useful

Sending a daily or weekly dashboard summary as a podcast sounds good, but don’t forget to properly set the date filters and possibly also include the previous snapshots and tell the AI to point out the changes. This way your podcast will be much more relevant.

Listen to your dashboards. Literally! was originally published in GoodData Developers on Medium, where people are continuing the conversation by highlighting and responding to this story.

Everyone’s Talking About MCP. I Built a Server to See If It’s Ready

Andrii Chumak — Tue, 20 May 2025 13:46:41 GMT

Model Context Protocol (MCP) has received a lot of attention lately. Anthropic keeps promoting it as a unified protocol for LLMs. CursorAI now supports it as a client. Even big players like OpenAI, Azure, and Google are getting on board.

But do we really need yet another abstraction layer? Isn’t the good old REST API (in the form of OpenAI’s function calling) enough?

What MCP is and is not

Well, I was a little skeptical about MCP at first. OpenAI has had function calling for a while now, and at first glance, it seemed that MCP was no more than a marketing move from Anthropic. What caught my attention was a passage in the documentation about the protocol being inspired by VSCode’s Language Server. At GoodData, we created our own Language Server for our analytics as code features, and I’m a big fan of the protocol. It is more than just a set of rules and schemas for communication with your IDE, it is changing the way a language developer thinks about the integration. So, I got curious if MCP can do the same for LLM apps.

Building an AI agent, or even a simple reactive chatbot, is hard. You need to both understand how LLMs work and be a decent software engineer. A colleague of mine (hey, Jakub!) has described this well with a Venn diagram.

Why building AI applications is hard

Model Context Protocol is a way to make LLM Apps’ development accessible to a wide variety of software engineers, even those with no data science background, such as myself. I no longer need to have a deep understanding of all the LLM oddities in order to create an agent, just as I don’t need to know how VSCode works under the hood in order to add support for my language with Language Server.

GoodData MCP Server PoC

There is no better way to learn about a new technology than to use it. So, I went and built a prototype MCP Server with just enough GoodData features to cover a simple usage scenario.

https://medium.com/media/a118dd218b1b638f1e764cc96f79aefa/href

As a starting point, I chose TypeScript SDK. There are plenty of relevant examples on GitHub using this SDK and it’s one of the official ones, supported by Anthropic.

As a client, I’m using the Claude Desktop app. Although some feature support is spotty, it is the official client from Anthropic.

If you want to see my implementation for yourself, check out this repository.

MCP First DevEx Impressions

MCP is still under heavy development, and this is immediately evident. Documentation is inconsistent at times, the official desktop client does not support all the MCP features, and there are inconsistencies between different server SDKs. This is understandable, given the pace at which MCP is evolving, and it is largely compensated by a plethora of examples on Github.

Your MCP server can deliver several types of integrations: Tools, Resources, Prompts, Sampling, and Roots.

Tools

Tools are by far the most well-supported and robust type of integration. They have existed for a long while (remember OpenAI function calling?). Tools are also well-supported by Claude Desktop.

The idea is that you define a function name and a JSON Schema of the expected input, and the LLM will request the execution of the function with input based on the user prompt.

A Semantic Search Tool defined with TS SDK

The main purpose of Tools is to provide a way for the AI Assistant to execute actions on the user’s behalf.

Resources

With Resources and Resource Templates, you can provide the LLM with the context it needs to reply to the user’s question. It can be anything from documentation in PDF files to JSON entities out of your REST API.

Resource is used to help LLM pick the right visualization type

Personally, I can see how this may become one of the most powerful integration types in MCP. However right now, in Claude Desktop, the LLM can’t choose which resources to use on its own. The user has to explicitly include the resource in each question. This limits the usefulness of the Resources, and I hope this becomes seamless in the future.

Prompts

Prompts is another integration type with great potential. At the surface level, Prompts are just prompt templates predefined by your MCP server. They provide the user with a standardized way to ask LLMs for certain tasks. The real power of Prompts, however, lies in parametrization and workflows. Think of this as Tools on steroids. Your MCP server can process the input arguments and output a set of instructions to the LLM on how to process the results, how to respond to the user, and which tools to use next in the pre-defined workflow.

A prompt that loads data for a given visualization and asks Claude to analyze it

Claude Desktop’s integration for Prompts is less fluent compared to Tools. The user needs to explicitly generate the prompt and provide the input parameters manually.

Sampling

Sampling is the latest addition to the MCP. It allows the server to ask for help from the LLM through the client, while keeping the human in the loop. This will be especially useful if you want to orchestrate complex multi-step tasks, like creating a data analytics visualization. Here is how it would work:

Your user asks a question to create a visualization.
Let’s say the user did not specify the exact visualization type they wanted. The MCP server can do some heuristics and narrow down the options based on the metrics and dimensionality, but in the end, it needs to decide whether to use a bar chart or a line chart. Something a quick call to the LLM can easily solve.
The server sends a Sampling request to the client with the user’s original prompt, a narrowed-down list of supported visualizations, and a guideline on how to select the visualization type.
Optionally, this is then displayed to the user, who can edit, approve, or even reject the LLM call. Users can also change the selected model.
The client forwards it to the LLM and then routes the response back to the server.

This workflow ensures that all LLM communication always goes through the client, while the user has the final say in what to send to the LLM.

At the time of writing this article, Sampling is not yet supported by the Claude Desktop.

Roots

Roots is a way for an MCP client to tell an MCP server which resources it should be accessing for a given operation. A typical example would be a directory where the filesystem-operating MCP server should be working in. This is also applicable to API base URLs or any other URI-identified resources.

What’s next for MCP?

MCP has come a long way since it was first introduced, and I’m looking forward to seeing it mature further. Here are a few improvements I’d really like to see:

More seamless integration for Resources, Prompts, and Sampling on the Claude Desktop side. I don’t want my users to manually pick the Resource or fill in the variables for a Prompt. Why not let the LLM choose to access these in the same way it does with Tools?
I want to be able to add my own content types to MCP client, the server, and to the client UI. One good use case is GoodData’s data visualization. I can easily write a server that returns a GoodData-specific data visualization JSON, but there is no way I can render it in Claude Desktop’s UI, unless I export it to a plain, non-interactive image. Imagine if an MCP server could also provide a link to a WebComponent (React/Angular component) capable of rendering the new content type.

What’s in it for GoodData?

At GoodData, we are keeping a very close eye on MCP. You can expect to see more PoCs and eventually production integrations from our side. Here is a sneak peek into what’s brewing:

Now Anthropic has finished the specification for a remote MCP server, we’re trying that out and will eventually integrate it into our servers.
Some time ago we built the GoodData VSCode extension to support our Analytics as Code format. Using this extension in VSCode-based AI-enabled IDE (like, Cursor) and adding an MCP server to it should be quite a powerful combination.
Last but not least, we are about to release GoodData AI Assistant. Making it an MCP client will unlock a lot of new use cases. Imagine an analytics-focused assistant that our customers can connect to their own MCP servers and provide additional context and tooling.

Conclusion

MCP is a very promising technology. One can see all the effort the team at Anthropic put into the protocol design. It needs to mature, though. I would be hesitant to write a generic MCP server with anything other than Tools, unless I’m targeting a very specific MCP client implementation or, better yet, I’m writing my own.

Everyone’s Talking About MCP. I Built a Server to See If It’s Ready was originally published in GoodData Developers on Medium, where people are continuing the conversation by highlighting and responding to this story.

Hyper personalised analytics is coming

Tomáš Muchka — Wed, 07 May 2025 09:25:37 GMT

Yet, most of the current analytics solutions are stuck in “one to rule them all” dashboards

Two years ago, I interviewed a candidate for a UX designer position for our web and marketing team. During the interview, he brought up hyper-personalization as the next big thing in digital marketing. It sounded interesting, but I didn’t give it much thought at the time. Now that this trend is coming to analytics, the current tools aren’t ready for it.

Most of the analytics solutions are still stuck in the pattern of a standardized set of dashboards which rule all of the consumers. To add an impression of personalisation, these standard dashboards are often accompanied with dozens of filters. This however only adds to their complexity. Making them incomprehensible at first glance.

Politely said, the current personalisation experience in analytics is suboptimal. But how to get out of this?

Moonshot: Dashboards are dead

I know we have been hearing about the dashboards being dead at least since 2020, yet they are still here with us. However, if hyper-personalization is really coming, dashboards are not the ideal interface for it. They should evolve into personal spaces or homepages.

Example of a personalised homepage from ChatGPT

Another option that nicely matches with hyper-personalizations are AI chatbots and conversational patterns in general. Surprisingly, chatbots already know a lot about us, just ask them yourselves.

Image of what does ChatGPT know about Tomas Muchka (author of this article)

Now imagine connecting the information chatbot knows about you, to your data needs.

However, I haven’t seen a single analytics platform that would be able to deliver such an experience. Even tech giants like Apple are still struggling to adopt and integrate new AI-driven features. And building this from scratch would require a tremendous amount of work.

Realistic goal: Personalized dashboards

As I said at the beginning, hyper-personalization is a tempting goal, however, most of the analytics solutions have close to zero personalized content. This is not because data analysts are villains wanting their dashboard users to suffer, but they just don’t have enough time and resources to manage personalized versions of their dashboards for everyone.

I believe AI can help us personalize analytics.

How exactly can AI help us here? In my article Can your BI tool import sketched dashboards? I outlined how GoodData professional services work. They look at the data and cooperate with our customers to deliver the best possible analytics for their desired use cases. AI can help us bootstrap this process.

Let’s look at three steps AI can help us with personalized dashboards.

1. Propose audience based on the data

While the common knowledge is to start with the audience and their use cases before building data around it, reality often flips this approach — we start with the data and then we seek the right audience. This is precisely where AI can help us.

Here is a demo explaining the idea. First, I select data sources and LLM will analyze their metadata to identify the best possible audience and use cases for it.

In the first step, select the data sources to build the analytics

Would you dare connect your database to LLM directly? How would your security and compliance department feel about it? Doing this through GoodData has one big advantage — you don’t connect your data to the LLM, just the tables and column names.

In this step, AI creates a list of personas and their use cases based on the data, which we will use in the next step, where we create the model.

tables = sdk.catalog_data_source.scan_data_source(data_source_id=data_source)
proposed_personas = ask_ai_to_suggest_personas(tables)

AI generated list of possible personas that might be interested in the data

2. Generate audience-based model

Now with the proposed audience, we are ready to generate a logical data model. A robust model is the very heart of a modern AI-ready analytics platform. And it is also the part where data analysts and analytics engineers spend most of their time. Some tools allow you to generate the model as a 1:1 copy from the database tables.

How much can such a copy of physical data structures serve your analytics needs? The only exception is if you already prepare your analytics model directly in the database. In such a case you spend most of your time creating the very same logical model just in a different tool.

AI can bootstrap the model creation by taking the audience and their use cases into account. Any information you can provide to the AI can help your model become a better fit.

datasets = ask_ai_to_create_model(tables, personas)

datasets_directory = os.path.join('analytics', 'datasets')
if os.path.exists(datasets_directory):
    shutil.rmtree(datasets_directory)
os.makedirs(datasets_directory, exist_ok=True)

for dataset in datasets:
    create_dataset(str(dataset))

3. Generate audience-oriented dashboards

With the logical data model and audience in place, we can go a bit further and generate personalised dashboards and visualizations.

personas_with_vizs = ask_ai_to_create_visualizations(list_of_personas)

for persona in personas_with_vizs:
    # Create a directory for each persona's title
    directory = os.path.join(visualisations_directory, persona['title'])
    os.makedirs(directory, exist_ok=True)
        
    # Iterate over the visualizations for each persona
    for viz in persona['visualisations']:
       create_visualization(str(viz), directory, False)

dashboards = ask_ai_to_create_dashboards(personas_with_vizs)
for dashboard in dashboards:
   create_dashboard(str(dashboard), False)
deploy_analytics()

AI generated dashboard for Market Growth Analysts

Not super happy about the outcome? No problem, all the assets are clearly separated. Feel free to update/create any of the steps manually or even better let the AI try it again with more specific instructions and more information.

Try it yourself

In the repository, you can find the whole code, including large parts of the prompts. It is very simple, yet elegant, because chatbot does most of the heavy lifting.

You might ask yourself, why would you even want to use GoodData, if even light prompting can lead to such results. And the reason why is exactly that, you only need light prompts, because our very AI friendly analytics as code approach.

As you can see in the presented code, we use OpenAIs’ GPT-4o for the first step, which is the only step where you want the AI to be creative. Then we use o3-mini, because it is basically generating code, as in GoodData you can represent anything as code.

Disclaimer: In the code, we omitted parts that mimic the chatbot of GoodData, so it will not work as seamlessly as in our testing. We also omitted parts of the prompts, and shortened them for demonstration purposes.

Conclusion

Although the future is bright and hyper personalized analytics is soon™ coming through conversational interfaces or even AI workflows, there is something we can do about personalisation even now. AI can help us to generate personalized analytics, so data analysts don’t need to start from scratch.

Hyper personalised analytics is coming was originally published in GoodData Developers on Medium, where people are continuing the conversation by highlighting and responding to this story.

Can your BI tool import sketched dashboards?

Tomáš Muchka — Tue, 01 Apr 2025 11:51:49 GMT

If not, you should probably look for a new one

Sketching a dashboard is still one of the most common ways to start basically any analytical project. With today’s technologies even a paper sketch can be transformed into an interactive dashboard.

From our experience at GoodData, a typical analytics project starts over a whiteboard. The solution architect and UX designer meet with the customer and put together the first draft of the future analytics product.

Example of a project start in a whiteboard tool

While a future-proof data model is a cornerstone of any successful project, customers can provide much more feedback when discussing the visuals of their future product, in our case dashboards.

That’s why we create wireframes of the dashboards as fast as possible and continuously discuss and iterate them with the customer. We even developed the GoodWire Figma library to speed up the process. These wireframes are focused on information architecture and dashboards’ layout. With such a sketch, we can already validate with the end users whether the dashboard communicates the desired insights correctly.

Example of a sketched dashboard

However, the juiciest stuff are the annotations. Our customers tend to clarify their needs through sticky notes on dashboard sketches. These usually include interactions or additional clarification that cannot be seen on the dashboard sketch (e.g., when the filters should have a parent-child relationship).

Example of a sketched dashboard with annotations

OK, now imagine what happens, when the dashboard sketch gets approved. The solution architect creates the corresponding semantic layer, so the solution designer can take the approved sketch… and throw it away, because it needs to be rebuilt from scratch step by step in the BI tool.

Rebuilding approved sketches into dashboards is a waste of everyone’s time.

Don’t rebuild, generate!

Instead of recreating the dashboard according to the sketch manually, let’s take the previously created sketch and generate a real dashboard out of it. Wouldn’t it be beautiful?

Although GoodData does not offer such a functionality out of the box, I decided to explore the possibilities and build such a feature on top of our existing tooling. Similar to my article From Sketch to Interactive Data, my idea was to use AI to generate the YAML definition of dashboard based on the sketch.

GoodData sports the concept of composable analytics. Which means it separates visualisations from dashboards. So I need to create not only the dashboard definition, but also separate definitions for all the visualisations on the dashboard.

Fortunately, GoodData supports the as-code representation of the analytics entities including dashboards and visualisations.

With all of that, check the interactive dashboard generated from the sketch.

Result of our work: an autogenerated dashboard

As code + AI = ♥️

Now let’s take a look at how to make the as-code work with AI, namely OpenAI’s GPT-4o (o1 or o3 might be better for this, but it will take much longer). I’m not using the chatbot interface, but the OpenAI API. Never tried that? Then read the official quickstart first.

Step 1: Create the visualisations

First we are going to send the base64 encoded dashboard sketch to AI, expecting to get a list of visualisations that are on the image.

outcome = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": INSTRUCTIONS},
            {"role": "assistant", "content": f"Here are the fields from which you can build visualizations. Use only existing attributes and facts. {PROMPT_DATASETS}"},
            {"role": "assistant", "content": f"Here are examples of existing visualizations: {PROMPT_EXAMPLES_OF_VISUALIZATION}"},
            {"role": "assistant", "content": f"Here are the metrics from which you can build visualizations. Use only existing metrics. {PROMPT_METRICS}"},
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Identify visualisations (charts) on the image and map it to the described visualization structure. Use the available fields. If not found, use the most similar fields.",
                    },
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
                    },
                ],
            }
        ], 
        response_format = schema
    )

And the schema starts like this:

"type": "json_schema",
    "json_schema": {
        "name": "visualisations_schema",
        "schema": {
            "type": "object",
            "properties": {
                "visualisations": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "title": "Visualisation",
                        "description": "JSON schema for GoodData Analytics Visualisation",

Once we have the visualisations ready, save them:

visus = ask_ai_to_create_dashboard_visualizations(encoded_file)
for visu in visus:
    create_visualization(visu)

Are you interested in more details on how this works? In my article The Road to AI-generated Visualizations I go over the whole process.

Step 2: Create dashboard

dashboard = ask_ai_to_create_dashboard(encoded_file)
create_dashboard(dashboard)

Where AI sets the dashboard layout and finds the most suitable visualisations to populate it.

outcome = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": INSTRUCTIONS},
            {"role": "assistant", "content": f"Here are the fields from which you can build dashboard filters. {PROMPT_DATASETS}"},
            {"role": "assistant", "content": f"Here are existing visualizations which you can use when building dashboard: {PROMPT_EXAMPLES_OF_VISUALIZATION}"},
            {"role": "assistant", "content": f"Here are already existing dashboards to learn from: {PROMPT_EXAMPLES_OF_DASHBOARDS}"},
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Map what is on the image to the described dashboard structure. Use existing visualizations and connect it to the newly created dashboard.",
                    },
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
                    },
                ],
            }
        ], 
        response_format = schema
    )

Have you noticed that I’m sending the image twice? It isn’t ideal, but from my experience asking the AI to do many things at once often doesn’t lead to a good result. It yields much better results when I first create the visualisations and then build the workspace. These two calls have much different expected outputs (schemas) and prompts.

Next, you may also notice that, when building the dashboard in the second step, the AI looks for visualisations created in the first step. In theory, it could find some of the already existing visualizations more suitable than the newly created ones, but I haven’t observed this behaviour so far.

Here is the final flow from sketch to an interactive dashboard:

https://medium.com/media/f9be151cdde3a5a72af27c84873e5672/href

Conclusion

Having the ability to transform sketched dashboards into their interactive twins greatly reduces the time to insight. Even if such a transformation should not be 100% correct and additional updates should be needed, it is an appreciated helper for anyone who builds analytics solutions.

Want to learn more about the GoodData analytics as code approach? Check out our other articles on Medium.

Can your BI tool import sketched dashboards? was originally published in GoodData Developers on Medium, where people are continuing the conversation by highlighting and responding to this story.

Are AI Frameworks Production-ready?

Štěpán Machovský — Tue, 25 Mar 2025 12:21:17 GMT

Everyone is implementing AI. Are AI frameworks ready to help us maintain it?

Nearly everyone is trying to implement some sort of AI into their product. From online shopping, like Amazon’s Rufus, to e-mail helpers like Gemini in Gmail.

Not saying whether that’s a bad thing or not, that’s beyond the scope of this article. However, when even small businesses are trying to use Large Language Models (LLM), there unsurprisingly is a pretty significant demand for something, that would help you abstract everything that is related to the LLM and help you focus solely on the business logic of the application itself.

So it is only natural that I was posed a question, whether it would make sense to incorporate one of these “somethings”. For now, these are open-source libraries, which help you abstract technicalities and your use-case, but in the future, it might be something different altogether.

As a part of my exploration, I’ve tried four of these frameworks:

Just as a small disclaimer, I am not interested in a simple “slap a chatbot on it” approach, which of course was implemented millions of times over and quite often only leads to distraught users. I am rather interested in more complex use-cases.

What are AI Frameworks?

AI frameworks are essentially libraries that help you create LLM based applications. From chatbots to an Alexa-like application, which could potentially orchestrate your IoT at home through voice input.

In theory, they should help you focus on what you want to create, rather than creating, abstracting and orchestrating all of the “AI magic” — LLM endpoints, vector stores and all the other already existing (and upcoming) fun stuff you can do with LLMs.

I started with only a handful of libraries, which are the most well-known in the industry. My logic behind this is that if none of them worked, I wouldn’t have to bother trying the others. But to be extra safe, I’ve included smolagents as well, which focus solely on the agentic approach.

Before I give you a short rundown of each, let’s focus on why you would even consider using such a framework in the first place.

Why would I want to use these frameworks?

With these frameworks, you should be able to iterate much faster, because you would only be changing the way your application behaves, rather than focusing on the stuff happening behind the scenes.

Say, there is a new and emerging trend in AI. Some sort of an optimization of how we work with LLM (or vector databases). You could possibly only change one or two components, which are readily available from the framework, and thanks to the abstraction, you wouldn’t have to adjust everything around it.

Of course, for it to run optimally, you would have to tweak a thing or two, but it should help you keep your finger on the pulse of the AI trends. And don’t get me wrong, in smaller scale applications, this is true and I most definitely recommend using one of the frameworks for PoCs.

Now, let’s look at how they work in a bigger picture, when you want to incorporate them into your product. This can get quite complex pretty quickly.

Are they what they claim to be in practice?

From my experience? Not really.

If I were to give you an analogy of how it feels to work with these frameworks, it’s as if you are riding a bike with training wheels.

When you start riding a bike, you might fall a few times. Well, that’s what training wheels are for! Suddenly you don’t have to worry about the balance as much and things “just work”. The first few trips with training wheels are a bliss and you are getting bolder and bolder. Now you are not only riding around your neighborhood, but would also like to go somewhere else.

Now you might get into some trouble, but it’s still better to play it safe and leave your training wheels on. You know, getting stuck in a sewer cover might be annoying, but it doesn’t really happen every time you go out, right?

And after a little more while, you try to go and ride singletracks. And oh boy do things go south from here. On the first sharper curve your training wheels force you out of the track and you are plunging head first into a tree. You come home and tell yourself that you will remove the training wheels, but oh no, they are bolted onto the bike, so you have two options. You either buy a new bike, or you meticulously remove the training wheels with a grinder and then paint over the mess you will inevitably create.

As you might’ve guessed by now, the frameworks are the training wheels in this analogy. At first, you need to chain prompts quickly. But later on (e.g., when debugging), you will need more granular control, and the abstraction stands in the way.

Back to frameworks

Now you might be thinking that I might be a little too harsh on the AI frameworks and to be frank I am not sure whether I am or not.

My initial experiments went pretty well, but once I started to test out more complex things it went downhill pretty fast.

In today’s fast-paced AI-world, where LLM endpoints seem to change every now and then, the only thing you really need is another source of unreliability. It can lock you out of those changes, because they might not be reflected in your framework of choice for quite some time.

Also, when AI is only a small part of a larger complex system with multiple micro-services (and components), you might want to interact over gRPC, to check whether some transactions are possible, etc… And for this you would have to customize the solution quite heavily.

Ok, now let’s look a little more into the details of these frameworks.

I feel comfortable making any judgments about LangChain and Haystack, and for the others I will just list some more high-level notes I made. I will definitely dig deeper in the upcoming articles.

LangChain

Pros:

Out of the bunch, it was definitely the easiest to iterate with.
Seemed pretty beginner-friendly.
It has the biggest and liveliest developer community.
Excellent, when you need to create a POC.

Cons:

It changes your prompts.
Not as customizable, as I’d like it to be.
It changes your prompts.
Too much abstraction.

Alright, so let’s break it down for LangChain. The pros are definitely there and there is a reason why so many people are using it. Credit, where credit is due.

However, you might’ve noticed that I’ve bolded and even repeated one of the cons. And it is because it quite drastically changes your prompt “behind the scenes”. From my perspective, this is unacceptable and goes hand-in-hand with the last con, that there is too much abstraction.

While it helps you keep your code relatively up-to-date with the current trends, you lose granular control on each of the steps and in the long run, you will lose a lot of your knowledge, because you won’t be using it. Vibe-coding vibes anyone? :)

Haystack

Pros:

Highly customizable!
Lively community and a lot of examples of the usage.
Total control over every single step of the process.

Cons:

Highly volatile.
Most of the examples of usage are already out-of-date.
While customizable, you have to customize.
Less connections than LangChain.

If I really had to go with one of the frameworks I’ve used so far, it would definitely be Haystack. It may not be the most complete, when it comes to connections. It also makes you do a lot of custom coding, but that’s where Haystack shines.

When I was trying to create a small PoC for our use-case, I chose to use DuckDB as the vector store of choice. And no, I am not insane, don’t worry. I know it might not be the best vector store there is, but it is small, easy to implement and I’ve worked with it in the past.

When implementing the DuckDB for Haystack, I actually hit a wall, because it does not support it. However, I’ve used a custom component and just imported the LangChain’s connection and used it instead.

Haystack is not only rainbows and sunshine. When I tried to use some of the Haystack examples, most of them were out-of-date and used the older version. It also forces you to make nearly everything custom (when non-trivial). I can see a future, where we actually use Haystack, but I think we need to give it more time to stabilize.

LlamaIndex

Pros:

Very good for RAG applications.
Excellent Data Connectors.
Well-suited for question answering.

Cons:

Seems more like a part of the AI stack, rather than the backbone.
Not as feature-rich as LangChain.
Middle of the road customizability.

I don’t feel confident giving a judgment about LlamaIndex just yet. Please wait for one of the follow-up articles, where I will focus solely on LlamaIndex.

smolagents

Pros:

Very easy to create agents with.
Lightweight

Cons:

Relies on third-parties for their connections (LangChain, LiteLLM..)
Not really the perfect fit for our use-case.

While smolagents seem like a good library for a fun project, where I could explore the limitations of current agentic approach, I don’t feel like it is a good fit for our use-case, where we do a hybrid-approach, to ensure the absolute correctness of everything that is provided by the LLM (crunching the data, numbers and checking the responses).

In one of the upcoming articles, I’ll try to implement the whole architecture using this approach.

Let’s give them a second chance

Well, to be honest, my tests are a little dated, so I will try to make new ones and give these frameworks a new go.

This time I will create an article for each of the frameworks and try to implement the same structure, which I will show you in the next article. At the end, I will create the application without any of these frameworks and will give another final judgement, whether it makes sense for us to use them.

In the next article, I will show you a “complex” application, where I will be utilizing LangChain and then in subsequent ones I will try to recreate the same functionality with multiple other frameworks. If you have a framework that you would like me to try, let me know in the comments! :)

So, should you use AI frameworks?

This boils down to what you expect. If you want to do some POC, then go for it!

If you want to implement AI into your production, you should be a little more cautious of what you use, because while it can be worth it in the short term, it can definitely lock you out of some functionality in the long run.

If you are interested in our AI experiments, be sure to check out our other articles on Medium! :)

Are AI Frameworks Production-ready? was originally published in GoodData Developers on Medium, where people are continuing the conversation by highlighting and responding to this story.

From Sketch to Interactive Data: What Is Napkin Analytics All About?

Tomáš Muchka — Thu, 20 Mar 2025 12:34:41 GMT

In high-stakes business moments, you often grab the closest tools available. It can even be a napkin. You sketch your vision on it, but does your BI know what to do with such a sketch?

Imagine you’re in a high-stakes moment. Maybe you’re pitching your startup idea to investors, brainstorming a new product feature with your team, or solving a critical analytics problem during a hackathon. The concept in your head is crystal clear, but you need a way to illustrate it. With no laptop in sight, you grab the closest tool available — a napkin — and sketch your vision.

Tim Burton, the famous film director, producer, screenwriter, and animator, sketched most of his ideas on napkins. He even published a napkin book. Image from timburton.com.

Your colleague gets it. The idea is solid. But now what? He can take the napkin… and throw it in the trash, because he still needs to manually recreate the visualization from scratch.

That’s the problem — your BI tool doesn’t know what to do with your napkin sketch. Or does it?

What if your quick doodle could be recognized, understood, and transformed into an interactive chart? What if AI could bridge the gap between analog sketches and digital analytics?

Let’s find out.

Welcome to the era of napkins analytics

I’m currently part of the AI team of GoodData analytics platform, trying to excavate the innovation potential of the LLMs and generative AI in general.

From our experience every analytics project starts with a drawing board (be it real, or just virtual), where the business people and analysts agree on what to measure and how. This board is actually a manifest of how the analytics should behave. It would be a waste, if the analysts would need to go to the BI tool and manually re-create all the sketches and magnificent ideas.

That’s where napkin analytics comes into play. It takes all your analytics sketches and translates them into interactive analytics objects.

https://medium.com/media/8c4078fcc83e46d830c341e65f45c2e4/href

How it works under the hood

Napkin analytics consists of two major parts. In the first it recognizes what is on the image, and then this gets mapped into the existing semantic layer and recreated as an interactive analytics object.

Recognizing the image content

Image recognition is a well-known challenge with many algorithms such as Convolutional Neural Networks, You Only Look Once, or even non-deep learning algorithms like Support vector machines.

In our prototype we actually used an LLM (GPT-4o), which reliably identifies the major characteristics of the chart. The performance is not great, but good enough for the demonstration.

On the image we search for aspects such as chart title, chart type, axis names, series names…

Example call to OpenAI to get the image description:

{

  "role": "user",
    "content": [
      {
         "type": "text",
         "text": "Map what is on the image to the described visualization structure. Use the available fields. If not found, use the most similar fields.",
      },
      {
         "type": "image_url",
         "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
      },
    ],
}

Creating interactive charts

Now let’s get to the meat. First we need to map the extracted chart properties into existing analytics objects. Thankfully GoodData uses the concept of semantic layer, which greatly simplifies the work. All the fields with analytical meaning are available through a catalog, including their relationships.

With selected analytics fields and additional chart properties it should be quite easy to recreate the chart itself. But only in case the BI tool supports some sort of programmatic description of the chart that would be easy to grasp for the AI. GoodData has exactly such a human (and LLM) readable YAML representation of all the analytics elements.

With all this knowledge, we used OpenAI’s structured output to force the LLM to return a valid chart object.

First few lines of the JSON schema to GoodData visualizations:

{
    "type": "json_schema",
    "json_schema": {
        "name": "datasets_schema",
        "schema": {
            "type": "object",
            "title": "Visualisation",
            "description": "JSON schema for GoodData Analytics Visualisation",
            "properties": {
                "type": {
                    "description": "Type of visualisation.",
                    "type": "string",
                    "enum": [
                        "table", "bar_chart", "column_chart", "line_chart",
                        "area_chart", "scatter_chart", "bubble_chart", "pie_chart",
                        "donut_chart", "treemap_chart", "pyramid_chart", "funnel_chart",
                        "heatmap_chart", "bullet_chart", "waterfall_chart",
                        "dependency_wheel_chart", "sankey_chart", "headline_chart",
                        "combo_chart", "geo_chart", "repeater_chart"
                    ]
                },

And here is the outcome:

The “flow of a napkin”. From image to Yaml code and then finally to interactive chart.

Have you spotted the suspicious spike in the interactive chart? Keep an eye out for the next article, where I’ll share techniques for handling such anomalies.

Tech Stack

By now you might wonder why I haven’t mentioned any programming languages. And I didn’t mention them for a good reason — it doesn’t really matter.

Our platform is made with Developer Experience (DX) in mind, which also includes high flexibility of implementation. This is also part of the reason why GoodData is very well prepared for all the AI innovations to come. Be it visualization generation, or even helping you make decisions directly.

OK, code-philosophy aside, most of my code is done in Python (with our PySDK) and it has less than 100 lines of code for the whole backend (not counting the prompts…). Another 150 lines of code to create a compatible front-end with canvas for drawing the visualizations and all the other bells and whistles.

Conclusion

To sum this up, the basic idea is quite simple. The hidden knowledge is, as usual with LLMs in the initial prompt, tons of examples and API-first analytics platform. Are you interested in knowing more about these? Then check my article The Road to AI-generated Visualizations.

Why does this matter?

The most important thing is that there is less friction between your idea and your visualization. With this, the sky is the limit, you don’t need a deep understanding of the model, you only need to know what you want to see.

What’s Next? From Sketches to Dashboards

Luckily visualizations are not the limits of the sketches in analytics. Next stop is dashboards, so buckle up for a new article that I plan to publish in the near future!

From Sketch to Interactive Data: What Is Napkin Analytics All About? was originally published in GoodData Developers on Medium, where people are continuing the conversation by highlighting and responding to this story.