Stories by Michael Hunger on Medium

Building AI Agents With the Google Gen AI Toolbox and Neo4j Knowledge Graphs

Michael Hunger — Mon, 10 Mar 2025 18:20:09 GMT

Michael Hunger, Head of Product Innovation, Neo4j
Kurtis Van Gent, Staff Software Engineer, Google

Interested in learning more? Watch the recording of the Gen AI Toolbox + Neo4j livestream from March 12, 2025.

https://medium.com/media/4c0476cd2fce369820dae0f693b1bdea/href

Introduction

The Gen AI Toolbox for Databases, recently launched in collaboration with LangChain, now includes a Neo4j integration.

Announcing Gen AI Toolbox for Databases. Get started today | Google Cloud Blog

This collaboration began following discussions with Kurtis Van Gent after the GraphRAG gathering, where he proposed better infrastructure to improve database integration in agentic systems as both sources and tools. The addition of Neo4j brings knowledge graph capabilities to Toolbox users, expanding the functionality for database management and Gen AI applications.

Agentic Architectures

While you might have heard much about AI agents, here is a quick recap of how they are different from simple retrieval-augmented generation (RAG) approaches and their implications.

In traditional RAG flows, the user question is not answered by an LLM directly, but relevant information from trusted data sources is retrieved first and then provided together with the question as part of the LLM prompt to be used for answering. So the LLM is only used for its language skills, not for the pretrained knowledge. This helps with the relevancy and accuracy of the answers and, partially, with explainability.

With GraphRAG, a knowledge graph is the data source that goes way beyond text fragment retrieval from a vector store. The connected information in the graph provides relevant context from related entities, documents, and structural summaries. Due to the structured nature of the retrieved information, explaining and transparently showing the detailed sources of the answers also becomes much easier.

In agentic architectures, the LLM is equipped with a set of tools for information retrieval and taking action on the user’s behalf. The LLM takes a much more active role in analyzing and dissecting the question, creating a plan for using a set of tools to retrieve information for the different inputs and requested results, and then executing it. During execution, the Agent-LLM can choose to call tools multiple times in sequence or loops or even drill down into more complex information retrieval by using previously retrieved results as inputs for new tool calls. While running in the agentic loop, it observes if the data collected is sufficient to answer the user’s question and also applies guardrails and other techniques like re-ranking, judging, and selection by relevancy.

Agent architecture building blocks

Due to the structured analysis and parameter extraction for tool calls, the need for vector similarity search decreases. In general, the work of the LLM agent can be seen as similar to a query planning (and execution) process in the database world.

The tools made available to the agent have a variety of options and shapes, from code execution in (Python) sandboxes to REST API calls to using databases for retrieval to using other LLMs — there are many possibilities. Most tools will be used for information retrieval or processing, and only a few per agent will allow updates and modifying actions.

Most LLMs today already support tool calls via function calling, which allows you to pass in a list of signatures (name, description, parameters, results) of functions which the LLM then selects from and provides parameters for execution. Agent Frameworks add a lot of additional functionality — like guardrails, graph-based workflows, execution memory, and the aforementioned techniques.

While agents are a powerful mechanism that (as shown in the example below) is working surprisingly well even for answering complex questions and tasks and keeping conversations flowing smoothly, there are some challenges, including:

Tool configuration and description have to be consistent and sufficient for the LLM to make the correct choices.
Tool execution management was often left to the developer, so depending on their familiarity with the data infrastructures, the resource management was suboptimal.
Database query and other tools couldn’t be configured by data analysts as there was significant extra work (OpenAPI spec, function call signatures, Python code for execution) to be provided.
The right tool has to be selected, so it needs to be very clear from the name and description what each tool does and which sub-tasks it applies to.
The tool has to be very specific about the kinds and data types of parameters, so that the LLM can do conversions and translations before passing the parameters to the tools.
Parameters need to be used in an injection-safe way to prevent code injections, which could lead to either data exfiltration or unrestricted modifications.
Vector search is less relevant in agentic setups as you will rarely pass the user question as a phrase to an embedding generation and vector similarity search, rather do structured parameter extraction and specific tool calls.

Gen AI Toolbox

When it comes to creating tools that access databases, there are usually a number of challenges that can be difficult to tackle with today’s orchestration frameworks: authentication, authorization, sanitization, connection pooling, and more. These challenges quickly become a burden, slowing down development and leaving room for mistakes when implementing the same boilerplate code over and over again.

Enter Toolbox — the open source server that makes it easier to build Gen AI tools for interacting with databases. It enables you to develop tools easier, faster, and more securely by handling complexities such as connection pooling, authentication, and more. Toolbox lets you define the tools in a single location and integrate with your LangGraph agents in less than three lines of code:

from toolbox_langchain import ToolboxClient

# update the URL to point to your server
client = ToolboxClient("http://127.0.0.1:5000")

# these tools can be passed to your agent! 
tools = await client.aload_toolset()

Toolbox is truly open source: as part of the launch, Google Cloud partnered with other database vendors, including Neo4j, to build support for a large number of open source and proprietary databases:

PostgreSQL (including AlloyDB and Cloud SQL for MySQL)
MySQL (including Cloud SQL for MySQL)
SQL Server (including Cloud SQL for SQL Server)
Spanner
Neo4j

We also have a number of features to help your application hit production more quickly — including built-in features such as support for end-user authentication in tools and built in observability through OpenTelemetry.

Neo4j Cypher Sources and Tools

The Neo4j integration Google added to the Gen AI Toolbox supports the definition of sources and tools for Cypher execution. Sources represent how to connect to your Neo4j instance — things like URI, API key, etc. Tools represent specific actions your agent can take — such as reading or writing to a source.

Neo4j is the leading open source graph database — it manages information not as tables but as entities (nodes) and relationships between them, allowing a flexible and powerful representation of connected information. Graphs add unique capabilities for many domains like biomedical, supply chain, manufacturing, fraud detection, and transport logistics. Knowledge graphs, which you can think of as digital twins of your organization (people, processes, products, partners etc.) are a great “factual memory” companion to LLM’s language skills.

After fetching the Toolbox binary, you can provide definitions for multiple Neo4j sources by configuring your database connection details (URL, username, password) with named sources in a file called tools.yaml, provided to the Toolbox binary at startup.

Tools that you define later are tied to a source, as in most cases, the queries are specific to the data model of that dataset.

sources:
    my-neo4j-source:
        kind: "neo4j"
        uri: "neo4j+s://xxxx.databases.neo4j.io:7687"
        user: "neo4j"
        password: "my-password"
        database: "neo4j"

Tools are specific statements that you want to execute against a source. To be usable with an agentic setup, it is really important to describe the tool, parameters, and results in enough detail so that the LLM can reason about its applicability.

Statements in tools can be generic — like fetching a single entity by a predicate, getting the first- or second-degree neighborhood, or even finding the shortest paths between entities. They can be specific, written by a subject-matter expert, following a more complex set of paths in your data, (retrieving the gene or protein expression targets of a drug or getting product recommendations for a user, for example). Tools can also update data in the graph by updating attributes or adding nodes and relationships representing business concepts.

But tools can also make use of advanced functionality like running graph algorithms for clustering, doing spatial operations, or executing vector similarity or full-text search for text phrases.

Below, you see a simple example of a neo4j-cypher tool retrieving a set of movies by actor and starting release year. Note the detailed description of the tool and the parameter. That allows the LLM to reason and choose the right tool at each step of the planning and execution.

Each tool has the following configuration:

A name
An associated source
The Cypher query statement using parameters, doing the graph pattern match, and returning results
Description of the tool/query
Name, type, and description of each parameter

tools:
 search_movies_by_actor:
    kind: neo4j-cypher
    source: my-neo4j-movies-instance
    statement: |
      MATCH (m:Movie)<-[:ACTED_IN]-(p:Person)
      WHERE p.name = $name AND m.year > $year
      RETURN m.title, m.year
      ORDER BY m.year DESC
      LIMIT 10
    description: |
Use this tool to get a list of movies for a specific actor and a given minimum 
 release year. Takes a full actor name, e.g. "Tom Hanks" and a year e.g 1993
 and returns a list of movie titles and release years.
Do NOT use this tool with a movie title. Do NOT guess an actor name, 
 Do NOT guess a year. An actor name is a fully qualified name with first and 
 last name separated by a space.
For example, if given "Hanks, Tom" the actor name is "Tom Hanks".
If the tool returns more than one option choose the most recent movies.
      Example:
      {{
          "name": "Meg Ryan",
          "year": 1993
      }}
      Example:
      {{
          "name": "Clint Eastwood",
          "year": 2000
      }}
    parameters:
      - name: name
        type: string
        description: Full actor name, "firstname lastname"
      - name: year
        type: integer
        description: 4 digit number starting in 1900 up to the current year

Investment Research Agent

This is a demonstration of an agentic LangChain application with Tools that use GraphRAG patterns combining full-text and graph search.

The example represents an investment research agent that can be used to find recent news about companies, their investors, competitors, partners, and industries. It is powered by data from the Diffbot knowledge graph that was imported into Neo4j.

The code for the example can be found in this repository:

GitHub - jexp/toolbox-companies: Google AI Toolbox Investment Research Agent

Companies and Articles: Diffbot Dataset

The dataset is a graph about companies, associated industries, people that work at or invested in the companies, and articles that report on those companies.

Article, industry, and company graph

The news articles are chunked, and the chunks are also stored in the graph.

The database is publicly available with a read-only user. You can explore the data at https://demo.neo4jlabs.com:7473/browser/.

URI: neo4j+s://demo.neo4jlabs.com
User: companies
Password: companies
Companies: companies

We use the agentic LangChain integration with Vertex AI, which allows us to pass the tools we registered with Toolbox automatically to the LLM for tool calling. We will use hybrid search, as well as parent-child retrievers and GraphRAG (extract relevant context).

In our configuration, we provide tools that make use of the full-text index, as well as graph retrieval queries, which fetch the following additional information:

Parent article of the Chunk (aggregate all chunks for a single article)
Organization(s) mentioned
IndustryCategory(ies) for the Organization
Person(s) connected to the Organization and their roles (e.g., Investor, Chairman, CEO)

Tools

These are the tools with specific queries that we make available in Toolbox for the agent to use.

Each tool takes parameters from the agent, executes a graph query, and returns structured results:

industries — List of industry category names
companies_in_industry — Companies (ID, name, summary) in a given industry by industry
companies — List of companies (ID, name, summary) by full-text search
articles_in_month — List of articles (ID, author, title, date, sentiment) in a month timeframe from the given date (yyyy-mm-dd)
article — Single article details (ID, author, title, date, sentiment, site, summary, content) by article ID
companies_in_articles — Companies (ID, name, summary) mentioned in articles by list of article IDs
people_at_company — People (name, role) associated with a company by company ID

Configuration in Toolbox

Neo4j Source Configuration

In the sources section of our tools.yaml file, we configure the public Neo4j companies graph database connection as a source:

sources:
    companies-graph:
        kind: "neo4j"
        uri: "neo4j+s://demo.neo4jlabs.com"
        user: "companies"
        password: "companies"
        database: "companies"

Neo4j Cypher Tools Setup

Now we can configure the first two tools we want to test with the Toolbox; the other tools will be configured and described in a similar way as shown below:

industries — List of industry names
companies_in_industry — Companies (ID, name, summary) in a given industry by industry

tools:
  industries:
    kind: neo4j-cypher
    source: companies-graph
    statement: |
      MATCH (i:IndustryCategory) 
      RETURN i.name as industry
    description: List of Industry names

  companies_in_industry:
    kind: neo4j-cypher
    source: companies-graph
    statement: |
      MATCH (:IndustryCategory {name:$industry})<-[:HAS_CATEGORY]-(c) 
      WHERE NOT EXISTS { (c)<-[:HAS_SUBSIDARY]-() }
      RETURN c.id as company_id, c.name as name, c.summary as summary
    description: Companies (company_id, name, summary) in a given industry by industry
    parameters:
      - name: industry
        type: string
        description: Industry name to filter companies by

After starting Toolbox with our YAML configuration file, it serves the tool definitions via an API, which can be retrieved with the ToolboxClient and passed to LangChain:

# one of linux/amd64, darwin/arm64, darwin/amd64, or windows/amd64
export OS="linux/amd64" 
curl -O https://storage.googleapis.com/genai-toolbox/v0.1.0/$OS/toolbox
chmod +x toolbox

./toolbox --tools_file tools.yaml
2025-02-13T13:50:22.052198+01:00 INFO "Initialized 1 sources." 
2025-02-13T13:50:22.052967+01:00 INFO "Initialized 0 authSources." 
2025-02-13T13:50:22.053095+01:00 INFO "Initialized 2 tools." 
2025-02-13T13:50:22.053146+01:00 INFO "Initialized 1 toolsets." 
2025-02-13T13:50:22.053742+01:00 INFO "Server ready to serve!"

We can now use LangChain with the Gemini 2.0 Flash model and feed our tool definitions to the Model to do a quick test. We can follow the Quickstart example in the Toolbox documentation.

import asyncio
import os

from langgraph.prebuilt import create_react_agent
from langchain_google_vertexai import ChatVertexAI
from langgraph.checkpoint.memory import MemorySaver

from toolbox_langchain import ToolboxClient

prompt = """
  You're a helpful investment research assistant. 
  You can use the provided tools to search for companies, 
  people at companies, industries, and news articles from 2023.
  Don't ask for confirmations from the user.
  User: 
"""

queries = [
    "What industries deal with computer manufacturing?",
    "List 5 companies in the computer manufacturing industry with their 
     description.",
]

def main():
    model = ChatVertexAI(model_name="gemini-1.5-pro")

    client = ToolboxClient("http://127.0.0.1:5000")
    tools = client.load_toolset()

    agent = create_react_agent(model, tools, checkpointer=MemorySaver())

    config = {"configurable": {"thread_id": "thread-1"}}
    for query in queries:
        inputs = {"messages": [("user", prompt + query)]}
        response = agent.invoke(inputs, stream_mode="values", config=config)
        print(response["messages"][-1].content)

main()

The outputs we get are:

The industries that deal with computer manufacturing are: 
Computer Hardware Companies, Electronic Products Manufacturers, 
and Computer Storage Companies.

Here are 5 companies in the computer hardware industry along with their 
descriptions:

1.  **Microsoft Egypt:** Microsoft branch in Egypt
2.  **Apigee:** Software company based in San Jose, California, United States and owned by Google
3.  **Microsemi:** Communications corporation
4.  **Intermec:** American electronics manufacturing company
5.  **Elitegroup Computer Systems:** No summary available

Now that this works, we can add the other tools as well:

companies — List of companies (ID, name, summary) by full-text search
articles_in_month — List of articles (ID, author, title, date, sentiment) in a month timeframe from the given date (yyyy-mm-dd)
article — Single article details (ID, author, title, date, sentiment, site, summary, content) by article ID
companies_in_articles — Companies (ID, name, summary) mentioned in articles by list of article IDs
people_at_company — People (name, role) associated with a company by company ID

Please note that to handle array parameters, you have to provide an “items” subgroup that specifies the sub-type and description:

tools:

  companies:
    kind: neo4j-cypher
    source: companies-graph
    statement: |
      CALL db.index.fulltext.queryNodes('entity', $search, {limit: 100}) 
      YIELD node as c, score WHERE c:Organization 
      AND NOT EXISTS { (c)<-[:HAS_SUBSIDARY]-() }
      RETURN c.id as company_id, c.name as name, c.summary as summary
    description: List of Companies (company_id, name, summary) by fulltext search
    parameters:
      - name: search
        type: string
        description: Part of a name of a company to search for

  articles_in_month:
    kind: neo4j-cypher
    source: companies-graph
    statement: |
      MATCH (a:Article)
      WHERE date($date) <= date(a.date) < date($date) + duration('P1M')
      RETURN a.id as article_id, a.author as author, a.title as title, toString(a.date) as date, a.sentiment as sentiment
      LIMIT 25
    description: List of Articles (article_id, author, title, date, sentiment) in a month timeframe from the given date
    parameters:
      - name: date
        type: string
        description: Start date in yyyy-mm-dd format

  article:
    kind: neo4j-cypher
    source: companies-graph
    statement: |
      MATCH (a:Article)-[:HAS_CHUNK]->(c:Chunk)
      WHERE a.id = $article_id
      WITH a, c ORDER BY id(c) ASC
      WITH a, collect(c.text) as contents
      RETURN a.id as article_id, a.author as author, a.title as title, toString(a.date) as date,
      a.summary as summary, a.siteName as site, a.sentiment as sentiment, apoc.text.join(contents, ' ') as content
    description: Single Article details (article_id, author, title, date, sentiment, site, summary, content) by article id
    parameters:
      - name: article_id
        type: string
        description: ID of the article to retrieve

  companies_in_articles:
    kind: neo4j-cypher
    source: companies-graph
    statement: |
      MATCH (a:Article)-[:MENTIONS]->(c)
      WHERE a.id = $article_id AND not exists { (c)<-[:HAS_SUBSIDARY]-() }
      RETURN c.id as company_id, c.name as name, c.summary as summary
    description: Companies (company_id, name, summary) mentioned in articles by article id
    parameters:
      - name: article_id
        type: string
        description: Article id to find companies mentioned in

  people_at_company:
    kind: neo4j-cypher
    source: companies-graph
    statement: |
      MATCH (c:Organization)-[role]-(p:Person) WHERE c.id = $company_id
      RETURN replace(type(role),"HAS_","") as role, p.name as person_name, c.id as company_id, c.name as company_name
    description: People (person_name, role, company_id, company_name) associated with a company by company id
    parameters:
      - name: company_id
        type: string
        description: Company id to find associated people for

Using the Agent — Conversation

I’ve been impressed by the following conversations and the ability of the LLM to use the tools correctly until all information to answer the question has been collected.

The LLM is able to not only use the endpoints once but also refer back to information from the conversation history and ensure that the relevant context is taken into account.

It can fix misspelled or incorrect parameters to their correct content, reformat dates in the required shape, and more. API functions are called multiple times and also cascade if needed. For instance, retrieve a list of IDs for a certain context and then make individual calls for each ID to retrieve details.

It is also able to resolve pronouns and demonstratives (e.g., “these” is resolved to a concrete list of company IDs or articles that are then used to make multiple API calls).

I’ll comment on each question about what the trickiness is and how the LLM agent resolved it.

Let’s look at some more questions to our research agent and see how it fares:

What five industries deal with neurological implants?
List three companies from those industries with their description and filter afterward by California.
Who is working at these companies?
Which companies are mentioned in the news in January 2023 with positive sentiment?
Who is working there as board members?

The answers:

What 5 industries deal with neurological implants?

Initially, we do an easy one — the task here is to call the industries endpoint, but then only limit it to five results (the endpoint takes no limit argument). Also, we’re vague about the actual industry.

Based on the available industries, the following seem most relevant to neurological implants:

Medical technology companies
Healthcare companies
Biotechnology companies
Pharmaceutical companies
Robotics companies

List three companies from those industries with their description and filter afterward by California.

Using prior information, fixing parameters, and post-filtering. Here in this question, we provide general back-reference to the industries and an additional location that is not a parameter but needs to be post-filtered.

It uses the prior information in the conversation to fix the industry name and applies the post-filter and limit to only show three California companies.

OK, I have a lot of companies. I will filter for companies located in California, then list five of them with their descriptions:

goBalto: (biotechnology companies, pharmaceutical companies) Software company based in San Francisco, California, United States
Quantenna Communications: (semiconductor companies, electronic products manufacturers) Manufacturing company based in San Jose, California, United States and owned by ON Semiconductor
…

Who is working at these companies?

In our next question, we reference back with a “these” for the companies and ask for details that are available in a separate endpoint. But the agent has to call the people_at_company endpoint multiple times, as it only takes a single company ID.

Here are some people working at the listed companies:

goBalto: Sujay Jadhav (CEO), Darren Hite (BOARD_MEMBER), Masashi Kiyomine (BOARD_MEMBER)
Quantenna Communications: Sam Heidari (CEO), Dmitry Akhanov (BOARD_MEMBER), Edwin B Hooper III “Ned” (BOARD_MEMBER), Jack Lazar (BOARD_MEMBER), Mark Stevens (BOARD_MEMBER)
Finisar: Bob Bashaw (CEO), Mary Jane Raymond (BOARD_MEMBER)
…

Which companies are mentioned in the news in January 2023 with positive sentiment?

Here, we want to do two things, switching the conversation from companies to articles without restarting the session. We give a partial date of January 2023, which has to be converted to the format yyyy-mm-dd for the API call. It has to understand that “good sentiment” refers to the numeric score and that we want to have the highest score. The list of articles that come with sentiment output, the model needs to read, understand, sort by sentiment descending, select the highest-ranked articles, and output their titles and the companies mentioned in these. It is not the same endpoint, so it needs to take the article IDs and check the companies_in_article tool for each article individually.

Here are the articles from January 2023 with positive sentiment scores and the companies they mention:

BackBox Unveils Customized Product Updates (0.886): I don’t have the tools to identify the companies mentioned in this article.
Mozilla’s Pocket App Just Got a Big Update (0.838, 0.83): Mozilla
Onapsis and Wipro help enterprises secure their SAP applications (0.612): Onapsis, Wipro, SAP
Onapsis Announces Collaboration with Wipro to Remove Security as a Digital Transformation Roadblock (0.804, 0.735): Onapsis, Wipro
Houston Metro and Lone Star college launch new pilot service for Lone Star College-Houston North Fallbrook campus (0.816): Lone Star College
…

Who is working there as board members?

Here, we refer back to the companies from before with “there” and also specify the roles requirement for the API call choice only vaguely. The agent needs to call the people_at_company endpoint multiple times with the IDs of the companies retrieved earlier in the session history, then aggregate and filter the information into the response.

I wasn’t able to retrieve board members for all of the companies. Here’s the information I found:

Wipro:
Azim Premji (BOARD_MEMBER)
Rishad Premji (BOARD_MEMBER)
SAP:
Hasso Plattner (BOARD_MEMBER)
Jennifer Morgan (BOARD_MEMBER)
Google:
Larry Page (BOARD_MEMBER)
Sergey Brin (BOARD_MEMBER)

Deployment

The deployment of the Toolbox server and the LangGraph application is outside the scope of this article. Please refer to the documentation for GenAI Toolbox and LangChain on Cloud Run.

Summary

We’ve demonstrated how you can build agentic applications that integrate database-based tools easily with the Google Gen AI Toolbox. While we focused on a GraphRAG use case with a knowledge graph in Neo4j, you can combine different data sources and types of databases in the same agentic application. Since Toolbox integrates with different agent frameworks, you can define your tools once and reuse them for several use cases. Being open source, it was straightforward to contribute to Toolbox. We hope you will make use of the feedback and discussion mechanisms to add your own data sources and provide feedback and improvements.

You can learn more about knowledge graphs in What Is a Knowledge Graph? And be sure to check out The GraphRAG Manifesto: Adding Knowledge to GenAI.

Building AI Agents With the Google Gen AI Toolbox and Neo4j Knowledge Graphs was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

LLM Knowledge Graph Builder — First Release of 2025

Michael Hunger — Wed, 05 Feb 2025 13:26:58 GMT

LLM Knowledge Graph Builder — First Release of 2025

New features include community summaries, parallel retrievers, and expanded model support for better knowledge graph construction from text

Many developers try to build retrieval-augmented generation (RAG) experiences to interact with information from unstructured data using only vector search and struggle to get to the results that they want. Looking only at text fragments without context only gets you so far. As usual in data engineering, there are more advanced patterns for preprocessing the data and extracting knowledge, one of which is GraphRAG. So when you get around using the data, you’ve surfaced the underlying concepts and can make use of them to connect the pieces and provide relevant context to a user’s questions.

Introduction

We built, open-sourced, and hosted the LLM Knowledge Graph Builder to let you try out better ways of treating your unstructured data. We preprocess documents, transcripts, web articles, and more sources into chunks, compute text embeddings, and connect them (lexical graph).

But we don’t stop there. We also extract entities and their relationships, which is especially relevant if you ingest multiple documents because you can relate the pieces spread out over multiple sources (entity graph).

This combined knowledge graph then enables a set of different retrievers to fetch data (see below).

Since we launched the LLM Knowledge Graph Builder in June 2024, we’ve had an impressive amount of usage and great feedback from users. It’s now the fourth most popular source of user interaction on AuraDB Free, which makes us really happy.

We did a release in fall 2024, but there were too many AI events, which took most of my time to write a blog post. Over the past few months, the team worked on really nice features — some of which we want to introduce today in the first release of 2025.

What Does the LLM Knowledge Graph Builder Do?

For those of you who don’t know what the tool does, here’s a quick introduction.

If you have a number of text documents, web articles, Wikipedia pages, or similar unstructured information, wouldn’t it be great to surface all the knowledge hidden inside those in a structured way and then use those entities and their relationships to better chat with your data?

The LLM Knowledge Graph Builder:

Imports your documents
Splits them into chunks and links them up
Generates text embeddings for vector search and connects the most similar ones
Uses a variety of large language models (LLMs) to extract entities and their relationships
Optionally using a graph schema you can provide
Stores the nodes and relationships in Neo4j
And when running against a graph data science-enabled Neo4j instance, it also performs topic clustering and summarization

Get a quick overview of the process and try it out at https://llm-graph-builder.neo4jlabs.com.

The only prerequisite is a publicly accessible Neo4j instance to store your data, which you can create on AuraDB Free (or Aura Pro Trial with Graph Data Science).

https://medium.com/media/a6a1fc0888c952295da39fbf8e69e98f/href

Under the Hood Blog Series

Over the next few weeks, we have a series of blog posts that explore different aspects of the LLM Knowledge Graph Builder and explain how they work under the hood, so you can benefit from our learnings and apply them to your own GenAI projects.

New Features

Let’s dive into the new features. The main ones are generating Community Summaries and new local and global retrievers for it, as well as the new ability of running multiple retrievers in parallel for your questions and evaluating them. You can now also guide the extraction with custom prompt instructions.

Additionally, we have some user experience improvements to highlight.

More Models

In development mode or in self-deployment, we tested and configured the LLM Knowledge Graph Builder with a plethora of new models, which we also hint at in the production version:

OpenAI GPT-4o (and GPT-4o mini)
Google Gemini 1.5 and 2.0 Pro and Flash
Qwen 2.5
Amazon Nova models
Groq
Llama 3.x
Ollama models
Claude 3.5 Sonnet
DeepSeek and Microsoft Phi-4, coming soon

We also test the models internally for integration testing and to see how well they work for extraction.

Test results for different LLM extractions with entity counts and runtimes

Community Summaries

One way you can improve the richness of the graph representing your documents is by running graph algorithms to extract additional information hidden in the graph structure.

Microsoft did that last year with the “From local to global — Query Focused Summarization GraphRAG” paper. They used a hierarchical graph clustering algorithm (Leiden) on the extracted entity domain graph. This algorithm identifies clusters of closely related entities. An LLM then summarizes their content into community summary nodes, which represent the information in that set of entities and relationships. Due to the hierarchical results, this can be done on multiple levels going from very fine-grained to the highest level.

A more detailed blog post on evaluation will be available later in the series.

The extracted communities are visible in the graph visualization of your documents, so you can inspect them and their texts, and see which entities they are summarizing.

Graph visualization with entities and summarized communities

Those community summaries are then used in a global retriever to answer general questions about the documents, which aren’t specific to any one entity but identify topics across documents. So instead of the vertical sequence of content of each document, they represent themes spanning individual documents.

Clustering topics across documents with graph algorithms

In LLM Knowledge Graph Builder, we implemented it using the same clustering algorithm, so if you run connect to a Neo4j instance that has graph data science enabled (AuraPro with GDS, AuraDS, Neo4j Sandbox, or self-hosted — which we show as ⚛ on top of the app), and enable the Community Summarization in the Graph Enhancements > Post Processing, we can run the algorithm and create the tree of summaries.

Post Processing Jobs

In our global community retriever, we took a slightly different approach from the paper, which just stuffs all the community summaries (of a certain level) into several LLM prompts to answer a question. We generate vector embeddings of the community summaries and use similarity and full-text search to find the most relevant ones for the question, which Microsoft Research recently also proposed, and use those to answer the questions.

Since we can now run multiple retrievers in parallel, you can see the difference for yourself. Check out the global and local entity retrievers comparison.

Global Community Retriever

Both retrievers, like all the others, allow showing the retrieved contextual graph data (communities, entities, chunks) that went into generating the answer, supporting explainability.

Global Community Retriever retrieval details

Local Entity Retriever

The local entity retriever turns what is several thousand lines of Python code in the Microsoft paper implementation in about 50 lines of Cypher (one benefit of using an actual graph database), fetching the following:

Entities with hybrid search (embedding and full-text)
The relationships between them
The most common relationships to entities outside the initial set
The chunks and documents from where the entities were extracted
The community summaries the entities are part of

// previous hybrid search on entities, then graph expansion
WITH collect(node) AS nodes,
    avg(score) AS score,
    collect({id: elementId(node), score: score}) AS metadata

RETURN score, nodes, metadata,

    collect {
        UNWIND nodes AS n
        MATCH (n)<-[:HAS_ENTITY]->(c:Chunk)
        WITH c, count(distinct n) AS freq
        RETURN c
        ORDER BY freq DESC
        LIMIT 3
    } AS chunks,

    collect {
        UNWIND nodes AS n
        OPTIONAL MATCH (n)-[:IN_COMMUNITY]->(c:__Community__)
        WITH c, c.community_rank AS rank, c.weight AS weight
        RETURN c
        ORDER BY rank, weight DESC
        LIMIT 3
    } AS communities,

    collect {
        UNWIND nodes AS n
        UNWIND nodes AS m
        MATCH (n)-[r]->(m)
        RETURN DISTINCT r
    } AS rels,

    collect {
        UNWIND nodes AS n
        MATCH path = (n)-[r]-(m:__Entity__)
        WHERE NOT m IN nodes
        WITH m, collect(distinct r) AS rels, count(*) AS freq
        ORDER BY freq DESC
        LIMIT 10
        WITH collect(m) AS outsideNodes, apoc.coll.flatten(collect(rels)) AS rels
        RETURN { nodes: outsideNodes, rels: rels }
    } AS outside

Local entity retriever

For the entity retriever, we can show not only text chunks but also the entities and their retrieved relationships.

Local entity retriever result details

Community summaries used in the local entity retriever

Multi-Retrievers

As shown in the previous section, you can now select one or more retrievers run in parallel to generate an answer to your question and switch between the answer results directly to compare.

In the Details link after each answer, the retriever also provides the contextual facts retrieved from the database, which are then sent to the LLM, as well as additional information about the model, runtime, and token counts.

In development mode or for self-hosted, even more retrievers are available for you to test and compare.

To make this easier, the narrow right conversation sidebar can be maximized and even popped out into a full frame, which even becomes shareable. This is especially useful with read-only database connection setups, which we now also support for sharing your generated knowledge graph.

The underlying data of your conversation can be downloaded as a JSON file, so you can process it according to your own needs.

Retriever Evaluation

One reason for the parallel retriever evaluation was adding the ability to generate evaluation metrics.

We are using the RAGAs framework to run the evaluation. Currently, we compute the following metrics, some of which you need to provide a ground truth for:

Relevancy — How well the answer addresses the user’s question
Faithfulness — How accurately the answer reflects the provided information
Context relevance — Determines the recall of entities present in generated answer and retrieved contexts
Semantic relevance — How well the generated answer understands the meaning of the reference answer
ROUGE — Similarity to ground-truth answer word by word

Retriever evaluation

There will be a more detailed blog post on evaluation later in the series.

Guided Instructions for Extraction

In the latest version, we added the ability to guide the extraction more by allowing users to pass additional prompts to the LLM for extracting entities. So you can force it to only focus on certain parts of the document, on certain themes, or use specific additional instructions.

Provide additional extraction instructions

I tested it by extracting entities and relationships from several articles on Albert Einstein’s work but directing the LLM not to extract anything related to his physics work. And there it was — life, people, awards, peace activities, and other inventions, but not much on his tremendous contributions in physics.

User Experience Improvements

A quick list of the improvements:

Allow read-only database access, then only retrieval is possible
Pop out the chat experience into a separate window
Improve graph visualization with local search and highlighting

Graph Consolidation

An experimental feature is automatic graph consolidation, intended for users who just quickly want to see an extracted knowledge graph from their data but don’t want to specify a graph schema upfront.

In those cases, the LLM often generates a load of entity types and relationships — counting into the thousands if you give it free rein. Our retrievers don’t mind because they use the graph topology, not the actual type, to traverse the graph (although they collect them together with the textual information).

That’s why we recommend providing a graph schema upfront for a more semantically constrained knowledge graph. But in cases where this didn’t happen, we can use an LLM to categorize a sorted list of node labels and relationship types into a smaller, more general set. Since we’re not 100-percent happy with the reduction, we didn’t enable it by default, but we would love your feedback. You can find it as one of the post-processing jobs in the graph enhancements.

Summary

Building an open source tool like this is a fulfilling experience — especially when we get as much feedback as we’ve gotten so far. We’ve addressed more than 400 GitHub issues, including internal planned tasks, and gotten more than 2,800 GitHub stars.

If you haven’t yet, please try it out and let us know what you think in the comments. We’d also be excited if you wrote about your experiences using the tool for different sets of documents of different domains.

Please share if you have questions or feedback. Also, please give us a star if you like the project.

Happy building!

LLM Knowledge Graph Builder — First Release of 2025 was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Claude Conversing With Neo4j Via MCP

Michael Hunger — Fri, 20 Dec 2024 15:56:40 GMT

Claude Converses With Neo4j Via MCP

Give LLMs the Power of Knowledge Graphs With Anthropic’s Model Context Protocol

Last week, Anthropic published a new Model Context Protocol (MCP) for LLMs like Claude to communicate with external data sources.

This protocol allows, for instance, local applications like Claude.ai, Zed, Replit, and others to access file systems, APIs, and databases as part of the conversation with the user to fetch additional information. This can be used to answer questions, fetch more information, or use code generation to generate on-the-fly charts and visualizations.

Here’s what it could look like in your conversation.

Claude fetching data from Neo4j to render a heat map of ratings

If you want to test it yourself, you can add this config to your claude_desktop_config.json and restart your server to follow along. This points to a read-only demo database.

{"mcpServers": {
      "movies-neo4j": {
        "command": "uvx",
        "args": ["mcp-neo4j-cypher", 
                 "--db-url", "neo4j+s://demo.neo4jlabs.com", 
                 "--user", "recommendations", 
                 "--password", "recommendations"]
      }   
    }
 }

And they didn’t just publish the protocol. They open-sourced a number of connectors (servers) with an invitation to the community to add more.

The Neo4j MCP Server(s)

We immediately started implementing a Neo4j MCP server to see what that would look like, which we also published on PyPI, so you can run it directly.

GitHub - neo4j-contrib/mcp-neo4j: Model Context Protocol with Neo4j

It’s also included in the official MCP Servers repository.

An Interactive Example

Here’s a simple example of how you can use the MCP to query a Neo4j database from a LLM.

Imagine you have a Neo4j database with an IMDB-like set of movies, actors, directors, genres and movie ratings.

I configure the publicly accessible database in my Claude.ai desktop configuration and restart the client.

Now I want to find out which movies were directed by Quentin Tarantino, so I’ll just ask Claude about it.

Check my movie database for movies directed by Quentin Tarantino.

Claude will pop up a permission dialog that asks me if I want to allow the query to the Neo4j database (once or for the whole chat), and if I accept, it will run the query and return the results in a nice format.

Then it answers:

Claude answer for: Check my movie database for movies directed by Quentin Tarantino.

I’ll help you query the database for movies directed by Quentin Tarantino.

View Result from read-neo4j-cypher from movies-neo4j (local)

{
  `query`: `
MATCH (d:Person {name: 'Quentin Tarantino'})-[:DIRECTED]->(m:Movie)
RETURN m.title, m.year
ORDER BY m.year
`
}

[{'m.title': 'Reservoir Dogs', 'm.year': 1992}, {'m.title': 'Pulp Fiction', 'm.year': 1994}, {'m.title': 'Four Rooms', 'm.year': 1995}, {'m.title': 'Jackie Brown', 'm.year': 1997}, {'m.title': 'Kill Bill: Vol. 1', 'm.year': 2003}, {'m.title': 'Kill Bill: Vol. 2', 'm.year': 2004}, {'m.title': 'Sin City', 'm.year': 2005}, {'m.title': 'Grindhouse', 'm.year': 2007}, {'m.title': 'Death Proof', 'm.year': 2007}, {'m.title': 'Inglourious Basterds', 'm.year': 2009}, {'m.title': 'Django Unchained', 'm.year': 2012}, {'m.title': 'The Hateful Eight', 'm.year': 2015}]

Here are Quentin Tarantino’s directed films in chronological order:

Reservoir Dogs (1992)
Pulp Fiction (1994)
Four Rooms (1995)
Jackie Brown (1997)
Kill Bill: Vol. 1 (2003)
Kill Bill: Vol. 2 (2004)
Sin City (2005) …

Let me know if you’d like to know more details about any of these films or if you’d like to explore other aspects of Tarantino’s work!

What are other popular movies that the actors of Kill Bill acted in?

Claude answer for: What are other popular movies that the actors of Kill Bill acted in?

I’ll query for other notable movies featuring actors from Kill Bill: Vol. 1 and Kill Bill: Vol. 2.

View Result from read-neo4j-cypher from movies-neo4j (local)

{
  `query`: `
MATCH (kb:Movie)
WHERE kb.title IN ['Kill Bill: Vol. 1', 'Kill Bill: Vol. 2']
MATCH (actor:Person)-[:ACTED_IN]->(kb)
MATCH (actor)-[:ACTED_IN]->(other:Movie)
WHERE other.title <> 'Kill Bill: Vol. 1'
AND other.title <> 'Kill Bill: Vol. 2'
RETURN DISTINCT actor.name, collect(DISTINCT other.title) as other_movies
ORDER BY actor.name
`
}
[{'actor.name': 'Daryl Hannah', 'other_movies': ['Clan of the Cave Bear, The', 'Tie That Binds, The', 'Addams Family Reunion', 'Summer Lovers', 'Legal Eagles', 'Reckless', 'Pope of Greenwich Village, The', 'Splash', 'Roxanne', 'Gingerbread Man, The', 'My Favorite Martian', 'Steel Magnolias', 'Walk to Remember, A']}, {'actor.name': 'Lucy Liu', 'other_movies': ['Afro Samurai: Resurrection', 'Lucky Number Slevin', 'Mulan II', 'Cypher', "Charlie's Angels: Full Throttle", 'Ballistic: Ecks vs. Sever', "Charlie's Angels", 'Shanghai Noon']}, {'actor.name': 'Michael Parks', 'other_movies': ['Tusk', 'Death Wish 5: The Face of Death']}, {'actor.name': 'Uma Thurman', 'other_movies': ['Misérables, Les', 'My Super Ex-Girlfriend', 'Even Cowgirls Get the Blues', 'Truth About Cats & Dogs, The', 'Batman & Robin', 'Gattaca', 'Avengers, The', 'Month by the Lake, A', 'Jennifer 8', 'Mad Dog and Glory', 'Johnny Be Good', 'Henry & June', 'Be Cool', 'Producers, The', 'Tape', 'Paycheck']}, {'actor.name': 'Vivica A. Fox', 'other_movies': ['Set It Off', 'Soul Food', 'Boat Trip', 'Juwanna Mann', 'Why Do Fools Fall In Love?']}]

Let me break down some notable movies by the main cast members:

Uma Thurman’s other major films include: — The Avengers — Gattaca — Batman & Robin — Les Misérables — The Producers — Be Cool

Lucy Liu appeared in: — Charlie’s Angels and Charlie’s Angels: Full Throttle — Shanghai Noon — Lucky Number Slevin — Mulan II

Michael Parks appeared in: — Tusk — Death Wish 5: The Face of Death

Would you like to know more details about any of these actors or specific movies? I can query for additional information like release years, directors, or other cast members.

Tell me more about Tusk; is it worth watching? Use the database schema to figure out the ratings.

Claude answer for: Tell me more about Tusk; is it worth watching? Use the database schema to figure out the ratings.

Now, it uses the other tool (get-neo4j-schema) available in the integration to get the database schema, which is important for determining where and how ratings are stored. It returns node types, their attributes, if they’re indexed, and the relationships to other nodes.

Let me first check the database schema to see what kind of rating information is available about movies.

View Result from get-neo4j-schema from movies-neo4j (local)

{}
[{'label': 'Movie', 'attributes': {'budget': 'INTEGER', 'movieId': 'STRING unique indexed', 'tmdbId': 'STRING unique indexed', 'plotEmbedding': 'LIST', 'imdbVotes': 'INTEGER', 'runtime': 'INTEGER', 'posterEmbedding': 'LIST', 'countries': 'LIST', 'imdbId': 'STRING indexed', 'url': 'STRING', 'plot': 'STRING indexed', 'released': 'STRING indexed', 'languages': 'LIST', 'imdbRating': 'FLOAT indexed', 'title': 'STRING indexed', 'poster': 'STRING', 'year': 'INTEGER indexed', 'revenue': 'INTEGER'}, 'relationships': {'IN_GENRE': 'Genre'}}, {'label': 'Genre', 'attributes': {'name': 'STRING unique indexed'}, 'relationships': {}}, {'label': 'User', 'attributes': {'name': 'STRING indexed', 'userId': 'STRING unique indexed'}, 'relationships': {'RATED': 'Movie'}}, {'label': 'Actor', 'attributes': {'born': 'DATE', 'bornIn': 'STRING', 'tmdbId': 'STRING', 'bio': 'STRING', 'died': 'DATE', 'name': 'STRING', 'poster': 'STRING', 'imdbId': 'STRING', 'url': 'STRING'}, 'relationships': {'ACTED_IN': 'Movie', 'DIRECTED': 'Movie'}}, {'label': 'Director', 'attributes': {'born': 'DATE', 'bornIn': 'STRING', 'tmdbId': 'STRING', 'bio': 'STRING', 'died': 'DATE', 'name': 'STRING', 'poster': 'STRING', 'imdbId': 'STRING', 'url': 'STRING'}, 'relationships': {'ACTED_IN': 'Movie', 'DIRECTED': 'Movie'}}, {'label': 'Person', 'attributes': {'born': 'DATE', 'bornIn': 'STRING', 'tmdbId': 'STRING unique indexed', 'bio': 'STRING indexed', 'died': 'DATE', 'name': 'STRING indexed', 'poster': 'STRING', 'imdbId': 'STRING', 'url': 'STRING'}, 'relationships': {'ACTED_IN': 'Movie', 'DIRECTED': 'Movie'}}]

Now, let me query specific information about Tusk, including its rating, plot, and genre.

View Result from read-neo4j-cypher from movies-neo4j (local)

{
  `query`: `
MATCH (m:Movie {title: 'Tusk'})
MATCH (m)-[:IN_GENRE]->(g:Genre)
MATCH (d:Person)-[:DIRECTED]->(m)
MATCH (a:Person)-[:ACTED_IN]->(m)
RETURN m.title, m.year, m.imdbRating, m.plot, collect(DISTINCT g.name) as genres,
       collect(DISTINCT d.name) as directors, collect(DISTINCT a.name) as actors
`
}
[{'m.title': 'Tusk', 'm.year': 2014, 'm.imdbRating': 5.4, 'm.plot': 'When podcaster Wallace Bryton goes missing in the backwoods of Manitoba while interviewing a mysterious seafarer named Howard Howe, his best friend Teddy and girlfriend Allison team with an ex-cop to look for him.', 'genres': ['Comedy', 'Drama', 'Horror'], 'directors': ['Kevin Smith'], 'actors': ['Michael Parks', 'Haley Joel Osment', 'Genesis Rodriguez', 'Justin Long']}]

Based on the database information, here’s what I can tell you about Tusk:

1. It’s a horror-comedy-drama directed by Kevin Smith (known for Clerks, Mallrats, Chasing Amy) and was released in September 2014.

2. The plot follows a podcaster named Wallace Bryton who disappears in Manitoba while interviewing a mysterious seafarer named Howard Howe. His girlfriend and best friend then team up with an ex-cop to search for him.

3. The movie has a relatively modest IMDb rating of 5.4/10, suggesting mixed to negative reception.

….

The low IMDb rating suggests it’s not for everyone, but it has developed something of a cult following among fans of unusual horror films.

Would you like me to find similar movies in the database that you might enjoy instead?

Hmm, OK, give me the best movie of the list that we got from the Kill Bill actors with some detail.

Claude answer for: Hmm, ok, give me the best movie of the list that we got from the Kill Bill actors with some detail.

Let me search through those movies and their ratings to find the highest rated one.

{
  `query`: `
MATCH (a:Person)-[:ACTED_IN]->(kb:Movie)
WHERE kb.title IN ['Kill Bill: Vol. 1', 'Kill Bill: Vol. 2']
MATCH (a)-[:ACTED_IN]->(m:Movie)
WHERE m.title <> 'Kill Bill: Vol. 1' AND m.title <> 'Kill Bill: Vol. 2'
MATCH (m)-[:IN_GENRE]->(g:Genre)
MATCH (d:Person)-[:DIRECTED]->(m)
RETURN DISTINCT m.title, m.year, m.imdbRating, m.plot,
       collect(DISTINCT g.name) as genres,
       collect(DISTINCT d.name) as directors,
       collect(DISTINCT a.name) as actors
ORDER BY m.imdbRating DESC
LIMIT 1
`
}
[{'m.title': 'Gattaca', 'm.year': 1997, 'm.imdbRating': 7.8, 'm.plot': 'A genetically inferior man assumes the identity of a superior one in order to pursue his lifelong dream of space travel.', 'genres': ['Thriller', 'Sci-Fi', 'Drama'], 'directors': ['Andrew Niccol'], 'actors': ['Uma Thurman']}]

From the movies we found earlier, Gattaca (1997) stands out as the highest-rated film with an IMDb rating of 7.8/10. Let me get some additional details about this film.

{
  `query`: `
MATCH (m:Movie {title: 'Gattaca'})
MATCH (p:Person)-[:ACTED_IN]->(m)
RETURN m.runtime, collect(p.name) as cast
`
}
[{'m.runtime': 106, 'cast': ['Xander Berkeley', 'Gore Vidal', 'Uma Thurman', 'Ethan Hawke']}]

Here’s what makes Gattaca stand out:

1. Plot: It’s a science fiction drama about a man born genetically “inferior” who assumes the identity of a genetically superior person to pursue his dream of space travel. The story explores themes of genetic discrimination, determination, and human potential.

…..

Would you like me to find similar movies in the database that you might enjoy if you like Gattaca?

What is the other average, min, max user rating for Gattaca?

Claude answer for: What is the other average, min, max user rating for Gattaca?

I’ll look up the user ratings statistics for Gattaca.

{
  `query`: `
MATCH (m:Movie {title: 'Gattaca'})
MATCH (u:User)-[r:RATED]->(m)
RETURN
  m.title,
  count(r) as number_of_ratings,
  round(avg(r.rating)*10)/10 as avg_rating,
  min(r.rating) as min_rating,
  max(r.rating) as max_rating
`
}
[{'m.title': 'Gattaca', 'number_of_ratings': 84, 'avg_rating': 3.7, 'min_rating': 1.0, 'max_rating': 5.0}]

I’ll look up the user ratings statistics for Gattaca.

Here are the user rating statistics for Gattaca:
- Average User Rating: 3.68 out of 5
- Minimum Rating: 1.0
- Maximum Rating: 5.0
- Number of Ratings: 84 users

This average rating of 3.68/5 (equivalent to about 7.4/10) aligns fairly well with the IMDb rating of 7.8/10 we saw earlier. The movie has received the full range of possible scores (1–5), showing that while most people rate it highly, there are some varying opinions. With 84 ratings in the database, this gives us a decent sample size to work with.

Ok I watched Gattaca and loved it, what other movies are similar to it? Can you create a user for me with an ID mesirii and store my rating of 8 out of 10 (normalize it to fit the other ratings for the movie)?

We’re up for write statements, and we get another permission check for the write operation tool. Here is the first time Claude makes a small syntax mistake in its Cypher statement and immediately fixes it. As my database is read only, it cannot actually write to it, but it offers me other movies that I can watch next.

I’ll help you add your rating first, normalizing 8/10 to 4/5 for consistency with the database’s rating scale.

I apologize for the error with creating a user rating — it seems I don’t have permission to create new users in the database. However, I can still help you find similar movies to Gattaca!

View Result from write-neo4j-cypher from movies-neo4j (local)

{
  `query`: `
MERGE (u:User {userId: 'mesirii', name: 'mesirii'})
MERGE (m:Movie {title: 'Gattaca'})
CREATE (u)-[:RATED {rating: 4.0}]->(m)
RETURN u.name, m.title
`
}

I apologize for the error — it seems I don’t have permission to create new user records. Let me focus instead on finding similar movies to Gattaca. I’ll search for movies that share similar genres, themes, and ratings:

View Result from read-neo4j-cypher from movies-neo4j (local)

{
  `query`: `
MATCH (m:Movie)-[:IN_GENRE]->(g:Genre)
WHERE g.name IN ['Sci-Fi', 'Drama']
AND m.title <> 'Gattaca'
AND m.imdbRating >= 7.0
WITH m, collect(DISTINCT g.name) as genres
WHERE size(genres) >= 2
RETURN m.title, m.year, m.imdbRating, m.plot, genres
ORDER BY m.imdbRating DESC
LIMIT 5
`
}
[{'m.title': 'Inception', 'm.year': 2010, 'm.imdbRating': 8.8, 'm.plot': 'A thief, who steals corporate secrets through use of dream-sharing technology, is given the inverse task of planting an idea into the mind of a CEO.', 'genres': ['Drama', 'Sci-Fi']}, {'m.title': 'Black Mirror', 'm.year': None, 'm.imdbRating': 8.8, 'm.plot': 'A television anthology series that shows the dark side of life and technology.', 'genres': ['Drama', 'Sci-Fi']}, {'m.title': "Dr. Horrible's Sing-Along Blog", 'm.year': 2008, 'm.imdbRating': 8.7, 'm.plot': 'An aspiring supervillain must balance his career and his pursuit of a beautiful do-gooder.', 'genres': ['Drama', 'Sci-Fi']}, {'m.title': 'Battlestar Galactica', 'm.year': 2003, 'm.imdbRating': 8.7, 'm.plot': 'A re-imagining of the original series in which a rag-tag fugitive fleet of the last remnants of mankind flees pursuing robots while simultaneously searching for their true home, Earth.', 'genres': ['Drama', 'Sci-Fi']}, {'m.title': 'Prestige, The', 'm.year': 2006, 'm.imdbRating': 8.5, 'm.plot': 'Two stage magicians engage in competitive one-upmanship in an attempt to create the ultimate stage illusion.', 'genres': ['Sci-Fi', 'Drama']}]

Let me search for movies with similar themes, genres, and ratings:

{
  `query`: `
MATCH (m:Movie {title: 'Gattaca'})
MATCH (m)-[:IN_GENRE]->(g:Genre)
MATCH (other:Movie)-[:IN_GENRE]->(g)
WHERE other.title <> 'Gattaca' 
  AND other.imdbRating >= 7.0
  AND exists(other.plot)
RETURN DISTINCT other.title, other.imdbRating, other.plot, 
  collect(DISTINCT g.name) as genres
ORDER BY other.imdbRating DESC
LIMIT 5
`
}

Error: {code: Neo.ClientError.Statement.SyntaxError} {message: The property existence syntax `... exists(variable.property)` is no longer supported. Please use `variable.property IS NOT NULL` instead. (line 7, column 7 (offset: 174))
"  AND exists(other.plot)"
       ^}

I apologize for the syntax error. Let me correct the query:

Let me refine this search to focus more specifically on sci-fi and thriller elements like Gattaca:

{
  `query`: `
MATCH (m:Movie {title: 'Gattaca'})
MATCH (m)-[:IN_GENRE]->(g:Genre)
MATCH (other:Movie)-[:IN_GENRE]->(g)
WHERE other.title <> 'Gattaca' 
  AND other.imdbRating >= 7.0
  AND other.plot IS NOT NULL
RETURN DISTINCT other.title, other.imdbRating, other.plot, 
  collect(DISTINCT g.name) as genres
ORDER BY other.imdbRating DESC
LIMIT 5
`
}

Based on your enjoyment of Gattaca, here are some highly-rated recommendations that share similar themes or elements:

1. Black Mirror (8.8/10)
— A thought-provoking sci-fi anthology series that, like Gattaca, explores the dark implications of technology on society
— Each episode tells a different story about how technology might affect human society

….

These films all share Gattaca’s thoughtful approach to science fiction, focusing more on human drama and ethical questions rather than action. They all explore themes of identity, humanity, and the impact of technology/science on society, which were central to Gattaca.

Would you like more details about any of these recommendations?

…

That’s my first experience with the MCP and the Neo4j connector. Pretty cool, huh? It was a great experience, and I can see how this can be used to enhance the conversational experience with LLMs like Claude.

The Model Context Protocol

How does this work under the hood?

There is comprehensive documentation of the MCP on the modelcontextprotocol.io site, but here’s a brief overview:

Architecture diagram from the MCP documentation

MCP follows a client-server architecture where:

Hosts are LLM applications (like Claude Desktop or IDEs) that initiate connections and use the client.
Clients maintain 1:1 connections using the transport protocol with servers from within the host application.
Servers provide context, tools, and prompts to clients.

The protocol layer handles message framing, request/response linking, notifications and high-level communication patterns.

The MCP allows for different transport protocols. Those currently supported are HTTPS (with Server-Sent-Events (SSE) for server to client messages and HTTP POST for client to server) and STDIO (standard-in/out) for local servers where the server is started by the client and can communicate via stdin/stdout.

All transport message exchanges are based on a specification using JSON-RPC 2.0, so it encourages us to implement the protocol in other languages or transport layers.

The based message types are:

Resources: Context and data for the user or the AI model to use
Prompts: Templated messages and workflows for users
Tools: Functions for the AI model to execute
Sampling: Server-initiated agentic behaviors and recursive LLM interactions

Additional relevant aspects include configuration, progress tracking, cancellation, error reporting, and logging.

The protocol spec is also considering security and trust, which is important when allowing LLMs’ access to external data sources. With write access to databases and file systems and servers running locally, and the potential for malicious code execution, security should be high priority. The foundation models are known to be vulnerable to adversarial attacks, prompt injections, and hallucinations.

LLM users are often non-technical and might not be aware of the risks involved in allowing an AI model to access their data.

That’s why the protocol employs the following key principles:

User consent and control: Users need to approve server and data access and always be able to discern which data was sent and received. Client implementations need to have clear UIs for these authorizations.
Data privacy access control for user data, no data exfiltration to other services by hosts and servers.
Tool safety is used to prevent malicious code execution and to ensure that tools are safe to run and don’t have side effects. Tools need to be structured in a way that a user clearly understands what each tool is doing.
LLM sampling controls: The protocol limits server access to user prompts. Users can control sampling and what data and prompts are sent to the server.

Neo4j MCP Server Implementation

When we implemented the Neo4j MCP server, we had to implement three components:

Connection to the Neo4j database and executing Cypher queries
Registering and listing the available tools
Handling the tool execution requests

Connection and Cypher Execution With Neo4j Database

In our case, we wrote a small neo4jDatabase class that holds the Neo4j Python driver to connect to the database, execute the statements, and render the results as JSON text block.

def _execute_query(self, query: str, params: dict[str, Any] | None = None) -> list[dict[str, Any]]:
    """Execute a Cypher query and return results as a list of dictionaries"""
    logger.debug(f"Executing query: {query}")
    try:
        result = self.driver.execute_query(query, params)
        if is_write_query(query):
            counters = vars(result.summary.counters)
            logger.debug(f"Write query affected {counters}")
            return [counters]
        else:
            results = [dict(r) for r in result.records]
            logger.debug(f"Read query returned {len(results)} rows")
            return results
    except Exception as e:
        logger.error(f"Database error executing query: {e}\n{query}")
        raise

Register Available Tools for Interacting With the Database

These tools will later be available to the LLM to choose from:

get-neo4j-schema
read-neo4j-cypher
write-neo4j-cypher

Each has a description and possible parameters (in our case, just a mandatory “query” parameter for the Cypher tools). We can imagine a number of additional generic tools for a graph database like getting the aggregated neighborhood of a node, finding shortest paths, or running graph algorithms.

Here’s the definition of the write-neo4j-cypher tool:

@server.list_tools()
    async def handle_list_tools() -> list[types.Tool]:
        """List available tools"""
        return [
            types.Tool(
                name="write-neo4j-cypher",
                description="Execute a write Cypher query on the neo4j database",
                inputSchema={
                    "type": "object",
                    "properties": {
                        "query": {"type": "string", "description": "Cypher write query to execute"},
                    },
                    "required": ["query"],
                },
            ),
#            ...
        ]

Execution of the Registered Tools

When a user asks Claude to query the database, Claude asks the user for permission, starts the server if it’s not already running as a local server, and sends a request to the MCP server with the tool name and parameters via the transport protocol.

In the tool implementations, we just use this class to execute the statements. There is a small check that we only allow read statements in the read tool and vice versa (the write tool only returns update statistics).

@server.call_tool()
    async def handle_call_tool(
        name: str, arguments: dict[str, Any] | None
    ) -> list[types.TextContent | types.ImageContent | types.EmbeddedResource]:
        """Handle tool execution requests"""
        try:
            if name == "get-neo4j-schema":
                results = db._execute_query(
                    """
CALL apoc.meta.data() yield label, property, type, other, unique, index, elementType
WHERE elementType = 'node'
RETURN label,
    collect(case when type <> 'RELATIONSHIP' then [property, type] end) as attributes,
    collect(case when type = 'RELATIONSHIP' then [property, head(other)] end) as relationships
                    """
                )
                return [types.TextContent(type="text", text=str(results))]

So you can see it’s pretty straightforward to implement an MCP server for a tool of your choice.

From Conversation to Code

Combining MCP tool results with Claude’s code generation and execution capabilities is taking the interaction to the next level.

One thing that’s already impressive in the interactions is that the previous tool call data results are part of the conversational history, so the LLM makes use of them to shape queries and input parameters for further server interactions.

But it goes even beyond this with the code-generation capabilities, which allow you to easily render results as charts, networks, or just compute derived data to export. It generates artifacts of JavaScript using React or other libraries combined with charting libraries that you can render and visualize directly.

Here are two examples of generating a heat map per movie or per rating:

Heat-map rendering of Tarantino movies

Actually I wanted ratings and years as a square heatmap.

Square heat map of rating-count per rating-value

And finally a network visualization of our conversation:

Generate a graph visualization of the conversation so far, starting with Quentin Tarantino’s movies to the recommendations all the way to the final suggestions. Please make sure the rendering has high contrast.

Conclusion

We’d love for you to test out the Neo4j MCP server implementation and share your experiences, give us feedback on GitHub, and perhaps write an article of your own.

Happy experimenting!

Claude Conversing With Neo4j Via MCP was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Get Started With GraphRAG: Neo4j’s Ecosystem Tools

Michael Hunger — Wed, 19 Jun 2024 17:21:29 GMT

We’re excited to introduce new resources for your GenAI apps: the Neo4j GraphRAG Ecosystem Tools. These open-source tools make it easy to get started with GenAI applications grounded with knowledge graphs, which help improve response quality and explainability and accelerate app development and adoption.

GraphRAG combines retrieval-augmented generation (RAG) with knowledge graphs to solve critical LLM issues like hallucination and lack of domain-specific context. Knowledge graphs provide the contextual memory LLMs need to reliably answer questions and serve as trusted agents in complex workflows — and unlike most RAG solutions, which only offer access to fragments of textual data, GraphRAG integrates structured and semi-structured information into the retrieval process.

What Is Retrieval-Augmented Generation (RAG)? - Graph Database & Analytics

These new tools will help you create a knowledge graph from unstructured text and use that graph — or an existing graph database — to retrieve relevant information for generative tasks via both vector and graph search.

You can use the tools to kickstart GenAI development, integrate them into your own systems, or use them as a reference template for building your own custom implementations. The current implementations use our LangChain integrations for Python and Javascript, but you can also build them with other languages and frameworks.

In this post, we give an overview of the LLM Knowledge Graph Builder, NeoConverse, and GenAI framework integrations.

Quickly Turn Unstructured Text Into a Knowledge Graph

The Neo4j Knowledge Graph Builder can seem magical — just load unstructured text to produce a structured graph that surfaces hidden entities and relationships within the data. It works with PDFs, Word documents, YouTube transcripts, Wikipedia pages, and many other kinds of unstructured text.

Steps for Knowledge Graph Creation

If you’re new to graph technology, you can use the Knowledge Graph Builder to easily create graphs from familiar domain information. More experienced graph developers might use it to kickstart new projects.

You can use the Graph Builder online. If you do not have a Neo4j instance, you can create a free Neo4j Aura database.

In addition to extracting source documents and their chunks and embeddings as the lexical graph, the Graph Builder extracts the graph of entities and their relationships and connects them to the chunks.

To see the unstructured and structured contextual data behind the answers, you can visualize and question the ingested data. And because we use GraphRAG behind the scenes for every vector search result, we can fetch the associated entities and provide them to the LLM to generate an answer.

The Graph Builder front-end is a React application that uses the Neo4j design system (via the Needle Starter Kit) and the recently published Neo4j Visualization Library.

The backend uses the LangChain integrations for the interactions with Neo4j, the knowledge graph extraction, and the GraphRAG search that combines vector search with graph retrieval queries. Written in Python, it uses FastAPI and runs as containers on Google Cloud Run. But you can also run it locally with Docker Compose.

You can use the LangChain integrations in your own code, as shown below, for knowledge graph construction:

from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_openai import ChatOpenAI
from langchain_community.graphs import Neo4jGraph

graph = Neo4jGraph()
llm = ChatOpenAI(temperature=0, model_name="gpt-4o")

kg_transformer = LLMGraphTransformer( llm=llm,
  allowed_nodes=["Person", "Country", "Company"],
  allowed_relationships=["LOCATED_IN","WORKED_AT"])
results = kg_transformer.convert_to_graph_documents(documents)
graph.add_graph_documents(results)

To learn more about the Knowledge Graph Builder, find the source code, and see walkthrough videos, check out our Graph Builder docs page.

Better yet, try it yourself with your own files.

You can also watch the short demo video here.

https://medium.com/media/a6a1fc0888c952295da39fbf8e69e98f/href

Query Your Graph Using Natural Language

Our NeoConverse tool uses the structure of an existing knowledge graph to generate Cypher graph queries from a user’s question and then executes them against the Neo4j database. Those query results are used to generate a text or chart response.

You can test out NeoConverse online.

Here’s the NeoConverse flow:

User selects the dataset and if they want responses as text or chart.
User asks the question.
The system extracts the database schema from the database and combines it with the user question to generate a prompt for the configured LLM.
The LLM generates a Cypher query to execute against the database.
The query is validated and executed.
The query results are sent to the LLM with the user question and a specific prompt to either generate a response as markdown text or configuration for the charting library.

NeoConverse Flow

NeoConverse comes with several pre-configured datasets to demonstrate its capabilities. For each, you can see a schema of the database and example questions (click on the vertical ellipsis). You can also configure additional datasets in NeoConverse to connect to your own Neo4j database.

NeoConverse UI with 2 questions one for a textual and one for a chart response

To read more about NeoConverse, and find additional videos, blog posts, and the link to the GitHub repository, head over to our NeoConverse docs page.

Use Your Preferred LLM Frameworks

Neo4j integrates seamlessly with most open-source GenAI ecosystem libraries for Python, JavaScript, Java, and .Net.

We’ve integrated with LangChain Python and LangChain JavaScript to provide vector and graph search, text-to-cypher, conversational memory, knowledge graph construction, advanced RAG templates, and much more.

Our LlamaIndex integrations include Cypher search, vector search, knowledge graph representation and construction, and text-to-cypher querying. We recently collaborated with the LlamaIndex team on a full revamp of the knowledge graph integration for both construction and querying.

For Deepset’s Haystack, we got a valuable community contribution for vector search and Cypher querying, and we’ll be adding more capabilities soon.

In the Java space, we’ve integrated vector search into Spring AI and LangChain4j. We’ve also implemented Neo4j support for semantic memory in Semantik Kernel. Finally, in DSPy, we’ve added a Neo4j-based Retriever Module that makes use of the Neo4j vector index.

Many of these integration pages point to relevant starter kit implementations that explain how to build GraphRAG applications on the Edgar SEC filings dataset.

Up Your Skills for GraphRAG and GenAI Development

For a deeper dive into the Neo4j GraphRAG tools and GenAI ecosystem, take a look at our GenAI ecosystem pages.

The pages provide much more detail on Neo4j GenAI features like embedding generation and vector search, as well as information on our cloud-native GenAI integrations with Google (Vertex AI), AWS (Bedrock), and Azure (OpenAI) — including a video tutorial for each service.

You’ll also find example GenAI projects, including GraphRAG demos, NeoConverse, and the Knowledge Graph Builder in action, explaining the functionality behind each tool.

Hands-on learning opportunities are available as well. We’ve collaborated with Deeplearning AI on a knowledge graph course that walks you through building graph-powered GenAI applications.

Knowledge Graphs for RAG

We also teach GenAI app development in depth in our free GraphAcademy courses.

Free Neo4j Generative AI Courses from GraphAcademy

We just ran a livestream on “Kickstarting your GenAI Development with Neo4j’s GraphRAG Ecosystem Tools” with a lot of interesting questions — feel free to watch it here.

Neo4j | Kickstart GenAI Development with Neo4j's GraphRAG Ecosystem

Get Started With GraphRAG: Neo4j’s Ecosystem Tools was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Google Cloud Next Recap from a GenAI and Databases Perspective

Michael Hunger — Mon, 22 Apr 2024 14:47:12 GMT

Last week I was back in Vegas after only a few months, for my first Google Cloud Next. Somehow I never made it to the previous ones, but I’m really glad Sudhir Hasbe convinced me to come.

It was really worth it and I could observe him in his natural element, beloved by so many of his Google colleagues. I was surprised to see soo many more women and people from URM, way more than at other industry conferences. Kudos to Google for being so supportive of diversity.

Disclaimer, it was a whirlwind of a week, so if I mixed something up, forgot to mention someone or attributed something wrongly please don’t be mad at me.

The post got a bit long, but I think it’s worth the read (there are a lot of pictures), I only covered the sessions I attended, there were hundreds of other awesome sessions, check them out on the programs site. (see below)

Sadly I missed the solar eclipse on Monday as I was still in transit.

Vegas at Night

We had a strong Neo4j presence at the conference with two booths, a lot of folks onsite and having been awarded Partner of the Year for Data Management for the 2nd year in a row. It was a great opportunity to speak with a lot of customers and folks that were new or curious about graphs and GenAI. We had also just announced new Vertex AI integrations, and DataFlow Templates in GA, so there was a lot to talk about.

Neo4j Brings GraphRAG Capabilities for GenAI to Google Cloud - Graph Database & Analytics

I also really enjoyed meeting and discussing folks from Google in person, from our partner team Rupal Jain and Maruti , and a number of PMs and engineering leads for databases and Vertex AI.

It is always impressive to see the logistics to house, feed, entertain 30k+ developers in action. The lunch choices were also quite good, from poke bowls to a variety of salads, I loved the vast amounts of vinegar potato chips (my family hates them).

Sessions, Videos and Slides

The volume of great sessions at Google Next was just incredible, really hard to choose between database topics with and without AI, applied GenAI and a lot of customers and partners presenting on their integrations and progress. Big Kudos to the Google team for pulling it all off. Especially as many of the sessions are already available on the Google Next Program Site with slides and videos (on youtube).

Opening Keynote

As expected the theme of the conference was GenAI as a “new way to help customers & users”. Starting with Thomas Kurian’s opening keynote on Tuesday (short video) which was packed in the massive arena, with a really cool stage setup including revolving doors and multi-shape displays.

https://medium.com/media/39b0440178be09cfccf697ac0ee181b2/href

As I had arrived early, I had a good spot. Everything was in the theme of the AI Hypercomputer, hardware, databases, models, agents, and services. Agents (and Agent Builder) were a big thing as well as adding AI to all Google Cloud databases. Gemini 1.5 was finally made into public preview and Claude 3 is now also in the Model Garden. It was a bit sad that Sundar didn’t come to the event in person and there was only a recorded message.

The Google Vids demo by Aparna was also really impressive, looking forward to using it, which will be good as I’m a video noob :) and it was great to see a native ARM CPU from Google too (Axion).

Databases (AlloyDB) and AI

Even more than the keynote I enjoyed the Databases and AI session (video) from Andi Gutmans, Anita, Sandy presenting with our friend Harrison Chase and Hari Ramamurthy from Home Depot.

https://medium.com/media/bc5d1e1fdf6ecaa191ccf54bc5ff123f/href

Andi showcased the AI capabilities for all databases, including vector search, embedding generation (functions), advanced text2sql and parameterized secure views to safeguard against credential injection attacks with a great Cymbal Air demo by Sandy Ghai. Here is the documentation for AlloxDB AI

Andi spoke with Harrison about Google’s focus on LangChain (see below). Hari discussed GenAI opportunities at Home Depot and mentioned Knowledge Graphs with Vector and Text search as retrieval components.

The future of Databases and Generative AI

There was also a really good session “The future of databases and generative AI (video)” with Sailesh Krishnamurthy and Fiona Tan, the CTO of Wayfair where she discussed using AI to migrate her legacy monolith using a lot of SQL and stored procedures on SQL server to GraphQL fronted smaller services backed by individual databases and how much time was saved doing many of the rote migrations automatically (at least to a certain degree).

https://medium.com/media/81e60ece67f3f20431090f6afe4922b0/href

Sailesh introduced the ScANN index and advanced, interactive text2sql as well as some good conceptual thinking.

(slides)

GenAI Applications with AlloyDB Omni

Sudhir had introduced me to GG and Anita the day before, so it was nice to attend GG’s session on GenAI applications with AlloyDB Omni (slides). Part of the session introduced the new AI capabilities, especially built in embedding and prediction functions and the ability to call remote models (including Hugging Face and OpenAI).

He was joined by Tabitha Lewis who did an awesome demo, showing an app running against a local embedding model (registered in the model registry) using the embedding functions. What’s really cool about AlloyDB Omni, is that it can run everywhere, your local machine, K8S clusters, self hosted environments and of course GCP with much for the functionality of the hosted version.

And even more impressive Sharanya Desai from NeuroPace a company producing FDA approved electric brain stimulators for epilepsy patients that used vector search to find patients by their brain patterns.

AI in Database Management

Not just for app developers but also for DevOps folks and DBAs, Gemini In Databases (video) is helping everyone to get work done more efficiently.

https://medium.com/media/c05a111b4e3f66fb160488ec1af78a88/href

AI was everywhere with databases including operations, maintenance, tuning and migrations, which are really useful capabilities. Also checking databases for compliance and potential cost improvements.

Developer Keynote

I was speaking to a customer about exciting knowledge graph and LLM topics so we were a bit late to the Developer Keynote (video) which was a massive fireworks presented by Richard Seroter and Chloe Condon.

https://medium.com/media/73dff2c699ab445ad9eb08ecb796efe6/href

It was awesome to see my friend Josh Long on stage presenting Spring (AI) in a whirlwind tour as well as Guillermo Rauch with Vercel’s AI SDK and Honeycomb’s Charity Majors on AI observability, closed off by a cool Hugging Face Demo by Philipp Schmid on training and deploying HF models in Vertex AI. Sadly I had missed Brad Calder’s part, the man whom Sudhir told me so much about.

Good Food and Company

One good thing about Vegas, even if it’s not my favorite place, is the good food and that you get to do 20k steps each day just walking between places to compensate for the calories. We had a really good time with the team at Beauty & Essex speakeasy.

Vertex AI — LangChain, Extensions, Reasoning Engine

Google is going all in on LangChain which makes me happy because we’ve been collaborating with Harrison, Jacob and the team for a long time, so Neo4j is well integrated with LangChain and perfectly set up to participate in your GenAI experience on Google Cloud.

I’ve been working with Vertex AI extensions and Reasoning Engine for quite some months now — so I really enjoyed @Julia Wiesinger and @Kris Overholt presenting the Agent integration using LangChain tools deployed with Reasoning Engine. There was even a graph database easter egg in one of the slides :) Here is the resources slide with all the links you would need

I’ve almost finished my blog post on Vertex AI extensions and Reasoning Engine with GraphRAG, it’s coming soon, stay tuned.

There were a lot of LangChain related talks which kept Harrison really busy :)

LangChain at Google Cloud Next

The following slide describes it well. @LangChain is supported throughout the Vertex AI and databases stack. Both for building, running and deploying RAG applications (with templates, LangServe and Reasoning Engine), but also for monitoring with LangSmith (and hopefully natively in GCP soon).

Here is a small selection of the talks that touched LangChain, but it was also used for many other applications in talks about AlloyDB Omni, Cymbal Air demo and more.

https://medium.com/media/ef626dced943e6310619ab611194891f/href https://medium.com/media/9f0cda3c57edfe55f6a560855afbf95c/href

Gemini Fine Tuning

One session with great live demos was the one on the newly released Gemini Fine Tuning with May Hu and Anand Iyer they discussed the why’s and how’s of fine-tuning and demonstrated how to do it with Gemini (training time on 500 examples, about an hour and a few hundred dollars of cost) and compared the base and fine-tuned model. Mai also introduced the student-teacher approach to fine-tuning where a larger model can train a smaller model (e.g. Gemma 7B) to improve massively on tasks. She also discussed planning, monitoring, validation and eval aspects of fine tuning as well as tips. They also shared a documentation page with more details.

@Lavanya Ramani from Palo Alto Networks shared their journey of a chatbot, fine tuned for their types of cybersecurity documents, including eval results and lessons learned, some of which also resonated with me.

RAG & Agent Patterns

As I had been speaking with @Kurtis van Gent for a few months, now I knew that he was really interested in advanced patterns for RAG and Agents (video) and had a lot of best practices to share from his experience implementing them.

https://medium.com/media/89f87decedeaa8c4884edc394c47927d/href

So I was really excited for his talk and boy I was not disappointed, such a rich set of tips and guidance in a single session. Totally worth it.

BigQuery Developer Experience

I really love the BigQuery Data Canvas (video) with Abinav Kushraj and especially the live demo by engineering lead Mike Romano.

https://medium.com/media/82f8317f29907bd7c34e98212badfc40/href

Using an infinite canvas where you can place datasets, queries, results, visualizations and derive one from each other and generate both queries and visualizations with natural language is really a great interactive tool for data analysts.

Stackoverflow collaboration with Google GenAI and what it means for Developer Communities

Almost the last talk of the conference was really interesting, and I was curious how it would turn out. As we all know the advent of ChatGPT coincided with quite a drop in StackOverflow usage and engagement. Stack Overflow CTO/CPO Ryan Polk and Chen Wei shared some interesting thoughts on the future of the platform. The focus seems to be on investing in strengthening the community with money made from providing the data as training/RAG data for code assistant providers like Google. In the presentation there was a nice demonstration of Gemini Code Assist (which we also contributed to) linking back to Stack Overflow as source of the provided information. (slides)

After the talk I spoke with Ryan a bit, discussing the power of data and community and how AI could reduce the threshold for not just aksing but also answering questions And how the timeframe of responses together with a package analytics systems like https://deps.dev from Google could help pinpoint versions of software discussed.

I also shared a bit about how I’ve been using Stackoverflow data in the past as a knowledge graph, for tag correlation and most recently with Docker, LangChain and Ollama as part of the GenAI-Stack. Looking forward to future conversations.

Blue Men and Roadtrip to Grand Canyon

At the end of the week I enjoyed some good company, went to see Blue Man Group, which was exhilarating, creepy, impressive and fun.

And I went on a road trip to the Grand Canyon and Hoover dam with my colleague and GenAI/GDS expert Zach Blumenfeld. The tour and the views were really impressive, definitely worthwhile when you’re in Vegas (it’s only a 2:30 hour drive). On the way we had the opportunity to visit my long-time team member Elaine Rosenberg, who worked on GraphAcademy and retired last year. Really glad she’s almost back to good health now. She made us fresh apple pie and we had a great time.

On my flight back I met some Google people from Zurich and Vienna, and we had a good time. Sadly my plane to Berlin got delayed so I missed my bus. But that gave me time to start working on this recap. I hope you enjoyed it as much as I enjoyed the Wiener sausage with potato salad and Radler in Berlin Airport :)

Until next time, I hope to meet many of you later this year, perhaps at the AI Engineer World Fair in San Francisco.

Enriching LlamaIndex Models from GraphQL and Graph Databases

Michael Hunger — Fri, 30 Jun 2023 19:28:52 GMT

Enriching LlamaIndex Models with GraphQL and Graph Databases

How to add new data loaders to LlamaIndex and use them.

In this article I wanted to share the process of adding new data loaders to LlamaIndex. First we’ll look at what LlamaIndex is and try a simple example of providing additional context to an LLM query using a simple CSV loader. Then we look at how easy it is to add a new loader for graph databases to LlamaIndex. And lastly we try that new loader and another loader for GraphQL APIs that I added in practice and see how their extra context can help an LLM answer questions better.

Background/Context

I was listening to the "This Week in ML" (twiml) Podcast where Jerry Liu from LlamaIndex (previously GPT-Index) explained the ideas behind the library to enrich query contexts to LLMs with data from any number of sources.

LlamaIndex on TWIML AI: A Distilled Summary (using LlamaIndex)

LlamaIndex is a toolkit to augment LLMs with your own (private) data using in-context learning. It takes care of selecting the right context to retrieve from large knowledge bases. To achieve that it utilizes a number of connectors or loaders (from LlamaHub) and data structures (indices) to efficiently provide the pre-processed data as Documents.

Each type of index stores documents in different ways, e.g via embeddings for vector search, as a simple list or graph or tree structure. Those indices are used as query interface to the LLM, transparently embedding the relevant context.

Besides the higher quality response from the LLM, you get also the documents returned that have been used to construct the answer. LlamaIndex also allows chain of thought reasoning, compare/contrast queries, and natural language querying of databases.

Using a Basic CSV Loader

Here is an example of using a basic CSV loader to provide documents for LlamaIndex.

In our Notebook we download the countries.csv via the Countries List Project (MIT) (raw source).

Our dependencies are llama-index and python-dotenv.

!pip install llama-index==0.6.19 python-dotenv

We need to provide our OpenAI-api key, to avoid accidentally leaking it in the notebook, I uploaded an openai.env file and use the dotenv library to load the contents as environment variables.

In the next step we load the env file and prepare the OpenAI ChatGPTLLMPredictor (using gpt-3.5-turbo by default) and add it to the ServiceContext .

import os
from pathlib import Path
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader, ServiceContext, GPTListIndex
from llama_index.llm_predictor.chatgpt import ChatGPTLLMPredictor
from dotenv import load_dotenv
from llama_index import download_loader

load_dotenv("openai.env")

llm_predictor = ChatGPTLLMPredictor()
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)

Now we can use the loader to load the CSV and turn it into documents, create an an GPT Index (VectorStoreIndex in this case), which LlamaIndex can then use to retrieve the relevant information to pass along in the context to the LLM.

Initializing CSV Loader and GPTVectorStoreIndex

SimpleCSVReader = download_loader("SimpleCSVReader")
loader = SimpleCSVReader(concat_rows=False)
documents = loader.load_data(file=Path('./countries.csv'))

print(documents)
index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)

Documents from the CSV Loader

[Document(text='country, capital, type', doc_id='67c30c68-7d9f-4906-945b-9affc96f95d2', embedding=None, doc_hash='3a506ebea9c04655b51406d79fdf5e3a87c3d8ff5b5387aace3e5a79711a21b8', extra_info=None),
Document(text='Abkhazia, Sukhumi, countryCapital', doc_id='6e6be4b5-051f-48e0-8774-6d48e0444785', embedding=None, doc_hash='ea387d0eab94cc6c59f98c473ac1f0ee64093901673b43e1c0d163bbc203026e', extra_info=None),
...]

The CSV loader didn’t create one Document per CSV row by default, but only one for the whole document, but you could configure it so that it turned the CSV into one document per row.

LlamaIndex supports much more involved setups of different kinds of indexes, allows to chain them and even conditionally select one or the other. Here we just do the bare minimum to demonstrate our loaders.

After setting up the indices with the appropriate loaders, and connected indexes, we now can use the index as an LLM query engine and execute our user query.

To demonstrate that the LLM still is able to use its world knowledge, we can ask in a mix of English (System), German (Question) and French (requested Answer).

queryEngine = index.as_query_engine()

queryEngine.query("""
Provide the answer in French.
Question: Was ist die Hauptstadt von Albanien?
""")

As you can see in the response below it doesn’t just answer our question correctly in French La capitale de l’Albanie est Tirana., but also provides which documents it used to generate the answer.

Response(response="La capitale de l'Albanie est Tirana.", 
source_nodes=[NodeWithScore(node=Node(text='              Albania', doc_id='3decbee1-98cc-4650-a071-ed25cd3e00d5', embedding=None, doc_hash='7d9d85082095471a9663690742d2d49fc37b2ec37cc5acf4e99e006a68a17742', extra_info=None, 
node_info={'start': 0, 'end': 30, '_node_type': }, 
relationships={: '7b6c861f-2c2f-4905-a047-edfc25f7df19'}), score=0.7926356007369129), 
NodeWithScore(node=Node(text='              Algiers', doc_id='8111b737-9f45-4855-8cd8-f958d4eb0ccd', embedding=None, doc_hash='8570a02a057a6ebbd0aff6d3f63c9f29a0ee858a81d913298d31b025101d1e44', 
extra_info=None, node_info={'start': 0, 'end': 30, '_node_type': }, relationships={: '22e11ac6-8375-4d0c-91c6-4750fc63a375'}), score=0.7877589022795918)], extra_info={'3decbee1-98cc-4650-a071-ed25cd3e00d5': None, '8111b737-9f45-4855-8cd8-f958d4eb0ccd': None})

LlamaIndex Loaders

The number of existing data sources in LlamaHub is impressive, I counted 100+ integrations in the repository. You can find anything from Google docs, to GitHub, to relational databases.

LlamaHub, screenshot by Author

But I was missing two of my favorite technologies: GraphQL - the API query language open sourced by Facebook and Graph databases like Neo4j, the best way to store and manage large amounts of connected data, for example in Knowledge Graphs.

So I thought: "How hard can it be to add them :)"

Adding the new loaders

Adding new loaders is really straightforward. There is a script in the llama-hub repository to help with adding a new loader. Running ./add-loader.sh added the skeleton files.

To get familiar with the existing implementations I looked at the Databases (relational) and MongoDB integrations, the former for the Graph Database and the latter for the GraphQL.

It was easy enough, we only needed the requirements for our loader, implement the base.py with an straightforward API and a README.md` with an explanation and a code example.

The main difference my loaders have from the existing ones, is that they don’t use hard-coded field names for extracting the relevant value from the query result, but instead turn the result into YAML.

I picked YAML not because I like it, but because it was closest to a textual representation of a nested tree of key-value pairs that a user would write as nested bullet lists.

Below is the example code for the Graph Database implementation (the GraphQL one is similar).

Adding the Graph Database Loader

I added the requirements for the neo4j dependency, a Cypher query language over Bolt protocol python driver, that also works with Memgraph and AWS Neptune.

Then I added the code for __init__ to take in a database server URI, database name and credentials to connect and create a driver instance.

The load_data method takes in the query to run and optional parameters. It’s implemented by calling the driver’s execute_query method.

Each row of results is mapped into a LlamaIndex Document with the text being the YAML representation of the results.

"""Graph Database Cypher Reader."""

from typing import Dict, List, Optional

from llama_index.readers.base import BaseReader
from llama_index.readers.schema.base import Document

import yaml

class GraphDBCypherReader(BaseReader):
    """Graph database Cypher reader.

    Combines all Cypher query results into the Document type used by LlamaIndex.

    Args:
        uri (str): Graph Database URI
        username (str): Username
        password (str): Password

    """

    def __init__(
        self,
        uri: str,
        username: str,
        password: str,
        database: str
    ) -> None:
        """Initialize with parameters."""
        try:
            from neo4j import GraphDatabase, basic_auth

        except ImportError:
            raise ImportError(
                "`neo4j` package not found, please run `pip install neo4j`"
            )
        if uri:
            if uri is None:
                raise ValueError("`uri` must be provided.")
            self.client = GraphDatabase.driver(uri=uri, auth=basic_auth(username, password))
            self.database = database

    def load_data(
        self, query: str, parameters: Optional[Dict] = None
    ) -> List[Document]:
        """Run the Cypher with optional parameters and turn results into documents

        Args:
            query (str): Graph Cypher query string.
            parameters (Optional[Dict]): optional query parameters.

        Returns:
            List[Document]: A list of documents.

        """
        if parameters is None:
            parameters = {}

        records, summary, keys = self.client.execute_query(query, parameters, database_ = self.database)

        documents = [Document(yaml.dump(entry.data())) for entry in records]

        return documents

You’re now ready to start using the data loader. If you want to start using this in your code, simply import `GraphDBCypherReader` from the relevant file and follow the steps below.

If you wish to submit the loader on LlamaHub, the process is fairly straightforward. After adding an example to the readme which uses an always-on demo server with StackOverflow data, I was ready to create a pull request. After a short discussion the PR was quickly merged.

Thanks a lot Jerry for the smooth experience.

Now let’s see how to use our two loaders.

Using the Graph Database Loader

The GraphDB Cypher loader, connects to graph databases, which are specialized databases that store data not in tables but in entities (Nodes) and their Relationships. Because they are schema free, you can store real-world knowledge without compromising on richness.

Image for “Network Graph” generated by Midjourney by Author

Relationships can also hold attributes, which can represent time, weights, costs or whatever defines the concrete relationship. Any node can have as many or as few attributes or relationships as needed.

To query a graph database you can use the Cypher query language, a pattern based language that expresses those relationships in visual ascii-art patterns. You encircle nodes in parentheses () and draw relationships as arrows --> with additional constraints put in square brackets. Otherwise Cypher provides many features known from SQL and also supports many graph operations as well as handling data structures like nested documents, of lists and dicts.

Let’s use a movie graph database and ask the LLM a question about common action movie plots.

Setting up the ServiceContext and the ChatGPTLLMPredictor is the same as before.

Then we get the GraphDBCypherReader and connect it to our database (with an small example movie graph from TheMovieDB with permission).

GraphDBCypherReader = download_loader('GraphDBCypherReader')

reader = GraphDBCypherReader(uri = "neo4j+s://demo.neo4jlabs.com", \
    username = "recommendations", password = "recommendations", database = "recommendations")

Then we define our query to the graph database with a parameter of year that allows us to pick more recent movies. When loading the data, each row of results should turn into one Document where the text property of the document is the YAML representation of the row.

query = """
    MATCH (m:Movie)-[rel:ACTED_IN|DIRECTED|IN_GENRE]-(other)
    WHERE $year < m.year and m.imdbRating > $rating
    WITH m, type(rel) as relation, collect(other.name) as names
    RETURN m.title as title, m.year as year, m.plot as plot, relation, names
    ORDER BY m.year ASC
"""

documents = reader.load_data(query, parameters = {"year":1990,"rating":8})
index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)

print(len(documents))
print(documents[0:5])

The output will look similar to the following:

829
[Document(text='names:\n- Saifei He\n- Li Gong\n- Jingwu Ma\n- Cuifen Cao\nplot: A young woman becomes the fourth wife of a wealthy lord, and must learn to live\n  with the strict rules and tensions within the household.\nrelation: ACTED_IN\ntitle: Raise the Red Lantern (Da hong deng long gao gao gua)\nyear: 1991\n', doc_id='782d9a63-251b-4bb8-aa3d-5d8f6d1fb5d2', embedding=None, doc_hash='f9fd966bc5f2234e94d09efebd3be008db8c891f8666c1a364abf7812f5d7a1c', extra_info=None), Document(text='names:\n- Yimou Zhang\nplot: A young woman becomes the fourth wife of a wealthy lord, and must learn to live\n  with the strict rules and tensions within the household.\nrelation: DIRECTED\ntitle: Raise the Red Lantern (Da hong deng long gao gao gua)\nyear: 1991\n', doc_id='2e13caf6-b9cf-4263-a264-7121bc77d1ee', embedding=None, doc_hash='e1f340ed1fac2f1b8d6076cfc2c9e9cb0109d5d11e5dcdbf3a467332f5995cb1', extra_info=None), ...]

Now we can use our index to run a LLM query to answer the questions we wanted to pose.

queryEngine= index.as_query_engine()

queryEngine.query("""
What are the most common plots in action movies?
""")

The answer shows that the LLM can utilize the inputs, understands the genre "action movies" and can summarize their plots. Here is its answer.

Based on the given context information, it appears that the most common plots in action movies are heists and battles against controlling forces. However, it is important to note that this conclusion is based on a limited sample size and may not be representative of all action movies.

Response(response='Based on the given context information, it appears that the most common plots in action movies are heists and battles against controlling forces. However, it is important to note that this conclusion is based on a limited sample size and may not be representative of all action movies.',


source_nodes=[NodeWithScore(node=Node(text='names:\n- Action\n- Crime\n- Thriller\nplot: A group of professional bank robbers start to feel the heat from police when\n  they unknowingly leave a clue at their latest heist.\nrelation: IN_GENRE\ntitle: Heat\nyear: 1995\n', doc_id='bb117618-1cce-4cec-bd9b-8645ab0b50a3', embedding=None, doc_hash='4d493a9f33eb7a1c071756f61e1975ae5c313ecd42243f81a8827919a618468b', extra_info=None, node_info={'start': 0, 'end': 215, '_node_type': }, relationships={: 'dbfffdae-d88c-49e2-9d6b-83dad427a3f3'}), score=0.8247381316731472), NodeWithScore(node=Node(text='names:\n- Thriller\n- Sci-Fi\n- Action\nplot: A computer hacker learns from mysterious rebels about the true nature of his\n  reality and his role in the war against its controllers.\nrelation: IN_GENRE\ntitle: Matrix, The\nyear: 1999\n', doc_id='c4893c61-32ee-4d05-b559-1f65a5197e5e', embedding=None, doc_hash='0b6a080bf712548099c5c8c1b033884a38742c73dc23d420ac2e677e7ece82f4', extra_info=None, node_info={'start': 0, 'end': 227, '_node_type': }, relationships={: '6c8dea11-1371-4f5a-a1a1-7f517f027008'}), score=0.8220633045996049)], extra_info={'bb117618-1cce-4cec-bd9b-8645ab0b50a3': None, 'c4893c61-32ee-4d05-b559-1f65a5197e5e': None})

Using the GraphQL Loader

The GraphQL loader is similarly easy to use.

GraphQL is not a database query language, but an API query language that is based on strict schema expressed in "type definitions". There you express your entities, their attributes (fields) both for scalar datatypes as well as object datatypes pointing to other entities.

What is GraphQL from GraphQL.org, Screenshot by Author

GraphQL itself is a tree based query language, that expresses a nested structure of data that you want to fetch starting from a root query. The fields of every entity returned from that query can be selected and for object fields you can further select fields from the referred entity and so on, almost ad-infinitum (API-Limits apply).

There are a number of GraphQL libraries, most notably the JavaScript reference implementation, but also gql for python, and also integrations with databases like Hasura, Prisma or the Neo4j-GraphQL-Library. Several larger projects now provide GraphQL APIs including GitHub, Spotify, Twitter.

The demo is similar to our first one. We use a public GraphQL endpoint (https://countries.trevorblades.com/), that provides a structure of continent→country→capital. (Licensed under MIT)

A subset of the type-definition is here.

type Query {
    continent(code: ID!): Continent
    continents(filter: ContinentFilterInput = {}): [Continent!]!
    countries(filter: CountryFilterInput = {}): [Country!]!
    country(code: ID!): Country
    language(code: ID!): Language
    languages(filter: LanguageFilterInput = {}): [Language!]!
}

type Continent {
    code: ID!
    countries: [Country!]!
    name: String!
}

type Country {
    awsRegion: String!
    capital: String
    code: ID!
    continent: Continent!
    currencies: [String!]!
    currency: String
    emoji: String!
    emojiU: String!
    languages: [Language!]!
    name(lang: String): String!
    native: String!
    phone: String!
    phones: [String!]!
    states: [State!]!
    subdivisions: [Subdivision!]!
}
...

In our demo, we again define the ServiceContext with the ChatGPTLLMPredictor as before. Then we get the GraphQLReader loader and point it to the URL of the endpoint. You can also provide additional HTTP-Headers, e.g. for authentication.

GraphQLReader = download_loader('GraphQLReader')
reader = GraphQLReader(uri = "https://countries.trevorblades.com/", headers = {})

query = """
query getContinents {
  continents {
    name
    countries {
      name
      capital
    }
  }
}
"""
documents = reader.load_data(query, variables = {})
print(len(documents))
print(documents)

We see that it finds 7 continents with countries and capitals, each of the root results (continent) is turned into a document

7
[Document(text='countries:\n- capital: Luanda\n  name: Angola\n- capital: Ouagadougou\n  name: Burkina Faso\n- capital: Bujumbura\n  name: Burundi\n- capital: Porto-Novo\n  name: Benin\n- capital: Gaborone\n  name: Botswana\n- capital: Kinshasa\n  name: Democratic Republic of the Congo\n- capital: Bangui\n  name: Central African Republic\n....',doc_id='b82fec36-5e82-4246-b7ab-f590bf6741ab', embedding=None, doc_hash='a4caa760423d6ca861b9332f386add3c449f1683168391ae10f7f73a691a2240', extra_info=None)]

Again we stress the LLM only a little bit by asking it in German, "Which capitals are in North America".

index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)
queryEngine= index.as_query_engine()

response = queryEngine.query("""
Question: Welche Hauptstädte liegen in Nordamerika?
Answer:
""")

response.response

I was surprised, as I had only expected a hand-full of countries and cities. But we get 27 countries that are in North America. This shows how our perception is skewed by the western worldview.

Die Hauptstädte, die in Nordamerika liegen, sind Ottawa, San Jos\xE9, Havana, Willemstad, Roseau, Santo Domingo, St. George's, Nuuk, Guatemala City, Tegucigalpa, Port-au-Prince, Kingston, Basseterre, George Town, Castries, Marigot, Fort-de-France, Plymouth, Mexico City, Managua, Panama City, Saint-Pierre, San Juan, San Salvador, Philipsburg, Cockburn Town, Port of Spain, Washington D.C., Kingstown und Road Town.

We could also flip the GraphQL query around and then get 250 countries with their respective capitals and continents.

query = """
query getCountries {
  countries {
    name
    capital
    continent {
        name
    }
  }
}
"""
documents = reader.load_data(query, variables = {})
print(len(documents))
print(documents)

Both document lists should work equally well, but let’s see.

This time the answer from the LLM was much more limited. I’m not sure if that was because the index fed the LLM fewer documents to pick from.

index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)
queryEngine= index.as_query_engine()

response = queryEngine.query("""
Question: Which capitals are in North America?
Answer:
""")

response.response

Washington D.C. and Mexico City are in North America.

Conclusion

It was really smooth to add new data loaders to LlamaHub, thanks a lot to Jerry Liu for making it so easy. Please let me know what you’re doing with these loaders and if you have any feedback.

If I find time in the next weeks I also want to look into the KnowledgeGraphIndex and see if my graph database loader can nicely populate that one.

Enriching LlamaIndex Models from GraphQL and Graph Databases was originally published in LlamaIndex Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Analyzing Software Dependencies With deps.dev — Discover AuraDB Free (Week 49)

Michael Hunger — Thu, 25 May 2023 20:39:01 GMT

Analyzing Software Dependencies With deps.dev — Discover AuraDB Free (Week 49)

This week we looked at software dependencies, an important use case within software analytics for graph databases. Not only can you understand what libraries your software uses not just directly but also indirectly, but also how you’re affected by software vulnerabilities.

If you missed it - the call for papers for our online developer conference NODES 2023 is open till June 30th, but if you submit early you might be selected as a featured speaker.

Two years ago, Google launched https://deps.dev which is an open source package dependency database that makes package information from these systems available:

npm (Javascript)
PyPI (Python)
maven (Java / JVM)
cargo (Rust)
NuGet (.Net)
Go

It even talks about dependency graphs in its "How it works" section.

The service repeatedly examines sites such as github.com, npmjs.com, and pkg.go.dev to find up-to-date information about open source software packages. Using that information it builds for each package the full dependency graph from scratch—not just from package lock files—connecting it to the packages it depends on and to those that depend on it. And then does it all again to keep the information fresh. This transitive dependency graph allows problems in any package to be made visible to the owners and users of any software they affect.

If you rather watch the recording for the livestream, you find it here:

https://medium.com/media/23b7ea3d09f3f7d8477d7ada1338efb2/href

Back then I threw together a quick script to load the data via their unofficial REST API that powered the site.

And tweeted about it:

Michael Hunger 🇪🇺 🇺🇦 @mesirii@chaos.social on Twitter: "Neat, REST API, let's do @Neo4j call apoc.load.json("https://t.co/2CEy0rS9Ro") yield value as vmerge (p:Package {name:v.package .name, version:v.version})with * unwind v.dependencies as dmerge (o:Package {name:d.package .name, version:d.version})merge (p)-[:DEPENDS_ON]->(o) https://t.co/XHdMJxCT6V pic.twitter.com/r4XncqPTXi / Twitter"

Neat, REST API, let's do @Neo4j call apoc.load.json("https://t.co/2CEy0rS9Ro") yield value as vmerge (p:Package {name:v.package .name, version:v.version})with * unwind v.dependencies as dmerge (o:Package {name:d.package .name, version:d.version})merge (p)-[:DEPENDS_ON]->(o) https://t.co/XHdMJxCT6V pic.twitter.com/r4XncqPTXi

But meanwhile, they have published an API that we can use to access the data. The API docs are minimal, but good enough for our purposes.

The minimal API for getting information for package is straightforward but doesn’t give us a lot of data, more interesting is the information per version, which also lists licenses, security vulnerabilities, and links (homepage, repo, issue-tracker).

Here is the example for React (no security vulnerabilities):

https://api.deps.dev/v3alpha/systems/npm/packages/react/versions/18.2.0

{
    "versionKey": {
        "system": "NPM",
        "name": "react",
        "version": "18.2.0"
    },
    "isDefault": true,
    "licenses": [
        "MIT"
    ],
    "advisoryKeys": [],
    "links": [
        {
            "label": "HOMEPAGE",
            "url": "https://reactjs.org/"
        },
        {
            "label": "ISSUE_TRACKER",
            "url": "https://github.com/facebook/react/issues"
        },
        {
            "label": "ORIGIN",
            "url": "https://registry.npmjs.org/react/18.2.0"
        },
        {
            "label": "SOURCE_REPO",
            "url": "git+https://github.com/facebook/react.git"
        }
    ]
}

But we’re more interested in the graph, so let’s go directly for the package dependencies.

Dependencies of a package

You can find the dependencies of a package (like TensorFlow) in the UI

Dependencies of Tensorflow

The API Docs are here

Loading the data for the TensorFlow packages via API uses the system, name and version of a package in the URL.

https://api.deps.dev/v3alpha/systems/pypi/packages/tensorflow/versions/2.12.0:dependencies

And responds with a JSON that has already a graph format:

{
"nodes": [
{
    "versionKey": {
    "system": "PYPI",
    "name": "tensorflow",
    "version": "2.12.0"
    },
    "bundled": false,
    "relation": "SELF",
    "errors": []
},
{
    "versionKey": {
    "system": "PYPI",
    "name": "absl-py",
    "version": "1.4.0"
    },
    "bundled": false,
    "relation": "DIRECT",
    "errors": []
},...],
"edges": [
{
    "fromNode": 0,
    "toNode": 1,
    "requirement": ">=1.0.0"
},
{
    "fromNode": 0,
    "toNode": 2,
    "requirement": ">=1.6.0"
},
{
    "fromNode": 0,
    "toNode": 6,
    "requirement": ">=2.0"
}, ... ]}

The response contains data in a graph format, first a list of nodes then a list of edges with fromNode and toNode (based on the index in the nodes array) and semantic version requirement.

To load the data from the API we use apoc.load.json to provide the response as a Cypher nested structure result.

call apoc.load.json("https://api.deps.dev/v3alpha/systems/pypi/packages/tensorflow/versions/2.12.0:dependencies")
yield value as r

We can now import the data by creating the nodes first and then collecting them into an array again to provide the index lookup for the edges. We encode the "system", here "pypi" as an additional label :PyPi on our :Package nodes which then also hold the constraint by name

create constraint package_pypi if not exists for (p:PyPi) require (p.name) is unique

In a real system we would create separate version nodes on each package that we would then link to, here for simplicity we stuck with the :Package nodes only.

And then iterate over the nodes with UNWIND within a CALL subquery to create the nodes. And then do a second subquery for the relationships.

with "pypi" as system, "tensorflow" as name, "2.12.0" as version

call apoc.load.json("https://api.deps.dev/v3alpha/systems/"+system+"/packages/"
                    +name+"/versions/"+version+":dependencies")
yield value as r
// create nodes
call { with r
        unwind r.nodes as package
        merge (p:Package:PyPi {name:package.versionKey.name}) on create set p.version = package.versionKey.version
        return collect(p) as packages
}
// create relationships by linking nodes
call { with r, packages
        unwind r.edges as edge
        with packages[edge.fromNode] as from, packages[edge.toNode] as to, edge
        merge (from)-[rel:DEPENDS_ON]->(to) ON CREATE SET rel.requirement = edge.requirement
        return count(*) as numRels
}
return size(packages) as numPackages, numRels

Now we can visualize the data in the Query UI by running
MATCH path=(:PyPi {name:"tensorflow"})-[:DEPENDS_ON*]→() RETURN path

Or we can head over to "Explore" and visualize it in the hierarchical layout and also find the shortest paths between packages visually.

Explore dependencies with the hierarchical layout

We can also use the packages that we already have imported into our graph to fetch their dependencies.

To achieve that we replace the hardcoded initial data for package and version with data from the graph. We also set an additional property (or label) to indicate which packages have already been loaded.

match (root:Package:PyPi) where root.imported is null
set root.imported = true
with "pypi" as system, root.name as name, root.version as version
call apoc.load.json("https://api.deps.dev/v3alpha/systems/"+system+"/packages/"
                    +name+"/versions/"+version+":dependencies")
yield value as r
call { with r
        unwind r.nodes as package
        merge (p:Package:PyPi {name:package.versionKey.name}) on create set p.version = package.versionKey.version
        return collect(p) as packages
}
call { with r, packages
        unwind r.edges as edge
        with packages[edge.fromNode] as from, packages[edge.toNode] as to, edge
        merge (from)-[rel:DEPENDS_ON]->(to) ON CREATE SET rel.requirement = edge.requirement
        return count(*) as numRels
}
return size(packages) as numPackages, numRels

Loading Dependents

The UI also shows dependents (i.e. packages that use the current package), which we could infer inversely from our imported data too. Unfortunately, there is no API call for this, so we need to get the REST API call for the UI, which is the following:

https://deps.dev/_/s/pypi/p/tensorflow/v/2.12.0/dependents

It has a different response format and only lists 100 results, but that’s better than nothing for demonstration purposes. We can pick the directSample list of entries and connect them to our root package that we start with.

with "pypi" as system, "tensorflow" as name, "2.12.0" as version
merge (root:PyPi { name:name}) on create set root.version = version
with *
call apoc.load.json("https://deps.dev/_/s/"+system+"/p/"+name+"/v/"+version+"/dependents")
yield value as r


unwind r.directSample as entry
merge (dep:PyPi:Package {name:entry.package.name})
on create set dep.version = entry.version
merge (dep)-[:DEPENDS_ON]->(root)

Question from the viewers — Eshwar: How do I fix relationships that I imported wrongly?

Answer:

find the relationship to delete or update properties
e.g. MATCH ()-[rel:SOME_TYPE]->() DELETE rel
or us apoc refactor procedures to rename, change direction, redirect
see also call apoc.help("refactor")

That was it for today. Happy graphing!

Don’t forget to share the episode or the "Discover AuraDB Free with Fun Datasets" series with your graph-curious friends and colleagues.

Analyzing Software Dependencies With deps.dev — Discover AuraDB Free (Week 49) was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Bluesky User Interaction Graph — Discover AuraDB Free (Week 47)

Michael Hunger — Tue, 09 May 2023 18:04:47 GMT

Bluesky User Interaction Graph — Discover AuraDB Free (Week 47)

As social networks are really nice to import and explore in a graph database, this is what I did this week. If you rather watch the recording of the stream, you can do so below. Otherwise, this blog goes into the details for you.

https://medium.com/media/7b8793cee012111b9b37d25ca8542264/href

Bluesky as the reference social media application for the Authenticated Transfer Protocol (ATP), currently sees a massive hype. There are almost 2M people on the waitlist while the current user count is 65k with roughly 2M posts so far.

A number of celebrities, journalists, and tech folks already made it there, but what is most compelling is that a larger number of people from usually underrepresented minorities like BIPOC and gay and trans-folks are active there.

Bluesky Timeline

Bluesky currently feels like the early Twitter of 2007, with lots of fun and jokes and everyone feeling comfortable just sharing tidbits. It’s refreshing and cool to be able to ping Jake Tapper, AOC (who also did an AMA), or James Gunn and actually get an answer.

So far I’ve seen no harassment or bad behavior (but that might also just be my bubble). But there are many conversations about scalable moderations and user protection with varying points of view.

The app itself is really the bare minimum, but it’s also just meant to provide a reference implementation to test the protocol at a larger scale than before. Expansion happens at a slow scale, most users get 1 invite code every two weeks, with some exceptions.

The AT-protocol itself is quite interesting, basing everything on distributed IDs (DID) to identify users. Data (posts, likes, media) is stored cryptographically signed in repositories that are hosted by federated “personal data servers” (PDS).

Protocol Overview | AT Protocol

The AT protocol allows users to use their own domains to authoritatively declare who they are (like washingtonpost.com) with an _atproto DSN record.

You can get the record from DNS yourself and then query the Placeholder Service (PLC) for more details on that ID, here you can see also the repository data server (PDS) for my case and my domain as an alias.

dig TXT _atproto.mesirii.de

; <<>> DiG 9.10.6 <<>> TXT _atproto.mesirii.de
...
;; ANSWER SECTION:
_atproto.mesirii.de.	3276	IN	TXT	"did=did:plc:gteyzzitmjhuezvsi6nyrszj"

curl -s https://plc.directory/did:plc:gteyzzitmjhuezvsi6nyrszj/data | jq .
{
  "did": "did:plc:gteyzzitmjhuezvsi6nyrszj",
  "verificationMethods": {
    "atproto": "did:key:zQ3shXjHeiBuRCKmM36cuYnm7YEMzhGnCmCyW92sRJ9pribSF"
  },
  "rotationKeys": [
    "did:key:zQ3shhCGUqDKjStzuDxPkTxN6ujddP4RkEKJJouJGRRkaLGbg",
    "did:key:zQ3shpKnbdPx3g3CmPf5cRVTPe1HtSwVn5ish3wSnDPQCbLJK"
  ],
  "alsoKnownAs": [
    "at://mesirii.de"
  ],
  "services": {
    "atproto_pds": {
      "type": "AtprotoPersonalDataServer",
      "endpoint": "https://bsky.social"
    }
  }
}

There is more detail in the protocol docs, a number of clients in Go, Rust, Python, Javascript, and other languages are available and about 1000 folks are already on the Bluesky-Dev-Discord. The protocol uses HTTPS and XRPC behind the scenes for communication, with a “Lexicon” based schema approach to ensure correct interpretation of data.

Alternative apps like GraySky, FireSky, or SkyPulse are already using the protocol, with more to come.

One way of getting the firehose of all posts on the platform is to register an event listener with the stream and then get “commit notifications” when new posts have been added to the data repositories. This works for new data but not for past events.

Bluesky user Jaz has been working for a while on a system that not only tracks and collects posts from the mainstream in a Postgres database, but also aggregates and visualizes it as an interaction graph (with currently 20k users and 140k connections representing 630k interactions), which is really cool.

Bluesky Interaction Graph by Jaz

I dived deeper into the topic in my “Coding with ChatGPT” session last week, where we walked through their code in detail, played around with the APIs, and even integrated Neo4j a bit (more to come).

https://medium.com/media/bf4152ac00a0c266cd47a176060392e0/href

The interaction graph (at least 2 conversations between the users) is kept in memory and can be serialized, rendered, and enhanced with graph algorithms for clustering, centrality-based sizing, and pre-layouting. Jaz uses sigma.js to visualize the already layouted graph in the front-end and add interactivity, like searching for your own handle.

But that goes too deep for today's “intro to graphs session”.

Here we just want to import the interaction graph from Jaz’s precomputed data and visualize and query it in Neo4j and run some graph algorithms on the data for clustering and sizing, as Jaz did for pre-computation.

Fortunately, the data loaded into the web visualization is available as JSON file with "nodes" and "edges" entries that allow us quickly to construct the graph.

To run the graph algorithms on our data, today we will provision an AuraDS (data science) instance, that’s optimized for this purpose and has the graph-data-science library installed.

You can also use the blank Neo4j Sandbox or a local installation with Neo4j Desktop or Docker.

The minimal instance size is a bit overkill for our uses, it could easily manage 1000 times more data, but it’s easiest to use and set up. Initially, the estimator allows you to provide your data volume and the algorithms you want to use and gives you an instance size.

AuraDS instance estimation

While creating the instance, make sure to download the credentials file with the database URI, username, and password.

To import the data we open “Neo4j Workspace” with the “[Open]” button and provide the password.

Then we can visit the “Query” tab to get started with our quick import.

Our data model is really simple, we have User nodes that are connected by an INTERACTED relationship. Both of which have a number of attributes (key, label (handle), size, weight, color, community, x,y) that we can take from the JSON directly.

Let’s first create a constraint for User and key so that we can ensure uniqueness and can look the users up quickly by key to connect them.

create constraint user_key if not exists for (u:User) require (u.key) is unique;

Originally I had also created a constraint for the user name (aka label in the data) but there was a duplicate user, that we need to merge first. But more about that later.

Next, we can look at some entries from the file, we use the user-defined procedure apoc.load.json to load the file and a JSON-Path expression, to grab the "nodes" entries.

call apoc.load.json("https://bsky.jazco.dev/exported_graph_minified.json","$.nodes")
yield value as nv
return nv limit 5;

Which returns the structure in the file as Cypher constructs, nested maps (dicts), and lists(arrays).

{
  attributes: {
    area: 9.51,
    size: 3.48,
    color: "#7EC0EE",
    x: -824.56,
    y: -6222.31,
    label: "maxberger.bsky.social",
    community: 1,
    key: 1
  },
  key: "1"
}

So we see we can create the user with the key as id and set the other attributes. Which is what we’re going to do.

Note: We’re going to use MERGE to make our operation idempotent (it’s a get-or-create), we can re-run as often as we want. And we’re using batches of 10k rows, not so important with 20k users but think about all the millions to come.

We’re also removing key from the attributes before adding them to the nodes, because it’s a different datatype (string vs. integer).

call apoc.load.json("https://bsky.jazco.dev/exported_graph_minified.json","$.nodes")
yield value as nv
call { with nv
   merge (n:User {key:nv.key})
   on create set n += apoc.map.clean(nv.attributes,["key"],[])
} in transactions of 10000 rows;

This gives us roughly 19k lonely nodes in our database.

We can style and position them based on their attributes, but without relationships, this is all boring.

So let’s add them, again by first looking at the data and then using it.

call apoc.load.json("https://bsky.jazco.dev/exported_graph_minified.json","$.edges") yield value as ev
return ev limit 5;

Which returns:

{
  attributes: {
    size: 0.2,
    weight: 2,
    ogWeight: 2
  },
  source: "723",
  key: "geid_35_0",
  target: "15220"
}

So with this data for each edge, we can look up the start and end nodes and then connect them with an INTERACTS relationship and set the attributes from the record on the relationship.

In our MERGE operation here, we leave off the direction as it combines bi-directional interactions.

call apoc.load.json("https://bsky.jazco.dev/exported_graph_minified.json","$.edges") yield value as ev
call { with ev
    match (source:User {key:ev.source}),(target:User {key:ev.target})
    merge (source)-[r:INTERACTED {key:ev.key}]-(target)
    set r += ev.attributes
} in transactions of 20000 rows;

With the data in the graph, we can now start exploring it, head over to the explore tab, and follow along.

Styled Bluesky interaction graph in Explore

Instead of describing all I did with screenshots, why don’t you just watch the 5 minutes section of the stream below (from 25:30)?

At this point, we have:

explored the interaction graph visually, expanding relationships and loading additional data,
styled nodes and relationships based on community, size, and weight attributes,
ran graph algorithms (Louvain for clusters and page rank for size/importance) and used the results for styling,
layout the graph both naturally with force layout as well as with the pre-computed x,y-values from Jaz with coordinate layout, and
used filtering to dismiss the low-weight (2,3) relationships.

Discovering Neo4j AuraDB Free with Michael and Alexander - Importing Bluesky User Interactions

More recently, Jaz has been working on rendering large threads (like the Hellthread with more than 10000 posts) visually.

Thread Visualization by Jaz for the Hellthread

As those thread visualizations also contain posts, I wanted to import them too.

So let’s grab the data and have a quick look.

call apoc.load.json("https://bsky-search.jazco.io/thread?authorHandle=ihatenfts.hellthread.vet&postID=3jv6terdbop2z&layout=true","$") yield value as post
RETURN count(*);
// 10939

call apoc.load.json(https://bsky-search.jazco.io/thread?authorHandle=ihatenfts.hellthread.vet&postID=3jv6terdbop2z&layout=true","$") yield value as post
RETURN post limit 5;

{
  depth: 2,
  post: {
    root_post_id: "3juzlwllznd24",
    parent_relationship: null,
    author_did: "did:plc:pbxv2f7r5eo47e3ylwczhk32",
    has_embedded_media: false,
    parent_post_id: "3juzoytvr7s2y",
    created_at: "2023-05-05T19:35:36.601-07:00",
    id: "3juzp3fjlds26",
    text: "Were you supposed to tag someone?"
  },
  x: 1657.06,
  y: -344.76,
  author_handle: "kentbye.com"
}

So we see we have the author handle and did, the post id, text, and created_at, and if it has embedded media. The parent post id links it to the previous post in the thread (parent-relationship is always null) and the root post-id is a reference to the post that started the thread (one without parents).

So we can use this to create the structure of:

(:User {did, label/handle})-[:POSTED]→(p:Post {id, text, created, media, root})-[:PARENT]→(parent:Post)

But first, we need to fix the duplicate users in our source data that we have already imported. Fortunately, there is only one duplicate user with the same label (handle) but different keys.

We can group users by label, count occurrences and collect the entries into a list users. For all labels that have a count greater than one, we merge that list of nodes into a single node while preserving the relationships.

Our friend here is apoc.refactor.mergeNodes which does exactly that.

match (u:User)
with u.label as label, count(*) as count, collect(u) as users
where count > 1
call apoc.refactor.mergeNodes(users) yield node
return node;

Now we can create the constraint for that property, so we can look up users quickly by labeling and connecting them to their posts.

create constraint user_label if not exists for (u:User) require (u.label) is unique;

Now let’s run the import statement, to pull in the Hellthread.

load the posts as a stream
batch in transactions of 10k
get-or-create Post with id
set attributes (x,y,depth) from entry
set attributes (root, media, created_at, text)
get-or-create parent-post
connect to parent-post
get-or-create author user
connect to author

call apoc.load.json("https://bsky-search.jazco.io/thread?authorHandle=ihatenfts.hellthread.vet&postID=3jv6terdbop2z&layout=true","$") yield value as entry
call { with entry
   with entry, entry.post as post
   merge (p:Post {id:post.id})
   set
     p += entry {.depth, .x, .y},
     p += post { .root_post_id, .has_embedded_media, created_at: datetime(post.created_at), .text}
   merge (parent:Post {id:post.parent_post_id})
   merge (p)-[:PARENT]->(parent)
   with *
   merge (u:User { label: entry.author_handle})
   set u.did = post.author_did
   merge (u)-[:POSTED]->(p)
} in transactions of 10000 rows;
// Created 10995 nodes, created 21878 relationships, set 98507 properties, added 10995 labels

Now we have a nice subset of posts to explore paths and threading in our database.

A basic pattern (post:Post)-[:PARENT]→(parent:Post) can be extended to a variable length one by adding a star. Like here for 50 hops: (post:Post)-[:PARENT*50]→(parent:Post)

So if we fetch that single path or a few, we get a nice long chain (or tree)

MATCH path = (post:Post)-[:PARENT*50]->(parent:Post)
RETURN path LIMIT 50

But “Explore” is much better with the hierarchical layout, or coordinate layout of the pre-computed thread visualization. Also its WebGL-based layout and rendering scales better than the d3 one in Query.

Hierarchical Layout of the Thread

We can now find the shortest paths between users, e.g. via the interaction network, or via posts in the thread(s), either visually in the context menu in Explore or with the following statement.

MATCH (start:User {label:"mesirii.de"})
MATCH (end:User {label:"rboyd.dev"})
MATCH path = shortestPath( (start)-[:INTERACTED*]-(end))
RETURN path

As you can see, even my friend Ryan and I need James Governor to mediate :)

Shortest Path between Ryan and me

You can find the code we used today in my GitHub Gist.

Happy socializing and graphing. Be kind.

The next thing I want to write up is how to use the Bluesky data captured in Postgres by zhuowei to import the users, posts, likes, and follows graph into Neo4j. Already got the CSV dumps imported. But that’s for the next blog post :)

Bluesky User Interaction Graph — Discover AuraDB Free (Week 47) was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Cypher Code-Golf Completion: Hackathon Winners Announced

Michael Hunger — Thu, 06 Apr 2023 08:42:00 GMT

With spring in full swing and summer just right around the corner, I wanted to take a moment to reflect on the Neo4j Hackathon we ran last year, where we encouraged you all to compete in your Cypher skills.

Similar to the classic code-golf where the shortest solution to a problem wins, we wanted to see who can write the most efficient and shortest Cypher query to solve one of available three challenges in “Cypher Code-Golf,” where the level of difficulty ranges from Beginner, Intermediate, and Advanced Cypher skill levels.

The efficiency was measured as “Database Hits,” a metric returned from using PROFILE with your query, and the query length in characters counted for the shortest. For ties on both metrics, we used the first submission.

By the numbers, more than 1000 people registered, and over 5,000 queries were submitted. At the conclusion of the hackathon, 18 winners were selected to take home hard-earned cash prizes, and we couldn’t be more thankful to have so many participants in this challenge. More on those 18 winners below!

Our first-ever hackathon would only be complete with some learnings and challenges. We quickly realized that not having a result verification tool led to a lot of invalid queries with an unusually high number of database hits that simply didn’t make sense. Zero db-hits was an obvious one, and we got plenty of those! Next time we’ll apply the approach from Advent of Code that executes the statement with parameters for another set of input parameters and validates the correctness of the results that way.

Here’s a quick rundown of the user interface of Code-Golf:

After signing up and picking an avatar, you could start swinging.

We provided three difficulty levels to accommodate different skills, one of which you could compete in.

Stack Overflow Dataset

Being developers, we felt that a Stack Overflow dataset would be easy to understand and fun to query for everyone. Based on our previous work, we used the import of the whole Stack Overflow dump from 2018 (55M nodes, 123M relationships), which provided us with this data model:

So effectively, the elements we are looking at are:

Labels: Tag, Question:Post, Answer:Post, User
Patterns
(:User)-[:POSTED]->(:Post), (Tag)-[:SIMILAR]->(Tag)
(:Question)-[:TAGGED]->(:Tag)
(:Question)-[:ACCEPTED]->(:Answer), (:Question)<-[:ANSWERED]-(:Answer)

For the 3 levels, we had a question each.

Level 1: Beginner

Question: How many tags do questions in the Cypher area have at most?

This question could be solved in these steps

Find the “cypher” tag
Find questions tagged with that tag
Return the max degree (relationship count) of those questions

We got the most submissions in this category, 1197 in total.

There were a number of cheats e.g. just returning the number or sneakily creating a statement that just hit 1000 db-hits and then returning the fixed results or similar.

An efficient statement that meets the criteria with 12719 db-hits is:

MATCH (t:Tag {name:'cypher'})<-[:TAGGED]-(q)
RETURN max(size( (q)-[:TAGGED]->())) as most

With Neo4j 5.x you would replace the size(pattern) with a count { pattern } expression instead.

Winners

Our 6 winners in this category are

Chris Zirkel
Paweł Gronowski
Ertan Kabakcı
Benjamin Malburg
Joren Van de Vondel
Paul Billing-Ross

Here’s an example query of one of our winners that’s different from the minimal query:

MATCH (:Tag {name: "cypher"})<-[:TAGGED]-(q)
RETURN max(apoc.node.degree(q,"TAGGED"))
MATCH (t:Tag)<-[:TAGGED]-(q)
WHERE t.name = "cypher"
WITH q, size((q)-[:TAGGED]->()) as ts
RETURN max(ts)

Level 2: Intermediate

Question: What is the title of the most highly voted question posted on April 1st in the Perl category?

This question was also pretty straightforward, you can access the components like month or day of date(-time) property individually.

Find the “perl” tag
Find questions with that tag
Filter those questions by day 1 and month 4
Sort by score DESCending
Return the title

Again in the 932 submissions, we got a lot of cheats that just looked for the question on that single date, or with the question-id or only within that single year and not all years, or just blatantly returned the expected title.

A reasonable solution would be:

MATCH (t:Tag {name:'perl'})<-[:TAGGED]-(q)
WHERE q.createdAt.month = 4 and q.createdAt.day = 1
RETURN q
ORDER BY q.score DESC LIMIT 1;

An alternative solution could be iterating over the years and then using a range (between) for using the index on the date field and a hash-join between the tag and the question.

But that one had higher db-hit costs (353k) as it had to filter all questions for that date, and not just the ones within the perl tag.

UNWIND range(2008,2019) as year
WITH datetime({year:year,month:4,day:1}) as date
MATCH (t:Tag {name:'perl'})<-[:TAGGED]-(q:Question)
USING JOIN on t
WHERE date <= q.createdAt < date + duration("P1D")
RETURN q.title, q.score
ORDER BY q.score DESC LIMIT 1;

Winners

The winners for this category are:

Justin B
Brian Lee
Belinda Dhamers
Carmi Raz
Venkatesh Prasanna
Camille Caulier

Some of their different solutions:

CYPHER runtime=interpreted
MATCH (t:Tag{name:'perl'})<-[:TAGGED]-(q:Question)
USING INDEX q:Question(createdAt)
USING JOIN ON t
WHERE datetime({year:2008, month:4, day:1}) <= q.createdAt < datetime({year:2008, month:4, day:2})
OR datetime({year:2009, month:4, day:1}) <= q.createdAt < datetime({year:2009, month:4, day:2})
OR datetime({year:2010, month:4, day:1}) <= q.createdAt < datetime({year:2010, month:4, day:2})
…
day:2})
OR datetime({year:2022, month:4, day:1}) <= q.createdAt < datetime({year:2022, month:4, day:2})
RETURN q.title ORDER BY q.score DESC LIMIT 1

Or with a regular expression:

MATCH (:Tag {name: 'perl'})<-[:TAGGED]-(q)
WHERE apoc.convert.toString(q.createdAt) =~ '.*04\-01T.*'
RETURN q.title
ORDER BY q.score DESC
LIMIT 1

Or via date components:

MATCH (:Tag {name:"perl"})<-[:TAGGED]-(q) WHERE EXISTS {
MATCH(q) WHERE q.createdAt.month=4 AND q.createdAt.day=1
} RETURN q.id ORDER BY q.score DESC LIMIT 1

Level 3: Advanced

Question: Given the top 10 tags in the Lua community (except Lua), which other tags most frequently co-occur with them. For those tags, find the 25 people who most frequently answered those questions and see what are the top 10 question tag names they had that were not in the original 10-element list.

This question was a bit more involved but you can just follow the steps one at a time to build it up.

Find the “lua” tag
Find questions tagged with that tag, and other tags for these questions
Aggregate the other tags by their frequency, sort descending and select the top 10
Turn the 10 top other tags into a list
Find the users who posted answers to questions with those top 10 tags
Aggregate by frequency, sort descending, and collect the top 25 users
For those users find what the tags were for questions they answered
Filter out the tags that were in our top 10 tags list
Return the remaining tags, ordered by frequency descending, and pick the top 10

An example query on how it could be solved is:

MATCH (:Tag {name:'lua'})<-[:TAGGED]-()-[:TAGGED]->(o)
WITH o, count(*) as c ORDER BY c DESC LIMIT 10
CALL { with o return collect(o) as all }
MATCH (o)<-[:TAGGED]-()<-[:ANSWERED]-()<-[:POSTED]-(u)
WITH u, all, count(*) as c ORDER BY c LIMIT 25
MATCH (o)<-[:TAGGED]-()<-[:ANSWERED]-()<-[:POSTED]-(u)
WHERE not o in all
RETURN o, count(*) as c ORDER BY c DESC LIMIT 10

This category was as expected the hardest, many folks also had issues understanding the challenge. Unfortunately, we didn’t have the capacity to provide explanations for all your questions, apologies for that.

Winners

But we still got 111 submissions, of which we could pick the 6 winners.

Ron van Weverwijk
Laura Arditti
Hüseyin Çötel
Đức Lê Tự
Niclas Kjäll-Ohlsson
Rajendra Kadam

Some different solutions from the one we’ve shown:

MATCH (:Tag{name: "lua"})<-[:TAGGED]-(p)-[:TAGGED]->(o)
WITH o, count(p) as np
ORDER BY np DESC
LIMIT 10
WITH collect(o) as os
CALL {
WITH os
UNWIND os as o
MATCH (o)<-[:TAGGED]-()<-[:ANSWERED]-(a)<-[:POSTED]-(u)
RETURN u, count(DISTINCT a) AS na
ORDER BY na DESC
LIMIT 25
}
WITH os, collect(u) as us
CALL {
WITH us, os
UNWIND us as u
MATCH (u) →(:Question)-[t:TAGGED]->(ot)
WHERE (NOT ot IN os)
RETURN ot, count(t) as nt
ORDER BY nt DESC
LIMIT 100
}
return us, collect(ot)

Using APOC:

MATCH l = (t:Tag {name: "lua"})-[:SIMILAR]-(s:Tag)<-[:TAGGED]-(p:Post)-[:TAGGED]->(c:Tag)
WITH c, count(l) AS n, collect(p) AS p
ORDER BY n DESC LIMIT 10
WITH apoc.coll.flatten(collect(p)) AS p, collect(c) AS o
MATCH r = (u:User)-[:POSTED]->(:Answer)-[:ANSWERED]->(q:Question)
WHERE q in p
WITH count(r) AS a, u, o, collect(q) AS q
ORDER BY a DESC LIMIT 25
WITH apoc.coll.flatten(collect(q)) AS q, o
MATCH (v:Question)-[:TAGGED]->(t:TAG)
WHERE NOT (t IN o) AND v IN q
WITH t, count(v) AS c
RETURN t ORDER by c DESC LIMIT 100

Using a single query:

MATCH (tf:Tag)<-[tr:TAGGED]-(q:Question)-[:TAGGED]-(t:Tag {name:"lua"}) with tf, count(tr) as cnt order by cnt desc limit 10
MATCH (tf)<-[:TAGGED]-(q:Question)<-[:ANSWERED]-(:Post)<-[rp:POSTED]-(u:User) with u, count(rp) as cntrp, collect(tf.name) as tf_list order by cntrp desc limit 25
MATCH (tt:Tag)<-[rtt:TAGGED]-(:Question)<-[:ANSWERED]-(:Post)<-[:POSTED]-(u) where not tt.name in tf_list with tt, count(rtt) as crtt order by count(rtt) desc limit 100 return tt

A smaller Stack Overflow dataset is also available on Neo4j AuraDB, the full dataset is on the Demo server demo.neo4jlabs.com with username/password/database “stackoverflow”.

Please make sure you join our user forums or hang out in our Discord to learn from each other.

Neo4j Online Community

If you want to learn more, we have published additional Cypher, Graph Data Science, and Application Development courses on GraphAcademy.

Free, Self-Paced, Hands-on Online Training

And we’re running regular live streams, Meetups, GraphSummits, and other events, all of which you can find on our Events Page.

Events Archive

Cypher Code-Golf Completion: Hackathon Winners Announced was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Git Commit History — Discover AuraDB: Week 44

Michael Hunger — Mon, 27 Mar 2023 16:20:46 GMT

Git Commit History — Discover AuraDB: Week 44

Let’s explore the output of git log as a graph in Neo4j AuraDB Free.

This time Alex, couldn’t make it, hope he gets better soon.

Some interesting things that happened since last weekend

Went to watch Dungeons and Dragons: Honor among Thieves with the kids (was really good)
Played a lot with GPT-4
Missed the asteroid DZ2 2023, due to the cloudy sky :(
Been running Arc browser, which is really neat Invite here in the show-notes
Went to a local whisky festival which was great for tasting rare spirits

Between everything I saw the TIL post by Simon Willison about his GPT-4 coding exercise to turn git logs into JSON output.

He uses the swiss army knife of JSON processing jq, which is awesome.

His post inspired me to today’s session on looking at git commit history as a graph.

If you rather want to see the recording, check it out here:

https://medium.com/media/b33f117202af31e3ad1dbd82d747761a/href

Data Source and Preparation

We’re using Neo4j’s open-source repository here for our experiment: https://github.com/neo4j/neo4j

Clone the repository (I limited it to the 5.6 branch) and follow along.

Simon did two interesting things

Using git log --pretty with NULL-bytes as separators (instead of commas or tabs)
Using jq to parse split the raw string by those NULL bytes and output JSON for the fields

git log --date=iso --pretty=format:'%H%x00%an%x00%ad%x00%s%x00' | head -2 | \
  jq -R -s '[split("\n")[:-1] | map(split("\u0000")) | .[] | {
    "commit": .[0],
    "author": .[1],
    "date": .[2],
    "message": .[3]
  }]'

The output is:

[
  {
    "commit": "5ad4387ed521f169a737f9836402dbac8759a9fc",
    "author": "Johannes Donath",
    "date": "2023-03-08 16:17:12 +0100",
    "message": "Corrected an issue in which the number of writable bytes is incorrectly calculated. ()"
  },
  {
    "commit": "c53e91519eca145c6879d5c9be9a421fdd223338",
    "author": "Tobias Johansson",
    "date": "2023-03-07 13:34:12 +0100",
    "message": "Do polling of the fabric transaction lock in terminate"
  }
]

The placeholders in the pretty-print string are a bit ominous but the man page and the cheat-sheet here help a lot.

I wanted to add the parent commit via %P to see the git commit history chain. Also tried to get one of the files of the commit in but we’ll leave that for another time.

The other change that we need to do is to turn the data into a CSV, not a JSON file. Fortunately, jq also supports CSV as an output so we can select our 5 fields and tell it to send it through the @csv processing step.

echo 'commit,parent,author,date,message' > ~/Downloads/neo4j-git.csv
git log --date=iso --pretty=format:'%H%x00%P%x00%an%x00%ad%x00%s%x00' | \
  jq -r -R -s 'split("\n")[:-1] | map(split("\u0000")) | .[] | [
   .[0],
   .[1],
   .[2],
   .[3],
   .[4]
  ] | @csv' >> ~/Downloads/neo4j-git.csv

We can check the resulting CSV file with the xsv tool and see that it has the 76k commits and the right fields, it’s about 14MB of data.

xsv count ~/Downloads/neo4j-git.csv
76567

xsv stats ~/Downloads/neo4j-git.csv
field,type,sum,min,max,min_length,max_length,mean,stddev
commit,Unicode,,0000257bb06e29e15c11b6bc5ad4f8253deed4a4,ffff9ee399d15bd834e4aeb4719a02e5e39308d5,40,40,,
parent,Unicode,,0000257bb06e29e15c11b6bc5ad4f8253deed4a4,ffff9ee399d15bd834e4aeb4719a02e5e39308d5,0,81,,
author,Unicode,,@fbiville,wujek srujek,2,30,,
date,Unicode,,2007-05-24 01:34:45 +0000,2023-03-08 16:17:12 +0100,25,25,,
message,Unicode,,     - BeansAPITransaction handles multiple calls to tx.finish(),zoo_keeper_servers example changed to something useful (the default ZK port),1,3217,,

ls -lh ~/Downloads/neo4j-git.csv
14M 27 Mar 10:17 /Users/mh/Downloads/neo4j-git.csv

Ok, now we have everything to turn those commits into a graph.

Are you ready?

Let’s first spin up a Neo4j AuraDB database and get the data imported.

Create a Neo4j AuraDB Free Instance

Go to https://dev.neo4j.com/neo4j-aura to register or log into the service (you might need to verify your email address).

After clicking Create Database you can create a new Neo4j AuraDB Free instance.

Choose the "Empty Instance" option as we want to import our data ourselves.

On the Credentials popup, make sure to save the password somewhere safe, best is to download the credentials file, which you can also use for your app development.

The default username is always neo4j.

Then wait 2-3 minutes for your instance to be created.

Afterward, you can connect via the "Open" Button with Workspace (you’ll need the password), which offers the "Import" (Data Importer), "Explore" (Neo4j Bloom), and "Query" (Neo4j Browser) tabs to work with your data.

On the database tile, you can also find the connection URL: neo4j+s://xxx.databases.neo4j.io (it is also contained in your credentials env file).

If you want to see examples of programmatically connecting to the database go to the "Connect" tab of your instance and pick the language of your choice

After opening Neo4j Workspace via the "Open" button, and logging in with the downloaded credentials, we can go to the "Import" tab and get started.

Data Modeling

The data model is pretty straightforward, we just have Commit and Author nodes connected by a WROTE relationship.

The commits are also pointing to their parent commit via a PARENT relationship.

Minimalistic Data Model

Now we can add our CSV file and map the attributes and select our id-fields.

Commit (hash, date, parent, subject)
Author (author)

Data Importer with Mapping

With the data mapped we can "Preview" our import and inspect nodes and relationship attributes and structure to see if we messed anything up.

Import Preview

As we’re good we can click "Import" and, after a few seconds, the results are presented. There you can also inspect the Cypher statements used to create the data which you can use in your own code or scripts.

The image below is from a 2nd run, so the data is already in the graph.

Import Results

Explore

Now with "Start Exploring" we head over to the "Explore" tab and see data in the graph.

There we can also style our nodes and for instance determine the shortest paths between commits (select two nodes and right click for the context menu).

Now let’s get our hands dirty and write some Cypher statements.

Query & Evolve

Opening the left sidebar shows us the labels and relationship-types in the graph, we can click on any of them to see a subset of our data.

Query the Graph

Now we can start to explore a bit,

First looking at the total number of commits and authors:

MATCH (:Commit) RETURN count(*)
// 76k

MATCH (:Author) RETURN count(*)
// 338

Contributor Activity

Let’s find the most active authors, we can find out their outgoing relationships.

MATCH (a:Author)
WITH a, count { (a)-[:WROTE]->() } as commits
ORDER BY commits DESC LIMIT 10
RETURN a.author, commits

Which gives us these folks:

a.author commits
Mattias Persson 6592
Anders Nawroth 4106
Pontus Melke 4030
MishaDemianenko 3989
Andres Taylor 3640
Chris Vest 2983
Jacob Hansson 2521
Anton Persson 2212
Mattias Finné 2148
Davide Grohmann 1987

The usual suspects :)

Oh, Mattias has been with us for 15 years but got married in between, so he shows up with two different names.

Let’s fix that and move the 6592 relationships from his old alter-ego to the new one:

MATCH (new:Author { author: "Mattias Finné"})
MATCH (old:Author { author: "Mattias Persson"})
MATCH (old)-[rel:WROTE]->(c:Commit)
CREATE (new)-[:WROTE]->(c)
DELETE rel
RETURN count(*);

So that should fix the results:

a.author commits
Mattias Finné 8740
Anders Nawroth 4106
Pontus Melke 4030
MishaDemianenko 3989
Andres Taylor 3640
Chris Vest 2983
Jacob Hansson 2521
Anton Persson 2212
Davide Grohmann 1987
Satia Herfert 1896

If we wanted to limit the people by commit-date, so we see who has been more active recently, we can do that too:

MATCH (a:Author)
WITH a, count { (a)-[:WROTE]->(c:Commit) WHERE c.date > datetime("2019-01-01T00:00:00")} as commits
ORDER BY commits DESC LIMIT 10
RETURN a.author, commits

Now we have a different set of people with fewer total contributions:

a.author commits
MishaDemianenko 1892
Pontus Melke 1570
Mattias Finné 1284
Satia Herfert 1126
Anton Persson 920
Chris Vest 746
Louise Söderström 578
Tobias Johansson 463
Therese Magnusson 439
Georgiy Kargapolov 415

Fixing Parents

Another thing we need to fix is that some commit multiple parents not just one so that their parent-hash was not found in the database to connect them.

We have roughly 15k commits of that kind. Here is what the property looks like.

parent: "e5697a0900ff849f92d0ae3c88bd8e31e3163024 d26c6ab67c34a5f91cc6cfadfc75ca4b1def3bef"

We can fix it by:

Finding these commits
Splitting the parent field by space
Turning that list of hashes into rows
Find the parent commit with the hash
Create the relationships

Or in Cypher:

// find commits without PARENT relationship but with multiple parent hash values
MATCH (c:Commit)
WHERE NOT exists { (c)-[:PARENT]->() }
  AND c.parent contains ' '

// split by space into list of hashes
WITH c, split(c.parent,' ') as parents

// turn list into rows
UNWIND parents as parent

// find parent commit
MATCH (p:Commit {hash:parent})

/// create relationship
MERGE (c)-[:PARENT]->(p)

RETURN count(*);

Now our graph is better connected, and we can run a few long path queries:

MATCH path = (c:Commit)-[:PARENT*100]->(p:Commit)
RETURN path LIMIT 1

Which gives us this beautiful long, flowery chain of commits

We can also look at "root" commits and see which ones have the most children.

With PROFILE we can sneak under the hood and see how the query planner optimizes this query, it completes in about 2 seconds checking 10-hop paths for all 76k commits.

profile
match (c)-[:PARENT*..10]->(p:Commit)
with p, count(distinct c) as children
return p {.*} as parent, children order by children desc limit 10

Some of them have more than 600 children even only up to level 10!

We can do the same without an upper limit but for a subset of commits:

PROFILE
MATCH (p:Commit) WITH p LIMIT 1000
MATCH (c)-[:PARENT*]->(p)
WITH p, count(distinct c) as children
RETURN p, children ORDER BY children DESC LIMIT 10

And with that, we were out of time and had covered a lot of ground.

If you want to see videos, write-ups, and data for past livestreams check out our overview page:

Discovering AuraDB Free with Fun Datasets

Or our repository: https://github.com/neo4j-examples/discoveraurafree

Hope this was as much for you as it was for me,

Happy Graphing!

Git Commit History — Discover AuraDB: Week 44 was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.