<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Michael Hunger on Medium]]></title>
        <description><![CDATA[Stories by Michael Hunger on Medium]]></description>
        <link>https://medium.com/@mesirii?source=rss-3865848842f9------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/0*hmhjbwkprwEOimDT.jpg</url>
            <title>Stories by Michael Hunger on Medium</title>
            <link>https://medium.com/@mesirii?source=rss-3865848842f9------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Wed, 08 Apr 2026 15:56:57 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@mesirii/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Building AI Agents With the Google Gen AI Toolbox and Neo4j Knowledge Graphs]]></title>
            <link>https://medium.com/neo4j/building-ai-agents-with-the-google-gen-ai-toolbox-and-neo4j-knowledge-graphs-86526659b46a?source=rss-3865848842f9------2</link>
            <guid isPermaLink="false">https://medium.com/p/86526659b46a</guid>
            <category><![CDATA[genai]]></category>
            <category><![CDATA[neo4j]]></category>
            <category><![CDATA[knowledge-graph]]></category>
            <category><![CDATA[agentic-ai]]></category>
            <category><![CDATA[generative-ai-tools]]></category>
            <dc:creator><![CDATA[Michael Hunger]]></dc:creator>
            <pubDate>Mon, 10 Mar 2025 18:20:09 GMT</pubDate>
            <atom:updated>2025-03-12T19:20:22.466Z</atom:updated>
            <content:encoded><![CDATA[<p><a href="https://medium.com/u/3865848842f9">Michael Hunger</a>, Head of Product Innovation, Neo4j<br><a href="https://medium.com/u/f17c1f95a916">Kurtis Van Gent</a>, Staff Software Engineer, Google</p><blockquote>Interested in learning more? Watch the recording of the <a href="https://youtube.com/live/0p3S56JnTCg">Gen AI Toolbox + Neo4j livestream</a> from March 12, 2025.</blockquote><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2F0p3S56JnTCg%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D0p3S56JnTCg&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2F0p3S56JnTCg%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/4c0476cd2fce369820dae0f693b1bdea/href">https://medium.com/media/4c0476cd2fce369820dae0f693b1bdea/href</a></iframe><h3>Introduction</h3><p>The Gen AI Toolbox for Databases, recently <a href="https://cloud.google.com/blog/products/ai-machine-learning/announcing-gen-ai-toolbox-for-databases-get-started-today?e=48754805&amp;hl=en">launched in collaboration with LangChain</a>, now includes a Neo4j integration.</p><p><a href="https://cloud.google.com/blog/products/ai-machine-learning/announcing-gen-ai-toolbox-for-databases-get-started-today?e=48754805&amp;hl=en">Announcing Gen AI Toolbox for Databases. Get started today | Google Cloud Blog</a></p><p>This collaboration began following discussions with Kurtis Van Gent after the <a href="https://neo4j.com/blog/developer/genai-graph-gathering/">GraphRAG gathering</a>, where he proposed better infrastructure to improve database integration in agentic systems as both sources and tools. The addition of Neo4j brings knowledge graph capabilities to Toolbox users, expanding the functionality for database management and Gen AI applications.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/796/0*iNcgBWfFFnOkSv6z" /></figure><h3>Agentic Architectures</h3><p>While you might have heard much about AI agents, here is a quick recap of how they are different from simple retrieval-augmented generation (RAG) approaches and their implications.</p><p>In traditional <a href="https://cloud.google.com/use-cases/retrieval-augmented-generation?hl=en"><strong>RAG</strong></a> flows, the user question is not answered by an LLM directly, but relevant information from trusted data sources is retrieved first and then provided together with the question as part of the LLM prompt to be used for answering. So the LLM is only used for its language skills, not for the pretrained knowledge. This helps with the relevancy and accuracy of the answers and, partially, with explainability.</p><p>With <a href="https://graphrag.com"><strong>GraphRAG</strong></a>, a knowledge graph is the data source that goes way beyond text fragment retrieval from a vector store. The connected information in the graph provides relevant context from related entities, documents, and structural summaries. Due to the structured nature of the retrieved information, explaining and transparently showing the detailed sources of the answers also becomes much easier.</p><p>In <strong>agentic architectures,</strong> the LLM is equipped with a set of tools for information retrieval and taking action on the user’s behalf. The LLM takes a much more active role in analyzing and dissecting the question, creating a plan for using a set of tools to retrieve information for the different inputs and requested results, and then executing it. During execution, the Agent-LLM can choose to call tools multiple times in sequence or loops or even drill down into more complex information retrieval by using previously retrieved results as inputs for new tool calls. While running in the agentic loop, it observes if the data collected is sufficient to answer the user’s question and also applies guardrails and other techniques like re-ranking, judging, and selection by relevancy.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/960/0*gj7tWF-vccGGrp3J" /><figcaption>Agent architecture building blocks</figcaption></figure><p>Due to the structured analysis and parameter extraction for tool calls, the need for vector similarity search decreases. In general, the work of the LLM agent can be seen as similar to a query planning (and execution) process in the database world.</p><p>The tools made available to the agent have a variety of options and shapes, from code execution in (Python) sandboxes to REST API calls to using databases for retrieval to using other LLMs — there are many possibilities. Most tools will be used for information retrieval or processing, and only a few per agent will allow updates and modifying actions.</p><p>Most LLMs today already support tool calls via function calling, which allows you to pass in a list of signatures (name, description, parameters, results) of functions which the LLM then selects from and provides parameters for execution. Agent Frameworks add a lot of additional functionality — like guardrails, graph-based workflows, execution memory, and the aforementioned techniques.</p><p>While agents are a powerful mechanism that (as shown in the example below) is working surprisingly well even for answering complex questions and tasks and keeping conversations flowing smoothly, there are some challenges, including:</p><ul><li>Tool configuration and description have to be consistent and sufficient for the LLM to make the correct choices.</li><li>Tool execution management was often left to the developer, so depending on their familiarity with the data infrastructures, the resource management was suboptimal.</li><li>Database query and other tools couldn’t be configured by data analysts as there was significant extra work (OpenAPI spec, function call signatures, Python code for execution) to be provided.</li><li>The right tool has to be selected, so it needs to be very clear from the name and description what each tool does and which sub-tasks it applies to.</li><li>The tool has to be very specific about the kinds and data types of parameters, so that the LLM can do conversions and translations before passing the parameters to the tools.</li><li>Parameters need to be used in an injection-safe way to prevent code injections, which could lead to either data exfiltration or unrestricted modifications.</li><li>Vector search is less relevant in agentic setups as you will rarely pass the user question as a phrase to an embedding generation and vector similarity search, rather do structured parameter extraction and specific tool calls.</li></ul><h3>Gen AI Toolbox</h3><p>When it comes to creating tools that access databases, there are usually a number of challenges that can be difficult to tackle with today’s orchestration frameworks: authentication, authorization, sanitization, connection pooling, and more. These challenges quickly become a burden, slowing down development and leaving room for mistakes when implementing the same boilerplate code over and over again.</p><p>Enter Toolbox — the open source server that makes it easier to build Gen AI tools for interacting with databases. It enables you to develop tools easier, faster, and more securely by handling complexities such as connection pooling, authentication, and more. Toolbox lets you define the tools in a single location and integrate with your LangGraph agents in less than three lines of code:</p><pre>from toolbox_langchain import ToolboxClient<br><br># update the URL to point to your server<br>client = ToolboxClient(&quot;http://127.0.0.1:5000&quot;)<br><br># these tools can be passed to your agent! <br>tools = await client.aload_toolset()</pre><p>Toolbox is truly open source: as part of the launch, Google Cloud partnered with other database vendors, including Neo4j, to build support for a large number of open source and proprietary databases:</p><ul><li>PostgreSQL (including <a href="https://googleapis.github.io/genai-toolbox/resources/sources/alloydb-pg/">AlloyDB</a> and <a href="https://googleapis.github.io/genai-toolbox/resources/sources/cloud-sql-mysql/">Cloud SQL for MySQL</a>)</li><li><a href="https://googleapis.github.io/genai-toolbox/resources/sources/mysql/">MySQL</a> (including <a href="https://googleapis.github.io/genai-toolbox/resources/sources/cloud-sql-mysql/">Cloud SQL for MySQL</a>)</li><li><a href="https://googleapis.github.io/genai-toolbox/resources/sources/mssql/">SQL Server</a> (including Cloud SQL for <a href="https://googleapis.github.io/genai-toolbox/resources/sources/cloud-sql-mssql/">SQL Server</a>)</li><li><a href="https://googleapis.github.io/genai-toolbox/resources/sources/spanner/">Spanner</a></li><li><a href="https://googleapis.github.io/genai-toolbox/resources/sources/neo4j/"><strong>Neo4j</strong></a></li></ul><p>We also have a number of features to help your application hit production more quickly — including built-in features such as support for <a href="https://googleapis.github.io/genai-toolbox/resources/authsources/">end-user authentication in tools</a> and built in <a href="https://googleapis.github.io/genai-toolbox/concepts/telemetry/">observability</a> through OpenTelemetry.</p><h3>Neo4j Cypher Sources and Tools</h3><p>The Neo4j integration Google added to the Gen AI Toolbox supports the definition of sources and tools for Cypher execution. <a href="https://googleapis.github.io/genai-toolbox/resources/sources/">Sources</a> represent how to connect to your Neo4j instance — things like URI, API key, etc. <a href="https://googleapis.github.io/genai-toolbox/resources/tools/">Tools</a> represent specific actions your agent can take — such as reading or writing to a source.</p><p><a href="https://neo4j.com/genai">Neo4j</a> is the leading open source graph database — it manages information not as tables but as entities (nodes) and relationships between them, allowing a flexible and powerful representation of connected information. Graphs add unique capabilities for many domains like biomedical, supply chain, manufacturing, fraud detection, and transport logistics. <a href="https://neo4j.com/blog/what-is-knowledge-graph/">Knowledge graphs</a>, which you can think of as digital twins of your organization (people, processes, products, partners etc.) are a great “factual memory” companion to LLM’s language skills.</p><p>After <a href="https://googleapis.github.io/genai-toolbox/getting-started/local_quickstart/#step-2-install-and-configure-toolbox">fetching the Toolbox binary</a>, you can provide definitions for multiple Neo4j <a href="https://googleapis.github.io/genai-toolbox/resources/sources/neo4j/"><strong>sources</strong></a> by configuring your database connection details (URL, username, password) with named sources in a file called tools.yaml, provided to the Toolbox binary at startup.</p><p>Tools that you define later are tied to a source, as in most cases, the queries are specific to the data model of that dataset.</p><pre>sources:<br>    my-neo4j-source:<br>        kind: &quot;neo4j&quot;<br>        uri: &quot;neo4j+s://xxxx.databases.neo4j.io:7687&quot;<br>        user: &quot;neo4j&quot;<br>        password: &quot;my-password&quot;<br>        database: &quot;neo4j&quot;</pre><p><a href="https://googleapis.github.io/genai-toolbox/resources/tools/">Tools</a> are specific statements that you want to execute against a source. To be usable with an agentic setup, it is really important to describe the tool, parameters, and results in enough detail so that the LLM can reason about its applicability.</p><p>Statements in tools can be <strong>generic</strong> — like fetching a single entity by a predicate, getting the first- or second-degree neighborhood, or even finding the shortest paths between entities. They can be <strong>specific</strong>, written by a subject-matter expert, following a more complex set of paths in your data, (retrieving the gene or protein expression targets of a drug or getting product recommendations for a user, for example). Tools can also <strong>update data</strong> in the graph by updating attributes or adding nodes and relationships representing business concepts.</p><p>But tools can also make use of advanced functionality like running graph algorithms for clustering, doing spatial operations, or executing vector similarity or full-text search for text phrases.</p><p>Below, you see a simple example of a <a href="https://googleapis.github.io/genai-toolbox/resources/tools/neo4j-cypher/">neo4j-cypher tool</a> retrieving a set of movies by actor and starting release year. Note the detailed description of the tool and the parameter. That allows the LLM to reason and choose the right tool at each step of the planning and execution.</p><p>Each tool has the following configuration:</p><ul><li>A name</li><li>An associated source</li><li>The Cypher query statement using parameters, doing the graph pattern match, and returning results</li><li>Description of the tool/query</li><li>Name, type, and description of each parameter</li></ul><pre>tools:<br> search_movies_by_actor:<br>    kind: neo4j-cypher<br>    source: my-neo4j-movies-instance<br>    statement: |<br>      MATCH (m:Movie)&lt;-[:ACTED_IN]-(p:Person)<br>      WHERE p.name = $name AND m.year &gt; $year<br>      RETURN m.title, m.year<br>      ORDER BY m.year DESC<br>      LIMIT 10<br>    description: |<br>Use this tool to get a list of movies for a specific actor and a given minimum <br> release year. Takes a full actor name, e.g. &quot;Tom Hanks&quot; and a year e.g 1993<br> and returns a list of movie titles and release years.<br>Do NOT use this tool with a movie title. Do NOT guess an actor name, <br> Do NOT guess a year. An actor name is a fully qualified name with first and <br> last name separated by a space.<br>For example, if given &quot;Hanks, Tom&quot; the actor name is &quot;Tom Hanks&quot;.<br>If the tool returns more than one option choose the most recent movies.<br>      Example:<br>      {{<br>          &quot;name&quot;: &quot;Meg Ryan&quot;,<br>          &quot;year&quot;: 1993<br>      }}<br>      Example:<br>      {{<br>          &quot;name&quot;: &quot;Clint Eastwood&quot;,<br>          &quot;year&quot;: 2000<br>      }}<br>    parameters:<br>      - name: name<br>        type: string<br>        description: Full actor name, &quot;firstname lastname&quot;<br>      - name: year<br>        type: integer<br>        description: 4 digit number starting in 1900 up to the current year </pre><h3>Investment Research Agent</h3><p>This is a demonstration of an agentic LangChain application with Tools that use GraphRAG patterns combining full-text and graph search.</p><p>The example represents an <strong>investment research agent</strong> that can be used to find recent news about companies, their investors, competitors, partners, and industries. It is powered by data from the Diffbot knowledge graph that was imported into Neo4j.</p><p>The code for the example can be found in <a href="https://github.com/jexp/toolbox-companies">this repository</a>:</p><p><a href="https://github.com/jexp/toolbox-companies">GitHub - jexp/toolbox-companies: Google AI Toolbox Investment Research Agent</a></p><h3>Companies and Articles: Diffbot Dataset</h3><p>The dataset is a graph about companies, associated industries, people that work at or invested in the companies, and articles that report on those companies.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*sNEwrOwyosb_E_Lt" /><figcaption>Article, industry, and company graph</figcaption></figure><p>The news articles are chunked, and the chunks are also stored in the graph.</p><p>The database is publicly available with a read-only user. You can explore the data at <a href="https://demo.neo4jlabs.com:7473/browser/">https://demo.neo4jlabs.com:7473/browser/</a>.</p><ul><li>URI: neo4j+s://demo.neo4jlabs.com</li><li>User: companies</li><li>Password: companies</li><li>Companies: companies</li></ul><p>We use the agentic LangChain integration with Vertex AI, which allows us to pass the tools we registered with Toolbox automatically to the LLM for tool calling. We will use hybrid search, as well as parent-child retrievers and GraphRAG (extract relevant context).</p><p>In our configuration, we provide tools that make use of the full-text index, as well as graph retrieval queries, which fetch the following additional information:</p><ul><li>Parent article of the Chunk (aggregate all chunks for a single article)</li><li>Organization(s) mentioned</li><li>IndustryCategory(ies) for the Organization</li><li>Person(s) connected to the Organization and their roles (e.g., Investor, Chairman, CEO)</li></ul><h3>Tools</h3><p>These are the tools with specific queries that we make available in Toolbox for the agent to use.</p><p>Each tool takes parameters from the agent, executes a graph query, and returns structured results:</p><ul><li>industries — List of industry category names</li><li>companies_in_industry — Companies (ID, name, summary) in a given industry by industry</li><li>companies — List of companies (ID, name, summary) by full-text search</li><li>articles_in_month — List of articles (ID, author, title, date, sentiment) in a month timeframe from the given date (yyyy-mm-dd)</li><li>article — Single article details (ID, author, title, date, sentiment, site, summary, content) by article ID</li><li>companies_in_articles — Companies (ID, name, summary) mentioned in articles by list of article IDs</li><li>people_at_company — People (name, role) associated with a company by company ID</li></ul><h3>Configuration in Toolbox</h3><h4>Neo4j Source Configuration</h4><p>In the sources section of our tools.yaml file, we configure the public Neo4j companies graph database connection as a source:</p><pre>sources:<br>    companies-graph:<br>        kind: &quot;neo4j&quot;<br>        uri: &quot;neo4j+s://demo.neo4jlabs.com&quot;<br>        user: &quot;companies&quot;<br>        password: &quot;companies&quot;<br>        database: &quot;companies&quot;</pre><h3>Neo4j Cypher Tools Setup</h3><p>Now we can configure the first two tools we want to test with the Toolbox; the other tools will be configured and described in a similar way as shown below:</p><ul><li>industries — List of industry names</li><li>companies_in_industry — Companies (ID, name, summary) in a given industry by industry</li></ul><pre>tools:<br>  industries:<br>    kind: neo4j-cypher<br>    source: companies-graph<br>    statement: |<br>      MATCH (i:IndustryCategory) <br>      RETURN i.name as industry<br>    description: List of Industry names<br><br>  companies_in_industry:<br>    kind: neo4j-cypher<br>    source: companies-graph<br>    statement: |<br>      MATCH (:IndustryCategory {name:$industry})&lt;-[:HAS_CATEGORY]-(c) <br>      WHERE NOT EXISTS { (c)&lt;-[:HAS_SUBSIDARY]-() }<br>      RETURN c.id as company_id, c.name as name, c.summary as summary<br>    description: Companies (company_id, name, summary) in a given industry by industry<br>    parameters:<br>      - name: industry<br>        type: string<br>        description: Industry name to filter companies by</pre><p>After starting Toolbox with our YAML configuration file, it serves the tool definitions via an API, which can be retrieved with the ToolboxClient and passed to LangChain:</p><pre># one of linux/amd64, darwin/arm64, darwin/amd64, or windows/amd64<br>export OS=&quot;linux/amd64&quot; <br>curl -O https://storage.googleapis.com/genai-toolbox/v0.1.0/$OS/toolbox<br>chmod +x toolbox<br><br>./toolbox --tools_file tools.yaml<br>2025-02-13T13:50:22.052198+01:00 INFO &quot;Initialized 1 sources.&quot; <br>2025-02-13T13:50:22.052967+01:00 INFO &quot;Initialized 0 authSources.&quot; <br>2025-02-13T13:50:22.053095+01:00 INFO &quot;Initialized 2 tools.&quot; <br>2025-02-13T13:50:22.053146+01:00 INFO &quot;Initialized 1 toolsets.&quot; <br>2025-02-13T13:50:22.053742+01:00 INFO &quot;Server ready to serve!&quot;</pre><p>We can now use LangChain with the Gemini 2.0 Flash model and feed our tool definitions to the Model to do a quick test. We can follow the <a href="https://googleapis.github.io/genai-toolbox/getting-started/local_quickstart/#step-3-connect-your-agent-to-toolbox">Quickstart example</a> in the Toolbox documentation.</p><pre>import asyncio<br>import os<br><br>from langgraph.prebuilt import create_react_agent<br>from langchain_google_vertexai import ChatVertexAI<br>from langgraph.checkpoint.memory import MemorySaver<br><br>from toolbox_langchain import ToolboxClient<br><br>prompt = &quot;&quot;&quot;<br>  You&#39;re a helpful investment research assistant. <br>  You can use the provided tools to search for companies, <br>  people at companies, industries, and news articles from 2023.<br>  Don&#39;t ask for confirmations from the user.<br>  User: <br>&quot;&quot;&quot;<br><br>queries = [<br>    &quot;What industries deal with computer manufacturing?&quot;,<br>    &quot;List 5 companies in the computer manufacturing industry with their <br>     description.&quot;,<br>]<br><br>def main():<br>    model = ChatVertexAI(model_name=&quot;gemini-1.5-pro&quot;)<br><br>    client = ToolboxClient(&quot;http://127.0.0.1:5000&quot;)<br>    tools = client.load_toolset()<br><br>    agent = create_react_agent(model, tools, checkpointer=MemorySaver())<br><br>    config = {&quot;configurable&quot;: {&quot;thread_id&quot;: &quot;thread-1&quot;}}<br>    for query in queries:<br>        inputs = {&quot;messages&quot;: [(&quot;user&quot;, prompt + query)]}<br>        response = agent.invoke(inputs, stream_mode=&quot;values&quot;, config=config)<br>        print(response[&quot;messages&quot;][-1].content)<br><br>main()</pre><p>The outputs we get are:</p><pre>The industries that deal with computer manufacturing are: <br>Computer Hardware Companies, Electronic Products Manufacturers, <br>and Computer Storage Companies.<br><br>Here are 5 companies in the computer hardware industry along with their <br>descriptions:<br><br>1.  **Microsoft Egypt:** Microsoft branch in Egypt<br>2.  **Apigee:** Software company based in San Jose, California, United States and owned by Google<br>3.  **Microsemi:** Communications corporation<br>4.  **Intermec:** American electronics manufacturing company<br>5.  **Elitegroup Computer Systems:** No summary available</pre><p>Now that this works, we can add the other tools as well:</p><ul><li>companies — List of companies (ID, name, summary) by full-text search</li><li>articles_in_month — List of articles (ID, author, title, date, sentiment) in a month timeframe from the given date (yyyy-mm-dd)</li><li>article — Single article details (ID, author, title, date, sentiment, site, summary, content) by article ID</li><li>companies_in_articles — Companies (ID, name, summary) mentioned in articles by list of article IDs</li><li>people_at_company — People (name, role) associated with a company by company ID</li></ul><p>Please note that to handle array parameters, you have to provide an “items” subgroup that specifies the sub-type and description:</p><pre>tools:<br><br>  companies:<br>    kind: neo4j-cypher<br>    source: companies-graph<br>    statement: |<br>      CALL db.index.fulltext.queryNodes(&#39;entity&#39;, $search, {limit: 100}) <br>      YIELD node as c, score WHERE c:Organization <br>      AND NOT EXISTS { (c)&lt;-[:HAS_SUBSIDARY]-() }<br>      RETURN c.id as company_id, c.name as name, c.summary as summary<br>    description: List of Companies (company_id, name, summary) by fulltext search<br>    parameters:<br>      - name: search<br>        type: string<br>        description: Part of a name of a company to search for<br><br>  articles_in_month:<br>    kind: neo4j-cypher<br>    source: companies-graph<br>    statement: |<br>      MATCH (a:Article)<br>      WHERE date($date) &lt;= date(a.date) &lt; date($date) + duration(&#39;P1M&#39;)<br>      RETURN a.id as article_id, a.author as author, a.title as title, toString(a.date) as date, a.sentiment as sentiment<br>      LIMIT 25<br>    description: List of Articles (article_id, author, title, date, sentiment) in a month timeframe from the given date<br>    parameters:<br>      - name: date<br>        type: string<br>        description: Start date in yyyy-mm-dd format<br><br>  article:<br>    kind: neo4j-cypher<br>    source: companies-graph<br>    statement: |<br>      MATCH (a:Article)-[:HAS_CHUNK]-&gt;(c:Chunk)<br>      WHERE a.id = $article_id<br>      WITH a, c ORDER BY id(c) ASC<br>      WITH a, collect(c.text) as contents<br>      RETURN a.id as article_id, a.author as author, a.title as title, toString(a.date) as date,<br>      a.summary as summary, a.siteName as site, a.sentiment as sentiment, apoc.text.join(contents, &#39; &#39;) as content<br>    description: Single Article details (article_id, author, title, date, sentiment, site, summary, content) by article id<br>    parameters:<br>      - name: article_id<br>        type: string<br>        description: ID of the article to retrieve<br><br>  companies_in_articles:<br>    kind: neo4j-cypher<br>    source: companies-graph<br>    statement: |<br>      MATCH (a:Article)-[:MENTIONS]-&gt;(c)<br>      WHERE a.id = $article_id AND not exists { (c)&lt;-[:HAS_SUBSIDARY]-() }<br>      RETURN c.id as company_id, c.name as name, c.summary as summary<br>    description: Companies (company_id, name, summary) mentioned in articles by article id<br>    parameters:<br>      - name: article_id<br>        type: string<br>        description: Article id to find companies mentioned in<br><br>  people_at_company:<br>    kind: neo4j-cypher<br>    source: companies-graph<br>    statement: |<br>      MATCH (c:Organization)-[role]-(p:Person) WHERE c.id = $company_id<br>      RETURN replace(type(role),&quot;HAS_&quot;,&quot;&quot;) as role, p.name as person_name, c.id as company_id, c.name as company_name<br>    description: People (person_name, role, company_id, company_name) associated with a company by company id<br>    parameters:<br>      - name: company_id<br>        type: string<br>        description: Company id to find associated people for</pre><h3>Using the Agent — Conversation</h3><p>I’ve been impressed by the following conversations and the ability of the LLM to use the tools correctly until all information to answer the question has been collected.</p><p>The LLM is able to not only use the endpoints once but also refer back to information from the conversation history and ensure that the relevant context is taken into account.</p><p>It can fix misspelled or incorrect parameters to their correct content, reformat dates in the required shape, and more. API functions are called multiple times and also cascade if needed. For instance, retrieve a list of IDs for a certain context and then make individual calls for each ID to retrieve details.</p><p>It is also able to resolve pronouns and demonstratives (e.g., “these” is resolved to a concrete list of company IDs or articles that are then used to make multiple API calls).</p><p>I’ll comment on each question about what the trickiness is and how the LLM agent resolved it.</p><p>Let’s look at some more questions to our research agent and see how it fares:</p><ul><li>What five industries deal with neurological implants?</li><li>List three companies from those industries with their description and filter afterward by California.</li><li>Who is working at these companies?</li><li>Which companies are mentioned in the news in January 2023 with positive sentiment?</li><li>Who is working there as board members?</li></ul><p>The answers:</p><p><strong>What 5 industries deal with neurological implants?</strong></p><p><em>Initially, we do an easy one — the task here is to call the industries endpoint, but then only limit it to five results (the endpoint takes no limit argument). Also, we’re vague about the actual industry.</em></p><p>Based on the available industries, the following seem most relevant to neurological implants:</p><ul><li>Medical technology companies</li><li>Healthcare companies</li><li>Biotechnology companies</li><li>Pharmaceutical companies</li><li>Robotics companies</li></ul><p><strong>List three companies from those industries with their description and filter afterward by California.</strong></p><p><em>Using prior information, fixing parameters, and post-filtering. Here in this question, we provide general back-reference to the industries and an additional location that is not a parameter but needs to be post-filtered.</em></p><p><em>It uses the prior information in the conversation to fix the industry name and applies the post-filter and limit to only show three California companies.</em></p><p>OK, I have a lot of companies. I will filter for companies located in California, then list five of them with their descriptions:</p><ul><li><strong>goBalto:</strong> (biotechnology companies, pharmaceutical companies) Software company based in San Francisco, California, United States</li><li><strong>Quantenna Communications:</strong> (semiconductor companies, electronic products manufacturers) Manufacturing company based in San Jose, California, United States and owned by ON Semiconductor</li><li><strong>…</strong></li></ul><p><strong>Who is working at these companies?</strong></p><p><em>In our next question, we reference back with a “these” for the companies and ask for details that are available in a separate endpoint. But the agent has to call the people_at_company endpoint multiple times, as it only takes a single company ID.</em></p><p>Here are some people working at the listed companies:</p><ul><li><strong>goBalto:</strong> Sujay Jadhav (CEO), Darren Hite (BOARD_MEMBER), Masashi Kiyomine (BOARD_MEMBER)</li><li><strong>Quantenna Communications:</strong> Sam Heidari (CEO), Dmitry Akhanov (BOARD_MEMBER), Edwin B Hooper III “Ned” (BOARD_MEMBER), Jack Lazar (BOARD_MEMBER), Mark Stevens (BOARD_MEMBER)</li><li><strong>Finisar:</strong> Bob Bashaw (CEO), Mary Jane Raymond (BOARD_MEMBER)</li><li><strong>…</strong></li></ul><p><strong>Which companies are mentioned in the news in January 2023 with positive sentiment?</strong></p><p><em>Here, we want to do two things, switching the conversation from companies to articles without restarting the session. We give a partial date of January 2023, which has to be converted to the format yyyy-mm-dd for the API call. It has to understand that “good sentiment” refers to the numeric score and that we want to have the highest score. The list of articles that come with sentiment output, the model needs to read, understand, sort by sentiment descending, select the highest-ranked articles, and output their titles and the companies mentioned in these. It is not the same endpoint, so it needs to take the article IDs and check the companies_in_article tool for each article individually.</em></p><p>Here are the articles from January 2023 with positive sentiment scores and the companies they mention:</p><ul><li><strong>BackBox Unveils Customized Product Updates (0.886):</strong> I don’t have the tools to identify the companies mentioned in this article.</li><li><strong>Mozilla’s Pocket App Just Got a Big Update (0.838, 0.83):</strong> Mozilla</li><li><strong>Onapsis and Wipro help enterprises secure their SAP applications (0.612):</strong> Onapsis, Wipro, SAP</li><li><strong>Onapsis Announces Collaboration with Wipro to Remove Security as a Digital Transformation Roadblock (0.804, 0.735):</strong> Onapsis, Wipro</li><li><strong>Houston Metro and Lone Star college launch new pilot service for Lone Star College-Houston North Fallbrook campus (0.816):</strong> Lone Star College</li><li><strong>…</strong></li></ul><p><strong>Who is working there as board members?</strong></p><p><em>Here, we refer back to the companies from before with “there” and also specify the roles requirement for the API call choice only vaguely. The agent needs to call the people_at_company endpoint multiple times with the IDs of the companies retrieved earlier in the session history, then aggregate and filter the information into the response.</em></p><p>I wasn’t able to retrieve board members for all of the companies. Here’s the information I found:</p><ul><li><strong>Wipro:<br></strong>Azim Premji (BOARD_MEMBER)<br>Rishad Premji (BOARD_MEMBER)</li><li><strong>SAP:<br></strong>Hasso Plattner (BOARD_MEMBER)<br>Jennifer Morgan (BOARD_MEMBER)</li><li><strong>Google:<br></strong>Larry Page (BOARD_MEMBER)<br>Sergey Brin (BOARD_MEMBER)</li></ul><h3>Deployment</h3><p>The deployment of the Toolbox server and the LangGraph application is outside the scope of this article. Please refer to the <a href="https://googleapis.github.io/genai-toolbox/how-to/deploy_toolbox/">documentation for GenAI Toolbox</a> and <a href="https://codelabs.developers.google.com/codelabs/build-and-deploy-a-langchain-app-on-cloud-run">LangChain on Cloud Run</a>.</p><h3>Summary</h3><p>We’ve demonstrated how you can build agentic applications that integrate database-based tools easily with the Google Gen AI Toolbox. While we focused on a <a href="https://graphrag.com">GraphRAG</a> use case with a knowledge graph in Neo4j, you can combine different data sources and types of databases in the same agentic application. Since Toolbox integrates with different agent frameworks, you can define your tools once and reuse them for several use cases. Being open source, it was straightforward to contribute to Toolbox. We hope you will make use of the <a href="https://github.com/googleapis/genai-toolbox">feedback and discussion mechanisms</a> to add your own data sources and provide feedback and improvements.</p><p>You can learn more about knowledge graphs in <a href="https://neo4j.com/blog/genai/what-is-knowledge-graph">What Is a Knowledge Graph?</a> And be sure to check out <a href="https://neo4j.com/blog/genai/graphrag-manifesto/">The GraphRAG Manifesto: Adding Knowledge to GenAI</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=86526659b46a" width="1" height="1" alt=""><hr><p><a href="https://medium.com/neo4j/building-ai-agents-with-the-google-gen-ai-toolbox-and-neo4j-knowledge-graphs-86526659b46a">Building AI Agents With the Google Gen AI Toolbox and Neo4j Knowledge Graphs</a> was originally published in <a href="https://medium.com/neo4j">Neo4j Developer Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[LLM Knowledge Graph Builder — First Release of 2025]]></title>
            <link>https://medium.com/neo4j/llm-knowledge-graph-builder-first-release-of-2025-532828c4ba76?source=rss-3865848842f9------2</link>
            <guid isPermaLink="false">https://medium.com/p/532828c4ba76</guid>
            <category><![CDATA[developer-experience]]></category>
            <category><![CDATA[knowledge-graph]]></category>
            <category><![CDATA[graphrag]]></category>
            <category><![CDATA[genai]]></category>
            <category><![CDATA[retrieval-augmented-gen]]></category>
            <dc:creator><![CDATA[Michael Hunger]]></dc:creator>
            <pubDate>Wed, 05 Feb 2025 13:26:58 GMT</pubDate>
            <atom:updated>2025-02-05T19:00:31.688Z</atom:updated>
            <content:encoded><![CDATA[<h3>LLM Knowledge Graph Builder — First Release of 2025</h3><h4>New features include community summaries, parallel retrievers, and expanded model support for better knowledge graph construction from text</h4><p>Many developers try to build <a href="https://neo4j.com/blog/what-is-retrieval-augmented-generation-rag/">retrieval-augmented generation</a> (RAG) experiences to interact with information from unstructured data using only vector search and struggle to get to the results that they want. Looking only at text fragments without context only gets you so far. As usual in data engineering, there are more <a href="https://graphrag.com">advanced patterns</a> for preprocessing the data and extracting knowledge, <a href="https://neo4j.com/blog/what-is-graphrag/#:~:text=GraphRAG%20is%20a%20powerful%20retrieval,for%20trustworthy%20and%20reliable%20results.">one of which is GraphRAG</a>. So when you get around using the data, you’ve surfaced the underlying concepts and can make use of them to connect the pieces and provide relevant context to a user’s questions.</p><h3>Introduction</h3><p>We built, <a href="https://github.com/neo4j-labs/llm-graph-builder">open-sourced</a>, and hosted the <a href="https://llm-graph-builder.neo4jlabs.com/">LLM Knowledge Graph Builder</a> to let you try out better ways of treating your unstructured data. We preprocess documents, transcripts, web articles, and more <em>sources into chunks, compute text embeddings,</em> and connect them (lexical graph).</p><p>But we don’t stop there. We also <em>extract entities and their relationships, </em>which is especially relevant if you ingest multiple documents because you can relate the pieces spread out over multiple sources (entity graph).</p><p>This combined knowledge graph then enables a set of different retrievers to fetch data (see below).</p><p>Since we launched the <a href="https://medium.com/neo4j/get-started-with-graphrag-neo4js-ecosystem-tools-eec476167e86">LLM Knowledge Graph Builder in June 2024</a>, we’ve had an impressive amount of usage and great feedback from users. It’s now the fourth most popular source of user interaction on AuraDB Free, which makes us really happy.</p><p>We did a release in fall 2024, but there were too many AI events, which took most of my time to write a blog post. Over the past few months, the team worked on really nice features — some of which we want to introduce today in the first release of 2025.</p><h3>What Does the LLM Knowledge Graph Builder Do?</h3><p>For those of you who don’t know what the tool does, here’s a quick introduction.</p><p>If you have a number of text documents, web articles, Wikipedia pages, or similar unstructured information, wouldn’t it be great to surface all the knowledge hidden inside those in a structured way and then use those entities and their relationships to better <strong><em>chat with your data</em></strong>?</p><p>The LLM Knowledge Graph Builder:</p><ol><li>Imports your documents</li><li>Splits them into chunks and links them up</li><li>Generates text embeddings for vector search and connects the most similar ones</li><li>Uses a variety of large language models (LLMs) to extract entities and their relationships</li><li>Optionally using a graph schema you can provide</li><li>Stores the nodes and relationships in Neo4j</li><li>And when running against a graph data science-enabled Neo4j instance, it also performs topic clustering and summarization</li></ol><p>Get a quick overview of the process and try it out at <a href="https://llm-graph-builder.neo4jlabs.com">https://llm-graph-builder.neo4jlabs.com</a>.</p><figure><a href="https://llm-graph-builder.neo4jlabs.com/"><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*kk_YGSw0ScCgdR3LgoFbmg.png" /></a></figure><p>The only prerequisite is a publicly accessible Neo4j instance to store your data, which you can <a href="https://console.neo4j.io">create on AuraDB Free</a> (or Aura Pro Trial with Graph Data Science).</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FLlNy5VmV290%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DLlNy5VmV290&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FLlNy5VmV290%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/a6a1fc0888c952295da39fbf8e69e98f/href">https://medium.com/media/a6a1fc0888c952295da39fbf8e69e98f/href</a></iframe><h3>Under the Hood Blog Series</h3><p>Over the next few weeks, we have a <a href="https://medium.com/neo4j/tagged/knowledge-graph">series of blog posts</a> that explore different aspects of the LLM Knowledge Graph Builder and explain how they work under the hood, so you can benefit from our learnings and apply them to your own GenAI projects.</p><h3>New Features</h3><p>Let’s dive into the new features. The main ones are <strong>generating Community Summaries</strong> and new local and global retrievers for it, as well as the new ability of running <strong>multiple retrievers</strong> in parallel for your questions and <strong>evaluating</strong> them. You can now also guide the extraction with <strong>custom prompt instructions</strong>.</p><p>Additionally, we have some user experience improvements to highlight.</p><h3>More Models</h3><p>In development mode or in self-deployment, we tested and configured the LLM Knowledge Graph Builder with a plethora of new models, which we also hint at in the production version:</p><ul><li>OpenAI GPT-4o (and GPT-4o mini)</li><li>Google Gemini 1.5 and 2.0 Pro and Flash</li><li>Qwen 2.5</li><li>Amazon Nova models</li><li>Groq</li><li>Llama 3.x</li><li>Ollama models</li><li>Claude 3.5 Sonnet</li><li>DeepSeek and Microsoft Phi-4, coming soon</li></ul><p>We also test the models internally for integration testing and to see how well they work for extraction.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xSnt6EXgEYZVZoGtzh03iA.png" /><figcaption>Test results for different LLM extractions with entity counts and runtimes</figcaption></figure><h3>Community Summaries</h3><p>One way you can improve the richness of the graph representing your documents is by running graph algorithms to extract additional information hidden in the graph structure.</p><p>Microsoft did that last year with the “<a href="https://graphrag.com/appendices/research/2404.16130/">From local to global — Query Focused Summarization GraphRAG</a>” paper. They used a hierarchical graph clustering algorithm (<a href="https://en.wikipedia.org/wiki/Leiden_algorithm">Leiden</a>) on the extracted entity domain graph. This algorithm identifies clusters of closely related entities. An LLM then summarizes their content into community summary nodes, which represent the information in that set of entities and relationships. Due to the hierarchical results, this can be done on multiple levels going from very fine-grained to the highest level.</p><blockquote>A more detailed blog post on evaluation will be available later in the series.</blockquote><p>The extracted communities are visible in the graph visualization of your documents, so you can inspect them and their texts, and see which entities they are summarizing.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*kXA9HO5Ku4tZZqSE" /><figcaption>Graph visualization with entities and summarized communities</figcaption></figure><p>Those community summaries are then used in a <a href="https://graphrag.com/reference/graphrag/global-community-summary-retriever/">global retriever</a> to answer general questions about the documents, which aren’t specific to any one entity but identify topics across documents. So instead of the vertical sequence of content of each document, they represent themes spanning individual documents.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*-cJG0OhdgpHP8OG_" /><figcaption>Clustering topics across documents with graph algorithms</figcaption></figure><p>In LLM Knowledge Graph Builder, we implemented it using the <a href="https://neo4j.com/docs/graph-data-science/current/algorithms/leiden/">same clustering algorithm</a>, so if you run connect to a Neo4j instance that has graph data science enabled (AuraPro with GDS, AuraDS, Neo4j Sandbox, or self-hosted — which we show as ⚛ on top of the app), and enable the Community Summarization in the Graph Enhancements &gt; Post Processing, we can run the algorithm and create the tree of summaries.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*nnQyRkzhB2vvITh6" /><figcaption>Post Processing Jobs</figcaption></figure><p>In our <strong>global community retriever,</strong> we took a slightly different approach from the paper, which just stuffs all the community summaries (of a certain level) into several LLM prompts to answer a question. We generate vector embeddings of the community summaries and use similarity and full-text search to find the most relevant ones for the question, which Microsoft Research recently also proposed, and use those to answer the questions.</p><p>Since we can now run multiple retrievers in parallel, you can see the difference for yourself. Check out the global and local entity retrievers comparison.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*p3f3XQm8d1pUjPfz" /><figcaption>Global Community Retriever</figcaption></figure><p>Both retrievers, like all the others, allow showing the retrieved contextual graph data (communities, entities, chunks) that went into generating the answer, supporting explainability.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*j6F7x_Vx_wxNbZzo" /><figcaption>Global Community Retriever retrieval details</figcaption></figure><h3>Local Entity Retriever</h3><p>The <a href="https://graphrag.com/reference/graphrag/local-retriever/">local entity retriever</a> turns what is several thousand lines of Python code in the Microsoft paper implementation in about 50 lines of Cypher (one benefit of using an actual graph database), fetching the following:</p><ul><li>Entities with hybrid search (embedding and full-text)</li><li>The relationships between them</li><li>The most common relationships to entities outside the initial set</li><li>The chunks and documents from where the entities were extracted</li><li>The community summaries the entities are part of</li></ul><pre>// previous hybrid search on entities, then graph expansion<br>WITH collect(node) AS nodes,<br>    avg(score) AS score,<br>    collect({id: elementId(node), score: score}) AS metadata<br><br>RETURN score, nodes, metadata,<br><br>    collect {<br>        UNWIND nodes AS n<br>        MATCH (n)&lt;-[:HAS_ENTITY]-&gt;(c:Chunk)<br>        WITH c, count(distinct n) AS freq<br>        RETURN c<br>        ORDER BY freq DESC<br>        LIMIT 3<br>    } AS chunks,<br><br>    collect {<br>        UNWIND nodes AS n<br>        OPTIONAL MATCH (n)-[:IN_COMMUNITY]-&gt;(c:__Community__)<br>        WITH c, c.community_rank AS rank, c.weight AS weight<br>        RETURN c<br>        ORDER BY rank, weight DESC<br>        LIMIT 3<br>    } AS communities,<br><br>    collect {<br>        UNWIND nodes AS n<br>        UNWIND nodes AS m<br>        MATCH (n)-[r]-&gt;(m)<br>        RETURN DISTINCT r<br>    } AS rels,<br><br>    collect {<br>        UNWIND nodes AS n<br>        MATCH path = (n)-[r]-(m:__Entity__)<br>        WHERE NOT m IN nodes<br>        WITH m, collect(distinct r) AS rels, count(*) AS freq<br>        ORDER BY freq DESC<br>        LIMIT 10<br>        WITH collect(m) AS outsideNodes, apoc.coll.flatten(collect(rels)) AS rels<br>        RETURN { nodes: outsideNodes, rels: rels }<br>    } AS outside</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*ZD7Kzh6nN3-YpUff" /><figcaption>Local entity retriever</figcaption></figure><p>For the entity retriever, we can show not only text chunks but also the entities and their retrieved relationships.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*ZjtFU4KVzknsvA7w" /><figcaption>Local entity retriever result details</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*jSzh1nrGTKeKNapy" /><figcaption>Community summaries used in the local entity retriever</figcaption></figure><h3>Multi-Retrievers</h3><p>As shown in the previous section, you can now select one or more retrievers run in parallel to generate an answer to your question and switch between the answer results directly to compare.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/738/0*LmYm0qyYiuAboeB0" /></figure><p>In the Details link after each answer, the retriever also provides the contextual facts retrieved from the database, which are then sent to the LLM, as well as additional information about the model, runtime, and token counts.</p><p>In development mode or for self-hosted, even more retrievers are available for you to test and compare.</p><p>To make this easier, the narrow right conversation sidebar can be maximized and even popped out into a full frame, which even becomes shareable. This is especially useful with read-only database connection setups, which we now also support for sharing your generated knowledge graph.</p><p>The underlying data of your conversation can be downloaded as a JSON file, so you can process it according to your own needs.</p><h3>Retriever Evaluation</h3><p>One reason for the parallel retriever evaluation was adding the ability to generate evaluation metrics.</p><p>We are using the RAGAs framework to run the evaluation. Currently, we compute the following metrics, some of which you need to provide a ground truth for:</p><ul><li>Relevancy — How well the answer addresses the user’s question</li><li>Faithfulness — How accurately the answer reflects the provided information</li><li>Context relevance — Determines the recall of entities present in generated answer and retrieved contexts</li><li>Semantic relevance — How well the generated answer understands the meaning of the reference answer</li><li>ROUGE — Similarity to ground-truth answer word by word</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*zxECJ_fHh2H3kit5xfxPNA.png" /><figcaption>Retriever evaluation</figcaption></figure><blockquote>There will be a more detailed blog post on evaluation later in the series.</blockquote><h3>Guided Instructions for Extraction</h3><p>In the latest version, we added the ability to guide the extraction more by allowing users to pass additional prompts to the LLM for extracting entities. So you can force it to only focus on certain parts of the document, on certain themes, or use specific additional instructions.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*T5j3ix96derzEGi08B5DjQ.png" /><figcaption>Provide additional extraction instructions</figcaption></figure><p>I tested it by extracting entities and relationships from several articles on Albert Einstein’s work but directing the LLM not to extract anything related to his physics work. And there it was — life, people, awards, peace activities, and other inventions, but not much on his tremendous contributions in physics.</p><h3>User Experience Improvements</h3><p>A quick list of the improvements:</p><ul><li>Allow read-only database access, then only retrieval is possible</li><li>Pop out the chat experience into a separate window</li><li>Improve graph visualization with local search and highlighting</li></ul><h3>Graph Consolidation</h3><p>An experimental feature is automatic graph consolidation, intended for users who just quickly want to see an extracted knowledge graph from their data but don’t want to specify a graph schema upfront.</p><p>In those cases, the LLM often generates a load of entity types and relationships — counting into the thousands if you give it free rein. Our retrievers don’t mind because they use the graph topology, not the actual type, to traverse the graph (although they collect them together with the textual information).</p><p>That’s why we recommend providing a graph schema upfront for a more semantically constrained knowledge graph. But in cases where this didn’t happen, we can use an LLM to categorize a sorted list of node labels and relationship types into a smaller, more general set. Since we’re not 100-percent happy with the reduction, we didn’t enable it by default, but we would love your feedback. You can find it as one of the post-processing jobs in the graph enhancements.</p><h3>Summary</h3><p>Building an open source tool like this is a fulfilling experience — especially when we get as much feedback as we’ve gotten so far. We’ve addressed more than 400 GitHub issues, including internal planned tasks, and gotten more than 2,800 GitHub stars.</p><p>If you haven’t yet, <strong>please try it out and let us know what you think</strong> in the comments. We’d also be excited if you wrote about your experiences using the tool for different sets of documents of different domains.</p><p>Please share if you have questions or feedback. Also, please <a href="https://github.com/neo4j-labs/llm-graph-builder">give us a star</a> if you like the project.</p><p>Happy building!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=532828c4ba76" width="1" height="1" alt=""><hr><p><a href="https://medium.com/neo4j/llm-knowledge-graph-builder-first-release-of-2025-532828c4ba76">LLM Knowledge Graph Builder — First Release of 2025</a> was originally published in <a href="https://medium.com/neo4j">Neo4j Developer Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Claude Conversing With Neo4j Via MCP]]></title>
            <link>https://medium.com/neo4j/claude-conversing-with-neo4j-via-mcp-5419d24bb7a0?source=rss-3865848842f9------2</link>
            <guid isPermaLink="false">https://medium.com/p/5419d24bb7a0</guid>
            <category><![CDATA[protocol]]></category>
            <category><![CDATA[genai]]></category>
            <category><![CDATA[chatbots]]></category>
            <category><![CDATA[neo4j]]></category>
            <category><![CDATA[claude]]></category>
            <dc:creator><![CDATA[Michael Hunger]]></dc:creator>
            <pubDate>Fri, 20 Dec 2024 15:56:40 GMT</pubDate>
            <atom:updated>2024-12-22T00:34:06.221Z</atom:updated>
            <content:encoded><![CDATA[<h3>Claude Converses With Neo4j Via MCP</h3><h4>Give LLMs the Power of Knowledge Graphs With Anthropic’s Model Context Protocol</h4><p>Last week, Anthropic published a new <a href="https://www.anthropic.com/news/model-context-protocol">Model Context Protocol</a> (MCP) for LLMs like Claude to communicate with external data sources.</p><p>This protocol allows, for instance, local applications like Claude.ai, Zed, Replit, and others to access file systems, APIs, and databases as part of the conversation with the user to fetch additional information. This can be used to answer questions, fetch more information, or use code generation to generate on-the-fly charts and visualizations.</p><p>Here’s what it could look like in your conversation.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*UnC-f-LmG5RB-wk3G16TgQ.png" /><figcaption>Claude fetching data from Neo4j to render a heat map of ratings</figcaption></figure><p>If you want to test it yourself, you can add this config to your claude_desktop_config.json and restart your server to follow along. This points to a read-only demo database.</p><pre>{&quot;mcpServers&quot;: {<br>      &quot;movies-neo4j&quot;: {<br>        &quot;command&quot;: &quot;uvx&quot;,<br>        &quot;args&quot;: [&quot;mcp-neo4j-cypher&quot;, <br>                 &quot;--db-url&quot;, &quot;neo4j+s://demo.neo4jlabs.com&quot;, <br>                 &quot;--user&quot;, &quot;recommendations&quot;, <br>                 &quot;--password&quot;, &quot;recommendations&quot;]<br>      }   <br>    }<br> }</pre><p>And they didn’t just publish the protocol. They open-sourced a <a href="https://github.com/modelcontextprotocol/servers">number of connectors (servers)</a> with an invitation to the community to add more.</p><h3>The Neo4j MCP Server(s)</h3><p>We immediately started <a href="https://github.com/neo4j-contrib/mcp-neo4j/tree/main/servers/mcp-neo4j-cypher">implementing a Neo4j MCP server</a> to see what that would look like, which we also <a href="https://pypi.org/project/mcp-neo4j-cypher/">published on PyPI</a>, so you can run it directly.</p><p><a href="https://github.com/neo4j-contrib/mcp-neo4j">GitHub - neo4j-contrib/mcp-neo4j: Model Context Protocol with Neo4j</a></p><p>It’s also included in the official <a href="https://github.com/modelcontextprotocol/servers?tab=readme-ov-file#%EF%B8%8F-official-integrations">MCP Servers repository</a>.</p><h3>An Interactive Example</h3><p>Here’s a simple example of how you can use the MCP to query a Neo4j database from a LLM.</p><p>Imagine you have a Neo4j database with an IMDB-like set of movies, actors, directors, genres and movie ratings.</p><p>I configure the publicly accessible database in my Claude.ai desktop configuration and restart the client.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*_uCb-D3O0Fa5CfdCZybrAA.png" /></figure><p>Now I want to find out which movies were directed by Quentin Tarantino, so I’ll just ask Claude about it.</p><blockquote><em>Check my movie database for movies directed by Quentin Tarantino.</em></blockquote><p>Claude will pop up a permission dialog that asks me if I want to allow the query to the Neo4j database (once or for the whole chat), and if I accept, it will run the query and return the results in a nice format.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*nZ_94geL38hPyf8ODH0lig.png" /></figure><p>Then it answers:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*GIHD854jB_Lo3NnION7eLA.png" /><figcaption>Claude answer for: <em>Check my movie database for movies directed by Quentin Tarantino.</em></figcaption></figure><p><em>I’ll help you query the database for movies directed by Quentin Tarantino.</em></p><p><em>View Result from read-neo4j-cypher from movies-neo4j (local)</em></p><pre>{<br>  `query`: `<br>MATCH (d:Person {name: &#39;Quentin Tarantino&#39;})-[:DIRECTED]-&gt;(m:Movie)<br>RETURN m.title, m.year<br>ORDER BY m.year<br>`<br>}</pre><pre>[{&#39;m.title&#39;: &#39;Reservoir Dogs&#39;, &#39;m.year&#39;: 1992}, {&#39;m.title&#39;: &#39;Pulp Fiction&#39;, &#39;m.year&#39;: 1994}, {&#39;m.title&#39;: &#39;Four Rooms&#39;, &#39;m.year&#39;: 1995}, {&#39;m.title&#39;: &#39;Jackie Brown&#39;, &#39;m.year&#39;: 1997}, {&#39;m.title&#39;: &#39;Kill Bill: Vol. 1&#39;, &#39;m.year&#39;: 2003}, {&#39;m.title&#39;: &#39;Kill Bill: Vol. 2&#39;, &#39;m.year&#39;: 2004}, {&#39;m.title&#39;: &#39;Sin City&#39;, &#39;m.year&#39;: 2005}, {&#39;m.title&#39;: &#39;Grindhouse&#39;, &#39;m.year&#39;: 2007}, {&#39;m.title&#39;: &#39;Death Proof&#39;, &#39;m.year&#39;: 2007}, {&#39;m.title&#39;: &#39;Inglourious Basterds&#39;, &#39;m.year&#39;: 2009}, {&#39;m.title&#39;: &#39;Django Unchained&#39;, &#39;m.year&#39;: 2012}, {&#39;m.title&#39;: &#39;The Hateful Eight&#39;, &#39;m.year&#39;: 2015}]</pre><p><em>Here are Quentin Tarantino’s directed films in chronological order:</em></p><ol><li><em>Reservoir Dogs (1992)</em></li><li><em>Pulp Fiction (1994)</em></li><li><em>Four Rooms (1995)</em></li><li><em>Jackie Brown (1997)</em></li><li><em>Kill Bill: Vol. 1 (2003)</em></li><li><em>Kill Bill: Vol. 2 (2004)</em></li><li><em>Sin City (2005) …​</em></li></ol><p><em>Let me know if you’d like to know more details about any of these films or if you’d like to explore other aspects of Tarantino’s work!</em></p><blockquote><em>What are other popular movies that the actors of Kill Bill acted in?</em></blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*P-QK_c6ayzraCmPsKK-IrQ.png" /><figcaption>Claude answer for: <em>What are other popular movies that the actors of Kill Bill acted in?</em></figcaption></figure><p><em>I’ll query for other notable movies featuring actors from Kill Bill: Vol. 1 and Kill Bill: Vol. 2.</em></p><p><em>View Result from read-neo4j-cypher from movies-neo4j (local)</em></p><pre>{<br>  `query`: `<br>MATCH (kb:Movie)<br>WHERE kb.title IN [&#39;Kill Bill: Vol. 1&#39;, &#39;Kill Bill: Vol. 2&#39;]<br>MATCH (actor:Person)-[:ACTED_IN]-&gt;(kb)<br>MATCH (actor)-[:ACTED_IN]-&gt;(other:Movie)<br>WHERE other.title &lt;&gt; &#39;Kill Bill: Vol. 1&#39;<br>AND other.title &lt;&gt; &#39;Kill Bill: Vol. 2&#39;<br>RETURN DISTINCT actor.name, collect(DISTINCT other.title) as other_movies<br>ORDER BY actor.name<br>`<br>}<br>[{&#39;actor.name&#39;: &#39;Daryl Hannah&#39;, &#39;other_movies&#39;: [&#39;Clan of the Cave Bear, The&#39;, &#39;Tie That Binds, The&#39;, &#39;Addams Family Reunion&#39;, &#39;Summer Lovers&#39;, &#39;Legal Eagles&#39;, &#39;Reckless&#39;, &#39;Pope of Greenwich Village, The&#39;, &#39;Splash&#39;, &#39;Roxanne&#39;, &#39;Gingerbread Man, The&#39;, &#39;My Favorite Martian&#39;, &#39;Steel Magnolias&#39;, &#39;Walk to Remember, A&#39;]}, {&#39;actor.name&#39;: &#39;Lucy Liu&#39;, &#39;other_movies&#39;: [&#39;Afro Samurai: Resurrection&#39;, &#39;Lucky Number Slevin&#39;, &#39;Mulan II&#39;, &#39;Cypher&#39;, &quot;Charlie&#39;s Angels: Full Throttle&quot;, &#39;Ballistic: Ecks vs. Sever&#39;, &quot;Charlie&#39;s Angels&quot;, &#39;Shanghai Noon&#39;]}, {&#39;actor.name&#39;: &#39;Michael Parks&#39;, &#39;other_movies&#39;: [&#39;Tusk&#39;, &#39;Death Wish 5: The Face of Death&#39;]}, {&#39;actor.name&#39;: &#39;Uma Thurman&#39;, &#39;other_movies&#39;: [&#39;Misérables, Les&#39;, &#39;My Super Ex-Girlfriend&#39;, &#39;Even Cowgirls Get the Blues&#39;, &#39;Truth About Cats &amp; Dogs, The&#39;, &#39;Batman &amp; Robin&#39;, &#39;Gattaca&#39;, &#39;Avengers, The&#39;, &#39;Month by the Lake, A&#39;, &#39;Jennifer 8&#39;, &#39;Mad Dog and Glory&#39;, &#39;Johnny Be Good&#39;, &#39;Henry &amp; June&#39;, &#39;Be Cool&#39;, &#39;Producers, The&#39;, &#39;Tape&#39;, &#39;Paycheck&#39;]}, {&#39;actor.name&#39;: &#39;Vivica A. Fox&#39;, &#39;other_movies&#39;: [&#39;Set It Off&#39;, &#39;Soul Food&#39;, &#39;Boat Trip&#39;, &#39;Juwanna Mann&#39;, &#39;Why Do Fools Fall In Love?&#39;]}]</pre><p><em>Let me break down some notable movies by the main cast members:</em></p><p><em>Uma Thurman’s other major films include: — The Avengers — Gattaca — Batman &amp; Robin — Les Misérables — The Producers — Be Cool</em></p><p><em>Lucy Liu appeared in: — Charlie’s Angels and Charlie’s Angels: Full Throttle — Shanghai Noon — Lucky Number Slevin — Mulan II</em></p><p><em>Michael Parks appeared in: — Tusk — Death Wish 5: The Face of Death</em></p><p><em>Would you like to know more details about any of these actors or specific movies? I can query for additional information like release years, directors, or other cast members.</em></p><blockquote><em>Tell me more about Tusk; is it worth watching? Use the database schema to figure out the ratings.</em></blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*B6D02g0WvSfUVsn7yV3yBA.png" /><figcaption>Claude answer for: <em>Tell me more about Tusk; is it worth watching? Use the database schema to figure out the ratings.</em></figcaption></figure><p>Now, it uses the other tool (get-neo4j-schema) available in the integration to get the database schema, which is important for determining where and how ratings are stored. It returns node types, their attributes, if they’re indexed, and the relationships to other nodes.</p><p><em>Let me first check the database schema to see what kind of rating information is available about movies.</em></p><p><em>View Result from get-neo4j-schema from movies-neo4j (local)</em></p><pre>{}<br>[{&#39;label&#39;: &#39;Movie&#39;, &#39;attributes&#39;: {&#39;budget&#39;: &#39;INTEGER&#39;, &#39;movieId&#39;: &#39;STRING unique indexed&#39;, &#39;tmdbId&#39;: &#39;STRING unique indexed&#39;, &#39;plotEmbedding&#39;: &#39;LIST&#39;, &#39;imdbVotes&#39;: &#39;INTEGER&#39;, &#39;runtime&#39;: &#39;INTEGER&#39;, &#39;posterEmbedding&#39;: &#39;LIST&#39;, &#39;countries&#39;: &#39;LIST&#39;, &#39;imdbId&#39;: &#39;STRING indexed&#39;, &#39;url&#39;: &#39;STRING&#39;, &#39;plot&#39;: &#39;STRING indexed&#39;, &#39;released&#39;: &#39;STRING indexed&#39;, &#39;languages&#39;: &#39;LIST&#39;, &#39;imdbRating&#39;: &#39;FLOAT indexed&#39;, &#39;title&#39;: &#39;STRING indexed&#39;, &#39;poster&#39;: &#39;STRING&#39;, &#39;year&#39;: &#39;INTEGER indexed&#39;, &#39;revenue&#39;: &#39;INTEGER&#39;}, &#39;relationships&#39;: {&#39;IN_GENRE&#39;: &#39;Genre&#39;}}, {&#39;label&#39;: &#39;Genre&#39;, &#39;attributes&#39;: {&#39;name&#39;: &#39;STRING unique indexed&#39;}, &#39;relationships&#39;: {}}, {&#39;label&#39;: &#39;User&#39;, &#39;attributes&#39;: {&#39;name&#39;: &#39;STRING indexed&#39;, &#39;userId&#39;: &#39;STRING unique indexed&#39;}, &#39;relationships&#39;: {&#39;RATED&#39;: &#39;Movie&#39;}}, {&#39;label&#39;: &#39;Actor&#39;, &#39;attributes&#39;: {&#39;born&#39;: &#39;DATE&#39;, &#39;bornIn&#39;: &#39;STRING&#39;, &#39;tmdbId&#39;: &#39;STRING&#39;, &#39;bio&#39;: &#39;STRING&#39;, &#39;died&#39;: &#39;DATE&#39;, &#39;name&#39;: &#39;STRING&#39;, &#39;poster&#39;: &#39;STRING&#39;, &#39;imdbId&#39;: &#39;STRING&#39;, &#39;url&#39;: &#39;STRING&#39;}, &#39;relationships&#39;: {&#39;ACTED_IN&#39;: &#39;Movie&#39;, &#39;DIRECTED&#39;: &#39;Movie&#39;}}, {&#39;label&#39;: &#39;Director&#39;, &#39;attributes&#39;: {&#39;born&#39;: &#39;DATE&#39;, &#39;bornIn&#39;: &#39;STRING&#39;, &#39;tmdbId&#39;: &#39;STRING&#39;, &#39;bio&#39;: &#39;STRING&#39;, &#39;died&#39;: &#39;DATE&#39;, &#39;name&#39;: &#39;STRING&#39;, &#39;poster&#39;: &#39;STRING&#39;, &#39;imdbId&#39;: &#39;STRING&#39;, &#39;url&#39;: &#39;STRING&#39;}, &#39;relationships&#39;: {&#39;ACTED_IN&#39;: &#39;Movie&#39;, &#39;DIRECTED&#39;: &#39;Movie&#39;}}, {&#39;label&#39;: &#39;Person&#39;, &#39;attributes&#39;: {&#39;born&#39;: &#39;DATE&#39;, &#39;bornIn&#39;: &#39;STRING&#39;, &#39;tmdbId&#39;: &#39;STRING unique indexed&#39;, &#39;bio&#39;: &#39;STRING indexed&#39;, &#39;died&#39;: &#39;DATE&#39;, &#39;name&#39;: &#39;STRING indexed&#39;, &#39;poster&#39;: &#39;STRING&#39;, &#39;imdbId&#39;: &#39;STRING&#39;, &#39;url&#39;: &#39;STRING&#39;}, &#39;relationships&#39;: {&#39;ACTED_IN&#39;: &#39;Movie&#39;, &#39;DIRECTED&#39;: &#39;Movie&#39;}}]</pre><p><em>Now, let me query specific information about Tusk, including its rating, plot, and genre.</em></p><p><em>View Result from read-neo4j-cypher from movies-neo4j (local)</em></p><pre>{<br>  `query`: `<br>MATCH (m:Movie {title: &#39;Tusk&#39;})<br>MATCH (m)-[:IN_GENRE]-&gt;(g:Genre)<br>MATCH (d:Person)-[:DIRECTED]-&gt;(m)<br>MATCH (a:Person)-[:ACTED_IN]-&gt;(m)<br>RETURN m.title, m.year, m.imdbRating, m.plot, collect(DISTINCT g.name) as genres,<br>       collect(DISTINCT d.name) as directors, collect(DISTINCT a.name) as actors<br>`<br>}<br>[{&#39;m.title&#39;: &#39;Tusk&#39;, &#39;m.year&#39;: 2014, &#39;m.imdbRating&#39;: 5.4, &#39;m.plot&#39;: &#39;When podcaster Wallace Bryton goes missing in the backwoods of Manitoba while interviewing a mysterious seafarer named Howard Howe, his best friend Teddy and girlfriend Allison team with an ex-cop to look for him.&#39;, &#39;genres&#39;: [&#39;Comedy&#39;, &#39;Drama&#39;, &#39;Horror&#39;], &#39;directors&#39;: [&#39;Kevin Smith&#39;], &#39;actors&#39;: [&#39;Michael Parks&#39;, &#39;Haley Joel Osment&#39;, &#39;Genesis Rodriguez&#39;, &#39;Justin Long&#39;]}]</pre><p><em>Based on the database information, here’s what I can tell you about Tusk:</em></p><p><em>1. It’s a horror-comedy-drama directed by Kevin Smith (known for Clerks, Mallrats, Chasing Amy) and was released in September 2014.</em></p><p><em>2. The plot follows a podcaster named Wallace Bryton who disappears in Manitoba while interviewing a mysterious seafarer named Howard Howe. His girlfriend and best friend then team up with an ex-cop to search for him.</em></p><p><em>3. The movie has a relatively modest IMDb rating of 5.4/10, suggesting mixed to negative reception.</em></p><p>….</p><p><em>The low IMDb rating suggests it’s not for everyone, but it has developed something of a cult following among fans of unusual horror films.</em></p><p><em>Would you like me to find similar movies in the database that you might enjoy instead?</em></p><blockquote><em>Hmm, OK, give me the best movie of the list that we got from the Kill Bill actors with some detail.</em></blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-42ylI4L8kOLzpWyvjglSw.png" /><figcaption>Claude answer for: <em>Hmm, ok, give me the best movie of the list that we got from the Kill Bill actors with some detail.</em></figcaption></figure><p><em>Let me search through those movies and their ratings to find the highest rated one.</em></p><pre>{<br>  `query`: `<br>MATCH (a:Person)-[:ACTED_IN]-&gt;(kb:Movie)<br>WHERE kb.title IN [&#39;Kill Bill: Vol. 1&#39;, &#39;Kill Bill: Vol. 2&#39;]<br>MATCH (a)-[:ACTED_IN]-&gt;(m:Movie)<br>WHERE m.title &lt;&gt; &#39;Kill Bill: Vol. 1&#39; AND m.title &lt;&gt; &#39;Kill Bill: Vol. 2&#39;<br>MATCH (m)-[:IN_GENRE]-&gt;(g:Genre)<br>MATCH (d:Person)-[:DIRECTED]-&gt;(m)<br>RETURN DISTINCT m.title, m.year, m.imdbRating, m.plot,<br>       collect(DISTINCT g.name) as genres,<br>       collect(DISTINCT d.name) as directors,<br>       collect(DISTINCT a.name) as actors<br>ORDER BY m.imdbRating DESC<br>LIMIT 1<br>`<br>}<br>[{&#39;m.title&#39;: &#39;Gattaca&#39;, &#39;m.year&#39;: 1997, &#39;m.imdbRating&#39;: 7.8, &#39;m.plot&#39;: &#39;A genetically inferior man assumes the identity of a superior one in order to pursue his lifelong dream of space travel.&#39;, &#39;genres&#39;: [&#39;Thriller&#39;, &#39;Sci-Fi&#39;, &#39;Drama&#39;], &#39;directors&#39;: [&#39;Andrew Niccol&#39;], &#39;actors&#39;: [&#39;Uma Thurman&#39;]}]</pre><p><em>From the movies we found earlier, Gattaca (1997) stands out as the highest-rated film with an IMDb rating of 7.8/10. Let me get some additional details about this film.</em></p><pre>{<br>  `query`: `<br>MATCH (m:Movie {title: &#39;Gattaca&#39;})<br>MATCH (p:Person)-[:ACTED_IN]-&gt;(m)<br>RETURN m.runtime, collect(p.name) as cast<br>`<br>}<br>[{&#39;m.runtime&#39;: 106, &#39;cast&#39;: [&#39;Xander Berkeley&#39;, &#39;Gore Vidal&#39;, &#39;Uma Thurman&#39;, &#39;Ethan Hawke&#39;]}]</pre><p><em>Here’s what makes Gattaca stand out:</em></p><p><em>1. Plot: It’s a science fiction drama about a man born genetically “inferior” who assumes the identity of a genetically superior person to pursue his dream of space travel. The story explores themes of genetic discrimination, determination, and human potential.</em></p><p>…..</p><p><em>Would you like me to find similar movies in the database that you might enjoy if you like Gattaca?</em></p><blockquote><em>What is the other average, min, max user rating for Gattaca?</em></blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*tMPo-q3SQ_HL4UffjPs6RA.png" /><figcaption>Claude answer for: <em>What is the other average, min, max user rating for Gattaca?</em></figcaption></figure><p><em>I’ll look up the user ratings statistics for Gattaca.</em></p><pre>{<br>  `query`: `<br>MATCH (m:Movie {title: &#39;Gattaca&#39;})<br>MATCH (u:User)-[r:RATED]-&gt;(m)<br>RETURN<br>  m.title,<br>  count(r) as number_of_ratings,<br>  round(avg(r.rating)*10)/10 as avg_rating,<br>  min(r.rating) as min_rating,<br>  max(r.rating) as max_rating<br>`<br>}<br>[{&#39;m.title&#39;: &#39;Gattaca&#39;, &#39;number_of_ratings&#39;: 84, &#39;avg_rating&#39;: 3.7, &#39;min_rating&#39;: 1.0, &#39;max_rating&#39;: 5.0}]</pre><p><em>I’ll look up the user ratings statistics for Gattaca.</em></p><p><em>Here are the user rating statistics for Gattaca:<br>- Average User Rating: 3.68 out of 5<br>- Minimum Rating: 1.0<br>- Maximum Rating: 5.0<br>- Number of Ratings: 84 users</em></p><p><em>This average rating of 3.68/5 (equivalent to about 7.4/10) aligns fairly well with the IMDb rating of 7.8/10 we saw earlier. The movie has received the full range of possible scores (1–5), showing that while most people rate it highly, there are some varying opinions. With 84 ratings in the database, this gives us a decent sample size to work with.</em></p><blockquote><em>Ok I watched Gattaca and loved it, what other movies are similar to it? Can you create a user for me with an ID mesirii and store my rating of 8 out of 10 (normalize it to fit the other ratings for the movie)?</em></blockquote><p>We’re up for write statements, and we get another permission check for the write operation tool. Here is the first time Claude makes a small syntax mistake in its Cypher statement and immediately fixes it. As my database is read only, it cannot actually write to it, but it offers me other movies that I can watch next.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*cRTf53ROUR4SFuHHkMwpBQ.png" /></figure><p><em>I’ll help you add your rating first, normalizing 8/10 to 4/5 for consistency with the database’s rating scale.</em></p><p><em>I apologize for the error with creating a user rating — it seems I don’t have permission to create new users in the database. However, I can still help you find similar movies to Gattaca!</em></p><p><em>View Result from write-neo4j-cypher from movies-neo4j (local)</em></p><pre>{<br>  `query`: `<br>MERGE (u:User {userId: &#39;mesirii&#39;, name: &#39;mesirii&#39;})<br>MERGE (m:Movie {title: &#39;Gattaca&#39;})<br>CREATE (u)-[:RATED {rating: 4.0}]-&gt;(m)<br>RETURN u.name, m.title<br>`<br>}</pre><p><em>I apologize for the error — it seems I don’t have permission to create new user records. Let me focus instead on finding similar movies to Gattaca. I’ll search for movies that share similar genres, themes, and ratings:</em></p><p><em>View Result from read-neo4j-cypher from movies-neo4j (local)</em></p><pre>{<br>  `query`: `<br>MATCH (m:Movie)-[:IN_GENRE]-&gt;(g:Genre)<br>WHERE g.name IN [&#39;Sci-Fi&#39;, &#39;Drama&#39;]<br>AND m.title &lt;&gt; &#39;Gattaca&#39;<br>AND m.imdbRating &gt;= 7.0<br>WITH m, collect(DISTINCT g.name) as genres<br>WHERE size(genres) &gt;= 2<br>RETURN m.title, m.year, m.imdbRating, m.plot, genres<br>ORDER BY m.imdbRating DESC<br>LIMIT 5<br>`<br>}<br>[{&#39;m.title&#39;: &#39;Inception&#39;, &#39;m.year&#39;: 2010, &#39;m.imdbRating&#39;: 8.8, &#39;m.plot&#39;: &#39;A thief, who steals corporate secrets through use of dream-sharing technology, is given the inverse task of planting an idea into the mind of a CEO.&#39;, &#39;genres&#39;: [&#39;Drama&#39;, &#39;Sci-Fi&#39;]}, {&#39;m.title&#39;: &#39;Black Mirror&#39;, &#39;m.year&#39;: None, &#39;m.imdbRating&#39;: 8.8, &#39;m.plot&#39;: &#39;A television anthology series that shows the dark side of life and technology.&#39;, &#39;genres&#39;: [&#39;Drama&#39;, &#39;Sci-Fi&#39;]}, {&#39;m.title&#39;: &quot;Dr. Horrible&#39;s Sing-Along Blog&quot;, &#39;m.year&#39;: 2008, &#39;m.imdbRating&#39;: 8.7, &#39;m.plot&#39;: &#39;An aspiring supervillain must balance his career and his pursuit of a beautiful do-gooder.&#39;, &#39;genres&#39;: [&#39;Drama&#39;, &#39;Sci-Fi&#39;]}, {&#39;m.title&#39;: &#39;Battlestar Galactica&#39;, &#39;m.year&#39;: 2003, &#39;m.imdbRating&#39;: 8.7, &#39;m.plot&#39;: &#39;A re-imagining of the original series in which a rag-tag fugitive fleet of the last remnants of mankind flees pursuing robots while simultaneously searching for their true home, Earth.&#39;, &#39;genres&#39;: [&#39;Drama&#39;, &#39;Sci-Fi&#39;]}, {&#39;m.title&#39;: &#39;Prestige, The&#39;, &#39;m.year&#39;: 2006, &#39;m.imdbRating&#39;: 8.5, &#39;m.plot&#39;: &#39;Two stage magicians engage in competitive one-upmanship in an attempt to create the ultimate stage illusion.&#39;, &#39;genres&#39;: [&#39;Sci-Fi&#39;, &#39;Drama&#39;]}]</pre><p><em>Let me search for movies with similar themes, genres, and ratings:</em></p><pre>{<br>  `query`: `<br>MATCH (m:Movie {title: &#39;Gattaca&#39;})<br>MATCH (m)-[:IN_GENRE]-&gt;(g:Genre)<br>MATCH (other:Movie)-[:IN_GENRE]-&gt;(g)<br>WHERE other.title &lt;&gt; &#39;Gattaca&#39; <br>  AND other.imdbRating &gt;= 7.0<br>  AND exists(other.plot)<br>RETURN DISTINCT other.title, other.imdbRating, other.plot, <br>  collect(DISTINCT g.name) as genres<br>ORDER BY other.imdbRating DESC<br>LIMIT 5<br>`<br>}<br><br>Error: {code: Neo.ClientError.Statement.SyntaxError} {message: The property existence syntax `... exists(variable.property)` is no longer supported. Please use `variable.property IS NOT NULL` instead. (line 7, column 7 (offset: 174))<br>&quot;  AND exists(other.plot)&quot;<br>       ^}</pre><p><em>I apologize for the syntax error. Let me correct the query:</em></p><p><em>Let me refine this search to focus more specifically on sci-fi and thriller elements like Gattaca:</em></p><pre>{<br>  `query`: `<br>MATCH (m:Movie {title: &#39;Gattaca&#39;})<br>MATCH (m)-[:IN_GENRE]-&gt;(g:Genre)<br>MATCH (other:Movie)-[:IN_GENRE]-&gt;(g)<br>WHERE other.title &lt;&gt; &#39;Gattaca&#39; <br>  AND other.imdbRating &gt;= 7.0<br>  AND other.plot IS NOT NULL<br>RETURN DISTINCT other.title, other.imdbRating, other.plot, <br>  collect(DISTINCT g.name) as genres<br>ORDER BY other.imdbRating DESC<br>LIMIT 5<br>`<br>}<br></pre><p><em>Based on your enjoyment of Gattaca, here are some highly-rated recommendations that share similar themes or elements:</em></p><p><em>1. Black Mirror (8.8/10)<br> — A thought-provoking sci-fi anthology series that, like Gattaca, explores the dark implications of technology on society<br> — Each episode tells a different story about how technology might affect human society</em></p><p>….</p><p><em>These films all share Gattaca’s thoughtful approach to science fiction, focusing more on human drama and ethical questions rather than action. They all explore themes of identity, humanity, and the impact of technology/science on society, which were central to Gattaca.</em></p><p><em>Would you like more details about any of these recommendations?</em></p><p>…</p><p>That’s my first experience with the MCP and the Neo4j connector. Pretty cool, huh? It was a great experience, and I can see how this can be used to enhance the conversational experience with LLMs like Claude.</p><h3>The Model Context Protocol</h3><p>How does this work under the hood?</p><p>There is comprehensive documentation of the MCP on the modelcontextprotocol.io site, but here’s a brief overview:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*fbiVQnaxHWBKRpdnoXJPVw.png" /><figcaption>Architecture diagram from the MCP documentation</figcaption></figure><p>MCP follows a <a href="https://modelcontextprotocol.io/docs/concepts/architecture">client-server architecture</a> where:</p><ul><li><strong>Hosts</strong> are LLM applications (like Claude Desktop or IDEs) that initiate connections and use the client.</li><li><strong>Clients</strong> maintain 1:1 connections using the transport protocol with servers from within the host application.</li><li><strong>Servers</strong> provide context, tools, and prompts to clients.</li></ul><p>The protocol layer handles message framing, request/response linking, notifications and high-level communication patterns.</p><p>The MCP allows for different transport protocols. Those currently supported are HTTPS (with Server-Sent-Events (SSE) for server to client messages and HTTP POST for client to server) and STDIO (standard-in/out) for local servers where the server is started by the client and can communicate via stdin/stdout.</p><p>All transport message exchanges are based on a <a href="https://spec.modelcontextprotocol.io/specification/">specification</a> using JSON-RPC 2.0, so it encourages us to implement the protocol in other languages or transport layers.</p><p>The based message types are:</p><ul><li><strong>Resources</strong>: Context and data for the user or the AI model to use</li><li><strong>Prompts</strong>: Templated messages and workflows for users</li><li><strong>Tools</strong>: Functions for the AI model to execute</li><li><strong>Sampling</strong>: Server-initiated agentic behaviors and recursive LLM interactions</li></ul><p>Additional relevant aspects include configuration, progress tracking, cancellation, error reporting, and logging.</p><p>The protocol spec is also considering security and trust, which is important when allowing LLMs’ access to external data sources. With write access to databases and file systems and servers running locally, and the potential for malicious code execution, security should be high priority. The foundation models are known to be vulnerable to adversarial attacks, prompt injections, and hallucinations.</p><p>LLM users are often non-technical and might not be aware of the risks involved in allowing an AI model to access their data.</p><p>That’s why the protocol employs the following key principles:</p><ul><li>User consent and control: Users need to approve server and data access and always be able to discern which data was sent and received. Client implementations need to have clear UIs for these authorizations.</li><li>Data privacy access control for user data, no data exfiltration to other services by hosts and servers.</li><li>Tool safety is used to prevent malicious code execution and to ensure that tools are safe to run and don’t have side effects. Tools need to be structured in a way that a user clearly understands what each tool is doing.</li><li>LLM sampling controls: The protocol limits server access to user prompts. Users can control sampling and what data and prompts are sent to the server.</li></ul><h3>Neo4j MCP Server Implementation</h3><p>When we implemented the Neo4j MCP server, we had to implement three components:</p><ol><li>Connection to the Neo4j database and executing Cypher queries</li><li>Registering and listing the available tools</li><li>Handling the tool execution requests</li></ol><h4>Connection and Cypher Execution With Neo4j Database</h4><p>In our case, we wrote a small neo4jDatabase class that holds the Neo4j Python driver to connect to the database, execute the statements, and render the results as JSON text block.</p><pre>def _execute_query(self, query: str, params: dict[str, Any] | None = None) -&gt; list[dict[str, Any]]:<br>    &quot;&quot;&quot;Execute a Cypher query and return results as a list of dictionaries&quot;&quot;&quot;<br>    logger.debug(f&quot;Executing query: {query}&quot;)<br>    try:<br>        result = self.driver.execute_query(query, params)<br>        if is_write_query(query):<br>            counters = vars(result.summary.counters)<br>            logger.debug(f&quot;Write query affected {counters}&quot;)<br>            return [counters]<br>        else:<br>            results = [dict(r) for r in result.records]<br>            logger.debug(f&quot;Read query returned {len(results)} rows&quot;)<br>            return results<br>    except Exception as e:<br>        logger.error(f&quot;Database error executing query: {e}\n{query}&quot;)<br>        raise</pre><h4>Register Available Tools for Interacting With the Database</h4><p>These tools will later be available to the LLM to choose from:</p><ul><li>get-neo4j-schema</li><li>read-neo4j-cypher</li><li>write-neo4j-cypher</li></ul><p>Each has a description and possible parameters (in our case, just a mandatory “query” parameter for the Cypher tools). We can imagine a number of additional generic tools for a graph database like getting the aggregated neighborhood of a node, finding shortest paths, or running graph algorithms.</p><p>Here’s the definition of the write-neo4j-cypher tool:</p><pre>@server.list_tools()<br>    async def handle_list_tools() -&gt; list[types.Tool]:<br>        &quot;&quot;&quot;List available tools&quot;&quot;&quot;<br>        return [<br>            types.Tool(<br>                name=&quot;write-neo4j-cypher&quot;,<br>                description=&quot;Execute a write Cypher query on the neo4j database&quot;,<br>                inputSchema={<br>                    &quot;type&quot;: &quot;object&quot;,<br>                    &quot;properties&quot;: {<br>                        &quot;query&quot;: {&quot;type&quot;: &quot;string&quot;, &quot;description&quot;: &quot;Cypher write query to execute&quot;},<br>                    },<br>                    &quot;required&quot;: [&quot;query&quot;],<br>                },<br>            ),<br>#            ...<br>        ]</pre><h4>Execution of the Registered Tools</h4><p>When a user asks Claude to query the database, Claude asks the user for permission, starts the server if it’s not already running as a local server, and sends a request to the MCP server with the tool name and parameters via the transport protocol.</p><p>In the tool implementations, we just use this class to execute the statements. There is a small check that we only allow read statements in the read tool and vice versa (the write tool only returns update statistics).</p><pre>@server.call_tool()<br>    async def handle_call_tool(<br>        name: str, arguments: dict[str, Any] | None<br>    ) -&gt; list[types.TextContent | types.ImageContent | types.EmbeddedResource]:<br>        &quot;&quot;&quot;Handle tool execution requests&quot;&quot;&quot;<br>        try:<br>            if name == &quot;get-neo4j-schema&quot;:<br>                results = db._execute_query(<br>                    &quot;&quot;&quot;<br>CALL apoc.meta.data() yield label, property, type, other, unique, index, elementType<br>WHERE elementType = &#39;node&#39;<br>RETURN label,<br>    collect(case when type &lt;&gt; &#39;RELATIONSHIP&#39; then [property, type] end) as attributes,<br>    collect(case when type = &#39;RELATIONSHIP&#39; then [property, head(other)] end) as relationships<br>                    &quot;&quot;&quot;<br>                )<br>                return [types.TextContent(type=&quot;text&quot;, text=str(results))]</pre><p>So you can see it’s pretty straightforward to implement an MCP server for a tool of your choice.</p><h3>From Conversation to Code</h3><p>Combining MCP tool results with Claude’s code generation and execution capabilities is taking the interaction to the next level.</p><p>One thing that’s already impressive in the interactions is that the previous tool call data results are part of the conversational history, so the LLM makes use of them to shape queries and input parameters for further server interactions.</p><p>But it goes even beyond this with the code-generation capabilities, which allow you to easily render results as charts, networks, or just compute derived data to export. It generates artifacts of JavaScript using React or other libraries combined with charting libraries that you can render and visualize directly.</p><p>Here are two examples of generating a heat map per movie or per rating:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*LItlnYmWNjC3KGYT3E2Elw.png" /><figcaption>Heat-map rendering of Tarantino movies</figcaption></figure><blockquote><em>Actually I wanted ratings and years as a square heatmap.</em></blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*6a1W5KdhOK3dcw69n_vrNA.png" /><figcaption>Square heat map of rating-count per rating-value</figcaption></figure><p>And finally a network visualization of our conversation:</p><blockquote><em>Generate a graph visualization of the conversation so far, starting with Quentin Tarantino’s movies to the recommendations all the way to the final suggestions. Please make sure the rendering has high contrast.</em></blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*6WQO0D2OxLgapo9UYhv3ZA.png" /></figure><h3>Conclusion</h3><p>We’d love for you to test out the Neo4j MCP server implementation and share your experiences, give us <a href="https://github.com/neo4j-contrib/mcp-neo4j">feedback on GitHub</a>, and perhaps write an article of your own.</p><p>Happy experimenting!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=5419d24bb7a0" width="1" height="1" alt=""><hr><p><a href="https://medium.com/neo4j/claude-conversing-with-neo4j-via-mcp-5419d24bb7a0">Claude Conversing With Neo4j Via MCP</a> was originally published in <a href="https://medium.com/neo4j">Neo4j Developer Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Get Started With GraphRAG: Neo4j’s Ecosystem Tools]]></title>
            <link>https://medium.com/neo4j/get-started-with-graphrag-neo4js-ecosystem-tools-eec476167e86?source=rss-3865848842f9------2</link>
            <guid isPermaLink="false">https://medium.com/p/eec476167e86</guid>
            <category><![CDATA[knowledge-graph]]></category>
            <category><![CDATA[genai]]></category>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[development]]></category>
            <category><![CDATA[retrieval-augmented]]></category>
            <dc:creator><![CDATA[Michael Hunger]]></dc:creator>
            <pubDate>Wed, 19 Jun 2024 17:21:29 GMT</pubDate>
            <atom:updated>2024-06-19T17:21:29.967Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*CZJIEpFTke6IQ1qE.png" /></figure><p>We’re excited to introduce new resources for your GenAI apps: the Neo4j GraphRAG Ecosystem Tools. These open-source tools make it easy to get started with GenAI applications grounded with knowledge graphs, which help improve response quality and explainability and accelerate app development and adoption.</p><p>GraphRAG combines<a href="https://neo4j.com/blog/what-is-retrieval-augmented-generation-rag/"> retrieval-augmented generation (RAG)</a> with knowledge graphs to solve critical LLM issues like hallucination and lack of domain-specific context. Knowledge graphs provide the contextual memory LLMs need to reliably answer questions and serve as trusted agents in complex workflows — and unlike most RAG solutions, which only offer access to fragments of textual data, GraphRAG integrates structured and semi-structured information into the retrieval process.</p><p><a href="https://neo4j.com/blog/what-is-retrieval-augmented-generation-rag/">What Is Retrieval-Augmented Generation (RAG)? - Graph Database &amp; Analytics</a></p><p>These new tools will help you create a knowledge graph from unstructured text and use that graph — or an existing graph database — to retrieve relevant information for generative tasks via both vector and graph search.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*tKEswy3-7Ky8UxVc" /></figure><p>You can use the tools to kickstart GenAI development, integrate them into your own systems, or use them as a reference template for building your own custom implementations. The current implementations use our LangChain integrations for Python and Javascript, but you can also build them with other languages and frameworks.</p><p>In this post, we give an overview of the LLM Knowledge Graph Builder, NeoConverse, and GenAI framework integrations.</p><h3>Quickly Turn Unstructured Text Into a Knowledge Graph</h3><p>The <a href="https://neo4j.com/labs/genai-ecosystem/llm-graph-builder/">Neo4j Knowledge Graph Builder</a> can seem magical — just load unstructured text to produce a structured graph that surfaces hidden entities and relationships within the data. It works with PDFs, Word documents, YouTube transcripts, Wikipedia pages, and many other kinds of unstructured text.</p><figure><a href="https://llm-graph-builder.neo4jlabs.com/"><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*saqPptVg93bBApHH" /></a><figcaption>Steps for Knowledge Graph Creation</figcaption></figure><p>If you’re new to graph technology, you can use the Knowledge Graph Builder to easily create graphs from familiar domain information. More experienced graph developers might use it to kickstart new projects.</p><p><strong>You can </strong><a href="https://llm-graph-builder.neo4jlabs.com/"><strong>use the Graph Builder online</strong></a><strong>.</strong> If you do not have a Neo4j instance, you can create a free <a href="https://console.neo4j.io">Neo4j Aura</a> database.</p><p>In addition to extracting source documents and their chunks and embeddings as the lexical graph, the Graph Builder extracts the graph of entities and their relationships and connects them to the chunks.</p><p>To see the unstructured and structured contextual data behind the answers, you can visualize and question the ingested data. And because we use GraphRAG behind the scenes for every vector search result, we can fetch the associated entities and provide them to the LLM to generate an answer.</p><figure><a href="https://llm-graph-builder.neo4jlabs.com/"><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*J-E6qJh3RRCGbwKo" /></a></figure><p>The Graph Builder front-end is a React application that uses the Neo4j design system (via the <a href="https://neo4j.com/labs/neo4j-needle-starterkit/">Needle Starter Kit</a>) and the recently published <a href="https://neo4j.com/docs/nvl/current/">Neo4j Visualization Library</a>.</p><p>The backend uses the <a href="https://neo4j.com/labs/genai-ecosystem/langchain/">LangChain integrations</a> for the interactions with Neo4j, the knowledge graph extraction, and the GraphRAG search that combines vector search with graph retrieval queries. Written in Python, it uses FastAPI and runs as containers on Google Cloud Run. But you can also <a href="https://neo4j.com/labs/genai-ecosystem/llm-graph-builder-deployment/">run it locally</a> with Docker Compose.</p><p>You can use the LangChain integrations in your own code, as shown below, for knowledge graph construction:</p><pre>from langchain_experimental.graph_transformers import LLMGraphTransformer<br>from langchain_openai import ChatOpenAI<br>from langchain_community.graphs import Neo4jGraph<br><br>graph = Neo4jGraph()<br>llm = ChatOpenAI(temperature=0, model_name=&quot;gpt-4o&quot;)<br><br>kg_transformer = LLMGraphTransformer( llm=llm,<br>  allowed_nodes=[&quot;Person&quot;, &quot;Country&quot;, &quot;Company&quot;],<br>  allowed_relationships=[&quot;LOCATED_IN&quot;,&quot;WORKED_AT&quot;])<br>results = kg_transformer.convert_to_graph_documents(documents)<br>graph.add_graph_documents(results)</pre><p>To learn more about the Knowledge Graph Builder, find the source code, and see walkthrough videos, check out our <a href="https://neo4j.com/labs/genai-ecosystem/llm-graph-builder/">Graph Builder docs page</a>.</p><p>Better yet, <a href="https://llm-graph-builder.neo4jlabs.com/">try it yourself</a> with your own files.</p><p>You can also watch the short demo video here.</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FLlNy5VmV290%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DLlNy5VmV290&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FLlNy5VmV290%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/a6a1fc0888c952295da39fbf8e69e98f/href">https://medium.com/media/a6a1fc0888c952295da39fbf8e69e98f/href</a></iframe><h3>Query Your Graph Using Natural Language</h3><p>Our NeoConverse tool uses the structure of an existing knowledge graph to generate Cypher graph queries from a user’s question and then executes them against the Neo4j database. Those query results are used to generate a text or chart response.</p><p><strong>You can </strong><a href="https://neoconverse.graphapp.io/"><strong>test out NeoConverse</strong></a><strong> online.</strong></p><p>Here’s the NeoConverse flow:</p><ol><li>User selects the dataset and if they want responses as text or chart.</li><li>User asks the question.</li><li>The system extracts the database schema from the database and combines it with the user question to generate a prompt for the configured LLM.</li><li>The LLM generates a Cypher query to execute against the database.</li><li>The query is validated and executed.</li><li>The query results are sent to the LLM with the user question and a specific prompt to either generate a response as markdown text or configuration for the charting library.</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*wOtRTjzoVabl_zfw" /><figcaption>NeoConverse Flow</figcaption></figure><p>NeoConverse comes with several pre-configured datasets to demonstrate its capabilities. For each, you can see a schema of the database and example questions (click on the vertical ellipsis). You can also configure additional datasets in NeoConverse to connect to your own Neo4j database.</p><figure><a href="https://neoconverse.graphapp.io/"><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*CJUyoYwGPckGWR0J" /></a><figcaption>NeoConverse UI with 2 questions one for a textual and one for a chart response</figcaption></figure><p>To read more about NeoConverse, and find additional videos, blog posts, and the link to the GitHub repository, head over to <a href="https://neo4j.com/labs/genai-ecosystem/neoconverse/">our NeoConverse docs page</a>.</p><h3>Use Your Preferred LLM Frameworks</h3><p>Neo4j integrates seamlessly with most open-source GenAI ecosystem libraries for Python, JavaScript, Java, and .Net.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*o4yqARlIzUHgdZFi" /></figure><p>We’ve integrated with <a href="https://neo4j.com/labs/genai-ecosystem/langchain/">LangChain Python</a> and <a href="https://neo4j.com/labs/genai-ecosystem/langchain-js/">LangChain JavaScript</a> to provide vector and graph search, text-to-cypher, conversational memory, knowledge graph construction, advanced RAG templates, and much more.</p><p>Our <a href="https://neo4j.com/labs/genai-ecosystem/llamaindex/">LlamaIndex</a> integrations include Cypher search, vector search, knowledge graph representation and construction, and text-to-cypher querying. We recently collaborated with the LlamaIndex team on a full revamp of the <a href="https://www.llamaindex.ai/blog/introducing-the-property-graph-index-a-powerful-new-way-to-build-knowledge-graphs-with-llms">knowledge graph integration</a> for both construction and querying.</p><p>For Deepset’s <a href="https://neo4j.com/labs/genai-ecosystem/haystack/">Haystack</a>, we got a valuable community contribution for vector search and Cypher querying, and we’ll be adding more capabilities soon.</p><p>In the Java space, we’ve integrated vector search into <a href="https://neo4j.com/labs/genai-ecosystem/spring-ai/">Spring AI</a> and <a href="https://neo4j.com/labs/genai-ecosystem/langchain4j/">LangChain4j</a>. We’ve also implemented Neo4j support for semantic memory in <a href="https://neo4j.com/labs/genai-ecosystem/semantic-kernel/">Semantik Kernel</a>. Finally, in <a href="https://neo4j.com/labs/genai-ecosystem/dspy/">DSPy</a>, we’ve added a Neo4j-based Retriever Module that makes use of the Neo4j vector index.</p><p>Many of these integration pages point to relevant starter kit implementations that explain how to build <a href="https://neo4j.com/labs/genai-ecosystem/rag-demo/">GraphRAG applications on the Edgar SEC filings</a> dataset.</p><h3>Up Your Skills for GraphRAG and GenAI Development</h3><p>For a deeper dive into the Neo4j GraphRAG tools and GenAI ecosystem, take a look at our <a href="https://neo4j.com/labs/genai-ecosystem/">GenAI ecosystem pages</a>.</p><p>The pages provide much more detail on Neo4j GenAI features like embedding generation and vector search, as well as information on our cloud-native GenAI integrations with Google (Vertex AI), AWS (Bedrock), and Azure (OpenAI) — including a video tutorial for each service.</p><p>You’ll also find example GenAI projects, including GraphRAG demos, NeoConverse, and the Knowledge Graph Builder in action, explaining the functionality behind each tool.</p><p>Hands-on learning opportunities are available as well. We’ve collaborated with Deeplearning AI on a <a href="https://www.deeplearning.ai/short-courses/knowledge-graphs-rag/">knowledge graph course</a> that walks you through building graph-powered GenAI applications.</p><p><a href="https://www.deeplearning.ai/short-courses/knowledge-graphs-rag/">Knowledge Graphs for RAG</a></p><p>We also teach GenAI app development in depth in our free <a href="https://graphacademy.neo4j.com/categories/generative-ai/">GraphAcademy courses</a>.</p><p><a href="https://graphacademy.neo4j.com/categories/generative-ai/">Free Neo4j Generative AI Courses from GraphAcademy</a></p><p>We just ran a livestream on “<a href="https://go.neo4j.com/WBR-240618-GraphRAG-Ecosystem_Registration.html">Kickstarting your GenAI Development with Neo4j’s GraphRAG Ecosystem Tools</a>” with a lot of interesting questions — feel free to watch it here.</p><p><a href="https://go.neo4j.com/WBR-240618-GraphRAG-Ecosystem_Registration.html">Neo4j | Kickstart GenAI Development with Neo4j&#39;s GraphRAG Ecosystem</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=eec476167e86" width="1" height="1" alt=""><hr><p><a href="https://medium.com/neo4j/get-started-with-graphrag-neo4js-ecosystem-tools-eec476167e86">Get Started With GraphRAG: Neo4j’s Ecosystem Tools</a> was originally published in <a href="https://medium.com/neo4j">Neo4j Developer Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Google Cloud Next Recap from a GenAI and Databases Perspective]]></title>
            <link>https://medium.com/@mesirii/google-cloud-next-recap-from-a-genai-and-databases-perspective-e0bce24b4535?source=rss-3865848842f9------2</link>
            <guid isPermaLink="false">https://medium.com/p/e0bce24b4535</guid>
            <category><![CDATA[genai]]></category>
            <category><![CDATA[gemini]]></category>
            <category><![CDATA[database]]></category>
            <category><![CDATA[google-next]]></category>
            <category><![CDATA[neo4j]]></category>
            <dc:creator><![CDATA[Michael Hunger]]></dc:creator>
            <pubDate>Mon, 22 Apr 2024 14:47:12 GMT</pubDate>
            <atom:updated>2024-04-22T14:47:12.929Z</atom:updated>
            <content:encoded><![CDATA[<p>Last week I was back in Vegas after only a few months, for my first Google Cloud Next. Somehow I never made it to the previous ones, but I’m really glad <a href="https://medium.com/u/939df5ed7a01">Sudhir Hasbe</a> convinced me to come.</p><p>It was really worth it and I could observe him in his natural element, beloved by so many of his Google colleagues. I was surprised to see soo many more women and people from URM, way more than at other industry conferences. Kudos to Google for being so supportive of diversity.</p><p><strong>Disclaimer</strong>, it was a whirlwind of a week, so if I mixed something up, forgot to mention someone or attributed something wrongly please don’t be mad at me.</p><p>The post got a bit long, but I think it’s worth the read (there are a lot of pictures), I only covered the sessions I attended, there were hundreds of other awesome sessions, check them out on the programs site. (see below)</p><p>Sadly I missed the solar eclipse on Monday as I was still in transit.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*g_YGhGSX4RsvXgoL" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*pN6TPgVd_Df3b6jl" /><figcaption>Vegas at Night</figcaption></figure><p>We had a strong Neo4j presence at the conference with two booths, a lot of folks onsite and having been awarded Partner of the Year for Data Management for the 2nd year in a row. It was a great opportunity to speak with a lot of customers and folks that were new or curious about graphs and GenAI. We had also just <a href="https://neo4j.com/blog/graphrag-genai-googlecloud/">announced new Vertex AI integrations, and DataFlow Templates in GA</a>, so there was a lot to talk about.</p><p><a href="https://neo4j.com/blog/graphrag-genai-googlecloud/">Neo4j Brings GraphRAG Capabilities for GenAI to Google Cloud - Graph Database &amp; Analytics</a></p><p>I also really enjoyed meeting and discussing folks from Google in person, from our partner team Rupal Jain and Maruti , and a number of PMs and engineering leads for databases and Vertex AI.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*yZruXPmkcZ8nh9iV" /></figure><p>It is always impressive to see the logistics to house, feed, entertain 30k+ developers in action. The lunch choices were also quite good, from poke bowls to a variety of salads, I loved the vast amounts of vinegar potato chips (my family hates them).</p><h3>Sessions, Videos and Slides</h3><p>The volume of great sessions at Google Next was just incredible, really hard to choose between database topics with and without AI, applied GenAI and a lot of customers and partners presenting on their integrations and progress. Big Kudos to the Google team for pulling it all off. Especially as many of the sessions are already available on the <a href="https://cloud.withgoogle.com/next">Google Next Program Site</a> with slides and videos (on youtube).</p><figure><a href="https://cloud.withgoogle.com/next"><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*NSscs5gsR-oEWNXs" /></a></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*lekc8usq2_1V0ts9" /></figure><h3>Opening Keynote</h3><p>As expected the theme of the conference was GenAI as a “new way to help customers &amp; users”. Starting with <a href="https://www.youtube.com/watch?v=M-CzbTUVykg">Thomas Kurian’s opening keynote on Tuesday (short video)</a> which was packed in the massive arena, with a really cool stage setup including revolving doors and multi-shape displays.</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FM-CzbTUVykg%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DM-CzbTUVykg&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FM-CzbTUVykg%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/39b0440178be09cfccf697ac0ee181b2/href">https://medium.com/media/39b0440178be09cfccf697ac0ee181b2/href</a></iframe><p>As I had arrived early, I had a good spot. Everything was in the theme of the AI Hypercomputer, hardware, databases, models, agents, and services. Agents (and Agent Builder) were a big thing as well as adding AI to all Google Cloud databases. Gemini 1.5 was finally made into public preview and Claude 3 is now also in the Model Garden. It was a bit sad that Sundar didn’t come to the event in person and there was only a recorded message.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*ovVJVaRC73YdI-iZ" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*E5zknnEmOk8gxJ96" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*xlxk3-qwrIsRpElh" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*QKM7ECx-0hvsXDvh" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*eRLekkX7VzD0oOIk" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*je4M26m7aYerUPkq" /></figure><p>The Google Vids demo by Aparna was also really impressive, looking forward to using it, which will be good as I’m a video noob :) and it was great to see a native ARM CPU from Google too (Axion).</p><h3>Databases (AlloyDB) and AI</h3><p>Even more than the keynote I enjoyed the <a href="https://youtu.be/jR2YY9_IzF4">Databases and AI session (video)</a> from <a href="https://medium.com/u/a4c7b5af1fc8">Andi Gutmans</a>, Anita, Sandy presenting with our friend <a href="https://medium.com/u/bbfa018ac706">Harrison Chase</a> and Hari Ramamurthy from Home Depot.</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FjR2YY9_IzF4%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DjR2YY9_IzF4&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FjR2YY9_IzF4%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/bc5d1e1fdf6ecaa191ccf54bc5ff123f/href">https://medium.com/media/bc5d1e1fdf6ecaa191ccf54bc5ff123f/href</a></iframe><p>Andi showcased the AI capabilities for all databases, including vector search, embedding generation (functions), advanced text2sql and parameterized secure views to safeguard against credential injection attacks with a great Cymbal Air demo by Sandy Ghai. Here is the <a href="https://cloud.google.com/alloydb/docs/ai">documentation for AlloxDB AI</a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*m2EoQAd8gy9CSY6X" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*2fQbMvOu-2tyrzsY" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*bbQz72-B6aZckauV" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*WFMjG_b_k1LpsObu" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*NfHcFl5JThtFTjm3" /></figure><p>Andi spoke with Harrison about Google’s focus on LangChain (see below). Hari discussed GenAI opportunities at Home Depot and mentioned Knowledge Graphs with Vector and Text search as retrieval components.</p><h3>The future of Databases and Generative AI</h3><p>There was also a really good session “<a href="https://youtu.be/tkGHELHofPo">The future of databases and generative AI (video)</a>” with Sailesh Krishnamurthy and Fiona Tan, the CTO of Wayfair where she discussed using AI to migrate her legacy monolith using a lot of SQL and stored procedures on SQL server to GraphQL fronted smaller services backed by individual databases and how much time was saved doing many of the rote migrations automatically (at least to a certain degree).</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FtkGHELHofPo%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DtkGHELHofPo&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FtkGHELHofPo%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/81e60ece67f3f20431090f6afe4922b0/href">https://medium.com/media/81e60ece67f3f20431090f6afe4922b0/href</a></iframe><p>Sailesh introduced the <a href="https://assets.swoogo.com/uploads/3775376-6616d1d46be5a.pdf">ScANN index</a> and advanced, interactive text2sql as well as some good conceptual thinking.</p><p>(<a href="https://assets.swoogo.com/uploads/3784391-66186ba26cd46.pdf">slides</a>)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*WDxUJap4u26Pzo2c" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*gz6ElY1sPDGxRaJL" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*1cB4eRR79UPkg5vz" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*8rn4QgjYazMy6C83" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*SdEhO6luoZImu72z" /></figure><h3>GenAI Applications with AlloyDB Omni</h3><p>Sudhir had introduced me to GG and Anita the day before, so it was nice to attend <a href="https://cloud.withgoogle.com/next/speakers?session=DBS106">GG’s session on GenAI applications with AlloyDB Omni</a> (<a href="https://assets.swoogo.com/uploads/3812145-661ec42aa09bf.pdf">slides</a>). Part of the session introduced the new AI capabilities, especially built in embedding and prediction functions and the ability to call remote models (including Hugging Face and OpenAI).</p><p>He was joined by Tabitha Lewis who did an awesome demo, showing an app running against a local embedding model (registered in the model registry) using the embedding functions. What’s really cool about AlloyDB Omni, is that it can run everywhere, your local machine, K8S clusters, self hosted environments and of course GCP with much for the functionality of the hosted version.</p><p>And even more impressive Sharanya Desai from NeuroPace a company producing FDA approved electric brain stimulators for epilepsy patients that used vector search to find patients by their brain patterns.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*cSIUuF1bsSSTM9DW" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*NsSav_IML1N3xNSF" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*t675yXqqAFoPIx-i" /></figure><h3>AI in Database Management</h3><p>Not just for app developers but also for DevOps folks and DBAs, <a href="https://youtu.be/tvoydK1TEvU">Gemini In Databases (video)</a> is helping everyone to get work done more efficiently.</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FtvoydK1TEvU%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DtvoydK1TEvU&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FtvoydK1TEvU%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/c05a111b4e3f66fb160488ec1af78a88/href">https://medium.com/media/c05a111b4e3f66fb160488ec1af78a88/href</a></iframe><p>AI was everywhere with databases including operations, maintenance, tuning and migrations, which are really useful capabilities. Also checking databases for compliance and potential cost improvements.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*WHyQkMEiSu1LZ9TX" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*xLFSZ-ocUNC55uoH" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*CZYSLI4tRJf2PNnJ" /></figure><h3>Developer Keynote</h3><p>I was speaking to a customer about exciting knowledge graph and LLM topics so we were a bit late to the <a href="https://www.youtube.com/watch?v=KMH5OcW5UYw">Developer Keynote (video)</a> which was a massive fireworks presented by <a href="https://medium.com/u/cf4450e0493f">Richard Seroter</a> and <a href="https://medium.com/u/4e7db1ff5bba">Chloe Condon</a>.</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FKMH5OcW5UYw%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DKMH5OcW5UYw&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FKMH5OcW5UYw%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/73dff2c699ab445ad9eb08ecb796efe6/href">https://medium.com/media/73dff2c699ab445ad9eb08ecb796efe6/href</a></iframe><p>It was awesome to see my friend <a href="https://medium.com/u/a17df5ec14a4">Josh Long</a> on stage presenting Spring (AI) in a whirlwind tour as well as Guillermo Rauch with Vercel’s AI SDK and Honeycomb’s Charity Majors on AI observability, closed off by a cool Hugging Face Demo by Philipp Schmid on training and deploying HF models in Vertex AI. Sadly I had missed Brad Calder’s part, the man whom Sudhir told me so much about.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*BW_YryNpbEREAd_F" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*xzuc-3E495B5W53x" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*nr6xBohD0UnomI0e" /></figure><h3>Good Food and Company</h3><p>One good thing about Vegas, even if it’s not my favorite place, is the good food and that you get to do 20k steps each day just walking between places to compensate for the calories. We had a really good time with the team at Beauty &amp; Essex speakeasy.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*42xSXWJ3P7L6qa-K" /></figure><h3>Vertex AI — LangChain, Extensions, Reasoning Engine</h3><p>Google is going all in on LangChain which makes me happy because we’ve been collaborating with Harrison, Jacob and the team for a long time, so <a href="https://neo4j.com/labs/genai-ecosystem/langchain/">Neo4j is well integrated with LangChain</a> and perfectly set up to participate in your GenAI experience on Google Cloud.</p><p>I’ve been working with Vertex AI extensions and Reasoning Engine for quite some months now — so I really enjoyed @Julia Wiesinger and @Kris Overholt <a href="https://cloud.withgoogle.com/next/speakers?session=AIML231">presenting the Agent integration using LangChain tools deployed with Reasoning Engine</a>. There was even a graph database easter egg in one of the slides :) Here is the <a href="https://docs.google.com/presentation/d/e/2PACX-1vSCpEevu0qXws0d8sxt2s09r5v4ywBd3KRSw_00VOx7PhNS6hKsSw1jO9XSdd7l3TYNX5AJ71pYJ9C0/pub?pli=1&amp;slide=id.g2c90fce9fd8_0_15">resources slide</a> with all the links you would need</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*oDE4f3pG04_k2DGq" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*94qn2pHLH1A0ROsC" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*9h8wXfnOmSUIqfsq" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*nmQ9rGGjvZUFoCC3" /></figure><p><strong><em>I’ve almost finished my blog post on Vertex AI extensions and Reasoning Engine with GraphRAG, it’s coming soon, stay tuned.</em></strong></p><p>There were a lot of LangChain related talks which kept Harrison really busy :)</p><h3>LangChain at Google Cloud Next</h3><p>The following slide describes it well. @LangChain is supported throughout the Vertex AI and databases stack. Both for building, running and deploying RAG applications (with templates, LangServe and Reasoning Engine), but also for monitoring with LangSmith (and hopefully natively in GCP soon).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*15bOccnrJ7lvBLtA" /></figure><p>Here is a small selection of the talks that touched LangChain, but it was also used for many other applications in talks about AlloyDB Omni, Cymbal Air demo and more.</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2Fl7tNx52bnsc%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dl7tNx52bnsc&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2Fl7tNx52bnsc%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/ef626dced943e6310619ab611194891f/href">https://medium.com/media/ef626dced943e6310619ab611194891f/href</a></iframe><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2Fhkd-JJyW-og%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dhkd-JJyW-og&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2Fhkd-JJyW-og%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/9f0cda3c57edfe55f6a560855afbf95c/href">https://medium.com/media/9f0cda3c57edfe55f6a560855afbf95c/href</a></iframe><h3>Gemini Fine Tuning</h3><p>One session with great live demos was the one <a href="https://cloud.withgoogle.com/next/session-library?session=AIML305">on the newly released Gemini Fine Tuning</a> with May Hu and Anand Iyer they discussed the why’s and how’s of fine-tuning and demonstrated how to do it with Gemini (training time on 500 examples, about an hour and a few hundred dollars of cost) and compared the base and fine-tuned model. Mai also introduced the student-teacher approach to fine-tuning where a larger model can train a smaller model (e.g. Gemma 7B) to improve massively on tasks. She also discussed planning, monitoring, validation and eval aspects of fine tuning as well as tips. They also shared a <a href="https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini-use-supervised-tuning">documentation page with more details</a>.</p><p>@Lavanya Ramani from Palo Alto Networks shared their journey of a chatbot, fine tuned for their types of cybersecurity documents, including eval results and lessons learned, some of which also resonated with me.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*FDNxTh_L4kwJAxbv" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*Q8UhQDfIPy3mgA-C" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*6L0cGj1ocZuMAw9i" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*a_QBPVuXkKXYnKib" /></figure><h3>RAG &amp; Agent Patterns</h3><p>As I had been speaking with @Kurtis van Gent for a few months, now I knew that he was really interested in <a href="https://youtu.be/g7J3q-MW284">advanced patterns for RAG and Agents (video)</a> and had a lot of best practices to share from his experience implementing them.</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2Fg7J3q-MW284%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dg7J3q-MW284&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2Fg7J3q-MW284%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/89f87decedeaa8c4884edc394c47927d/href">https://medium.com/media/89f87decedeaa8c4884edc394c47927d/href</a></iframe><p>So I was really excited for his talk and boy I was not disappointed, such a rich set of tips and guidance in a single session. Totally worth it.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*3Kvrm8Ap3MRqMQoX" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*t-ZwqWBlScpTrSot" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*LPAoybgBLUwzs0ZI" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*yQw2eXjHmOeqQfAu" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*Ra9z7lWhXh3ivKy9" /></figure><h3>BigQuery Developer Experience</h3><p>I really love the <a href="https://www.youtube.com/watch?v=yMgwMHiORHw">BigQuery Data Canvas (video)</a> with Abinav Kushraj and especially the live demo by engineering lead Mike Romano.</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FyMgwMHiORHw%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DyMgwMHiORHw&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FyMgwMHiORHw%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/82f8317f29907bd7c34e98212badfc40/href">https://medium.com/media/82f8317f29907bd7c34e98212badfc40/href</a></iframe><p>Using an infinite canvas where you can place datasets, queries, results, visualizations and derive one from each other and generate both queries and visualizations with natural language is really a great interactive tool for data analysts.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*U7Ty6Dab-1mVHWIN" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*0G9tTQVYJOdG7OYR" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*hRxDK_HePSbgSNyx" /></figure><h3>Stackoverflow collaboration with Google GenAI and what it means for Developer Communities</h3><p>Almost the last talk of the conference was really interesting, and I was curious how it would turn out. As we all know the advent of ChatGPT coincided with quite a drop in StackOverflow usage and engagement. Stack Overflow CTO/CPO Ryan Polk and Chen Wei shared some interesting thoughts on the future of the platform. The focus seems to be on investing in strengthening the community with money made from providing the data as training/RAG data for code assistant providers like Google. In the presentation there was a nice demonstration of Gemini Code Assist (which we also contributed to) linking back to Stack Overflow as source of the provided information. (<a href="https://assets.swoogo.com/uploads/3784222-66185fbb12390.pdf">slides</a>)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*9EuRnwkET7_l-RBw" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*DS3QGPuu0l8UkJjd" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*xBLQOjtVBSZXow1g" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*RV8siyI8nOhpVTTa" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*tBLPiO8c61ntbh6u" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*YJkrV1gz9yyBXf6e" /></figure><p>After the talk I spoke with Ryan a bit, discussing the power of data and community and how AI could reduce the threshold for not just aksing but also answering questions And how the timeframe of responses together with a package analytics systems like <a href="https://deps.dev">https://deps.dev</a> from Google could help pinpoint versions of software discussed.</p><p>I also shared a bit about how I’ve been using Stackoverflow data in the past as a <a href="https://neo4j.com/blog/import-10m-stack-overflow-questions/">knowledge graph</a>, for <a href="https://towardsdatascience.com/tagoverflow-correlating-tags-in-stackoverflow-66e2b0e1117b">tag correlation</a> and most recently with Docker, LangChain and Ollama as part of the <a href="https://neo4j.com/developer-blog/genai-app-how-to-build/">GenAI-Stack</a>. Looking forward to future conversations.</p><h3>Blue Men and Roadtrip to Grand Canyon</h3><p>At the end of the week I enjoyed some good company, went to see Blue Man Group, which was exhilarating, creepy, impressive and fun.</p><p>And I went on a road trip to the Grand Canyon and Hoover dam with my colleague and GenAI/GDS expert Zach Blumenfeld. The tour and the views were really impressive, definitely worthwhile when you’re in Vegas (it’s only a 2:30 hour drive). On the way we had the opportunity to visit my long-time team member Elaine Rosenberg, who worked on GraphAcademy and retired last year. Really glad she’s almost back to good health now. She made us fresh apple pie and we had a great time.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*dVvuldKMrMGXyJJx" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*XxuD9wblKhPqiQK3" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*VB2XZ1b1KyIKPqO7" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*kh2mi549u7PaqHCX" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*QzKTuxGe4diqQPZz" /></figure><p>On my flight back I met some Google people from Zurich and Vienna, and we had a good time. Sadly my plane to Berlin got delayed so I missed my bus. But that gave me time to start working on this recap. I hope you enjoyed it as much as I enjoyed the Wiener sausage with potato salad and Radler in Berlin Airport :)</p><p>Until next time, I hope to meet many of you later this year, perhaps at the <a href="https://www.ai.engineer/worldsfair">AI Engineer World Fair</a> in San Francisco.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*lXwkKe7bKRGt0KJ1" /></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=e0bce24b4535" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Enriching LlamaIndex Models from GraphQL and Graph Databases]]></title>
            <link>https://medium.com/llamaindex-blog/enriching-llamaindex-models-from-graphql-and-graph-databases-bcaecec262d7?source=rss-3865848842f9------2</link>
            <guid isPermaLink="false">https://medium.com/p/bcaecec262d7</guid>
            <category><![CDATA[graph-database]]></category>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[graphql]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[llamaindex]]></category>
            <dc:creator><![CDATA[Michael Hunger]]></dc:creator>
            <pubDate>Fri, 30 Jun 2023 19:28:52 GMT</pubDate>
            <atom:updated>2023-06-30T19:28:52.884Z</atom:updated>
            <content:encoded><![CDATA[<h3>Enriching LlamaIndex Models with GraphQL and Graph Databases</h3><h4>How to add new data loaders to LlamaIndex and use them.</h4><p>In this article I wanted to share the process of adding new data loaders to LlamaIndex. First we’ll look at what LlamaIndex is and try a simple example of providing additional context to an LLM query using a simple CSV loader. Then we look at how easy it is to add a new loader for graph databases to LlamaIndex. And lastly we try that new loader and another loader for GraphQL APIs that I added in practice and see how their extra context can help an LLM answer questions better.</p><h3>Background/Context</h3><p>I was listening to the <a href="https://medium.com/llamaindex-blog/llamaindex-on-twiml-ai-a-distilled-summary-using-llamaindex-de2a88551595">&quot;This Week in ML&quot; (twiml) Podcast</a> where <a href="https://medium.com/u/e76da1c45ef7">Jerry Liu</a> from LlamaIndex (previously GPT-Index) explained the ideas behind the library to enrich query contexts to LLMs with data from any number of sources.</p><p><a href="https://medium.com/llamaindex-blog/llamaindex-on-twiml-ai-a-distilled-summary-using-llamaindex-de2a88551595">LlamaIndex on TWIML AI: A Distilled Summary (using LlamaIndex)</a></p><p><a href="https://gpt-index.readthedocs.io/en/latest/index.html">LlamaIndex</a> is a toolkit to augment LLMs with your own (private) data using in-context learning. It takes care of selecting the right context to retrieve from large knowledge bases. To achieve that it utilizes a number of connectors or loaders (from <a href="https://llamahub.ai/">LlamaHub</a>) and data structures (indices) to efficiently provide the pre-processed data as Documents.</p><p>Each type of index stores documents in different ways, e.g via embeddings for vector search, as a simple list or graph or tree structure. Those indices are used as query interface to the LLM, transparently embedding the relevant context.</p><p>Besides the higher quality response from the LLM, you get also the documents returned that have been used to construct the answer. LlamaIndex also allows chain of thought reasoning, compare/contrast queries, and natural language querying of databases.</p><p>See also this presentation from Jerry:</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FFQBou-YgxyE%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DFQBou-YgxyE&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FFQBou-YgxyE%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/178b8b8ed7039d9b7b6b9b0d8468d51b/href">https://medium.com/media/178b8b8ed7039d9b7b6b9b0d8468d51b/href</a></iframe><p>All the code for the blog post is available in this <a href="https://colab.research.google.com/drive/1NUrIoiOh692LaQkBHEmnD-5IuLBpBqGJ#scrollTo=JN4gqQF-NRwj">Colab Notebook</a>.</p><p><a href="https://colab.research.google.com/drive/1NUrIoiOh692LaQkBHEmnD-5IuLBpBqGJ#scrollTo=JN4gqQF-NRwj">Google Colaboratory</a></p><h3>Using a Basic CSV Loader</h3><p>Here is an example of using a basic CSV loader to provide documents for LlamaIndex.</p><p>In our Notebook we download the countries.csv via the <a href="https://annexare.github.io/Countries/">Countries List Project</a> (MIT) (<a href="https://raw.githubusercontent.com/annexare/Countries/master/dist/countries.csv">raw source</a>).</p><p>Our dependencies are llama-index and python-dotenv.</p><pre>!pip install llama-index==0.6.19 python-dotenv</pre><p>We need to provide our OpenAI-api key, to avoid accidentally leaking it in the notebook, I uploaded an openai.env file and use the dotenv library to load the contents as environment variables.</p><p>In the next step we load the env file and prepare the OpenAI ChatGPTLLMPredictor (using gpt-3.5-turbo by default) and add it to the ServiceContext .</p><pre>import os<br>from pathlib import Path<br>from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader, ServiceContext, GPTListIndex<br>from llama_index.llm_predictor.chatgpt import ChatGPTLLMPredictor<br>from dotenv import load_dotenv<br>from llama_index import download_loader<br><br>load_dotenv(&quot;openai.env&quot;)<br><br>llm_predictor = ChatGPTLLMPredictor()<br>service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)</pre><p>Now we can use the loader to load the CSV and turn it into documents, create an an GPT Index (VectorStoreIndex in this case), which LlamaIndex can then use to retrieve the relevant information to pass along in the context to the LLM.</p><p>Initializing CSV Loader and GPTVectorStoreIndex</p><pre>SimpleCSVReader = download_loader(&quot;SimpleCSVReader&quot;)<br>loader = SimpleCSVReader(concat_rows=False)<br>documents = loader.load_data(file=Path(&#39;./countries.csv&#39;))<br><br>print(documents)<br>index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)</pre><p>Documents from the CSV Loader</p><pre>[Document(text=&#39;country, capital, type&#39;, doc_id=&#39;67c30c68-7d9f-4906-945b-9affc96f95d2&#39;, embedding=None, doc_hash=&#39;3a506ebea9c04655b51406d79fdf5e3a87c3d8ff5b5387aace3e5a79711a21b8&#39;, extra_info=None),<br>Document(text=&#39;Abkhazia, Sukhumi, countryCapital&#39;, doc_id=&#39;6e6be4b5-051f-48e0-8774-6d48e0444785&#39;, embedding=None, doc_hash=&#39;ea387d0eab94cc6c59f98c473ac1f0ee64093901673b43e1c0d163bbc203026e&#39;, extra_info=None),<br>...]</pre><p>The CSV loader didn’t create one Document per CSV row by default, but only one for the whole document, but you could configure it so that it turned the CSV into one document per row.</p><p>LlamaIndex supports much more involved setups of different kinds of indexes, allows to chain them and even conditionally select one or the other. Here we just do the bare minimum to demonstrate our loaders.</p><p>After setting up the indices with the appropriate loaders, and connected indexes, we now can use the index as an LLM query engine and execute our user query.</p><p>To demonstrate that the LLM still is able to use its world knowledge, we can ask in a mix of English (System), German (Question) and French (requested Answer).</p><pre>queryEngine = index.as_query_engine()<br><br>queryEngine.query(&quot;&quot;&quot;<br>Provide the answer in French.<br>Question: Was ist die Hauptstadt von Albanien?<br>&quot;&quot;&quot;)</pre><p>As you can see in the response below it doesn’t just answer our question correctly in French La capitale de l’Albanie est Tirana., but also provides which documents it used to generate the answer.</p><pre>Response(response=&quot;La capitale de l&#39;Albanie est Tirana.&quot;, <br>source_nodes=[NodeWithScore(node=Node(text=&#39;              &lt;td&gt;Albania&lt;/td&gt;&#39;, doc_id=&#39;3decbee1-98cc-4650-a071-ed25cd3e00d5&#39;, embedding=None, doc_hash=&#39;7d9d85082095471a9663690742d2d49fc37b2ec37cc5acf4e99e006a68a17742&#39;, extra_info=None, <br>node_info={&#39;start&#39;: 0, &#39;end&#39;: 30, &#39;_node_type&#39;: &lt;NodeType.TEXT: &#39;1&#39;&gt;}, <br>relationships={&lt;DocumentRelationship.SOURCE: &#39;1&#39;&gt;: &#39;7b6c861f-2c2f-4905-a047-edfc25f7df19&#39;}), score=0.7926356007369129), <br>NodeWithScore(node=Node(text=&#39;              &lt;td&gt;Algiers&lt;/td&gt;&#39;, doc_id=&#39;8111b737-9f45-4855-8cd8-f958d4eb0ccd&#39;, embedding=None, doc_hash=&#39;8570a02a057a6ebbd0aff6d3f63c9f29a0ee858a81d913298d31b025101d1e44&#39;, <br>extra_info=None, node_info={&#39;start&#39;: 0, &#39;end&#39;: 30, &#39;_node_type&#39;: &lt;NodeType.TEXT: &#39;1&#39;&gt;}, relationships={&lt;DocumentRelationship.SOURCE: &#39;1&#39;&gt;: &#39;22e11ac6-8375-4d0c-91c6-4750fc63a375&#39;}), score=0.7877589022795918)], extra_info={&#39;3decbee1-98cc-4650-a071-ed25cd3e00d5&#39;: None, &#39;8111b737-9f45-4855-8cd8-f958d4eb0ccd&#39;: None})</pre><h3>LlamaIndex Loaders</h3><p>The number of existing data sources in <a href="https://llamahub.ai/">LlamaHub</a> is impressive, I counted 100+ integrations in <a href="https://github.com/emptycrown/llama-hub">the repository</a>. You can find anything from Google docs, to GitHub, to relational databases.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*CiMuFLSaFdMTdMoAcoBJog.png" /><figcaption>LlamaHub, screenshot by Author</figcaption></figure><p>But I was missing two of my favorite technologies: GraphQL - the API query language open sourced by Facebook and Graph databases like Neo4j, the best way to store and manage large amounts of connected data, for example in Knowledge Graphs.</p><blockquote>So I thought: &quot;How hard can it be to add them :)&quot;</blockquote><h3>Adding the new loaders</h3><p>Adding new loaders is really straightforward. There is a script in the llama-hub repository to help with adding a new loader. Running ./add-loader.sh &lt;folder&gt; added the skeleton files.</p><p>To get familiar with the existing implementations I looked at the <a href="https://github.com/emptycrown/llama-hub/tree/main/llama_hub/database">Databases (relational)</a> and <a href="https://github.com/emptycrown/llama-hub/tree/main/llama_hub/mongo">MongoDB integrations</a>, the former for the Graph Database and the latter for the GraphQL.</p><p>It was easy enough, we only needed the requirements for our loader, implement the base.py with an straightforward API and a README.md` with an explanation and a code example.</p><p>The main difference my loaders have from the existing ones, is that they don’t use hard-coded field names for extracting the relevant value from the query result, but instead turn the result into YAML.</p><p>I picked YAML not because I like it, but because it was closest to a textual representation of a nested tree of key-value pairs that a user would write as nested bullet lists.</p><p>Below is the example code for the Graph Database implementation (the GraphQL one is similar).</p><h3>Adding the Graph Database Loader</h3><p>I added the requirements for the neo4j dependency, a Cypher query language over Bolt protocol python driver, that also works with Memgraph and AWS Neptune.</p><p>Then I added the code for <em>__init__</em> to take in a database server URI, database name and credentials to connect and create a driver instance.</p><p>The load_data method takes in the query to run and optional parameters. It’s implemented by calling the driver’s execute_query method.</p><p>Each row of results is mapped into a LlamaIndex Document with the text being the YAML representation of the results.</p><pre>&quot;&quot;&quot;Graph Database Cypher Reader.&quot;&quot;&quot;<br><br>from typing import Dict, List, Optional<br><br>from llama_index.readers.base import BaseReader<br>from llama_index.readers.schema.base import Document<br><br>import yaml<br><br>class GraphDBCypherReader(BaseReader):<br>    &quot;&quot;&quot;Graph database Cypher reader.<br><br>    Combines all Cypher query results into the Document type used by LlamaIndex.<br><br>    Args:<br>        uri (str): Graph Database URI<br>        username (str): Username<br>        password (str): Password<br><br>    &quot;&quot;&quot;<br><br>    def __init__(<br>        self,<br>        uri: str,<br>        username: str,<br>        password: str,<br>        database: str<br>    ) -&gt; None:<br>        &quot;&quot;&quot;Initialize with parameters.&quot;&quot;&quot;<br>        try:<br>            from neo4j import GraphDatabase, basic_auth<br><br>        except ImportError:<br>            raise ImportError(<br>                &quot;`neo4j` package not found, please run `pip install neo4j`&quot;<br>            )<br>        if uri:<br>            if uri is None:<br>                raise ValueError(&quot;`uri` must be provided.&quot;)<br>            self.client = GraphDatabase.driver(uri=uri, auth=basic_auth(username, password))<br>            self.database = database<br><br>    def load_data(<br>        self, query: str, parameters: Optional[Dict] = None<br>    ) -&gt; List[Document]:<br>        &quot;&quot;&quot;Run the Cypher with optional parameters and turn results into documents<br><br>        Args:<br>            query (str): Graph Cypher query string.<br>            parameters (Optional[Dict]): optional query parameters.<br><br>        Returns:<br>            List[Document]: A list of documents.<br><br>        &quot;&quot;&quot;<br>        if parameters is None:<br>            parameters = {}<br><br>        records, summary, keys = self.client.execute_query(query, parameters, database_ = self.database)<br><br>        documents = [Document(yaml.dump(entry.data())) for entry in records]<br><br>        return documents</pre><p>You’re now ready to start using the data loader. If you want to start using this in your code, simply import `GraphDBCypherReader` from the relevant file and follow the steps below.</p><p>If you wish to submit the loader on LlamaHub, the process is fairly straightforward. After adding an example to the readme which uses an always-on demo server with StackOverflow data, I was ready to create a <a href="https://github.com/emptycrown/llama-hub/pull/266">pull request</a>. After a short discussion the PR was quickly merged.</p><p>Thanks a lot Jerry for the smooth experience.</p><p>Now let’s see how to use our two loaders.</p><h3>Using the Graph Database Loader</h3><p>The GraphDB Cypher loader, connects to graph databases, which are specialized databases that store data not in tables but in entities (<em>Nodes</em>) and their <em>Relationships</em>. Because they are schema free, you can store real-world knowledge without compromising on richness.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*rrH_pwyIEriXFtc_6yjTDA.png" /><figcaption>Image for “Network Graph” generated by Midjourney by Author</figcaption></figure><p>Relationships can also hold attributes, which can represent time, weights, costs or whatever defines the concrete relationship. Any node can have as many or as few attributes or relationships as needed.</p><blockquote>To query a graph database you can use the <em>Cypher</em> query language, a pattern based language that expresses those relationships in visual ascii-art patterns. You encircle nodes in parentheses <em>()</em> and draw relationships as arrows <em>--&gt;</em> with additional constraints put in square brackets. Otherwise Cypher provides many features known from SQL and also supports many graph operations as well as handling data structures like nested documents, of lists and dicts.</blockquote><p>Let’s use a movie graph database and ask the LLM a question about <strong><em>common action movie plots</em></strong>.</p><p>Setting up the ServiceContext and the ChatGPTLLMPredictor is the same as before.</p><p>Then we get the GraphDBCypherReader and connect it to our database (with an small example movie graph from <a href="https://themoviedb.org">TheMovieDB</a> with permission).</p><pre>GraphDBCypherReader = download_loader(&#39;GraphDBCypherReader&#39;)<br><br>reader = GraphDBCypherReader(uri = &quot;neo4j+s://demo.neo4jlabs.com&quot;, \<br>    username = &quot;recommendations&quot;, password = &quot;recommendations&quot;, database = &quot;recommendations&quot;)</pre><p>Then we define our query to the graph database with a parameter of year that allows us to pick more recent movies. When loading the data, each row of results should turn into one Document where the text property of the document is the YAML representation of the row.</p><pre>query = &quot;&quot;&quot;<br>    MATCH (m:Movie)-[rel:ACTED_IN|DIRECTED|IN_GENRE]-(other)<br>    WHERE $year &lt; m.year and m.imdbRating &gt; $rating<br>    WITH m, type(rel) as relation, collect(other.name) as names<br>    RETURN m.title as title, m.year as year, m.plot as plot, relation, names<br>    ORDER BY m.year ASC<br>&quot;&quot;&quot;<br><br>documents = reader.load_data(query, parameters = {&quot;year&quot;:1990,&quot;rating&quot;:8})<br>index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)<br><br>print(len(documents))<br>print(documents[0:5])</pre><p>The output will look similar to the following:</p><pre>829<br>[Document(text=&#39;names:\n- Saifei He\n- Li Gong\n- Jingwu Ma\n- Cuifen Cao\nplot: A young woman becomes the fourth wife of a wealthy lord, and must learn to live\n  with the strict rules and tensions within the household.\nrelation: ACTED_IN\ntitle: Raise the Red Lantern (Da hong deng long gao gao gua)\nyear: 1991\n&#39;, doc_id=&#39;782d9a63-251b-4bb8-aa3d-5d8f6d1fb5d2&#39;, embedding=None, doc_hash=&#39;f9fd966bc5f2234e94d09efebd3be008db8c891f8666c1a364abf7812f5d7a1c&#39;, extra_info=None), Document(text=&#39;names:\n- Yimou Zhang\nplot: A young woman becomes the fourth wife of a wealthy lord, and must learn to live\n  with the strict rules and tensions within the household.\nrelation: DIRECTED\ntitle: Raise the Red Lantern (Da hong deng long gao gao gua)\nyear: 1991\n&#39;, doc_id=&#39;2e13caf6-b9cf-4263-a264-7121bc77d1ee&#39;, embedding=None, doc_hash=&#39;e1f340ed1fac2f1b8d6076cfc2c9e9cb0109d5d11e5dcdbf3a467332f5995cb1&#39;, extra_info=None), ...]</pre><p>Now we can use our index to run a LLM query to answer the questions we wanted to pose.</p><pre>queryEngine= index.as_query_engine()<br><br>queryEngine.query(&quot;&quot;&quot;<br>What are the most common plots in action movies?<br>&quot;&quot;&quot;)</pre><p>The answer shows that the LLM can utilize the inputs, understands the genre &quot;action movies&quot; and can summarize their plots. Here is its answer.</p><blockquote>Based on the given context information, it appears that the most common plots in action movies are heists and battles against controlling forces. However, it is important to note that this conclusion is based on a limited sample size and may not be representative of all action movies.</blockquote><pre>Response(response=&#39;Based on the given context information, it appears that the most common plots in action movies are heists and battles against controlling forces. However, it is important to note that this conclusion is based on a limited sample size and may not be representative of all action movies.&#39;,<br><br><br>source_nodes=[NodeWithScore(node=Node(text=&#39;names:\n- Action\n- Crime\n- Thriller\nplot: A group of professional bank robbers start to feel the heat from police when\n  they unknowingly leave a clue at their latest heist.\nrelation: IN_GENRE\ntitle: Heat\nyear: 1995\n&#39;, doc_id=&#39;bb117618-1cce-4cec-bd9b-8645ab0b50a3&#39;, embedding=None, doc_hash=&#39;4d493a9f33eb7a1c071756f61e1975ae5c313ecd42243f81a8827919a618468b&#39;, extra_info=None, node_info={&#39;start&#39;: 0, &#39;end&#39;: 215, &#39;_node_type&#39;: &lt;NodeType.TEXT: &#39;1&#39;&gt;}, relationships={&lt;DocumentRelationship.SOURCE: &#39;1&#39;&gt;: &#39;dbfffdae-d88c-49e2-9d6b-83dad427a3f3&#39;}), score=0.8247381316731472), NodeWithScore(node=Node(text=&#39;names:\n- Thriller\n- Sci-Fi\n- Action\nplot: A computer hacker learns from mysterious rebels about the true nature of his\n  reality and his role in the war against its controllers.\nrelation: IN_GENRE\ntitle: Matrix, The\nyear: 1999\n&#39;, doc_id=&#39;c4893c61-32ee-4d05-b559-1f65a5197e5e&#39;, embedding=None, doc_hash=&#39;0b6a080bf712548099c5c8c1b033884a38742c73dc23d420ac2e677e7ece82f4&#39;, extra_info=None, node_info={&#39;start&#39;: 0, &#39;end&#39;: 227, &#39;_node_type&#39;: &lt;NodeType.TEXT: &#39;1&#39;&gt;}, relationships={&lt;DocumentRelationship.SOURCE: &#39;1&#39;&gt;: &#39;6c8dea11-1371-4f5a-a1a1-7f517f027008&#39;}), score=0.8220633045996049)], extra_info={&#39;bb117618-1cce-4cec-bd9b-8645ab0b50a3&#39;: None, &#39;c4893c61-32ee-4d05-b559-1f65a5197e5e&#39;: None})</pre><h3>Using the GraphQL Loader</h3><p>The GraphQL loader is similarly easy to use.</p><p><a href="https://graphql.org">GraphQL</a> is not a database query language, but an API query language that is based on strict schema expressed in &quot;type definitions&quot;. There you express your entities, their attributes (fields) both for scalar datatypes as well as object datatypes pointing to other entities.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*JBLtKqmxBZjxiLxgmgct4w.png" /><figcaption>What is GraphQL from GraphQL.org, Screenshot by Author</figcaption></figure><p>GraphQL itself is a tree based query language, that expresses a nested structure of data that you want to fetch starting from a root query. The fields of every entity returned from that query can be selected and for object fields you can further select fields from the referred entity and so on, almost ad-infinitum (API-Limits apply).</p><p>There are a number of GraphQL libraries, most notably the JavaScript reference implementation, but also gql for python, and also integrations with databases like Hasura, Prisma or the <a href="https://neo4j.com/product/graphql-library/">Neo4j-GraphQL-Library</a>. Several larger projects now provide GraphQL APIs including GitHub, Spotify, Twitter.</p><p>The demo is similar to our first one. We use a public GraphQL endpoint (<a href="https://countries.trevorblades.com/">https://countries.trevorblades.com/</a>), that provides a structure of continent→country→capital. (<a href="https://github.com/trevorblades/countries">Licensed under MIT</a>)</p><p>A subset of the type-definition is here.</p><pre>type Query {<br>    continent(code: ID!): Continent<br>    continents(filter: ContinentFilterInput = {}): [Continent!]!<br>    countries(filter: CountryFilterInput = {}): [Country!]!<br>    country(code: ID!): Country<br>    language(code: ID!): Language<br>    languages(filter: LanguageFilterInput = {}): [Language!]!<br>}<br><br>type Continent {<br>    code: ID!<br>    countries: [Country!]!<br>    name: String!<br>}<br><br>type Country {<br>    awsRegion: String!<br>    capital: String<br>    code: ID!<br>    continent: Continent!<br>    currencies: [String!]!<br>    currency: String<br>    emoji: String!<br>    emojiU: String!<br>    languages: [Language!]!<br>    name(lang: String): String!<br>    native: String!<br>    phone: String!<br>    phones: [String!]!<br>    states: [State!]!<br>    subdivisions: [Subdivision!]!<br>}<br>...</pre><p>In our demo, we again define the ServiceContext with the ChatGPTLLMPredictor as before. Then we get the GraphQLReader loader and point it to the URL of the endpoint. You can also provide additional HTTP-Headers, e.g. for authentication.</p><pre>GraphQLReader = download_loader(&#39;GraphQLReader&#39;)<br>reader = GraphQLReader(uri = &quot;https://countries.trevorblades.com/&quot;, headers = {})</pre><pre>query = &quot;&quot;&quot;<br>query getContinents {<br>  continents {<br>    name<br>    countries {<br>      name<br>      capital<br>    }<br>  }<br>}<br>&quot;&quot;&quot;<br>documents = reader.load_data(query, variables = {})<br>print(len(documents))<br>print(documents)</pre><p>We see that it finds 7 continents with countries and capitals, each of the root results (continent) is turned into a document</p><pre>7<br>[Document(text=&#39;countries:\n- capital: Luanda\n  name: Angola\n- capital: Ouagadougou\n  name: Burkina Faso\n- capital: Bujumbura\n  name: Burundi\n- capital: Porto-Novo\n  name: Benin\n- capital: Gaborone\n  name: Botswana\n- capital: Kinshasa\n  name: Democratic Republic of the Congo\n- capital: Bangui\n  name: Central African Republic\n....&#39;,doc_id=&#39;b82fec36-5e82-4246-b7ab-f590bf6741ab&#39;, embedding=None, doc_hash=&#39;a4caa760423d6ca861b9332f386add3c449f1683168391ae10f7f73a691a2240&#39;, extra_info=None)]</pre><p>Again we stress the LLM only a little bit by asking it in German, &quot;Which capitals are in North America&quot;.</p><pre>index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)<br>queryEngine= index.as_query_engine()<br><br>response = queryEngine.query(&quot;&quot;&quot;<br>Question: Welche Hauptstädte liegen in Nordamerika?<br>Answer:<br>&quot;&quot;&quot;)<br><br>response.response</pre><p>I was surprised, as I had only expected a hand-full of countries and cities. But we get 27 countries that are in North America. This shows how our perception is skewed by the western worldview.</p><pre>Die Hauptstädte, die in Nordamerika liegen, sind Ottawa, San Jos\xE9, Havana, Willemstad, Roseau, Santo Domingo, St. George&#39;s, Nuuk, Guatemala City, Tegucigalpa, Port-au-Prince, Kingston, Basseterre, George Town, Castries, Marigot, Fort-de-France, Plymouth, Mexico City, Managua, Panama City, Saint-Pierre, San Juan, San Salvador, Philipsburg, Cockburn Town, Port of Spain, Washington D.C., Kingstown und Road Town.</pre><p>We could also flip the GraphQL query around and then get 250 countries with their respective capitals and continents.</p><pre>query = &quot;&quot;&quot;<br>query getCountries {<br>  countries {<br>    name<br>    capital<br>    continent {<br>        name<br>    }<br>  }<br>}<br>&quot;&quot;&quot;<br>documents = reader.load_data(query, variables = {})<br>print(len(documents))<br>print(documents)</pre><p>Both document lists should work equally well, but let’s see.</p><p>This time the answer from the LLM was much more limited. I’m not sure if that was because the index fed the LLM fewer documents to pick from.</p><pre>index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)<br>queryEngine= index.as_query_engine()<br><br>response = queryEngine.query(&quot;&quot;&quot;<br>Question: Which capitals are in North America?<br>Answer:<br>&quot;&quot;&quot;)<br><br>response.response</pre><pre>Washington D.C. and Mexico City are in North America.</pre><h3>Conclusion</h3><p>It was really smooth to add new data loaders to LlamaHub, thanks a lot to <a href="https://medium.com/u/e76da1c45ef7">Jerry Liu</a> for making it so easy. Please let me know what you’re doing with these loaders and if you have any feedback.</p><p>If I find time in the next weeks I also want to look into the KnowledgeGraphIndex and see if my graph database loader can nicely populate that one.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=bcaecec262d7" width="1" height="1" alt=""><hr><p><a href="https://medium.com/llamaindex-blog/enriching-llamaindex-models-from-graphql-and-graph-databases-bcaecec262d7">Enriching LlamaIndex Models from GraphQL and Graph Databases</a> was originally published in <a href="https://medium.com/llamaindex-blog">LlamaIndex Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Analyzing Software Dependencies With deps.dev — Discover AuraDB Free (Week 49)]]></title>
            <link>https://medium.com/neo4j/week-49-analyzing-software-dependencies-with-deps-dev-47ac8edf9c1a?source=rss-3865848842f9------2</link>
            <guid isPermaLink="false">https://medium.com/p/47ac8edf9c1a</guid>
            <category><![CDATA[dependency-management]]></category>
            <category><![CDATA[package-management]]></category>
            <category><![CDATA[data-visualization]]></category>
            <category><![CDATA[software-analysis]]></category>
            <category><![CDATA[graph-database]]></category>
            <dc:creator><![CDATA[Michael Hunger]]></dc:creator>
            <pubDate>Thu, 25 May 2023 20:39:01 GMT</pubDate>
            <atom:updated>2023-05-26T15:37:25.013Z</atom:updated>
            <content:encoded><![CDATA[<h3>Analyzing Software Dependencies With deps.dev — Discover AuraDB Free (Week 49)</h3><h4>This week we looked at software dependencies, an important use case within software analytics for graph databases. Not only can you understand what libraries your software uses not just directly but also indirectly, but also how you’re affected by software vulnerabilities.</h4><blockquote>If you missed it - the <a href="https://neo4j.com/nodes-2023">call for papers for our online developer conference NODES 2023</a> is open till June 30th, but if you submit early you might be selected as a featured speaker.</blockquote><figure><a href="https://neo4j.com/nodes-2023"><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*X1NCLQNEPcEswMJN47l0gQ.png" /></a></figure><p>Two years ago, Google launched <a href="https://deps.dev">https://deps.dev</a> which is an open source package dependency database that makes package information from these systems available:</p><ul><li>npm (Javascript)</li><li>PyPI (Python)</li><li>maven (Java / JVM)</li><li>cargo (Rust)</li><li>NuGet (.Net)</li><li>Go</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*0yF_x7u_XOpXCejC_K8nug.png" /></figure><p>It even talks about dependency graphs in its &quot;How it works&quot; section.</p><blockquote>The service repeatedly examines sites such as github.com, npmjs.com, and pkg.go.dev to find up-to-date information about open source software packages. Using that information it builds for each package the full dependency graph from scratch—not just from package lock files—connecting it to the packages it depends on and to those that depend on it. And then does it all again to keep the information fresh. This transitive dependency graph allows problems in any package to be made visible to the owners and users of any software they affect.</blockquote><p>If you rather watch the recording for the livestream, you find it here:</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FQOu5VAsCAoA%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DQOu5VAsCAoA&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FQOu5VAsCAoA%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/23b7ea3d09f3f7d8477d7ada1338efb2/href">https://medium.com/media/23b7ea3d09f3f7d8477d7ada1338efb2/href</a></iframe><p>Back then I threw together a quick <a href="https://gist.github.com/jexp/9541bbfc17e0383d1e55388a7d7e3162">script to load the data</a> via their unofficial REST API that powered the site.</p><p>And tweeted about it:</p><h3>Michael Hunger 🇪🇺 🇺🇦 @mesirii@chaos.social on Twitter: &quot;Neat, REST API, let&#39;s do @Neo4j call apoc.load.json(&quot;https://t.co/2CEy0rS9Ro&quot;) yield value as vmerge (p:Package {name:v.package .name, version:v.version})with * unwind v.dependencies as dmerge (o:Package {name:d.package .name, version:d.version})merge (p)-[:DEPENDS_ON]-&gt;(o) https://t.co/XHdMJxCT6V pic.twitter.com/r4XncqPTXi / Twitter&quot;</h3><p>Neat, REST API, let&#39;s do @Neo4j call apoc.load.json(&quot;https://t.co/2CEy0rS9Ro&quot;) yield value as vmerge (p:Package {name:v.package .name, version:v.version})with * unwind v.dependencies as dmerge (o:Package {name:d.package .name, version:d.version})merge (p)-[:DEPENDS_ON]-&gt;(o) https://t.co/XHdMJxCT6V pic.twitter.com/r4XncqPTXi</p><p>But meanwhile, they have published <strong>an API</strong> that we can use to access the data. The <a href="https://docs.deps.dev/api/v3alpha/">API docs</a> are minimal, but good enough for our purposes.</p><p>The minimal API for getting information for package is straightforward but doesn’t give us a lot of data, more interesting is the <a href="https://docs.deps.dev/api/v3alpha/#getversion">information per version</a>, which also lists licenses, security vulnerabilities, and links (homepage, repo, issue-tracker).</p><p>Here is the example for React (no security vulnerabilities):</p><p><a href="https://api.deps.dev/v3alpha/systems/npm/packages/react/versions/18.2.0">https://api.deps.dev/v3alpha/systems/npm/packages/react/versions/18.2.0</a></p><pre>{<br>    &quot;versionKey&quot;: {<br>        &quot;system&quot;: &quot;NPM&quot;,<br>        &quot;name&quot;: &quot;react&quot;,<br>        &quot;version&quot;: &quot;18.2.0&quot;<br>    },<br>    &quot;isDefault&quot;: true,<br>    &quot;licenses&quot;: [<br>        &quot;MIT&quot;<br>    ],<br>    &quot;advisoryKeys&quot;: [],<br>    &quot;links&quot;: [<br>        {<br>            &quot;label&quot;: &quot;HOMEPAGE&quot;,<br>            &quot;url&quot;: &quot;https://reactjs.org/&quot;<br>        },<br>        {<br>            &quot;label&quot;: &quot;ISSUE_TRACKER&quot;,<br>            &quot;url&quot;: &quot;https://github.com/facebook/react/issues&quot;<br>        },<br>        {<br>            &quot;label&quot;: &quot;ORIGIN&quot;,<br>            &quot;url&quot;: &quot;https://registry.npmjs.org/react/18.2.0&quot;<br>        },<br>        {<br>            &quot;label&quot;: &quot;SOURCE_REPO&quot;,<br>            &quot;url&quot;: &quot;git+https://github.com/facebook/react.git&quot;<br>        }<br>    ]<br>}</pre><p>But we’re more interested in the graph, so let’s go directly for the package dependencies.</p><h3>Dependencies of a package</h3><p>You can <a href="https://deps.dev/pypi/tensorflow/2.12.0/dependencies">find the dependencies of a package (like TensorFlow) in the UI</a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*zU0GIU6_IUXXkCU1Wtv58w.png" /><figcaption>Dependencies of Tensorflow</figcaption></figure><p>The <a href="https://docs.deps.dev/api/v3alpha/#getdependencies">API Docs are here</a></p><p>Loading the data for the TensorFlow packages via API uses the system, name and version of a package in the URL.</p><p><a href="https://api.deps.dev/v3alpha/systems/pypi/packages/tensorflow/versions/2.12.0:dependencies">https://api.deps.dev/v3alpha/systems/pypi/packages/tensorflow/versions/2.12.0:dependencies</a></p><p>And responds with a JSON that has already a graph format:</p><pre>{<br>&quot;nodes&quot;: [<br>{<br>    &quot;versionKey&quot;: {<br>    &quot;system&quot;: &quot;PYPI&quot;,<br>    &quot;name&quot;: &quot;tensorflow&quot;,<br>    &quot;version&quot;: &quot;2.12.0&quot;<br>    },<br>    &quot;bundled&quot;: false,<br>    &quot;relation&quot;: &quot;SELF&quot;,<br>    &quot;errors&quot;: []<br>},<br>{<br>    &quot;versionKey&quot;: {<br>    &quot;system&quot;: &quot;PYPI&quot;,<br>    &quot;name&quot;: &quot;absl-py&quot;,<br>    &quot;version&quot;: &quot;1.4.0&quot;<br>    },<br>    &quot;bundled&quot;: false,<br>    &quot;relation&quot;: &quot;DIRECT&quot;,<br>    &quot;errors&quot;: []<br>},...],<br>&quot;edges&quot;: [<br>{<br>    &quot;fromNode&quot;: 0,<br>    &quot;toNode&quot;: 1,<br>    &quot;requirement&quot;: &quot;&gt;=1.0.0&quot;<br>},<br>{<br>    &quot;fromNode&quot;: 0,<br>    &quot;toNode&quot;: 2,<br>    &quot;requirement&quot;: &quot;&gt;=1.6.0&quot;<br>},<br>{<br>    &quot;fromNode&quot;: 0,<br>    &quot;toNode&quot;: 6,<br>    &quot;requirement&quot;: &quot;&gt;=2.0&quot;<br>}, ... ]}</pre><p>The response contains data in a graph format, first a list of nodes then a list of edges with fromNode and toNode (based on the index in the nodes array) and semantic version requirement.</p><p>To load the data from the API we use apoc.load.json to provide the response as a Cypher nested structure result.</p><pre>call apoc.load.json(&quot;https://api.deps.dev/v3alpha/systems/pypi/packages/tensorflow/versions/2.12.0:dependencies&quot;)<br>yield value as r</pre><p>We can now import the data by creating the nodes first and then collecting them into an array again to provide the index lookup for the edges. We encode the &quot;system&quot;, here &quot;pypi&quot; as an additional label :PyPi on our :Package nodes which then also hold the constraint by name</p><pre>create constraint package_pypi if not exists for (p:PyPi) require (p.name) is unique</pre><p>In a real system we would create separate version nodes on each package that we would then link to, here for simplicity we stuck with the :Package nodes only.</p><p>And then iterate over the nodes with UNWIND within a CALL subquery to create the nodes. And then do a second subquery for the relationships.</p><pre>with &quot;pypi&quot; as system, &quot;tensorflow&quot; as name, &quot;2.12.0&quot; as version<br><br>call apoc.load.json(&quot;https://api.deps.dev/v3alpha/systems/&quot;+system+&quot;/packages/&quot;<br>                    +name+&quot;/versions/&quot;+version+&quot;:dependencies&quot;)<br>yield value as r<br>// create nodes<br>call { with r<br>        unwind r.nodes as package<br>        merge (p:Package:PyPi {name:package.versionKey.name}) on create set p.version = package.versionKey.version<br>        return collect(p) as packages<br>}<br>// create relationships by linking nodes<br>call { with r, packages<br>        unwind r.edges as edge<br>        with packages[edge.fromNode] as from, packages[edge.toNode] as to, edge<br>        merge (from)-[rel:DEPENDS_ON]-&gt;(to) ON CREATE SET rel.requirement = edge.requirement<br>        return count(*) as numRels<br>}<br>return size(packages) as numPackages, numRels</pre><p>Now we can visualize the data in the Query UI by running <br>MATCH path=(:PyPi {name:&quot;tensorflow&quot;})-[:DEPENDS_ON*]→() RETURN path</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*UM7U0G8OEgPLY2oxSwqdNg.png" /></figure><p>Or we can head over to &quot;Explore&quot; and visualize it in the hierarchical layout and also find the shortest paths between packages visually.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xiPQji5rsv3oYYtxGuCMbw.png" /><figcaption>Explore dependencies with the hierarchical layout</figcaption></figure><p>We can also use the packages that we already have imported into our graph to fetch their dependencies.</p><p>To achieve that we replace the hardcoded initial data for package and version with data from the graph. We also set an additional property (or label) to indicate which packages have already been loaded.</p><pre>match (root:Package:PyPi) where root.imported is null<br>set root.imported = true<br>with &quot;pypi&quot; as system, root.name as name, root.version as version<br>call apoc.load.json(&quot;https://api.deps.dev/v3alpha/systems/&quot;+system+&quot;/packages/&quot;<br>                    +name+&quot;/versions/&quot;+version+&quot;:dependencies&quot;)<br>yield value as r<br>call { with r<br>        unwind r.nodes as package<br>        merge (p:Package:PyPi {name:package.versionKey.name}) on create set p.version = package.versionKey.version<br>        return collect(p) as packages<br>}<br>call { with r, packages<br>        unwind r.edges as edge<br>        with packages[edge.fromNode] as from, packages[edge.toNode] as to, edge<br>        merge (from)-[rel:DEPENDS_ON]-&gt;(to) ON CREATE SET rel.requirement = edge.requirement<br>        return count(*) as numRels<br>}<br>return size(packages) as numPackages, numRels</pre><h3>Loading Dependents</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*i7-xPKfkaG730RAMR4xS8A.png" /></figure><p>The UI also shows dependents (i.e. packages that use the current package), which we could infer inversely from our imported data too. Unfortunately, there is no API call for this, so we need to get the REST API call for the UI, which is the following:</p><p><a href="https://deps.dev/_/s/pypi/p/tensorflow/v/2.12.0/dependents">https://deps.dev/_/s/pypi/p/tensorflow/v/2.12.0/dependents</a></p><p>It has a different response format and only lists 100 results, but that’s better than nothing for demonstration purposes. We can pick the directSample list of entries and connect them to our root package that we start with.</p><pre>with &quot;pypi&quot; as system, &quot;tensorflow&quot; as name, &quot;2.12.0&quot; as version<br>merge (root:PyPi { name:name}) on create set root.version = version<br>with *<br>call apoc.load.json(&quot;https://deps.dev/_/s/&quot;+system+&quot;/p/&quot;+name+&quot;/v/&quot;+version+&quot;/dependents&quot;)<br>yield value as r<br><br><br>unwind r.directSample as entry<br>merge (dep:PyPi:Package {name:entry.package.name})<br>on create set dep.version = entry.version<br>merge (dep)-[:DEPENDS_ON]-&gt;(root)</pre><p>Question from the viewers — Eshwar: How do I fix relationships that I imported wrongly?</p><p>Answer:</p><ul><li>find the relationship to delete or update properties</li><li>e.g. MATCH ()-[rel:SOME_TYPE]-&gt;() DELETE rel</li><li>or us <a href="https://neo4j.com/docs/apoc/5/graph-refactoring/set-relationship-type/">apoc refactor procedures</a> to rename, change direction, redirect</li><li>see also call apoc.help(&quot;refactor&quot;)</li></ul><p>That was it for today. Happy graphing!</p><p>Don’t forget to share the episode or the <a href="https://neo4j.com/video/discover-aura-free-with-fun-datasets/">&quot;Discover AuraDB Free with Fun Datasets&quot; series</a> with your graph-curious friends and colleagues.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=47ac8edf9c1a" width="1" height="1" alt=""><hr><p><a href="https://medium.com/neo4j/week-49-analyzing-software-dependencies-with-deps-dev-47ac8edf9c1a">Analyzing Software Dependencies With deps.dev — Discover AuraDB Free (Week 49)</a> was originally published in <a href="https://medium.com/neo4j">Neo4j Developer Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Bluesky User Interaction Graph — Discover AuraDB Free (Week 47)]]></title>
            <link>https://medium.com/neo4j/bluesky-user-interaction-graph-discover-auradb-free-week-47-32d636054dec?source=rss-3865848842f9------2</link>
            <guid isPermaLink="false">https://medium.com/p/32d636054dec</guid>
            <category><![CDATA[data-import-and-export]]></category>
            <category><![CDATA[blue-sky]]></category>
            <category><![CDATA[graph-algorithms]]></category>
            <category><![CDATA[neo4j]]></category>
            <category><![CDATA[social-network]]></category>
            <dc:creator><![CDATA[Michael Hunger]]></dc:creator>
            <pubDate>Tue, 09 May 2023 18:04:47 GMT</pubDate>
            <atom:updated>2023-05-09T18:04:47.252Z</atom:updated>
            <content:encoded><![CDATA[<h3>Bluesky User Interaction Graph — Discover AuraDB Free (Week 47)</h3><p>As social networks are really nice to import and explore in a graph database, this is what I did this week. If you rather watch the recording of the stream, you can do so below. Otherwise, this blog goes into the details for you.</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FUuGrblKXnGE%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DUuGrblKXnGE&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FUuGrblKXnGE%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/7b8793cee012111b9b37d25ca8542264/href">https://medium.com/media/7b8793cee012111b9b37d25ca8542264/href</a></iframe><p><a href="https://blueskyweb.xyz/">Bluesky</a> as the reference social media application for the <a href="https://atproto.com/guides/overview">Authenticated Transfer Protocol (ATP)</a>, currently sees a massive hype. There are almost 2M people on the waitlist while the current user count is 65k with roughly 2M posts so far.</p><p>A number of celebrities, journalists, and tech folks already made it there, but what is most compelling is that a larger number of people from usually underrepresented minorities like BIPOC and gay and trans-folks are active there.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*amxGFQIavOPJ5kIH4v8mfA.png" /><figcaption>Bluesky Timeline</figcaption></figure><p>Bluesky currently feels like the early Twitter of 2007, with lots of fun and jokes and everyone feeling comfortable just sharing tidbits. It’s refreshing and cool to be able to ping Jake Tapper, AOC (who also did an AMA), or James Gunn and actually get an answer.</p><p>So far I’ve seen no harassment or bad behavior (but that might also just be my bubble). But there are many conversations about scalable moderations and user protection with varying points of view.</p><p>The app itself is really the bare minimum, but it’s also just meant to provide a reference implementation to test the protocol at a larger scale than before. Expansion happens at a slow scale, most users get 1 invite code every two weeks, with some exceptions.</p><p>The <a href="https://atproto.com/guides/overview">AT-protocol</a> itself is quite interesting, basing everything on distributed IDs (DID) to identify users. Data (posts, likes, media) is stored cryptographically signed in repositories that are hosted by federated “personal data servers” (PDS).</p><p><a href="https://atproto.com/guides/overview">Protocol Overview | AT Protocol</a></p><p>The AT protocol allows users to use their own domains to authoritatively declare who they are (like washingtonpost.com) with an _atproto DSN record.</p><p>You can get the record from DNS yourself and then query the Placeholder Service (PLC) for more details on that ID, here you can see also the repository data server (PDS) for my case and my domain as an alias.</p><pre>dig TXT _atproto.mesirii.de</pre><pre>; &lt;&lt;&gt;&gt; DiG 9.10.6 &lt;&lt;&gt;&gt; TXT _atproto.mesirii.de<br>...<br>;; ANSWER SECTION:<br>_atproto.mesirii.de.	3276	IN	TXT	&quot;did=did:plc:gteyzzitmjhuezvsi6nyrszj&quot;<br></pre><pre>curl -s <a href="https://plc.directory/did:plc:gteyzzitmjhuezvsi6nyrszj/data">https://plc.directory/did:plc:gteyzzitmjhuezvsi6nyrszj/data</a> | jq .<br>{<br>  &quot;did&quot;: &quot;did:plc:gteyzzitmjhuezvsi6nyrszj&quot;,<br>  &quot;verificationMethods&quot;: {<br>    &quot;atproto&quot;: &quot;did:key:zQ3shXjHeiBuRCKmM36cuYnm7YEMzhGnCmCyW92sRJ9pribSF&quot;<br>  },<br>  &quot;rotationKeys&quot;: [<br>    &quot;did:key:zQ3shhCGUqDKjStzuDxPkTxN6ujddP4RkEKJJouJGRRkaLGbg&quot;,<br>    &quot;did:key:zQ3shpKnbdPx3g3CmPf5cRVTPe1HtSwVn5ish3wSnDPQCbLJK&quot;<br>  ],<br>  &quot;alsoKnownAs&quot;: [<br>    &quot;at://mesirii.de&quot;<br>  ],<br>  &quot;services&quot;: {<br>    &quot;atproto_pds&quot;: {<br>      &quot;type&quot;: &quot;AtprotoPersonalDataServer&quot;,<br>      &quot;endpoint&quot;: &quot;https://bsky.social&quot;<br>    }<br>  }<br>}</pre><p>There is more detail in the protocol docs, a number of clients in <a href="https://github.com/bluesky-social/indigo">Go</a>, Rust, Python, Javascript, and other languages are available and about 1000 folks are already on the <a href="https://discord.com/invite/3srmDsHSZJ">Bluesky-Dev-Discord</a>. The protocol uses HTTPS and XRPC behind the scenes for communication, with a “Lexicon” based schema approach to ensure correct interpretation of data.</p><p>Alternative apps like <a href="https://github.com/mozzius/graysky">GraySky</a>, <a href="https://firesky.tv/">FireSky</a>, or <a href="https://skypulse.dvy.io/">SkyPulse</a> are already using the protocol, with more to come.</p><p>One way of getting the firehose of all posts on the platform is to register an event listener with the stream and then get “commit notifications” when new posts have been added to the data repositories. This works for new data but not for past events.</p><p>Bluesky user <a href="https://github.com/ericvolp12">Jaz</a> has been working for a while on a system that not only tracks and collects posts from the mainstream in a Postgres database, but also aggregates and visualizes it as an <a href="https://bsky.jazco.dev/">interaction graph</a> (with currently 20k users and 140k connections representing 630k interactions), which is really cool.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*clyQyy2I65E-C-w5DolqIQ.png" /><figcaption>Bluesky Interaction Graph by Jaz</figcaption></figure><p>I dived deeper into the topic in my <a href="https://www.youtube.com/watch?v=6BMCXFii4Po">“Coding with ChatGPT”</a> session last week, where we walked through their code in detail, played around with the APIs, and even integrated Neo4j a bit (more to come).</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2F6BMCXFii4Po%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D6BMCXFii4Po&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2F6BMCXFii4Po%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/bf4152ac00a0c266cd47a176060392e0/href">https://medium.com/media/bf4152ac00a0c266cd47a176060392e0/href</a></iframe><p>The interaction graph (at least 2 conversations between the users) is kept in memory and can be serialized, rendered, and enhanced with graph algorithms for clustering, centrality-based sizing, and pre-layouting. Jaz uses <a href="https://www.sigmajs.org/">sigma.js</a> to visualize the already layouted graph in the front-end and add interactivity, like searching for your own handle.</p><p>But that goes too deep for today&#39;s “intro to graphs session”.</p><p>Here we just want to import the interaction graph from Jaz’s precomputed data and visualize and query it in Neo4j and run some graph algorithms on the data for clustering and sizing, as Jaz did for pre-computation.</p><p>Fortunately, the data loaded into the web visualization is available as JSON file with &quot;nodes&quot; and &quot;edges&quot; entries that allow us quickly to construct the graph.</p><p>To run the graph algorithms on our data, today we will provision an <a href="https://neo4j.com/aura">AuraDS (data science)</a> instance, that’s optimized for this purpose and has the graph-data-science library installed.</p><p>You can also use the blank <a href="https://sandbox.neo4j.com">Neo4j Sandbox</a> or a local installation with <a href="https://neo4j.com/download">Neo4j Desktop</a> or <a href="https://hub.docker.com/_/neo4j">Docker</a>.</p><p>The minimal instance size is a bit overkill for our uses, it could easily manage 1000 times more data, but it’s easiest to use and set up. Initially, the estimator allows you to provide your data volume and the algorithms you want to use and gives you an instance size.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*UCrtZxhoMeHIaBDtOchOjg.png" /><figcaption>AuraDS instance estimation</figcaption></figure><p>While creating the instance, make sure to download the credentials file with the database URI, username, and password.</p><p>To import the data we open “Neo4j Workspace” with the “[Open]” button and provide the password.</p><p>Then we can visit the “Query” tab to get started with our quick import.</p><p>Our data model is really simple, we have User nodes that are connected by an INTERACTED relationship. Both of which have a number of attributes (key, label (handle), size, weight, color, community, x,y) that we can take from the JSON directly.</p><p>Let’s first create a constraint for User and key so that we can ensure uniqueness and can look the users up quickly by key to connect them.</p><pre>create constraint user_key if not exists for (u:User) require (u.key) is unique;</pre><p>Originally I had also created a constraint for the user name (aka label in the data) but there was a duplicate user, that we need to merge first. But more about that later.</p><p>Next, we can look at some entries from the file, we use the user-defined procedure <a href="https://neo4j.com/docs/apoc/5/import/load-json/#load-json-json-path">apoc.load.json</a> to load the file and a JSON-Path expression, to grab the &quot;nodes&quot; entries.</p><pre>call apoc.load.json(&quot;https://bsky.jazco.dev/exported_graph_minified.json&quot;,&quot;$.nodes&quot;)<br>yield value as nv<br>return nv limit 5;</pre><p>Which returns the structure in the file as Cypher constructs, nested maps (dicts), and lists(arrays).</p><pre>{<br>  attributes: {<br>    area: 9.51,<br>    size: 3.48,<br>    color: &quot;#7EC0EE&quot;,<br>    x: -824.56,<br>    y: -6222.31,<br>    label: &quot;maxberger.bsky.social&quot;,<br>    community: 1,<br>    key: 1<br>  },<br>  key: &quot;1&quot;<br>}</pre><p>So we see we can create the user with the key as id and set the other attributes. Which is what we’re going to do.</p><blockquote><strong>Note:</strong> We’re going to use MERGE to make our operation idempotent (it’s a get-or-create), we can re-run as often as we want. And we’re using batches of 10k rows, not so important with 20k users but think about all the millions to come.</blockquote><p>We’re also removing key from the attributes before adding them to the nodes, because it’s a different datatype (string vs. integer).</p><pre>call apoc.load.json(&quot;https://bsky.jazco.dev/exported_graph_minified.json&quot;,&quot;$.nodes&quot;)<br>yield value as nv<br>call { with nv<br>   merge (n:User {key:nv.key})<br>   on create set n += apoc.map.clean(nv.attributes,[&quot;key&quot;],[])<br>} in transactions of 10000 rows;</pre><p>This gives us roughly 19k lonely nodes in our database.</p><p>We can style and position them based on their attributes, but without relationships, this is all <strong><em>boring</em></strong>.</p><p>So let’s add them, again by first looking at the data and then using it.</p><pre>call apoc.load.json(&quot;https://bsky.jazco.dev/exported_graph_minified.json&quot;,&quot;$.edges&quot;) yield value as ev<br>return ev limit 5;</pre><p>Which returns:</p><pre>{<br>  attributes: {<br>    size: 0.2,<br>    weight: 2,<br>    ogWeight: 2<br>  },<br>  source: &quot;723&quot;,<br>  key: &quot;geid_35_0&quot;,<br>  target: &quot;15220&quot;<br>}</pre><p>So with this data for each edge, we can look up the start and end nodes and then connect them with an INTERACTS relationship and set the attributes from the record on the relationship.</p><p>In our MERGE operation here, we leave off the direction as it combines bi-directional interactions.</p><pre>call apoc.load.json(&quot;https://bsky.jazco.dev/exported_graph_minified.json&quot;,&quot;$.edges&quot;) yield value as ev<br>call { with ev<br>    match (source:User {key:ev.source}),(target:User {key:ev.target})<br>    merge (source)-[r:INTERACTED {key:ev.key}]-(target)<br>    set r += ev.attributes<br>} in transactions of 20000 rows;</pre><p>With the data in the graph, we can now start exploring it, head over to the explore tab, and follow along.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*1nWZ-XVC3ZzBplHGLLhNtg.png" /><figcaption>Styled Bluesky interaction graph in Explore</figcaption></figure><p>Instead of describing all I did with screenshots, why don’t you just watch the 5 minutes section of the stream below (from 25:30)?</p><p>At this point, we have:</p><ul><li>explored the interaction graph visually, expanding relationships and loading additional data,</li><li>styled nodes and relationships based on community, size, and weight attributes,</li><li>ran graph algorithms (Louvain for clusters and page rank for size/importance) and used the results for styling,</li><li>layout the graph both naturally with force layout as well as with the pre-computed x,y-values from Jaz with coordinate layout, and</li><li>used filtering to dismiss the low-weight (2,3) relationships.</li></ul><p><a href="https://www.youtube.com/live/UuGrblKXnGE?feature=share&amp;t=1529">Discovering Neo4j AuraDB Free with Michael and Alexander - Importing Bluesky User Interactions</a></p><p>More recently, Jaz has been working on rendering <strong>large threads</strong> (like the Hellthread with more than 10000 posts) visually.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*mo2Tk5s10jjksxFvk9lF9w.png" /><figcaption>Thread Visualization by Jaz for the Hellthread</figcaption></figure><p>As those thread visualizations also contain posts, I wanted to import them too.</p><p>So let’s grab the data and have a quick look.</p><pre>call apoc.load.json(&quot;https://bsky-search.jazco.io/thread?authorHandle=ihatenfts.hellthread.vet&amp;postID=3jv6terdbop2z&amp;layout=true&quot;,&quot;$&quot;) yield value as post<br>RETURN count(*);<br>// 10939<br><br>call apoc.load.json(https://bsky-search.jazco.io/thread?authorHandle=ihatenfts.hellthread.vet&amp;postID=3jv6terdbop2z&amp;layout=true&quot;,&quot;$&quot;) yield value as post<br>RETURN post limit 5;</pre><pre>{<br>  depth: 2,<br>  post: {<br>    root_post_id: &quot;3juzlwllznd24&quot;,<br>    parent_relationship: null,<br>    author_did: &quot;did:plc:pbxv2f7r5eo47e3ylwczhk32&quot;,<br>    has_embedded_media: false,<br>    parent_post_id: &quot;3juzoytvr7s2y&quot;,<br>    created_at: &quot;2023-05-05T19:35:36.601-07:00&quot;,<br>    id: &quot;3juzp3fjlds26&quot;,<br>    text: &quot;Were you supposed to tag someone?&quot;<br>  },<br>  x: 1657.06,<br>  y: -344.76,<br>  author_handle: &quot;kentbye.com&quot;<br>}</pre><p>So we see we have the author handle and did, the post id, text, and created_at, and if it has embedded media. The parent post id links it to the previous post in the thread (parent-relationship is always null) and the root post-id is a reference to the post that started the thread (one without parents).</p><p>So we can use this to create the structure of:</p><p>(:User {did, label/handle})-[:POSTED]→(p:Post {id, text, created, media, root})-[:PARENT]→(parent:Post)</p><p>But first, we need to fix the duplicate users in our source data that we have already imported. Fortunately, there is only one duplicate user with the same label (handle) but different keys.</p><p>We can group users by label, count occurrences and collect the entries into a list users. For all labels that have a count greater than one, we merge that list of nodes into a single node while preserving the relationships.</p><p>Our friend here is apoc.refactor.mergeNodes which does exactly that.</p><pre>match (u:User)<br>with u.label as label, count(*) as count, collect(u) as users<br>where count &gt; 1<br>call apoc.refactor.mergeNodes(users) yield node<br>return node;</pre><p>Now we can create the constraint for that property, so we can look up users quickly by labeling and connecting them to their posts.</p><pre>create constraint user_label if not exists for (u:User) require (u.label) is unique;</pre><p>Now let’s run the import statement, to pull in the Hellthread.</p><ol><li>load the posts as a stream</li><li>batch in transactions of 10k</li><li>get-or-create Post with id</li><li>set attributes (x,y,depth) from entry</li><li>set attributes (root, media, created_at, text)</li><li>get-or-create parent-post</li><li>connect to parent-post</li><li>get-or-create author user</li><li>connect to author</li></ol><pre>call apoc.load.json(&quot;https://bsky-search.jazco.io/thread?authorHandle=ihatenfts.hellthread.vet&amp;postID=3jv6terdbop2z&amp;layout=true&quot;,&quot;$&quot;) yield value as entry<br>call { with entry<br>   with entry, entry.post as post<br>   merge (p:Post {id:post.id})<br>   set<br>     p += entry {.depth, .x, .y},<br>     p += post { .root_post_id, .has_embedded_media, created_at: datetime(post.created_at), .text}<br>   merge (parent:Post {id:post.parent_post_id})<br>   merge (p)-[:PARENT]-&gt;(parent)<br>   with *<br>   merge (u:User { label: entry.author_handle})<br>   set u.did = post.author_did<br>   merge (u)-[:POSTED]-&gt;(p)<br>} in transactions of 10000 rows;<br>// Created 10995 nodes, created 21878 relationships, set 98507 properties, added 10995 labels</pre><p>Now we have a nice subset of posts to explore paths and threading in our database.</p><p>A basic pattern (post:Post)-[:PARENT]→(parent:Post) can be extended to a variable length one by adding a star. Like here for 50 hops: (post:Post)-[:PARENT*50]→(parent:Post)</p><p>So if we fetch that single path or a few, we get a nice long chain (or tree)</p><pre>MATCH path = (post:Post)-[:PARENT*50]-&gt;(parent:Post)<br>RETURN path LIMIT 50</pre><p>But “Explore” is much better with the hierarchical layout, or coordinate layout of the pre-computed thread visualization. Also its WebGL-based layout and rendering scales better than the d3 one in Query.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*uXNfQANpnZYpnhX9T_dXiA.png" /><figcaption>Hierarchical Layout of the Thread</figcaption></figure><p>We can now find the shortest paths between users, e.g. via the interaction network, or via posts in the thread(s), either visually in the context menu in Explore or with the following statement.</p><pre>MATCH (start:User {label:&quot;mesirii.de&quot;})<br>MATCH (end:User {label:&quot;rboyd.dev&quot;})<br>MATCH path = shortestPath( (start)-[:INTERACTED*]-(end))<br>RETURN path</pre><p>As you can see, even my friend Ryan and I need James Governor to mediate :)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*hzcRP_3cXAL0MSZLof3Xlg.png" /><figcaption>Shortest Path between Ryan and me</figcaption></figure><p>You can find the code we used today in my <a href="https://gist.github.com/jexp/320521a881113b853a2f97cc867a25bf">GitHub Gist</a>.</p><p><strong>Happy socializing and graphing. Be kind.</strong></p><p>The next thing I want to write up is how to use the <a href="https://worthdoingbadly.com/bsky/">Bluesky data captured in Postgres</a> by <a href="https://github.com/zhuowei">zhuowei</a> to import the users, posts, likes, and follows graph into Neo4j. Already got the CSV dumps imported. But that’s for the next blog post :)</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=32d636054dec" width="1" height="1" alt=""><hr><p><a href="https://medium.com/neo4j/bluesky-user-interaction-graph-discover-auradb-free-week-47-32d636054dec">Bluesky User Interaction Graph — Discover AuraDB Free (Week 47)</a> was originally published in <a href="https://medium.com/neo4j">Neo4j Developer Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Cypher Code-Golf Completion: Hackathon Winners Announced]]></title>
            <link>https://medium.com/neo4j/cypher-code-golf-hackathon-completion-5b43877d2904?source=rss-3865848842f9------2</link>
            <guid isPermaLink="false">https://medium.com/p/5b43877d2904</guid>
            <category><![CDATA[neo4j]]></category>
            <category><![CDATA[community]]></category>
            <category><![CDATA[code-golf]]></category>
            <category><![CDATA[hackathons]]></category>
            <category><![CDATA[database]]></category>
            <dc:creator><![CDATA[Michael Hunger]]></dc:creator>
            <pubDate>Thu, 06 Apr 2023 08:42:00 GMT</pubDate>
            <atom:updated>2023-04-06T13:35:21.496Z</atom:updated>
            <content:encoded><![CDATA[<h4>With spring in full swing and summer just right around the corner, I wanted to take a moment to reflect on the Neo4j Hackathon we ran last year, where we encouraged you all to compete in your Cypher skills.</h4><p>Similar to the <a href="https://en.wikipedia.org/wiki/Code_golf">classic code-golf</a> where the shortest solution to a problem wins, we wanted to see who can write the most efficient and shortest Cypher query to solve one of available three challenges in “Cypher Code-Golf,” where the level of difficulty ranges from Beginner, Intermediate, and Advanced Cypher skill levels.</p><p>The efficiency was measured as “Database Hits,” a metric returned from using PROFILE with your query, and the query length in characters counted for the shortest. For ties on both metrics, we used the first submission.</p><p>By the numbers, more than 1000 people registered, and over 5,000 queries were submitted. At the conclusion of the hackathon, 18 winners were selected to take home hard-earned cash prizes, and we couldn’t be more thankful to have so many participants in this challenge. More on those 18 winners below!</p><p>Our first-ever hackathon would only be complete with some learnings and challenges. We quickly realized that not having a result verification tool led to a lot of invalid queries with an unusually high number of database hits that simply didn’t make sense. Zero db-hits was an obvious one, and we got plenty of those! Next time we’ll apply the approach from Advent of Code that executes the statement with parameters for another set of input parameters and validates the correctness of the results that way.</p><p>Here’s a quick rundown of the user interface of Code-Golf:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*SkOo8chTbWUp5Tpq" /></figure><p>After signing up and picking an avatar, you could start swinging.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*Z99bYo9-QfT7Iodl" /></figure><p>We provided three difficulty levels to accommodate different skills, one of which you could compete in.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*a5nYE7iXmEP5aFg6" /></figure><h3>Stack Overflow Dataset</h3><p>Being developers, we felt that a Stack Overflow dataset would be easy to understand and fun to query for everyone. Based on our previous work, we used the import of the whole Stack Overflow dump from 2018 (55M nodes, 123M relationships), which provided us with this data model:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*s4sTrMa7OmVVcjAE" /></figure><p>So effectively, the elements we are looking at are:</p><ul><li>Labels: Tag, Question:Post, Answer:Post, User</li><li>Patterns</li><li>(:User)-[:POSTED]-&gt;(:Post), (Tag)-[:SIMILAR]-&gt;(Tag)</li><li>(:Question)-[:TAGGED]-&gt;(:Tag)</li><li>(:Question)-[:ACCEPTED]-&gt;(:Answer), (:Question)&lt;-[:ANSWERED]-(:Answer)</li></ul><p>For the 3 levels, we had a question each.</p><h3>Level 1: Beginner</h3><p>Question: How many tags do questions in the Cypher area have at most?</p><p>This question could be solved in these steps</p><ol><li>Find the “cypher” tag</li><li>Find questions tagged with that tag</li><li>Return the max degree (relationship count) of those questions</li></ol><p>We got the most submissions in this category, 1197 in total.</p><p>There were a number of cheats e.g. just returning the number or sneakily creating a statement that just hit 1000 db-hits and then returning the fixed results or similar.</p><p>An efficient statement that meets the criteria with 12719 db-hits is:</p><pre>MATCH (t:Tag {name:&#39;cypher&#39;})&lt;-[:TAGGED]-(q)<br>RETURN max(size( (q)-[:TAGGED]-&gt;())) as most</pre><p>With Neo4j 5.x you would replace the <strong>size(pattern)</strong> with a <strong>count { pattern }</strong> expression instead.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*e9eYKyrCJrFah9KF" /></figure><h3>Winners</h3><p>Our 6 winners in this category are</p><ol><li>Chris Zirkel</li><li>Paweł Gronowski</li><li>Ertan Kabakcı</li><li>Benjamin Malburg</li><li>Joren Van de Vondel</li><li>Paul Billing-Ross</li></ol><p>Here’s an example query of one of our winners that’s different from the minimal query:</p><pre>MATCH (:Tag {name: &quot;cypher&quot;})&lt;-[:TAGGED]-(q)<br>RETURN max(apoc.node.degree(q,&quot;TAGGED&quot;))<br>MATCH (t:Tag)&lt;-[:TAGGED]-(q)<br>WHERE t.name = &quot;cypher&quot;<br>WITH q, size((q)-[:TAGGED]-&gt;()) as ts<br>RETURN max(ts)</pre><h3>Level 2: Intermediate</h3><p>Question: What is the title of the most highly voted question posted on April 1st in the Perl category?</p><p>This question was also pretty straightforward, you can access the components like month or day of date(-time) property individually.</p><ol><li>Find the “perl” tag</li><li>Find questions with that tag</li><li>Filter those questions by day 1 and month 4</li><li>Sort by score DESCending</li><li>Return the title</li></ol><p>Again in the 932 submissions, we got a lot of cheats that just looked for the question on that single date, or with the question-id or only within that single year and not all years, or just blatantly returned the expected title.</p><p>A reasonable solution would be:</p><pre>MATCH (t:Tag {name:&#39;perl&#39;})&lt;-[:TAGGED]-(q)<br>WHERE q.createdAt.month = 4 and q.createdAt.day = 1<br>RETURN q<br>ORDER BY q.score DESC LIMIT 1;</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*taoZyDsrngLRRXNq" /></figure><p>An alternative solution could be iterating over the years and then using a range (between) for using the index on the date field and a hash-join between the tag and the question.</p><p>But that one had higher db-hit costs (353k) as it had to filter <strong>all</strong> questions for that date, and not just the ones within the perl tag.</p><pre>UNWIND range(2008,2019) as year<br>WITH datetime({year:year,month:4,day:1}) as date<br>MATCH (t:Tag {name:&#39;perl&#39;})&lt;-[:TAGGED]-(q:Question)<br>USING JOIN on t<br>WHERE date &lt;= q.createdAt &lt; date + duration(&quot;P1D&quot;)<br>RETURN q.title, q.score<br>ORDER BY q.score DESC LIMIT 1;</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*2hD-YD18xn6aqZu-" /></figure><h3>Winners</h3><p>The winners for this category are:</p><ol><li>Justin B</li><li>Brian Lee</li><li>Belinda Dhamers</li><li>Carmi Raz</li><li>Venkatesh Prasanna</li><li>Camille Caulier</li></ol><p>Some of their <em>different</em> solutions:</p><pre>CYPHER runtime=interpreted<br>MATCH (t:Tag{name:&#39;perl&#39;})&lt;-[:TAGGED]-(q:Question)<br>USING INDEX q:Question(createdAt)<br>USING JOIN ON t<br>WHERE datetime({year:2008, month:4, day:1}) &lt;= q.createdAt &lt; datetime({year:2008, month:4, day:2})<br>OR datetime({year:2009, month:4, day:1}) &lt;= q.createdAt &lt; datetime({year:2009, month:4, day:2})<br>OR datetime({year:2010, month:4, day:1}) &lt;= q.createdAt &lt; datetime({year:2010, month:4, day:2})<br>…<br>day:2})<br>OR datetime({year:2022, month:4, day:1}) &lt;= q.createdAt &lt; datetime({year:2022, month:4, day:2})<br>RETURN q.title ORDER BY q.score DESC LIMIT 1</pre><p>Or with a regular expression:</p><pre>MATCH (:Tag {name: &#39;perl&#39;})&lt;-[:TAGGED]-(q)<br>WHERE apoc.convert.toString(q.createdAt) =~ &#39;.*04\-01T.*&#39;<br>RETURN q.title<br>ORDER BY q.score DESC<br>LIMIT 1</pre><p>Or via date components:</p><pre>MATCH (:Tag {name:&quot;perl&quot;})&lt;-[:TAGGED]-(q) WHERE EXISTS {<br>MATCH(q) WHERE q.createdAt.month=4 AND q.createdAt.day=1<br>} RETURN q.id ORDER BY q.score DESC LIMIT 1</pre><h3>Level 3: Advanced</h3><p>Question: Given the top 10 tags in the Lua community (except Lua), which other tags most frequently co-occur with them. For those tags, find the 25 people who most frequently answered those questions and see what are the top 10 question tag names they had that were not in the original 10-element list.</p><p>This question was a bit more involved but you can just follow the steps one at a time to build it up.</p><ol><li>Find the “lua” tag</li><li>Find questions tagged with that tag, and other tags for these questions</li><li>Aggregate the other tags by their frequency, sort descending and select the top 10</li><li>Turn the 10 top other tags into a list</li><li>Find the users who posted answers to questions with those top 10 tags</li><li>Aggregate by frequency, sort descending, and collect the top 25 users</li><li>For those users find what the tags were for questions they answered</li><li>Filter out the tags that were in our top 10 tags list</li><li>Return the remaining tags, ordered by frequency descending, and pick the top 10</li></ol><p>An example query on how it could be solved is:</p><pre>MATCH (:Tag {name:&#39;lua&#39;})&lt;-[:TAGGED]-()-[:TAGGED]-&gt;(o)<br>WITH o, count(*) as c ORDER BY c DESC LIMIT 10<br>CALL { with o return collect(o) as all }<br>MATCH (o)&lt;-[:TAGGED]-()&lt;-[:ANSWERED]-()&lt;-[:POSTED]-(u)<br>WITH u, all, count(*) as c ORDER BY c LIMIT 25<br>MATCH (o)&lt;-[:TAGGED]-()&lt;-[:ANSWERED]-()&lt;-[:POSTED]-(u)<br>WHERE not o in all<br>RETURN o, count(*) as c ORDER BY c DESC LIMIT 10</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*Ed0QQnCxXlHpKTbe" /></figure><p>This category was as expected the hardest, many folks also had issues understanding the challenge. Unfortunately, we didn’t have the capacity to provide explanations for all your questions, apologies for that.</p><h3>Winners</h3><p>But we still got 111 submissions, of which we could pick the 6 winners.</p><ol><li>Ron van Weverwijk</li><li>Laura Arditti</li><li>Hüseyin Çötel</li><li>Đức Lê Tự</li><li>Niclas Kjäll-Ohlsson</li><li>Rajendra Kadam</li></ol><p>Some different solutions from the one we’ve shown:</p><pre>MATCH (:Tag{name: &quot;lua&quot;})&lt;-[:TAGGED]-(p)-[:TAGGED]-&gt;(o)<br>WITH o, count(p) as np<br>ORDER BY np DESC<br>LIMIT 10<br>WITH collect(o) as os<br>CALL {<br>WITH os<br>UNWIND os as o<br>MATCH (o)&lt;-[:TAGGED]-()&lt;-[:ANSWERED]-(a)&lt;-[:POSTED]-(u)<br>RETURN u, count(DISTINCT a) AS na<br>ORDER BY na DESC<br>LIMIT 25<br>}<br>WITH os, collect(u) as us<br>CALL {<br>WITH us, os<br>UNWIND us as u<br>MATCH (u) →(:Question)-[t:TAGGED]-&gt;(ot)<br>WHERE (NOT ot IN os)<br>RETURN ot, count(t) as nt<br>ORDER BY nt DESC<br>LIMIT 100<br>}<br>return us, collect(ot)</pre><p>Using APOC:</p><pre>MATCH l = (t:Tag {name: &quot;lua&quot;})-[:SIMILAR]-(s:Tag)&lt;-[:TAGGED]-(p:Post)-[:TAGGED]-&gt;(c:Tag)<br>WITH c, count(l) AS n, collect(p) AS p<br>ORDER BY n DESC LIMIT 10<br>WITH apoc.coll.flatten(collect(p)) AS p, collect(c) AS o<br>MATCH r = (u:User)-[:POSTED]-&gt;(:Answer)-[:ANSWERED]-&gt;(q:Question)<br>WHERE q in p<br>WITH count(r) AS a, u, o, collect(q) AS q<br>ORDER BY a DESC LIMIT 25<br>WITH apoc.coll.flatten(collect(q)) AS q, o<br>MATCH (v:Question)-[:TAGGED]-&gt;(t:TAG)<br>WHERE NOT (t IN o) AND v IN q<br>WITH t, count(v) AS c<br>RETURN t ORDER by c DESC LIMIT 100</pre><p>Using a single query:</p><pre>MATCH (tf:Tag)&lt;-[tr:TAGGED]-(q:Question)-[:TAGGED]-(t:Tag {name:&quot;lua&quot;}) with tf, count(tr) as cnt order by cnt desc limit 10<br>MATCH (tf)&lt;-[:TAGGED]-(q:Question)&lt;-[:ANSWERED]-(:Post)&lt;-[rp:POSTED]-(u:User) with u, count(rp) as cntrp, collect(tf.name) as tf_list order by cntrp desc limit 25<br>MATCH (tt:Tag)&lt;-[rtt:TAGGED]-(:Question)&lt;-[:ANSWERED]-(:Post)&lt;-[:POSTED]-(u) where not tt.name in tf_list with tt, count(rtt) as crtt order by count(rtt) desc limit 100 return tt</pre><p>A smaller Stack Overflow dataset is also available on <a href="https://neo4j.com/aura">Neo4j AuraDB</a>, the full dataset is on the Demo server <a href="https://demo.neo4jlabs.com:7473/browser">demo.neo4jlabs.com</a> with username/password/database “stackoverflow”.</p><p>Please make sure you join our <a href="https://community.neo4j.com">user forums</a> or hang out in our <a href="https://dev.neo4j.com/chat">Discord</a> to learn from each other.</p><p><a href="https://community.neo4j.com">Neo4j Online Community</a></p><p>If you want to learn more, we have published additional Cypher, Graph Data Science, and Application Development courses on <a href="https://graphacademy.neo4j.com">GraphAcademy</a>.</p><p><a href="https://graphacademy.neo4j.com">Free, Self-Paced, Hands-on Online Training</a></p><p>And we’re running regular live streams, Meetups, GraphSummits, and other events, all of which you can find on our <a href="https://neo4j.com/events">Events Page</a>.</p><p><a href="https://neo4j.com/events">Events Archive</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=5b43877d2904" width="1" height="1" alt=""><hr><p><a href="https://medium.com/neo4j/cypher-code-golf-hackathon-completion-5b43877d2904">Cypher Code-Golf Completion: Hackathon Winners Announced</a> was originally published in <a href="https://medium.com/neo4j">Neo4j Developer Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Git Commit History — Discover AuraDB: Week 44]]></title>
            <link>https://medium.com/neo4j/git-commit-history-discover-auradb-week-44-2ea2337abc86?source=rss-3865848842f9------2</link>
            <guid isPermaLink="false">https://medium.com/p/2ea2337abc86</guid>
            <category><![CDATA[data-exploration]]></category>
            <category><![CDATA[graph-database]]></category>
            <category><![CDATA[data-visualization]]></category>
            <category><![CDATA[neo4j]]></category>
            <category><![CDATA[git]]></category>
            <dc:creator><![CDATA[Michael Hunger]]></dc:creator>
            <pubDate>Mon, 27 Mar 2023 16:20:46 GMT</pubDate>
            <atom:updated>2023-03-27T16:20:46.008Z</atom:updated>
            <content:encoded><![CDATA[<h3>Git Commit History — Discover AuraDB: Week 44</h3><h4>Let’s explore the output of git log as a graph in Neo4j AuraDB Free.</h4><p>This time Alex, couldn’t make it, hope he gets better soon.</p><p>Some interesting things that happened since last weekend</p><ul><li>Went to watch <a href="https://www.youtube.com/watch?v=IiMinixSXII">Dungeons and Dragons: Honor among Thieves</a> with the kids (was really good)</li><li>Played a lot with GPT-4</li><li>Missed the <a href="https://www.dw.com/en/asteroid-to-zip-between-earth-and-moon-without-collision/a-65120155">asteroid DZ2 2023</a>, due to the cloudy sky :(</li><li>Been running Arc browser, which is really neat <a href="https://www.lennyspodcast.com/competing-with-giants-an-inside-look-at-how-the-browser-company-builds-product-josh-miller-ceo/">Invite here in the show-notes</a></li><li>Went to a local whisky festival which was great for tasting rare spirits</li></ul><p>Between everything I saw the TIL post by Simon Willison about his <a href="https://til.simonwillison.net/jq/git-log-json">GPT-4 coding exercise to turn git logs into JSON output</a>.</p><p>He uses the <a href="https://stedolan.github.io/jq/">swiss army knife of JSON processing jq</a>, which is awesome.</p><p>His post inspired me to today’s session on looking at <strong>git commit history</strong> as a graph.</p><p>If you rather want to see the recording, check it out here:</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FhSttMcYxdKQ%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DhSttMcYxdKQ&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FhSttMcYxdKQ%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/b33f117202af31e3ad1dbd82d747761a/href">https://medium.com/media/b33f117202af31e3ad1dbd82d747761a/href</a></iframe><h3>Data Source and Preparation</h3><p>We’re using Neo4j’s open-source repository here for our experiment: <a href="https://github.com/neo4j/neo4j">https://github.com/neo4j/neo4j</a></p><p>Clone the repository (I limited it to the 5.6 branch) and follow along.</p><p>Simon did two interesting things</p><ol><li>Using git log --pretty with NULL-bytes as separators (instead of commas or tabs)</li><li>Using jq to parse split the raw string by those NULL bytes and output JSON for the fields</li></ol><pre>git log --date=iso --pretty=format:&#39;%H%x00%an%x00%ad%x00%s%x00&#39; | head -2 | \<br>  jq -R -s &#39;[split(&quot;\n&quot;)[:-1] | map(split(&quot;\u0000&quot;)) | .[] | {<br>    &quot;commit&quot;: .[0],<br>    &quot;author&quot;: .[1],<br>    &quot;date&quot;: .[2],<br>    &quot;message&quot;: .[3]<br>  }]&#39;</pre><p>The output is:</p><pre>[<br>  {<br>    &quot;commit&quot;: &quot;5ad4387ed521f169a737f9836402dbac8759a9fc&quot;,<br>    &quot;author&quot;: &quot;Johannes Donath&quot;,<br>    &quot;date&quot;: &quot;2023-03-08 16:17:12 +0100&quot;,<br>    &quot;message&quot;: &quot;Corrected an issue in which the number of writable bytes is incorrectly calculated. ()&quot;<br>  },<br>  {<br>    &quot;commit&quot;: &quot;c53e91519eca145c6879d5c9be9a421fdd223338&quot;,<br>    &quot;author&quot;: &quot;Tobias Johansson&quot;,<br>    &quot;date&quot;: &quot;2023-03-07 13:34:12 +0100&quot;,<br>    &quot;message&quot;: &quot;Do polling of the fabric transaction lock in terminate&quot;<br>  }<br>]</pre><p>The placeholders in the pretty-print string are a bit ominous but the man page and the <a href="https://devhints.io/git-log-format">cheat-sheet here</a> help a lot.</p><p>I wanted to add the parent commit via %P to see the git commit history chain. Also tried to get one of the files of the commit in but we’ll leave that for another time.</p><p>The other change that we need to do is to turn the data into a CSV, not a JSON file. Fortunately, jq also supports CSV as an output so we can select our 5 fields and tell it to send it through the @csv processing step.</p><pre>echo &#39;commit,parent,author,date,message&#39; &gt; ~/Downloads/neo4j-git.csv<br>git log --date=iso --pretty=format:&#39;%H%x00%P%x00%an%x00%ad%x00%s%x00&#39; | \<br>  jq -r -R -s &#39;split(&quot;\n&quot;)[:-1] | map(split(&quot;\u0000&quot;)) | .[] | [<br>   .[0],<br>   .[1],<br>   .[2],<br>   .[3],<br>   .[4]<br>  ] | @csv&#39; &gt;&gt; ~/Downloads/neo4j-git.csv</pre><p>We can check the resulting CSV file with <a href="https://github.com/BurntSushi/xsv">the </a><a href="https://github.com/BurntSushi/xsv">xsv tool</a> and see that it has the 76k commits and the right fields, it’s about 14MB of data.</p><pre>xsv count ~/Downloads/neo4j-git.csv<br>76567<br><br>xsv stats ~/Downloads/neo4j-git.csv<br>field,type,sum,min,max,min_length,max_length,mean,stddev<br>commit,Unicode,,0000257bb06e29e15c11b6bc5ad4f8253deed4a4,ffff9ee399d15bd834e4aeb4719a02e5e39308d5,40,40,,<br>parent,Unicode,,0000257bb06e29e15c11b6bc5ad4f8253deed4a4,ffff9ee399d15bd834e4aeb4719a02e5e39308d5,0,81,,<br>author,Unicode,,@fbiville,wujek srujek,2,30,,<br>date,Unicode,,2007-05-24 01:34:45 +0000,2023-03-08 16:17:12 +0100,25,25,,<br>message,Unicode,,     - BeansAPITransaction handles multiple calls to tx.finish(),zoo_keeper_servers example changed to something useful (the default ZK port),1,3217,,<br><br>ls -lh ~/Downloads/neo4j-git.csv<br>14M 27 Mar 10:17 /Users/mh/Downloads/neo4j-git.csv</pre><p>Ok, now we have everything to turn those commits into a graph.</p><p>Are you ready?</p><p>Let’s first spin up a Neo4j AuraDB database and get the data imported.</p><h3>Create a Neo4j AuraDB Free Instance</h3><p>Go to <a href="https://dev.neo4j.com/neo4j-aura">https://dev.neo4j.com/neo4j-aura</a> to register or log into the service (you might need to verify your email address).</p><p>After clicking Create Database you can create a new <strong>Neo4j AuraDB Free</strong> instance.</p><p>Choose the &quot;Empty Instance&quot; option as we want to import our data ourselves.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*8SycaBOFQjP6cXAYoaWSNA.png" /></figure><p>On the Credentials popup, make sure to save the password somewhere safe, best is to download the credentials file, which you can also use for your app development.</p><p>The default username is always neo4j.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*_9VVp2bgdaS_Jp13wdgBSQ.png" /></figure><p>Then wait 2-3 minutes for your instance to be created.</p><p>Afterward, you can connect via the &quot;Open&quot; Button with Workspace (you’ll need the password), which offers the &quot;Import&quot; (Data Importer), &quot;Explore&quot; (Neo4j Bloom), and &quot;Query&quot; (Neo4j Browser) tabs to work with your data.</p><p>On the database tile, you can also find the connection URL: neo4j+s://xxx.databases.neo4j.io (it is also contained in your credentials env file).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Q_y2TLjlivNHfMGEsDgQPA.png" /></figure><p>If you want to see examples of programmatically connecting to the database go to the &quot;Connect&quot; tab of your instance and pick the language of your choice</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*l1VeLL7ICT1jiQCvVaQJLA.png" /></figure><p>After opening Neo4j Workspace via the &quot;Open&quot; button, and logging in with the downloaded credentials, we can go to the &quot;Import&quot; tab and get started.</p><h3>Data Modeling</h3><p>The data model is pretty straightforward, we just have Commit and Author nodes connected by a WROTE relationship.</p><p>The commits are also pointing to their parent commit via a PARENT relationship.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/591/1*rRNAKc2zUHl2c6HJmnKWhg.png" /><figcaption>Minimalistic Data Model</figcaption></figure><p>Now we can add our CSV file and map the attributes and select our id-fields.</p><ul><li>Commit (hash, date, parent, subject)</li><li>Author (author)</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*7Cc-OB7vf9TVZx1eAFAi7A.png" /><figcaption>Data Importer with Mapping</figcaption></figure><p>With the data mapped we can &quot;Preview&quot; our import and inspect nodes and relationship attributes and structure to see if we messed anything up.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1009/1*BobDYHAwKe4ukP5683yS2g.png" /><figcaption>Import Preview</figcaption></figure><p>As we’re good we can click &quot;Import&quot; and, after a few seconds, the results are presented. There you can also inspect the Cypher statements used to create the data which you can use in your own code or scripts.</p><p>The image below is from a 2nd run, so the data is already in the graph.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/898/1*PCiBPfua1_gyXJYphu_YTA.png" /><figcaption>Import Results</figcaption></figure><h3>Explore</h3><p>Now with &quot;Start Exploring&quot; we head over to the &quot;Explore&quot; tab and see data in the graph.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*abcsgzxcZuZDTfRm_Rl_4Q.png" /></figure><p>There we can also style our nodes and for instance determine the shortest paths between commits (select two nodes and right click for the context menu).</p><p>Now let’s get our hands dirty and write some Cypher statements.</p><h3>Query &amp; Evolve</h3><p>Opening the left sidebar shows us the labels and relationship-types in the graph, we can click on any of them to see a subset of our data.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*cT_BLuhmxDk6RlUggTHL4g.png" /><figcaption>Query the Graph</figcaption></figure><p>Now we can start to explore a bit,</p><p>First looking at the total number of commits and authors:</p><pre>MATCH (:Commit) RETURN count(*)<br>// 76k<br><br>MATCH (:Author) RETURN count(*)<br>// 338</pre><h3>Contributor Activity</h3><p>Let’s find the most active authors, we can find out their outgoing relationships.</p><pre>MATCH (a:Author)<br>WITH a, count { (a)-[:WROTE]-&gt;() } as commits<br>ORDER BY commits DESC LIMIT 10<br>RETURN a.author, commits</pre><p>Which gives us these folks:</p><pre>a.author commits<br>Mattias Persson 6592<br>Anders Nawroth 4106<br>Pontus Melke 4030<br>MishaDemianenko 3989<br>Andres Taylor 3640<br>Chris Vest 2983<br>Jacob Hansson 2521<br>Anton Persson 2212<br>Mattias Finné 2148<br>Davide Grohmann 1987</pre><p>The usual suspects :)</p><p>Oh, Mattias has been with us for 15 years but got married in between, so he shows up with two different names.</p><p>Let’s fix that and move the 6592 relationships from his old alter-ego to the new one:</p><pre>MATCH (new:Author { author: &quot;Mattias Finné&quot;})<br>MATCH (old:Author { author: &quot;Mattias Persson&quot;})<br>MATCH (old)-[rel:WROTE]-&gt;(c:Commit)<br>CREATE (new)-[:WROTE]-&gt;(c)<br>DELETE rel<br>RETURN count(*);</pre><p>So that should fix the results:</p><pre>a.author commits<br>Mattias Finné 8740<br>Anders Nawroth 4106<br>Pontus Melke 4030<br>MishaDemianenko 3989<br>Andres Taylor 3640<br>Chris Vest 2983<br>Jacob Hansson 2521<br>Anton Persson 2212<br>Davide Grohmann 1987<br>Satia Herfert 1896</pre><p>If we wanted to limit the people by commit-date, so we see who has been more active recently, we can do that too:</p><pre>MATCH (a:Author)<br>WITH a, count { (a)-[:WROTE]-&gt;(c:Commit) WHERE c.date &gt; datetime(&quot;2019-01-01T00:00:00&quot;)} as commits<br>ORDER BY commits DESC LIMIT 10<br>RETURN a.author, commits</pre><p>Now we have a different set of people with fewer total contributions:</p><pre>a.author commits<br>MishaDemianenko 1892<br>Pontus Melke 1570<br>Mattias Finné 1284<br>Satia Herfert 1126<br>Anton Persson 920<br>Chris Vest 746<br>Louise Söderström 578<br>Tobias Johansson 463<br>Therese Magnusson 439<br>Georgiy Kargapolov 415</pre><h3>Fixing Parents</h3><p>Another thing we need to fix is that some commit multiple parents not just one so that their parent-hash was not found in the database to connect them.</p><p>We have roughly 15k commits of that kind. Here is what the property looks like.</p><pre>parent: &quot;e5697a0900ff849f92d0ae3c88bd8e31e3163024 d26c6ab67c34a5f91cc6cfadfc75ca4b1def3bef&quot;</pre><p>We can fix it by:</p><ol><li>Finding these commits</li><li>Splitting the parent field by space</li><li>Turning that list of hashes into rows</li><li>Find the parent commit with the hash</li><li>Create the relationships</li></ol><p>Or in Cypher:</p><pre>// find commits without PARENT relationship but with multiple parent hash values<br>MATCH (c:Commit)<br>WHERE NOT exists { (c)-[:PARENT]-&gt;() }<br>  AND c.parent contains &#39; &#39;<br><br>// split by space into list of hashes<br>WITH c, split(c.parent,&#39; &#39;) as parents<br><br>// turn list into rows<br>UNWIND parents as parent<br><br>// find parent commit<br>MATCH (p:Commit {hash:parent})<br><br>/// create relationship<br>MERGE (c)-[:PARENT]-&gt;(p)<br><br>RETURN count(*);</pre><p>Now our graph is better connected, and we can run a few long path queries:</p><pre>MATCH path = (c:Commit)-[:PARENT*100]-&gt;(p:Commit)<br>RETURN path LIMIT 1</pre><p>Which gives us this beautiful long, flowery chain of commits</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*h-ZjreAShPMpLYPEdRmhZg.png" /></figure><p>We can also look at &quot;root&quot; commits and see which ones have the most children.</p><p>With PROFILE we can sneak under the hood and see how the query planner optimizes this query, it completes in about 2 seconds checking 10-hop paths for all 76k commits.</p><pre>profile<br>match (c)-[:PARENT*..10]-&gt;(p:Commit)<br>with p, count(distinct c) as children<br>return p {.*} as parent, children order by children desc limit 10</pre><p>Some of them have more than 600 children even only up to level 10!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*fIyIh-SsecIySwDwep--pw.png" /></figure><p>We can do the same without an upper limit but for a subset of commits:</p><pre>PROFILE<br>MATCH (p:Commit) WITH p LIMIT 1000<br>MATCH (c)-[:PARENT*]-&gt;(p)<br>WITH p, count(distinct c) as children<br>RETURN p, children ORDER BY children DESC LIMIT 10</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*q0nGJnW6oLofIV47J3Hsvg.png" /></figure><p>And with that, we were out of time and had covered a lot of ground.</p><p>If you want to see videos, write-ups, and data for past livestreams check out our overview page:</p><p><a href="https://neo4j.com/video/discover-aura-free-with-fun-datasets/">Discovering AuraDB Free with Fun Datasets</a></p><p>Or our repository: <a href="https://github.com/neo4j-examples/discoveraurafree">https://github.com/neo4j-examples/discoveraurafree</a></p><p>Hope this was as much for you as it was for me,</p><p>Happy Graphing!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=2ea2337abc86" width="1" height="1" alt=""><hr><p><a href="https://medium.com/neo4j/git-commit-history-discover-auradb-week-44-2ea2337abc86">Git Commit History — Discover AuraDB: Week 44</a> was originally published in <a href="https://medium.com/neo4j">Neo4j Developer Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>