GraphQL distinct queries for efficient data filtering

GraphQL distinct queries for efficient data filtering

GraphQL distinct refers to the common developer task of fetching unique or de-duplicated values from a data source using a GraphQL API. Unlike SQL, GraphQL has no built-in DISTINCT keyword, so developers must implement this logic on the server side — inside a resolver, at the database layer, or through framework-level helpers like Hasura’s distinct_on or Gatsby’s distinct() aggregation. Getting this wrong leads to data over-fetching, bloated payloads, and unnecessary strain on both server and client.

Key Benefits at a Glance

  • Faster Data Processing: Reduces the amount of data sent over the network and processed by the client, speeding up your application.
  • Reduced Server Load: Implementing distinct logic in your resolver prevents sending redundant data, lowering database query costs and server strain.
  • Cleaner UI Displays: Ensures that user-facing elements like dropdown menus or lists only show unique options, improving usability.
  • Simplified Client-Side Logic: Handling de-duplication on the server means the frontend code remains cleaner and focused on presentation.
  • Improved Data Accuracy: Guarantees that aggregated counts or lists are based on unique entities, preventing skewed analytics and reporting.

Purpose of this guide

This guide is for developers who need to retrieve unique values through a GraphQL API and aren’t sure where to implement the logic. You’ll get working code examples for three major environments — Gatsby 5, Apollo Server, and Hasura — plus a clear explanation of the Gatsby 5 SELECT keyword, performance optimization strategies, filtering patterns, and a comparison with group functions. By the end, you’ll know exactly how to implement distinct queries cleanly and scalably, regardless of your stack.

Introduction

GraphQL distinct queries are a go-to solution when you need unique values from a dataset — think populating a category dropdown, building a tag cloud, or powering a faceted search interface. The challenge is that GraphQL’s specification has no standardized DISTINCT operator, so the implementation falls to you: the resolver, the database query, or a framework helper. Done right, this means less data over the wire, simpler frontend code, and a faster user experience. Done wrong, you end up retrieving thousands of records just to deduplicate them on the client.

This guide covers the full picture — from what distinct queries actually are, through Gatsby 5’s syntax changes, Apollo resolver patterns, and Hasura’s native distinct_on, to performance optimization and common pitfalls. Whether you’re building a CMS, an e-commerce catalog, or an analytics dashboard, you’ll find actionable patterns here.

What are GraphQL distinct queries

A GraphQL distinct query is an operation that returns only unique values for a specified field, eliminating duplicates from the result set at the API layer. Instead of fetching every post and extracting categories on the frontend, you ask the API for a deduplicated list of categories directly. The result is a smaller payload, no client-side processing, and a single source of truth for unique values across your application.

Think of it like asking a librarian for a list of unique genres in the library rather than pulling every book and reading its genre label. You get a clean, ready-to-use list — not raw data to sift through.

AspectRegular GraphQL QueryDistinct GraphQL Query
Data ReturnedAll matching records including duplicatesOnly unique values from specified field
Use CaseFetching complete entity dataGetting unique options for filters/dropdowns
PerformanceReturns more dataReduces payload size
Frontend ProcessingRequires client-side deduplicationNo additional processing needed

GraphQL itself has no standardized distinct operator, but major frameworks each provide their own solution. Hasura offers distinct_on for PostgreSQL-backed APIs. Gatsby 5 exposes a distinct() aggregation resolver using the SELECT enum syntax. Apollo Server requires a custom resolver, but gives you full control over the deduplication logic. For a broader look at how queries work, see the official GraphQL query documentation.

When to use distinct queries in GraphQL

Distinct queries are the right tool when your UI needs a list of options — not a list of records. The moment you catch yourself fetching full entities just to extract one field and deduplicate it on the client, that’s the sign to move the logic to the API.

  • Populating dropdown menus with unique category options
  • Creating filter interfaces that show available values
  • Generating tag clouds from content metadata
  • Building analytics dashboards with unique dimension values
  • Reducing data transfer for mobile applications

Content management systems are a natural fit. A blog with hundreds of posts might have only ten unique categories — a distinct query returns those ten values directly, without making the frontend process hundreds of post objects. E-commerce catalogs benefit similarly: instead of scanning all products to find unique brands, colors, or sizes, a distinct query returns the filter options ready to render.

Mobile applications get a disproportionate benefit. Bandwidth and processing constraints make client-side deduplication expensive on mobile. Moving distinct logic to the API directly reduces payload size and speeds up rendering, which matters for both user experience and data costs.

The business case is clear: distinct queries reduce development overhead, eliminate duplicate deduplication logic across teams, and scale gracefully as datasets grow. A pattern that works on 1,000 records with client-side filtering often breaks at 100,000 — distinct queries at the API layer stay efficient at any size.

Example use case: obtaining distinct values from collections

Consider a content website that needs to extract unique categories from blog posts to populate a category filter dropdown. Multiple posts share the same categories, so a standard query returns duplicates that need client-side processing. A distinct query solves this at the source.

Sample data structure:

[
  {
    "title": "GraphQL Best Practices",
    "frontmatter": {
      "category": "Development",
      "tags": ["graphql", "api"]
    }
  },
  {
    "title": "React Performance Tips",
    "frontmatter": {
      "category": "Development",
      "tags": ["react", "performance"]
    }
  },
  {
    "title": "CSS Grid Layout",
    "frontmatter": {
      "category": "Design",
      "tags": ["css", "layout"]
    }
  }
]

A standard query would return all three posts with their categories, requiring the frontend to deduplicate. The distinct query returns only the unique values:

query GetUniqueCategories {
  allMarkdownRemark {
    distinct(field: {frontmatter: {category: SELECT}})
  }
}

Response:

{
  "data": {
    "allMarkdownRemark": {
      "distinct": ["Development", "Design"]
    }
  }
}

The API becomes the single source of truth for category values. Any component that needs the category list makes one query and gets clean data — no deduplication logic, no risk of inconsistency between components. As the content collection grows to thousands of posts, the query continues to return the same compact array of unique values rather than processing the full dataset on the client.

Sorting distinct results effectively

Distinct results from Gatsby’s distinct() resolver are returned as a plain array of strings. To sort them, apply JavaScript’s .sort() after the query returns — the distinct() argument doesn’t accept sorting directly.

query GetUniqueCategories {
  allMarkdownRemark {
    distinct(field: {frontmatter: {category: SELECT}})
  }
}
// Sort the result after the query
const sortedCategories = data.allMarkdownRemark.distinct.sort();

In Hasura, sorting is built into the query via order_by, which pairs naturally with distinct_on:

query GetSortedUniqueCategories {
  posts(distinct_on: category, order_by: {category: asc}) {
    category
  }
}

Note that in Hasura, order_by must include the same fields used in distinct_on as its first ordering criteria — otherwise the query will return an error. This is a PostgreSQL constraint, not a Hasura limitation.

For more patterns on ordering query results, see the complete guide on GraphQL sorting techniques and order_by usage.

Implementation across different GraphQL environments

GraphQL’s specification defines query structure but says nothing about distinct operations, leaving each framework to implement it differently. Gatsby provides a built-in distinct() resolver. Hasura delegates to PostgreSQL’s native DISTINCT ON. Apollo requires a custom resolver. Understanding each approach matters both for implementing correctly now and for evaluating frameworks when starting new projects.

Distinct in Gatsby GraphQL

Gatsby’s built-in distinct functionality is part of its GraphQL data layer and is available on all connection types (e.g., allMarkdownRemark, allMdx). It runs at build time for static sites, meaning the deduplication overhead doesn’t affect runtime performance.

Gatsby 5 introduced a breaking change in the syntax for field selection. The old triple-underscore path format (frontmatter___category) was replaced with a nested object selector using the SELECT enum keyword.

FeatureGatsby 4Gatsby 5
Distinct Syntaxdistinct(field: frontmatter___category)distinct(field: {frontmatter: {category: SELECT}})
Field SelectionString path with triple underscoresFieldSelectorEnum with SELECT keyword
Breaking ChangeLegacy syntaxSELECT keyword mandatory
Type SafetyNo schema-level validationField validated at parse time

Gatsby 4 syntax:

query Gatsby4DistinctExample {
  allMarkdownRemark {
    distinct(field: frontmatter___category)
  }
}

Gatsby 5 syntax:

query Gatsby5DistinctExample {
  allMarkdownRemark {
    distinct(field: {frontmatter: {category: SELECT}})
  }
}

The challenge of finding correct syntax in Gatsby 5

Developers upgrading from Gatsby 4 frequently hit this error when using the old syntax:

GraphQL Error: Field "distinct" argument "field" of type "FieldSelectorEnum"
is required but not provided.

The error message doesn’t tell you what the correct format is. The old syntax:

# Does NOT work in Gatsby 5
query IncorrectSyntax {
  allMarkdownRemark {
    distinct(field: frontmatter___category)
  }
}

The correct Gatsby 5 syntax:

# Works in Gatsby 5
query CorrectSyntax {
  allMarkdownRemark {
    distinct(field: {frontmatter: {category: SELECT}})
  }
}

When you run this in GraphiQL on a Gatsby 5 site, autocompletion will show SELECT as the only available value inside the field object — that’s the confirmation you’re using the right structure.

Understanding the SELECT keyword in Gatsby 5

The SELECT keyword is a value of the FieldSelectorEnum type in Gatsby’s generated schema. It replaces the flat string path approach with a typed, nested object that the GraphQL engine can validate at parse time — before the query executes. This means field reference errors surface immediately in your IDE or during the build, rather than silently returning empty arrays at runtime.

In practice, you can use SELECT on any scalar field nested inside frontmatter or other node types:

query MultipleDistinctFields {
  allMarkdownRemark {
    categories: distinct(field: {frontmatter: {category: SELECT}})
    authors: distinct(field: {frontmatter: {author: SELECT}})
  }
}

The schema validation also means better autocompletion in GraphiQL and IDE plugins — you’ll see valid field paths suggested as you type.

Distinct in Apollo GraphQL

Apollo Server has no built-in distinct operation. You implement it in a custom resolver, which gives you full control over the deduplication logic and lets you integrate with any data source or apply business rules during extraction.

// Apollo Server resolver
const resolvers = {
  Query: {
    distinctCategories: async (parent, args, context) => {
      const allPosts = await context.dataSources.posts.getAllPosts();
      const categories = allPosts.map(post => post.category).filter(Boolean);
      return [...new Set(categories)].sort();
    }
  }
};

Schema definition:

type Query {
  distinctCategories: [String!]!
}

Client query:

query GetDistinctCategories {
  distinctCategories
}

For better performance with large datasets, push the deduplication to the database rather than loading all records into the resolver:

// More efficient: let the database handle distinct
const resolvers = {
  Query: {
    distinctCategories: async (parent, args, context) => {
      // SQL: SELECT DISTINCT category FROM posts WHERE published = true
      return context.db.query(
        'SELECT DISTINCT category FROM posts WHERE published = $1 ORDER BY category',
        [true]
      ).then(rows => rows.map(r => r.category));
    }
  }
};

When you need distinct values across multiple fields simultaneously, consider using DataLoader to batch and cache these queries and avoid N+1 issues if distinct resolvers are called from multiple places in the same request.

If your Apollo API also handles filtering, combine distinct resolver logic with where clause patterns to pre-filter the dataset before deduplicating.

Distinct in Hasura

Hasura provides native distinct functionality through its auto-generated GraphQL API, delegating directly to PostgreSQL’s DISTINCT ON clause. No custom resolver needed — the capability is available out of the box for every table and view in your database.

“You can fetch rows with only distinct values of a column using the distinct_on argument. It is typically recommended to use order_by along with distinct_on to ensure we get predictable results.”
Hasura GraphQL Docs, 2024
Source link
query HasuraDistinctExample {
  posts(distinct_on: category, order_by: {category: asc}) {
    category
  }
}

distinct_on accepts an array of columns, enabling distinct operations across multiple fields. When using multiple columns, order_by must list those same columns first — this is a PostgreSQL requirement for DISTINCT ON to produce deterministic results.

query MultiColumnDistinct {
  posts(
    distinct_on: [category, author_id],
    order_by: [
      {category: asc},
      {author_id: asc},
      {created_at: desc}
    ]
  ) {
    category
    author_id
    title
    created_at
  }
}

Because Hasura uses PostgreSQL under the hood, all standard PostgreSQL indexing strategies apply. An index on the distinct_on column — especially a partial index when combined with a filter — can dramatically reduce query execution time.

When your Hasura query needs to return distinct rows alongside related entities, see how nested queries combine with distinct_on to fetch joined data without duplicates.

Filtering distinct results

Combining filtering with distinct queries creates focused, efficient data retrieval. The key decision is whether to filter before or after the distinct operation.

  1. Apply filters to narrow down the dataset first
  2. Execute the distinct operation on filtered results
  3. Sort the unique values if needed for presentation
  4. Validate that filtering doesn’t eliminate required distinct values

Pre-filtering applies conditions before extracting distinct values, reducing the working dataset. This is almost always the better approach for performance when filters eliminate a meaningful portion of records:

query FilteredDistinctCategories {
  allMarkdownRemark(
    filter: {
      frontmatter: {
        published: {eq: true},
        date: {gte: "2024-01-01"}
      }
    }
  ) {
    distinct(field: {frontmatter: {category: SELECT}})
  }
}

In Hasura, the equivalent uses the where argument alongside distinct_on:

query HasuraFilteredDistinct {
  posts(
    where: {published: {_eq: true}},
    distinct_on: category,
    order_by: {category: asc}
  ) {
    category
  }
}

Post-filtering (filtering the distinct results after extraction) is less common and typically handled in application logic rather than at the GraphQL layer — for example, stripping internal or deprecated category values from a list before passing it to the UI.

For complex filter combinations, pre-filtering always wins on performance because it reduces the rows the distinct operation must process. Use post-filtering only when the filter criteria apply to the distinct values themselves, not the underlying dataset.

For detailed filter syntax across frameworks, see the guide on filtering multiple values in GraphQL. When you also need to cap the number of distinct results returned, combine filtering with result limiting.

Optimizing performance with distinct queries

Performance optimization for distinct queries works at three levels: the database, the query structure, and the application cache. The biggest gains almost always come from the database.

Optimization TechniquePerformance ImpactImplementation Complexity
Database IndexingHighLow
Query Result CachingVery HighMedium
Field Selection LimitingMediumLow
Connection PoolingMediumHigh
Query BatchingHighMedium
  • Index frequently queried distinct fields at the database level
  • Implement caching for distinct results that change infrequently
  • Limit field selection to only required data
  • Monitor query execution time in production environments
  • Use database-specific distinct optimizations when available

A composite index that covers both the distinct field and any filter conditions lets the database satisfy the query from the index alone, without a table scan:

-- Covers: distinct_on category WHERE published = true
CREATE INDEX idx_posts_published_category
ON posts (category, published)
WHERE published = true;

Distinct results are ideal candidates for caching because category lists, tag sets, and filter options change far less frequently than the underlying content. A short TTL cache (5–60 minutes depending on your update frequency) can absorb the majority of distinct query load:

const getDistinctCategories = async () => {
  const cacheKey = 'distinct_categories';
  let categories = await cache.get(cacheKey);

  if (!categories) {
    categories = await db.query(
      'SELECT DISTINCT category FROM posts WHERE published = true ORDER BY category'
    ).then(rows => rows.map(r => r.category));
    await cache.set(cacheKey, categories, 3600); // 1 hour TTL
  }

  return categories;
};

In production, track distinct query execution time as a dedicated metric. Category and filter queries are called on nearly every page load — a 200ms distinct query hitting a table without an index will show up in your p95 latency before anything else.

Using group functions as an alternative approach

When you need unique values and a count per value, group functions are more efficient than running a distinct query followed by a separate count query.

ApproachBest ForAdvantagesLimitations
Distinct QueriesSimple unique value extractionStraightforward syntax, widely supportedNo aggregation — values only
Group FunctionsUnique values with counts or statsRich aggregation in a single queryMore complex syntax, higher resource usage

In Gatsby, the group() resolver returns unique values with a totalCount for each bucket — useful for category menus that show post counts:

query GroupedCategories {
  allMarkdownRemark {
    group(field: {frontmatter: {category: SELECT}}) {
      fieldValue
      totalCount
    }
  }
}

Response:

{
  "data": {
    "allMarkdownRemark": {
      "group": [
        {"fieldValue": "Development", "totalCount": 15},
        {"fieldValue": "Design", "totalCount": 8}
      ]
    }
  }
}

If your UI displays “Development (15)” style labels, group functions eliminate a second API call. If you only need the category names, distinct is simpler and faster.

For the full syntax of group operations across Gatsby, Hasura, and Apollo, see the dedicated guide on GraphQL group by.

Common pitfalls and solutions

  • Using Gatsby 4 syntax (frontmatter___category) in a Gatsby 5 project
  • Applying distinct to non-scalar field types (objects, arrays)
  • Forgetting to filter null values from distinct results
  • Running distinct on large tables without an index
  • Missing the required order_by when using Hasura’s distinct_on

Wrong field type: Distinct only works on scalar fields. Pointing it at an object or array field causes an error or silent empty results:

# Wrong: frontmatter is an object, not a scalar
query ProblematicDistinct {
  allMarkdownRemark {
    distinct(field: {frontmatter: SELECT})
  }
}

# Correct: target the scalar field inside the object
query CorrectDistinct {
  allMarkdownRemark {
    distinct(field: {frontmatter: {category: SELECT}})
  }
}

Null values: If a field is optional, null may appear in distinct results. Filter it out before rendering:

const categories = data.allMarkdownRemark.distinct.filter(Boolean);

Hasura missing order_by: Using distinct_on without order_by on the same field returns non-deterministic results and may cause a database error:

# Will error or return inconsistent results
query BadHasuraDistinct {
  posts(distinct_on: category) {
    category
  }
}

# Correct: order_by must lead with the distinct_on field
query GoodHasuraDistinct {
  posts(distinct_on: category, order_by: {category: asc}) {
    category
  }
}
  1. Verify the correct syntax for your GraphQL implementation version
  2. Confirm the target field is a scalar, not an object or array
  3. Filter null values from results before passing to UI components
  4. Add a database index on the distinct field before going to production
  5. Include order_by with Hasura’s distinct_on to ensure deterministic results

Practical examples and use cases

Here are three implementation patterns that cover the most common real-world uses of distinct queries.

Dynamic category navigation (Gatsby CMS): Category menus that automatically reflect published content without manual maintenance:

query DynamicCategoryNavigation {
  allMarkdownRemark(filter: {frontmatter: {published: {eq: true}}}) {
    distinct(field: {frontmatter: {category: SELECT}})
  }
}

E-commerce filter options (Hasura + PostgreSQL): Extract unique values for product filters directly from the catalog:

query ProductBrandFilter {
  products(
    where: {active: {_eq: true}},
    distinct_on: brand,
    order_by: {brand: asc}
  ) {
    brand
  }
}

Run a separate query per filter dimension (brand, size, color) rather than trying to get all distinct values in one query — this keeps each query simple and independently cacheable.

Analytics dimensions (Apollo + custom resolver): Supply unique segment and region values to a reporting dashboard:

const resolvers = {
  Query: {
    analyticsDimensions: async (_, __, { db }) => {
      const [segments, regions] = await Promise.all([
        db.query('SELECT DISTINCT segment FROM customers ORDER BY segment'),
        db.query('SELECT DISTINCT region FROM customers ORDER BY region')
      ]);
      return {
        segments: segments.rows.map(r => r.segment),
        regions: regions.rows.map(r => r.region)
      };
    }
  }
};

Frequently Asked Questions

A distinct query in GraphQL retrieves only unique values for a specified field, eliminating duplicates from the result set. GraphQL has no built-in DISTINCT keyword, so the implementation depends on the framework: Gatsby provides a distinct() resolver, Hasura uses distinct_on, and Apollo requires a custom resolver. The result is a clean array of unique values — ideal for populating dropdowns, filter panels, and navigation menus without client-side deduplication.

The approach depends on your framework. In Gatsby 5, use the built-in distinct(field: {frontmatter: {category: SELECT}}) resolver on any connection type. In Hasura, add distinct_on: fieldName and a matching order_by to your query. In Apollo Server, write a custom resolver that queries the database with SELECT DISTINCT and returns an array of unique values. For all approaches, target scalar fields only — distinct operations on object or array fields will not work.

In Gatsby 5, the correct syntax uses the SELECT enum keyword inside a nested field selector object: distinct(field: {frontmatter: {category: SELECT}}). The Gatsby 4 triple-underscore format (distinct(field: frontmatter___category)) no longer works and produces a FieldSelectorEnum error. The new syntax provides schema-level validation and better IDE autocompletion.

Gatsby 5 replaced the triple-underscore path format with a nested object selector. In Gatsby 4 you would write distinct(field: frontmatter___category). In Gatsby 5 the equivalent is distinct(field: {frontmatter: {category: SELECT}}). This is a breaking change — using the old format in Gatsby 5 throws a FieldSelectorEnum error at build time. The change aligns with GraphQL’s type system requirements and enables proper schema validation of field references.

Without a database index, a distinct query requires a full table scan, which becomes slow at scale. The fix is a standard index on the queried column. Beyond indexing, distinct results are excellent candidates for caching since category lists and filter options change infrequently — a 1-hour cache TTL can eliminate most repeated distinct queries entirely. In Gatsby, distinct queries run at build time, so runtime performance is unaffected for static sites.

Use distinct queries when you need a list of unique values for UI elements like dropdown menus, category filters, or tag clouds — and you don’t want to retrieve full records just to extract and deduplicate one field on the client. Moving deduplication to the API layer reduces payload size, eliminates frontend processing overhead, and ensures consistent unique values across all parts of your application. It’s particularly valuable for mobile clients and large datasets where client-side deduplication would be expensive.