What is a wasted tool call?

A wasted tool call is a tool call that sends the model down an unhelpful path. The model may still recover, but the failed or unnecessary call adds cost, latency or noise to the trace.

Why are wasted tool calls worth debugging?

A wasted tool call can look harmless if the model self-corrects and the run succeeds on a later attempt. It still adds cost, latency and noise to the trace, and it can expose a prompt issue or failure pattern that keeps repeating across the dataset.

What's the difference between required and avoidable failed tool calls?

Some failed tool calls are required because the agent has to make the call to retrieve information it does not have. A file lookup may fail because the file does not exist, or a path check may fail because the path is wrong. Other failed tool calls are avoidable because the prompt is inaccurate or not specific enough, which leads the model down an unhelpful path.

Why can a successful run still contain wasted tool calls?

Because the model can self-correct after a failed or unhelpful tool call and still reach a usable result. The run may succeed, but the trace still includes the extra failure, extra calls and extra work that made the path slower, noisier and more expensive.

What do wasted tool calls usually indicate?

They usually indicate that the prompt or call specification needs work. The model may recover, but the wasted call still weakens the signal on whether the prompt is well specified and whether the system is behaving efficiently.

Why do subtle LLM failures often appear only at large scale?

Many LLM behaviors like sycophancy or inconsistent answers show up sparsely across thousands of conversations. They don't form clear patterns until you analyze the dataset broadly.

Hyperparam Blog

What wasted tool calls revealed about my LLM’s behavior

2026-03-25T20:00:00+00:00

We’ve all seen it happen: the LLM starts going down the wrong path and makes dozens of failed or wasted tool calls that don’t actually get it closer to its goal.

Even though models can self-correct and find a new path on a subsequent retry, self-correction can hide repeated failures that make the agent slower, more expensive and more difficult to evaluate across the dataset. In this post, I look at what wasted tool calls do to the trace, when retries are required and when they’re avoidable, and the cost of avoidable failures in practice.

Need the workflow? See my step-by-step guide for debugging wasted tool calls in LLM logs.

Key takeaways

Failed tool calls add cost, latency, noise and weaker prompt signal.
A wasted tool call makes the trace noisier even when the next run succeeds.
Required retried tool calls help the agent retrieve information it didn’t have.
Avoidable retries usually stem from prompts that are inaccurate or not specific enough.
If the model keeps self-correcting and eventually finds a new path, the trace can look healthier than the prompt really is.

A wasted tool call produces a noisy trace

Even when the tool call eventually finds a new path, it doesn’t erase the earlier failure(s). The trace still shows the failed calls and the subsequent successful one. This makes it harder to review the run and determine whether the agent reached the result efficiently. In a production dataset, that pattern can repeat across many traces, even when the final outputs look fine.

Required retries vs. avoidable retries

Some tool-call failures calls are required because the agent has to make the call to retrieve information it doesn’t have. If a file lookup fails because the file doesn’t exist, that failure may still be useful because it tells the agent something it needed to know. A path check can work the same way. Other calls are genuine failures because the agent queried the wrong file, took a dead-end path or went down a rabbit hole. The line between these cases can be fuzzy.

Other failed tool calls are avoidable. These occur when the prompt is inaccurate or not specific enough. Think of incorrect parameters, wrong attribute names on objects and shell syntax issues such as unescaped pipe characters.

The hidden cost of failed tool calls in agent traces

Wasted tool calls that are avoidable hide costs that are easy to miss if you only look at whether the run succeeded:

More cost: more calls mean more tokens, more compute and more spend
More latency: retries slow the agent down even when the run succeeds
More trace noise: the extra failure and retry make the trace harder to review
More unhelpful work: the model may recover on a subsequent attempt, but it still spends extra calls on a path that was never going to help
Weaker prompt signal: recovery masks prompt defects, so a successful run is a weaker indicator of whether the prompt is doing what you think it is

For my step-by-step workflow, see how to debug wasted tool calls in LLM logs.

FAQ

Why are failed tool calls in LLM logs worth debugging?

Failed tool calls in LLM logs are worth debugging because although the model may correct itself and find a new path on a subsequent attempt, the retry still adds cost, latency and noise to the trace. The failure can also point to a prompt issue that keeps repeating across the dataset.

Why does a retried tool call produce a noisy trace?

A retried tool call produces a noisy trace because the successful call doesn’t erase the failed one that came before it even though the run eventually succeeds. This makes reviewing the trace and diagnosing issues more difficult.

What’s the difference between wasted tool calls that are required and ones that are avoidable?

Some failed tool calls are required because the failure provides the model with new information, for example, that a file doesn’t exist or a path is wrong. Avoidable retries happen when the prompt is inaccurate or not specific enough, which leads to underspecified tool calls.

When a wasted tool call is avoidable, what does that usually indicate?

When a failed tool call is avoidable, the prompt usually needs to be updated. The unnecessary noise in the trace weakens the signal on whether the prompt is well specified.

Virtual Scrolling for Billions of Rows — Techniques from HighTable

2026-02-11T20:00:00+00:00

Editor’s note: This post was originally written by Sylvain Lesage, one of the primary contributors to HighTable, whose work we’ve been sponsoring as part of our broader effort to plan for the data-scale problems that are emerging with LLMs. We’re republishing it here with his permission because it captures a core technical challenge we’ve been working through—how to scroll through billions of rows in the browser.

Stay tuned for a future post on what this innovation means for Hyperparam users.

TL;DR: In this post, I present five techniques related to vertical scrolling used in , a React component that can display billions of rows in a table while keeping good performance and accessibility.

It’s a long post, which reflects the complexity of rendering billions of rows in a table, and the amount of work we put into building the React component.

Table of contents:

Introduction
Demo
Scrolling basics
Technique 1: lazy loading
Technique 2: table slice
Technique 3: infinite pixels
Technique 4: pixel-precise scroll
Technique 5: two-step random access
Conclusion

Introduction

Showing data in a table is one of the first exercises you’ll find in HTML 101 courses.

  NameAge

  Alice64
Bob37

Name	Age
Alice	64
Bob	37

But, as often in data science, what works for simple cases breaks when the size increases.

In this post, I’ll showcase five techniques we use to solve challenges related to vertical scrolling in the React component to handle billions of rows.

The component also provides features for columns (sort, hide, resize), rows (select), cells (keyboard navigation, pointer interactions, custom rendering). Feel free to ask and look at the code if you’re interested in knowing more.

The component is developed at hyparam/hightable. It was created by Kenny Daniel for Hyperparam, and I’ve had the chance to contribute to its development for one year now.

This blog post was sponsored by Hyperparam. Thanks for the support and for challenging me to solve the fascinating problem of rendering billions of rows in the browser!

Demo

Try the hightable demo:

HighTable is also used in the Parquet viewer, on source.coop and in Hyperparam:

Scrolling basics

Before diving into the techniques, let’s describe how scrolling works using a standard HTML table.

The HTML structure is composed of a scrollable container, that we call the viewport, and a table element inside it:

 class="viewport" style="overflow-y: auto;">
   class="table">
    ...
  

In this structure, the viewport is a div with a fixed height and the CSS property overflow-y: auto enables a vertical scrollbar when the table is taller than the viewport.

In the following widget, scroll the left box up and down to see how the right box mimics the scrolling effect.

If you use a keyboard, you can focus the left box with Tab, and scroll with the arrow keys ⏶ and ⏷. Otherwise, you can use mouse wheel, drag the scroll bar, or slide on a touch screen.

The component is delimited by its fixed-size viewport (blue border). The table (golden border) is rendered inside the container. As its height is larger than the viewport height, only part of the table is visible, and a vertical scrollbar lets changing the visible part. The inner table element moves up and down within the viewport, creating the scrolling effect.

On the right side, we mimic the scrolling effect, showing the position of the table relative to the viewport.

Let’s settle some definitions and formulas that will be useful later:

in this post, we assume viewport.clientHeight, the height of the visible area, is constant. In HighTable, we measure it and react to resizing.
viewport.scrollHeight, the total height of the scrollable content, is equal to table.clientHeight. Both are equal to the number of rows in the table multiplied by the row height:
```
 const rowHeight = 33 // in pixels
 const numRows = data.numRows // total number of rows in the table
 const height = numRows * rowHeight
```
In this post, we assume the row height and the number of rows are constant. In HighTable, we react to changes in data.numRows (the number of rows in the data frame, the data structure holding the table data), for example when filtering; but we assume the row height is fixed (see issue #395 to support variable row heights).
viewport.scrollTop is the number of pixels between the top of the scrolled table and the top of the viewport. The minimum value 0px shows the top of the table, while the bottom of the table is reached at the maximum value viewport.scrollHeight - viewport.clientHeight.

The visible pixels can be computed from the viewport scroll top position:

 const firstVisiblePixel = viewport.scrollTop
 const lastVisiblePixel = viewport.scrollTop + viewport.clientHeight
 // firstVisiblePixel is inclusive, lastVisiblePixel is exclusive

Now that we have the basics, let’s see how to handle large datasets.

Technique 1: lazy loading

The first challenge when working on a large dataset is that it will not fit in your browser memory. The good news: you’ll not want to look at every row either, and not at the same time. So, instead of loading the whole data file at start, we only load the visible cells.

Note that lazy loading the data does not change the HTML structure of the table.

The following widget shows how lazy loading works. Scroll the left box up and down to see how the cells are loaded on demand on the right side:

In the table, only the visible cells are loaded. When scrolling, newly visible cells are requested and loaded in the background, and rendered when available.

To do so, we compute the visible rows, and only load them:

const rowStart = Math.floor(firstVisiblePixel / rowHeight)
const rowEnd = Math.ceil(lastVisiblePixel / rowHeight)
// rowStart is inclusive, rowEnd is exclusive

In HighTable, the data loading logic is handled in a data frame, passed to the React component as the data prop:

<HighTable data={data} />

The data frame is an object that defines how to load (i.e. fetch and cache) the data on demand, and how to get the loaded data for rendering. See the DataFrame TypeScript definition in types.ts.

Here is a simplified DataFrame implementation that generates random data for one column, applying some delay to simulate fetching data over the network, and persists the values in memory:

const cache = new Map()
const eventTarget = new EventTarget()
const numRows = 1_000_000

const data = {
  numRows,
  eventTarget,

  // Synchronously return the cached value (if any)
  getCell({ row }) {
    return cache.get(row);
  },

  // Load missing values for the given rows, and cache them
  async fetch({ rowStart, rowEnd }) {
    // Simulate network delay
    await new Promise((resolve) => setTimeout(resolve, 100));
    for (let row = rowStart; row < rowEnd; row++) {
      // Skip already cached rows
      if (cache.has(row)) continue;
      // Generate a random value for the cell, and cache it
      cache.set(row, {value: Math.random()});
    }
    // Emit an event to tell  to re-render the visible cells
    eventTarget.dispatchEvent(new Event('resolve'));
  },
}

The data frame loads the data from the source using the asynchronous data.fetch() method. It must cache the results, and dispatch a resolve event when new data is available. The source can be anything. In our example, the data was randomly generated. It can also be obtained from a local file, an in-memory array, a remote file (using HTTP range requests), or a REST API, to name a few examples.

The data frame must also provide a synchronous data.getCell() method to get the cached data for a given cell, or undefined if the data is not loaded yet.

On every scroll move, the table is rendered, calling data.getCell() for the visible rows, as well as data.fetch() to load them in the background if necessary (it’s the responsibility of the data frame to return fast if the data is already cached). Every time new data is fetched and reported (on resolve events), the table will be re-rendered.

You can find a more complete example of a data frame that loads a remote Parquet file (using HTTP range requests) in the hyparquet demo.

The data frame structure is not oriented towards rows or columns, and allows loading and accessing the data by cell. Currently, in HighTable, we load full rows, but we could improve by computing the visible columns and loading them lazily as well. Join the pending discussion if you’re interested in this feature.

Impact of lazy loading

If we assume 10 billions of rows, and 100 bytes per row, the total data size is 1TB. Loading it all in memory is not possible, but with lazy loading, we only load 3KB for the visible part (about 30 rows at a time), and keep good performance.

Lazy loading the data is the first step, required to handle large datasets in the browser. The next step is to avoid rendering too many HTML elements at once.

Technique 2: table slice

In software engineering, when you try to optimize, the first step is to remove computing that does nothing. In our case, if the table has one million rows and we can see only 30 at a time, why render one million HTML elements? As a reference, Chrome recommends creating or updating less than 300 HTML elements for optimal responsiveness.

In the component, only the visible slice of the table is rendered. The other row elements simply don’t exist.

To achieve this, the HTML structure must be adapted, by adding an intermediate div element, that we call the canvas, between the viewport and the table:

 class="viewport" style="overflow-y: auto;">
   class="canvas" style="position: relative; height: 30000px;">
     class="table" style="position: absolute; top: 3000px;">
      
      ...

The HTML structure will remain the same for the rest of the blog post, including techniques 3, 4 and 5.

The canvas div is not related at all with the HTML element. I’m open to suggestions for better naming if it’s confusing.

The canvas is sized so that it could contain all the rows:

canvas.style.height = `${data.numRows * rowHeight}px`

It sets the viewport scrollbar to the expected size. As shown in the scrolling basics section, viewport.scrollHeight is equal to canvas.clientHeight.

The canvas serves as a reference for absolutely positioning the table slice.

The following widget shows how table slicing works. Scroll the left box up and down to see how the right box mimics the scrolling effect, while rendering only the visible rows. Toggle the full table button to see how the rendered rows fit in the full table:

On the right side, you see that only the visible rows are rendered. The table slice contains 6 rows instead of 10 (or 7, depending on the scroll position).

The HTML structure inside the table slice is:

...row 100......row 101...
    ...
    ...row 119...
  

Let’s assume the data has 1,000 rows, each row in the table is 30px height, and the viewport height is 600px (so that about 20 rows are visible at once). If the user has scrolled down 3,000px, only renders rows 100 to 119 in the actual

element.

The HTML above is a simplification. In hightable, we render a table header and add some padding rows before and after the visible rows to improve the scrolling experience.

The table top position is adjusted to fit in the full table (toggle the Show / Hide button to render the full table). It’s equals to the position of the first visible row inside the virtual full table. It’s nearly equal to viewport.scrollTop, but differs by the amount of hidden pixels at the top of the first visible row. So:

table.style.top = `${
  viewport.scrollTop - (viewport.scrollTop % rowHeight)
}px`;

These computations are done on every scroll event (and on every other change: when the viewport height changes, or when the number of rows is updated). Once computed, the table slice is re-rendered with the new visible rows, the table position is updated with the new top value, and the data frame is queried to load the new visible cells if needed.

A detail worth mentioning is the sticky header. In , the header with column names is rendered as part of the table element, in

, not as a separate element. It helps with accessibility, as screen readers can easily identify the header cells associated with each data cell, and with columns resizing, as the header and data cells are aligned automatically by the browser. Thanks to the CSS property position: sticky (see sticky on MDN), the header row remains visible at the top of the viewport when scrolling. We take it into account to compute the first visible row.

Note that the table slicing technique is not specific to vertical scrolling. The same approach can be used for horizontal scrolling (rendering only the visible columns). It’s less critical, as tables generally have less columns than rows. Join the pending discussion on virtual columns if you’re interested in this feature.

Impact of table slicing

If we assume 10 billions of rows, and 30 rows are visible at a time, we only render 30 HTML elements instead of 10 billion. It allows to keep good performance with any number of rows, as the number of rendered elements is constant.

Until now, everything is pretty standard. The next techniques are more specific to hightable, and address challenges that arise when dealing with billions of rows.

Technique 3: infinite pixels

Technique 2 works perfectly, until it breaks… As Eric Meyer explains in his blog post Infinite Pixels, HTML elements have a maximum height, and the exact value depends on the browser. The worst case is Firefox: about 17 million pixels. As the canvas height increases with the number of rows, if the row height is 33px (the default in HighTable), we cannot render more than 500K rows.

Our approach to this issue in HighTable is to set a maximum height for the canvas and downscale the scrollbar resolution above this limit. In HighTable, the threshold is set to 8 million pixels.

Concretely, above the threshold, one scrolled pixel corresponds to multiple pixels in the full table. The downscaling factor is the ratio between the theoretical height of the full table and the maximum height of the canvas. Thanks to that factor, if you scroll half the scrollbar, you reach the middle of the full table, no matter how big it is.

Below the threshold, the downscaling factor is 1, so everything works as before: one scrolled pixel corresponds to one pixel in the full table.

The downscale factor is computed as:

const fullTableHeight = data.numRows * rowHeight
const maxCanvasHeight = 8_000_000
if (fullTableHeight <= maxCanvasHeight) {
  downscaleFactor = 1
} else {
  downscaleFactor = 
    (fullTableHeight - viewport.clientHeight) /
    (maxCanvasHeight - viewport.clientHeight)
}

Now, the first visible row is computed with:

firstVisibleRow = Math.floor(
  (viewport.scrollTop * downscaleFactor) / rowHeight
)

and the table top position is set to align the first visible row with the top of the viewport:

table.style.top = `${viewport.scrollTop}px`;

This lets the user navigate through the whole table, even with billions of rows.

The following widget shows how scrollbar downscaling works. Scroll the left box up and down to see how the right box mimics the scrolling effect, allowing to navigate through ten billion rows.

But there is a drawback. The native scroll bar precision is limited to 1 physical pixel. On “high-resolution” screens, the apparent precision is a fraction of a CSS pixel (1 / devicePixelRatio). But let’s keep one pixel for simplicity.

As an anecdote, setting the scroll value programmatically is hard to predict. It depends on the device pixel ratio, which itself depends on the zoom, and maybe other factors. For example, element.scrollTo({top: 100}) might result in scrollTop = 100, scrollTop = 100.23, or scrollTop = 99.89. You cannot know exactly, but within a margin of one pixel.

The scrollTop value can even be outside of the expected range, for example negative or larger than the maximum value scrollHeight - clientHeight. To prevent such browser-specific over-scroll effects, when reacting to a scroll event, hightable always clamps the scrollTop value within the expected range, and applies the CSS rule overflow-y: clip. clip, instead of hidden, shows the sticky header, even if I’m not sure why to be honest.

So, when the downscale factor is big, like in the example above (2,189,781,021), the minimal scroll move (1px) corresponds to 2,189,781,021 pixels in the full table. With a row height of 30px, it means that the minimal scroll move corresponds to about 72,992,701 rows. It creates gaps in the reachable rows:

if viewport.scrollTop = 0, the visible rows are 0 to 5
if viewport.scrollTop = 1, the visible rows are 72,992,700 to 72,992,705
if viewport.scrollTop = 2, the visible rows are 145,985,401 to 145,985,406
and so on…

There is no way to navigate to the rows 6 to 10, for example. Setting viewport.scrollTop = 0.00000000274 to reach rows 6 to 10 is impossible, because the browser rounds the scroll position to the nearest integer pixel.

Impact of infinite pixels

If we assume 10 billions of rows, the infinite pixels technique allows to navigate through the whole rows span. There is no limit to the number of rows, as we can always increase the downscale factor to fit in the maximum canvas height.

But due to the limited scrollbar precision, if the row height is 30px and the canvas is 8Mpx, each scrolled pixel moves the table by 1,250 rows. It means that only one row (and its neighbors) out of 1,250 is reachable.

The infinite pixels technique thus provides global navigation through billions of rows. But it does not allow fine scrolling, and some rows are unreachable. The technique 4 addresses this issue.

Technique 4: pixel-precise scroll

The previous technique allows to scroll globally through the file, but prevents users from scrolling locally because any scroll gesture will jump over gaps of unreachable rows.

To fix that, we implement two scrolling modes: local and global scrolling. Local scrolling means scrolling the table slice pixel by pixel (i.e. even more precisely than row by row), while global scrolling means jumping to the position given by the scrollbar.

The logic requires a state with three values: { scrollTop, globalAnchor, localOffset }

the last viewport scroll top value is stored in the state to compute the scroll move on every scroll event.
the global anchor is the viewport scroll top value corresponding to the last global scroll. It is updated on every global scroll, but not on local scrolls.
the local offset is the offset applied to the global anchor to compute the current scroll position. It is updated on every local scroll, and reset to 0 on global scrolls.

The first visible row is computed from the global anchor and the local offset:

const firstVisibleRow = Math.floor((
    state.globalAnchor * downscaleFactor + state.localOffset
  ) / rowHeight)

The absolute positioning of the table is now:

table.style.top = `${viewport.scrollTop + state.localOffset}px`;

On every scroll event, we compute the magnitude of the scroll move (difference between the new viewport’s scroll top and the previous one, stored in the state) and decide to apply:

a global scroll if the scroll move is big, typically on scrollbar drag and drop, and we jump to the new global position (technique 3),
or a local scroll if the scroll move is small, for example when using the mouse wheel. In that case, we keep the state’s globalAnchor value unchanged (ie: not sync’ed anymore with the real scrollTop value) and adjust the localOffset so that the move appears local (for example, 3 rows downwards).

Represented as code, the logic looks like this (simplified, pseudo-code):

const state = getState()
const delta = viewport.scrollTop - state.scrollTop
if (Math.abs(delta) > localThreshold) {
  // global scroll
  state.localOffset = 0
  state.globalAnchor = viewport.scrollTop
} else {
  // local scroll
  state.localOffset += delta
}
setState(state)

Now, the user can navigate around the current row, but also jump to any part of the data.

The following widget shows the dual scrolling mode. Scroll the left box up and down to see how the right box mimics the scrolling effect, allowing to navigate both locally and globally through ten billion rows.

With this approach, small scroll moves appear local, while large scroll moves jump to the expected global position. The user can navigate through the whole table, and reach every row. The user can scroll as expected in the browser, with their mouse wheel, touchpad, keyboard (when the table is focused) or scrollbar.

Impact of pixel-precise scroll

If we assume 10 billions of rows, the dual scrolling mode allows to access any pixel of the full table using the native scrollbar. The user can scroll locally with the mouse wheel, and scroll globally by dragging the scrollbar.

This works if the full table height is less than the maximum canvas height (8Mpx in hightable) squared, which corresponds to about 64 trillion pixels. So, 1px fidelity is guaranteed up to 2 trillion rows with a row height of 30px.

Above that limit, the minimal step is greater than 1px, but every row is still reachable up to 64 trillion rows! Above, some rows become unreachable.

The last challenge is to move to any cell programmatically (i.e. random access to any part of the table), be it using the keyboard or through a “jump to row” input, without worrying about the local vs global scrolling mode. Random access requires decoupling vertical and horizontal scrolling. We explain it in the next section.

Technique 5: two-step random access

One of the HighTable requirements is to allow keyboard navigation (e.g. ↓ to go to the next row). Fortunately, the Web Accessibility Initiative (WAI) provides guidance through the Grid Pattern and the Data Grid Examples. We use tabindex roving to handle the focus, providing all the expected keyboard interactions.

The browser provides a useful default when calling cell.focus(): it automatically scrolls to the cell and focus it. But in HighTable, we don’t use the default behavior. Indeed, it positions the cell at the center of the viewport, which does not feel natural.

To get the expected behavior, we first scroll by the minimal amount to show the next row and column, by calling cell.scrollIntoView({block: 'nearest', inline: 'nearest'}). Then we set the focus with no scroll action using cell.focus({preventScroll: true}).

Unfortunately, the keyboard navigation techniques explained in the WAI resources are designed for full tables. But due to the techniques 2 (table slice), 3 (infinite pixels) and 4 (pixel-precise scroll), multiple steps are required. In particular, to let the user move the active cell with the keyboard, we separate the vertical scrolling logic from the horizontal one.

When the user moves the active cell, the final position can be anywhere in the table: ↓ moves to the next row, while Ctrl+↓ moves to the last row. If the move is big, we might have to scroll vertically to have the required cell in the DOM.

The same issue whenever we access a random row in the table, for example if an app embedding provides a “jump to row” feature. The table should programmatically scroll to the expected row, and focus the cell in the expected column, without worrying about the local vs global scrolling mode, or the horizontal scroll position.

The process is as follows:

compute the next state (global anchor and local offset) that will make the row of the required cell visible,
programmatically scroll to the new scrollTop position, if the global anchor has changed,
once scrolled, render the table slice to have the required cell in the DOM,
scroll horizontally if needed with cell.scrollIntoView({inline: 'nearest'}),
set the focus to the new cell with cell.focus({preventScroll: true}).

Note that, for point 1. (computing the next state), we respect the block: nearest behavior by minimizing the scroll move. If the next row is below the current viewport, it will be the last visible row in the next viewport. If it is above, it will be the first visible row. If it is already visible, no vertical scroll is applied.

The pseudo-code for decoupling vertical and horizontal scrolling requires a flag to prevent horizontal scrolling and focus during the programmatic vertical scroll:

/* in the cell navigation code */
const shouldScroll = state.update()
renderTableSlice()
if (shouldScroll) {
  // set a flag to prevent horizontal scrolling + focus
  // during programmatic scroll
  setFlag('programmaticScroll')
  viewport.scrollTo({top: state.globalAnchor, behavior: 'instant'})
}

/* in the scroll event handler */
if (isFlagSet('programmaticScroll')) {
  // allow horizontal scrolling + focus,
  // once the programmatic scroll is done
  clearFlag('programmaticScroll')
}

/* in the cell rendering code */
if (!isFlagSet('programmaticScroll')) {
  // horizontal scrolling + focus allowed
  cell.scrollIntoView({inline: 'nearest'})
  cell.focus({preventScroll: true})
}

We set behavior: 'instant' when scrolling programmatically to ensure we only receive one scroll event. The alternative, behavior: 'smooth', would trigger multiple scroll events, clearing the flag too early, and generating conflicts with the internal state due to intermediate unexpected scrollTop positions (see the open issue).

Impact of two-step random access

With this technique, the user can access any random cell in the table with the keyboard, and the table will scroll to the expected position, even with billions of rows. The vertical and horizontal scrolling are decoupled, so that the user can move to the next column with → without triggering a vertical scroll, and vice versa with ↓.

Conclusion

No need for a fake scroll bar. No need to render the table in a canvas. We use the Web platform. Thanks to these five techniques that rely on native HTML elements, hightable lets you navigate seamlessly through billions of rows of a remote data file, in the browser.

Give a star ⭐ to the GitHub repo if you liked the article!

How to debug chatbot failures by inspecting LLM logs

2026-01-29T08:00:00+00:00

A real-world workflow for debugging chatbot failures at scale

Most dissatisfied users don’t complain. They churn.

“I asked your chatbot how I could talk with a live customer service agent and it gave me a nonsensical answer, so I never used it again.”

That complaint is unusually helpful. Most users don’t send feedback. They just bounce.

This post walks through a real-world workflow for inspecting LLM logs to debug chatbot failures, identify systemic issues, and validate fixes using real production data.

For Darryl, the engineer debugging the issue, the immediate question wasn’t why this one chat failed. It was:

If one person noticed and complained, how many others hit the same failure and churned?

The team had shipped the chatbot a month earlier. In that time, the LLM logs had already exploded to multiple gigabytes in Parquet and were still growing fast. Reading them manually wasn’t an option. Spot-checking wasn’t an option either: this was a trust failure in a support channel, and you can’t restore trust by guessing.

Darryl needed a workflow that could answer two things quickly:

Find the failing conversations (including this user’s chat) without knowing the exact phrasing.
Reproduce and fix the failure, then validate the fix across real historical inputs—not just a few hand-picked examples.

The first step was to stop thinking in terms of individual conversations and start reasoning over the LLM logs as a dataset.

Step 1: Inspect the logs like a dataset, not a transcript

Darryl loaded the Parquet logs into Hyperparam and started by scanning raw rows to understand what was captured per turn (messages, tool calls, metadata).

The first goal was simple: locate conversations that matched the complaint, such as “trying to reach a human,” “live agent,” “customer service,” etc. That’s awkward in SQL because the query is semantic: the same intent shows up across many different phrasings.

Instead of writing brittle keyword filters, he used an AI agent to filter the dataset down to conversations that likely matched the reported intent. Then he pulled up the specific user’s interaction to review the full conversational context.

Step 2: Identify the real failure mode (it wasn’t “random hallucination”)

Darryl soon discovered that the issue wasn’t just that the chatbot wrote something wrong. In the failing chat, the model attempted a tool call to answer a factual question about support availability—but it called the wrong tool.

That’s an important distinction, because it changes the fix:

If it’s pure model generation, you’re tuning prompts and refusal behavior.
If it’s tool routing, you’re fixing selection, schema, constraints, and guardrails—and you can validate the fix deterministically across historical inputs.

Step 3: Turn “one bug” into a measurable pattern

After fixing the single root cause, Darryl ran a broader review across the full dataset to look for other cases where users asked factual questions and received low-quality or nonsensical answers. Once he zoomed out to look across the full dataset, individual failures stopped being useful on their own. The questions he needed to ask were:

Which intents fail most?
Which tool calls correlate with failures?
Is the problem isolated to one path or systemic?

The agent surfaced multiple similar issues. What initially appeared to be a single complaint turned out to be a recurring failure that could be costing customers.

Step 4: Replay real conversations to validate changes

At this stage, Darryl needed to validate behavior by asking:

Given the exact same user inputs from the original version (V1), does the updated version (V2) reliably choose the right tool and produce a correct answer?

With Hyperparam, Darryl replayed historical conversations under different configurations (prompts, tooling, model), then compared outputs across variants using LLM-as-a-judge to score improvements at scale.

This made it possible to see whether fixes held up across the full replay dataset, not just a few handpicked samples.

After iterating, he exported a concrete set of changes for the next chatbot version: which tool call behavior to adjust, what prompt or tool constraints to add, and which configuration produced the best outcomes on the replay dataset.

“I found the right setup within a couple of hours without pulling in an entire team. Being able to compare V1 and V2 across real inputs made it obvious which changes actually worked.”

If you’re debugging chatbot failures using real LLM logs, this is the kind of workflow Hyperparam is designed to support.

Squirreling: a new SQL engine for the web

2025-12-30T08:11:00+00:00

Hyperparam is built on three bets: first, there’s a goldmine of information in LLM data — e.g. LLM chat logs or LLM training data. Second, people need tools to understand this data. And third, the browser is the only place to build modern interactive data tools. Over the past year, we’ve built a browser-native tool without a “classical” backend server that helps users transform and analyze massive LLM datasets. One of the core functionalities is the ability to extract structured information at scale with the use of AI.

Now comes the challenge. Hyperparam’s users end up with massive datasets with large text-blobby columns such as chat logs and structured columns with labels, scores, or other, more classical, structured information. Users need the ability to query over this data in the browser in an AI native manner. The only AI-friendly language to do this with is SQL, but there’s no SQL engine built natively for the browser that’s fast enough, low memory enough, or async enough to meet Hyperparam’s standards for interactivity.

So I did what all engineers do: I built Squirreling, a ~9 KB (minified and gzipped) SQL engine with zero external dependencies. It achieves instant startup and constant memory usage for streaming queries.

I made my first commit on November 15th; open-sourced it on November 22nd, and had it live in Hyperparam on November 26th. This would never have been possible with only one person and in such a short timeframe without AI.

Drawbacks of WebAssembly for SQL engines

To understand why existing browser-based SQL engines struggle with interactive data exploration, it helps to examine how they’re built. Tools like DuckDB-Wasm compile a full analytical SQL engine to WebAssembly so it can run inside the browser. But database engines relying on WebAssembly to run in the browser face inherent limitations:

Large footprint: DuckDB-Wasm, for instance, exceeds 4MB, which adds to processing times.
Synchronous execution model: This limits true streaming execution.
Differences in memory types: WebAssembly’s linear memory model is separate from JavaScript’s heap memory, which requires data copying at boundaries.

In practice, this shows up as noticeable startup times before queries can run, delayed time-to-first-result, and execution behavior that prioritizes throughput over interactivity. Queries run to completion before yielding results. And if you wanted to have derived columns or user-defined functions (UDFs), there’s no way to connect DuckDB-Wasm with async API calls such as LLMs. This makes existing SQL engines less than optimal for exploratory workflows that depend on fast, incremental feedback.

How Squirreling fundamentally differs from existing SQL engines

Squirreling emerged as a response to the simple question: what happens if you design a SQL engine for the browser first, instead of adapting a server-oriented database to run there?

Starting from that premise leads to a different set of design choices than those made by existing solutions:

Async-native execution: Squirreling relies on JavaScript’s AsyncGenerator protocol throughout, resulting in streaming query results and a responsive UI.
Late materialization via lazy cells: Table cells are represented as async functions and are only evaluated when accessed, which minimizes expensive operations.
Pluggable data sources: The AsyncDataSource interface decouples query execution from data retrieval, allowing data sources to only return the specific rows and columns a query requires.

These design choices are reflected throughout Squirreling’s architecture, shaping how queries execute, how data is retrieved, and how work is scheduled.

Squirreling’s architecture

Let’s examine the key parts of Squirreling’s architecture that make these design choices concrete.

Late materialization

Squirreling delays computing column values until the query needs them. Expensive operations only run on cells that survive earlier stages such as filtering, sorting, and limiting.

By delaying materialization, Squirreling executes joins over minimal projections and effectively inherits the asymptotically worst-case-optimal behavior of modern join algorithms, only materializing payload columns for rows that survive the join. [1]

Execution model

Squirreling distinguishes between streaming and buffered paths based on query characteristics:

The streaming path is used for queries without ORDER BY, GROUP BY, or aggregates. It achieves constant data coverage regardless of the size of the dataset. It yields one input row, one output row, and a small amount of data that satisfies the limit query.
The buffered path is used for queries with ORDER BY or GROUP BY; however, late materialization options are still implemented, and LIMIT is applied before projection. Squirreling buffers the rows first and only evaluates those columns that are required for the current stage of the query.

Query processing

Squirreling parses SQL into an AST and executes against the AST directly without a separate planning phase. This simple system avoids heavy planning overhead and allows execution to stay incremental and async. It fits the browser environment, where responsiveness matters more than deep cost-based optimization.

This AST-driven, async, late materialization model applies even to complex queries. Joins are executed directly from the AST, yielding an async and streaming result. They can stop early in the event LIMIT is applied and don’t force evaluation of columns. Columns are treated independently and evaluated individually and lazily, deferring expensive columns until required.

Module footprint

Squirreling is written in pure JavaScript with zero runtime dependencies. The complete library — consisting of the parser, executor, and all built-in functions — compiles to ~9kb (minified and gzipped). That’s 500x smaller than DuckDB-Wasm’s 4.5 MB binary.

This small footprint offers the following benefits:

Instant startup: Because there’s no WebAssembly compilation delay, Squirreling is ready to execute queries immediately after the JavaScript loads.
Embeddability: Squirreling can be bundled into applications with almost no impact on size.
Edge deployment: Squirreling’s small size enables deployment in serverless edge functions or service workers — environments that typically can’t host databases at all.

Squirreling: a browser-native SQL engine

These architectural choices follow logically from treating the browser as the primary execution environment for interactive exploration. The result is a browser-native SQL engine with a set of properties that shape how queries run:

Immediate, incremental query execution in the browser. Queries start producing results as soon as execution begins, with rows and cells streaming as they’re ready. Users can inspect partial output, refine queries, or stop execution without blocking the UI or waiting for full execution.
Explicit control over when expensive work runs. Columns are evaluated only when referenced, and LIMIT and ORDER BY act as cost controls as well as query clauses. Expensive operations run only when required, which allows execution to stop early to avoid unnecessary computation.
Lightweight, backend-free exploration over asynchronous data. Squirreling runs entirely client-side as an open-source library, with no backend, account setup, or loading flows. It supports interactive querying over asynchronous data sources, including cloud-native formats like Parquet, in a footprint of ~9KB (minified and gzipped).

Squirreling is available as an open-source library here: github.com/hyparam/squirreling

References

[1] Abadi, D. J., Myers, D. S., DeWitt, D. J., & Madden, S. R. (2007). Materialization Strategies in a Column-Oriented DBMS. In Proceedings of the 23rd International Conference on Data Engineering (ICDE) (pp. 466–475). IEEE.

Why you’re missing issues in your LLM chat logs

2025-12-16T09:30:00+00:00

You’re missing critical issues in your LLM chat logs

Debugging issues like sycophancy or tone shifts in large LLM chat logs usually starts the same way. Someone flags a problem, and suddenly you’re the engineer staring at hundreds of thousands of rows trying to figure out what went sideways. Your boss wants answers, and the dataset is huge. So you pull a small sample, send it through another LLM to score for sycophancy, and check to see whether the scoring prompt actually captures what you care about. That quick loop works for the iteration phase, but it never tells you how often the issue appears or what triggers it across the full dataset.

LLM chat logs become harder to reason about at scale because the issues you care about are distributed across tens or even hundreds of thousands of lines of text. Chatbot logs consist of multi-GB text files. In this deluge of unstructured data, what matters is finding the important 1% of failures that are relevant to the challenge you’re working on. Most teams start by sampling because it’s the fastest way to inspect a few examples and test their scoring logic. But sampling only shows fragments of the behavior, and there’s no guarantee those fragments reflect the full picture.

Key takeaways

LLM chat logs hide important behaviors because the issues you care about are distributed across thousands of conversations. Debugging becomes slower and less reliable when you can’t query across the full dataset.
Sampling is an essential first step, but it only surfaces fragments of the behavior and can’t reveal frequency, triggers, or context across the entire dataset.
Reasoning across the entire multi-GB dataset is becoming essential for accurate LLM behavior analysis.

Traditional debugging breaks down with LLM chat logs
The issues you know exist but can’t quantify in sampled logs
What happens when issues stay hidden in your logs
The future of LLM debugging depends on reasoning across multi-GB datasets

Traditional debugging breaks down with the scale of LLM chat logs

Traditional debugging workflows were built for structured tables, not multi-GB datasets of AI-scale text. They’re designed for predictable schemas, uniform rows, and fields you can sort, filter, or compute against. They usually rely on sampling or slice-based queries because they’re the fastest ways to inspect a few examples.

But the new world we live in consists of huge piles of unstructured text data: LLM chat logs that don’t behave that way. A single row can contain a full conversation, a long reasoning chain, or text that spans hundreds of tokens with no consistent structure. Engineers still catch individual failures like an instance of sycophancy or a strange tone shift, but locating those issues in massive logs often requires digging through isolated rows manually. And because conventional SQL or Python workflows weren’t built to analyze unstructured, conversational text across large datasets, they don’t help you map how often an issue occurs, what triggers it, or whether it’s part of a pattern that repeats across thousands of conversations.

So the question becomes: am I seeing the full picture here, or is my view skewed because traditional methods aren’t built to query massive unstructured datasets?

The issues you know exist but can’t quantify in sampled logs

Understanding the issues in your AI systems almost always comes from actual use, whether doing so yourself (dogfooding) or by listening to reports from your users. In many cases, you might have a general sense of the issues that exist, like sycophancy, unexpected tone shifts, or two conversations that answer the same question differently. But with massive logs like this, the underlying behavior often only shows up when issues are viewed across the entire dataset.

Some issues only make sense when you see how often they appear or what triggers them:

Sycophancy triggered by specific phrasing that appears sporadically in large LLM chat logs
Logical inconsistencies that surface only in longer, multi-turn threads
Conflicting answers that only become obvious when similar queries appear in different regions or contexts

These issues aren’t rare, but they’re distributed thinly across the dataset, and that distribution makes them nearly impossible to see without dataset-level querying. A sample shows you symptoms, but only inspecting the entire dataset reveals the scope, frequency, or context that define the real pattern. And you can’t query for this kind of thing with SQL. You need something that understands natural language.

What happens when issues stay hidden in your LLM logs

When you can’t query across the full dataset, you lose the ability to judge the scope or conditions of an issue. This can result in:

Issues surfacing late, often only after a user reports something unexpected.
Teams working on fixes that don’t solve the real problem because the root cause wasn’t visible.
The possibility of updates shipping with issues no one noticed.
Debugging that focuses on the model when the issue actually lives in prompts, context windows, or specific conversational patterns.

These blind spots might start out small, but they can expand quickly and slow down debugging, pushing teams into reactive rather than proactive work.

The future of LLM debugging depends on reasoning across multi-GB datasets

As datasets grow, the limits of slice-based inspection become harder to ignore. Issues in LLM chat logs emerge from patterns that spread across thousands of conversations, across different regions and varying prompts. And with multi-GB datasets now being the norm, finding issues and understanding the patterns behind them requires reasoning across the full dataset, not just the fragments that appear inside a sample.

If you work with LLM chat logs, you can try the Hyperparam app for a faster way to explore and query large datasets. It’s free while in beta.

FAQ

What makes LLM chat logs harder to analyze than other types of AI data?

LLM chat logs combine long-form text, multi-turn conversations, and inconsistent structures that don’t fit traditional structured data workflows. This makes it difficult to map issues and patterns across the full dataset using query-based or search-based methods.

Why do subtle LLM failures often appear only at a large scale?

No one knows how to make good evals. So issues tend to surface from actually deploying and using AI models. The issues that surface only become apparent over time, and are often subtle behavior issues like sycophancy. They don’t form clear patterns until you analyze the dataset broadly.

What happens when issues stay hidden in your LLM logs?

In my experience, inability to look deeply at LLM chat log data has resulted in:

Issues surfacing late, often only after a user reports something unexpected.
Teams working on fixes that don’t solve the real problem because the root cause wasn’t visible.
Shipping updates with issues no one noticed.
Debugging that focuses on the model when the issue actually lives in prompts, tools, and context windows.

These blind spots might start out small, but they can expand quickly and slow down debugging, pushing teams into reactive rather than proactive work.

How can teams reduce blind spots when working with massive unstructured logs?

Teams can reduce blind spots by shifting from spot-checking to dataset-level reasoning, using AI assistance that allows them to query, compare, and evaluate issues across the entire dataset instead of isolated rows.

Explore massive datasets with the Hyperparam AI tool

2025-11-19T14:00:00+00:00

Meet the Hyperparam AI tool for massive datasets

The first-of-its-kind interactive UI for navigating and improving LLM-scale datasets

AI runs on data. Massive amounts of it. On one side you’re training models on large amounts of text, and once deployed, these models constantly produce mountains of AI text data. The entire lifecycle of AI is massive data in and even more data out. Between April 2024 and April 2025, Google’s AI products alone went from roughly 9.7 trillion tokens to more than 480 trillion tokens. That’s almost a 50x increase in just one year and rapidly approaching 1 quadrillion tokens per month.

However, none of the tools that currently exist are built to work with massive, planet-sized balls of unstructured text. Notebooks, SQL engines, and data visualizers all assume something smaller and more structured than what we actually deal with today.

If we want to keep advancing with AI, we need solutions that let us explore and understand AI data at the speed at which it’s produced. And that’s why Hyperparam exists. The Hyperparam AI tool, a browser-native application built specifically for this environment, lets you explore and transform massive datasets in real time so you can understand and improve your AI datasets.

Key takeaways

AI-scale datasets grow faster than traditional tools can handle, leaving teams unable to understand their own data.
The Hyperparam AI tool pairs a high-speed browser engine with an army of AI agents and natural language analysis to make AI-scale data workable.
You can explore and refine massive unstructured datasets in real time without waiting.
One person can triage issues like sycophancy or hallucinations across tens of thousands of rows inside a single browser tab.

Teams are drowning in overwhelming amounts of unstructured data
AI is only useful when the UI can keep up
How the Hyperparam AI agents accelerate real data work
Hyperparam keeps the human in the loop

Teams are drowning in overwhelming amounts of unstructured data

Every company building with AI now sits on more text than any team can realistically examine. Chat logs, model outputs, product interactions, and support conversations all contain valuable intelligence on how a company’s AI is performing.

But AI data accumulates faster than humans can review or understand. In Q3 2025 alone, Azure’s AI services processed over 100 trillion tokens. Even small teams wind up with tens of thousands of rows overnight, and the rate of growth only accelerates as AI proliferates across more companies and industries.

Traditional tools to help businesses understand their data often rely heavily on the data being structured and accessed via SQL or other structured query languages. But the “signals” in AI data — e.g. did the model hallucinate, did the model ask for clarification, did the user get frustrated — exist fuzzily in text, not in an easy-to-access column. The information to learn from is in the data, but there is no way with traditional tools to access it for any kind of dataset analysis or debugging.

With the pace of AI, that gap only compounds. The more data you produce, the less equipped you are to do anything meaningful with it. The result is a backlog of unknowns that keeps growing while your ability to understand it stays flat.

AI is only useful when the UI can keep up

Ironically, our hypothesis is that AI can help you understand your AI data, but only if the interface makes that possible. AI models can fuzzily extract information, transform, label, and filter for you, but none of that matters when the surrounding tools choke the moment you hit real-world dataset volumes. For example, ChatGPT can help you understand if your AI is hallucinating, but you can’t load more than a few dozen chat logs at a time. Traditional data viewers, even augmented with AI, can’t display more than a few thousand rows instantly. Custom notebooks could be built to use AI, but would require scalable infrastructure to run over the entire data.

The Hyperparam AI tool solves this problem. It’s the first tool that makes AI usable at dataset scale by pairing two things that have never existed side by side:

Browser-native speed that streams and renders massive unstructured datasets instantly
A host of AI agents that act like a Swiss Army knife for your data, enabling you to score, label, categorize, and filter rows using natural language

Because the interface is fast enough to keep up, the AI insights become actionable. You can generate columns, score for sentiment, and filter results in real time, all without waiting or guessing. In short, everything clicks into place: the combination of a high-speed UI and Hyperparam’s AI agents gives us the first tool designed to explore and understand AI-scale data and support real LLM dataset debugging.

How the Hyperparam AI agents accelerate real data work

Once the interface is fast enough to keep up with the data, the AI layer turns into a genuine workflow upgrade. The browser engine handles the scale, the model does the hard work of reading through the thousands of rows of text data, and you stay in charge of the decisions. The model scores every row, creates new columns, surfaces issues, and points out strange behavior you might not notice on your own. You explore and validate the results in real time because nothing stalls or blocks you.

Take something as simple as triaging chatbot sycophancy and releasing a new prompt to correct sycophantic behavior. In the Hyperparam chat, you can ask Hyperparam to score every conversation for sycophancy, sort the entire dataset, filter to the outliers, and transform sycophantic results into desired behaviors for evaluations. Then you can try out different prompts, check the responses, and iterate until you have a prompt performing well on your corrected evaluation. You can even export this evaluation to use it later. And you can do this all singlehandedly inside one browser tab.

The Hyperparam AI tool keeps the human in the loop

Large language models can help score conversations or pinpoint odd behavior, but they can’t work through AI-scale datasets on their own. Hyperparam overcomes that limitation by pairing a high-speed browser engine with an army of AI agents that support the parts of the workflow where natural language actually adds value. You move through the data instantly, and the model helps you understand what you’re seeing without ever taking over the decisions.

This setup keeps the judgment where it belongs: with you, the human expert. We believe strongly that human-in-the-loop is the only way to work responsibly with AI. You decide how far to trust a score or when a prompt needs refinement. The UI makes the dataset feel lightweight and the AI does the heavy lifting, but every decision runs through your expert eye.

If you work with AI data, try the Hyperparam AI tool for a faster way to inspect, debug, and refine massive datasets. It’s free while it’s in beta.

Lessons from Hyperparam’s Year of Open-Source Data Transformation

2025-11-12T08:00:00+00:00

I sat down with my former cofounder, Kenny Daniel, to talk about his new startup Hyperparam

Hyperparam is an AI-powered data transformation tool that lets users and an army of AI agents look at, transform, score, and filter massive datasets instantly. As Kenny puts it, “It’s like a Swiss Army knife for your data.” It’s built on an ecosystem of open source data transformation libraries that power its paid app, which delivers the full Hyperparam experience.

Unlike most products that start with an enterprise focus and chase a single proof of concept, Kenny, as I’ve always known him to do, chose his own path. His thesis: Starting from open source is a better, faster way to build a product. In this interview, he shares his take on the open source community, product development in the new world of AI, and how Hyperparam took an intentional approach to open and closed source development.

Key Takeaways:

Open source development provides faster, more authentic product feedback than traditional enterprise development.
Hugging Face’s adoption of Hyperparam’s libraries (HyLlama and Hyparquet) validated the browser-native approach and its value for large-scale AI workflows.
Minimalism in engineering, or building without dependencies, creates faster, more reliable software.
Building tools you want to use yourself often leads to creating products others didn’t realize they needed.

Why Hyperparam Went Open Source
Hyperparam vs. Data Visualization Tools
How Hugging Face Validated the Browser-Native Approach
What’s Open Source and What’s Product at Hyperparam
AI Workflows That Make Large-Scale Data Transformation Faster
Why Minimalism Drives Hyperparam’s Engineering Philosophy
Advice for Developers Exploring Open Source Projects

Why Hyperparam Went Open Source

At our previous company, Algorithmia, we didn’t go open source. What made you decide to do it differently this time?

I built Hyperparam as the data transformation tool I wished I had because there wasn’t one that met my criteria. The first version of Hyperparam was a simple browser-based data viewer for Parquet files with some simple data transformation tools. The majority of large datasets for AI training and monitoring are Parquet files, and I just wanted to look at the data and play around with it. But I didn’t think people would pay for a Parquet viewer, so I doubted I’d be shooting myself in the foot by giving it away. If anything, I was going to get all the benefits of usage in the community. So I just put it out there without promoting it.

One of the most compelling arguments for doing open source is because I think it’s fundamentally a better way to build a product and get feedback. If you start building a product straight for the enterprise, it’s a recipe for disaster. You start asking the wrong questions, like, “How do we fit into their workflows?’ rather than asking, “How do we build a product that would see organic adoption?” With open source, people use your software if it’s useful and if it’s not, they don’t. That’s an incredibly valuable signal.

Hyperparam vs. Data Visualization Tools

Though you can use Hyperparam to view large datasets, you describe it as a data transformation tool. What makes Hyperparam different from data visualizers?

Hyperparam lets you instantly view, explore, and transform millions of rows of data, all through a user-friendly, chat-based UI built for scale and usability. So it’s much more than a data visualizer; it’s a data transformation tool.

When I went looking for a data tool, I just wanted to open one dataset. Jupyter was frustratingly slow. ChatGPT, VS Code, Copilot, and other assistants weren’t designed for interacting with massive datasets. And I quickly realized there wasn’t a single tool out there that let me look at any scaled dataset.

That led me on this path, and the question became: What does the interface look like for using AI across data? The answer is Hyperparam. It delivers the power of instant data transformation with the ease and nuance of natural language querying.

How Hugging Face Validated Hyperparam’s Browser-Native Approach

Hugging Face is just one of the organizations that started using your libraries HyLlama and Hyparquet. What was the significance of that moment?

Hugging Face’s adoption of my libraries validated hugely that there was something to my idea of moving more AI workflows to the browser. It was the strongest market signal I’d had up to that point, and it made me start thinking about what else we could build with these components.

For context, Hugging Face is the world’s repository for open models and open data. They use multi-gigabyte files in Llama CCP format, and they wanted to enable the user to simply get the metadata instead of downloading the entire file. HiLlama does exactly that: it pulls the metadata and provides the info the user needs, saving bandwidth, time, and disk space.

After Hugging Face integrated HyLlama into their website, they started looking into Hyparquet. When they realized it offered many of the same benefits for data, they started integrating it, as well. And it was a great honor that because they support OSS in general and have adopted our libraries, they gave us a substantial open-source grant.

What’s Open Source and What’s Product at Hyperparam

You’re launching a paid version of Hyperparam soon. What’s open source and what’s part of the product?

I’ve already open sourced the parts of Hyperparam that have shared value to the community, and I’ve kept the full product experience (including the AI workflows) within the paid app.

Hyparquet, HighTable, HyLlama are some libraries we’ve released that are building blocks that help others explore data in the browser and also power what we’re building internally. My belief is that connectors, frontend components, writers, readers, and other “glue” components should be open source. They’re globally useful beyond what I’m building and should be shared by the community. But on their own, they’re not the Hyperparam product.

Beyond just thinking about what is useful for the community, there are a few other upsides of open sourcing components. For one, I get to control how the component is optimized and designed, and I can make sure it’s designed to work well with Hyperparam. Secondly, and probably most importantly, there’s the community of developers invested in these components. Approximately a dozen people contributed code to Hyparquet and HighTable, and even more filed bugs that I subsequently fixed. Giving away components doesn’t diminish value; it amplifies it through feedback, goodwill, and contributions.

Now, when thinking about the paid product, any AI component and the core user experience is proprietary. I care deeply how users flow through my product, and I need to own that experience because I don’t think anyone else can build it correctly. That’s a bit of a cocky statement, but my team is a select group of people obsessed with the overall data experience and how the AI should work.

AI Workflows That Make Large-Scale Data Transformation Faster

What kind of AI workflows can you do with Hyperparam?

Hyperparam is a general purpose data transformation tool that enables you to do multiple things with your data, so it’s easiest to give an example.

Let’s say you’re a company that’s deploying a chatbot to your users, and a user files a ticket. Your support team needs to dive into the data to understand what happened, whether it’s sycophancy or some other issue. With Hyperparam, you can apply an LLM-generated score to every row in your dataset, look at the values, filter out the bad ones, transform them into something better, export the results, and just continue with your workflow.

That’s just one example of what Hyperparam can do. In addition to applying AI scores and sorting, filtering, and searching based on those scores, you can ask natural-language questions about your data, for example: “Rate every chatbot conversation for sycophancy,” “Did the user seem satisfied?” or “Was a conclusion reached?” You can categorize, tag, and explore your dataset in ways that were never possible before.

You can also run experiments: import historical data, tweak prompts, compare models, and see how the outputs change. It’s a deep research workflow that lets one person do what used to take an entire team.

Why Minimalism Drives Hyperparam’s Engineering Philosophy

What’s your core engineering philosophy and how did open source support that?

One of my fundamental engineering principles is to take no dependencies. I feel very strongly against building a huge stack of dependent software, which is something you see a lot of in JavaScript. Because Hyperparam started as a passion project and it’s open source, I could optimize purely for the function that I cared about. That’s not necessarily how that would have been at a company.

That simplification reminds me of SpaceX’s Raptor engine. Each iteration of the Raptor keeps getting simpler… yet more powerful. I wanted to do the same for software. It’s an aesthetic choice, but it also influences the architecture and engineering. With an obsession over engineering and product, you can build minimal software, and that’s what Hyperparam is. I built it from the ground up depending on nothing else. That’s why it’s as small, light, and fast as it is.

Advice for Developers Exploring Open Source Projects

What’s your advice for developers starting out with open source?

Build the product you want to use yourself. If you have to solely rely on other people to tell you if what you are building is useful, your iteration cycles will be slow and painful. This advice might run contrary to conventional startup wisdom, which says you should assume you know nothing, talk to a hundred customers, and then build to solve their problem. That’s viable, but it’s not the only way to create something meaningful.

To build a certain kind of product, you need more vision and aesthetic opinion. In open source, you’ll see these shining monuments to technology, and why? Because someone cared enough to make them both functional and beautiful. Hyperparam is the data transformation tool I wished existed. I’m building for an audience of one: myself. When you build something you want to use yourself, you often end up building something others didn’t realize they needed.

Simulated Personas, Real Insights: Using Snowglobe and Hyperparam to Stress-Test Conversational AI

2025-10-15T07:10:00+00:00

Testing our Conversations Before we Go Live

Hyperparam is building an AI-assisted tool for working with large text datasets. The product includes a viewer for parquet (and csv and jsonl) datasets, and a data assistant chat. Before launching the product we wanted to anticipate problems that may crop up.

The fastest way there was to generate simulated data with realistic, diverse, edge-case conversations that exposed how our data agent behaved across user types and intents. Once we could generate this data we had a quick way to interrogate the simulation data set in order to slice, flag and transform the conversations into a usable data set for follow on fine tuning (or continued analysis).

Our setup: Snowglobe for simulation + Hyperparam for exploration

Plan:

Simulate 10 realistic personas and conversations using Snowglobe.
Explore and transform that dataset interactively with Hyperparam.
Isolate conversations where the agent recommended using Python versus general analytical queries.
Prepare the subset for evaluation or fine-tuning.

Step 1: Generate Synthetic Conversations with Snowglobe

Snowglobe is a simulation engine for conversational AI teams. You define who your users are, what they want, and how they behave. Snowglobe auto-generates thousands of realistic interactions with your model or API endpoint. Think of it as a load test for reasoning or dialogue, not just latency. Snowglobe uses the information from an application description to create data that’s useful for your specific app. For this blog post, we created an application with the system prompt from Hyperparam. It’s long, but the short version looks like: “This chatbot allows users to chat with their data, pulling out insights and statistics. The data looks like this: ....”

Define Your Personas

In our example, we’re simulating users of Hyperparam, a data exploration tool. We want personas that mirror the real user base data engineers, data analysts, and tinkerers with different levels of skill and temperament. To create personas like these, we can start with a “Simulation prompt”. For example, we can enter a simulation prompt like “Users are data engineers, scientists, and analysts ask questions about their data.”

This prompt results in personas like follows. These personas vary in objective, tone, and style.

personas:
  - name: "Hands-On Data Explorer"
    description: Loves examples, learns by doing.
  - name: "Skeptical Analyst"
    description: Double-checks every step, asks 'why'.
  - name: "Product Engineer"
    description: Wants quick, applied answers.
  - name: "Aha Moment Seeker"
    description: Prefers conceptual explanations.
  ...

Configure Conversation Scenarios

This provides us with conversation templates tied to our product use cases:

scenarios:
  - topic: "data analysis"
    prompt: "How can I explore this dataset for outliers?"
  - topic: "python vs sql"
    prompt: "Should I use Python or write an analytic query?"

Snowglobe will orchestrate multi-turn dialogues between each persona and our model endpoint, generating text logs, metadata, and structured output (JSONL).

When the run completes, we have a dataset with 10 personas × 200 conversations each — 2000 total dialogues, complete with role labels, timestamps, and message-level metadata.

Step 2: Explore the Dataset in Hyperparam

Hyperparam is an interactive, browser-based dataset explorer purpose-built for ML workflows. It opens local or remote Parquet, JSONL, CSV files instantly and lets you filter, transform, and visualize data directly: no heavy Jupyter notebooks required.

Drop file directly in the explorer

Hyperparam renders the dataset in an interactive table. You can scroll through conversations, inspect columns like persona, topic, or assistant_message, and even preview message trees. Conversation view allows for easy visual exploration.

One of the things we noticed right off the bat was that sometimes a user would ask a question and the model would suggest leaving the product and that they use a python script instead. This is not what we want. Sometimes though, the user asks a question that genuinely need to be done off-platform. What we’d really like to find is the conversations that could have been solved on-platform, but instead the model recommended python.

Step 3: Flag Conversations with Hyperparam

Now we want to detect when the assistant recommended Python code vs analytic query language in its replies.

In Hyperparam, you can do this with the hyperparam chat: a prompt-based operation that adds a new computed column.

Prompt: “Add columns: flag when a model suggests the user run python themselves, or makes a general analytic-style query instead of a transformation or filtering like we expect.”

The data agent runs across all sample rows, and decides to create two new boolean columns:

suggested_user_python: true/false
is_analytics_query: true/false

This single operation converts raw chat logs into labeled data.

Step 4: Filter, Export, and Iterate

After inspecting the two new columns, we wanted to extract the samples where it suggested using python but not as a general analytics query. This is our proxy for things our data-agent should be able to do but for some reason did not.

By visually inspecting the data we notice:

“Hands-On Data Explorer” and “Skeptical Analyst” personas trigger Python examples more frequently.
“Pragmatic Insight Seekers” get concise analytic answers.
Conversations recommending Python also tend to have longer message chains (higher cognitive load).

The data set of [suggested_user_python=true && is_analytics_query=false] is good for further fine-tuning our data agent as examples of where we should have a suggestions but did not.

Step 5: Why This Workflow Matters

The combination of Snowglobe + Hyperparam closes a crucial loop for conversational AI teams:

Stage	Tool	Outcome
Simulation	Snowglobe	Synthetic but realistic data across personas
Exploration	Hyperparam	Fast, visual filtering and labeling
Transformation	Hyperparam	LLM-assisted column creation
Iteration	Both	Repeat, evaluate, fine-tune

This pipeline lets teams:

Build evaluation datasets before collecting real user data.
Debug reasoning patterns in synthetic interactions.
Scale up diverse conversational contexts without manual labeling.
Quickly explore and interact with the data sets.

Closing Thoughts

Simulation is the new data collection.

When you can generate, label, and filter conversation data quickly, interactively and with precision, you gain the power to test your agent’s reasoning loops and UX outcomes before they ever reach a customer.

Snowglobe gives you the synthetic user base. Hyperparam gives you the interactive microscope.

Next steps:

Try snowglobe.so to generate your own synthetic conversations.
Explore the data instantly with hyperparam.app.

Hyparquet: The Quest for Instant Data

2025-07-24T14:00:00+00:00

I just wanted to build a javascript code model.

Following the common adage that “data quality determines model quality”, I did what every AI engineer does and tried to look at some training data hosted on HuggingFace. I did not care how, I just needed to see some data and interact with it - search around rows, sort, and otherwise get a feel for the data quality.

This is where my goal started to go off the rails… Most modern AI datasets are 10GB or more and are in parquet format – we’ll talk more about this later – which means you need to parse and open the file. No simple less would work. The most common tools to read parquet for easy viewing are pandas/polars and DuckDB. With some ChatGPT help, I was running the command to load the first 5 rows of data. As shown below, I sat there waiting… and waiting.

Modern data viewer tools take anywhere from 5 sec (DuckDB) to 57 sec (Pandas) to load just 10 rows of data. The HCI community largely agrees that the ideal time-to-first-interactivity is 500 ms [Lui, Heer 2014]. Why should that not hold for data? Why is it acceptable for data to take 20x longer to load data than a webpage?

The rest of this blog describes my multi-month journey to hyper-optimize time-to-first-data for parquet files. I am still on this journey but, along the way, released Hyparquet, the most conformant browser-based parquet file reader in existence. It’s open source and, most importantly, can load my 10 rows of data in 150 ms.

Legacy Server Architecture

Let’s take a step back and first understand where the runtime is going in existing data viewer tools. Let’s take an oversimplified version of a simple pandas backed data viewer and pretend it’s hosted in AWS and reading a parquet file from S3. Before the data even gets loaded, the user’s browser has to first hit cloud front, goes to ELB, then finally gets redirected to the Node JS frontend server of the data hosting service, goes through another ELB, hits the backend server hosting the data and then finally pings S3 and downloads the data. In total this takes about 40 sec of latency just to get the request and download the data. The data then gets parsed in the backend server (taking about 1 sec), and finally makes its way back to the user’s browser.

This diagram may feel complicated but it’s drastically oversimplified compared to most real-world architectures that have: auth, logging, message brokers, etc. Systems that each add additional latency.

When optimizing this pipeline, most engineers only have control of the backend and spend time optimizing parsing. This can speed up the time-to-first-data a lot, but it’s not enough for me. I wanted to completely remove the latency before parsing even started.

Browser-First Architecture

Fundamentally, whenever you have a backend, you need layers of tooling on top. And backend servers are generally good ideas - they manage application state, can handle compute heavy processes, and decouple the viewer from the data models. But I don’t care about any of that. A data viewer isn’t a feature in my application, it is the entire application. So what if I just remove everything backend related and point the browser directly at S3 *(Well, you still need cloudfront to optimize the SSL handshake)?

You would be left with just the browser talking straight to cloud storage:

With this architecture, you immediately save latency as you skip having to hit ELB and a backend. As an added bonus, it’s cheaper because you don’t need cloud costs to host the backend server and far simpler for developers to maintain.

This simplified architecture does leave two issues: (a) where does user state live so if they, for example, refresh a page, they don’t lose their location in the viewer and (b) you still need to parse a parquet file.

It turns out, if you use browser cookies and local storage, you can manage user state all in the browser. Sure, if the user clears their browsing history, they’re in trouble, but I’m okay with that. The parsing…well, I was just going to have to use a javascript parser instead. Or, as it turns out, build my own.

Parquet in the Browser

At the beginning 2024, when I started this quest, there were 3 libraries that could load parquet files from cloud storage directly into the browser: ParquetJS, ParquetWASM, and DuckDB WASM. And I had a goal to parse parquet files in under 500ms. As shown below, none of these were fast and ParquetJS wasn’t even supported anymore.

Looking at this waterfall chart we can see that all libraries take at least 600 ms to get a request and parse a parquet file. But, they also show multiple opportunities for optimization. Let’s summarize some of the inefficiencies of the duckdb-wasm library - we will go into more details below.

Loading the WASM engine (extra > 1 sec)
Multiple requests to get metadata when it could be done in one (extra 200 ms)
Sequential read requests when it could be in parallel
Limited optimizations based on metadata
Synchronous fetches versus asynchronous
Inefficient compression algorithms

If I could optimize these pieces, I could achieve my 500ms time-to-first-data. And I could do it in Kenny style: 100% javascript and no dependencies (because who doesn’t want to rebuild everything from scratch). Time to introduce Hyparquet.

Parquet from Scratch

Re-writing a parquet parser from scratch, how hard can it be?? It took about a week to be able to parse my first parquet file, which I thought was pretty good. The problem is that I kept finding more parquet files that I couldn’t open. Parquet is a sprawling format, with many features:

8 physical types (bool, int, float, etc)
22 converted types
17 logical types
8 compression codecs (snappy, gzip, brotli, etc)
2 major versions

It took 6 months to parse ALL the parquet files.

Gotta Go Fast

Javascript is not exactly known as a high performance language. I think this reputation is undeserved. I’m not saying it’s going to beat rust in a benchmark. But with careful engineering and tactical use of modern browser apis, we can make decoding parquet in the browser surprisingly performant.

Let’s dive deeper into some of the mistakes made by other parquet libraries, and how we can make it better in the browser:

Engine Size – DuckDB-WASM requires downloading and compiling several megabytes of WebAssembly, incurring seconds of startup delay before queries can run. That’s seconds where your user sees… nothing.

Could we get a performance advantage from starting with less? Every kilobyte of WASM adds startup latency. Hyparquet’s core engine is only 10KB (minified, gzipped), dramatically reducing startup latency, and is substantially easier to bundle. By narrowing the focus strictly to Parquet parsing with pushdown filters, we achieve near-instant initialization.

We also save an entire round-trip loading the wasm blob:
Smart Metadata Fetching – In parquet, the metadata is stored in the footer of the file. So in order to fetch the metadata, you might naively make at least three requests:

This is what parquet-wasm and parquetjs do:
1. HEAD request to get file size
2. Fetch the last 8 bytes to get the metadata_length field
3. Fetch the metadata
But with hyparquet we actually do a little better: we can skip the second step. Rather than make an 8 byte round-trip fetch request, we optimistically fetch 512kb of the footer of the file. 99% of the time that includes the entire metadata. In the rare cases where this initial request fails to include all the metadata, we use the metadata length in the footer and make another request for just the remainder of the missing metadata. On http over the internet, an 8 byte fetch takes almost the same amount of time as a 512 kb request.
Parallelization – Traditional databases fetch data sequentially. But browsers can handle 6+ concurrent HTTP connections. Hyparquet leverages parallel HTTP range requests, retrieving only needed portions of the Parquet file (specific columns or row groups) in parallel. This overlap of I/O helps reduce wall-clock latency for data access.

Duckdb uses a different (and much worse) algorithm: it does a sequence of exponentially increasing request sizes, all in series (not parallel). This is fine when you’re reading from local disk but is pathological when loading over the network:
Use the Metadata – Hyparquet employs predicate pushdown by analyzing Parquet metadata (schema, column statistics). This allows it to identify and skip irrelevant row groups entirely, reducing network load and improving speed. This isn’t new—every modern columnar database does this. But when network latency is your enemy, skipping even one unnecessary 25MB column chunk can save seconds.

It’s worth mentioning that by default parquetjs does NOT do this. In fact, neither does python! The default pyarrow and pandas parquet readers WILL READ THE ENTIRE FILE. I had to tweak parquetjs to make it load partial data at all. [2]
Async Everything – JavaScript might be the world’s most async-friendly language. We utilize this to return whatever data is ready first. Parquet is a column-oriented format, so if rows are being emitted from a cursor object, you’re making users wait for ALL the columns to load before returning any data to the user. Hyparquet can return data asynchronously whenever it’s ready (but provides helpers for row-oriented data if that’s needed).
Compression That Doesn’t Suck – Standard JavaScript Snappy decompression was too slow, so we implemented HySnappy, a WebAssembly-based decoder that’s 40% faster yet adds minimal size (<4KB). This ensures decompression never becomes the performance bottleneck.

The problem with WASM is that it normally adds an extra round-trip fetch request for the wasm file. We improved this using a little-known browser trick: you can synchronously load wasm if and only if it is less than 4kb! So we wrote our own snappy decompression library, with no dependencies, not even memcpy, and definitely no emscripten. This makes hysnappy super easy to bundle, deploy, and load.

The Result: Sub-Second Magic

This obsession with latency has real-world implications. Where DuckDB-Wasm might take several seconds just to initialize its query engine, Hyparquet can produce visible results on multi-gigabyte datasets in under a second.

The Philosophy: Bringing Compute to Data (In Your Browser)

Hyparquet demonstrates a shift toward treating the browser as a fully capable query processor operating directly on data stored in cloud storage, suggesting a new paradigm for database research:

This inverts traditional assumptions:

Thin server, client doing the heavy lifting.
Round trips matter more than total bandwidth.
Time-to-first-byte is the new query optimization target.

Hyparquet’s extreme minimalism sets a new benchmark for browser-native analytics. But what are the broader implications?

Hyparquet enables ML researchers and data analysts to interactively explore large datasets directly in the browser, eliminating the need for traditional backend setup or data infrastructure management.

Hyparquet also allows data analysis over a much-simplified infrastructure: By removing backend databases, there’s less infrastructure to maintain, simpler developer experience, and faster user experience.

Where did this lead? Hyperparam.

If you remember where this started, I wanted users to see the first few rows of a large AI parquet dataset in under a second. But I started looking at data because I wanted to train a model. I obsessed over the first step of the AI data curation pipeline because it was painful. But I didn’t stop there. I founded Hyperparam, a company built on this paradigm of hyper optimization of browser native applications for data curation. Hyperparam’s goal is for users to build their training, evaluation, or RAG datasets with the seamless interactivity of the browser they are accustomed to for non data-intensive tasks. Our motto - “javascript can do it too”

Try It Yourself

Want to see Hyparquet in action? Head to https://hyperparam.app, drop any Parquet file or url, and watch your data appear instantly.

Hyperparam: How Browser-Based Tools Will Re-Shape AI

2025-01-21T05:40:00+00:00

What is the key to building the most advanced AI models? Data quality.

Everyone wants better AI models: smarter, cheaper, and with style. How does one achieve that? Whether you’re a mega-scale AI company, or a small enterprise team, the only real lever for making better models is to construct a better training set.

How do you build a better training set? This is a question that has always been one of the most challenging, and labor-intensive parts of the data science process.

Why is data cleaning and data understanding so time-consuming? Because current tools often miss three key capabilities: 1) should enable very fast free-form data exploration by the user, which is key to finding insights in your data, 2) use AI models to assist looking at huge volumes of data that would be impractical for a person, and 3) should be simple to run locally in the browser and not depend on complex services and data pipelines. Instead, most tools are built around Python, arguably the worst language for creating modern, compelling UIs and tools. This might seem controversial, but think about what is the most common interface for python? Jupyter Notebooks. Notebooks are great for iteration and experimentation, but they are extremely weak when it comes to interactive data exploration. If you’ve ever tried to open a parquet file (the most common format for modern ML datasets) in a notebook it looks like this:

This table is practically useless. You can’t paginate to the next set of rows. You can’t even see the entire data in a cell (which in this case is an entire github source file). So how are you supposed to get an intuitive sense of your data if you can’t even see it?

Can we do better? If you want to build a highly performant user interface, there is only one choice: JavaScript. The browser is the only place for building modern UIs.

The problem is that ML datasets are massive (often multiple gigabytes of compressed text data), so it’s not obvious if it’s even possible to work with large scale datasets in the browser. However, by using modern data formats like Apache Parquet, and clever frontend engineering, it is in fact possible to work with massive datasets directly in the browser.

Aside: Apache Parquet files are a column-oriented data structure that contains a built-in index. This allows tools like hadoop and duckdb to efficiently query parquet datasets without having to retrieve all the data. Furthermore it allows doing these queries without a server, simply by putting the parquet files in a storage service like S3. What if you could do this same trick in the browser, and pull in just the data needed to render the current view. Hello Hyparquet.

Hyparquet is a new JavaScript parquet parser which can efficiently query against parquet files stored in the cloud. This enables the creation of a new type of client-side only parquet data viewer which is significantly faster than anything that could be done with a server.

The goal here is to get data engineers to look at their data 👀 Anyone who has worked with data for a model before knows that looking at your data is the key to understanding the domain you’re trying to model, and it is virtually impossible to do good data science without looking at your data. Looking at your data is the easiest way to find data and model issues, and is a constant source of ideas of how to improve them.

This is one of the core workflows in data science: build a model, see what data was correctly or incorrectly modeled, fix the data and/or the model, and repeat. This is a repeatable, teachable process! And if it can be taught to a human data scientist, why can’t it be taught to a model to assist?

Can you use a model to assist with dataset curation? The challenges are two-fold: 1) How do you leverage human expertise to express what you want from the model? 2) These datasets are huge, so the cost of running a model across all the data is expensive.

You need the human in the loop to express their intent for the data. There is not just one definition of “good” versus “bad” data. What matters is the question “is this data useful for the model I’m trying to build?” This is where the UI comes in as a way to allow the user to look at the data, and use the data to express their intent.

As for the cost, we are entering a new era of LLMs where for the first time it is affordable to do dataset-scale inference in which you run an entire dataset through a model to help filter and label data. In 2023 it cost $5,000,000 USD to process 1 trillion input tokens with a sota model (gpt-4-turbo). In 2024 it cost $75,000 USD to process 1 trillion input tokens with a similar model (gpt-4o-mini). This trend will continue to make dataset-scale inference accessible to model builders. Model-based quality filtering has already been used by Meta to filter the training set for llama3 using labels generated by llama2 [1].

We’re entering a new era in which dataset-scale inference and interactive, browser-based data exploration will define how AI models are built and refined. By combining efficient data formats, high-performance JavaScript interfaces, and affordable AI-based annotations, teams can finally put data quality front and center without prohibitively high costs or clunky workflows.

The future belongs to those who seamlessly blend human expertise with AI-assisted insights—an approach that makes data cleaning faster, more intuitive, and ultimately, far more effective in powering the next generation of advanced AI models.

Ready to explore your machine learning data? Visit Hyperparam to start viewing and analyzing your datasets in seconds.

Hyperparam Blog

What wasted tool calls revealed about my LLM’s behavior

Key takeaways

A wasted tool call produces a noisy trace

Required retries vs. avoidable retries

The hidden cost of failed tool calls in agent traces

FAQ

Virtual Scrolling for Billions of Rows — Techniques from HighTable

Introduction

Demo

Scrolling basics

Technique 1: lazy loading

Technique 2: table slice

Technique 3: infinite pixels

Technique 4: pixel-precise scroll

Technique 5: two-step random access

Conclusion

How to debug chatbot failures by inspecting LLM logs

A real-world workflow for debugging chatbot failures at scale

Step 1: Inspect the logs like a dataset, not a transcript

Step 2: Identify the real failure mode (it wasn’t “random hallucination”)

Step 3: Turn “one bug” into a measurable pattern

Step 4: Replay real conversations to validate changes

Squirreling: a new SQL engine for the web

Drawbacks of WebAssembly for SQL engines

How Squirreling fundamentally differs from existing SQL engines

Squirreling’s architecture

Late materialization

Execution model

Query processing

Module footprint

Squirreling: a browser-native SQL engine

References

Why you’re missing issues in your LLM chat logs

You’re missing critical issues in your LLM chat logs

Key takeaways

Table of contents

Traditional debugging breaks down with the scale of LLM chat logs

The issues you know exist but can’t quantify in sampled logs

What happens when issues stay hidden in your LLM logs

The future of LLM debugging depends on reasoning across multi-GB datasets

FAQ

Explore massive datasets with the Hyperparam AI tool

Meet the Hyperparam AI tool for massive datasets

The first-of-its-kind interactive UI for navigating and improving LLM-scale datasets

Key takeaways

Table of contents

Teams are drowning in overwhelming amounts of unstructured data

AI is only useful when the UI can keep up

How the Hyperparam AI agents accelerate real data work

The Hyperparam AI tool keeps the human in the loop

Lessons from Hyperparam’s Year of Open-Source Data Transformation

I sat down with my former cofounder, Kenny Daniel, to talk about his new startup Hyperparam

Key Takeaways:

Table of Contents:

Why Hyperparam Went Open Source

Hyperparam vs. Data Visualization Tools

How Hugging Face Validated Hyperparam’s Browser-Native Approach

What’s Open Source and What’s Product at Hyperparam

AI Workflows That Make Large-Scale Data Transformation Faster

Why Minimalism Drives Hyperparam’s Engineering Philosophy

Advice for Developers Exploring Open Source Projects

Simulated Personas, Real Insights: Using Snowglobe and Hyperparam to Stress-Test Conversational AI

Testing our Conversations Before we Go Live

Our setup: Snowglobe for simulation + Hyperparam for exploration

Step 1: Generate Synthetic Conversations with Snowglobe

Define Your Personas

Configure Conversation Scenarios

Step 2: Explore the Dataset in Hyperparam

Step 3: Flag Conversations with Hyperparam

Step 4: Filter, Export, and Iterate

Step 5: Why This Workflow Matters

Closing Thoughts

Hyparquet: The Quest for Instant Data

Legacy Server Architecture

Browser-First Architecture

Parquet in the Browser

Parquet from Scratch

Gotta Go Fast

The Result: Sub-Second Magic