Kamu Data - Medium

Is Data Infrastructure a Bottleneck of Distributed Energy?

Sergii Mikhtoniuk — Sun, 24 May 2026 17:31:16 GMT

DERs everywhere!

With the proliferation of renewable power and cheaper batteries, the energy sector is undergoing a transformational change towards decentralization — towards DERs.

DER (Distributed Energy Resource) can refer to:

Generation — from solar panels on your house, to a wind turbine, and industrial solar arrays
Storage — from a car battery and residential home banks, to utility-scale battery storages
Consumption — from your smart heater and EV charger to a small datacenter.

Energy sector is a big balancing act of generation and demand. Balancing requires a lot of coordination. Coordination in turn relies on a lot of data. Negative electricity prices and massive grid outages, which have become a common occurrence, show that this process is not working well, due to large generation forecast errors, curtailment, and output coordination issues.

DERs are exciting because they provide a solution:

Abundant storage capacity that moves closer to consumers
Consumption flexibility with smart chargers and devices, and much more.

To us, as data engineers, DERs are exciting because they require completely re-evaluating the data flows used to monitor and manage current infrastructure, how we integrate the massive number of small-scale devices into a common grid, and utilize all devices to maximize efficiency.

Energy’s data problem

Let’s analyze the current data flows from several perspectives to identify gaps and opportunities to improve.

Ownership

If you have any IoT devices in your home, you likely know the state of data ownership already. Every device manufacturer wants to tie you into their ecosystem and make you send data into some proprietary platform. You get a few quick dashboards out of the box, but if you want to later switch vendors, or your vendor discontinues their platform (very common in a competitive market) — you lose your data. You have to work quite hard to regain control over your data.

The same exact thing happens on a large scale. If you are building an industrial solar array — you buy inverters from some vendor, and those come with a subscription-based proprietary data collection and monitoring platforms.

We’re talking about a lot of data. A 10 MW array with 100 string inverters sampled every 5 seconds may produce over 2M data points per hour [1]. So your monitoring stack quickly develops “data gravity” and becomes a big factor in locking you into one vendor.

⚠️ I should clarify that I don’t think hardware vendors are evil and aim to use data gravity as leverage. A few major hardware manufacturers I know said they would prefer not to develop their own data solutions at all, and see them as a distraction from their core business. The problem is that there hasn’t been any middle ground between niche “turn-key” data apps (a convenient product that users want) and a “build your own from a thousand pieces” infrastructure (that provides data sovereignty).

What can we do?

There is a clear parallel between the current state of data and where software was a decade ago. Deploying and operating software was an arduous process that required bespoke integrations and tightly coupled you with your infrastructure provider. Kubernetes project created a new level of abstraction where software components could be packaged as standard-ish lego pieces and deployed in any cloud or on-prem environment, significantly reducing vendor lock-ins.

For convenience and sovereignty to co-exist in data we need two things:

Ability to replicate our valuable data from any app to third-party storage for safe-keeping, or migrate it to another platform. Not a “data export” button that spit out a huge archive file in bespoke formats — a standard protocol to continuously move large volumes of structured data across company boundaries.
Ability to transfer the data processing logic from one platform to another. The way you clean, aggregate, and enrich your data encodes hard-won domain knowledge about how to interpret raw device readings. Currently when switching to a different platform you are not migrating this logic — you’re rebuilding it, because the execution semantics between platform differ slightly in many ways.

Open Data Fabric (ODF) specification is our attempt to build “Kubernetes of data”. It’s a layer that wraps many open-source data storage and processing systems into “lego pieces” that compose well. Through standard formats it allows to efficiently replicate data between storages and systems in near-real-time. Thanks to standardized processing semantics it provides portability across implementations and environment — you can run the same pipeline on a laptop, edge server, or in a datacenter, using tools from different vendors.

Interoperability

When an IoT manufacturer builds a data app, it’s more of a “toy” with a few functions than a “data platform” — a check mark in their feature completeness list needed to sell a device. There are no real incentives for them to bother with extensibility, storage diversity, data exports and backups, interoperability with other tools.

But the data these apps hoard has hundreds of potential uses — and you need interoperability to unlock them. If you have ever tried to implement things like smarter heating scheduling for your house, you know that it requires first spending at least a week to build an alternative pipeline to move data to a place where it can be used by scripts and other tools.

Same, again, repeats on a larger scale. Most data tools supplied by industrial hardware vendors come from a long legacy that focused only on monitoring and maintenance. As number of uses for same data keeps growing (see next section), you also end up having to build an alternative data pipeline, and at this scale you’ll need a whole data engineering team.

When there were only a handful of large utility companies, having a data team was not a big problem. But in the world of DERs this is no longer viable. Things like generation forecasting or predictive maintenance cannot remain “tools for the big players only”.

Even at a 10 MW array scale, deviating from the forecasted output by 5% on average can cost the company hundreds of thousands of dollars and 10–15% of annual revenue in imbalance penalties [2]. No wonder that many solar developers spend years of engineering time on custom data stacks to achieve better forecasting.

The interoperability problem is arguably even more important on an inter-company level. When one firm builds flexibility into their EV chargers and another adds similar energy flexibility into smart heaters — we need common protocols for them to participate in scheduling optimization together. Every company today invents its own APIs, resulting in a O(n²) quadratic explosion of integration complexity.

O(n²) complexity of custom integrations vs. interoperability

Energy sector has many mature standards, but as you’ll see in the next section, data that DERs generate needs to efficiently cross not only site, but industry and sector boundaries. And this goes both ways, as DER operators also require access to external (weather, financial, hardware costs) data to plan strategically.

What can we do?

Existing energy data standards (Modbus, DNP3, OCPP etc.) bundle domain-specific semantics, wire formats, and network protocols into a single specification. This works well for device-to-device communication, but not for cross-domain interoperability.

We’re missing a layer that defines how historical and real-time information is stored and exchanged, independently of semantics and network protocols. A data consumer receiving two feeds, one from a solar plant and one from a financial market, should only focus on domain aspects of each to correctly combine them. Everything else — how records are encoded, transferred, how updates are propagated — would be handled automatically. This is what Open Data Fabric spec defines.

The closest thing to this today is JSON APIs, but they fall short in many ways. Every API is highly custom — there’s no standard for how data is queried, filtered, aggregated, paginated, how updates are signaled, or how historical data is accessed, which means every integration is effectively built from scratch. Almost every real-world integration of several data sources ultimately involves first moving everything into an intermediate database before any actual analysis can begin. ODF is designed to make that intermediate step unnecessary.

This approach scales from capturing raw monitoring data, to feeding it to anomaly detection, to aggregating it for reporting and BI, to enriching it with external market, weather, and irradiance forecast data — all in a uniform way. Let’s see how this approach also scales when it comes to exchanging data with external parties.

Sharing and trading

Imagine having data from thousands of home solar setups around the world — it would be hugely valuable! But you’d really struggle to find a buyer for data from just your house.

Similarly, data from a solar array has thousands of uses:

Many operators are waking up to the idea of their data being a valuable product on its own, but capturing this value when you run just one site is disproportionately hard. Finding a buyer, negotiating a complex contract and NDA, integrating and maintaining data delivery flows via portals, warehouses, FIX and Websocket APIs, is a lot of work outside of your core business. And this effort grows linearly with the number of recipients. That’s what we call a “small publisher dilemma”.

The energy sector is one of the few cases where data is so valuable that multiple “aggregator” companies exist that source it from individual sites to sell in bulk. They may offer you access to market-wide statistics in exchange, but in the end most value of your data is captured by aggregators.

What can we do?

What if we flip the problem on its head and redesign the data infrastructure from “deliver-to-each-buyer” model to “publish-once” model. Instead of an operator maintaining separate integrations for every data consumer, the operator advertises a dataset on a public network, defines price, license, and access conditions including what level of aggregation is visible to whom — and any number of consumers can discover and access it through standard protocols. For operator the effort is constant no matter how many downstream consumers exist.

This changes the value dynamics of small datasets. If the publishing infrastructure allows the global community of researchers and data scientists to build derivative pipelines that combine data from many individual sites into higher-order data products — the operator’s data becomes easier to discover and consume, and exponentially more valuable. Every site operator automatically benefits from every subsequent site that joins the publishing network, without any extra work.

Critically, this doesn’t require sacrificing control: query patterns and aggregation rules defined at publish time let the operator control precisely what is visible at each level, and those rules are enforced by the infrastructure.

The economics also shift. Instead of aggregators capturing most value by virtue of being the only party with the engineering capacity to assemble and resell the data at scale — value is captured primarily by data owners. Overpriced integration pipelines will immediately invite competition from global community of data engineers.

Privacy and accountability

When energy data is used in so many applications, it better be reliable. But how can we trust that a party delivers data that is real and accurate?

We could ask for disaggregated data and near-real-time delivery. Add device signing, and you’ll have robust protection against tampering.

But raw data is a major privacy and security concern:

It exposes the map of your physical infrastructure and reveals when equipment fails
Advertises your attack surface and when on-site personnel are present or absent
Provides competitive intelligence about your performance and curtailment patterns for bidding behavior inference and PPA negotiation leverage.

So data needs to be shared selectively, with levels of anonymisation and aggregation. And this layer is always custom-built — yet another massive task for your data engineering team.

But if the data is processed and doesn’t come signed straight from a device — how can we be sure that it’s not synthetic, generated to make a quick buck, or that it’s not omitting some critical interval to cover something up?

In current approaches — we simply can’t. While revenue-grade metering with certified meters directly read by ISOs provides tamper-evident generation records for settlement — there is no such mechanism for the rest of the data. Financial market consumers don’t have any mechanism to keep providers accountable for API call responses. Data shared via some warehouse tables can be changed overnight without any trace. This makes data products overall much less valuable, and any doubts about the validity of some value or a chart require months of work to address.

What can we do?

The accountability gap is a provenance problem — we need to prove that data we aggregated or anonymized for the sake of privacy actually corresponds to original raw device data. But how can we do this without revealing all raw data for audit?

Enter confidential computing. If data is cryptographically signed at the device and every subsequent processing step produces a cryptographic proof that the computation ran correctly on authentic inputs, then the trustworthiness of a derivative dataset can be confirmed without trusting the parties who handled it. A data consumer can independently verify that an aggregated or anonymized dataset is a faithful derivative of real device readings with no gaps, substitutions, or fabrications, regardless of how many processing steps or organizational boundaries it passed through.

Anonimized data verifiable back to the original device readings

The implications for ESG markets are significant. Today carbon offsets or RECs are traceable only as far back as the reporting entity that submitted them to a registry. With device-level signing and verifiable computations, every certificate could be cryptographically linked back to the physical event that created it, like a certified flow rate meter measuring sequestration or carbon capture. This would make greenwashing and certificate fraud structurally difficult rather than merely illegal as it would require tampering the certified device itself rather simply changing data values anywhere in the downstream pipelines.

ODF seamlessly integrates modern confidential computing techniques like TEEs and Zero-Knowledge Proofs into data processing pipelines, with pilot projects conducted in the carbon offsets market.

Recency, latency, automation

In our team we often discuss how power-hungry AI workloads and cryptocurrency mining could be instantaneously dispatched to datacenters in areas where energy prices go negative as a perfect dump load. But a quick reality check shows that most operators today still handle day-ahead scheduling manually. Even if ML forecasting is used — it still mostly runs on the same 24-hour schedule and with manual sign-offs. There’s such a big gap between where we want to be with DERs and where we are right now…

Unlocking the potential of DERs will require:

High degree of human-out-of-the-loop automation — to push latency to minutes/seconds
High degree of traceability and auditability — to debug and gain confidence in said automation.

The current state of data tooling just doesn’t cut it. Outside of monitoring stacks, most tools operate on a fundamentally batch-oriented model — jobs are scheduled to run at fixed intervals, reprocess all data from scratch, at certain cutoff times, producing outputs that are stale before they’re consumed. No matter how much you automate — this model requires a lot of care around execution order, ensuring consistency across datasets that were frozen at different moments. A human has to be in the loop to maintain this extremely fragile machinery and to recover it after failures.

What can we do?

Temporal processing — a new computational model alternative to batch — treats all pipelines as continuous flows where each processing step produces outputs incrementally as new data arrives, with the infrastructure guaranteeing consistency and execution order automatically. It’s like switching from algebra to calculus — a better way to express computations that involve passage of time. It isn’t just a latency improvement — it’s the only way to make genuine human-out-of-the-loop automation possible.

Batch flows requiring constant human attention

Temporal flows are configured once and run autonomously

If we ever hope to meaningfully track ESG scope 1, 2, and 3 emissions — across an operator’s own assets, their energy purchases, and their supply chain — and do so without needing an army of specialized accountants crunching numbers day and night — we need to build temporal pipelines that can be reviewed, certified, and continuously improved. Any operator should be able to reuse such pipelines like an open template — plug in their monitoring inputs, and get output data for disclosures at minimal cost of compliance.

Temporal processing combined with confidential computing would make ESG disclosures both verifiable and privacy-preserving.

Fragmentation

Renewable energy certificate registries (aka REC, GO, REGO) is a consequential example of data fragmentation problem. Globally, RECs exist in dozens of incompatible regional systems 3 — each with its own data model, eligibility rules, temporal granularity, and identifier namespace. These registries don’t interoperate. A corporate buyer with operations across multiple continents must manually reconcile fundamentally incompatible certificate schemas, rely on intermediaries to convert certificates across registry boundaries — a process that is slow, expensive, and administratively rather than technically verified — and trust that bilateral agreements between registries prevent the same megawatt-hour from being claimed twice in two different markets.

The result is that the global renewable energy market runs on a foundation of administrative trust and manual reconciliation rather than technical verifiability, which limits how sophisticated it can become: real-time certificate markets, automated ESG reporting, and meaningful participation by small DER operators are all structurally out of reach until a common interoperability layer exists beneath the registries.

What can we do?

Many blockchain-based solutions tried to address the registry fragmentation problem. Decentralized tamper-proof ledger sounds like a natural fit at first, but in practice, even specialized enterprise blockchains struggle with large volumes of data — they are designed for transactional rather than analytical processing. As DER space moves from high granularity assets (1 REC = 1 MWh) and infrequent settlements to more real-time interactions — single-ledger model would face scaling issues. And all data would still need to be copied into databases and lakehouse tools to be searchable, queriable, and interoperate with BI and data science tools.

A conventional SQL lakehouse architecture, on the other hand, is well understood and widely deployed. A cryptographic verifiability layer we discussed above can ensure that the data that lands into regional registries is trustworthy. Thanks to tamper-proof properties, not even registry admins will be able to modify it without a trace.

Federated registry of verifiable renewable energy certificates

Through a layer of federation, data from multiple registries can be harmonized and integrated together, and made easily available and efficient to query. Such integration pipelines can exist purely as code that deterministically reconstructs the dataset. It does not need to be governed by an authoritative body. It doesn’t event need any infrastructure! It can be maintained as a pure open-source initiative.

In this model every registry will retain full autonomy of evolving their data schemas, while federated pipelines will provide them an instant feedback on how their changes affect the harmonization layer and whether they break downstream consumers. The whole system will become more uniform over time by the nature of improved visibility.

So is data a bottleneck for DERs?

There are many obstacles to DER adoption, and data infrastructure will likely not going to be on top of any operator’s list of bottlenecks. But viewing data merely “as one problem among many” understates how deeply data problems compound the others.

These problems scale nonlinearly with DER proliferation and the O(n²) integration complexity becomes unmanageable at millions of small assets.

Conclusion

The negative electricity prices and grid failures aren’t engineering failures — they’re coordination failures. The physics of renewable generation is well understood. The hardware to store and dispatch it is getting cheaper. What’s missing is the connective tissue: the ability to observe, verify, and act on the behavior of millions of distributed resources in real time, across organizational and jurisdictional boundaries. Preferably without armies of engineers needing to maintain it like a giant Rube Goldberg machine.

The problems we’ve described — vendor lock-in, quadratic integration complexity, the small publisher dilemma, the accountability gap, registry fragmentation — look like separate issues when viewed in isolation. But they share a common root: the absence of a sector-agnostic layer that handles how data is managed, processed, and exchanged independently of what the data means or which domain it comes from. Every sector that has developed such a layer — compute infrastructure with Kubernetes, package distribution with language registries, web communication with HTTP — has seen coordination costs collapse and innovation accelerate. Data infrastructure hasn’t had its equivalent yet.

When it does, a different set possibilities will open. A small solar operator publishes monitoring data once and reaches forecasters, insurers, researchers, and carbon markets without maintaining individual relationships with any of them. A utility approving a new DER interconnection has auditable, verifiable performance history from comparable assets rather than conservative assumptions based on manufacturer specs. A corporate buyer retiring a renewable energy certificate can verify its provenance back to a signed device record rather than trusting a chain of administrative reports. An ESG disclosure that currently requires a quarterly accounting exercise runs continuously and updates automatically as new generation data arrives. A grid operator dispatching thousands of EV batteries during a frequency event has sub-second verified telemetry from each one rather than stale batch reports.

None of these require new hardware or new physics. They require data infrastructure that treats verifiability, interoperability, and automation as first-class properties rather than afterthoughts.

Open Data Fabric is our attempt to specify this layer — open sector-agnostic spec, and already in use in biotech, finance, and IoT contexts. We think energy is where it’s needed most urgently, and we’d welcome the conversation with anyone working in this problem space.

10 MW ÷ 100 kW inverters × ~30 Modbus tags × 3,600s / 5s = 2.16M of raw records per hour prior to aggregation.
Solar forecast errors of 5% nMAE are typical for sites without calibrated plant models [NREL]. In CAISO’s settlement structure deviations are settled at a 10% penalty multiplier above or below real-time prices [CAISO BPM]. Figure is highly ISO-dependent.
WREGIS in the western US, M-RETS in the midwest, GATS in PJM, Guarantees of Origin in the EU, I-RECs for international markets, and many more.

Is Data Infrastructure a Bottleneck of Distributed Energy? was originally published in Kamu Data on Medium, where people are continuing the conversation by highlighting and responding to this story.

Open Data Fabric Protocol & Consortium

Sergii Mikhtoniuk — Fri, 27 Feb 2026 01:53:06 GMT

Open Data Fabric Consortium

Read for free

Feels like just yesterday Kamu set out on an ambitious mission to unlock collaboration in data on the scale similar to open-source software revolution.

Very early in our journey we introduced Open Data Fabric protocol as a stand-alone open specification *(aka “HTTPs of multi-party data exchange”)*. We envisioned it to be developed by community, for community, and belong to no one.

Since then, these efforts had now grew into a large consortium of contributors and it deserved a new home: https://opendatafabric.org

Big thanks to our partners for supporting the vision of equitable and verifiable data economy.

Check out the website and enjoy the new introductory video:

https://medium.com/media/c822aedbc6be5dfa735480c7b5c25113/href

Want to discuss or contribute? Join the conversation on Discord.

Till next time!

Open Data Fabric Protocol & Consortium was originally published in Kamu Data on Medium, where people are continuing the conversation by highlighting and responding to this story.

Kamu Update — A year-worth of features

Sergii Mikhtoniuk — Mon, 17 Nov 2025 03:03:30 GMT

Kamu Update — A year-worth of features

Hey, it’s been a while since our last update.

We accumulated tons of exciting news, so let’s dig in!

But first, what is Kamu?

What’s the first thing that comes to your mind when I say “data”?

If it’s Excel spreadsheets — then imagine one that keeps real-time record of events that happens in your business, and like an accounting ledger allows you to “time-travel” to see how your data looked like at any specific point in time.

If it’s thousands of CSV files — then imagine a data format allows you to represent highly dynamic data (e.g. IoT), is strictly typed, 1000x more compact, can be efficiently queried like a database, yet remains easy to share and copy between systems, and embeds cryptographic proofs of who owns and who modified each dataset.

If it’s Jupyter notebooks — then imagine a special notebook with code that cleans, harmonizes, and combines data from several different dynamic sources. You can share this notebook with other people, thus providing them with both a stream of continuously refined data AND an explanation of where this data comes from.

If it’s an enterprise data lakehouse — then imagine a peer-to-peer network connecting lakehouses in multiple companies, allowing them to share data with subsidiaries and partners from a “single source of truth” with flexible privacy controls, and build verifiable ETL pipelines that span across company boundaries.

Kamu is a novel data supply chain technology focused on how data travels across organization boundaries. It aims to unlock collaboration on data similar to Open Source Software revolution and build a collaborative data economy based on privacy, clear ownership, and verifiable trust.

The technology itself is pretty simple. Explaining it still requires many analogies only because something like that did not exist before (see step by step technical introduction).

New Partners and Exciting Use Cases

We’ve seen way too many companies (especially in Web3) who spend years developing complex technologies only to realize there is no market for them.

In Kamu we believe that a healthy tech company should minimize the time between developing a feature and putting it to good use in production.

In the face of limited resources and a small team, the best way to do it is to build meaningful partnerships with visionary companies, solve their immediate needs, and grow together towards larger common goals.

I’m happy to announce our continued collaboration with Molecule in scientific data sharing (see DeSci Berlin talk).

🧬 100+ Biotech startups are trusting Kamu with their research data!

We have collaborated with several startups in Decentralized AI space:

To connect LLM agents to real-time analytical data (see OAG demo)
To provide provenance trail for GenAI artifacts for copyright and IP infringement protection
To power tokenomics models that reward data owners and model authors for GenAI artifacts

We are also starting to collaborate with several DePIN companies to unlock private data ownership and democratize DePIN data pipelines, and Industrial IoT companies that need to disseminate high-frequency sensor data.

R&D

These collaborations have pushed us hard in terms of features needed from Kamu Node. Nearly all our efforts this year went into turning the technological foundation we created with Kamu CLI into a scalable back-end solution.

Flow System

Just like how GithHub hosts millions of repositories and runs their CI actions, our public node is intended to host millions of datasets and ingest, process, and validate their data.

Given that pipelines need different engines (Flink, Spark, Datafusion etc.) and different versions of those engines to execute, and we don’t have infinite hardware to keep them running all the time — we need a smart way to schedule these computations.

Enter the Flow System. Flow system is the heart of Kamu Node that beats to drive computations forward.

Computations can be of many types, like polling ingestion, derivative transformations, verification, compactions. Flow System intelligently schedules and prioritizes them and spawns execution tasks that are processed according to available capacity.

We decided that the most important thing in the design of this system was explainability — understanding why some actions happened when they did, and what actions will happen soon.

Happy to say that flows in Kamu offer the best explainability I’ve seen in any jobs scheduler:

You can see the upcoming flows that and estimated time when they will start executing

You can understand what events triggered a flow and reconstruct the graph of updates

You can see all queues, batching, and scheduling decisions that were made prior to task execution

We are working to make flows easy to monitor, so you could keep your complex ETL pipelines in a working order.

Private Datasets & Auth

Some data we work with today is highly sensitive:

In case of Biotech startups it may contain technology secrets, pre-patent discoveries, and financial data
In case of GenAI — it may contain internal company materials used for inference.

So we extended Kamu Node to finally support private datasets and a flexible sharing system, based on a robust ReBAC authorization mechanism.

You can generate access tokens for scripting and automation.

CLI now supports OAuth2 Device Flow to authenticate with remote nodes.

And you can now authenticate with Kamu using cryptographic wallets — a major step we took towards personal data ownership in DeSci and DePIN use cases.

This is only the beginning of our privacy (structured transparency) roadmap.

Python Client

We have released the first version of our official Python client library:

Now you can:

Query data from Python scripts and notebook environments
Query data from local workspaces and remote nodes
Select between multiple SQL engines

A minimal Jupyter notebook example is:

%load_ext kamu

import kamu

con = kamu.connect("grpc+tls://node.demo.kamu.dev:50050")
con.query("select * from 'kamu/co.alphavantage.tickers.daily.spy' limit 3")

The client library is based on ADBC and FlightSQL — the most efficient protocols to transfer columnar data out there.

It works in any Python notebook environment — try running this example notebook in Google Colab.

Verifiable Query API

Imagine running an SQL query that aggregates gigabytes of data, or better yet — JOINs several huge datasets that belong to different organizations. The computations to deliver this result may span across several independent Kamu nodes, located in different countries and operated by different people with different incentives.

How can you trust such a result?

Enter Verifiable Query API. When you enable this mode — every query response from Kamu will be accompanied by a cryptographic proof of result validity.

These proofs are succinct, much smaller than the result data itself (just a few kilobytes), but having them allows you to hold every participant of the entire supply chain that delivered you the response forever accountable for it.

Our novel Structured Recursive Commitments scheme allows to validate a response and assign blame to a specific actor of a multi-party collaboration.

Currently we support an optimistic execution scheme (proofs based on signatures, determinism, and reproducibility), but in the near future will be expanding them to TEEs, and later to Zero Knowledge proofs.

This proof mechanism is fundamental to ODF Oracle — a system for querying data from blockchain smart contracts.

LLM Connectivity via MCP Server

Kamu in collaboration with Brian have developed a new technique to connect LLMs to verifiable real-time factual data.

Oracle-Augmented Generation technique:

Enables LLMs to execute complex analytical queries over large volumes of real-time data
Makes AI answers auditable and far less prone to hallucination
Offers full data supply chain provenance.

Proof saved alongside the agent’s chat history

We believe this technique will be fundamental to granting AI agents more autonomy, without losing control and ability to reason about their actions.

It will likewise be fundamental in building an equitable data economy where owners of information and pipeline maintainers are fairly rewarded for their contributions in answering questions accurately.

The best thing is that you can use it just like any other MCP server.

Semantic Search

First important step in our LLM connectivity technique was to narrow down the search space of what datasets can potentially answer a certain questions from thousands to a few dozens, not to overload LLM context window. We achieved this by implementing semantic search API to find most relevant datasets.

Metadata about the dataset like its description, schema, and example queries is converted into vector embeddings and stored in a vector database for efficient querying and ranking.

You can now enter free-form text in the search box to find datasets that are close to desired theme:

Database-backed Metadata Catalog

ODF file format standardizes dataset layout on disk, allows datasets to be interoperably shared and replicated across different environments. But when datasets are in your Kamu Node — scanning metadata as individual files is not very efficient.

Structure of an ODF dataset

We introduced a new metadata catalog that acts as a write-through cache in Postgres and MariaDB databases. This is very similar to multiple metadata catalog implementations for formats like Apache Iceberg, but in our case stores the cryptographic metadata chain blocks, refs, dataset summaries and dependency graph, and a lot more.

This change dramatically improves performance of most operations. It also ensures atomic transactions, which are hard to achieve in pure-storage catalogs like S3.

Kamu CLI also benefited from these changes, as it is now using SQLite-based implementation of such catalog.

Webhooks

As data moves through the pipelines in Kamu — there are multiple use-cases for changes in datasets to drive some automation. For this we have added full-fledged support for Webhook mechanism that supports message signatures (RFC 9421) and content digests (RFC 9530) for security.

Variables and Secrets

We added support for variables and secrets that can be passed down into ingestion tasks, making it easy to interact with data APIs that require some form of authorization. All secrets are stored securely encrypted.

Usability

My (very biased) feeling is that we struck the right design and exact right combination of technologies to move the stagnating world of data into a new territory. As a software architect, never before have I seen so many pieces of the puzzle falling neatly together, even those that we have not at all anticipated.

Our biggest challenge admittedly will be making the product extremely easy to use. Accessibility, rapid feedback, and guiding users to right actions will be the key to undo the decades of dominance of the batch processing model.

A few highlights in this area include:

Ingestions via file uploads allows you to create root datasets by dragging & dropping files.

Email gateway support that will notify you about changes to your account and the state of your pipelines.

Documentation overhaul including:

Much better structure of the documentation portal
Built-in quick search
Inclusion of ODF spec and metadata schemas reference
RFC history
Comprehensive Glossary with cross-linking of all terms throughout our docs and tutorials
REST API reference documentation using OpenAPI
New GraphQL playground

Core ODF Protocol

The core protocol has been getting many cool improvements as well, too many to list.

Some highlights:

More expressive dataset schema definitions
Separation of ODF crates from Kamu implementation to help building more ODF-compatible clients
Redesign of metadata traversal (MetadataChainVisitor) that allows to express complex computations on dataset history while utilizing multiple layers of caching for efficient traversal
Projecting changelog history into a state snapshot using to_table() UDTF
Hard compactions
New ChangelogStream and UpsertStream merge strategies

What’s Next?

Our near-term roadmap includes:

Securing future funding
Improving dataset schema migrations
Improving usability and shortening feedback loops
Privacy-preserving queries and transformations using TEEs
More declarative pipeline and resource management
Further separation of ODF core libraries
Integrating more engines
System abuse prevention

Our focus will remain primarily on working closely with existing partners, but all improvements we make benefit Kamu Node’s functionality for everyone else.

We remain fully committed to build Kamu as local-first software, meaning everything you can do in Kamu Node you’ll be able to do locally with Kamu CLI, without sign-ups, accounts, and for free.

Thanks so much for reading, and huge thanks to our team for making all this happen. Start a conversation with us on Discord.

Till next time!

Kamu Update — A year-worth of features was originally published in Kamu Data on Medium, where people are continuing the conversation by highlighting and responding to this story.

Ownership and Decentralization of Data in DePINs

Sergii Mikhtoniuk — Sat, 23 Aug 2025 22:24:27 GMT

Decentralized Physical Infrastructure Networks (DePINs) emerged from decentralized finance as incredible new mechanisms of incentives that enable large groups of people to collectively provide valuable and cohesive services to the society with minimal intermediation and transparent rewards.

Central role of data in DePIN rewards

Data plays central role in DePINs:

Some DePINs are purpose-built to crowd-source data collection and reward their contributors based on volume and quality of data (e.g. weather, traffic, air and sound pollution networks, wearable personal health devices)
Other DePINs collect data for quality of service purposes (e.g. wireless connectivity providers and idle CPU/GPU compute sharing networks collect node uptime, reliability, and units of work metrics).

In both cases data is the source of truth for determining how much rewards contributors earn for providing their services to the network. Therefore validity of data and computations is of utmost importance to fair token distribution.

Challenges of DePIN data management

Managing data is hard, even in a centralized setting. Tech companies retain large data engineering teams for the sole purpose of making sense of internal data. For DePINs this challenge is even more severe:

They often deal with large volumes of device data (IoT), often beyond what a single database can handle.
Their analytics and rewards involve combining on-chain and off-chain data, creating a lot of extra work to index data from blockchains and get data into smart contracts.
DePINs create value from data only when it crosses organizational boundaries, when it is provided to business consumers, or when it’s shared with data science communities, thus requiring a lot more infrastructure around dissemination APIs and data privacy.

So DePINs need of build a lot more complex tech stack, often on a short budget, while also facing transparency and verifiability pressure that their Web2 counterparts do not.

Who owns the data?

As the result of these challenges, many DePINs fail to deliver on decentralization promisses. By building on top of Web2 data solutions like AWS, Snowflake, BigQuery they re-centralize the data, contraty to their ethos.

You may argue that DePIN data stored this way still can be traced back to the person/device who provided it via device IDs and wallet accounts, but here’s a simple “ownership test”:

If a company behind a DePIN doesn’t pay its server bill — the data of the entire network can be lost, the token value can tank, and years of contributions of people and their hard-earned rewards wiped out in an instant.

The reason for all this is simple: while blockchains mostly solved the decentralization of money, the problem of decentralization of data is largely unsolved.

Evolution of data in DePINs

Most DePIN companies do try hard to deliver on their promises to the community using best tool and resources at their disposal.

Infrastructure of a typical DePIN with high degree of centralizion

Typical evolution stages of data management in such companies may look like this:

Start collecting data into a single database as a proof of concept
Write scripts to distribute rewards to early adopters, without any transparency or verifiability
Build custom APIs to deliver data to first customers, monetize data, and close the value loop
Develop an interim solution to scale beyond one database, such as offloading historical data to a Parquet data lake in S3
Develop custom data exports to distribute data to the community for hackathons and challenges to unlock more interesting use cases for their data
Under pressure to differentiate themselves from Web2 companies they may start pinning data in IPFS or Filecoin.

This is a long and winding path, it takes many years and millions in engineering resources to execute, yet still doesn’t lead to satisfactory results. Having data pinned in IPFS doesn’t solve the underlying re-centralization issues, only masks them under the disguise of duplicating data into a Web3 storage.

By that time data has already lost its verifiability properties and its connection to who provided it, and all data pipelines — the arteries that transport and refine data into more valuable forms — remain proprietary, impossible to reproduce.

Kamu: Decentralized data middleware

Kamu achieves what we used to think was impossible:

It flips the data management model upside-down, re-focusing it on personal data ownership
But does so in a way that is non-disruptive and maximally compatible with existing data tools.

DePIN negotiates access to personally-owned data and refines it via decentralized pipelines

Here’s how it works:

People or devices write data into datasets that they own and control. This data can be a high-frequency stream like IoT and reside in any storage of their choice.
Kamu stores data as a ledger where information about ownership is inseparable from data, simultaneously protecting data owners from the copy problem and holding them accountable for data they provide.
DePIN networks don’t own source data — they are simply granted access to personally-owned datasets.
DePINs combine and aggregate real-time data from contributors by creating “views” or derivative datasets that can be queried as efficiently as in centralized infrastructure, but preserve the link to source data throughout all transformations.
Through series of verifiable computing pipelines they refine raw data into higher-order valuable datasets and calculate data quality scores of every contributor in a fully transparent way.
They monetize high-quality data through smart contracts and grant access permissions to 3rd parties, simultaneously creating a financial trace and an access audit log.
In another smart contract they query monetization records in combination with data quality scores to fairly distribute token rewards. By doing this on-chain using Kamu’s verifiable oracle they link every token disbursment to the exact data it was based on, ensuring complete transparency, reproducibility, and auditability.

In the same way as DePINs rely on L1 blockchain networks for decentralized financial services — Kamu provides decentralized data storage and processing services that cover most needs of DePIN infrastructure.

This approach unlocks a dazzling number of cool properties:

Refined datasets, business reports, trained AI models — all artifacts can be forever linked to data from verifiable pipelines and the information about who contributed every data point.
Verifiable provenance provides the mechanism to both fairly reward everyone who contributed data and maintained the pipelines and to hold everyone accountable for their actions.
If DePIN company dissolves — data contributors don’t lose their data. They can take their data elsewhere, or even pool up with data consumers and keep all pipelines running as before. It’s an absolutely crucial step towards making DePINs actually behave like true DAOs.
Same devices can be contributing data to multiple DePIN networks at once, earning multiple rewards, without duplicating data.
DePIN projects can manage data with full transparency and engage community to contribute towards improving pipelines as easily as they would contribute to an open-source code.

🤔 Interested how all this works? See our DePIN mini-course 🤔

Takeaway

Many early-stage DePINs severely underestimate the challenges of data management at scale. They tie up significant engineering resources towards building highly-custom internal data platforms and end up with high degree of centralization that contradicts their ethos.

We attribute this primarily to lack of Web3-native data tools in the past.

At Kamu we built the first data middleware that can both stand up to most challenges of scale, and allow DePINs skip this long winding path altogether and from day one start building with tools that are purpose-made to enable decentralization and personal data ownership.

We believe it’s a major step that will shift DePIN space closer to operating as DAOs, allow them to more effectively engage their communities, unlock more value from data globally, and let us rebuild the global data economy around tightly aligned incentives.

Oracle-Augmented Generation: Connecting AI to Real-Time Verifiable Data

Sergii Mikhtoniuk — Sat, 11 Jan 2025 22:50:58 GMT

In collaboration between Kamu and Brian we are excited to introduce a new technique for connecting LLM-based AI agents to verifiable data we call Oracle-Augmented Generation.

You can find a quick overview of the technique in this video:

https://medium.com/media/8a6fca1ba5405baaff431a6bc0702c81/href

In this post we will quickly set the stage and then dive deeper into technical and implementation details.

The Challenge of Factual Data in LLMs

Large language models (LLMs) have become very powerful reasoning and automation tools. Trained on vast quantities of data they rely on generalization to extract concepts, find rules and patterns, and infer relationships between them.

The model training is akin to multi-dimensional approximation or lossy compression. The generalization of knowledge that gives models their power of high-level reasoning is the same thing that takes away their ability to manipulate accurate factual data points. Asking models to recall specific facts often leads to “hallucinations” — models making up plausible-sounding but false information.

Embedding the world’s factual data into LLMs not only would be impractical from the model size perspective, but because such models take months to re-train they would also be always lagging far behind the real-time data the world is increasingly relying on for decision-making.

Retrieval-Augmented Generation

To address this, many AI systems employ Retrieval-Augmented Generation (RAG), pairing LLMs with vector databases to retrieve contextually relevant information at the time of the query.

Retrieval-Augmented Generation

While this approach improves the factualness and recency of responses, it has notable drawbacks:

Source staleness and credibility — major LLM companies employ RAG mostly to search for relevant web pages, lots of which may contain outdated or unreliable information. While different ranking mechanisms are employed to pick the best results, it is difficult to imagine a sustainable approach to assigning and maintaining such ranks on a global scale.
Complex query conditions — RAG performs best when required information already exists close to its desired form, a form that can be easily processed by an LLM. But the number of questions with non-trivial conditions (e.g. narrowing down geographically, or by time) is so large that we cannot expect a web page to exist for each of them — answering a question correctly may require non-trivial computations over a large amount of data to be performed uniquely for that user.
Opaque data selection and centralized control — The RAG operator has full control over which sources are included in search results and which aren’t, raising concerns about transparency and potential bias. The proprietary data collection pipelines built by LLM companies to improve model training and RAG also have an adverse effect of concentrating an alarming amount of power in very few hands.

Introducing Oracle-Augmented Generation

We propose a new technique called Oracle-Augmented Generation (OAG) that pairs an AI agent with a verifiable analytical data processing system working with a set of trusted data sources.

The term “oracle” here is borrowed from the domain of Web3 and blockchains where “oracle” is a system that acts as a bridge between blockchains and the external world, enabling smart contracts to access off-chain data while providing some form of guarantees of the validity of data. The term “oracle” also exists in computational theory as a black-box function or device that can answer queries that might otherwise be difficult or impossible to compute, as is the case for recalling facts from a generalized language model.

On the high level OAG proceeds in 4 stages:

Context gathering — where in its simplest form the Oracle system is asked to provide top N most relevant datasets to user’s question
Query generation — where the LLM is asked to use dataset metadata to generate a query (e.g. SQL) that computes data that may answer the question
Query execution — where Oracle performs the query and returns a verifiable result
Answer generation — where LLM is asked to interpret the data from the query result to the user

OAG Sequence Diagram

A key distinction from typical “Text-to-SQL” approaches here is the requirement for oracle to provide cryptographic proofs for context gathering and query execution stages.

Example Interaction

Using Brian LLM agent and Kamu Node playing the role of an oracle, let’s see how one OAG user interaction may look like.

User Prompt: What was the total trading volume of USDC between Oct 10th and Oct 20th 2024?

Context gathering: Brian agent passes the prompt to Kamu Node’s search API verbatim, and Kamu will perform a semantic search to find datasets that are most relevant to this question, returning datasets like:

kamu/io.codex.tokens.olhcv - DeFi trading data, most relevant to us
kamu/com.cryptocompare.ohlcv.eth-usd - a crypto exchange dataset that has mentions of trading volumes
kamu/com.defillama.tokens.prices - crypto token prices dataset that has mentions of USDC

Brian agent fetches lots of metadata about these datasets from Kamu, including their schemas and column descriptions, readme files, and popular queries.

Although Brian naturally prioritizes DeFi datasets, OAG in Kamu is domain agnostic. Any dataset added by our community becomes automatically available for querying.

Query generation: Brian passes the above context to its underlying language model, prompting it to generate a Postgres-compatible SQL query that answers the user’s prompt.

The model returns:

select
  sum(volume) as total_volume
from 'kamu/io.codex.tokens.olhcv'
where
  symbol = 'USDC'
  and event_time >= '2024-10-10'
  and event_time < '2024-10-21'

Query execution: This SQL is sent to Kamu’s query endpoint and returns a response in JSON:

{
  "input": {
    "query": "select\n  sum(volume) as total_volume\nfrom 'kamu/io.codex.tokens.olhcv'\nwhere ...",
    "queryDialect": "SqlDataFusion",
    "dataFormat": "JsonAoS",
    "include": ["Input", "Proof"],
    "datasets": [
      {
        "id": "did:odf:fed011b209e776577c1688affdab1db2d3bda4822852dcaf9d59d108df8b441544938",
        "alias": "kamu/io.codex.tokens.olhcv",
        "blockHash": "f16206ead4be7fd3c3efbaa3de1c15e303e2ce9f6c2bc605f11e033e83a0206573722"
      }
    ],
    "skip": 0,
    "limit": 100
  },
  "output": {
    "data": [{ "total_volume": 4039963082.961011 }],
    "dataFormat": "JsonAoS"
  },
  "subQueries": [],
  "commitment": {
    "inputHash": "f16207ad1730665365efa3acb77ea33a169e49b292c03a13ae01d2718b7d958afb46b",
    "outputHash": "f1620510c1a4b28136b6f79c971e86c37e9b6ea77833ad511515c7a4e0133e47113b6",
    "subQueriesHash": "f1620ca4510738395af1429224dd785675309c344b2b549632e20275c69b15ed1d210"
  },
  "proof": {
    "type": "Ed25519Signature2020",
    "verificationMethod": "did:key:z6MkkhJQPHpA41mTPLFgBeygnjeeADUSwuGDoF9pbGQsfwZp",
    "proofValue": "uLaanvQHkx5w6yOcLmI-VH1IquEFTMjmlJRqqgd1Na1qYYcb6CIpxERLjtlYRasqiIwL2hg6NAEHMoNz68xwSBQ"
  }
}

The "output" part of the response is the actual result of the query, while the rest of the fields form a cryptographic proof of this request.

In this example we have a proof via reproducibility, where a specific Kamu node (identified by did:key:z6..fwZp W3C DID) commits to having correctly executed the query on the dataset did:odf:fe..4938 at a specific state snapshot (block) f1..3722.

Other types of proofs, including zero-knowledge can similarly be used in OAG to achieve desired effects.

For a detailed overview of the query proofs mechanism and its properties see our documentation.

Answer generation: LLM is asked to interpret the result for the user given the:

Original user prompt
Context with metadata about the datasets
Generated SQL query
Response data (in our case just [{ "total_volume": 4039963082.961011 }])

LLM returns the final answer: The total trading volume of USDC between October 10th and October 20th, 2024, was approximately $4,039,963,083.

Both the final answer and the cryptographic commitment are saved in the chat history.

Proof saved alongside the agent’s chat history

A hyperlink that Brian includes into the response allows users to quickly decode the proof and audit the query in Kamu’s Web UI.

Auditing proof in Kamu Web UI

User can see:

Which SQL query was executed
What datasets and what snapshots of their state were used in the computation
The result data that was reproduced identically to the past query
The validation status of the proof, that ties the response to one or more nodes that executed the query

The ability of Kamu to provide indefinite reproducibility of queries even for fast-moving datasets relies on ledgerized data structure of the Open Data Fabric protocol. You can find more details in the ODF specification.

Current Limitations

Kamu node does not yet provide the Proof - a proof of executing context gathering phase and returning the most relevant datasets without any additions and omissions correctly. A reproducibility-based proof similar to the query proof Proof will be soon provided by anchoring the state of all “known datasets” of a node in an ODF dataset and using deterministic embeddings generation and vectorized search algorithms.

OAG vs. RAG

While in RAG the model has to infer the answer directly from unstructured context data, in OAG the model generates query code and delegates the execution to an analytical system. OAG thus can work with much larger volumes of data that would otherwise never fit into the RAG context. This also significantly decreases the likelihood of hallucinating because most computations are performed by a deterministic system.

~~When it comes to structured data, OAG can work with significantly wider range of questions, including:~~

~~Point-in-time and time-range questions~~
~~Complex filters~~
~~Statistical aggregations (medians, quantiles, OLAP cubes etc.)~~

In RAG it’s often hard to understand if LLM have made a correct conclusion based on the context data. Verifying this would require a human to analyze the entire context themselves, defeating the purpose. OAG offers superior auditability and provenance. Even if the answer requires processing terabytes of data — the generated query can be easily inspected and understood. The oracle system can also provide ways to also check if the query is using reliable data sources. As the query usually would aggregate many data points into an easily digestible statistical summary or a chart — it’s a lot easier for humans to check if the LLM’s interpretation of a result is correct.

In OAG, proofs introduce accountability. A query proof described above allows users to hold data providers and the parties that executed the query forever accountable for validity of the results. They can expose any attempts to alter or withhold data and thus become building blocks for penalizing malicious actors. Succinct cryptographic proofs provided by Kamu remain small even when result includes millions of data points, and thus can be cheaply stored alongside the chat history of agent responses.

OAG and Kamu for Data Supply Chain Verifiability

The verifiability aspect of OAG is subtle yet transformative. Today companies pay huge sums for data APIs that don’t offer a single mechanism to hold the API provider accountable for correctness of their data. But as the world increases its reliance on real-time data and taps into more independent data sources — we will inevitably need to consider the consequences a presence of a few malicious actors may lead to in such a system.

While OAG can work with any verifiable database, we are especially excited about the properties it unlocks in combination with a decentralized data processing network like Kamu. Through its use of verifiable stream processing, Kamu extends OAG’s verifiability from a single query to cover an entire data supply chain.

~~Data supply chain verifiability~~

Data from a reputable publisher or a blockchain can be processed by multiple third parties, cleaned and combined into higher-order more useful datasets that AI agents will naturally prefer to work with when mining for answers. But thanks to Kamu, the proof of a single query will automatically include provenance proofs of the entire supply chain, no matter how many hands the data went through!

We believe that Kamu and OAG can help elevate the state of data exchange from all of us just scrambling to get any satisfactory data to get by, to us collectively caring about the quality of data and where it comes from, and inventing new ways to organize efficient and trustworthy data supply chains on a global scale.

Role of OAG in AI and Data Economy

We believe that conversational AI agents will soon become the primary way of how humans interact with data. We will increasingly rely on AI to source and distill important information for us. It is therefore important to make sure AI agents play nicely with the global data economy and that all incentives are aligned.

We unfortunately see a big problem in how big AI companies apply RAG today. With LLMs providing us the results we need, many users might no longer feel the need to open the web pages where RAG sourced the information from. This means that the websites RAG sourced data from start losing their traffic and ad revenue. We may soon see data brokers banning AI agents on their websites. This may in turn make major LLM companies use their virtually bottomless pockets to invest into their own proprietary data pipelines. With legislation around AI and IP rights still stuck in limbo, such a degree of centralization would be catastrophic.

~~We have previously shown [1][2] how a decentralized system like Kamu offers a better foundation for data economy:~~

~~Through federated querying it allows us to efficiently combine data from multiple independent sources without the need to move data into one place~~
~~Integrated provenance makes answers based on official reputable sources more valuable than unverifiable data from proprietary sources~~
Verifiable pipelines allow data sourcing, cleaning, and composition to be crowd-sourced to a global community that can self-organize into a new generation of more transparent data brokers and aggregators.

~~Now with OAG we extend these properties to the AI economy.~~

Once again, verifiable provenance is playing the key role. After all, a query proof that tells us where every bit of data came from and who participated in computations is the exact information we need to fairly compensate everyone involved.

~~Provenance-based reward distribution~~

~~Provenance provides us a quantifiable way to distribute the revenue upstream to every participant of a data supply chain:~~

~~Data publishers~~
~~Storage and compute providers~~
~~And the global community that builds and maintains the processing pipelines.~~

Future work

~~The first OAG prototype has exceeded our expectations, but we have many more ideas to try:~~

Agentic exploratory data analysis — to help with one of the biggest problems today, where LLM sometimes struggle to filter data correctly without engough hints about the specific values present in data, we want the agent to be able to decide when it doesn’t have sufficient information to form a query and needs to perform intermediate steps to explore the content of candidate datasets.
Progressive context expansion via knowledge/semantic graph — to help LLM generate correct JOINs between datasets in separate domains we want to extend the metadata with semantic annotations (e.g. RDF).
Fuzzy querying — where SQL layer on Kamu’s side could notice and auto-correct typical mistakes in queries
Fine-tuning with OAG-in-the-loop — as we believe best results can be achieved when LLM is trained in conjunction with an oracle and learns how to use it for best advantage.

~~Thanks for reading! Please give Brian AI agent a try and let us know what you think on Discord!~~

Oracle-Augmented Generation: Connecting AI to Real-Time Verifiable Data was originally published in Kamu Data on Medium, where people are continuing the conversation by highlighting and responding to this story.

Kamu at FIL Dev Summit 2024

Sergii Mikhtoniuk — Mon, 15 Jul 2024 18:50:09 GMT

I had an honor to present at the 4th FIL Dev Summit organized by Protocol Labs in Brussels on July 9–11.
In my talk titled “Supercharging SQL: Global Data Supply Chains for Verifiable AI and Analytics” I make the following key points:
With 80% of AI researcher time spent on data acquisition and prep — AI is a data problem
Advancing AI means moving the world towards the global data economy
Existing data lakehouse model is unfit for global data exchange
Batch processing is the culprit that makes data manual and fragile
Stream (temporal) processing is the solution to make data processing autonomous and composable
By layering Web3 properties on top we can move the world towards data economy based on collectively-owned data supply chains
You can find the full recording here:
https://medium.com/media/4593450b28322df1f23c9cdf5bfa0f39/href
It was really exciting to share the details of several projects we have been working and show the trajectory that we’re taking:
Kamu is fast becoming the “Kubernetes for Data” with 4 powerful enterprise data processing engines already integrated into one system
Kamu is the first to combine a blockchain indexer, off-chain data lakehouse, and an oracle under one technology blurring the line between on- and off-chain data
Connecting AI to a web of community-operated factual data supply chains simultaneously solves several major problems that surround LLMs today, like veracity, attribution, and compensation
Our verifiable data processing model is already being used in generative-AI space to fairly distribute rewards to IP owners.
Big thanks to Protocol Labs for organizing the event and to everyone who attended!
It was especially great to meet some of our Discord members in person.
Till next time!
Kamu at FIL Dev Summit 2024 was originally published in Kamu Data on Medium, where people are continuing the conversation by highlighting and responding to this story.

Next Decade of Data: From algebra to calculus, from silos to real-time supply chains

Sergii Mikhtoniuk — Wed, 03 Jul 2024 03:36:48 GMT

This post will be very different from the rest — it’s our team’s attempt to explain the state of the data market as we see it and make some predictions.
It’s a long read.
Data market is damn complex, to the point where I think no one has a complete understanding of it including data company founders, gatekeepers of capital, or especially the users. So what makes us qualified to even attempt to touch on this topic?
We believe that a lot of this complexity is incidental, not inherent, and can be untangled. By the nature of what we’re building, we interact with a much wider range of users than most, as we try to bridge data between previously non-overlapping areas like scientific data management, government, healthcare, finance, insurance, civil infrastructure, Web3, telecom, automotive, aerospace, commercial fishing, green energy and ESG, and many more. This gave us a good picture of what solutions everyone is using and what aspects of data management they prioritize. As a highly technical team, we can now quickly put each new solution we encounter into a certain “bucket”. Along the way we also developed some opinions on where all of this is going, that we will share today.
This problem is multi-dimensional and impossible to linearize, so excuse me hopping from one topic to another. Also forgive many generalizations and assertiveness of the tone — it’s for the sake of being concise.
On Industry Outlook
Technological transition to data economy
For the past four decades innovation in data engineering was driven almost exclusively by big tech solving their internal problems using analytics over internal data. A revolutionary invention — relational database — was followed by steady evolution towards larger volumes and lower latencies.
We see this trend continuing in the current generation of startups and emerging tech:
Startup takes a popular data solution X
Builds a faster version of X
Open-sources the core
Sells the cloud version
We believe this model will have steeply diminishing returns:
As performance gains stemming largely from Apache Arrow and moving away from the legacy JVM stack are incorporated into all major data engines, and as open-source data formats like Iceberg gain wide adoption — this leaves very little room for differentiation. There is surely enough market for competition — lots of big companies are still transitioning away from the outdated warehouses and you can “catch” them at the right moment — but this will increasingly be a sales race, not a technological one.
The current processing model is reaching its limits — not in performance, but in human cost of operating it, how manual and fragile it is. Its flaws cannot be patched and will require a major re-design of engines, disrupting the established order in the analytics market.
An organization can only derive limited value from internal data, whereas external data holds immense untapped potential. Using internal data organizations can optimize their processes and efficiencies, but only using external data can they act strategically — access customer insights, compare their performance against peers, and better understand their place in the global market. [sloan] [deloitte] [forbes]
Organizations exchanging information with one-another will become the main focus of the next decade in virtually every sector of the economy and we’ll show that the current solutions simply aren’t designed with this in mind.
Today we barely scratch the surface of strategic use of data — the next decade will be the time of transition from internal silos towards the global data economy.
Aside: Don’t we have a data economy already?
Emailing Excel files and PDF reports still remains the primary way of data exchange — a highly manual and error prone process that we often compare to the Rube Goldberg machine. In data startup circles this is considered a “low-tech” segment and not the area to focus on, but we believe that this is a clear indication of a big gap in solutions and a big part of the systemic problem.
Companies that already operate data lakes are not much better off. They often drown in internal data, unable to tell where most data came from and how it was processed. While solving storage and querying, data lakes are low-level tools that don’t provide good mechanisms to manage data workflows, thus even internal data exchange always ends up plagued with provenance and quality problems. When companies need to share data with partners, they often resort to highly custom solutions tailored to each consumer, which are costly to develop, maintain. Lack of good privacy frameworks often leads to multiple copies of data at different levels of detail being produced that add to confusion and erode the “source of truth”.
Lack of interoperable solutions is most noticeable in the scientific and government data space where universities and governments have policies to publish data, but keep developing custom non-interoperable portals. The amount of redundant effort is in tens of billions and data remains hard to access. They do this not because they don’t have money to pay for Databricks or Snowflake — govtech solutions are extremely overpriced — they do so because their requirements around data sovereignty are not met.
REST APIs remain the most widespread way to monetize data, but we think this model is flawed:
APIs are the means of delivering data to applications, they are not fit for data science and analytics.
They are non-standard and always require custom integrations
If the focus of your company is not on selling data — building data API infrastructure is too significant of an investment. Many companies in automotive, telecom, and other spaces that have very valuable data, but told us they can’t monetize it because of the costs to build and maintain such infrastructure.
APIs thus favor big players, lead to centralization, and produce data monopolies. It’s not possible for a few small publishers to “pool” data together — APIs don’t compose.
When it comes to publisher-supplier relationships, all companies still operate through bilateral agreements. A health insurer will contract an actuary that in turn has deals with dozens of hospitals to periodically get statistical data. Again, the entry barrier into this system is very high, getting data takes a lot of time, and due to how manual the whole process is — you will likely be seeing data that is long outdated and can be barely trusted. The only thing that tells you that the data you get from an actuary is real is an ephemeral feeling that “they do data as a company, so it must be real”.
So, no — we don’t see the current state representative of even a small fraction of what the global data economy will look like.
On Transition Path
There of course have been hundreds of attempts to enable data economy by building “global data portals”, and “data marketplaces”, but all of them turned into data graveyards. There has been a resurgence of data marketplace attempts in Web3, but they all focus on using blockchains to experiment with different economic incentive models, leaving data engineering foundation the same. The continuing attempts are evidence that the problem is worth solving, but making actual progress will require substantially rethinking the approach.
We don’t see this transition happening gradually as an evolution of existing practices. Cross-org information exchange is a trustless environment, very different from mostly-aligned incentives within a company — existing solutions were not built with any of these requirements in mind and they are impossible to retrofit. The current solutions for managing the internal data already require organizations to have large engineering teams to sustain — we cannot expect them to also handle the external sharing side. We’ve hit the complexity limit — this won’t scale.
The current state is a local optima and getting out of it is not going to be easy.
Local optima of modern data
The set of problems that must be solved during this transition are:
Automation
Composability
Trust & Accountability
Ownership
Decentralization & Interoperability
Privacy
While many VCs are looking for “The Next Snowflake / Databricks” — we don’t think this is what a solution will look like. The gap is certainly wide enough for another behemoth, but thankfully the market will not tolerate a centralized / proprietary solution to this problem. A solution will likely have the form of open formats, a specification that multiple different data engines can adhere to, and a “Fat Protocol” that serves as an interoperability layer for various on-prem-first implementations that respect data sovereignty, unlike the current cloud data platforms. We of course hope that our Open Data Fabric initiative can fulfill this delicate role.
Data lakes vs. supply chains
On Automation
When I say “real-time data” what’s the first thing that comes to your mind? For most, it’s probably the stock market with its high-frequency tickers.
We ask: Why isn’t all data like that? Why do events like the release of quarterly employment reports in the US still cause market swings as traders try to predict what these numbers will look like? Why can’t we get those numbers on an hourly basis?
The answer is — because of the current data processing model that is extremely manual, fragile, and keeps humans in the loop.
To emphasize, even if all data in the world by some miracle was in a single well-structured, perfectly-permissioned database — we don’t think the situation would get significantly better — the current model is unfit for automation at scale.
In our The End of Batch Era blog post we attribute this problem to batch processing — a computational model that constitutes >99% data processing today. This model doesn’t know how to work with time — the most fundamental dimension of data. With industrial-IoT growing massively, blockchains becoming significant and high-frequency data sources, and all sorts of dynamic consumers like devices, automation, and AI bots in play — we need a better computational framework.
This framework already exists — Stream / Temporal Processing (we prefer the term Temporal Processing to break away from the old solutions that were just working with time-series data and message queues like Kafka that for many people are synonymous with streaming). It is already revolutionizing how big tech works with real-time data, but we think enterprise data companies are again largely missing the point. Yes, it improves latency, but this is actually one of its least fascinating properties.
More importantly:
It provides a solution for dealing with data incompleteness
It allows to write code that processes data continuously while it evolves over time, automatically reacting to backfills, retractions, and corrections
It removes humans from the loop — people design and build pipelines, but don’t move any data manually. Removed from the “hot path” of data they don’t contribute their oh-so-human latency and mistakes to it
It composes — pipelines can have thousands of steps that are reactive, self-healing, and autonomous.
The importance of stream processing technology is comparable to the invention of calculus — it’s a new framework for problems for which previous methods were not a good fit.
If a new data solution does not consider temporality — I usually stop looking.
On Composability
Let’s say with a swing of a magic wand we move millions of existing data publishers and consumers to one platform. This utopia can easily backfire and turn into a systematization nightmare. We immediately run into M*N quadratic complexity problem where all data is hard to discover, takes a lot of effort to combine together, is riddled with quality and trust problems, and every consumer has to deal with these issues themselves. These are the same issues as a poorly executed transition to data lake leads to, but now on a global scale.
Once again — removing interoperability barriers alone is not enough.
Quadratic complexity explosion in a non-composable model
For the data economy to function we should stop thinking about data as fruits on a stand that change hands, or as parcels sent through a logistics network. Instead we should think about data as plumbing or an electrical grid — a highly-interconnected network of supply chains that flow, diverge, and converge continuously and autonomously. These chains form a DAG, and graphs provide structure and hierarchy.
DAG of streaming data pipelines
This is where composability on processing level that we just touched on turns into composability on a global cross-organizational scale:
Data from publishers is flowing
It’s picked up by the global community of data scientists and that combine, aggregate, and extract value from it
Consumers access data on various levels of aggregation
Any retractions or corrections propagate instantly to consumers
Data composition is the only way to solve systematization and quality problems at global scale, yet all mainstream data solutions don’t offer a composable model.
On Supply and Demand
Data economy needs a balanced supply and demand. Demand is already high and will only get higher once data access barriers drop. Supply however is critically low. The current technologies created a steep entry barrier and failed to create the right incentives for publishers to share data.
We must recognize that for most companies publishing data will never be the core part of their business — a supplementary income at best. Publishers therefore should be treated as the most fragile and vulnerable persona in the model as any extra effort needed on their end significantly reduces chances of getting the data.
That’s why we believe that a much more opinionated and simpler to use data solution is necessary to fill the gap between Excel and a data lake in these lower-tech companies. Publishing data externally must be a natural extension of internal data workflows otherwise operating overhead will be hard to justify.
Also that’s why we pay close attention to the government use cases (another underserved area), as regulatory requirements may just be the driving force that makes such companies upgrade their data stack to keep up. Privacy-preserving verifiable analytics is essential to make regulations and compliance sustainable without falling into the extremes of a “surveillance state” or a 3rd party audit mess.
There is also what we call a “small publisher dilemma”. The smaller the publisher — the less data they have, the harder it is to find, the more work for a consumer it is to integrate, and therefore the less valuable it is. So, for example, while data from all taxi companies in the world would be immensely valuable — data of one taxi company may not have any demand, and thus discouraging a company from ever trying to publish.
The NY Taxis dataset was a clear illustration of how fast things can go once you overcome the initial friction. When the New York government collected and shared this previously inaccessible data it turned out immensely valuable and has since spread to dozens of data portals, was featured in hundreds of conference talks, and benefited many ride sharing startups.
One of many NY Taxis dataset visualizations (by Ravi Shekhar)
Again, we believe that composability is the solution:
Publishers only need to expose data in its raw form
Data scientists can then find publishers within common domain, clean their data up, harmonize and pool it into higher-order datasets
The data is then provided wholesale to the consumers
Pipelines keep running autonomously as long as publishers continue to provide data
Rewards flow upstream and are shared between publishers and the pipeline maintainers.
Composability here plays a crucial role of allowing parties other than the publishers to perform most of the heavy-lifting, dramatically reducing the barrier of entry.
Today there are many companies that specialize in sourcing data and data cleaning. By making data composable, this role can be opened up to a global community of hobbyists and researchers, replacing actuaries with transparent, auditable, and verifiable pipelines.
We see composability foundational to collectively-owned data supply chains, governance of which can be automated through smart contracts (aka Data DAOs) to avoid the complexity of custom data redistribution and revenue sharing agreements and further simplify publisher participation.
On Ownership
Taking composability a step further in another direction naturally brings us to personal data ownership. In a truly composable system raw data can be collected on device / individual level and then be progressively combined into higher-order datasets. For example, data of a commercial fishing company may be a union of datasets of every ship they hire, while the ship’s data is a union of datasets of individual IoT devices on board. As devices are installed, break, get replaced changing vendors — a digital twin of a ship can be seen as an ever-evolving (temporal) graph of streaming pipelines.
Note how this approach is bottom-up — a device is installed and starts writing data to its own dataset, and then this data is incorporated into the supply chains feeding the ship’s digital twin and company analytics. Many digital twin solution use a top-down approach where a common data schema is designed first and then devices are made to write to it in a specified format. Designing such a common data schema is a tough social exercise of seeking some common denominator and consensus between many device manufacturers — good luck attempting this on a global scale.
In a bottom-up approach:
Data can start flowing immediately
Be quickly aggregated in a way that suits internal needs of a company
And later harmonized for consumption on country / global level
Even if the internal model of a company ends up incompatible with what the rest of the market settled on — a company can easily create and migrate to a new model as all raw data on device level is still preserved.
The permissions in this model work very naturally as well — they can be enforced on per-device level, the control can be delegated to captains, who authorize access by fisheries. When the ship’s contract ends with one company — the read permission for future data can be revoked — leaving the company only with access to data that was gathered while the contract was active. Complete data stays available to captains to base maintenance on, and may or may not be available to the next company that contracts them (e.g. a ZK proof of sufficient maintenance may be all they get).
Individually-owned data will make up the “big data” of global analytics while preserving the IP ownership, enforcing the desired degree of privacy, and maintaining provenance links for equitable reward distribution.
On AI
No matter how sophisticated your AI model is, if you train it with poor data you will get poor results: garbage in, garbage out. Acquiring and preparing data is absurdly hard — AI startups spend >80% of time gathering data instead of designing and tweaking models.
AI is a data problem.
Our theory is that the boom of LLMs is a side-effect of the data problem — it’s a lot easier to build an AI startup that scrapes open information from the web than to go through all the hurdles of getting specialized domain data. Companies that crowd-source data generation, like Shutterstock and Getty are making $100M annually from AI companies. Meanwhile domain-specific AI stagnates, facing privacy barriers, complexity of bilateral agreements, and custom integrations.
It’s not the algorithms that make a successful AI company — it’s the strategies and business relationships behind data acquisition.
If these barriers are lifted we will see a boom of domain-specific AI applications.
LLMs continue to improve rapidly, but also are rapidly commoditized. Many open-source LLM models exist as alternatives to GPT, and none of the applied LLM startups seem to have a strong defensibility moat, other than how they prepare data for fine-tuning. Their early gains will likely be quickly eaten away by the competition. When dealing with hype technologies it’s better to pay attention to the fundamental infrastructure that supports them. This was the case with semiconductors that surged first due to Blockchain mining and then AI booms — NVIDIA is the most valuable company in the world as of this writing — same will be the case with data infrastructure. Unlike AI models though, data has inherent and profound network effect — data is a lot more valuable when highly interconnected and can be easily combined with other data. A protocol that solves data exchange problems will have a significant moat due to this built-in network effect.
In Kamu we see the relationship between LLMs and data changing significantly. LLMs are notoriously bad at factual information — they have no defined boundary between facts and “dreaming”. The models are already huge, so it’s simply impractical to try to embed more factual data into them. Even if the size is not a concern — in the constantly changing world a lot of data will already be out-of-date by the time you finish re-training the model. We therefore think that LLMs should be treated as a human interface that can help people access and make sense of factual data, but not necessarily contain it within itself.
A simple example of what we’re prototyping:
LLM receives a question like “what is the median price of a 2BR apartment in Vancouver?”
Using semantic graph of datasets in Kamu (via RAG) it finds datasets that are most suitable to answer this question
It composes an SQL query to fetch the data
And determines the best representation (verbal, or graphical) to serve the results.
The symbiosis of LLMs and a data supply network can deliver users factual real-time results, without having to frequently re-train the model.
Perhaps even more importantly, it can deliver the results with full provenance information, and therefore enable both accountability and fair compensation of data providers. When this state of technology is achieved we see AI becoming possibly the largest consumer and the primary user interface for data.
The Scarlett Johansson lawsuit against OpenAI once again turned the attention towards provenance of data used for training and compensating the IP owners. We think this is the wrong side to start digging from. You can already establish a link between the AI model and data used for training — currently through versioning / anchoring data and deterministic / verifiable computations. But we also need to establish provenance across all the hundreds of transformation steps data usually goes through before training, linking it to ultimate roots of trust — the individual data publishers. Training an AI model is not that fundamentally different from executing an aggregate function in SQL, thus if we solve provenance at a level of multi-stage data supply chains — AI training will fit into the general framework nicely. And if your data ownership extends to individual level (as we described in Ownership section) — you’ll have a programmatic way to compensate the IP owners. Try starting with the training provenance problem and you’ll end up linking models to data the origin of which cannot be established.
Using data provenance to reward IP owners
And finally, today AI/ML is primarily the end product of data processing — you collect and shape data, train a model, then deploy it to serve users. We think this will change as well. Once we can produce verifiable models with strong provenance we will see more of them being used as intermediate stages in data processing pipelines. We are building Kamu to make AI-enabled classification, feature extraction, alerting, and all kinds of decision making a seamless part of the data supply chains, as simple as running an SQL function.
On Web3
We already stated that the market will (rightfully) resist any centralized / proprietary solution to data exchange. For most organizations NO data exchange is better than having their data held hostage. This naturally brings us to Web3 and the realm of decentralized protocols.
Blockchain investors themselves describe this market as “self-serving” and in the state of a “big casino”. There’s definitely a lot of that. Despite that, Web3 operates much closer to the underlying challenges of multi-party trustless information exchange and already gave birth to several extremely important new technologies like decentralized storage and compute networks, and novel privacy-preserving techniques.
When it comes to data, the biggest issue we see is Web3 companies building solutions for other Web3 companies in isolation, disregarding the transition path for the other 99.9% of the organizations in the world. If a company already operates a data lake based on S3 — why on earth would they want to move their terabytes of data to some decentralized database and lose access to Spark and other vast sets of analytical and BI tools that they built their business around? Our theory is that the current state of affairs is caused by the profile of engineers — most people that come to Web3 today were only exposed to OLTP (transactional) data processing before and create data solutions that unfortunately repeat the same mistakes that OLAP (analytical / warehousing) data processing already evolved out of decades before.
In Kamu we take a very sober, practical look at Web3 and build towards its core values, but with existing enterprise OLAP data users as the starting point. Web3 values are the “north star”, but we need to make sure that the on-ramp is as smooth as possible. With our products, we have already proven that provenance and verifiability can be introduced into modern data lakes in-place, with little to no disruption. Users can continue using their existing storage, their existing tools, and open their data to others and decentralize progressively, when there is a strong business case to do so.
As a quick rundown of technologies:
Blockchains: We see them as OLTP engines for multi-party interactions. Most data is not that. Putting data on blockchain is prohibitively expensive, so this discourages the abuse of blockchains for data storage, although many startups still try. Private / enterprise blockchains are just stopgaps for privacy and scaling issues — we don’t consider them as good mediums for data exchange either, as none of them can handle Industrial-IoT volumes and frequencies of data. We use blockchains as censorship-resistant data catalog, permission management layer, and as a source/consumer of some data (see below), while keeping all data processing off-chain. Data doesn’t need a global ledger. Verifiability of data pipelines has much more efficient solutions than what general purpose blockchains offer.
Oracles: Access to trustworthy off-chain data is essential to make Smart Contracts useful, but The Oracle Problem remains a big barrier. We strongly believe that the oracle problem is a feature, not a bug. The issue of how holding data providers accountable exists even when you call Google’s REST API from your app — today you have no way of proving that resulting data even came from Google, not to mention disputing it. The oracle problem simply exposed the lack of accountability in modern data exchange in stark contrast with the properties achieved and held sacred by blockchains. Complexity and cost of Oracle Networks is a big factor in the stagnation of the smart contract ecosystem. If the accountability problem is addressed at the core — bridging data to smart contracts will become trivial, opening up millions of new use cases for smart contracts other than just finance.
Indexers: OLTP transactions on the blockchains generate a lot of interesting data. The desire to use it for analytics gave birth to the category of solutions that ingest decoded blockchain data into databases and lakes. They all end up re-centralizing the data and going back to all the trust concerns blockchains are solving. Situation is improving with some indexers using Zk-proofs to make some claims verifiable, but this is still very far from general-purpose analytics. More importantly, all indexers provide access to only blockchain data, meaning that if you need to correlate some on-chain data with events happening in the outside world — you are suddenly thrown into complexity of building and operating your own data infrastructure — an investment few can afford. In this blog post we show how Kamu unifies oracles, indexers, off- and on-chain data under one simple model.
DePIN (decentralized physical infrastructure networks): Some DePINs like weather station networks exist to gather data. For others data is a byproduct crucial for provisioning, monitoring, and service level control. From the perspective of data, the only difference between DePIN and industrial IoT is in where you draw the ownership boundary. While blockchains are used as an incentive layer for these networks, the massive volumes of high-frequency data will have to go somewhere off-chain while preserving verifiability.
On Privacy
Most companies still consider data privacy only in the form of private networks and centralized permission control. In many cases this may be actually enough for them. Many Web3 ideals like using public storage networks where data is openly accessible but protected by encryption are viewed as additional complexity and often a showstopper. Again, we need a smooth on-ramp — industry must accept that it will take time for these approaches to earn customers’ trust — we should meet users where they are.
Only when encountering cross-org data exchange companies start looking at the domain of structured transparency of which previous approach is just a tiny subset. The most popular solutions there still remain Compute-to-Data and Federated Learning — simple ideas of performing computations within data owner’s infrastructure, without ever sending raw data outside the private network. While posed as the holy grail of medical AI, we still have not seen a single sustainable integration. These solution require not only advanced infrastructure, but a significant technical expertise on the data owner’s side. It’s still hard for me to imagine hospitals en-large ever employing people capable of reviewing the submitted code and AI training routines for PII leaks.
Luckily this field is booming with new techniques that can make structured transparency easier to operate:
Differential Privacy can control how much information is revealed in aggregate throughout the computation and automatically terminate it when it exceeds some threshold without any human input
Software Enclaves and Homomorphic Encryption can allow 3rd parties to operate the compute infrastructure without leaking any information to them
Arguments, Proofs, and Zero-Knowledge enable verifiable and privacy-preserving computing without the need for specialized hardware and the “root of trust”.
We are accumulating a robust set of privacy tools, but the main problem now is that there is no common foundation to integrate them through. When two or three parties are involved in a computation, the likelihood of them using the same tech stack is slim, and we are extremely far from having any standard spec to make verifiable computing interoperable. Until this changes — we will see structured transparency deployed very haphazardly, as custom solutions with hefty price tags, and being close to impractical in terms of cognitive overhead for the operators.
Privacy has to be layered over the open data supply chain protocols. Going the other way around will only fragment the already highly fragmented market.
Conclusion
Every sci-fi movie already captured the desired UX:
Instant, uniform access to millions of data sources
Feeds of relevant data delivered in real-time
Ability to drill down to individual data points and cross-verify veracity of data between sources.
Let’s figure out how to make this vision a reality.
We believe that in Kamu we largely cracked the problem of how the foundation of this should look like:
Separation of data ownership from storage and compute infrastructure
Composability of stream / temporal processing as the key to automation, ownership, and global collaboration on data effects of which will likely be similar if not greater to Open Source Software revolution
Ecosystem based on open formats and protocols, open to many engines and alternative node implementations
Network economy that fairly compensates all participants of supply chains to keep things sustainable.
Unfortunately this foundation is not a gradual evolution of existing frameworks — a lot of things will need to be re-designed.
Surprisingly, for such a complex problem we have not yet encountered many “forks” on the path to the solution. We have been following the same vision for over 3 years and watching the many puzzle pieces we have not even considered (e.g. the oracle problem) fall neatly into their places.
Do you agree with the trends and problems we described? Reach out to us on Discord with your thoughts and comments.
Try our demo to see a lot of this new foundation already functioning.
If you’d like to know more technical details of what major drivers influenced the technical design of Kamu — don’t miss our upcoming blog post: “The Grand Convergence: Five silent revolutions that shape the future of data”.
Next Decade of Data: From algebra to calculus, from silos to real-time supply chains was originally published in Kamu Data on Medium, where people are continuing the conversation by highlighting and responding to this story.

Kamu Update: Funding, hiring, tech

Sergii Mikhtoniuk — Thu, 10 Aug 2023 22:28:51 GMT

Next week, Kamu is turning 3 years old! 🎉
When we set out to build the “world’s first global data supply chain” we didn’t expect it to be easy … and it’s not. For the last 7 months we had to channel all our resources into things that help us survive as a company: customers, product, and fundraising, so despite the lack of public updates we were hard at work.
Funding secured
Our efforts paid off — the Kamu project has secured more funding!
Addressing systemic issues of data economy might seem akin to a “hot startup”, but in reality, the Enterprise and Web3 Data markets have grown so complex that few possess a clear perspective on them anymore. Building something outside the established buckets (like data lake, catalog, indexer, oracle) and pursuing a fundamental shift also means “big risks” for investors. If very few are properly equipped to assess these risks, even fewer are willing to take them during the market downturn.
We are deeply grateful to exceptional people in Revere, Protocol Labs, Faber, and Gagra Ventures for taking the time to understand us, for sharing our vision and for believing in our ability to bring it to life.
And a very special thanks to Dell Technologies, Infinity, the University of Groningen, and other members of the “Open Data for Data Science” consortium for their support and for connecting us with so many scientists, field experts, research institutes, and companies.
Growing the team
We not only have the means to keep moving forward but can also finally expand our team!
When members of our community were asking about working with us we previously had nothing to offer. “We worked without a salary for 2 years — join us!” was not the job pitch I ever wanted to give. Things are changing now!
We are now looking for exceptional people in three key areas:
Data Eng (Arrow, DataFusion, Spark, Flink, blockchains) — to work on our novel data formats, protocols, and data engines for verifiable and privacy-preserving data processing.
Backend (Rust, Kubernetes, multi-cloud) — to work on our decentralized data network and its supporting services
Frontend (Angular, GraphQL, data viz) — help us build the first collaborative data processing platform, a “GitHub for Data Pipelines”
Technology updates
CLI Tool
kamu-cli was released in early 2021 as a technology prototype that made (still very novel) stream processing far more accessible and combined it with blockchain-like provenance and verifiability.
We were pleasantly surprised that our tool attracted the attention of a very experienced and technical crowd. So many CTOs, CDOs, and experts from fields like science, healthcare, finance, insurance, and web3 gave it a try. Thank you all for your feedback!
Our plan is to continue to evolve kamu-cli into a powerful “Swiss army knife” for data flows - a new alternative to “data-as-code” tools like dbt that takes an opinionated approach to how data should be managed internally and externally for us to achieve:
Provenance and accountability
Superior automation and data consistency at low latencies
Collaborative data cleaning and aggregation
And other foundational pieces of reproducible science and functioning data economy.
Compute Node
We are now working to evolve the technology you see in the tool into a “deploy-anywhere” data processing service.
Kamu Compute Node is a set of Kubernetes-native applications that can be deployed in any cloud or on-prem to:
Operate the stream processing pipelines for a certain set of data flows
Continuously verify datasets that you are interested it to catch malicious behavior
Serve ad-hoc batch queries and Oracle queries for ingesting data into smart contracts.
Compute nodes are the building pieces of the Open Data Fabric network. Unlike blockchain nodes that maintain a single ledger, Kamu compute nodes can form loosely connected clusters based on the vested interest of their operators in certain types of data.
Roles in ODF network
Rather than a flat peer-to-peer, it’s a federated network where:
A university that cares about the reproducibility of science can contribute storage resources to replicate research datasets,
A government can provide compute resources to keep civic healthcare data pipelines running, or
A company using crowdsourced weather data can dedicate resources to continuously verify these communal data pipelines to detect malicious actors.
Our work so far was focused on:
Helm charts — making it easy to install the node in Kubernetes
Closer S3 and IPFS integration — making our engines read and transform externally-hosted data while minimizing the data transfers
Multi-tenancy, authentication, authorization
Work scheduling, capacity control, and backpressure.
As with the rest of our tech, all code is open, and under a time-delayed Apache license which is free to use from the get-go for almost any organization.
Web Interface
We are continuing to build a “window” into the Open Data Fabric network.
You can think of it as a mix of:
GitHub — as a place for discovering, collaborating, and governance of data:
Etherscan — as a place to inspect the dataset metadata ledgers, see verification attestations, and the state of the network:
Snowflake — as a place to do ad-hoc exploratory data analysis:
And Jupyter Notebooks — providing basic analytics/dashboarding functionality
There will be a public instance coming, but you will always be able to run it yourself. It even comes embedded in kamu-cli so you can run it to explore your local workspace:
kamu ui
Our work so far was focused on:
Covering most of the read & explore functionality (search, pipeline graph, metadata block explorer)
Responsive data querying and exploration
Dataset creation and editing UI, including especially complex pieces of UI for ingesting data from external sources.
Web UI interacts with the Compute Node via GraphQL API which you can also use for all kinds of automation.
GitHub repo
New DataFusion Engine
As you may know, Kamu / ODF does not dictate the use of any specific data processing framework or language. Any engine can be integrated as a plug-in. For our initial engines, we chose Spark and Flink as the most mature implementations of the bitemporal stream processing model.
However, these engines were developed with an enterprise setting in mind. They are designed to operate in large long-running clusters, where a startup time of 10–15 seconds and the memory-hungry JVM didn’t matter that much. The quick checkpoint-to-checkpoint nature of processing in kamu-cli goes against this, with 90% of processing time usually spent waiting for engines to boot … it’s a poor user experience.
Recently we have completed a prototype integration of Apache DataFusion engine — a rapidly-developing Rust-based data processing framework built on top of Apache Arrow with a powerful SQL API. While this engine has some limitations it can be many orders of magnitude faster than Spark or Flink in many use cases.
We are very excited about this new generation of data engines. We hope they will pay closer attention to the temporal / streaming data processing paradigm. And as a project that is constantly seeking common ground among the wide variety of implementations, we are strong supporters of Apache Arrow ecosystem and the Substrait project, aiming to unify how we compute on data.
We are already working on replacing Spark with DataFusion in our data ingestion path, so expect kamu to get much faster!
Fast & Efficient Data Transfer
From the very beginning ODF protocol was following a spec-first approach and we remain committed to design and build in the open and seek opportunities for standardization.
As a snippet of this low-level protocol work consider RFC-008: Smart Transfer Protocol. When you think about efficient and standard structured data transfer protocols, what comes to your mind?
Is it the “good” old JDBC/ODBC?
Or REST+JSON API, which are not standard at all?
If you are “in the know”, perhaps Arrow Flight?
There is a big gap in these protocols:
Often the server controlling the access to data is not the one storing it. If our Compute Node (server) is deployed in EKS while all data is in S3 (storage) and clients often want to download large datasets — the entire data will have to flow through the server, creating a lot of extra work and a bigger bill.
ODF Smart Transfer Protocol takes care of this problem by allowing to point the client to a piece of data potentially served by a different party over a different protocol, avoiding proxying. In the above case, it will return the client a list of pre-signed S3 URLs to download Parquet files from.
Note: The Delta Sharing protocol by Databricks is the closest thing to this, but unfortunately was too Spark-specific for us to adopt it.
In future, we are planning to extend it to handle in-band data transfer as well, via Arrow Flight protocol, so that depending on the query the system could pick the most efficient way to transfer data to the client.
What’s next?
We will soon be publishing a public roadmap so you could follow both our progress and provide input for future development!
A few upcoming things I’m most excited about are:
DataFusion-based ingest
Laying foundation for granular authorization (based on UCAN protocol)
IPLD migration
Documentation upgrade.
So stay tuned! You can always find us on Discord as well as many other like-minded people happy to chat about everything data-related.
Kamu Update: Funding, hiring, tech was originally published in Kamu Data on Medium, where people are continuing the conversation by highlighting and responding to this story.

Kamu at Compute Over Data Summit

Sergii Mikhtoniuk — Sun, 22 Jan 2023 03:17:56 GMT

I had a great opportunity to present at the Compute Over Data Summit organized by Protocol Labs in Lisbon.
In my talk titled “Structured Data in Web3 — Rethinking databases” I covered:
Major trends in OLTP and OLAP that already build towards what we would consider Web3 properties of data
How Web3 data community can reuse many of the momentum already happening in modern data lakes
Advocated for the need of standardization of how we represent structured data in Web3
And presented Open Data Fabric as a potential solution to composable processing of structured data
You can find the recording here:
https://medium.com/media/17dcd7ebe958bdc3961cf53a8d02d74f/href
My personal highlight of the summit was when Peter Wang from Anaconda was asked “if you could wave a magic wand — what system you’d want to exist in the world — it can be ANYTHING” … and he went on to describe something that sounds exactly like Kamu:
Global registry of datasets
Data and code combined in one system
Verifiability and provenance of all derived data
I’d also recommend checking out Juan Benet’s talk on future of COD and the full of energy demo of Bacalhau from David Aronchick.
Big thanks to Protocol Labs for organizing the event and to everyone who attended!
It was especially great to meet some of our Discord members in person.
Till next time!
Kamu at Compute Over Data Summit was originally published in Kamu Data on Medium, where people are continuing the conversation by highlighting and responding to this story.

Kamu Update: We join CDL and CODWG

Sergii Mikhtoniuk — Mon, 12 Sep 2022 18:33:13 GMT

Kamu is selected for CDL program
We are very happy to announce that Kamu has been accepted into Creative Destruction Lab — a program for early stage science and technology companies. We will be joining the new Compute Stream in Vancouver that focuses on “technologies that will impact humankind in the same ways the printing press, the personal computer, and the internet did”. This is quite an expectation to live up to, and it definitely reflects our mission — to bring humanity onto the next level of data-driven decision making and bootstrap the new era of digital economy based on fair and rapid data exchange.
We join Protocol Labs “Compute Over Data” Working Group
Following our graduation from Faber-Filecoin Web3 accelerator, Protocol Labs — the company behind such amazing projects like IPFS and Filecon — invited us to join their newly-formed Compute Over Data Working Group. We are proud to work alongside many amazing companies to collectively address the problem of decentralized data processing.
Companies in the group tackle a wide variety of problems like:
Sandboxed computation environments that can run co-located with data (in WASM VMs or contaienrs)
Verifiable computations and identifying malicious actors
Decentralized ownership and authorization
Privacy-preserving computations
Web3-native databases
General-purpose compute networks (think decentralized alternatives to AWS)
With so much activity in this space we see a clear niche where Kamu can bring the most value:
Structured data processing — WASM and general-purpose compute is great, but data processing requires much more robust and higher-level primitives. We need interoperable data and schema formats, data-centric processing languages like SQL that go beyond a single dataset (e.g. in case of JOINs)
Dynamic data — in addition to one-shot processing tasks, how do we represent dynamic data sources (e.g. IoT devices, medical records) in decentralized and content-addressable storage and how do we build pipelines that can continuously process them
Bridging Web3 with an existing ecosystems of enterprise, government, and research data — for Web3 data to become mainstream where need to provide a smooth transition path for existing organizations through the use of standard analytical data formats, languages, and APIs and allowing them to pick the comfortable level of decentralization.
Here you can find our introduction and the technology demo we presented to the group.
Technology updates
Our progress on technology in the past two month incudes:
Support for publishing and syncing data from IPFS — our first integration with content-addressable file system went very smooth as we originally designed our protocol around this
New chapter in our self-serve demo showcases using Kamu for Web3 data analytics — follow it to build a complex pipeline that combines data from Ethereum blockchain and Web2 data sources
Updates to core protocol’s block structure that allowed us to significantly improve sync efficiency
We’ve put a major effort into our web frontend’s internals — it’s mostly non user facing, but sets us up for rapid feature development to simplify the user experience.
Kamu Update: We join CDL and CODWG was originally published in Kamu Data on Medium, where people are continuing the conversation by highlighting and responding to this story.

Kamu Data - Medium

Is Data Infrastructure a Bottleneck of Distributed Energy?

DERs everywhere!

Energy’s data problem

Ownership

What can we do?

Interoperability

What can we do?

Sharing and trading

What can we do?

Privacy and accountability

What can we do?

Recency, latency, automation

What can we do?

Fragmentation

What can we do?

So is data a bottleneck for DERs?

Conclusion

Open Data Fabric Protocol & Consortium

Open Data Fabric Consortium

Kamu Update — A year-worth of features

Kamu Update — A year-worth of features

But first, what is Kamu?

New Partners and Exciting Use Cases

R&D

Flow System

Private Datasets & Auth

Python Client

Verifiable Query API

LLM Connectivity via MCP Server

Semantic Search

Database-backed Metadata Catalog

Webhooks

Variables and Secrets

Usability

Core ODF Protocol

What’s Next?

Ownership and Decentralization of Data in DePINs

Central role of data in DePIN rewards

Challenges of DePIN data management

Who owns the data?

Evolution of data in DePINs

Kamu: Decentralized data middleware

Takeaway

See Also

Oracle-Augmented Generation: Connecting AI to Real-Time Verifiable Data

The Challenge of Factual Data in LLMs

Retrieval-Augmented Generation

Introducing Oracle-Augmented Generation

Example Interaction

Current Limitations

OAG vs. RAG

OAG and Kamu for Data Supply Chain Verifiability

Role of OAG in AI and Data Economy

Future work

Kamu at FIL Dev Summit 2024

Next Decade of Data: From algebra to calculus, from silos to real-time supply chains

On Industry Outlook

Aside: Don’t we have a data economy already?

On Transition Path

On Automation

On Composability

On Supply and Demand

On Ownership

On AI

On Web3

On Privacy

Conclusion

Kamu Update: Funding, hiring, tech

Funding secured

Growing the team

Technology updates

CLI Tool

Compute Node

Web Interface

New DataFusion Engine

Fast & Efficient Data Transfer

What’s next?

Kamu at Compute Over Data Summit

Kamu Update: We join CDL and CODWG