<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Kamu Data - Medium]]></title>
        <description><![CDATA[Decentralized data lakehouse and collaborative data processing network - Medium]]></description>
        <link>https://medium.com/kamu-data?source=rss----7bbcdfcb365---4</link>
        <image>
            <url>https://cdn-images-1.medium.com/proxy/1*TGH72Nnw24QL3iV9IOm4VA.png</url>
            <title>Kamu Data - Medium</title>
            <link>https://medium.com/kamu-data?source=rss----7bbcdfcb365---4</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Mon, 08 Jun 2026 12:20:24 GMT</lastBuildDate>
        <atom:link href="https://medium.com/feed/kamu-data" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Is Data Infrastructure a Bottleneck of Distributed Energy?]]></title>
            <link>https://medium.com/kamu-data/is-data-infrastructure-a-bottleneck-of-distributed-energy-d57ac8e44cf3?source=rss----7bbcdfcb365---4</link>
            <guid isPermaLink="false">https://medium.com/p/d57ac8e44cf3</guid>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[energy]]></category>
            <category><![CDATA[renewable-energy]]></category>
            <category><![CDATA[data-science]]></category>
            <dc:creator><![CDATA[Sergii Mikhtoniuk]]></dc:creator>
            <pubDate>Sun, 24 May 2026 17:31:16 GMT</pubDate>
            <atom:updated>2026-05-24T17:34:05.849Z</atom:updated>
            <content:encoded><![CDATA[<p><a href="https://medium.com/kamu-data/is-data-infrastructure-a-bottleneck-of-distributed-energy-d57ac8e44cf3?source=friends_link&amp;sk=16e58ac285fd0b6775bd75593d6267a6"><em>No-paywall link</em></a></p><h3>DERs everywhere!</h3><p>With the proliferation of renewable power and cheaper batteries, the energy sector is undergoing a transformational change towards decentralization — towards <em>DERs</em>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*pnsATTPzPIJbo3WY.png" /></figure><p>DER (Distributed Energy Resource) can refer to:</p><ul><li>Generation — from solar panels on your house, to a wind turbine, and industrial solar arrays</li><li>Storage — from a car battery and residential home banks, to utility-scale battery storages</li><li>Consumption — from your smart heater and EV charger to a small datacenter.</li></ul><p>Energy sector is a big balancing act of generation and demand. Balancing requires a lot of coordination. Coordination in turn relies on <strong>a lot of data</strong>. <a href="https://www.pv-magazine.com/2026/05/08/europes-negative-electricity-price-hours-double-in-q1-amid-renewables-surpluses-market-imbalances/">Negative electricity prices</a> and massive <a href="https://en.wikipedia.org/wiki/2025_Iberian_Peninsula_blackout">grid outages</a>, which have become a common occurrence, show that this process is not working well, due to large generation forecast errors, curtailment, and output coordination issues.</p><p>DERs are exciting because they provide a solution:</p><ul><li>Abundant storage capacity that moves closer to consumers</li><li>Consumption flexibility with smart chargers and devices, and much more.</li></ul><p>To us, as data engineers, DERs are exciting because they require completely re-evaluating the data flows used to monitor and manage current infrastructure, how we integrate the massive number of small-scale devices into a common grid, and utilize all devices to maximize efficiency.</p><h3>Energy’s data problem</h3><p>Let’s analyze the current data flows from several perspectives to identify gaps and opportunities to improve.</p><h3>Ownership</h3><p>If you have any IoT devices in your home, you likely know the state of data ownership already. Every device manufacturer wants to tie you into their ecosystem and make you send data into some proprietary platform. You get a few quick dashboards out of the box, but if you want to later switch vendors, or your vendor discontinues their platform <em>(very common in a competitive market)</em> — <strong>you lose your data</strong>. You have to work quite hard to regain control over your data.</p><p>The same exact thing happens on a large scale. If you are building an industrial solar array — you buy inverters from some vendor, and those come with a subscription-based proprietary data collection and monitoring platforms.</p><p>We’re talking about a lot of data. A 10 MW array with 100 string inverters sampled every 5 seconds may produce over <strong>2M data points per hour</strong> [1]. So your monitoring stack quickly develops <strong>“data gravity”</strong> and becomes a big factor in locking you into one vendor.</p><blockquote>⚠️ I should clarify that I don’t think hardware vendors are evil and aim to use data gravity as leverage. A few major hardware manufacturers I know said they would prefer not to develop their own data solutions at all, and see them as a distraction from their core business. The problem is that there hasn’t been any middle ground between niche “turn-key” data apps (a convenient product that users want) and a “build your own from a thousand pieces” infrastructure (that provides <strong>data sovereignty</strong>).</blockquote><h3>What can we do?</h3><p>There is a clear parallel between the current state of data and where software was a decade ago. Deploying and operating software was an arduous process that required bespoke integrations and tightly coupled you with your infrastructure provider. Kubernetes project created a new level of abstraction where software components could be packaged as standard-ish lego pieces and deployed in any cloud or on-prem environment, significantly reducing vendor lock-ins.</p><p>For <strong>convenience and sovereignty to co-exist</strong> in data we need two things:</p><ol><li>Ability to replicate our valuable data from any app to third-party storage for safe-keeping, or migrate it to another platform. Not a “data export” button that spit out a huge archive file in bespoke formats — a standard protocol to <em>continuously</em> move large volumes of structured data across company boundaries.</li><li>Ability to transfer the data processing logic from one platform to another. The way you clean, aggregate, and enrich your data encodes hard-won domain knowledge about how to interpret raw device readings. Currently when switching to a different platform you are not migrating this logic — you’re rebuilding it, because the execution semantics between platform differ slightly in many ways.</li></ol><p><a href="https://www.opendatafabric.org/">Open Data Fabric</a> (ODF) specification is our attempt to build <em>“Kubernetes of data”</em>. It’s a layer that wraps many open-source data storage and processing systems into “lego pieces” that compose well. Through standard formats it allows to efficiently replicate data between storages and systems in near-real-time. Thanks to standardized processing semantics it provides portability across implementations and environment — you can run the same pipeline on a laptop, edge server, or in a datacenter, using tools from different vendors.</p><h3>Interoperability</h3><p>When an IoT manufacturer builds a data app, it’s more of a “toy” with a few functions than a “data platform” — a check mark in their feature completeness list needed to sell a device. There are no real incentives for them to bother with extensibility, storage diversity, data exports and backups, interoperability with other tools.</p><p>But the data these apps hoard has <strong>hundreds of potential uses</strong> — and you need interoperability to unlock them. If you have ever tried to implement things like smarter heating scheduling for your house, you know that it requires first spending at least a week to build an <strong>alternative pipeline</strong> to move data to a place where it can be used by scripts and other tools.</p><p>Same, again, repeats on a larger scale. Most data tools supplied by industrial hardware vendors come from a long legacy that focused only on monitoring and maintenance. As number of uses for same data keeps growing <em>(see next section)</em>, you also end up having to build an alternative data pipeline, and at this scale you’ll need a whole <strong>data engineering team</strong>.</p><blockquote>When there were only a handful of large utility companies, having a data team was not a big problem. But in the world of DERs this is no longer viable. Things like generation forecasting or predictive maintenance cannot remain “tools for the big players only”.</blockquote><p>Even at a 10 MW array scale, deviating from the forecasted output by 5% on average can cost the company hundreds of thousands of dollars and <strong>10–15% of annual revenue in imbalance penalties</strong> [2]. No wonder that many solar developers spend years of engineering time on custom data stacks to achieve better forecasting.</p><p>The interoperability problem is arguably even more important on an <strong>inter-company level</strong>. When one firm builds flexibility into their EV chargers and another adds similar energy flexibility into smart heaters — we need common protocols for them to participate in scheduling optimization together. Every company today invents its own APIs, resulting in a <strong>O(n²) quadratic explosion of integration complexity</strong>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*jAx3mohXd8CZWqvE.png" /><figcaption>O(n²) complexity of custom integrations vs. interoperability</figcaption></figure><p>Energy sector has many mature standards, but as you’ll see in the next section, data that DERs generate needs to efficiently cross not only site, but industry and sector boundaries. And this goes both ways, as DER operators also require access to external (weather, financial, hardware costs) data to plan strategically.</p><h3>What can we do?</h3><p>Existing energy data standards <em>(Modbus, DNP3, OCPP etc.)</em> bundle domain-specific semantics, wire formats, and network protocols into a single specification. This works well for device-to-device communication, but not for cross-domain interoperability.</p><p>We’re missing a layer that defines how historical and real-time information is stored and exchanged, independently of semantics and network protocols. A data consumer receiving two feeds, one from a solar plant and one from a financial market, should only focus on domain aspects of each to correctly combine them. Everything else — how records are encoded, transferred, how updates are propagated — would be handled automatically. This is what Open Data Fabric spec defines.</p><p>The closest thing to this today is JSON APIs, but they fall short in many ways. Every API is highly custom — there’s no standard for how data is queried, filtered, aggregated, paginated, how updates are signaled, or how historical data is accessed, which means every integration is effectively built from scratch. Almost every real-world integration of several data sources ultimately involves first moving everything into an intermediate database before any actual analysis can begin. ODF is designed to make that intermediate step unnecessary.</p><p>This approach scales from capturing raw monitoring data, to feeding it to anomaly detection, to aggregating it for reporting and BI, to enriching it with external market, weather, and irradiance forecast data — all in a <strong>uniform way</strong>. Let’s see how this approach also scales when it comes to exchanging data with external parties.</p><h3>Sharing and trading</h3><p>Imagine having data from thousands of home solar setups around the world — it would be <strong>hugely valuable</strong>! But you’d really struggle to find a buyer for data from just your house.</p><p>Similarly, data from a solar array has thousands of uses:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xgXmZrInOSevmahZT8etJQ.png" /></figure><p>Many operators are waking up to the idea of their data being a valuable product on its own, but capturing this value when you run just one site is disproportionately hard. Finding a buyer, negotiating a complex contract and NDA, integrating and maintaining data delivery flows via portals, warehouses, FIX and Websocket APIs, is a lot of work outside of your core business. And this <strong>effort grows linearly</strong> with the number of recipients. That’s what we call a <a href="https://www.kamu.dev/blog/2024-06-27-next-decade-of-data/#on-supply-and-demand">“small publisher dilemma”</a>.</p><p>The energy sector is one of the few cases where data is <em>so valuable</em> that multiple “aggregator” companies exist that source it from individual sites to sell in bulk. They may offer you access to market-wide statistics in exchange, but in the end <strong>most value of your data is captured by aggregators</strong>.</p><h3>What can we do?</h3><p>What if we flip the problem on its head and redesign the data infrastructure from “deliver-to-each-buyer” model to “publish-once” model. Instead of an operator maintaining separate integrations for every data consumer, the operator advertises a dataset on a public network, defines price, license, and access conditions including what level of aggregation is visible to whom — and any number of consumers can discover and access it through standard protocols. For operator <strong>the effort is constant</strong> no matter how many downstream consumers exist.</p><p>This changes the <strong>value dynamics</strong> of small datasets. If the publishing infrastructure allows the global community of researchers and data scientists to build derivative pipelines that combine data from many individual sites into <strong>higher-order data products</strong> — the operator’s data becomes easier to discover and consume, and <strong>exponentially more valuable</strong>. Every site operator automatically benefits from every subsequent site that joins the publishing network, without any extra work.</p><p>Critically, this doesn’t require sacrificing control: query patterns and aggregation rules defined at publish time let the operator control precisely what is visible at each level, and those rules are enforced by the infrastructure.</p><p>The economics also shift. Instead of aggregators capturing most value by virtue of being the only party with the engineering capacity to assemble and resell the data at scale — <strong>value is captured primarily by data owners</strong>. Overpriced integration pipelines will immediately invite competition from global community of data engineers.</p><h3>Privacy and accountability</h3><p>When energy data is used in so many applications, it better be reliable. But how can we trust that a party delivers data that is <strong>real and accurate</strong>?</p><p>We could ask for disaggregated data and near-real-time delivery. Add device signing, and you’ll have robust protection against tampering.</p><p>But <strong>raw data is a major privacy and security concern</strong>:</p><ul><li>It exposes the map of your physical infrastructure and reveals when equipment fails</li><li>Advertises your attack surface and when on-site personnel are present or absent</li><li>Provides competitive intelligence about your performance and curtailment patterns for bidding behavior inference and PPA negotiation leverage.</li></ul><p>So data needs to be shared selectively, with levels of anonymisation and aggregation. And <strong>this layer is always custom-built</strong> — yet another massive task for your data engineering team.</p><p>But if the data is processed and doesn’t come signed straight from a device — <strong>how can we be sure that it’s not synthetic</strong>, generated to make a quick buck, or that it’s not omitting some critical interval to cover something up?</p><p>In current approaches — we simply can’t. While revenue-grade metering with certified meters directly read by ISOs provides tamper-evident generation records for settlement — there is no such mechanism for the rest of the data. Financial market consumers don’t have any mechanism to keep providers <strong>accountable</strong> for API call responses. Data shared via some warehouse tables can be changed overnight without any trace. This makes data products overall much less valuable, and any doubts about the validity of some value or a chart require months of work to address.</p><h3>What can we do?</h3><p>The accountability gap is a provenance problem — we need to prove that data we aggregated or anonymized for the sake of privacy actually corresponds to original raw device data. But how can we do this without revealing all raw data for audit?</p><p>Enter <a href="https://en.wikipedia.org/wiki/Confidential_computing">confidential computing</a>. If data is cryptographically signed at the device and every subsequent processing step produces a <strong>cryptographic proof</strong> that the computation ran correctly on authentic inputs, then the trustworthiness of a derivative dataset can be confirmed without trusting the parties who handled it. A data consumer can independently verify that an aggregated or anonymized dataset is a faithful derivative of real device readings with no gaps, substitutions, or fabrications, regardless of how many processing steps or organizational boundaries it passed through.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*A5QVAXMD1m5isaHpTEu2pg.png" /><figcaption>Anonimized data verifiable back to the original device readings</figcaption></figure><p>The implications for ESG markets are significant. Today carbon offsets or RECs are traceable only as far back as the reporting entity that submitted them to a registry. With device-level signing and verifiable computations, every certificate could be cryptographically linked back to the physical event that created it, like a certified flow rate meter measuring sequestration or carbon capture. This would make greenwashing and certificate fraud structurally difficult rather than merely illegal as it would require tampering the certified device itself rather simply changing data values anywhere in the downstream pipelines.</p><p>ODF seamlessly integrates modern confidential computing techniques like <a href="https://en.wikipedia.org/wiki/Trusted_execution_environment">TEEs</a> and <a href="https://en.wikipedia.org/wiki/Zero-knowledge_proof">Zero-Knowledge Proofs</a> into data processing pipelines, with pilot projects conducted in the carbon offsets market.</p><h3>Recency, latency, automation</h3><p>In our team we often discuss how power-hungry AI workloads and cryptocurrency mining could be instantaneously dispatched to datacenters in areas where energy prices go negative as a perfect dump load. But a quick reality check shows that most operators today still handle day-ahead scheduling manually. Even if ML forecasting is used — it still mostly runs on the same 24-hour schedule and with manual sign-offs. There’s <em>such a big gap</em> between where we want to be with DERs and where we are right now…</p><p>Unlocking the potential of DERs will require:</p><ul><li>High degree of <strong>human-out-of-the-loop automation</strong> — to push latency to minutes/seconds</li><li>High degree of <strong>traceability and auditability</strong> — to debug and gain confidence in said automation.</li></ul><p>The current state of data tooling just doesn’t cut it. Outside of monitoring stacks, most tools operate on a fundamentally batch-oriented model — jobs are scheduled to run at fixed intervals, reprocess all data from scratch, at certain cutoff times, producing outputs that are stale before they’re consumed. No matter how much you automate — this model requires a lot of care around execution order, ensuring consistency across datasets that were frozen at different moments. A human has to be in the loop to maintain this extremely fragile machinery and to recover it after failures.</p><h3>What can we do?</h3><p><a href="https://www.kamu.dev/blog/end-of-batch-era/">Temporal processing</a> — a new computational model alternative to batch — treats all pipelines as continuous flows where each processing step produces outputs incrementally as new data arrives, with the infrastructure guaranteeing consistency and execution order automatically. It’s like switching from algebra to calculus — a better way to express computations that involve passage of time. It isn’t just a latency improvement — it’s the only way to make genuine human-out-of-the-loop automation possible.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*JngA03_m-0b-QzPs.png" /><figcaption>Batch flows requiring constant human attention</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*TJ61fb0AK0rO36Tz.png" /><figcaption>Temporal flows are configured once and run autonomously</figcaption></figure><p>If we ever hope to meaningfully track ESG scope 1, 2, and 3 emissions — across an operator’s own assets, their energy purchases, and their supply chain — and do so without needing an army of specialized accountants crunching numbers day and night — we need to build temporal pipelines that can be reviewed, certified, and continuously improved. Any operator should be able to reuse such pipelines like an open template — plug in their monitoring inputs, and get output data for disclosures at <strong>minimal cost of compliance</strong>.</p><p>Temporal processing combined with confidential computing would make ESG disclosures both <strong>verifiable and privacy-preserving</strong>.</p><h3>Fragmentation</h3><p>Renewable energy certificate registries (aka REC, GO, REGO) is a consequential example of data fragmentation problem. Globally, RECs exist in dozens of incompatible regional systems <a href="https://www.kamu.dev/use-cases/energy/#fn:3">3</a> — each with its own data model, eligibility rules, temporal granularity, and identifier namespace. These registries don’t interoperate. A corporate buyer with operations across multiple continents must manually reconcile fundamentally incompatible certificate schemas, rely on intermediaries to convert certificates across registry boundaries — a process that is slow, expensive, and administratively rather than technically verified — and trust that bilateral agreements between registries prevent the same megawatt-hour from being claimed twice in two different markets.</p><p>The result is that the global renewable energy market runs on a foundation of administrative trust and manual reconciliation rather than technical verifiability, which limits how sophisticated it can become: real-time certificate markets, automated ESG reporting, and meaningful participation by small DER operators are all structurally out of reach until a common interoperability layer exists beneath the registries.</p><h3>What can we do?</h3><p>Many blockchain-based solutions tried to address the registry fragmentation problem. Decentralized tamper-proof ledger sounds like a natural fit at first, but in practice, even specialized enterprise blockchains struggle with large volumes of data — they are designed for transactional rather than analytical processing. As DER space moves from high granularity assets (1 REC = 1 MWh) and infrequent settlements to more real-time interactions — single-ledger model would face scaling issues. And all data would still need to be copied into databases and lakehouse tools to be searchable, queriable, and interoperate with BI and data science tools.</p><p>A conventional SQL lakehouse architecture, on the other hand, is well understood and widely deployed. A cryptographic verifiability layer we discussed above can ensure that the data that lands into regional registries is trustworthy. Thanks to tamper-proof properties, not even registry admins will be able to modify it without a trace.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/703/0*TdHYJf0O-I_D2Oc4.png" /><figcaption>Federated registry of verifiable renewable energy certificates</figcaption></figure><p>Through a layer of <strong>federation</strong>, data from multiple registries can be harmonized and integrated together, and made easily available and efficient to query. Such integration pipelines can exist purely as code that deterministically reconstructs the dataset. It does not need to be governed by an authoritative body. It doesn’t event need any infrastructure! It can be maintained as a pure open-source initiative.</p><p>In this model every registry will retain full autonomy of evolving their data schemas, while federated pipelines will provide them an <strong>instant feedback</strong> on how their changes affect the harmonization layer and whether they break downstream consumers. The whole system will become more uniform over time by the nature of improved visibility.</p><h3>So is data a bottleneck for DERs?</h3><p>There are many obstacles to DER adoption, and data infrastructure will likely not going to be on top of any operator’s list of bottlenecks. But viewing data merely “as one problem among many” understates how deeply <strong>data problems compound the others</strong>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*5HDVY4QsU4efkf93ghZmvg.png" /></figure><p>These problems scale nonlinearly with DER proliferation and the O(n²) integration complexity becomes unmanageable at millions of small assets.</p><h3>Conclusion</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*OfBuEfTUWTaEgyZE.png" /></figure><p>The negative electricity prices and grid failures aren’t engineering failures — they’re coordination failures. The physics of renewable generation is well understood. The hardware to store and dispatch it is getting cheaper. What’s missing is the connective tissue: the ability to observe, verify, and act on the behavior of millions of distributed resources in real time, across organizational and jurisdictional boundaries. Preferably without armies of engineers needing to maintain it like a giant Rube Goldberg machine.</p><p>The problems we’ve described — vendor lock-in, quadratic integration complexity, the small publisher dilemma, the accountability gap, registry fragmentation — look like separate issues when viewed in isolation. But they share a common root: the absence of a sector-agnostic layer that handles how data is managed, processed, and exchanged independently of what the data means or which domain it comes from. Every sector that has developed such a layer — compute infrastructure with Kubernetes, package distribution with language registries, web communication with HTTP — has seen coordination costs collapse and innovation accelerate. Data infrastructure hasn’t had its equivalent yet.</p><p>When it does, a different set possibilities will open. A small solar operator publishes monitoring data once and reaches forecasters, insurers, researchers, and carbon markets without maintaining individual relationships with any of them. A utility approving a new DER interconnection has auditable, verifiable performance history from comparable assets rather than conservative assumptions based on manufacturer specs. A corporate buyer retiring a renewable energy certificate can verify its provenance back to a signed device record rather than trusting a chain of administrative reports. An ESG disclosure that currently requires a quarterly accounting exercise runs continuously and updates automatically as new generation data arrives. A grid operator dispatching thousands of EV batteries during a frequency event has sub-second verified telemetry from each one rather than stale batch reports.</p><p>None of these require new hardware or new physics. They require data infrastructure that treats verifiability, interoperability, and automation as first-class properties rather than afterthoughts.</p><p><a href="https://www.opendatafabric.org/">Open Data Fabric</a> is our attempt to specify this layer — open sector-agnostic spec, and already in use in <a href="https://www.kamu.dev/blog/2025-06-30-molecule-desci-berlin/">biotech</a>, <a href="https://www.kamu.dev/blog/2025-01-08-oracle-augmented-generation/">finance</a>, and <a href="https://www.kamu.dev/blog/2025-08-23-depin-data/">IoT</a> contexts. We think energy is where it’s needed most urgently, and we’d welcome the <a href="https://www.kamu.dev/contact/">conversation</a> with anyone working in this problem space.</p><ol><li>10 MW ÷ 100 kW inverters × ~30 Modbus tags × 3,600s / 5s = 2.16M of raw records per hour prior to aggregation.</li><li>Solar forecast errors of 5% nMAE are typical for sites without calibrated plant models <a href="https://nrel.gov/publications">[NREL]</a>. In CAISO’s settlement structure deviations are settled at a 10% penalty multiplier above or below real-time prices <a href="https://www.caiso.com/legal-regulatory/business-practice-manuals">[CAISO BPM]</a>. Figure is highly ISO-dependent.</li><li>WREGIS in the western US, M-RETS in the midwest, GATS in PJM, Guarantees of Origin in the EU, I-RECs for international markets, and many more.</li></ol><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=d57ac8e44cf3" width="1" height="1" alt=""><hr><p><a href="https://medium.com/kamu-data/is-data-infrastructure-a-bottleneck-of-distributed-energy-d57ac8e44cf3">Is Data Infrastructure a Bottleneck of Distributed Energy?</a> was originally published in <a href="https://medium.com/kamu-data">Kamu Data</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Open Data Fabric Protocol & Consortium]]></title>
            <link>https://medium.com/kamu-data/open-data-fabric-protocol-consortium-1d654625ca15?source=rss----7bbcdfcb365---4</link>
            <guid isPermaLink="false">https://medium.com/p/1d654625ca15</guid>
            <category><![CDATA[data]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[open-data]]></category>
            <category><![CDATA[github]]></category>
            <dc:creator><![CDATA[Sergii Mikhtoniuk]]></dc:creator>
            <pubDate>Fri, 27 Feb 2026 01:53:06 GMT</pubDate>
            <atom:updated>2026-02-27T01:53:06.637Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*lOuFDmofqAijlp0LJqzEog.png" /></figure><h3>Open Data Fabric Consortium</h3><p><a href="https://medium.com/@mikhtoniuk/1d654625ca15?source=friends_link&amp;sk=8da9342196fdf1c535b0df34e51a821b">Read for free</a></p><p>Feels like just yesterday <a href="https://kamu.dev">Kamu</a> set out on an ambitious mission to unlock collaboration in data on the scale similar to open-source software revolution.</p><p>Very early in our journey we <a href="https://medium.com/kamu-data/introducing-open-data-fabric-eaf9fdcd3903">introduced</a> Open Data Fabric protocol as a stand-alone open specification <em>*(aka “HTTPs of multi-party data exchange”)*</em>. We envisioned it to be developed by community, for community, and belong to no one.</p><p>Since then, these efforts had now grew into a large consortium of contributors and <strong>it deserved a new home</strong>: <a href="https://opendatafabric.org">https://opendatafabric.org</a></p><p>Big thanks to our partners for supporting the vision of equitable and verifiable data economy.</p><p>Check out the website and enjoy the new introductory video:</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FZyiZVfuVM9E%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DZyiZVfuVM9E&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FZyiZVfuVM9E%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/c822aedbc6be5dfa735480c7b5c25113/href">https://medium.com/media/c822aedbc6be5dfa735480c7b5c25113/href</a></iframe><p>Want to discuss or contribute? Join the conversation on <a href="https://discord.gg/nU6TXRQNXC">Discord</a>.</p><p>Till next time!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=1d654625ca15" width="1" height="1" alt=""><hr><p><a href="https://medium.com/kamu-data/open-data-fabric-protocol-consortium-1d654625ca15">Open Data Fabric Protocol &amp; Consortium</a> was originally published in <a href="https://medium.com/kamu-data">Kamu Data</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Kamu Update — A year-worth of features]]></title>
            <link>https://medium.com/kamu-data/kamu-update-a-year-worth-of-features-61565a626301?source=rss----7bbcdfcb365---4</link>
            <guid isPermaLink="false">https://medium.com/p/61565a626301</guid>
            <category><![CDATA[open-data]]></category>
            <category><![CDATA[desci]]></category>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[blockchain-technology]]></category>
            <dc:creator><![CDATA[Sergii Mikhtoniuk]]></dc:creator>
            <pubDate>Mon, 17 Nov 2025 03:03:30 GMT</pubDate>
            <atom:updated>2025-11-17T03:03:29.238Z</atom:updated>
            <content:encoded><![CDATA[<h3>Kamu Update — A year-worth of features</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/500/0*jSq_1dTp2Nh_C9dH.png" /></figure><p>Hey, it’s been a while since our last update.</p><p>We accumulated tons of exciting news, so let’s dig in!</p><h3>But first, what is Kamu?</h3><p><em>What’s the first thing that comes to your mind when I say “data”?</em></p><p>If it’s <strong>Excel spreadsheets</strong> — then imagine one that keeps real-time record of events that happens in your business, and like an accounting ledger allows you to “time-travel” to see how your data looked like at any specific point in time.</p><p>If it’s thousands of <strong>CSV files</strong> — then imagine a data format allows you to represent highly dynamic data (e.g. IoT), is strictly typed, 1000x more compact, can be efficiently queried like a database, yet remains easy to share and copy between systems, and embeds cryptographic proofs of who owns and who modified each dataset.</p><p>If it’s <strong>Jupyter notebooks</strong> — then imagine a special notebook with code that cleans, harmonizes, and combines data from several different dynamic sources. You can share this notebook with other people, thus providing them with both a stream of continuously refined data AND an explanation of where this data comes from.</p><p>If it’s an <strong>enterprise data lakehouse</strong> — then imagine a peer-to-peer network connecting lakehouses in multiple companies, allowing them to share data with subsidiaries and partners from a “single source of truth” with flexible privacy controls, and build verifiable ETL pipelines that span across company boundaries.</p><blockquote><strong>Kamu is a novel data supply chain technology</strong> focused on how data travels across organization boundaries. It aims to unlock collaboration on data similar to Open Source Software revolution and build a collaborative data economy based on privacy, clear ownership, and verifiable trust.</blockquote><p>The technology itself is pretty simple. Explaining it still requires many analogies only because something like that did not exist before (see <a href="https://www.kamu.dev/blog/2024-08-28-intro-to-web3-data-engineering/">step by step technical introduction</a>).</p><h3>New Partners and Exciting Use Cases</h3><p>We’ve seen way too many companies (especially in Web3) who spend years developing complex technologies only to realize there is no market for them.</p><p>In Kamu we believe that a healthy tech company should minimize the time between developing a feature and putting it to good use in production.</p><p>In the face of limited resources and a small team, the best way to do it is to build meaningful <strong>partnerships</strong> with visionary companies, solve their immediate needs, and grow together towards larger common goals.</p><p>I’m happy to announce our continued collaboration with <a href="https://molecule.xyz/">Molecule</a> in <strong>scientific data sharing</strong> (see <a href="https://www.youtube.com/watch?v=d-Nha1sO0tM">DeSci Berlin talk</a>).</p><p>🧬 <em>100+ Biotech startups are trusting Kamu with their research data!</em></p><p>We have collaborated with several startups in <strong>Decentralized AI</strong> space:</p><ul><li>To connect LLM agents to real-time analytical data (see <a href="https://www.youtube.com/watch?v=RuQQNAgfOlY">OAG demo</a>)</li><li>To provide provenance trail for GenAI artifacts for copyright and IP infringement protection</li><li>To power tokenomics models that reward data owners and model authors for GenAI artifacts</li></ul><p>We are also starting to collaborate with several <strong>DePIN</strong> companies to unlock private data ownership and <a href="https://www.kamu.dev/blog/2025-08-23-depin-data/">democratize DePIN data pipelines</a>, and <strong>Industrial IoT</strong> companies that need to disseminate high-frequency sensor data.</p><h3>R&amp;D</h3><p>These collaborations have pushed us hard in terms of features needed from <a href="https://docs.kamu.dev/node/">Kamu Node</a>. Nearly all our efforts this year went into turning the technological foundation we created with <a href="https://docs.kamu.dev/cli/">Kamu CLI</a> into a <strong>scalable back-end solution</strong>.</p><h4>Flow System</h4><p>Just like how GithHub hosts millions of repositories and runs their CI actions, our public node is intended to host millions of datasets and ingest, process, and validate their data.</p><p>Given that pipelines need different engines (Flink, Spark, Datafusion etc.) and different <em>versions</em> of those engines to execute, and we don’t have infinite hardware to keep them running all the time — we need a smart way to schedule these computations.</p><p>Enter the <strong>Flow System</strong>. Flow system is <strong>the heart of Kamu Node</strong> that beats to drive computations forward.</p><p>Computations can be of many types, like polling ingestion, derivative transformations, verification, compactions. Flow System intelligently schedules and prioritizes them and spawns execution tasks that are processed according to available capacity.</p><p>We decided that the most important thing in the design of this system was <strong>explainability</strong> — understanding why some actions happened when they did, and what actions <em>will</em> happen soon.</p><p>Happy to say that flows in Kamu offer the best explainability I’ve seen in any jobs scheduler:</p><ul><li>You can see the upcoming flows that and estimated time when they will start executing</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*cT9Ld8UYGqFrObZq.png" /></figure><ul><li>You can understand what events triggered a flow and reconstruct the graph of updates</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*qETl64eZ0ze4mW4x.png" /></figure><ul><li>You can see all queues, batching, and scheduling decisions that were made prior to task execution</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*eOxvS39FbLhnqdZw.png" /></figure><p>We are working to make flows easy to monitor, so you could keep your complex ETL pipelines in a working order.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*vnUemYyT3I1vldU6.png" /></figure><h4>Private Datasets &amp; Auth</h4><p>Some data we work with today is highly sensitive:</p><ul><li>In case of Biotech startups it may contain technology secrets, pre-patent discoveries, and financial data</li><li>In case of GenAI — it may contain internal company materials used for inference.</li></ul><p>So we extended Kamu Node to finally support <strong>private datasets</strong> and a flexible sharing system, based on a robust ReBAC authorization mechanism.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/681/0*fDljmP7mEBvB-ASH.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*6pLOJyak40yOJ3Ni.png" /></figure><p>You can generate <strong>access tokens</strong> for scripting and automation.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*0Jw8zR-jl6146RYo.png" /></figure><p>CLI now supports <strong>OAuth2 Device Flow</strong> to authenticate with remote nodes.</p><p>And you can now authenticate with Kamu using <strong>cryptographic wallets</strong> — a major step we took towards <a href="https://www.kamu.dev/blog/2025-08-23-depin-data/">personal data ownership</a> in DeSci and DePIN use cases.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*h-HNOgCi2X4tR6nc.png" /></figure><p>This is only the beginning of our privacy (<a href="https://arxiv.org/abs/2012.08347">structured transparency</a>) roadmap.</p><h4>Python Client</h4><p>We have released the first version of our <a href="https://github.com/kamu-data/kamu-client-python">official Python client</a> library:</p><p>Now you can:</p><ul><li>Query data from Python scripts and notebook environments</li><li>Query data from local workspaces and remote nodes</li><li>Select between multiple SQL engines</li></ul><p>A minimal Jupyter notebook example is:</p><pre>%load_ext kamu<br><br>import kamu<br><br>con = kamu.connect(&quot;grpc+tls://node.demo.kamu.dev:50050&quot;)<br>con.query(&quot;select * from &#39;kamu/co.alphavantage.tickers.daily.spy&#39; limit 3&quot;)</pre><p>The client library is based on ADBC and FlightSQL — the most efficient protocols to transfer columnar data out there.</p><p>It works in any Python notebook environment — try running this <a href="https://colab.research.google.com/drive/1WQqZJsPQpipU4kW6SPea9H2qmHH4rF8k?usp=drive_link">example notebook</a> in Google Colab.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*lqMzqqA_nYhlzYcy.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*flfv-gYS5CvM-CuE.png" /></figure><h4>Verifiable Query API</h4><p>Imagine running an SQL query that aggregates gigabytes of data, or better yet — JOINs several huge datasets that belong to <strong>different organizations</strong>. The computations to deliver this result may span across several independent Kamu nodes, located in different countries and operated by different people with different incentives.</p><p><em>How can you trust such a result?</em></p><p>Enter <strong>Verifiable Query API</strong>. When you enable this mode — every query response from Kamu will be accompanied by a cryptographic proof of result validity.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*SrxXEvWy9QY9RdM0.png" /></figure><p>These proofs are succinct, much smaller than the result data itself (just a few kilobytes), but having them allows you to hold <strong>every participant</strong> of the entire supply chain that delivered you the response <strong>forever accountable</strong> for it.</p><p>Our novel <a href="https://docs.kamu.dev/node/commitments/">Structured Recursive Commitments</a> scheme allows to validate a response and assign blame to a specific actor of a multi-party collaboration.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*zFTH-rkEpJszPdEk.png" /></figure><p>Currently we support an optimistic execution scheme (proofs based on signatures, determinism, and reproducibility), but in the near future will be expanding them to TEEs, and later to Zero Knowledge proofs.</p><p>This proof mechanism is fundamental to <a href="https://docs.kamu.dev/node/protocols/oracle/">ODF Oracle</a> — a system for querying data from blockchain smart contracts.</p><h4>LLM Connectivity via MCP Server</h4><p>Kamu in collaboration with <a href="https://www.brianknows.org/">Brian</a> have developed a <a href="https://www.kamu.dev/blog/2025-01-08-oracle-augmented-generation/">new technique</a> to connect LLMs to verifiable real-time factual data.</p><p><strong>Oracle-Augmented Generation</strong> technique:</p><ul><li>Enables LLMs to execute complex analytical queries over large volumes of real-time data</li><li>Makes AI answers auditable and far less prone to hallucination</li><li>Offers full data supply chain provenance.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*BfoVTUOwvISoEojt.png" /><figcaption>Proof saved alongside the agent’s chat history</figcaption></figure><p>We believe this technique will be fundamental to granting AI agents more autonomy, without losing control and ability to reason about their actions.</p><p>It will likewise be fundamental in building an equitable data economy where owners of information and pipeline maintainers are fairly rewarded for their contributions in answering questions accurately.</p><p>The best thing is that you can use it just like any other MCP server.</p><h4>Semantic Search</h4><p>First important step in our LLM connectivity technique was to narrow down the search space of what datasets can potentially answer a certain questions from thousands to a few dozens, not to overload LLM context window. We achieved this by implementing semantic search API to find most relevant datasets.</p><p>Metadata about the dataset like its description, schema, and example queries is converted into vector embeddings and stored in a vector database for efficient querying and ranking.</p><p>You can now enter free-form text in the search box to find datasets that are close to desired theme:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*g8PU6vf8Ihf6VKNW.png" /></figure><h4>Database-backed Metadata Catalog</h4><p><a href="https://docs.kamu.dev/odf/spec/#dataset-layout">ODF file format</a> standardizes dataset layout on disk, allows datasets to be interoperably shared and replicated across different environments. But when datasets are in your Kamu Node — scanning metadata as individual files is not very efficient.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/670/1*o04r0VWY6JNPM0jwcagAJg.png" /><figcaption>Structure of an ODF dataset</figcaption></figure><p>We introduced a new metadata catalog that acts as a write-through cache in Postgres and MariaDB databases. This is very similar to multiple metadata catalog implementations for formats like Apache Iceberg, but in our case stores the cryptographic metadata chain blocks, refs, dataset summaries and dependency graph, and a lot more.</p><p>This change dramatically improves performance of most operations. It also ensures atomic transactions, which are hard to achieve in pure-storage catalogs like S3.</p><p>Kamu CLI also benefited from these changes, as it is now using SQLite-based implementation of such catalog.</p><h4>Webhooks</h4><p>As data moves through the pipelines in Kamu — there are multiple use-cases for changes in datasets to drive some automation. For this we have added full-fledged support for Webhook mechanism that supports message signatures (<a href="https://www.rfc-editor.org/rfc/rfc9421.html">RFC 9421</a>) and content digests (<a href="https://www.rfc-editor.org/rfc/rfc9530.html">RFC 9530</a>) for security.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*QMHuYE2eiaSac6_V.png" /></figure><h4>Variables and Secrets</h4><p>We added support for variables and secrets that can be passed down into ingestion tasks, making it easy to interact with data APIs that require some form of authorization. All secrets are stored securely encrypted.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*Sw0sFnPq7jtdrKa4.png" /></figure><h4>Usability</h4><p>My (very biased) feeling is that we struck the right design and exact right combination of technologies to move the stagnating world of data into a new territory. As a software architect, never before have I seen so many pieces of the puzzle falling neatly together, even those that we have not at all anticipated.</p><p>Our biggest challenge admittedly will be making the product extremely easy to use. Accessibility, rapid feedback, and guiding users to right actions will be the key to <a href="https://www.kamu.dev/blog/end-of-batch-era/">undo the decades of dominance</a> of the batch processing model.</p><p>A few highlights in this area include:</p><p>Ingestions via <strong>file uploads</strong> allows you to create root datasets by dragging &amp; dropping files.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*IIwFsGsHNKzpxEkQ.png" /></figure><p><strong>Email gateway</strong> support that will notify you about changes to your account and the state of your pipelines.</p><p><strong>Documentation overhaul</strong> including:</p><ul><li>Much better structure of the <a href="https://docs.kamu.dev/">documentation portal</a></li><li>Built-in quick search</li><li>Inclusion of <a href="https://docs.kamu.dev/odf/spec/">ODF spec</a> and <a href="https://docs.kamu.dev/odf/reference/">metadata schemas</a> reference</li><li><a href="https://docs.kamu.dev/odf/rfcs/">RFC history</a></li><li>Comprehensive <a href="https://docs.kamu.dev/glossary/">Glossary</a> with cross-linking of all terms throughout our docs and tutorials</li><li><a href="https://docs.kamu.dev/node/api/rest/">REST API reference</a> documentation using OpenAPI</li><li>New <a href="https://node.demo.kamu.dev/">GraphQL playground</a></li></ul><h4>Core ODF Protocol</h4><p>The core protocol has been getting many cool improvements as well, too many to list.</p><p>Some highlights:</p><ul><li>More expressive dataset schema definitions</li><li>Separation of ODF crates from Kamu implementation to help building more ODF-compatible clients</li><li>Redesign of metadata traversal (MetadataChainVisitor) that allows to express complex computations on dataset history while utilizing multiple layers of caching for efficient traversal</li><li>Projecting changelog history into a state snapshot using to_table() UDTF</li><li>Hard compactions</li><li>New ChangelogStream and UpsertStream merge strategies</li></ul><h3>What’s Next?</h3><p>Our near-term roadmap includes:</p><ul><li>Securing future funding</li><li>Improving dataset schema migrations</li><li>Improving usability and shortening feedback loops</li><li>Privacy-preserving queries and transformations using TEEs</li><li>More declarative pipeline and resource management</li><li>Further separation of ODF core libraries</li><li>Integrating more engines</li><li>System abuse prevention</li></ul><p>Our focus will remain primarily on working closely with existing partners, but all improvements we make benefit Kamu Node’s functionality for everyone else.</p><p>We remain fully committed to build Kamu as <strong>local-first software</strong>, meaning everything you can do in Kamu Node you’ll be able to do locally with Kamu CLI, without sign-ups, accounts, and for free.</p><p>Thanks so much for reading, and huge thanks to our team for making all this happen. Start a conversation with us on <a href="https://discord.gg/nU6TXRQNXC">Discord</a>.</p><p>Till next time!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=61565a626301" width="1" height="1" alt=""><hr><p><a href="https://medium.com/kamu-data/kamu-update-a-year-worth-of-features-61565a626301">Kamu Update — A year-worth of features</a> was originally published in <a href="https://medium.com/kamu-data">Kamu Data</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Ownership and Decentralization of Data in DePINs]]></title>
            <link>https://medium.com/kamu-data/ownership-and-decentralization-of-data-in-depins-c7b60999847c?source=rss----7bbcdfcb365---4</link>
            <guid isPermaLink="false">https://medium.com/p/c7b60999847c</guid>
            <category><![CDATA[blockchain-technology]]></category>
            <category><![CDATA[web3]]></category>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[depin]]></category>
            <dc:creator><![CDATA[Sergii Mikhtoniuk]]></dc:creator>
            <pubDate>Sat, 23 Aug 2025 22:24:27 GMT</pubDate>
            <atom:updated>2025-08-23T22:24:27.690Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Y2A7sSXxUB_kEcjNi80E7Q.png" /></figure><p><em>Decentralized Physical Infrastructure Networks</em> (DePINs) emerged from decentralized finance as incredible new mechanisms of incentives that enable large groups of people to collectively provide valuable and cohesive services to the society with minimal intermediation and transparent rewards.</p><h3>Central role of data in DePIN rewards</h3><p>Data plays central role in DePINs:</p><ul><li>Some DePINs are purpose-built to <strong>crowd-source data collection</strong> and reward their contributors based on volume and quality of data <em>(e.g. weather, traffic, air and sound pollution networks, wearable personal health devices)</em></li><li>Other DePINs collect data for <strong>quality of service</strong> purposes <em>(e.g. wireless connectivity providers and idle CPU/GPU compute sharing networks collect node uptime, reliability, and units of work metrics)</em>.</li></ul><p>In both cases data is the source of truth for determining how much rewards contributors earn for providing their services to the network. Therefore <strong>validity of data and computations</strong> is of utmost importance to fair token distribution.</p><h3>Challenges of DePIN data management</h3><p>Managing data is hard, even in a centralized setting. Tech companies retain large data engineering teams for the sole purpose of making sense of internal data. For DePINs this challenge is even more severe:</p><ul><li>They often deal with <strong>large volumes</strong> of device data (IoT), often beyond what a single database can handle.</li><li>Their analytics and rewards involve combining <strong>on-chain and off-chain</strong> data, creating a lot of extra work to index data from blockchains and get data into smart contracts.</li><li>DePINs create value from data only when it <strong>crosses organizational boundaries</strong>, when it is provided to business consumers, or when it’s shared with data science communities, thus requiring a lot more infrastructure around dissemination APIs and data privacy.</li></ul><p>So DePINs need of build a lot more complex tech stack, often on a short budget, while also facing <strong>transparency and verifiability pressure</strong> that their Web2 counterparts do not.</p><h3>Who owns the data?</h3><p>As the result of these challenges, many DePINs fail to deliver on decentralization promisses. By building on top of Web2 data solutions like AWS, Snowflake, BigQuery they <strong>re-centralize</strong> the data, contraty to their ethos.</p><p>You may argue that DePIN data stored this way still can be traced back to the person/device who provided it via device IDs and wallet accounts, but here’s a simple “ownership test”:</p><blockquote>If a company behind a DePIN doesn’t pay its server bill — the data of the entire network can be lost, the token value can tank, and years of contributions of people and their hard-earned rewards wiped out in an instant.</blockquote><p>The reason for all this is simple: while blockchains mostly solved the decentralization of money, the problem of <strong>decentralization of data is largely unsolved</strong>.</p><h3>Evolution of data in DePINs</h3><p>Most DePIN companies do try hard to deliver on their promises to the community using best tool and resources at their disposal.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*Kh47teWGec3oEBSN.png" /><figcaption>Infrastructure of a typical DePIN with high degree of centralizion</figcaption></figure><p>Typical evolution stages of data management in such companies may look like this:</p><ol><li>Start collecting data into a <em>single database</em> as a proof of concept</li><li>Write <em>scripts to distribute rewards</em> to early adopters, without any transparency or verifiability</li><li>Build <em>custom APIs</em> to deliver data to first customers, monetize data, and close the value loop</li><li>Develop an <em>interim solution to scale</em> beyond one database, such as offloading historical data to a Parquet data lake in S3</li><li>Develop <em>custom data exports</em> to distribute data to the community for hackathons and challenges to unlock more interesting use cases for their data</li><li>Under pressure to differentiate themselves from Web2 companies they may start <em>pinning data in IPFS or Filecoin</em>.</li></ol><p>This is a long and winding path, it takes many years and millions in engineering resources to execute, yet still doesn’t lead to satisfactory results. Having data pinned in IPFS <strong>doesn’t solve the underlying re-centralization issues</strong>, only masks them under the disguise of duplicating data into a Web3 storage.</p><p>By that time data has already lost its verifiability properties and its connection to who provided it, and all data pipelines — the arteries that transport and refine data into more valuable forms — remain proprietary, impossible to reproduce.</p><h3>Kamu: Decentralized data middleware</h3><p>Kamu achieves what we used to think was impossible:</p><ul><li>It flips the data management model upside-down, re-focusing it on <strong>personal data ownership</strong></li><li>But does so in a way that is <strong>non-disruptive</strong> and <strong>maximally compatible</strong> with existing data tools.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*F6N2K2FoEQRNJy6z.png" /><figcaption>DePIN negotiates access to personally-owned data and refines it via decentralized pipelines</figcaption></figure><p>Here’s how it works:</p><ol><li>People or devices write data into datasets that <strong>they own and control</strong>. This data can be a high-frequency stream like IoT and reside in any storage of their choice.</li><li>Kamu stores data as a ledger where <strong>information about ownership is inseparable from data</strong>, simultaneously protecting data owners from the copy problem and holding them accountable for data they provide.</li><li>DePIN networks don’t own source data — they are simply <strong>granted access</strong> to personally-owned datasets.</li><li>DePINs combine and aggregate real-time data from contributors by creating “views” or <strong>derivative datasets</strong> that can be queried as efficiently as in centralized infrastructure, but preserve the <strong>link to source data</strong> throughout all transformations.</li><li>Through series of verifiable computing pipelines they refine raw data into <strong>higher-order valuable datasets</strong> and calculate <strong>data quality scores</strong> of every contributor in a fully transparent way.</li><li>They monetize high-quality data through smart contracts and grant access permissions to 3rd parties, simultaneously creating a financial trace and an access audit log.</li><li>In another smart contract they query monetization records in combination with data quality scores to <strong>fairly distribute token rewards</strong>. By doing this on-chain using Kamu’s <strong>verifiable oracle</strong> they link every token disbursment to the exact data it was based on, ensuring complete transparency, reproducibility, and auditability.</li></ol><blockquote>In the same way as DePINs rely on L1 blockchain networks for decentralized financial services — Kamu provides decentralized data storage and processing services that cover most needs of DePIN infrastructure.</blockquote><p>This approach unlocks a dazzling number of cool properties:</p><ul><li>Refined datasets, business reports, trained AI models — all artifacts can be forever linked to data from verifiable pipelines and the information about who contributed every data point.</li><li>Verifiable provenance provides the mechanism to both fairly reward everyone who contributed data and maintained the pipelines and to hold everyone accountable for their actions.</li><li>If DePIN company dissolves — data contributors don’t lose their data. They can take their data elsewhere, or even pool up with data consumers and <strong>keep all pipelines running as before</strong>. It’s an absolutely crucial step towards making DePINs actually behave like true DAOs.</li><li>Same devices can be contributing data to multiple DePIN networks at once, earning multiple rewards, without duplicating data.</li><li>DePIN projects can manage data with full transparency and <strong>engage community to contribute</strong> towards improving pipelines as easily as they would contribute to an open-source code.</li></ul><p>🤔 <em>Interested how all this works? See our </em><a href="https://www.kamu.dev/blog/2024-08-28-intro-to-web3-data-engineering/"><em>DePIN mini-course</em></a> 🤔</p><h3>Takeaway</h3><p>Many early-stage DePINs severely underestimate the challenges of data management at scale. They tie up significant engineering resources towards building highly-custom internal data platforms and end up with high degree of centralization that contradicts their ethos.</p><p>We attribute this primarily to lack of Web3-native data tools in the past.</p><p>At Kamu we built the first data middleware that can both stand up to most challenges of scale, and allow DePINs skip this long winding path altogether and from day one start building with tools that are purpose-made to enable decentralization and personal data ownership.</p><p>We believe it’s a major step that will shift DePIN space closer to operating as DAOs, allow them to more effectively engage their communities, unlock more value from data globally, and let us <strong>rebuild the global data economy around tightly aligned incentives</strong>.</p><h4>See Also</h4><ul><li><a href="https://www.kamu.dev/blog/2024-08-28-intro-to-web3-data-engineering/">Introduction to Web3 data</a> — a mini-course exemplifying decentralization of DePIN data.</li><li><a href="https://www.kamu.dev/blog/2025-01-08-oracle-augmented-generation/">Oracle-Augmented Generation</a> — how Kamu connects AI agents to real-time verifiable data.</li><li><a href="https://www.kamu.dev/blog/2024-06-27-next-decade-of-data/">Next Decade of Data</a> — our view of the upcoming transformation towards global data economy.</li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=c7b60999847c" width="1" height="1" alt=""><hr><p><a href="https://medium.com/kamu-data/ownership-and-decentralization-of-data-in-depins-c7b60999847c">Ownership and Decentralization of Data in DePINs</a> was originally published in <a href="https://medium.com/kamu-data">Kamu Data</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Oracle-Augmented Generation: Connecting AI to Real-Time Verifiable Data]]></title>
            <link>https://medium.com/kamu-data/oracle-augmented-generation-connecting-ai-to-real-time-verifiable-data-62193673e7f6?source=rss----7bbcdfcb365---4</link>
            <guid isPermaLink="false">https://medium.com/p/62193673e7f6</guid>
            <category><![CDATA[verifiable-computation]]></category>
            <category><![CDATA[blockchain]]></category>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[agentic-ai]]></category>
            <dc:creator><![CDATA[Sergii Mikhtoniuk]]></dc:creator>
            <pubDate>Sat, 11 Jan 2025 22:50:58 GMT</pubDate>
            <atom:updated>2025-01-11T22:50:57.997Z</atom:updated>
            <content:encoded><![CDATA[<p>In collaboration between <a href="https://kamu.dev">Kamu</a> and <a href="https://brianknows.org/">Brian</a> we are excited to introduce a new technique for connecting LLM-based AI agents to verifiable data we call Oracle-Augmented Generation.</p><p>You can find a quick overview of the technique in this video:</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FRuQQNAgfOlY%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DRuQQNAgfOlY&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FRuQQNAgfOlY%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/8a6fca1ba5405baaff431a6bc0702c81/href">https://medium.com/media/8a6fca1ba5405baaff431a6bc0702c81/href</a></iframe><p>In this post we will quickly set the stage and then dive deeper into technical and implementation details.</p><h3>The Challenge of Factual Data in LLMs</h3><p>Large language models (LLMs) have become very powerful reasoning and automation tools. Trained on vast quantities of data they rely on generalization to extract concepts, find rules and patterns, and infer relationships between them.</p><p>The model training is akin to multi-dimensional approximation or lossy compression. The generalization of knowledge that gives models their power of high-level reasoning is the same thing that takes away their ability to manipulate accurate factual data points. Asking models to recall specific facts often leads to “hallucinations” — models making up plausible-sounding but false information.</p><p>Embedding the world’s factual data into LLMs not only would be impractical from the model size perspective, but because such models take months to re-train they would also be always lagging far behind the real-time data the world is increasingly relying on for decision-making.</p><h3>Retrieval-Augmented Generation</h3><p>To address this, many AI systems employ Retrieval-Augmented Generation (RAG), pairing LLMs with vector databases to retrieve contextually relevant information at the time of the query.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*WR7WLUYB-MW8yYDQ.png" /><figcaption>Retrieval-Augmented Generation</figcaption></figure><p>While this approach improves the factualness and recency of responses, it has notable drawbacks:</p><ul><li><strong>Source staleness and credibility</strong> — major LLM companies employ RAG mostly to search for relevant web pages, lots of which may contain outdated or unreliable information. While different ranking mechanisms are employed to pick the best results, it is difficult to imagine a sustainable approach to assigning and maintaining such ranks on a global scale.</li><li><strong>Complex query conditions</strong> — RAG performs best when required information already exists close to its desired form, a form that can be easily processed by an LLM. But the number of questions with non-trivial conditions (e.g. narrowing down geographically, or by time) is so large that we cannot expect a web page to exist for each of them — answering a question correctly may require non-trivial computations over a large amount of data to be performed <em>uniquely</em> for that user.</li><li><strong>Opaque data selection and centralized control</strong> — The RAG operator has full control over which sources are included in search results and which aren’t, raising concerns about transparency and potential bias. The proprietary data collection pipelines built by LLM companies to improve model training and RAG also have an adverse effect of concentrating an alarming amount of power in very few hands.</li></ul><h3>Introducing Oracle-Augmented Generation</h3><p>We propose a new technique called Oracle-Augmented Generation (OAG) that pairs an AI agent with a <strong>verifiable analytical data processing system</strong> working with a set of <strong>trusted data sources</strong>.</p><p>The term “oracle” here is borrowed from the domain of Web3 and blockchains where “oracle” is a system that acts as a bridge between blockchains and the external world, enabling smart contracts to access off-chain data while providing some form of <strong>guarantees of the validity of data</strong>. The term “oracle” also exists in computational theory as a black-box function or device that can answer queries that might otherwise be difficult or impossible to compute, as is the case for recalling facts from a generalized language model.</p><p>On the high level OAG proceeds in 4 stages:</p><ol><li><strong>Context gathering</strong> — where in its simplest form the Oracle system is asked to provide top N most relevant datasets to user’s question</li><li><strong>Query generation</strong> — where the LLM is asked to use dataset metadata to generate a query (e.g. SQL) that computes data that may answer the question</li><li><strong>Query execution</strong> — where Oracle performs the query and returns a <em>verifiable</em> result</li><li><strong>Answer generation</strong> — where LLM is asked to interpret the data from the query result to the user</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*GNoUYdowMPpGIXsE.png" /><figcaption>OAG Sequence Diagram</figcaption></figure><p>A key distinction from typical “Text-to-SQL” approaches here is the requirement for oracle to provide <strong>cryptographic proofs</strong> for context gathering and query execution stages.</p><h3>Example Interaction</h3><p>Using <a href="https://brianknows.org/">Brian</a> LLM agent and <a href="https://docs.kamu.dev/node/">Kamu Node</a> playing the role of an oracle, let’s see how one OAG user interaction may look like.</p><p><strong>User Prompt:</strong> <em>What was the total trading volume of USDC between Oct 10th and Oct 20th 2024?</em></p><p><strong>Context gathering:</strong> Brian agent passes the prompt to Kamu Node’s search API verbatim, and Kamu will perform a semantic search to find datasets that are most relevant to this question, returning datasets like:</p><ul><li><a href="https://platform.demo.kamu.dev/kamu/io.codex.tokens.olhcv">kamu/io.codex.tokens.olhcv</a> - DeFi trading data, most relevant to us</li><li><a href="https://platform.demo.kamu.dev/kamu/com.cryptocompare.ohlcv.eth-usd">kamu/com.cryptocompare.ohlcv.eth-usd</a> - a crypto exchange dataset that has mentions of trading volumes</li><li><a href="https://platform.demo.kamu.dev/kamu/com.defillama.tokens.prices">kamu/com.defillama.tokens.prices</a> - crypto token prices dataset that has mentions of USDC</li></ul><p>Brian agent fetches lots of metadata about these datasets from Kamu, including their schemas and column descriptions, readme files, and popular queries.</p><p>Although Brian naturally prioritizes DeFi datasets, OAG in Kamu is domain agnostic. Any dataset added by our community becomes automatically available for querying.</p><p><strong>Query generation:</strong> Brian passes the above context to its underlying language model, prompting it to generate a Postgres-compatible SQL query that answers the user’s prompt.</p><p>The model returns:</p><pre>select<br>  sum(volume) as total_volume<br>from &#39;kamu/io.codex.tokens.olhcv&#39;<br>where<br>  symbol = &#39;USDC&#39;<br>  and event_time &gt;= &#39;2024-10-10&#39;<br>  and event_time &lt; &#39;2024-10-21&#39;</pre><p><strong>Query execution:</strong> This SQL is sent to Kamu’s query endpoint and returns a response in JSON:</p><pre>{<br>  &quot;input&quot;: {<br>    &quot;query&quot;: &quot;select\n  sum(volume) as total_volume\nfrom &#39;kamu/io.codex.tokens.olhcv&#39;\nwhere ...&quot;,<br>    &quot;queryDialect&quot;: &quot;SqlDataFusion&quot;,<br>    &quot;dataFormat&quot;: &quot;JsonAoS&quot;,<br>    &quot;include&quot;: [&quot;Input&quot;, &quot;Proof&quot;],<br>    &quot;datasets&quot;: [<br>      {<br>        &quot;id&quot;: &quot;did:odf:fed011b209e776577c1688affdab1db2d3bda4822852dcaf9d59d108df8b441544938&quot;,<br>        &quot;alias&quot;: &quot;kamu/io.codex.tokens.olhcv&quot;,<br>        &quot;blockHash&quot;: &quot;f16206ead4be7fd3c3efbaa3de1c15e303e2ce9f6c2bc605f11e033e83a0206573722&quot;<br>      }<br>    ],<br>    &quot;skip&quot;: 0,<br>    &quot;limit&quot;: 100<br>  },<br>  &quot;output&quot;: {<br>    &quot;data&quot;: [{ &quot;total_volume&quot;: 4039963082.961011 }],<br>    &quot;dataFormat&quot;: &quot;JsonAoS&quot;<br>  },<br>  &quot;subQueries&quot;: [],<br>  &quot;commitment&quot;: {<br>    &quot;inputHash&quot;: &quot;f16207ad1730665365efa3acb77ea33a169e49b292c03a13ae01d2718b7d958afb46b&quot;,<br>    &quot;outputHash&quot;: &quot;f1620510c1a4b28136b6f79c971e86c37e9b6ea77833ad511515c7a4e0133e47113b6&quot;,<br>    &quot;subQueriesHash&quot;: &quot;f1620ca4510738395af1429224dd785675309c344b2b549632e20275c69b15ed1d210&quot;<br>  },<br>  &quot;proof&quot;: {<br>    &quot;type&quot;: &quot;Ed25519Signature2020&quot;,<br>    &quot;verificationMethod&quot;: &quot;did:key:z6MkkhJQPHpA41mTPLFgBeygnjeeADUSwuGDoF9pbGQsfwZp&quot;,<br>    &quot;proofValue&quot;: &quot;uLaanvQHkx5w6yOcLmI-VH1IquEFTMjmlJRqqgd1Na1qYYcb6CIpxERLjtlYRasqiIwL2hg6NAEHMoNz68xwSBQ&quot;<br>  }<br>}</pre><p>The &quot;output&quot; part of the response is the actual result of the query, while the rest of the fields form a <strong>cryptographic proof</strong> of this request.</p><p>In this example we have a proof via <strong>reproducibility</strong>, where a specific Kamu node (identified by did:key:z6..fwZp <a href="https://www.w3.org/TR/did-core/">W3C DID</a>) commits to having correctly executed the query on the dataset did:odf:fe..4938 at a specific state snapshot (block) f1..3722.</p><p>Other types of proofs, including <strong>zero-knowledge</strong> can similarly be used in OAG to achieve desired effects.</p><p>For a detailed overview of the query proofs mechanism and its properties see <a href="https://docs.kamu.dev/node/commitments/">our documentation</a>.</p><p><strong>Answer generation:</strong> LLM is asked to interpret the result for the user given the:</p><ul><li>Original user prompt</li><li>Context with metadata about the datasets</li><li>Generated SQL query</li><li>Response data (in our case just [{ &quot;total_volume&quot;: 4039963082.961011 }])</li></ul><p>LLM returns the final answer: <em>The total trading volume of USDC between October 10th and October 20th, 2024, was approximately $4,039,963,083.</em></p><p>Both the final answer and the cryptographic commitment are saved in the chat history.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*H8QVDGbBe8sEM55T.png" /><figcaption>Proof saved alongside the agent’s chat history</figcaption></figure><p>A hyperlink that Brian includes into the response allows users to quickly decode the proof and audit the query in Kamu’s Web UI.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*4Grkn30cC4nOXJwA.png" /><figcaption>Auditing proof in Kamu Web UI</figcaption></figure><p>User can see:</p><ul><li>Which SQL query was executed</li><li>What datasets and what snapshots of their state were used in the computation</li><li>The result data that was reproduced identically to the past query</li><li>The validation status of the proof, that ties the response to one or more nodes that executed the query</li></ul><p>The ability of Kamu to provide indefinite reproducibility of queries even for fast-moving datasets relies on ledgerized data structure of the Open Data Fabric protocol. You can find more details in the <a href="https://docs.kamu.dev/odf/spec/">ODF specification</a>.</p><h3>Current Limitations</h3><p>Kamu node does not yet provide the Proof&lt;S&gt; - a proof of executing context gathering phase and returning the most relevant datasets without any additions and omissions correctly. A reproducibility-based proof similar to the query proof Proof&lt;Q&gt; will be soon provided by anchoring the state of all “known datasets” of a node in an ODF dataset and using deterministic embeddings generation and vectorized search algorithms.</p><h3>OAG vs. RAG</h3><p>While in RAG the model has to infer the answer directly from unstructured context data, in OAG the model generates query code and delegates the execution to an analytical system. OAG thus <strong>can work with much larger volumes of data</strong> that would otherwise never fit into the RAG context. This also significantly <strong>decreases the likelihood of hallucinating</strong> because most computations are performed by a deterministic system.</p><p>When it comes to structured data, OAG can work with significantly <strong>wider range of questions</strong>, including:</p><ul><li>Point-in-time and time-range questions</li><li>Complex filters</li><li>Statistical aggregations (medians, quantiles, OLAP cubes etc.)</li></ul><p>In RAG it’s often hard to understand if LLM have made a correct conclusion based on the context data. Verifying this would require a human to analyze the entire context themselves, defeating the purpose. OAG offers superior <strong>auditability and provenance</strong>. Even if the answer requires processing terabytes of data — the generated query can be easily inspected and understood. The oracle system can also provide ways to also check if the query is using reliable data sources. As the query usually would aggregate many data points into an easily digestible statistical summary or a chart — it’s a lot easier for humans to check if the LLM’s interpretation of a result is correct.</p><p>In OAG, proofs introduce <strong>accountability</strong>. A query proof described above allows users to hold data providers and the parties that executed the query <strong>forever accountable</strong> for validity of the results. They can expose any attempts to alter or withhold data and thus become building blocks for penalizing malicious actors. Succinct cryptographic proofs provided by Kamu remain small even when result includes millions of data points, and thus can be cheaply stored alongside the chat history of agent responses.</p><h3>OAG and Kamu for Data Supply Chain Verifiability</h3><p>The verifiability aspect of OAG is subtle yet transformative. Today companies pay huge sums for data APIs that don’t offer a single mechanism to hold the API provider accountable for correctness of their data. But as the world increases its reliance on real-time data and taps into more independent data sources — we will inevitably need to consider the consequences a presence of a few malicious actors may lead to in such a system.</p><p>While OAG can work with any verifiable database, we are especially excited about the properties it unlocks in combination with a <strong>decentralized data processing network</strong> like Kamu. Through its use of verifiable stream processing, Kamu extends OAG’s verifiability from a single query to <strong>cover an entire data supply chain</strong>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*LaMJUH5-h4UeVjeW.png" /><figcaption>Data supply chain verifiability</figcaption></figure><p>Data from a reputable publisher or a blockchain can be processed by multiple third parties, cleaned and combined into higher-order more useful datasets that AI agents will naturally prefer to work with when mining for answers. But thanks to Kamu, the proof of a single query will automatically include provenance proofs of the entire supply chain, no matter how many hands the data went through!</p><p>We believe that Kamu and OAG can help elevate the state of data exchange from all of us just scrambling to get any satisfactory data to get by, to us collectively caring about the quality of data and where it comes from, and inventing new ways to organize efficient and trustworthy data supply chains on a global scale.</p><h3>Role of OAG in AI and Data Economy</h3><p>We believe that conversational AI agents will soon become the primary way of how humans interact with data. We will increasingly rely on AI to source and distill important information for us. It is therefore important to make sure AI agents play nicely with the global data economy and that all incentives are aligned.</p><p>We unfortunately see a big problem in how big AI companies apply RAG today. With LLMs providing us the results we need, many users might no longer feel the need to open the web pages where RAG sourced the information from. This means that the websites RAG sourced data from start losing their traffic and ad revenue. We may soon see data brokers banning AI agents on their websites. This may in turn make major LLM companies use their virtually bottomless pockets to invest into their own proprietary data pipelines. With legislation around AI and IP rights still stuck in limbo, such a degree of centralization would be catastrophic.</p><p>We have previously shown <a href="https://www.kamu.dev/blog/2024-06-27-next-decade-of-data/">[1]</a><a href="https://www.kamu.dev/blog/2024-08-28-intro-to-web3-data-engineering/">[2]</a> how a decentralized system like Kamu offers a better foundation for data economy:</p><ul><li>Through <strong>federated querying</strong> it allows us to efficiently combine data from multiple independent sources without the need to move data into one place</li><li>Integrated provenance makes answers based on official reputable sources <strong>more valuable</strong> than unverifiable data from proprietary sources</li><li>Verifiable pipelines allow data sourcing, cleaning, and composition to be <strong>crowd-sourced to a global community</strong> that can self-organize into a new generation of more transparent data brokers and aggregators.</li></ul><p>Now with OAG we <strong>extend these properties to the AI economy</strong>.</p><p>Once again, verifiable provenance is playing the key role. After all, a query proof that tells us where every bit of data came from and who participated in computations is the exact information we need to fairly compensate everyone involved.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*tcAmfBHIs0qh5RKc.png" /><figcaption>Provenance-based reward distribution</figcaption></figure><p>Provenance provides us a <strong>quantifiable way to distribute the revenue</strong> upstream to every participant of a data supply chain:</p><ul><li>Data publishers</li><li>Storage and compute providers</li><li>And the global community that builds and maintains the processing pipelines.</li></ul><h3>Future work</h3><p>The first OAG prototype has exceeded our expectations, but we have many more ideas to try:</p><ul><li><strong>Agentic exploratory data analysis</strong> — to help with one of the biggest problems today, where LLM sometimes struggle to filter data correctly without engough hints about the specific values present in data, we want the agent to be able to decide when it doesn’t have sufficient information to form a query and needs to perform intermediate steps to explore the content of candidate datasets.</li><li><strong>Progressive context expansion via knowledge/semantic graph</strong> — to help LLM generate correct JOINs between datasets in separate domains we want to extend the metadata with semantic annotations (e.g. <a href="https://en.wikipedia.org/wiki/Resource_Description_Framework">RDF</a>).</li><li><strong>Fuzzy querying</strong> — where SQL layer on Kamu’s side could notice and auto-correct typical mistakes in queries</li><li><strong>Fine-tuning with OAG-in-the-loop </strong>— as we believe best results can be achieved when LLM is trained in conjunction with an oracle and learns how to use it for best advantage.</li></ul><p>Thanks for reading! Please give <a href="https://brianknows.org/">Brian</a> AI agent a try and let us know what you think on <a href="https://discord.gg/nU6TXRQNXC">Discord</a>!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=62193673e7f6" width="1" height="1" alt=""><hr><p><a href="https://medium.com/kamu-data/oracle-augmented-generation-connecting-ai-to-real-time-verifiable-data-62193673e7f6">Oracle-Augmented Generation: Connecting AI to Real-Time Verifiable Data</a> was originally published in <a href="https://medium.com/kamu-data">Kamu Data</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Kamu at FIL Dev Summit 2024]]></title>
            <link>https://medium.com/kamu-data/kamu-at-fil-dev-summit-2024-6de26c8e655d?source=rss----7bbcdfcb365---4</link>
            <guid isPermaLink="false">https://medium.com/p/6de26c8e655d</guid>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[blockchain]]></category>
            <category><![CDATA[decentralized-data]]></category>
            <dc:creator><![CDATA[Sergii Mikhtoniuk]]></dc:creator>
            <pubDate>Mon, 15 Jul 2024 18:50:09 GMT</pubDate>
            <atom:updated>2024-07-15T18:50:09.271Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*_hUhVDNNJMWxaI3hELKYmA.png" /></figure><p>I had an honor to present at the <a href="http://www.fildev.io/FDS-4">4th FIL Dev Summit</a> organized by Protocol Labs in Brussels on July 9–11.</p><p>In my talk titled <strong>“Supercharging SQL: Global Data Supply Chains for Verifiable AI and Analytics”</strong> I make the following key points:</p><ul><li>With 80% of AI researcher time spent on data acquisition and prep — <em>AI is a data problem</em></li><li>Advancing AI means moving the world towards the <em>global data economy</em></li><li>Existing data lakehouse model is <em>unfit</em> for global data exchange</li><li>Batch processing is the culprit that makes data manual and fragile</li><li>Stream (temporal) processing is the solution to make data processing autonomous and composable</li><li>By layering Web3 properties on top we can move the world towards data economy based on <em>collectively-owned data supply chains</em></li></ul><p>You can find the full recording here:</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2Fc9UCjJdvJAU%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dc9UCjJdvJAU&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2Fc9UCjJdvJAU%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/4593450b28322df1f23c9cdf5bfa0f39/href">https://medium.com/media/4593450b28322df1f23c9cdf5bfa0f39/href</a></iframe><p>It was really exciting to share the details of several projects we have been working and show the trajectory that we’re taking:</p><ul><li>Kamu is fast becoming the <strong>“Kubernetes for Data”</strong> with 4 powerful enterprise data processing engines already integrated into one system</li><li>Kamu is the first to combine a blockchain indexer, off-chain data lakehouse, and an oracle under one technology <strong>blurring the line between on- and off-chain data</strong></li><li>Connecting AI to a web of community-operated factual data supply chains simultaneously solves several major problems that surround LLMs today, like veracity, attribution, and compensation</li><li>Our verifiable data processing model is already being used in generative-AI space to fairly distribute rewards to IP owners.</li></ul><p>Big thanks to Protocol Labs for organizing the event and to everyone who attended!</p><p>It was especially great to meet some of our Discord members in person.</p><p>Till next time!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=6de26c8e655d" width="1" height="1" alt=""><hr><p><a href="https://medium.com/kamu-data/kamu-at-fil-dev-summit-2024-6de26c8e655d">Kamu at FIL Dev Summit 2024</a> was originally published in <a href="https://medium.com/kamu-data">Kamu Data</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Next Decade of Data: From algebra to calculus, from silos to real-time supply chains]]></title>
            <link>https://medium.com/kamu-data/next-decade-of-data-from-algebra-to-calculus-from-silos-to-real-time-supply-chains-699d94c0c067?source=rss----7bbcdfcb365---4</link>
            <guid isPermaLink="false">https://medium.com/p/699d94c0c067</guid>
            <category><![CDATA[data-lake]]></category>
            <category><![CDATA[open-data]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[big-data]]></category>
            <category><![CDATA[blockchain]]></category>
            <dc:creator><![CDATA[Sergii Mikhtoniuk]]></dc:creator>
            <pubDate>Wed, 03 Jul 2024 03:36:48 GMT</pubDate>
            <atom:updated>2024-07-03T03:36:48.476Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*gNQhy8ql_hHc5fET.png" /></figure><p>This post will be very different from the rest — it’s our team’s attempt to explain the state of the data market as we see it and make some predictions.</p><p>It’s a long read.</p><p>Data market is damn complex, to the point where I think no one has a complete understanding of it including data company founders, gatekeepers of capital, or especially the users. So what makes us qualified to even attempt to touch on this topic?</p><p>We believe that a lot of this <strong>complexity is incidental</strong>, not inherent, and can be untangled. By the nature of what we’re building, we interact with a much wider range of users than most, as we try to bridge data between previously non-overlapping areas like scientific data management, government, healthcare, finance, insurance, civil infrastructure, Web3, telecom, automotive, aerospace, commercial fishing, green energy and ESG, and many more. This gave us a good picture of what solutions everyone is using and what aspects of data management they prioritize. As a highly technical team, we can now quickly put each new solution we encounter into a certain “bucket”. Along the way we also developed some opinions on where all of this is going, that we will share today.</p><p>This problem is multi-dimensional and impossible to linearize, so excuse me hopping from one topic to another. Also forgive many generalizations and assertiveness of the tone — it’s for the sake of being concise.</p><h3>On Industry Outlook</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*YyEdnIigeZdJe5xX.png" /><figcaption>Technological transition to data economy</figcaption></figure><p>For the past four decades innovation in data engineering was driven almost exclusively by <strong>big tech</strong> solving their <strong>internal problems</strong> using analytics over <strong>internal data</strong>. A revolutionary invention — relational database — was followed by steady evolution towards <em>larger volumes and lower latencies</em>.</p><p>We see this trend continuing in the current generation of startups and emerging tech:</p><ul><li>Startup takes a popular data solution X</li><li>Builds a faster version of X</li><li>Open-sources the core</li><li>Sells the cloud version</li></ul><p>We believe this model will have <strong>steeply diminishing returns</strong>:</p><ul><li>As performance gains stemming largely from <a href="https://arrow.apache.org/">Apache Arrow</a> and moving away from the legacy JVM stack are incorporated into all major data engines, and as open-source data formats like <a href="https://iceberg.apache.org/">Iceberg</a> gain wide adoption — this leaves very <strong>little room for differentiation</strong>. There is surely enough market for competition — lots of big companies are still transitioning away from the outdated warehouses and you can “catch” them at the right moment — but this will increasingly be a <strong>sales race</strong>, not a technological one.</li><li>The current <strong>processing model is reaching its limits</strong> — not in performance, but in human cost of operating it, how manual and fragile it is. Its flaws cannot be patched and will require a major re-design of engines, disrupting the established order in the analytics market.</li><li>An organization can only derive limited value from internal data, whereas <strong>external data</strong> holds immense untapped potential. Using internal data organizations can <strong>optimize</strong> their processes and efficiencies, but only using external data can they <strong>act strategically</strong> — access customer insights, compare their performance against peers, and better understand their place in the global market. <a href="https://mitsloan.mit.edu/ideas-made-to-matter/why-external-data-should-be-part-your-data-strategy">[sloan]</a> <a href="https://www2.deloitte.com/us/en/insights/focus/signals-for-strategists/smart-analytics-with-external-data.html">[deloitte]</a> <a href="https://www.forbes.com/sites/bernardmarr/2022/03/30/why-external-data-is-so-important-for-every-business/">[forbes]</a></li></ul><p>Organizations exchanging information with one-another will become the main focus of the next decade in virtually every sector of the economy and we’ll show that the current solutions simply aren’t designed with this in mind.</p><p>Today we barely scratch the surface of strategic use of data — the next decade will be the time of transition from internal silos towards the <strong>global data economy</strong>.</p><h4>Aside: Don’t we have a data economy already?</h4><p>Emailing Excel files and PDF reports still remains the primary way of data exchange — a highly manual and error prone process that we often compare to the Rube Goldberg machine. In data startup circles this is considered a “low-tech” segment and not the area to focus on, but we believe that this is a clear indication of a big gap in solutions and a big part of the <strong>systemic problem</strong>.</p><p>Companies that already operate data lakes are not much better off. They often drown in internal data, unable to tell where most data came from and how it was processed. While solving storage and querying, data lakes are low-level tools that don’t provide good mechanisms to manage data workflows, thus even internal data exchange always ends up plagued with provenance and quality problems. When companies need to share data with partners, they often resort to highly custom solutions tailored to each consumer, which are costly to develop, maintain. Lack of good privacy frameworks often leads to multiple copies of data at different levels of detail being produced that add to confusion and erode the “source of truth”.</p><p>Lack of interoperable solutions is most noticeable in the scientific and government data space where universities and governments have policies to publish data, but keep developing <strong>custom non-interoperable portals</strong>. The amount of redundant effort is in tens of billions and data remains hard to access. They do this not because they don’t have money to pay for Databricks or Snowflake — govtech solutions are <a href="https://laist.com/news/kpcc-archive/how-much-do-open-data-portals-cost-so-cal-governme">extremely overpriced</a> — they do so because their requirements around data sovereignty are not met.</p><p><strong>REST APIs</strong> remain the most widespread way to monetize data, but we think this model is flawed:</p><ul><li>APIs are the means of delivering data to applications, they are not fit for data science and analytics.</li><li>They are non-standard and always require custom integrations</li><li>If the focus of your company is not on selling data — building data API infrastructure is too significant of an investment. Many companies in automotive, telecom, and other spaces that have very valuable data, but told us they can’t monetize it because of the costs to build and maintain such infrastructure.</li><li>APIs thus favor big players, lead to centralization, and produce data monopolies. It’s not possible for a few small publishers to “pool” data together — APIs <strong>don’t compose</strong>.</li></ul><p>When it comes to publisher-supplier relationships, all companies still operate through <strong>bilateral agreements</strong>. A health insurer will contract an actuary that in turn has deals with dozens of hospitals to periodically get statistical data. Again, the entry barrier into this system is very high, getting data takes a lot of time, and due to how manual the whole process is — you will likely be seeing data that is long outdated and can be barely trusted. The only thing that tells you that the data you get from an actuary is real is an ephemeral feeling that “they do data as a company, so it must be real”.</p><p>So, no — we don’t see the current state representative of even a small fraction of what the global data economy will look like.</p><h3>On Transition Path</h3><p>There of course have been hundreds of attempts to enable data economy by building “global data portals”, and “data marketplaces”, but all of them turned into data graveyards. There has been a resurgence of data marketplace attempts in Web3, but they all focus on using blockchains to experiment with different economic incentive models, leaving data engineering foundation the same. The continuing attempts are evidence that the problem is worth solving, but making actual progress will require <strong>substantially rethinking the approach</strong>.</p><p>We don’t see this transition happening gradually as an evolution of existing practices. Cross-org information exchange is a <strong>trustless environment</strong>, very different from mostly-aligned incentives within a company — existing solutions were not built with any of these requirements in mind and they are <strong>impossible to retrofit</strong>. The current solutions for managing the internal data already require organizations to have large engineering teams to sustain — we cannot expect them to also handle the external sharing side. We’ve hit the complexity limit — this <strong>won’t scale</strong>.</p><p>The current state is a <strong>local optima</strong> and getting out of it is not going to be easy.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/445/0*O9u_G-iMElclFl76.png" /><figcaption>Local optima of modern data</figcaption></figure><p>The set of problems that must be solved during this transition are:</p><ul><li>Automation</li><li>Composability</li><li>Trust &amp; Accountability</li><li>Ownership</li><li>Decentralization &amp; Interoperability</li><li>Privacy</li></ul><p>While many VCs are looking for <em>“The Next Snowflake / Databricks”</em> — we don’t think this is what a solution will look like. The gap is certainly wide enough for another behemoth, but thankfully <strong>the market will not tolerate a centralized / proprietary solution</strong> to this problem. A solution will likely have the form of open formats, a specification that multiple different data engines can adhere to, and a <a href="https://www.usv.com/writing/2016/08/fat-protocols/">“Fat Protocol”</a> that serves as an interoperability layer for various on-prem-first implementations that respect data sovereignty, unlike the current cloud data platforms. We of course hope that our <a href="https://docs.kamu.dev/odf/">Open Data Fabric</a> initiative can fulfill this delicate role.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/527/0*Fsrq4o4HC23w6K9T.png" /><figcaption>Data lakes vs. supply chains</figcaption></figure><h3>On Automation</h3><p>When I say <em>“real-time data”</em> what’s the first thing that comes to your mind? For most, it’s probably the stock market with its high-frequency tickers.</p><p>We ask: <em>Why isn’t all data like that? Why do events like the release of quarterly employment reports in the US still cause market swings as traders try to predict what these numbers will look like? Why can’t we get those numbers on an hourly basis?</em></p><p>The answer is — because of the current data processing model that is extremely manual, fragile, and keeps humans in the loop.</p><p>To emphasize, even if all data in the world by some miracle was in a single well-structured, perfectly-permissioned database — we don’t think the situation would get significantly better — the current model is <strong>unfit for automation at scale</strong>.</p><p>In our <a href="https://www.kamu.dev/blog/end-of-batch-era/">The End of Batch Era</a> blog post we attribute this problem to batch processing — a computational model that constitutes &gt;99% data processing today. This model doesn’t know how to work with time — <a href="https://www.youtube.com/watch?v=XxKnTusccUM">the most fundamental dimension of data</a>. With industrial-IoT growing massively, blockchains becoming significant and high-frequency data sources, and all sorts of dynamic consumers like devices, automation, and AI bots in play — we need a better computational framework.</p><p>This framework already exists — <a href="https://www.kamu.dev/blog/end-of-batch-era/">Stream / Temporal Processing</a> (we prefer the term Temporal Processing to break away from the old solutions that were just working with time-series data and message queues like Kafka that for many people are synonymous with streaming). It is already revolutionizing how big tech works with real-time data, but we think enterprise data companies are again largely missing the point. Yes, it improves latency, but this is actually one of its least fascinating properties.</p><p>More importantly:</p><ul><li>It provides a <strong>solution for dealing with data incompleteness</strong></li><li>It allows to write code that processes data continuously while it evolves over time, <strong>automatically reacting to backfills, retractions, and corrections</strong></li><li>It <strong>removes humans from the loop</strong> — people design and build pipelines, but don’t move any data manually. Removed from the “hot path” of data they don’t contribute their oh-so-human latency and mistakes to it</li><li>It <strong>composes</strong> — pipelines can have thousands of steps that are reactive, self-healing, and autonomous.</li></ul><p>The importance of stream processing technology is <strong>comparable to the invention of calculus</strong> — it’s a new framework for problems for which previous methods were not a good fit.</p><p>If a new data solution does not consider temporality — I usually stop looking.</p><h3>On Composability</h3><p>Let’s say with a swing of a magic wand we move millions of existing data publishers and consumers to one platform. This utopia can easily backfire and turn into a systematization nightmare. We immediately run into <strong>M*N quadratic complexity problem</strong> where all data is hard to discover, takes a lot of effort to combine together, is riddled with quality and trust problems, and every consumer has to deal with these issues themselves. These are the same issues as a poorly executed transition to data lake leads to, but now on a global scale.</p><p>Once again — removing interoperability barriers alone is not enough.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/394/0*vZWWxLhhOkSBbcKU.png" /><figcaption>Quadratic complexity explosion in a non-composable model</figcaption></figure><p>For the data economy to function we should stop thinking about data as fruits on a stand that change hands, or as parcels sent through a logistics network. Instead we should think about data as plumbing or an electrical grid — a highly-interconnected network of supply chains that flow, diverge, and converge continuously and autonomously. These chains form a DAG, and graphs provide <strong>structure and hierarchy</strong>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*8RF1qVEWFhX3KYmC.png" /><figcaption>DAG of streaming data pipelines</figcaption></figure><p>This is where composability on processing level that we just touched on turns into <strong>composability on a global cross-organizational scale</strong>:</p><ul><li>Data from publishers is flowing</li><li>It’s picked up by the global community of data scientists and that combine, aggregate, and extract value from it</li><li>Consumers access data on various levels of aggregation</li><li>Any retractions or corrections propagate instantly to consumers</li></ul><p>Data composition is the only way to solve systematization and quality problems at global scale, yet all mainstream data solutions don’t offer a composable model.</p><h3>On Supply and Demand</h3><p>Data economy needs a balanced supply and demand. Demand is already high and will only get higher once data access barriers drop. Supply however is critically low. The current technologies created a steep entry barrier and failed to create the right incentives for publishers to share data.</p><p>We must recognize that for most companies publishing data will never be the core part of their business — a supplementary income at best. Publishers therefore should be treated as the most fragile and vulnerable persona in the model as any extra effort needed on their end significantly reduces chances of getting the data.</p><p>That’s why we believe that a much more opinionated and simpler to use data solution is necessary to <strong>fill the gap between Excel and a data lake</strong> in these lower-tech companies. Publishing data externally must be a natural extension of internal data workflows otherwise operating overhead will be hard to justify.</p><p>Also that’s why we pay close attention to the government use cases (another underserved area), as <strong>regulatory requirements</strong> may just be the driving force that makes such companies upgrade their data stack to keep up. Privacy-preserving verifiable analytics is essential to make regulations and compliance sustainable without falling into the extremes of a “surveillance state” or a 3rd party audit mess.</p><p>There is also what we call a <strong>“small publisher dilemma”</strong>. The smaller the publisher — the less data they have, the harder it is to find, the more work for a consumer it is to integrate, and therefore the less valuable it is. So, for example, while data from all taxi companies in the world would be immensely valuable — data of one taxi company may not have any demand, and thus discouraging a company from ever trying to publish.</p><p>The <a href="https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page">NY Taxis dataset</a> was a clear illustration of how fast things can go once you overcome the initial friction. When the New York government collected and shared this previously inaccessible data it turned out immensely valuable and has since spread to dozens of data portals, was featured in hundreds of conference talks, and benefited many ride sharing startups.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/636/0*NQDL7LLIzPnnqmvV.png" /><figcaption>One of many NY Taxis dataset visualizations (by Ravi Shekhar)</figcaption></figure><p>Again, we believe that <strong>composability</strong> is the solution:</p><ul><li>Publishers only need to expose data in its raw form</li><li>Data scientists can then find publishers within common domain, clean their data up, harmonize and pool it into higher-order datasets</li><li>The data is then provided wholesale to the consumers</li><li>Pipelines keep running autonomously as long as publishers continue to provide data</li><li>Rewards flow upstream and are shared between publishers and the pipeline maintainers.</li></ul><p>Composability here plays a crucial role of allowing parties other than the publishers to perform most of the heavy-lifting, dramatically reducing the barrier of entry.</p><p>Today there are many companies that specialize in sourcing data and data cleaning. By making data composable, this role can be opened up to a <strong>global community</strong> of hobbyists and researchers, replacing actuaries with transparent, auditable, and verifiable pipelines.</p><p>We see composability foundational to <strong>collectively-owned data supply chains</strong>, governance of which can be automated through smart contracts (aka Data DAOs) to avoid the complexity of custom data redistribution and revenue sharing agreements and further simplify publisher participation.</p><h3>On Ownership</h3><p>Taking composability a step further in another direction naturally brings us to <strong>personal data ownership</strong>. In a truly composable system raw data can be collected on device / individual level and then be progressively combined into higher-order datasets. For example, data of a commercial fishing company may be a <em>union</em> of datasets of every ship they hire, while the ship’s data is a <em>union</em> of datasets of individual IoT devices on board. As devices are installed, break, get replaced changing vendors — a <strong>digital twin</strong> of a ship can be seen as an ever-evolving (temporal) graph of streaming pipelines.</p><p>Note how this approach is <strong>bottom-up</strong> — a device is installed and starts writing data to its own dataset, and then this data is incorporated into the supply chains feeding the ship’s digital twin and company analytics. Many digital twin solution use a <strong>top-down</strong> approach where a common data schema is designed first and then devices are made to write to it in a specified format. Designing such a common data schema is a tough social exercise of seeking some common denominator and consensus between many device manufacturers — good luck attempting this on a global scale.</p><p>In a bottom-up approach:</p><ul><li>Data can start flowing immediately</li><li>Be quickly aggregated in a way that suits internal needs of a company</li><li>And later harmonized for consumption on country / global level</li><li>Even if the internal model of a company ends up incompatible with what the rest of the market settled on — a company can easily create and migrate to a new model as all raw data on device level is still preserved.</li></ul><p>The permissions in this model work very naturally as well — they can be enforced on per-device level, the control can be delegated to captains, who authorize access by fisheries. When the ship’s contract ends with one company — the read permission for future data can be revoked — leaving the company only with access to data that was gathered while the contract was active. Complete data stays available to captains to base maintenance on, and may or may not be available to the next company that contracts them (e.g. a ZK proof of sufficient maintenance may be all they get).</p><blockquote>Individually-owned data will make up the “big data” of global analytics while preserving the IP ownership, enforcing the desired degree of privacy, and maintaining provenance links for equitable reward distribution.</blockquote><h3>On AI</h3><p>No matter how sophisticated your AI model is, if you train it with poor data you will get poor results: <em>garbage in, garbage out</em>. Acquiring and preparing data is absurdly hard — AI startups spend &gt;80% of time gathering data instead of designing and tweaking models.</p><p>AI is a data problem.</p><p>Our theory is that the <strong>boom of LLMs is a side-effect of the data problem</strong> — it’s a lot easier to build an AI startup that scrapes open information from the web than to go through all the hurdles of getting specialized domain data. Companies that crowd-source data generation, like Shutterstock and Getty are <a href="https://www.bnnbloomberg.ca/shutterstock-s-ai-licensing-business-generated-104-million-last-year-1.2080875">making $100M annually</a> from AI companies. Meanwhile domain-specific AI stagnates, facing privacy barriers, complexity of bilateral agreements, and custom integrations.</p><blockquote>It’s not the algorithms that make a successful AI company — it’s the strategies and business relationships behind data acquisition.</blockquote><p>If these barriers are lifted we will see a boom of domain-specific AI applications.</p><p>LLMs continue to improve rapidly, but also are <strong>rapidly commoditized</strong>. Many open-source LLM models exist as alternatives to GPT, and none of the applied LLM startups seem to have a strong defensibility moat, other than how they prepare data for fine-tuning. Their early gains will likely be quickly eaten away by the competition. When dealing with hype technologies it’s better to pay attention to the fundamental infrastructure that supports them. This was the case with semiconductors that surged first due to Blockchain mining and then AI booms — NVIDIA is the <a href="https://www.google.com/url?q=https://www.forbes.com/sites/antoniopequenoiv/2024/06/18/nvidia-now-worlds-most-valuable-company-topping-microsoft-and-apple/&amp;sa=D&amp;source=docs&amp;ust=1719274017910652&amp;usg=AOvVaw1jsJfbXNSmdZc6vc70G0k-">most valuable company</a> in the world as of this writing — same will be the case with data infrastructure. Unlike AI models though, <strong>data has inherent and profound network effect</strong> — data is a lot more valuable when highly interconnected and can be easily combined with other data. A protocol that solves data exchange problems will have a significant moat due to this built-in network effect.</p><p>In Kamu we see the relationship between LLMs and data changing significantly. LLMs are notoriously bad at <strong>factual information</strong> — they have no defined boundary between facts and “dreaming”. The models are already huge, so it’s simply impractical to try to embed more factual data into them. Even if the size is not a concern — in the constantly changing world a lot of data will already be out-of-date by the time you finish re-training the model. We therefore think that LLMs should be treated as a <strong>human interface</strong> that can help people access and make sense of factual data, but not necessarily contain it within itself.</p><p>A simple example of what we’re prototyping:</p><ul><li>LLM receives a question like <em>“what is the median price of a 2BR apartment in Vancouver?”</em></li><li>Using semantic graph of datasets in Kamu (via <a href="https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/">RAG</a>) it finds datasets that are most suitable to answer this question</li><li>It composes an SQL query to fetch the data</li><li>And determines the best representation (verbal, or graphical) to serve the results.</li></ul><blockquote>The symbiosis of LLMs and a data supply network can deliver users factual real-time results, without having to frequently re-train the model.</blockquote><p>Perhaps even more importantly, it can deliver the results with <a href="https://docs.kamu.dev/node/protocols/oracle/#verifiability">full provenance information</a>, and therefore enable both accountability and fair compensation of data providers. When this state of technology is achieved we see AI becoming possibly the <strong>largest consumer and the primary user interface for data</strong>.</p><p>The <a href="https://www.wired.com/story/scarlett-johansson-v-openai-could-look-like-in-court/">Scarlett Johansson lawsuit against OpenAI</a> once again turned the attention towards <strong>provenance of data used for training</strong> and compensating the IP owners. We think this is the wrong side to start digging from. You can already establish a link between the AI model and data used for training — currently through versioning / anchoring data and deterministic / verifiable computations. But we also need to <strong>establish provenance across all the hundreds of transformation steps data usually goes through before training</strong>, linking it to ultimate roots of trust — the individual data publishers. Training an AI model is not that fundamentally different from executing an aggregate function in SQL, thus if we solve provenance at a level of multi-stage data supply chains — AI training will fit into the general framework nicely. And if your data ownership extends to individual level (as we described in <em>Ownership</em> section) — you’ll have a programmatic way to compensate the IP owners. Try starting with the training provenance problem and you’ll end up linking models to data the origin of which cannot be established.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/753/0*j1-PMfVtAWXUxqd9.png" /><figcaption>Using data provenance to reward IP owners</figcaption></figure><p>And finally, today AI/ML is primarily the <strong>end product</strong> of data processing — you collect and shape data, train a model, then deploy it to serve users. We think this will change as well. Once we can produce verifiable models with strong provenance we will see more of them being used as <strong>intermediate stages</strong> in data processing pipelines. We are building Kamu to make AI-enabled classification, feature extraction, alerting, and all kinds of decision making a seamless part of the data supply chains, as simple as running an SQL function.</p><h3>On Web3</h3><p>We already stated that the market will (rightfully) resist any centralized / proprietary solution to data exchange. For most organizations NO data exchange is better than having their data held hostage. This naturally brings us to Web3 and the realm of decentralized protocols.</p><p>Blockchain investors themselves describe this market as “self-serving” and in the state of a “big casino”. There’s definitely a lot of that. Despite that, Web3 operates much closer to the underlying challenges of multi-party trustless information exchange and already gave birth to several extremely important new technologies like decentralized storage and compute networks, and novel privacy-preserving techniques.</p><p>When it comes to data, the biggest issue we see is Web3 companies building solutions for other Web3 companies in isolation, disregarding the transition path for the other 99.9% of the organizations in the world. If a company already operates a data lake based on S3 — why on earth would they want to move their terabytes of data to some decentralized database and lose access to Spark and other vast sets of analytical and BI tools that they built their business around? Our theory is that the current state of affairs is caused by the profile of engineers — most people that come to Web3 today were only exposed to OLTP (transactional) data processing before and create data solutions that unfortunately repeat the same mistakes that OLAP (analytical / warehousing) data processing already evolved out of decades before.</p><p>In Kamu we take a very sober, practical look at Web3 and build towards its core values, but with existing enterprise OLAP data users as the starting point. Web3 values are the “north star”, but we need to make sure that the <strong>on-ramp</strong> is as smooth as possible. With our products, we have already proven that provenance and verifiability can be introduced into modern data lakes in-place, with <strong>little to no disruption</strong>. Users can continue using their existing storage, their existing tools, and open their data to others and decentralize progressively, when there is a strong business case to do so.</p><p>As a quick rundown of technologies:</p><p><strong>Blockchains:</strong> We see them as OLTP engines for multi-party interactions. Most data is not that. Putting data on blockchain is prohibitively expensive, so this discourages the abuse of blockchains for data storage, although many startups still try. Private / enterprise blockchains are just stopgaps for privacy and scaling issues — we don’t consider them as good mediums for data exchange either, as none of them can handle Industrial-IoT volumes and frequencies of data. We use blockchains as censorship-resistant data catalog, permission management layer, and as a source/consumer of some data (see below), while keeping all data processing off-chain. Data doesn’t need a global ledger. Verifiability of data pipelines has much more efficient solutions than what general purpose blockchains offer.</p><p><strong>Oracles:</strong> Access to trustworthy off-chain data is essential to make Smart Contracts useful, but <a href="https://ethereum.org/en/developers/docs/oracles/">The Oracle Problem</a> remains a big barrier. We strongly believe that the oracle problem is a feature, not a bug. The issue of how holding data providers accountable exists even when you call Google’s REST API from your app — today you have no way of proving that resulting data even came from Google, not to mention disputing it. The oracle problem simply exposed the lack of accountability in modern data exchange in stark contrast with the properties achieved and held sacred by blockchains. Complexity and cost of <a href="https://chain.link/education/blockchain-oracles">Oracle Networks</a> is a big factor in the stagnation of the smart contract ecosystem. If the accountability problem is addressed at the core — bridging data to smart contracts will become trivial, opening up millions of new use cases for smart contracts other than just finance.</p><p><strong>Indexers:</strong> OLTP transactions on the blockchains generate a lot of interesting data. The desire to use it for analytics gave birth to the category of solutions that ingest decoded blockchain data into databases and lakes. They all end up re-centralizing the data and going back to all the trust concerns blockchains are solving. Situation is improving with some indexers using Zk-proofs to make some claims verifiable, but this is still very far from general-purpose analytics. More importantly, all indexers provide access to only blockchain data, meaning that if you need to correlate some on-chain data with events happening in the outside world — you are suddenly thrown into complexity of building and operating your own data infrastructure — an investment few can afford. In <a href="https://www.kamu.dev/coming-soon">this blog post</a> we show how Kamu unifies oracles, indexers, off- and on-chain data under one simple model.</p><p><strong>DePIN</strong> (decentralized physical infrastructure networks): Some DePINs like <a href="https://weatherxm.com/">weather station networks</a> exist to gather data. For others data is a byproduct crucial for provisioning, monitoring, and service level control. From the perspective of data, the only difference between DePIN and industrial IoT is in where you draw the ownership boundary. While blockchains are used as an incentive layer for these networks, the massive volumes of high-frequency data will have to go somewhere off-chain while preserving verifiability.</p><h3>On Privacy</h3><p>Most companies still consider data privacy only in the form of private networks and centralized permission control. In many cases this may be actually enough for them. Many Web3 ideals like using public storage networks where data is openly accessible but protected by encryption are viewed as additional complexity and often a showstopper. Again, we need a smooth on-ramp — industry must accept that it will take time for these approaches to earn customers’ trust — we should meet users where they are.</p><p>Only when encountering cross-org data exchange companies start looking at the domain of <a href="https://blog.openmined.org/structured-transparency-input-output-privacy/">structured transparency</a> of which previous approach is just a tiny subset. The most popular solutions there still remain Compute-to-Data and Federated Learning — simple ideas of performing computations within data owner’s infrastructure, without ever sending raw data outside the private network. While posed as the holy grail of medical AI, we still have not seen a single sustainable integration. These solution require not only advanced infrastructure, but a <strong>significant technical expertise</strong> on the data owner’s side. It’s still hard for me to imagine hospitals en-large ever employing people capable of reviewing the submitted code and AI training routines for PII leaks.</p><p>Luckily this field is booming with new techniques that can make structured transparency easier to operate:</p><ul><li>Differential Privacy can control how much information is revealed in aggregate throughout the computation and automatically terminate it when it exceeds some threshold without any human input</li><li>Software Enclaves and Homomorphic Encryption can allow 3rd parties to operate the compute infrastructure without leaking any information to them</li><li>Arguments, Proofs, and Zero-Knowledge enable verifiable and privacy-preserving computing without the need for specialized hardware and the “root of trust”.</li></ul><p>We are accumulating a robust set of privacy tools, but the main problem now is that there is <strong>no common foundation to integrate them</strong> through. When two or three parties are involved in a computation, the likelihood of them using the same tech stack is slim, and we are extremely far from having any standard spec to make verifiable computing interoperable. Until this changes — we will see structured transparency deployed very haphazardly, as custom solutions with hefty price tags, and being close to impractical in terms of cognitive overhead for the operators.</p><p>Privacy has to be <strong>layered over</strong> the open data supply chain protocols. Going the other way around will only fragment the already highly fragmented market.</p><h3>Conclusion</h3><p>Every sci-fi movie already captured the desired UX:</p><ul><li>Instant, uniform access to millions of data sources</li><li>Feeds of relevant data delivered in real-time</li><li>Ability to drill down to individual data points and cross-verify veracity of data between sources.</li></ul><p>Let’s figure out how to make this vision a reality.</p><p>We believe that in Kamu we largely cracked the problem of how the foundation of this should look like:</p><ul><li>Separation of data ownership from storage and compute infrastructure</li><li>Composability of stream / temporal processing as the key to automation, ownership, and global collaboration on data effects of which will likely be similar if not greater to Open Source Software revolution</li><li>Ecosystem based on open formats and protocols, open to many engines and alternative node implementations</li><li>Network economy that fairly compensates all participants of supply chains to keep things sustainable.</li></ul><p>Unfortunately this foundation is not a gradual evolution of existing frameworks — a lot of things will need to be re-designed.</p><p>Surprisingly, for such a complex problem we have not yet encountered many “forks” on the path to the solution. We have been following the same vision for over 3 years and watching the many puzzle pieces we have not even considered (e.g. the oracle problem) fall neatly into their places.</p><p>Do you agree with the trends and problems we described? Reach out to us on <a href="https://discord.gg/nU6TXRQNXC">Discord</a> with your thoughts and comments.</p><p>Try our <a href="https://demo.kamu.dev/">demo</a> to see a lot of this new foundation already functioning.</p><p>If you’d like to know more technical details of what major drivers influenced the technical design of Kamu — don’t miss our upcoming blog post: <a href="https://www.kamu.dev/coming-soon">“The Grand Convergence: Five silent revolutions that shape the future of data”</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=699d94c0c067" width="1" height="1" alt=""><hr><p><a href="https://medium.com/kamu-data/next-decade-of-data-from-algebra-to-calculus-from-silos-to-real-time-supply-chains-699d94c0c067">Next Decade of Data: From algebra to calculus, from silos to real-time supply chains</a> was originally published in <a href="https://medium.com/kamu-data">Kamu Data</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Kamu Update: Funding, hiring, tech]]></title>
            <link>https://medium.com/kamu-data/kamu-update-funding-hiring-tech-8524645124?source=rss----7bbcdfcb365---4</link>
            <guid isPermaLink="false">https://medium.com/p/8524645124</guid>
            <category><![CDATA[open-data]]></category>
            <category><![CDATA[apache-arrow]]></category>
            <category><![CDATA[stream-processing]]></category>
            <category><![CDATA[web3]]></category>
            <category><![CDATA[etl-tool]]></category>
            <dc:creator><![CDATA[Sergii Mikhtoniuk]]></dc:creator>
            <pubDate>Thu, 10 Aug 2023 22:28:51 GMT</pubDate>
            <atom:updated>2023-08-10T22:28:51.598Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/500/0*bdrjhqKd9csv18TU.png" /></figure><p>Next week, Kamu is turning 3 years old! 🎉</p><p>When we set out to build the “world’s first global data supply chain” we didn’t expect it to be easy … and it’s not. For the last 7 months we had to channel all our resources into things that help us survive as a company: customers, product, and fundraising, so despite the lack of public updates we were hard at work.</p><h3>Funding secured</h3><p>Our efforts paid off — the Kamu project has secured more funding!</p><p>Addressing systemic issues of data economy might seem akin to a “hot startup”, but in reality, the Enterprise and Web3 Data markets have grown so complex that few possess a clear perspective on them anymore. Building something outside the established buckets (like data lake, catalog, indexer, oracle) and pursuing a fundamental shift also means “big risks” for investors. If very few are properly equipped to assess these risks, even fewer are willing to take them during the market downturn.</p><p>We are deeply grateful to exceptional people in <strong>Revere</strong>, <strong>Protocol Labs</strong>, <strong>Faber</strong>, and <strong>Gagra Ventures</strong> for taking the time to understand us, for sharing our vision and for believing in our ability to bring it to life.</p><p>And a very special thanks to <strong>Dell Technologies</strong>, <strong>Infinity</strong>, the <strong>University of Groningen</strong>, and other members of the <a href="https://www.project-odds.com/">“Open Data for Data Science” consortium</a> for their support and for connecting us with so many scientists, field experts, research institutes, and companies.</p><h3>Growing the team</h3><p>We not only have the means to keep moving forward but can also finally expand our team!</p><p>When members of our community were asking about working with us we previously had nothing to offer. <em>“We worked without a salary for 2 years — join us!”</em> was not the job pitch I ever wanted to give. Things are changing now!</p><p>We are now <a href="https://www.kamu.dev/jobs/">looking for exceptional people</a> in three key areas:</p><ul><li>Data Eng <em>(Arrow, DataFusion, Spark, Flink, blockchains)</em> — to work on our novel data formats, protocols, and data engines for <strong>verifiable</strong> and <strong>privacy-preserving</strong> data processing.</li><li>Backend <em>(Rust, Kubernetes, multi-cloud)</em> — to work on our <strong>decentralized</strong> data network and its supporting services</li><li>Frontend <em>(Angular, GraphQL, data viz)</em> — help us build the first <strong>collaborative</strong> data processing platform, a “GitHub for Data Pipelines”</li></ul><h3>Technology updates</h3><h3>CLI Tool</h3><p><a href="https://github.com/kamu-data/kamu-cli">kamu-cli</a> was released in early 2021 as a technology prototype that made (still very novel) <a href="https://www.kamu.dev/blog/end-of-batch-era/">stream processing</a> far more accessible and combined it with blockchain-like provenance and verifiability.</p><p>We were pleasantly surprised that our tool attracted the attention of a very experienced and technical crowd. So many CTOs, CDOs, and experts from fields like science, healthcare, finance, insurance, and web3 gave it a try. Thank you all for your feedback!</p><p><strong>Our plan</strong> is to continue to evolve kamu-cli into a powerful <strong>“Swiss army knife” for data flows</strong> - a new alternative to <strong>“data-as-code”</strong> tools like dbt that takes an opinionated approach to how data should be managed internally and externally for us to achieve:</p><ul><li>Provenance and accountability</li><li>Superior automation and data consistency at low latencies</li><li>Collaborative data cleaning and aggregation</li><li>And other foundational pieces of <strong>reproducible science</strong> and <strong>functioning data economy</strong>.</li></ul><h3>Compute Node</h3><p>We are now working to evolve the technology you see in the tool into a <em>“deploy-anywhere”</em> data processing service.</p><p><strong>Kamu Compute Node</strong> is a set of Kubernetes-native applications that can be deployed in any cloud or on-prem to:</p><ul><li>Operate the stream processing pipelines for a certain set of data flows</li><li>Continuously verify datasets that you are interested it to catch malicious behavior</li><li>Serve ad-hoc batch queries and Oracle queries for ingesting data into smart contracts.</li></ul><p>Compute nodes are the building pieces of the <a href="https://www.kamu.dev/blog/introducing-odf/">Open Data Fabric</a> network. Unlike blockchain nodes that maintain a single ledger, Kamu compute nodes can form <strong>loosely connected clusters</strong> based on the vested interest of their operators in certain types of data.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*6r7hSRPgjOXYbEAK.png" /><figcaption>Roles in ODF network</figcaption></figure><p>Rather than a flat peer-to-peer, it’s a federated network where:</p><ul><li>A university that cares about the reproducibility of science can contribute storage resources to replicate research datasets,</li><li>A government can provide compute resources to keep civic healthcare data pipelines running, or</li><li>A company using crowdsourced weather data can dedicate resources to continuously verify these communal data pipelines to detect malicious actors.</li></ul><p>Our <strong>work so far</strong> was focused on:</p><ul><li>Helm charts — making it easy to install the node in Kubernetes</li><li>Closer S3 and IPFS integration — making our engines read and transform externally-hosted data while minimizing the data transfers</li><li>Multi-tenancy, authentication, authorization</li><li>Work scheduling, capacity control, and backpressure.</li></ul><p>As with the rest of our tech, all <a href="https://github.com/kamu-data">code is open</a>, and under a <a href="https://docs.kamu.dev/contrib/license/">time-delayed Apache license</a> which is free to use from the get-go for almost any organization.</p><h3>Web Interface</h3><p>We are continuing to build a “window” into the Open Data Fabric network.</p><p>You can think of it as a mix of:</p><ul><li><em>GitHub</em> — as a place for discovering, collaborating, and governance of data:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*C0OshJTq6MNB8ZaU.png" /></figure><ul><li><em>Etherscan</em> — as a place to inspect the dataset metadata ledgers, see verification attestations, and the state of the network:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*PluMimMC487U-Ppo.png" /></figure><ul><li><em>Snowflake</em> — as a place to do ad-hoc exploratory data analysis:</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*b6p61aFkBMwvzeEg.png" /></figure><ul><li>And <em>Jupyter Notebooks</em> — providing basic analytics/dashboarding functionality</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/700/0*EL83jzEtRJBCTQRV.jpeg" /></figure><p>There will be a public instance coming, but you will always be able to <strong>run it yourself</strong>. It even comes embedded in kamu-cli so you can run it to explore your local workspace:</p><pre>kamu ui</pre><p>Our <strong>work so far</strong> was focused on:</p><ul><li>Covering most of the read &amp; explore functionality (search, pipeline graph, metadata block explorer)</li><li>Responsive data querying and exploration</li><li>Dataset creation and editing UI, including especially complex pieces of UI for ingesting data from external sources.</li></ul><p>Web UI interacts with the Compute Node via <strong>GraphQL API</strong> which you can also use for all kinds of automation.</p><p><a href="https://github.com/kamu-data/kamu-web-ui">GitHub repo</a></p><h3>New DataFusion Engine</h3><p>As you may know, Kamu / ODF does not dictate the use of any specific data processing framework or language. Any engine can be integrated as a plug-in. For our <a href="https://docs.kamu.dev/cli/supported-engines/">initial engines</a>, we chose Spark and Flink as the most mature implementations of the bitemporal stream processing model.</p><p>However, these engines were developed with an enterprise setting in mind. They are designed to operate in large long-running clusters, where a startup time of 10–15 seconds and the memory-hungry JVM didn’t matter that much. The quick checkpoint-to-checkpoint nature of processing in kamu-cli goes against this, with 90% of processing time usually spent waiting for engines to boot … it’s a poor user experience.</p><p>Recently we have completed a prototype <a href="https://github.com/kamu-data/kamu-engine-datafusion">integration</a> of <a href="https://github.com/apache/arrow-datafusion">Apache DataFusion</a> engine — a rapidly-developing Rust-based data processing framework built on top of <a href="https://arrow.apache.org/">Apache Arrow</a> with a powerful SQL API. While this engine has <a href="https://github.com/kamu-data/kamu-engine-datafusion#features">some limitations</a> it can be <strong>many orders of magnitude faster</strong> than Spark or Flink in many use cases.</p><p>We are very excited about this <strong>new generation of data engines</strong>. We hope they will pay closer attention to the temporal / streaming data processing paradigm. And as a project that is constantly seeking common ground among the wide variety of implementations, we are strong supporters of <a href="https://arrow.apache.org/">Apache Arrow</a> ecosystem and the <a href="https://substrait.io/">Substrait</a> project, aiming to unify how we compute on data.</p><p>We are already working on replacing Spark with DataFusion in our data ingestion path, so expect kamu to get much faster!</p><h3>Fast &amp; Efficient Data Transfer</h3><p>From the very beginning <a href="https://github.com/open-data-fabric/open-data-fabric/">ODF protocol</a> was following a <strong>spec-first approach</strong> and we remain committed to design and build in the open and seek opportunities for standardization.</p><p>As a snippet of this low-level protocol work consider <a href="https://github.com/open-data-fabric/open-data-fabric/blob/master/rfcs/008-smart-transfer-protocol.md">RFC-008: Smart Transfer Protocol</a>. When you think about <strong>efficient and standard structured data transfer protocols</strong>, what comes to your mind?</p><ul><li>Is it the “good” old JDBC/ODBC?</li><li>Or REST+JSON API, which are not standard at all?</li><li>If you are “in the know”, perhaps <a href="https://arrow.apache.org/docs/format/Flight.html">Arrow Flight</a>?</li></ul><p>There is a <strong>big gap</strong> in these protocols:</p><p>Often the server controlling the access to data is not the one storing it. If our Compute Node (server) is deployed in EKS while all data is in S3 (storage) and clients often want to download large datasets — the entire data will have to flow <strong>through</strong> the server, creating a lot of extra work and a bigger bill.</p><p><strong>ODF Smart Transfer Protocol</strong> takes care of this problem by allowing to point the client to a piece of data potentially served by a different party over a different protocol, <strong>avoiding proxying</strong>. In the above case, it will return the client a list of pre-signed S3 URLs to download Parquet files from.</p><p><em>Note: The </em><a href="https://github.com/delta-io/delta-sharing/blob/main/PROTOCOL.md"><em>Delta Sharing</em></a><em> protocol by Databricks is the closest thing to this, but unfortunately was too Spark-specific for us to adopt it.</em></p><p>In future, we are planning to extend it to handle <strong>in-band data transfer</strong> as well, via <a href="https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/">Arrow Flight protocol</a>, so that depending on the query the system could pick the most efficient way to transfer data to the client.</p><h3>What’s next?</h3><p>We will soon be publishing a <strong>public roadmap</strong> so you could follow both our progress and provide input for future development!</p><p>A few upcoming things I’m most excited about are:</p><ul><li>DataFusion-based ingest</li><li>Laying foundation for granular authorization (based on <a href="http://ucan.xyz/">UCAN protocol</a>)</li><li>IPLD migration</li><li>Documentation upgrade.</li></ul><p>So stay tuned! You can always find us on <a href="https://discord.gg/nU6TXRQNXC">Discord</a> as well as many other like-minded people happy to chat about everything data-related.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=8524645124" width="1" height="1" alt=""><hr><p><a href="https://medium.com/kamu-data/kamu-update-funding-hiring-tech-8524645124">Kamu Update: Funding, hiring, tech</a> was originally published in <a href="https://medium.com/kamu-data">Kamu Data</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Kamu at Compute Over Data Summit]]></title>
            <link>https://medium.com/kamu-data/kamu-at-compute-over-data-summit-8a7812bc5acd?source=rss----7bbcdfcb365---4</link>
            <guid isPermaLink="false">https://medium.com/p/8a7812bc5acd</guid>
            <category><![CDATA[blockchain-development]]></category>
            <category><![CDATA[data-science]]></category>
            <dc:creator><![CDATA[Sergii Mikhtoniuk]]></dc:creator>
            <pubDate>Sun, 22 Jan 2023 03:17:56 GMT</pubDate>
            <atom:updated>2023-01-22T03:17:56.031Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/591/1*JCryA_2AY6wtqVHR-eB5Gw.png" /></figure><p>I had a great opportunity to present at the <a href="https://www.codsummit.io/">Compute Over Data Summit</a> organized by Protocol Labs in Lisbon.</p><p>In my talk titled <strong>“Structured Data in Web3 — Rethinking databases”</strong> I covered:</p><ul><li>Major trends in OLTP and OLAP that already build towards what we would consider Web3 properties of data</li><li>How Web3 data community can reuse many of the momentum already happening in modern data lakes</li><li>Advocated for the need of standardization of how we represent structured data in Web3</li><li>And presented Open Data Fabric as a potential solution to composable processing of structured data</li></ul><p>You can find the recording here:</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FZQ-MdKj3BjU%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DZQ-MdKj3BjU&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FZQ-MdKj3BjU%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/17dcd7ebe958bdc3961cf53a8d02d74f/href">https://medium.com/media/17dcd7ebe958bdc3961cf53a8d02d74f/href</a></iframe><p>My <strong>personal highlight</strong> of the summit was when Peter Wang from Anaconda was asked <em>“if you could wave a magic wand — what system you’d want to exist in the world — it can be ANYTHING”</em> … and he <a href="https://youtu.be/WqquUQDgHj0?t=33865">went on to describe</a> something that sounds exactly like Kamu:</p><ul><li>Global registry of datasets</li><li>Data and code combined in one system</li><li>Verifiability and provenance of all derived data</li></ul><p>I’d also recommend checking out <a href="https://youtu.be/WqquUQDgHj0?t=13907">Juan Benet’s talk on future of COD</a> and the full of energy <a href="https://youtu.be/WqquUQDgHj0?t=10409">demo of Bacalhau from David Aronchick</a>.</p><p>Big thanks to Protocol Labs for organizing the event and to everyone who attended!</p><p>It was especially great to meet some of <a href="https://discord.gg/nU6TXRQNXC">our Discord</a> members in person.</p><p>Till next time!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=8a7812bc5acd" width="1" height="1" alt=""><hr><p><a href="https://medium.com/kamu-data/kamu-at-compute-over-data-summit-8a7812bc5acd">Kamu at Compute Over Data Summit</a> was originally published in <a href="https://medium.com/kamu-data">Kamu Data</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Kamu Update: We join CDL and CODWG]]></title>
            <link>https://medium.com/kamu-data/kamu-update-we-join-cdl-and-codwg-bdbcfda349cf?source=rss----7bbcdfcb365---4</link>
            <guid isPermaLink="false">https://medium.com/p/bdbcfda349cf</guid>
            <category><![CDATA[open-data]]></category>
            <category><![CDATA[desci]]></category>
            <category><![CDATA[web3]]></category>
            <dc:creator><![CDATA[Sergii Mikhtoniuk]]></dc:creator>
            <pubDate>Mon, 12 Sep 2022 18:33:13 GMT</pubDate>
            <atom:updated>2022-09-12T18:33:13.304Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/662/1*SebyIW67PmeN5UZb9zhIkg.png" /></figure><h3>Kamu is selected for CDL program</h3><p>We are very happy to announce that <a href="https://kamu.dev">Kamu</a> has been accepted into <a href="https://creativedestructionlab.com/">Creative Destruction Lab</a> — a program for early stage science and technology companies. We will be joining the new <a href="https://creativedestructionlab.com/streams/compute/">Compute Stream</a> in Vancouver that focuses on “technologies that will impact humankind in the same ways the printing press, the personal computer, and the internet did”. This is quite an expectation to live up to, and it definitely reflects our mission — to bring humanity onto the next level of data-driven decision making and bootstrap the new era of digital economy based on fair and rapid data exchange.</p><h3>We join Protocol Labs “Compute Over Data” Working Group</h3><p>Following our graduation from Faber-Filecoin Web3 accelerator, Protocol Labs — the company behind such amazing projects like IPFS and Filecon — invited us to join their newly-formed <a href="https://www.cod.cloud/">Compute Over Data Working Group</a>. We are proud to work alongside many amazing companies to collectively address the problem of decentralized data processing.</p><p>Companies in the group tackle a wide variety of problems like:</p><ul><li>Sandboxed computation environments that can run co-located with data (in WASM VMs or contaienrs)</li><li>Verifiable computations and identifying malicious actors</li><li>Decentralized ownership and authorization</li><li>Privacy-preserving computations</li><li>Web3-native databases</li><li>General-purpose compute networks (think decentralized alternatives to AWS)</li></ul><p>With so much activity in this space we see a clear niche <strong>where Kamu can bring the most value</strong>:</p><ul><li>Structured data processing — WASM and general-purpose compute is great, but data processing requires much more robust and higher-level primitives. We need interoperable data and schema formats, data-centric processing languages like SQL that go beyond a single dataset (e.g. in case of JOINs)</li><li>Dynamic data — in addition to one-shot processing tasks, how do we represent dynamic data sources (e.g. IoT devices, medical records) in decentralized and content-addressable storage and how do we build pipelines that can continuously process them</li><li>Bridging Web3 with an existing ecosystems of enterprise, government, and research data — for Web3 data to become mainstream where need to provide a smooth transition path for existing organizations through the use of standard analytical data formats, languages, and APIs and allowing them to pick the comfortable level of decentralization.</li></ul><p>Here you can find <a href="https://www.youtube.com/watch?v=GieOLV1IlmE">our introduction</a> and the <a href="https://www.youtube.com/watch?v=1uJkawkYoEI">technology demo</a> we presented to the group.</p><h3>Technology updates</h3><p>Our progress on technology in the past two month incudes:</p><ul><li>Support for <a href="https://docs.kamu.dev/cli/collab/ipfs/">publishing and syncing data from IPFS</a> — our first integration with content-addressable file system went very smooth as we originally designed our protocol around this</li><li>New chapter in our <a href="https://docs.kamu.dev/cli/get-started/self-serve-demo/">self-serve demo</a> showcases using Kamu for Web3 data analytics — follow it to build a complex pipeline that combines data from Ethereum blockchain and Web2 data sources</li><li>Updates to core protocol’s block structure that allowed us to significantly improve sync efficiency</li><li>We’ve put a major effort into our <a href="https://docs.kamu.dev/cli/get-started/first-steps/#web-ui">web frontend’s</a> internals — it’s mostly non user facing, but sets us up for rapid feature development to simplify the user experience.</li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=bdbcfda349cf" width="1" height="1" alt=""><hr><p><a href="https://medium.com/kamu-data/kamu-update-we-join-cdl-and-codwg-bdbcfda349cf">Kamu Update: We join CDL and CODWG</a> was originally published in <a href="https://medium.com/kamu-data">Kamu Data</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>