Stories by Bhalisa Sodo on Medium

NFTs for Data

Bhalisa Sodo — Tue, 01 Mar 2022 12:56:57 GMT

Data NFTs allow for more power & flexibility in managing & selling data

https://www.jotform.com/blog/40-beautiful-and-amazing-aquatic-life-photos/

I asked one of the brightest minds at the intersection of artificial intelligence and blockchain Trent McConaghy, from Ocean Protocol; what really is the purpose of Data NFTs? After having mischaracterised them for tools to enable data owners to unlock multiple revenue streams. It turns out; “The main idea of Data NFTs is more to explicitly represent base IP (e.g. copyright) which allows for more power & flexibility in managing & selling data”. This addresses the fundamental issue of ambiguous legal documents, that back copyright, that are not generally interpretable[4]. With the help of NFTs and blockchain technology, copyright can be represented transparently on an irrefutable public ledger, unambiguously. Thus, the process of transferring ownership and rights is highly lubricated; cutting immense legal and transfer costs and the middleman altogether. This post will focus solely on intellectual property in the context of copyright, and particularly in data ownership and the transfer thereof.

I thought this would be important as data becomes more democratised and ubiquitous, as a result of the proliferation of public data marketplaces like Ocean Market.

Fungible and Non-Fungible Tokens (NFTs)

Fungible tokens, like cryptocurrencies that we are accustomed to on the Ethereum blockchain, follow the ERC-20 token standard. Meaning tokens of the same smart contract are interchangeable. Let us use the dollar as an example. Any 1 dollar note is exactly the same as the next 1 dollar note. It does not matter where you got that dollar from, it is homogeneous. You can think of apples of the same tree, you could probably trade one for another of a the same tree — granted they don’t vary fundamentally.

Non-fungible tokens, on the other hand, follow the ERC-721 token standard. This token is different to another from the same smart contract. NFTs are used to identify something or someone in a unique way. According to ethereum.org, this type of token is perfect to be used on platforms that offer collectible items, access keys, lottery tickets, numbered seats for concerts and sports matches, etc. Tokens of this standard are basically great candidates to represent a one-of-a-kind item. It cannot be identified by any other than itself.

The blockchain game, Cryptokitties, ushered in NFTs into the mainstream in 2017. This decentralised application (dapp) allows players to purchase, collect, breed and sell virtual cats. A cryptokitty can have unique attributes; from colour, family history, to even genetic makeup. Because of their non-fungible characteristic, security, immutability and transparent nature, as enforced by the ethereum blockchain, NFTs make for great candidates to represent ownership of a specific ‘bundle of rights’.

Base Intellectual Property (IP) and Sub-licenses

Let us consider an artist who has made a painting. Initially, the artist has full ownership or copyright to the painting[3]. Until said painting is sold to a collector, who in turn claims all ownership of the copyright. The copyright to the painting is its base IP. I would relate this to data sold to a public market: when a data owner publishes a dataset, say on Ocean Market, they immediately assume copyright over their intellectual property, granted it is rightfully theirs.

Now let us consider an author of a book[3]. The author owns all claims to the copyright of the book, he then exclusively licenses the base IP to a particular publisher. The exclusive license earns the publisher the right to sell and profit from the books. The publisher then distributes virtual copies (eBooks) of the hard copy, all copies of which have a different fungible sub-license to the base IP[3]. Meaning all sub-licenses are the same and entail the same set of rules on how the books can and cannot be used by the consumer of the eBook.

Data NFTs

The underlying idea is to tokenise base IP and its sub-licenses thereby representing them more explicitly in ‘dry code’. Rather than conventional human-readable ‘wet code’, such as lengthy legal documents that can be interpreted differently depending on who you are in the legal process[4]. Bitcoin introduced a ‘dry-code’ machine-readable approach to electronic money, enabled by blockchain. In the real-world, ownership of assets is defined through access control (if you hold it, you own it), it is no different to digital assets, as made possible by Bitcoin[1]. Data NFTs combine base IP which is best implemented by some incarnation of the ERC-721 standard, and sub-licenses to base IP which are best implemented by way of the ERC-20 standard.

Let’s take this into perspective. Ocean Market is a data marketplace that allows data owners to publish & monetise their data, and scientists to consume the data by training their models on the data at a fee.

In the beginning, the publisher has sole ownership of their copyright or base IP (ERC-721). And can choose to mint an x amount of sub-licenses (ERC-20) which are all in his/her control at first. The publisher can now have the ability to either:

Transfer the fungible sub-licenses to third-parties,
Or transfer non-fungible base IP, in turn ceding all copyright of the data and its sub-licenses.

This process of protecting and transferring copyright has been a tedious, long, error-prone and expensive process. This is because of numerous legal nuances, technicalities, legal costs and human-readable language semantics. NFTs propose a new way in which ownership can be seamlessly proven, protected, and transferred at the fraction of the cost, without any ambiguity.

This post has been purposely simplistic to get the main idea of how NFTs can be used in intellectual property — particularly copyright of data shared in public data marketplaces. Therefore, making it more readable and easier to understand by a larger non-technical audience. To get a more in-depth and technical explanation on the relationship between NFTs and intellectual property, I highly recommend reading Trent McConaghy’s ‘NFTs and IP’ series.

Check out Trent McConaghy’s ‘NFTs and IP’ series:

[1]NFTs & IP 1: Practical Connections of ERC721 with Intellectual Property

[2]NFTs & IP 2: Leveraging ERC20 Fungibility

[3]NFTs & IP 3: Combining ERC721 & ERC20

[4] Wet and dry code, by Nick Szabo

Privacy Preservation in Web 3.0

Bhalisa Sodo — Fri, 28 Jan 2022 11:08:51 GMT

How Compute-to-Data Relates to Differential Privacy and Federated Learning

Introduction

We outline the key similarities and differences between Compute-2-Data, Differential Privacy, and Federated Learning. All as means to facilitate the sharing of sensitive data while preserving privacy or withholding the sensitive contents of the data while compute jobs are done on it. The comparisons will be largely informed by all three technologies being used in the contexts of Ocean Protocol, OpenMined, and Google AI respectively, in layman terms.

As things stand, sensitive data is largely transferred between parties through sharing a copy of the dataset. Parties ensure that the data falls in the right (intended) hands and is not intercepted by malicious actors who might misuse it. However, this requires a copy of the data to leave the premises of the owner. This invokes a fundamental trade-off between the benefits of sharing the data with someone, and the risks of them misusing the data. According to OpenMined, Remote Data Science alleviates this problem by making it possible for one person to answer a question using data owned by another, without ever seeing or acquiring a copy of that data. Let us investigate Compute-2-Data, Differential Privacy and Federated Learning as privacy preserving layers in transactions involving running compute jobs and analytics models on data not owned by the analysts.

OpenMined and Differential Privacy

OpenMined uses data owner-deployed servers to store data from where it can be later queried remotely by analysts who need answers to questions. In this implementation, the basic flow of how data is sent to and queried from the server looks something like this[1];

Uses HAGrid Command Line Interface to deploy and communicate with data owner server
Host your data for study on PyGrid Server
Study data remotely using PySyft Library
Privacy is preserved by Differential Privacy (DP)

Differential Privacy (DP)

Differential Privacy ensures privacy by adding random noise to the query. This strikes a balance between data consumers getting some utility from the data, and the data provider not having to forego a data subject’s privacy in the process. Though DP does a good job in obfuscating the sensitive contents of the data through the introduction of noise, if enough queries are run on a dataset, the output can be a correct estimate of the ground truth, thereby breaching privacy. Because, in theory, with every additional query, twice the privacy is lost [2]. This can be circumvented by allocating a ‘privacy budget’ to limit the scope of the job(s) a data consumer can perform on the data. And enforcing mandatory noise (epsilon) in every function attempting to query the data from the server.

This is done by an algorithmic coin flip to introduce ‘plausible deniability’. For example, if we have a study that has binary outcomes, Yes/No, there is at least a 1 in 4 chance that an answer could be wrong, therefore you cannot trust information by handpicking the outcome that pertains to a single individual.

Simply Explained

To avoid exposing non-disclosed private information about a data subject, it is a rule of thumb, that an insight about a dataset must be consistent with or without the removal of a particular person’s information. Therefore, not exposing any new truth about a particular person other than what is already in the data. In other words, an analysis should tell us more about the population and nothing about a single person [3].

Kobbi Nissim, et al.

Ocean Protocol and Compute-2-Data

Decentralised data markets allow anyone to monetise their data as long as other market participants can agree on the value / price of that data. And as a result, opening up data sets that were previously unavailable for research and artificial intelligence. But, this has not been in the most secure and privacy-preserving manner, until the introduction of Compute-to-Data in Ocean Market.

Ocean Market is a Web3 data marketplace where data publishers allow data consumers (data scientists, researchers, etc.) to train algorithms on their data, while preserving privacy and the data never having to be copied or leave the publisher’s premises. Users can publish both data sets and algorithms — collectively referred to as data assets.

Compute-2-Data (C2D)

Compute-2-data allows for the sharing of data without the data having to leave the premises or compromise the data subjects’ privacy. Consumers purchase compute jobs on the data to improve the accuracy of their A.I models, or to derive relevant insights. Publishers can put up their own algorithms or third-party algorithms can be used to analyse the data. The image below illustrates the general user flow.

There are 2 consumption access permissions the publisher can choose from; download and compute:

Download — this access type would probably be best for non-personal data (e.g climate change related data, etc.)
Compute — compute access is best for personal data (e.g health records, etc.) whose exposure would likely pose a risk.

In compute, only the algorithm has viewing rights of the data. Therefore, no other person, except for the data owner can know who is implicated in the data set, and in what way. The analysts are only privy to the algorithms outputs and not the contents of the dataset.

Google AI & Federated Learning

According to Google AI, Federated Learning enables mobile phones to collaboratively learn a shared prediction model while keeping all the training data on device [4]. This means that the data no longer has to be copied from the device to the cloud. For the first time, the model can be trained on the device while analysts are only privy to the model updates.

Google AI

The image above demonstrates how this is achieved in the following steps:

Device downloads current model
The model learns on the device and is improved
A summary of changes as a small focused update is derived

The punchline is that only the update is sent to the cloud. The update is averaged with other devices’ updates in the network to apply improvements to the shared model. The improved model is immediately available to augment the personalisation of the user experience. The model is trained while the phone is idle, in the charger, and on a free wireless connection, so mobile phone performance is not negatively affected[4]. To ensure privacy and security, Google AI developed a Secure Aggregation protocol, enabling a coordinating server to only decrypt the average update. Meaning that no update about a single individual can be captured. Currently, this development is mostly around Gboard functionalities in Android-powered mobile phones.

According to Trent McConaghy of Ocean Protocol, Federated Learning as implemented by OpenMined could be further improved by Compute-2-Data to manage computation at each silo in a more secure fashion. In fact, we may see some incarnation of a collaboration between Federated Learning & Compute-2-Data in the upcoming Ocean V4 release. Differential Privacy holds potential for Compute-to-Data contexts too.

Useful resources

[1] Remote Data Science in 15 Minutes

[2] A Brief Introduction to Differential Privacy

[3] Differential Privacy: A Primer for a Non-technical Audience

[4]Federated Learning: Collaborative Machine Learning without Centralized Training Data

[5]How Ocean Compute-to-Data Relate to Other Privacy-Preserving Approaches?