Stories by Andrei Lapets on Medium

Accessible and Scalable Secure Data Evaluation

Andrei Lapets — Wed, 21 Jul 2021 06:07:20 GMT

Infutor protects customer data using multi-party computation

Members of the Nth Party team have spent the last six years building and enhancing libraries, applications, and products that deliver or incorporate secure multi-party computation (MPC) protocols and capabilities. More recently, browser-based tools released by Nth Party (such as nth.link) have demonstrated that MPC is ready to secure common data workflows and transactions encountered in domains such as marketing and advertising.

Over the last several months, Nth Party has had the opportunity to collaborate with Infutor Data Solutions in building an accessible, scalable secure MPC solution for a common data workflow: evaluation of a prospective customer’s data against a full-scale identity graph. Launched as part of Infutor’s new Test Drive experience, the solution allows prospects and customers to explore how Infutor’s products can add value without revealing their data to Infutor or any other third party.

Secure Data Evaluation at Scale with Infutor

Within the fractured value chains of the data-driven marketing and advertising spaces, companies engage in an innumerable quantity of data transactions on a daily basis. Often, even transactions that may be exploratory in nature (such as those in which a prospective customer is evaluating a vendor’s product or service) involve the transfer of sensitive data about consumers. Conducting these transactions in a safe and compliant manner can require lengthy negotiations, the assembly of legal agreements, the use of data governance tools and platforms, and investment in secure data storage infrastructure or services. But these mitigation techniques are not addressing the root issue: data must be transferred between parties and processed in its decrypted form.

Secure MPC eliminates the root issue: only encrypted data is ever transferred, and no one except the original data owner has a decryption key. This is a key feature of secure MPC protocols, and is sometimes misunderstood: transfer of encrypted data to a recipient who has no key involves no transfer of information — and thus carries no risk— because the data cannot be decrypted.

Using MPC techniques and protocols, Infutor’s Test Drive experience ensures that data is encrypted in a prospective customer’s browser and never decrypted after that point. Quoting the press release:

“Data security and privacy has always been our #1 priority,” says Gary Walter, Chief Executive Officer of Infutor, “With Nth Party’s encryption technology we don’t need to see the data to demonstrate value. It’s a win-win for us and our clients.”

In only a few minutes, the secure computation compares the customer’s encrypted data to Infutor’s identity graph (which features hundreds of millions of records). The customer can then view the results to instantly explore how Infutor’s products can help them achieve their business objectives.

Accessible and Scalable MPC

Nth Party’s work with Infutor builds on previous product releases and demonstrates once again that given the right delivery mechanism and configuration, MPC protocols are more than ready to address real-world data analysis workflows at industrial scales. While some skepticism is sometimes expressed about the maturity of MPC techniques, we hope this work builds confidence in the marketplace that services providers can offer their customers the strong security benefits of well-studied technologies such as secure MPC and private set intersection (which underlie nth.link, Facebook’s Private-ID, and Google’s Private Join and Compute, among others). Furthermore, the security benefits of MPC can be incorporated into on-premise software-only solutions…

without the need for trusted third parties or clean rooms,
without use of any specialized hardware, and
without moving data (either that of the service providers or their customers) to expensive third-party SaaS platforms.

How was the MPC solution for Infutor’s Test Drive experience designed and developed? Throughout its work on practical use cases involving MPC, the Nth Party team has focused on identifying low-hanging fruit opportunities: find the simplest protocol that leverages asymmetry of participant roles to offer the required privacy benefits at scale, combine it with well-understood and long-established algorithms and data structures, and build software libraries and applications that leverage contemporary serverless computing platforms and operate within ubiquitous environments such as web browsers.

The Nth Party team looks forward to building more commercial MPC applications that help secure data-driven workflows, and to contributing open-source libraries that help everyone build production-quality, at-scale secure applications.

Accessible and Scalable Secure Data Evaluation was originally published in Nth Party on Medium, where people are continuing the conversation by highlighting and responding to this story.

Privacy-Preserving Information Exchange Using Python

Andrei Lapets — Fri, 30 Oct 2020 18:01:00 GMT

Photo by Luca Bravo on Unsplash

Imagine a scenario involving a vendor that offers two distinct digital products for sale at the same price (such as mobile apps or digital content) and a customer that would like to purchase exactly one of these two products. Furthermore, suppose that the customer and vendor are interested in performing this transaction in a way that preserves the privacy of the customer. To be more specific, whatever approach is used needs to satisfy the following two criteria:

the customer does not disclose to the vendor which of the two digital products they are purchasing , but
the vendor allows the customer to purchase exactly one of the two digital products.

Is this even possible? One approach the vendor and customer can employ is to recruit a trusted third party. This third party can provide a kind of escrow service: retrieve a copy of each of the two products from the vendor and accept the selection from the customer, and then deliver to the customer only a copy of the selected product.

This approach satisfies the two criteria as they were originally stated, but it introduces a number of potential issues. Two of these include (1) the need to recruit (and potentially compensate) a third party and (2) the necessary disclosure of the customer’s selection to the third party. The latter of these may be particularly problematic if the customer actually does not wish to disclose their selection to any third party (which just happens to include the vendor).

Exchanging Information via Oblivious Transfer

In fact, it possible for the vendor and customer to perform this transaction while satisfying the original two conditions and without relying on a third party. Instead, the vendor and customer could each use a simple piece of software to communicate with one another via a cryptographic protocol for exchanging information known as oblivious transfer, or OT.

A technique first published in 1985 called one-out-of-two oblivious transfer is a form of secure computation that allows two parties to interact in exactly the way the vendor and customer in our scenario would like: a sender can deliver exactly one of two messages to a receiver without knowing which message it delivered. Since that time, generalized and streamlined variants of this technique have been developed, such as a simple protocol published in 2015 by Genç, Iovino, and Rial.

Simple OT using Python

The open-source otc Python library published by Nth Party provides an encapsulated implementation of the protocol published by Genç, Iovino, and Rial. How can the library be used to perform a privacy-preserving transaction that satisfies the original criteria?

The library lets programmers construct one of two objects: a sender object or a receiver object. In the transaction scenario, the vendor is the sender and the customer is the receiver. For the purposes of this example, the two products available for purchase are each a 16-byte string (this is not implausible in practice, as the two 16-byte strings could be cryptographic keys that can be used to decrypt larger files).

At the beginning of a transaction, the vendor must create a sender object (assigned to the variable s in the example below) and send a public key s.public to the receiver. To learn more about why this is called a public key and how public keys are used in secure communication, you may want to delve deeper into the details of public-key cryptography.

>>> import otc
>>> s = otc.send()
>>> s.public # Public key to send to the receiver.
b'\x18\x91\xee\xc9\xe7|\x81k\xf5a\xd2\x9b\xdbc\x92\xe9\x8c\xc4\x1c)\xb6u\x90\xb0\xfc\x91\x04\xc7\x80\xcd~z'

Once they receive the public key s.public, the customer can create a receiver object r and use it to build a query byte string (assigned to the variable query below) that effectively represents an encrypted request for one of the two items. In the example below, the receiver is requesting item 1 (the choices are 0 or 1). Note that the sender is not able to decrypt this query and cannot determine which item is being requested by examining it.

>>> import otc
>>> r = otc.receive()
>>> selection = r.query(s.public, 1) # Use public key from sender.
>>> selection # Selection to share with sender.
b'z\x01T\xbc\xa8\r2\xf0@v\x16k\xb7_\x01\x1a:\xdd\x8d\xb2\x8du1\xee\x99\xd1\xe0\xd1|\xe5\xad\x11'

Once the sender receives the query byte string query, they can build a reply (assigned to the variable replies in the example below) consisting of a pair of byte strings. These two byte strings are the encrypted versions of the two products, but only the product originally selected by the receiver can be successfully decrypted. In the example below, the products offered by the vendor are two 16-letter words.

>>> replies = s.reply(
...     selection,
...     'absentmindedness'.encode(),
...     'wholeheartedness'.encode()
... )
... 
>>> replies # Encrypted replies to share with receiver.
(b'\xd8\xda\xdf\xf0\x89JsJ\xb5\x9e0\x0b\xe8Kd\xcf\x1f\x92\xf2\x18\r\xc6r\xdc)\x04\xa0\x990\x93\xc1f', b'\xd0iy3\xa2\xb8\xbf\xefI\x0eF\xf9\rI^\xf9\xaf\x7fwO\xbd\x18\x9cL\x12\xba>\xd2V\xed\xec\xb4')

The receiver can now use the original receiver object r constructed at the beginning to obtain the product (which is selection 1 from the two choices 0 and 1) to which they originally committed when they generated their query.

>>> r.elect(s.public, 1, *replies).decode()
'wholeheartedness'

Note that reversing the two parts of the reply (in effect, attempting to decrypt the other product) will result in a decryption error.

>>> try:
...     r.elect(s.public, 1, *reversed(rs)).decode()
... except Exception as e:
...     print(e)
...
Decryption failed. Ciphertext failed verification

Of course, having only two products that are each a 16-byte string is not a very realistic scenario. But these limitations can be overcome by creatively chaining multiple instances of the back-and-forth exchange presented in the example above. For now, we leave these generalizations of the approach as an exercise for the reader.

Implementation Details

The otc library is an implementation of the Genç, Iovino, and Rial protocol that relies on cryptographic primitives that consist of operations involving elliptic curve points and scalars. This OT library in turn leverages another library, oblivious, that provides a single API that can seamlessly switch between libsodium and pure-Python implementations of the necessary primitives. This means the OT library can be used either in conjunction with a compiled instance of libsodium (yielding much better performance) or on its own (ensuring better portability at the expense of performance).

Practical Secure Computation

Hopefully, this article illustrates that incorporating cryptographic techniques like oblivious transfer into software applications that are implemented using contemporary, ubiquitous programming languages such as Python is becoming more practical. Visit the Nth Party GitHub to find and/or contribute to otc, oblivious, and other libraries that you might find useful for your projects. And, of course, be sure to check back here in the future for more examples and tutorials like this one.

This article is also available as a Jupyter Notebook on GitHub.

Privacy-Preserving Information Exchange Using Python was originally published in Nth Party on Medium, where people are continuing the conversation by highlighting and responding to this story.

Guide to Publishing Packages

Andrei Lapets — Thu, 29 Oct 2020 21:51:00 GMT

This article is a step-by-step guide to assembling and publishing a small, open-source Python package. While not all of the steps below will be appropriate or desirable for every package, each of these can contribute to the accessibility and maintainability of the package. Note that this material is meant to serve as a roadmap and overview; it is not a thorough review of all the nuances and trade-offs involved. You are encouraged to consult additional resources (links are included throughout the text) for other options and viewpoints on how to approach every portion of the process.

Each section covers a task or category of tasks related to organizing and publishing a package and, where appropriate, provides templates and examples. Note that depending on when you are reading this article, some of the steps, templates, and examples may be out of date.

Project Organization and Directory Tree Template

The directory structure below is one way in which a hypothetical project called published might be organized.

├─ .gitignore ............. File name patterns for Git to ignore 
├─ .travis.yml ............ Travis CI configuration
│
├─ LICENSE ................ Distribution license
├─ README.rst ............. Project README
├─ setup.cfg .............. Configuration with parameters for setup
├─ setup.py ............... Package file
│
├─ published/
│  ├─ __init__.py ......... Namespace module
│  └─ published.py ........ Library module
│
└─ test/
   └─ test_published.py ... Unit tests

For the purposes of this article, it is assumed that the package contains just one source module (i.e., published/published.py). Therefore, it is sufficient to place it (along with its namespace module published/__init__.py) into its own directory. The sections below go into more detail about each of the files and their purpose.

Package File Organization

The setup.py package file specifies relevant metadata associated with your project, the locations of some important project files, and the dependencies that your project requires. An example of a possible setup.py file for the hypothetical package called published is presented below.

from setuptools import setup

with open("README.rst", "r") as fh:
    long_description = fh.read()

setup(
    name="published",
    version="0.1.0",
    packages=["published",],
    install_requires=[],
    license="MIT",
    url="https://pypi.org/project/published",
    author="python.supply",
    author_email="contact@python.supply",
    description="Example illustrating how an open-source library "+\
                "can be organized and published.",
    long_description=long_description,
    long_description_content_type="text/x-rst",
    test_suite="nose.collector",
    tests_require=["nose"],
)

Establishing and Checking Style Conventions

One easy way to add linting to your project is to create a default configuration file for Pylint.

python -m pip install pylint
pylint --generate-rcfile > .pylintrc

You can then check your project source files in the following way.

pylint published

In some cases, you may find it necessary to alter certain parameters in the configuration.

The variable-rgx, class-rgx, constant-rgx, and similar parameters can be used to specify an alternate standard for acceptable naming conventions. This can be reasonable to do in certain scenarios. For example, if you are implementing a specialized library that involves certain mathematical constructs, it may be more clear to your audience and in accordance with domain conventions to use single-letter variables such as x and y rather than longer identifiers that use snake case.
You might add rules that you want to ignore to the comma-separated list that follows the disable= parameter.

You may also want to direct Pylint to ignore a rule in a specific location within a source file. This can be done using a directive inside a comment. An example is presented below.

class published(): # pylint: disable=C0103,R0903
    def __init__(self):
        self.published = True

Defining Unit Tests and Measuring Test Coverage

There is an extensive supply of tools, packages, and conventions for defining unit tests, running unit tests, and measuring unit test coverage. In this article, the focus is on techniques that are suitable for small Python libraries and packages, so only two approaches are presented.

Using doctest

Unit tests for individual classes and methods can be included inside the docstrings that appear at the top of their bodies. The built-in doctest library can then be used to find and validate them. In the example below, the definition for the class published includes one such collection of tests in its docstring (i.e., a sequence of executed statements and the corresponding results, as might be observed when engaging a session via the interactive prompt).

class published():
    """
    A published package.

    >>> p = published()
    >>> p.is_published()
    True
    >>> p.published = False
    >>> p.is_published()
    Traceback (most recent call last):
      ...
    RuntimeError: package must be published
    """
    def __init__(self):
        """Build an instance."""
        self.published = True

    def is_published(self):
        """Check publication status."""
        if not self.published:
            raise RuntimeError("package must be published")

        return self.published

To check if the outputs in these sequences are accurate descriptions of what is actually returned by the Python interpreter when it runs each statement, simply add the following to your main module (e.g., inside published.py in this case):

if __name__ == "__main__":
    doctest.testmod()

It is then possible to run these tests as follows (the -v option ensures a report is displayed even if all tests succeed):

python published/published.py -v

Alternatively, you can allow nose to find and run doctests using appropriate parameters in a setup.cfg file. An example of asetup.cfg file that is sufficient for the examples above is presented below.

[nosetests]
exe=True
tests=published/published.py

Using unittest

If you would like to employ a more extensive test suite, you can create one using the built-in unittest library and put it in a location that the nose tool can locate automatically. Below is an example of the layout a test script test/test_published.py might have.

from unittest import TestCase
from published.published import published

class Test_published(TestCase):
    def test_is_published(self):
        p = published()
        self.assertTrue(p.is_published())

To ensure nose can find the test script, the setup.cfg configuration file should specify the directory that contains the test script.

[nosetests]
exe=True
with-doctest=1
tests=test/

Using both doctest and unittest with nose

It is possible to use nosetests to run all tests, including both doctest unit tests and any testing scripts. An example of a setup.cfg file that enables this is provided below.

[nosetests]
exe=True
with-doctest=1
tests=published/published.py, test/

Measuring coverage

To determine how much of your code is covered by unit tests every time you use nose, you can add a few optional lines to the setup.cfg file.

[nosetests]
exe=True
with-doctest=1
cover-package=published
cover-html=1
tests=published/published.py, test/

By assigning your package name to cover-package, you are indicating that coverage should be measured (typically using coverage). By assigning 1 to cover-html, you are indicating that human-readable HTML files should be generated that highlight what portions of the module files are not covered by any unit tests.

Continuous Integration and Coverage Reporting

You can connect your GitHub personal or organization account with Travis CI and Coveralls in order to automatically run tests and publish test coverage every time you or a contributor pushes to the package’s GitHub repository or makes a pull request. Travis CI and Coveralls provide extensive documentation describing how you can link your GitHub account with their services; this article focuses is on what configuration files you need to add to your project.

A template for a simple .travis.yml Travis CI configuration file for the hypothetical package called published is provided below.

notifications:
  email:
    on_success: never
    on_failure: always
language: python
python:
  - "3.8"
cache: pip
install:
  - pip install pylint
  - pip install coveralls
  - pip install .
script:
  - pylint published
  - nosetests
after_success:
  - coveralls

The notifications section indicates that email notifications should only be sent if any step in the script section produces any result other than 0 (usually indicating success) to the standard output. The pylint and nosetests commands conform to this convention. The after_success section includes an entry that publishes the coverage results via Coveralls if there are no linting issues and no errors arise when the unit tests are executed.

README Organization and Format

An effective README document might cover some of the following topics:

the purpose of the package/libary;
a quick start guide showing how a first-time user can begin using the package/library;
how to run unit tests and measure test coverage;
how others can contribute;
the versioning standards or conventions being used.

Two popular formats for a README file that are supported by GitHub and PyPI are Markdown and reStructuredText. You can also learn more about how to structure a README file for PyPI.

Badges

You may want to add badges to your README that provide information about the status of your project (e.g., the host package repository and version number of the latest release, the last Travis CI build outcome, test coverage statistics, and so on). This requires determining the image URL for the badge you want to display (for example, the badge image URL for last Travis CI build status for the hypothetical package called published might be https://travis-ci.com/python-supply/published.svg) and then inserting that image into your README document using the appropriate syntax. For example, when using the reStructuredText format, the badge section of the README.rst file for the package published might look as follows if it uses a substitution definition for an images in order to include a badge for PyPI.

|pypi|

.. |pypi| image:: https://badge.fury.io/py/published.svg
   :target: https://badge.fury.io/py/published

Versioning and Contributions

When deciding how to assign version numbers to different versions or releases of your package, there are advantages to adopting a published standard. An example of a popular standard is Semantic Versioning 2.0.0. One benefit of a standard is that it makes it easier for anyone working with your package to understand what the difference between two versions signifies (e.g., whether they can expect one version to be backwards compatible with another). Another benefit is that it reduces the burden on you as the author and maintainer, as you do not need to invest effort in documenting your own conventions and resolving any corner cases that might arise. Of course, this may come at some cost or additional effort if your goal is to adhere to the standard consistently.

You may also want to specify in the README document any expectations you have of contributors who may be interested in reporting issues or making improvements to your package. For example, you might direct potential contributors to report problems via GitHub Issues and to submit suggested fixes or enhancements via pull requests.

Publishing to PyPI

Once you are ready to publish your package, you may want to test one last time that the package file and other project files are organized appropriately by installing the package locally.

python -m pip install .

You can then generate the archive files for distribution.

python setup.py sdist bdist_wheel

These can then be contributed to PyPI after you have set up an account. When you run the command below for the first time, you will be asked for your PyPI credentials.

twine upload dist/*

You can then try installing your package from PyPI. The example below is for the hypothetical package called published.

python -m pip install published

If you already installed a version locally, you may want to ensure you upgrade to the latest version using the --upgrade and --force-reinstall options.

python -m pip install --upgrade --force-reinstall published

Privacy-Preserving Matching and Computation in the Browser

Andrei Lapets — Wed, 30 Sep 2020 19:55:35 GMT

There are a variety of obstacles that can make it a challenge to deliver the benefits of privacy-preserving secure computation capabilities to organizations. These can include a lengthy process for obtaining approval to install new software, the cost and personnel effort associated with provisioning new cloud resources, and the technical hurdles involved in integrating existing infrastructure with new APIs. Fortunately, there is a way to sidestep these challenges for at least some secure computation workflows by delivering them via a web browser.

Privacy-Preserving Matching and Computation

Facebook AI Research (FAIR) recently released an open-source framework that makes it possible for two parties to match their respective data sets based on a common identifier — and to perform some aggregate computations over the results of that match — without requiring that either party share data with the other party. The framework is implemented in Rust, with Facebook’s stated rationale for the choice including “superior safety features and ease of writing multithreaded code”.

There are a variety of compelling use cases for a software solution that makes it possible to align the rows within two private or sensitive data sets (based on common row attributes) and to subsequently perform a computation such as a count or a sum over some of the columns in the combined data set. We enumerated some of these use cases in our white paper: enabling business-to-business decision-making without introducing the cost or complexity of third parties or data clean rooms, supporting benchmarking efforts involving multiple organizations, and the evaluation of sensitive or proprietary models or data.

Facebook’s release consists of a command-line software solution that allows two data owners to perform a join operation followed by an aggregation on a pair of data sets (with each data owner contributing one of these data sets). For example, a hospital may have information about the length of each patient’s stay and an insurance provider may know whether each patient visits their primary care physician regularly. The hospital and insurance provider can use this solution to create a report that lists the average hospital stay for each of the two categories of patients (at least for the patients they both have in common). What is unique about this solution is that it uses a cryptographic approach known as secure multi-party computation (including a specific technique known as private set intersection for the join step) to make this computation possible without requiring that either data owner reveal their data set to the other side.

Secure Computation in the Browser

The members of our team at Nth Party have extensive experience developing and deploying secure computation libraries, frameworks, and software applications that can run in a standard web browser. We were excited by Facebook’s announcement of their framework, but we were particularly pleased with the choice of Rust for its implementation. This is because over the past few years, we have seen that it possible to improve the performance of browser-based secure computation solutions by implementing them using WebAssembly. It so happens that it is possible to compile Rust to WebAssembly, with a variety of tools available for accomplishing this task.

Leveraging our team’s experience and combining FAIR’s framework with our own JavaScript secure multi-party computation libraries, we were able to assemble a solution that can deliver via a standard browser the same privacy-preserving matching and computation capabilities available in Facebook’s original solution. This is accomplished by first compiling the client-side and server-side components of the original Rust framework into WebAssembly. It is then possible to tie the client-side portion to a user interface that can run in a browser and to package the server-side component inside a Node.js application.

The Result: Scalable Browser-Based Join and Aggregation Workflows

Our browser-based application displaying some simulated input data (left-hand pane) and the output of a privacy-preserving computation (right-hand pane).

Our open-source solution is available on GitHub. The application as a whole scales reasonably well for a browser-based application, performing in under one minute a privacy-preserving join and aggregation workflow over data sets that number in the thousands. This demonstrates not only that secure computing techniques are ready to address real-world problems, but that they are relatively straightforward to adapt for modern software application stacks. This reinforces our views at Nth Party that secure computation is ready to help make business-to-business and consumer-facing data-oriented services and workflows more secure.

Privacy-Preserving Matching and Computation in the Browser was originally published in Nth Party on Medium, where people are continuing the conversation by highlighting and responding to this story.

Applications of Immutability

Andrei Lapets — Wed, 30 Sep 2020 19:31:00 GMT

Suppose you are implementing a service that works with data sets that represent routes on a map (e.g., collections of driving directions or logs of past trips). The road network is represented as a graph with nodes and edges, and each route is conceptually a collection of edges. You are tasked with choosing an appropriate data structure for routes that meets at least the following criteria:

it should be possible to deduplicate large collections of routes,
it should be possible to use individual routes as keys into a dictionary (e.g., to build a cache that maps each route to its total distance or average trip time),
programmers should not be able to modify a route instance if they obtain a reference to it, and
the collections of edges found in a route may be supplied to your data structure’s constructor in any order (even though they represent the same route for purposes of caching or deduplication).

What are your options in implementing a data structure for routes that meets these criteria, and what issues should you consider?

Both built-in and user-defined data structures in Python can be either mutable or immutable. This article explains why Python makes this distinction for built-in data structures, breaks down the independent characteristics that are often associated with immutable data structures, and explores several approaches you can employ when addressing the above use case.

Mutable and Immutable Built-in Types

Each of the built-in types found in Python is either mutable or immutable:

instances of the collection types list, dict, and set are mutable,
instances of the collection types tuple, frozenset, and range are immutable,
instances of types such as bool, int, float, and str are immutable, and
instances of bytes are immutable but instances of bytearray are mutable.

Mutable types are accompanied by methods that modify instances of the corresponding data structure in-place (usually returning None) while immutable types are usually accompanied by functions and methods that return a new instance of that type (such as string concatenation). But why does Python distinguish between mutable and immutable built-in types? The reasons are subtle and relate to an interplay between programming language design decisions and practical performance requirements. A brief overview is provided below, and you can find a detailed answer to this question in the Python documentation.

Some programming languages such as Haskell have only immutable values (and thus all new values are necessarily copies or entirely new objects). One significant benefit of this approach is that code in Haskell is much easier and safer to refactor and transform (e.g., for purposes of optimization) because the context of an expression will never affect its meaning. As an example, consider the following for loop.

>>> for i in range(3):
...     x = 1 + 2 + 3 + 4
...     print(x)
...
10
10
10

Because the expression 1 + 2 + 3 + 4 is immutable (i.e., its value will not change depending on where it appears in a program), it can safely be moved up and outside of the for loop without changing the behavior of the program. Modern interpreters and compilers routinely use such an approach to perform performance optimizations directly on the abstract syntax tree of the program.

>>> immutable_value = 1 + 2 + 3 + 4
>>> for i in range(3):
...     x = immutable_value
...     print(x)
...
10
10
10

Python set instances support an extremely fast lookup/membership operation (which can be invoked using the infix in operator) because the Python interpreter builds a hash table that contains hashes of all the individual elements of a set. As a reference point, consider the performance when searching for an element in a list.

>>> import time
>>> l = list(range(0,1000000))
>>> start = time.perf_counter()
>>> 999999 in l
>>> time.perf_counter() - start
0.07584109995514154

When evaluating the in operator in an expression such as 4 in {1, 2, 3, 4, 5}, the interpreter hashes 4 and finds the hash value in the hash table for {1, 2, 3, 4, 5} in nearly constant time. As shown below, this is significantly faster than performing a search through all of the elements in the set as in the above example (which is the only option without some sort of alternative comparison and sorting mechanism).

>>> s = set(l)
>>> start = time.perf_counter()
>>> 999999 in s
>>> time.perf_counter() - start
0.00026190001517534256

Suppose that elements in a set instance were mutable. This would mean that their hash would also be mutable, which would in term mean that the hash table for the set instance would need to be updated. But how would the interpreter even know to update the hash table? Consider the example below.

e = [1, 2, 3]
u = {e, "a"}
v = {e, True, False}
w = {e, 1.2, 2.3, 3.4}
e.pop()

Every time a statement such as e.pop() is executed, the interpreter would need to check whether e is a member of any sets (there are three such sets in this case) and would need to update the hash table for every one of them to accurately reflect that the hash value corresponding to e is different. If the interpreter did not do this, then an expression such as [1, 2, 3] in u could not be evaluated both correctly and efficiently.

It is worth noting that Python’s set and dict data structures were intentionally designed this way under an assumption: programmers will usually want to perform lookup in set and dict instances based on the value and not the particular instance of a data structure. An alternate approach could have been to simply build the hash table using the memory address of the each element rather than its value. However, this approach begins to fall apart as soon as strings are used as elements: after the statements k = "abc!" and d = {"abc!": 123}, a programmer probably expects d[k] == 123 to be True even though the address of the string instance "abc!" in k = "abc!" is different from the address of the distinct string instance "abc" in d = {"abc!": 123}. This can be confirmed using the built-in id function (though you should be aware of interning to avoid confusion when testing your own examples).

>>> k = "abc!"
>>> d = {"abc!": 123}
>>> id(k) == id(list(d.keys())[0])
False

The above explains why built-in immutable types exist and why Python requires them in certain contexts. This also indicates that to satisfy the criteria in the motivating scenario described in the introduction, the data structure you define for representing a route must be one that Python recognizes as being immutable.

Defining an Immutable Data Structure

In Python, there are a number of approaches available to you when you are defining a data structure for a use case such as the one described in the introduction.

One approach is simply to adopt a convention of using an existing built-in type to represent instances of your data structure. For example, a route can be represented as a frozenset of tuple instances (with two int components in each tuple representing the endpoints of the edge). This may be advantageous if you are trying to avoid unnecessary clutter or would like to make it easier for other libraries or components to use your data without dealing with (or introducing within their own code) application-specific boilerplate.
A second approach is to define a derived class that inherits the features of a built-in type (as demonstrated in an another article on operator overloading). This has most of the benefits of the first approach above, but gives you more control over the interface of the data structure. This is useful if you would like to add custom methods, to modify how certain default methods inherited from the built-in type behave, to enforce type or value constraints on method arguments, to throw application-specific exceptions, or simply to provide more user-friendly and application-specific synonyms for existing functions and methods.
A third approach is to define a brand new class that conceals its internal representation. This has the benefit of encapsulation (allowing you to modify the internal representation of the actual route in the future), but requires a more careful approach on the conceptual side and more boilerplate code on the practical side.

Using Built-in Types

A route could be represented using an instance of frozenset containing instances of tuple (with each tuple representing one edge) that in turn each contain two integers (with each integer representing one of the two nodes that an edge connects). Because integers and tuples are immutable, they can be elements inside frozenset instances.

>>> route_one = frozenset({(0, 1), (1, 2), (2, 3)})
>>> route_two = frozenset({(1, 2), (0, 1), (2, 3)})
>>> len({route_one, route_two}) # Two routes with the same edges.
1

Because frozenset instances are immutable, all four criteria for the route data structure can be satisfied. In particular, routes can be deduplicated by inserting them into a set and a mapping from routes to their distances can be implemented using a dict. Because a frozenset behaves like a mathematical set, the order of the edges does not matter.

>>> distances = {route_one: 3} # No exception is raised.
>>> len({route_one, route_two}) # Deduplication occurs.
1

Note that an instance of an immutable type can contain a mutable object inside it. However, in order for a value to be used as a key in a dict instance or an element in a set instance, all elements inside the immutable type must also be immutable (and so on, down to the leaves of the data structure instance). In the example below, the mutable object [True, False] causes an exception even though it is inside an immutable frozenset instance that is itself inside an immutable tuple instance.

>>> try:
...     {tuple("a", frozenset({[True, False]})): 0}
... except Exception as e:
...     print(e)
...
unhashable type: 'list'

Defining a Derived Class

It is possible to take advantage of Python’s support for inheritance to define a class that is derived from one of the immutable types. This ensures that the derived class has the same attributes and methods as the base class. In the example below, the route class inherits all the features of frozenset. In addition, it has a method distance that computes the distance of the route (defined as the number of hops that occur between two distinct nodes) and a custom definition of __repr__ to display an instance in a friendly way.

class route(frozenset):
    def distance(self):
        return sum([1 for e in self if e[0] != e[1]])
    
    def __repr__(self):
        return "route({" + ", ".join([str(e) for e in self]) + "})"

To create an instance of route, it is sufficient to wrap an instance of frozenset with the constructor.

>>> route({(0,0), (0,1), (1,1)}).distance()
1

With respect to its immutability, instances of route can be used in any context in which frozenset can be used.

>>> {route({(0,1), (1,2)}): 2}
{route({(0, 1), (1, 2)}): 2}

If your data structure were simpler (e.g., a record with a fixed collection of named attributes), you could have your derived class inherit from a type generated using namedtuple from the built-in collections library.

>>> from collections import namedtuple
>>> class record(namedtuple("record", "name age")):
...     pass
...
>>> record("Alice", 32)
record(name='Alice', age=32)

Defining a New Class

When creating your own class for the route data structure, you first need to determine which characteristics of built-in immutable types you would like your class to possess.

If you would like to be able to use instances of the class inside a set instance or as a key in a dict instance, you must define appropriate methods to make this possible.
If you would like to ensure that it is not possible to modify or extend an instance of your class once it has been created, you will need to redefine specific methods and attributes in a particular way. However, it is worth noting that this approach does not enforce immutability to the same extent as the approach of defining a class derived from an immutable type.

To satisfy the first requirement, it is sufficient to provide definitions for the __hash__ and __eq__ methods. The Python interpreter will invoke these methods when building a hash table (e.g., for a set or dict instance) as well as when retrieving a value. The __eq__ method is required in addition to the __hash__ method because hashes are not guaranteed to be unique and the interpreter needs to be able to disambiguate between objects of your class in such scenarios. Note that the Python interpreter expects that two values that are equal according to __eq__ must have the same hash values according to __hash__.

class route():
    def __init__(self, edges):
        self.es = edges

    def __hash__(self):
        es = sorted(list(set(self.es)))
        import hashlib
        return int(hashlib.sha256(str(es).encode()).hexdigest(), 16)

    def __eq__(self, other):
        return set(self.es) == set(other.es)

    def distance(self):
        return sum([1 for e in self.es if e[0] != e[1]])

It is now possible to use instances of route as elements of set instances and as keys for dict instances.

>>> route_one = route([(0,0), (0,1), (1,1)])
>>> route_two = route([(0,1), (1,2), (2,3)])
>>> route_three = route([(0,1), (1,2), (2,3)])
>>> distances = {route_one: 3}
>>> len({route_one, route_two, route_three})
2

Note that in the definition of the __hash__ method, a cryptographic hash function sha256 from the built-in hashlib library is applied to a string version of a normalized representation (i.e., sorted and deduplicated) of a route. Normalization ensures that the order and multiplicity of edges does not change the hash of a route. Use of sha256 ensures that the hash of any instance of the same string will always be the same in any environment and under any version of Python. This is not guaranteed by the built-in hash function, which may return different results across different Python sessions (even in the same environment). Such an inconsistency could be an issue if, for example, users of your data structure store instances of it on disk and load them again at a later time.

One way to satisfy the requirement that users cannot modify or extend instances of your data structure is to explicitly set the __slots__ attribute to a tuple containing only those attributes that you require for your implementation.

class route():
    __slots__ = ("es")
    
    def __init__(self, edges):
        self.es = edges

It is now not possible to create new attributes for any route instance.

>>> try:
...     route_example = route([(0,1), (1,2)])
...     route_example.duration = 123
... except Exception as e:
...     print(e)
...
'route' object has no attribute 'duration'

Another option is to provide a definition for __setattr__ that raises an exception, ensuring that it is not possible to add new attributes or to assign new values to an existing attribute.

Comprehensions and Combinations

Andrei Lapets — Thu, 27 Aug 2020 03:51:00 GMT

If you are working on a project in which you must enumerate, traverse, and/or test every possible combination of elements from one or more finite (or even infinite) collections, Python provides a variety of ways to do so. This article reviews some of the syntactically concise ways to do so, while also addressing some relevant memory utilization aspects. In particular, the focus is on comprehension syntax as foundational building block that can be employed in conjunction with functions and recursion.

Conventions for Terminology and Notation

In order to maintain consistency across examples and to keep outputs deterministic, this articles follows the conventions enumerated below.

Python lists are used to represent both collections that may have duplicates and sets that do not have duplicates. For example, the set {0, 1, 2} is represented using [0, 1, 2].
When including multiple collections (e.g., U, V, and W) within a Cartesian product expression (e.g., U × V × W), the collections are called components of the Cartesian product.
When an output evaluates to an iterable, it is sometimes immediately consumed and turned into a list using the list function so that it can be reused throughout an example multiple times. In practice, this may not be necessary and may even be unnecessarily expensive in terms of both memory utilization and running time. This distinction is noted explicitly where applicable.
A single-letter variable x usually refers to individual elements in a collection, a variable such as xs usually refers to collections of elements, and a variables such as xss usually refers to a collection of collections.

Cartesian Products

One of the simplest scenarios is one in which it is necessary to generate every combination of elements from two or more collections, where each combination has one component from each collection. This corresponds to the Cartesian product of the collections. Python’s comprehension syntax is a powerful language feature that, under the right circumstances, provides a way to implement such operations involving collections in a way that is concise and closely resembles widely used and recognized forms of mathematical notation. The example below demonstrates how this syntax can be used to build a Cartesian product of two lists.

>>> [(x, y) for x in [0, 1, 2] for y in ["a", "b"]]
[(0, 'a'), (0, 'b'), (1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')]

Comprehension syntax can become difficult to manage and read as the number of component collections involved increases. In such cases, it may make sense to build a function so that the repetitive aspects of building the definition are handled programmatically. It is demonstrated at the end of this section how such a function can be defined. But first, some examples of how it can be used are illustrated using the product function found in the built-in itertools library.

>>> from itertools import product
>>> u = [0, 1, 2]
>>> v = ["a", "b"]
>>> list(product(u, v)) # Wrapped in `list` for display purposes.
[(0, 'a'), (0, 'b'), (1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')]

The product function takes any number of iterable arguments and creates an iterable result containing one tuple for every combination of elements from each of the arguments.

>>> r = [False, True]
>>> s = [0, 1, 2]
>>> t = ['a', 'b', 'c', 'd']
>>> p = list(product(r, s, t))
>>> len(p) == 2*3*4 and len(p[0]) == 3
True

Using the unpacking operator in conjunction with the list repetition operator (also available as a method), it is possible to concisely describe a Cartesian product of any number of instances of a finite collection. In the example below, a ten-dimensional discrete finite space is created (where each dimension is [0, 1, 2]).

>>> s = [0, 1, 2]
>>> p = list(product(*[s]*10))
>>> len(p) == 3**10
True

How much memory does an output of product consume? Because it is an iterable that generates data dynamically, it only takes about as much memory as the component collections provided as inputs to product.

>>> import sys
>>> s = [0, 1, 2, 3, 4, 5, 6, 7]
>>> n = 10
>>> p = product(*[s]*n)
>>> (sys.getsizeof([s]*n), sys.getsizeof(p))
(68, 72)

Now that it has been demonstrated how the function can be used, consider a recursive implementation of such a function. To understand how this can be accomplished, a concrete example may help. Suppose a Cartesian product p of two collections [False, True] and [0, 1] has already been built. How do you turn it into a Cartesian product of three collections? You can iterate over all combinations of elements from the third collection ["a", "b"] and from the Cartesian product p, concatenating each element-combination pair.

>>> p = [(x, y) for x in [False, True] for y in [0, 1]]
>>> q = [(z,) + t for z in ["a", "b"] for t in p]
>>> q
[('a', False, 0),
 ('a', False, 1),
 ('a', True, 0),
 ('a', True, 1),
 ('b', False, 0),
 ('b', False, 1),
 ('b', True, 0),
 ('b', True, 1)]

Below is a complete implementation of the function based on the above approach. Note that the base case is a collection containing a single tuple of length zero. The recursive case consists of a comprehension that prepends every element in the first collection to every tuple in the cartesian product of all the remaining collections.

def cart(xss):
    if len(xss) == 0:
        return [()]
    else:
        return [(x,) + ys for x in xss[0] for ys in cart(xss[1:])]

You can confirm that cart produces the same output as product. However, note that its result is generated in its entirety when the function is called.

>>> c = cart([range(0, 100), range(0, 100)])
>>> p = product(*[range(0, 100), range(0, 100)])
>>> set(c) == set(p)
True
>>> sys.getsizeof(c), sys.getsizeof(p)
43808 40

A variant of this function that uses memory more efficiently can be created by turning the original definition into a generator. One added benefit of this approach is that the component collections can now be generators themselves (and, thus, can potentially contain an unknown number or even infinitely many elements).

def cart(xss):
    if len(xss) == 0:
        yield ()
    else:
        for t in ((x,)+ys for x in xss[0] for ys in cart(xss[1:])):
            yield t

The generator variant of cart function is nearly identical to product; the only difference is the absence of argument unpacking.

>>> list(cart([[0, 1, 2], ["a", "b"]]))
[(0, 'a'), (0, 'b'), (1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')]

Because this function is a generator, it consumes approximately as much memory as is needed to keep track of the two component collections.

>>> c = cart([range(0, 1000), range(0, 1000)])
>>> 2 * sys.getsizeof(range(0, 1000))
48
>>> sys.getsizeof(c)
56

Power Sets

A closely related scenario is one in which it may be necessary to generate every subset of a finite set (also known as the power set). There are a number of approaches to building a power set. It is possible to use the Cartesian product as a building block by noting that every element in a set s of size len(s) is either absent (corresponding to False) or present (corresponding to True). Thus, you can first build a Cartesian product of len(s) instances of the set {False, True}.

>>> s = {0, 1, 2}
>>> p = list(product(*[[False, True]]*len(s)))
>>> p
[(False, False, False),
 (False, False, True),
 (False, True, False),
 (False, True, True),
 (True, False, False),
 (True, False, True),
 (True, True, False),
 (True, True, True)]

It is then possible to use the built-in zip function and a comprehension (which employs an if clause for filtering) to associate the boolean values that make up each of the tuples in the Cartesian product with the corresponding elements in the original set.

>>> ss = [{x for (b, x) in zip(bs, s) if b} for bs in p]
>>> len(ss) == 2**len(s)
True
>>> ss
[set(), {2}, {1}, {1, 2}, {0}, {0, 2}, {0, 1}, {0, 1, 2}]

Alternatively, the Python documentation provides a recipe for building power sets using the built-in itertools library.

from itertools import chain, combinations
def powerset(iterable):
    s = list(iterable)
    return chain.from_iterable(
        combinations(s, r) for r in range(len(s) + 1)
    )

The above variant yields a collection of tuples rather than sets, but otherwise produces the same collection of combinations.

>>> list(powerset([0, 1, 2]))
[(), (0,), (1,), (2,), (0, 1), (0, 2), (1, 2), (0, 1, 2)]

The recursive definition below employs an approach that is nearly identical to that of the recursive definition of the Cartesian production function. In the recursive case, every subset of all the remaining elements (excluding the current first element) is included in the overall result. However, a second copy of each of these subsets is taken and then paired (as in the Cartesian product function) with the current first element. This accounts for all subsets that do include the first element and all subsets that do not include it.

def powerset(xs):
    if len(xs) == 0:
        return [tuple()]
    else:
        x = xs[0]
        yss = powerset(xs[1:])
        return yss + [(x,) + ys for ys in yss]

This approach yields the same results as the definition in the recipe, though it is worth noting that the exact implementation above leads to a different order. This distinction may be important in some applications (such as when performing a search for the largest subset that satisfies some criteria).

>>> powerset([0, 1, 2])
[(), (2,), (1,), (1, 2), (0,), (0, 2), (0, 1), (0, 1, 2)]

Can you apply the same technique used in the implementations of the cart function to turn the above recursive definition of powerset into a generator?

Tools for Organizations in a Rapidly Evolving Data Privacy Landscape

Andrei Lapets — Wed, 05 Aug 2020 20:40:57 GMT

Privacy-enhancing and secure computation technologies are ready today to transform your workflows and services.

By building and deploying web services or data analysis workflows that employ emerging privacy-enhancing and secure computation technologies such as secure multi-party computation (MPC), organizations can provide new services, identify and leverage new business opportunities, reduce risks and costs for both themselves and their customers, and comply with evolving regulations. MPC is already being incorporated into some industry, government, and nonprofit software solutions and workflows. But leveraging its advantages to address an organization’s data privacy challenges requires an understanding of how its features can satisfy technical, business, and legal constraints. Keep reading for an introduction to MPC and its security and privacy advantages, some context around MPC and related cryptographic techniques, and an overview of scenarios and challenges for which MPC is ready today.

Understanding MPC and its Security and Privacy Advantages

MPC is a family of cryptographic techniques that allows organizations and individuals to enjoy the benefits of web-based services and data analysis workflows while mitigating or removing the risks normally associated with providing or sharing the data that those services and workflows require. Thus, MPC can reduce costs associated with existing workflows and can enable new opportunities in scenarios in which data sharing is encumbered or restricted due to the security and privacy concerns of individuals, policies maintained by companies and other organizations, and legal constraints and regulations.

In traditional services and workflows that require computation over sensitive or private data that may be encrypted at rest, it is usually necessary to decrypt that data at some location and for some period of time. This is done so that existing computational tools can be applied to it (e.g., within a data clean room that is set up temporarily so that two organizations can run analyses on their joint datasets inside it). The result of the computation might then be encrypted again before it leaves the location where the computation took place. MPC removes the requirement that sensitive or private data must be decrypted in such scenarios. This means that the risks, liabilities, and costs associated with protecting the data while it is in a decrypted form can be reduced or eliminated.

How is computation over encrypted data possible? That depends on the choice of MPC technique. However, there are some common characteristics:

First, at least two distinct parties must be involved (this is usually already the case in almost all interesting applications).
Second, these parties must have the ability to generate random numbers privately (i.e., the process they use to generate the random numbers cannot be observed or influenced by other parties), which allows them to encrypt any form of data by separating it into constituent parts that appear random on their own (for the curious, this is equivalent to the one-time pad, which is an encryption technique that provides information-theoretic security).
Third, these parties should be incentivized not to simply share or publish the random values they create or exchange as part of a secure computation process (e.g., because of legal or economic incentives, contractual obligations, regulations, and so on).

These conditions are sufficient to allow any service or workflow to operate on encrypted data, and a variety of MPC techniques exist (each representing various trade-offs between performance and other factors) that rely on such an exchange of information that appears random to participants.

MPC in the Context of Related Privacy-Enhancing Techniques

Many related cryptographic techniques, including both those that have been ubiquitous for decades and those that are as novel as MPC, can be integrated with MPC. Traditional techniques such as symmetric and public-key cryptography can be leveraged to ease the burden of deploying MPC and thus to accommodate a broader range of scenarios, as has been demonstrated in multiple real-world use cases.

But MPC can also be combined with other emerging techniques that provide complementary but orthogonal advantages, such as differential privacy (DP) and blockchain:

DP techniques that protect individual records within a dataset while still allowing aggregate analyses over that dataset to be computed and shared. An MPC implementation of a DP workflow can protect all input and intermediate data within a computation (thanks to MPC), and anything the output implies about individual records in the original inputs (thanks to DP).
A range of blockchain techniques allow individuals and organizations to collectively maintain a distributed ledger that is verifiable and cannot be modified. If a scenario requires both (1) the ability to store and compute over encrypted data and (2) permanent storage and/or verifiability of encrypted data, it may make sense to combine the two technologies.

It is worth noting that the performance overheads and costs of MPC and other secure computation technologies almost always exceed those of traditional, non-secure solutions. When combining these technologies, their respective overheads and costs may be cumulative.

Contemporary Challenges and the Opportunities of MPC

Today, organizations operating in a broad range of domains (health, finance, education, government, and so on) face a number of existing and emerging challenges as they create and execute workflows within their own organizations and across their partners and customers: reducing the costs and liabilities of inter-organizational data exchange, protecting customer privacy, adhering to regulatory requirements, and others. MPC can help these organizations by introducing additional options and flexibilities where they might not otherwise exist.

MPC can act as a less expensive alternative to clean rooms and trusted third parties. It can address a variety of specific use cases involving one or more organizations, helping decision-makers analyze data to answer simple questions (e.g., with “yes” or “no” answers) without first undertaking a burdensome negotiation process that may involve legal expenses, delays, and risks of data exposure or unauthorized data reuse. Example of use cases include:

Partner organizations can evaluate the value or effectiveness of their partnerships, such as calculating conversion rates or identifying response correlations across audience segments and outreach strategies.
Consortia of organizations can create benchmarks using pooled data without sharing any of their constituent datasets.
Industry competitors can run fully confidential surveys across customer populations.

MPC can also have a significant impact on the evolution of consumer-facing web applications. In addition to letting organizations offer customer-facing applications that are privacy-preserving (i.e., allowing customers to enjoy the value of a service while not sharing their data with the service provider), MPC dramatically expands the range of available tools for navigating the rapidly evolving landscape of data protection and privacy regulations. For example, the General Data Protection Regulation (GDPR) within the European Union and the California Consumer Privacy Act (CCPA) in the US state of California impose requirements on organizations that collect or process personal data of individuals.

The key takeaway is that MPC can allow organizations to transform their workflows and services to operate only on encrypted, pseudonymized, or de-identified data while retaining some or all of the utility of those services and workflows. Organizations can continue to innovate and offer a wider range of new services to customers while maintaining compliance.

Getting Started with Secure Computation

Nth Party builds and offers products that help organizations introduce MPC into their web services and data analysis workflows. Leveraging its expertise and relying on years of experience developing MPC open-source frameworks, software, and applications that have been proven in real-world deployments, the Nth Party team maintains a rich suite of libraries and tools for quickly assembling MPC solutions to address contemporary privacy and security challenges.

If you would like to see the material in this article covered in greater depth and with more detailed examples, take a look at our recent white paper and check back for upcoming articles that will delve into each of the above topic areas.

Tools for Organizations in a Rapidly Evolving Data Privacy Landscape was originally published in Nth Party on Medium, where people are continuing the conversation by highlighting and responding to this story.

Working with Foreign Functions

Andrei Lapets — Thu, 30 Jul 2020 22:01:00 GMT

Suppose your Python library needs to load some sensitive binary data from a file into a contiguous block of memory (e.g., in order to use it for some application-specific operation). Furthermore, you have some additional requirements that must be met for security and auditing purposes:

you need to ensure that your code does not inadvertently cause the interpreter to copy any part of the data loaded from the file into some other region of memory,
you need to log the memory address at which the data was stored, and
you need to clear the memory region that held the data by overwriting it with random bytes.

One strategy you might employ in order maintain tight control over what your code is doing is to use C functions found in a compiled shared library to read the data from disk, to load that data into a region of memory, and at the end to clear that region. What minimal collection of built-in Python features will you need to invoke functions that are found in a shared library? How can you transform Python values (such as strings representing the location of the file) into an appropriate form on which the function can operate?

Python offers a rich set of capabilities via the built-in ctypes library that make it possible to invoke (or wrap in a Python function) foreign functions that have been implemented using another language (such as C/C++) and compiled into shared libraries. This article reviews the basics of employing foreign functions by demonstrating how to load and apply to the above use case the instance of the GNU C Library available on most operating systems. The same techniques can be used for any shared library. An alternative approach used by some popular Python packages is briefly reviewed, as well.

Loading a Shared Library

To load a shared library file for which you know the relative or absolute path, you can normally use the LoadLibrary method of either the cdll or the windll instance (depending on your operating system) of the LibraryLoader class found in ctypes. For the purposes of the use case in this article, it is sufficient to load the GNU C Library. In the example below, the system function from the platform library is used to distinguish between Windows and Linux/macOS environments. In the Linux/macOS case, the find_library function is used to determine the absolute path of the shared library.

import ctypes
import platform

if platform.system() == "Windows":
    libc = ctypes.windll.msvcrt
else:
    libc = ctypes.cdll.LoadLibrary(ctypes.util.find_library("c"))

Invoking Foreign Functions

The first portion of your workflow involves loading a file into memory. The Python code below writes a file to disk that contains a sequence of 32 random bytes. The file can be used to test the workflow.

from secrets import token_bytes
with open("data.txt", "wb") as file:
    file.write(token_bytes(32))

The C function fopen expects two arguments: a pointer to the first character of a string that represents the path of the file, and a pointer to the first character of the string that represents the mode (i.e., reading or writing) in which the file is opened. You can use the c_char_p function to turn Python strings into a representation in memory that can be handled by the C function. Note the use of the encode string method to provide an explicit encoding for the string as a byte sequence.

from ctypes import c_char_p
file = c_char_p("data.txt".encode("ascii"))
mode = c_char_p("rb".encode("ascii"))

Unfortunately, it is not possible within Python to examine the libc object that was created by the LibraryLoader instance in order to determine what symbols are defined within it. However, in this case we know that the functions fopen, fread, and fclose must exist. For each of these functions, an instance of the FuncPtr class can be found in libc.

fopen = libc.fopen
fread = libc.fread
fclose = libc.fclose

Before you can safely invoke these functions, you need to specify their argument types and their return types. This can be accomplished by first consulting the GNU C Library documentation to find the signature for each of the C functions you would like to use. Then, the appropriate data type classes can be used to assign the correct sequence of argument types and the correct return type to the argtypes and restype attributes, respectively, of each FuncPtr class instance.

from ctypes import c_int, c_size_t, c_void_p

fopen.argtypes = [c_char_p, c_char_p]
fopen.restype = c_void_p

fread.argtypes = [c_void_p, c_size_t, c_size_t, c_void_p]
fread.restype = c_size_t

fclose.argtypes = [c_void_p]
fclose.restype = c_int

It is now possible to invoke these functions on some inputs. You can allocate a memory buffer for the 32 bytes of data that you will be loading from the file using the create_string_buffer function.

from ctypes import create_string_buffer
data = ctypes.create_string_buffer(32)

You can now open the file, load the data, and close the file.

>>> fp = fopen(file, mode)
>>> fread(data, 32, 1, fp)
>>> fclose(fp)
>>> bytes(data).hex()
'6d1481fda1e3853c14d81c6f1c4f87fcf26aca8f8d628d2cc31067781f9624ce'

You can determine the memory address corresponding to the memory buffer data using the addressof function.

>>> from ctypes import addressof
>>> hex(addressof(data))
'0xd788e0'

You can now clear the memory region. The example below uses the memset C function for this purpose. An example that uses a random sequence generator that is appropriate for cryptographic applications appears in the next section.

>>> libc.memset.argtypes = [c_void_p, c_int, c_size_t]
>>> libc.memset(data, 0, 32)
>>> bytes(data).hex()
'0000000000000000000000000000000000000000000000000000000000000000'

Alternative Approaches

The C Foreign Function Interface library is similar to the built-in ctypes module and is used by some popular packages, including the PyNaCl library that acts as a Python interface for the cryptographic library libsodium. In the example below, the C implementation of the randombytes function is invoked on a character buffer bs and then the contents of that buffer are displayed.

>>> from nacl import _sodium
>>> lib = _sodium.lib
>>>
>>> from cffi import FFI
>>> ffi = FFI()
>>>
>>> bs = ffi.new("unsigned char[]", 8)
>>> lib.randombytes(bs, 8)
>>> bytes(bs).hex()
'a215d495f3e4248f'

You might choose to call the C implementation directly for a variety of reasons, including to improve performance. This may be useful to do when a more high-level library method allocates new memory for a byte sequence during every invocation, while your own solution can reuse the same memory over and over to store each new batch of bytes. In the example below, the time to invoke the C function over one million iterations is measured.

>>> import time
>>> start = time.perf_counter()
>>> for _ in range(10**6):
...     lib.randombytes(bs, 8)
...
>>> time.perf_counter() - start
2.583150100079365

The below example measures the amount of time it takes to invoke the Python wrapper in PyNaCl over the same number of iterations. The longer running time may be the result of a number of factors; regardless of the underlying reason that may apply for any particular function, the example demonstrates that direct access to the C method gives you more control over those factors.

>>> from nacl.bindings import randombytes
>>> start = time.perf_counter()
>>> for _ in range(10**6):
...     bs = randombytes(8)
...
>>> time.perf_counter() - start
5.637964699999429

Permutation Circuit Synthesis via Embedded Languages and Recursion

Andrei Lapets — Mon, 29 Jun 2020 20:10:00 GMT

The ability to synthesize logical circuits as data structures (without any intention of implementing such circuits as hardware) is becoming increasingly relevant as technologies such as garbled circuit protocols and quantum computing platforms begin to mature. Consequently, there is a growing population working in research, prototyping, and even in software application development settings that may find it convenient to have the ability to synthesize circuits dynamically in popular languages such as Python.

This article describes how an embedded language approach coupled with recursion can be used to create a framework that can synthesize a relatively efficient logical circuit for any chosen permutation of the set of all bit vectors of some fixed length. The described approach can actually be applied to the synthesis of a circuit for any function over bit vectors of a fixed length. This article focuses on the case of permutations because it is more challenging to know in advance whether and how circuits that represent a permutation of a space of bit vectors can be optimized, thus motivating a general approach that can produce circuits that may be used directly or as an input to a more specialized circuit optimization process.

Embedded Language for Synthesizing Circuits

This article leverages the circuit and circuitry libraries, which constitute an embedded domain-specific language (with Python acting as the host language) for representing, building, and evaluating circuits description.

from parts import parts
from circuit import *
from circuitry import *

Testing a Synthesis Approach

Before exploring and comparing synthesis techniques, it is useful to establish a standard approach for testing that the synthesis technique produces a circuit that is functionally correct. The function below performs such a test given a synthesis technique.

from itertools import product
from random import shuffle
from tqdm import tqdm

def test(synthesis):
    # Create a permutation of all 8-bit vectors.
    vs_original = list(product(*[[0,1]]*8))
    vs_permuted = list(product(*[[0,1]]*8))
    shuffle(vs_permuted, lambda: 0.5)

    # Execute the synthesis function that is being tested.
    # A synthesis function must accept as it inputs an
    # initial vector to evaluate while constructing the
    # circuit (as necessitated by the `circuitry` library),
    # the original list of vectors, and a permuted list of
    # vectors.
    bit.circuit(circuit())
    bs = synthesis(
        bits([input(i) for i in ([0]*8)]),
        vs_original,
        vs_permuted
    )

    # Display some statistics and whether the circuit
    # correctly implements the permutation.
    c = bit.circuit()
    checks = ([
        (vo == tuple(c.evaluate(list(vi))))
        for (vi, vo) in tqdm(
            list(zip(vs_original, vs_permuted)),
            position=0, leave=True
        )
    ])
    print(all(checks))
    print({
        o: c.count(lambda g: g.operation == o)
        for o in [op.and_, op.or_, op.not_]
    })

Naive Synthesis Approach

A naive synthesis approach that utilizes logical formulas can act as a starting point. First, split the permutation f: {0, 1}ⁿ → {0, 1}ⁿ into n separate component functions {f ∈ {0, 1}ⁿ → {0, 1}ⁿ | i ∈ {0, …, n}} such that each component function computes one bit of the output bit vector. For each function fᵢ, convert every input vector v ∈ {0, 1}ⁿ that maps to 1 into a corresponding formula ψᵥ that is true for exactly that vector. For example, given v = (0, 1, 1, 0), the formula would be ψᵥ(a, b, c, d) = (¬a) ∧ b ∧ c ∧ (¬d).

Then, it is just a matter of taking the disjunction of all such formulas to obtain the formula φᵢ for the component function fᵢ. Finally, the output f(v) ∈ {0, 1}ⁿ of the overall function f on an input vector v ∈ {0, 1}ⁿ can be computed by evaluating each of the n formulas φᵢ on the same input vector v. This approach can be implemented in a very concise manner, as demonstrated below.

from functools import reduce

def naive(xs, vs_ins, vs_outs):
    """
    Synthesize a circuit for the given permutation.
    """
    def clause(xs, kcs):
        if len(kcs) == 1:
            (k, c) = kcs[0]
            return xs[k] if c == 1 else ~xs[k]
        else:
            (kcs0, kcs1) = parts(kcs, 2)
            return clause(xs, kcs0) & clause(xs, kcs1)

    # The set of all clauses, one for each input vector.
    cs = [clause(xs, tuple(enumerate(vi))) for vi in vs_ins]

    # Index sets of input vectors that should be included
    # for each output bit.
    ps = [
        reduce(
            (lambda p, q: p | q),
            [
                clause(xs, tuple(enumerate(vs_ins[r])))
                for (r, vo) in enumerate(vs_outs) if vo[c] == 1
            ]
        )
        for c in range(8)
    ]

    return outputs(ps)

The naive approach can be evaluated and tested. The circuit generated using the approach is correct, but has a relatively large number of gates.

>>> test(naive)
100%|█████████████████████████| 256/256 [00:19<00:00, 13.43it/s]
True
{(0, 0, 0, 1): 7168, (0, 1, 1, 1): 1016, (1, 0): 3711}

Optimized Synthesis Approach

The naive approach described and implemented above creates a circuit that performs a large amount of redundant work. For any pair of input variables a and b, the circuit may have many instances of a gate such as a ∧ b. The optimized approach below attempts to take advantage of the fact that a circuit is a directed acyclic graph, finding opportunities to reuse gates where possible.

Note that the overall goal is not to implement an algorithm that can take a permutation as an input and find the optimal circuit with the minimal number of gates. Instead, the goal is to demonstrate that it is possible to leverage the embedded language for circuits to implement in a concise way a general-purpose greedy circuit synthesis algorithm that is a significant improvement over the naive approach (in terms of the size of the circuits it synthesizes for a given permutation).

Most Frequent Pairs

The optimized synthesis approach relies on the ability to identify a pair of elements (e.g., logical variables) that appears most frequently across a collection of sets. A function for identifying such a pair given a collection of sets ss is presented below. This function takes advantage of the Counter class found in the built-in collections library. Note that in addition to identifying a pair, the functions performs a few additional operations that will be useful within the synthesis algorithm.

from collections import Counter

def pair(ss, ds):
    """
    Add to `ds` the pair of elements that appears most
    frequently across all sets in `ss`.
    """
    # Collect all pairs of elements found in every set in `ss`.
    ps = [
        p 
        for s in ss
        for p in [(x, y) for x in s for y in s if x < y]
    ]

    if len(ps) == 0:
        return (ss, ds, False)
    else:
        # Find the most common pair.
        (p, i) = (Counter(ps).most_common(1)[0][0], len(ds))
        ds.append(p)
        
        # Replace these pairs of elements with an index into
        # the corresponding pair in `ds`.
        ss = [
            ((s - set(p)) | {i}) if set(p).issubset(s) else s
            for s in ss
        ]

        return (ss, ds, True)

Synthesis with Reuse

The synthesis approach below modifies the naive synthesis approach by introducing two kinds of reuse:

subformulas ψᵥ built for individual conjunction clauses are cached and reused (across all conjunctions) whenever possible and
clauses ψᵥ and their disjunctions are reused across the formulas constructed for the component functions fᵢ (via the heuristic above that looks for disjunctions of pairs of subformulas that occur most frequently at any given stage in the process).

def optimized(xs, vs_ins, vs_outs):
    """
    Synthesize a circuit for the given permutation.
    """
    cache = {}
    def clause(xs, kcs):
        if kcs in cache:
            return cache[kcs]
        elif len(kcs) == 1:
            (k, c) = kcs[0]
            cache[(k, c)] = xs[k] if c == 1 else ~xs[k]
            return cache[(k, c)]
        else:
            (kcs0, kcs1) = parts(kcs, 2)
            cache[kcs] = clause(xs, kcs0) & clause(xs, kcs1)
            return cache[kcs]

    # Construct an initial collection of sets 
    ss = [
        set(r for (r, vo) in enumerate(vs_outs) if vo[c] == 1)
        for c in range(8)
    ]

    # Keep merging the most frequent pair across all sets
    # until there are no pairs left.
    (ds, updated) = (list(range(len(vs_ins))), True)
    while updated:
        (ss, ds, updated) = pair(ss, ds)

    # Take the disjunction of every formula that corresponds
    # to an input vector that maps to `1`.
    cs = [clause(xs, tuple(enumerate(vi))) for vi in vs_ins]
    for (k, (i, j)) in enumerate(ds[len(vs_ins):]):
        cs.append(cs[i] | cs[j])

    return outputs([cs[k] for [k] in ss])

A test of the optimized approach demonstrates a significant reduction in the number of gates.

>>> test(optimized)
100%|█████████████████████████| 256/256 [00:01<00:00, 189.33it/s]
True
{(0, 0, 0, 1): 303, (0, 1, 1, 1): 494, (1, 0): 16}

This article is also available as a Jupyter Notebook on GitHub.

Permutation Circuit Synthesis via Embedded Languages and Recursion was originally published in Reity LLC on Medium, where people are continuing the conversation by highlighting and responding to this story.

Static Checking via Metaclasses

Andrei Lapets — Mon, 15 Jun 2020 03:48:00 GMT

Suppose you are maintaining a domain-specific machine learning library. Users of the library’s API expect that every machine learning algorithm offered by the API will have the same interface (i.e., the same methods with the same signatures) regardless of its underlying implementation. You would like to allow a community of contributors to define new algorithms that can be added to the library, but you would like to reduce your own effort and that of contributors when it comes to validating that a new algorithm conforms to the API.

Python metaclasses are the underlying, higher-order constructs that instantiate class definitions. Understanding what metaclasses are and how they can be used gives you a significant amount of control over what happens when a new class is introduced by users. This in turn allows you to constrain users when necessary and to provide assistance to users that can save them time and effort.

How Classes are Made

In Python, functions, classes, objects, and values are all on an equal footing. One consequence of this is that it is possible to pass any of these entities as arguments to functions and to return any of these entities as the result of a function (this fact was discussed in another article that covered Python decorators). But this also means that much of the syntax you normally use is actually just syntactic sugar for function calls.

What happens when the Python interpreter executes a class definition such as the one below?

class Document():
    def __init__(self):
        self.is_document = True

The class (not an instance or object of that class, but the class itself) is created and assigned to a variable that is in scope. In the example above, that variable is Document.

>>> Document
__main__.Document

Python’s built-in type function actually serves a number of purposes beyond determining the type of a value. Given a few additional parameters, the type function can be used to define a new class. Executing the statement in the example below is equivalent to executing the class definition for Document above.

def __init__(self):
    self.is_document = True

Document = type('Document', (), {'__init__': __init__})

Now that Document is a class, it is possible to create objects of this class.

>>> d = Document()
>>> d.is_document
True

How Metaclasses are Made

In a manner similar to that of many programmaing languages that support the object-oriented programming paradigm, Python allows programmers to define derived classes that inherit the attributes and methods of a base class. The example below illustrates this by defining a class Passport that is derived from the Document class. Notice that the base class constructor Document is specified in the class definition.

class Passport(Document):
    pass

The Passport class inherits the attributes of the Document class. The example below illustrates that it inherits the __init__ method of the Document class.

>>> p = Passport()
>>> p.is_document
True

The example in which Document was defined using the built-in type function suggests that type can be viewed (at least using a loose analogy) as a means for creating classes. In a way, it behaves like a constructor for the "class of all possible classes". Thus, if type is a kind of constructor for a class, it should be possible to use it in the same context as any other class constructor. But what should this mean? What is MetaClass in the example below?

class MetaClass(type):
    pass

Following the analogy to its logical conclusion, this must mean that MetaClass has inherited the capabilities of type. And, indeed, it has. In the example below, MetaClass is used to define a new class — in the same way that type was used before.

>>> Document = MetaClass('Document', (), {'__init__': __init__})
>>> d = Document()
>>> d.is_document
True

The ability to use a metaclass in place of type as in the above example is also supported by the more common class syntactic construct.

class Document(metaclass=MetaClass):
    def __init__(self):
        self.is_document = True

Using Metaclasses to Enforce an API

eturning to the motivating example from the first paragraph, suppose you introduce a metaclass called MetaAlgorithm for machine learning algorithms that is derived from type. This metaclass definition can override the method __new__ that is normally invoked when a new class is defined using type (or using the equivalent class syntactic construct). This alternate definition of __new__ performs some additional checks before the class is actually created. In this use case, that additional work involves validating that the class being defined (corresponding to a new machine learning algorithm) conforms to your API.

from types import FunctionType

class MetaAlgorithm(type):
    def __new__(cls, clsname, bases, attrs):
        
        # The base class does not need to conform to the API.
        # See the paragraph below for an explanation of this check.
        if clsname != 'Algorithm':
            
            # Check that the programmer-defined
            # class has a contributor string.
            if 'contributor' not in attrs or\
               not isinstance(attrs['contributor'], str):
                raise RuntimeError('missing contributor')
            
            # Check that the programmer-defined class has the
            # methods required for your API.
            if 'train' not in attrs or\
               not isinstance(attrs['train'], FunctionType):
                raise RuntimeError('missing training method')
            
            if 'classify' not in attrs or\
               not isinstance(attrs['classify'], FunctionType):
                raise RuntimeError('missing training method')
        
        return\
            super(MetaAlgorithm, cls)\
            .__new__(cls, clsname, bases, attrs)

Now that there is a way to define new classes, there are two ways to proceed. One approach is to require that all algorithm classes that contributors implement must include the metaclass=MetaAlgorithm parameter in the class definition. However, this is easy for a contributor to forget and also may require that contributors have a solid understanding of metaclasses. An alternative is to create a base class from which all contributed algorithm classes must be derived.

class Algorithm(metaclass=MetaAlgorithm):
    pass

Using this approach, it is sufficient to export the Algorithm base class and to inform all contributors that their classes must be derived from this base class. The example below illustrates how a contributor might do so for a very basic algorithm.

class Guess(Algorithm):
    contributor = "Author"
    
    def train(items, labels):
        pass
    
    def classify(item):
        import random
        return random.choice([True, False])

As the example below illustrates, an attempt by a user to define a class that does not conform to the API results in an error.

>>> try:
...     class Guess(Algorithm):
...         def classify(item):
...             return False
...
... except RuntimeError as error:
...     print("RuntimeError:", str(error))
...
RuntimeError: missing contributor

To emphasize: the error above occurs when the Python interpreter tries to execute the definition of the class, and not when an object of the class is created. It would be impossible to reach the point at which the interpreter attempts to create an object of this class because the class itself can never be defined.

Despite the fact that Python does not technically support static checking beyond ensuring that the syntax of a module is correct, it is arguably justifiable to say that what MetaAlgorithm does is a form of static checking. In many routine scenarios, the checks would be performed at the time that module is imported and before any other code has had a chance to run.

Stories by Andrei Lapets on Medium

Accessible and Scalable Secure Data Evaluation

Infutor protects customer data using multi-party computation

Secure Data Evaluation at Scale with Infutor

Accessible and Scalable MPC

Privacy-Preserving Information Exchange Using Python

Exchanging Information via Oblivious Transfer

Simple OT using Python

Implementation Details

Practical Secure Computation

Guide to Publishing Packages

Project Organization and Directory Tree Template

Package File Organization

Establishing and Checking Style Conventions

Defining Unit Tests and Measuring Test Coverage

Using doctest

Using unittest

Using both doctest and unittest with nose

Measuring coverage

Continuous Integration and Coverage Reporting

README Organization and Format

Badges

Versioning and Contributions

Publishing to PyPI

Further Reading

Privacy-Preserving Matching and Computation in the Browser

Privacy-Preserving Matching and Computation

Secure Computation in the Browser

The Result: Scalable Browser-Based Join and Aggregation Workflows

Applications of Immutability

Mutable and Immutable Built-in Types

Defining an Immutable Data Structure

Using Built-in Types

Defining a Derived Class

Defining a New Class

Further Reading

Comprehensions and Combinations

Conventions for Terminology and Notation

Cartesian Products

Power Sets

Further Reading

Tools for Organizations in a Rapidly Evolving Data Privacy Landscape

Privacy-enhancing and secure computation technologies are ready today to transform your workflows and services.

Understanding MPC and its Security and Privacy Advantages

MPC in the Context of Related Privacy-Enhancing Techniques

Contemporary Challenges and the Opportunities of MPC

Getting Started with Secure Computation

Working with Foreign Functions

Loading a Shared Library

Invoking Foreign Functions

Alternative Approaches

Further Reading

Permutation Circuit Synthesis via Embedded Languages and Recursion

Embedded Language for Synthesizing Circuits

Testing a Synthesis Approach

Naive Synthesis Approach

Optimized Synthesis Approach

Most Frequent Pairs

Synthesis with Reuse

Static Checking via Metaclasses

How Classes are Made

How Metaclasses are Made

Using Metaclasses to Enforce an API

Further Reading