Use Case: Scientific Computing with Multiple Players#

In this example, we demonstrate how we can perform scientific computation accross multiple data owners while keeping the data always encrypted during computation.

For simplicity, we will start by prototyping this computation between the different parties locally using the pm.LocalMooseRuntime. Then we will execute the same Moose computation over the network with pm.GrpcMooseRuntime. You can also check additional gRPC example here.

import pathlib

import numpy as np

import pymoose as pm

np.random.seed(1234)

Use case#

Here is the use case we are trying to solve: researchers would like to measure the correlation between alcohol consumption and students’ grades. However the alcohol consumption data and grades data are owned respectively by the Department of Public Health and the Department of Education. These datasets are too sensitive to be moved to a central location or exposed directly to the researchers. To solve this problem, we want to compute the correlation metric on an encrypted version of these datasets.

Data#

For this demo, we are generating synthetic datasets for 100 students. Of course the correlation result is made up for the purpose of this demo. It’s just to illustrate how Moose can be used.

def generate_synthetic_correlated_data(n_samples):
    mu = np.array([10, 0])
    r = np.array(
        [
            [3.40, -2.75],
            [-2.75, 5.50],
        ]
    )
    rng = np.random.default_rng(12)
    x = rng.multivariate_normal(mu, r, size=n_samples)
    return x[:, 0], x[:, 1]


alcohol_consumption, grades = generate_synthetic_correlated_data(100)

print(
    f"Acohol consumption data from Departement of Public Health: {alcohol_consumption[:5]}"
)
print(f"Grades data from Departement of Education: {grades[:5]}")
Acohol consumption data from Departement of Public Health: [11.06803447  9.58819631  6.28498731  9.63183684 11.17578054]
Grades data from Departement of Education: [ 0.71290544  2.16473508  2.78613359 -2.32336413  0.4538998 ]

Define Moose Computation#

To measure the correlation between alcohol consumption and students’ grades, we will compute the Pearson correlation coefficient.

To express this computation, Moose offers a Python DSL (internally referred to as the eDSL, i.e. “embedded” DSL). As you will notice, the syntax is very similar to the scientific computation library Numpy.

The main difference is the notion of placements: host placement and replicated placement. With Moose, every operation under a host placement context is computed on plaintext values (not encrypted). Every operation under a replicated placement is performed on secret shared values (encrypted).

We will compute the correlation coefficient between three different players, each of them representing a host placement: Department of Public Health, Department of Education, and a data scientist. The three players are grouped under the replicated placement to perform the encrypted computation.

The Moose computation below performs the following steps:

  • Loads Department of Public Health’s data in plaintext from its storage.

  • Loads Department of Education’s data in plaintext from its storage.

  • Secret shares (encrypts) the datasets.

  • Computes the correlation coefficient on secret shared data.

  • Reveals the correlation result only to the data scientist and saves it into its storage.

fixedpoint_dtype = pm.fixed(24, 40)
pub_health_dpt = pm.host_placement(name="pub_health_dpt")
education_dpt = pm.host_placement(name="education_dpt")
data_scientist = pm.host_placement(name="data_scientist")

encrypted_governement = pm.replicated_placement(
    name="encrypted_governement",
    players=[pub_health_dpt, education_dpt, data_scientist],
)


def pearson_correlation_coefficient(x, y):
    x_mean = pm.mean(x, 0)
    y_mean = pm.mean(y, 0)
    stdv_x = pm.sum(pm.square(pm.sub(x, x_mean)))
    stdv_y = pm.sum(pm.square(pm.sub(y, y_mean)))
    corr_num = pm.sum(pm.mul(pm.sub(x, x_mean), pm.sub(y, y_mean)))
    corr_denom = pm.sqrt(pm.mul(stdv_x, stdv_y))
    return pm.div(corr_num, corr_denom)


@pm.computation
def multiparty_correlation():

    # Department of Public Health load the data in plaintext
    # Then the data gets converted from float to fixed-point
    with pub_health_dpt:
        alcohol = pm.load("alcohol_data", dtype=pm.float64)
        alcohol = pm.cast(alcohol, dtype=fixedpoint_dtype)

    # Department of Education load the data in plaintext
    # Then the data gets converted from float to fixed-point
    with education_dpt:
        grades = pm.load("grades_data", dtype=pm.float64)
        grades = pm.cast(grades, dtype=fixedpoint_dtype)

    # Alcohol and grades data gets secret shared when moving from host placement
    # to replicated placement.
    # Then compute the correlation coefficient on secret shared data
    with encrypted_governement:
        correlation = pearson_correlation_coefficient(alcohol, grades)

    # Only the correlation coefficient gets revealed to the data scientist
    # Convert the data from fixed-point to floats and save the data in the storage
    with data_scientist:
        correlation = pm.cast(correlation, dtype=pm.float64)
        correlation = pm.save("correlation", correlation)

    return correlation

Evaluate Computation#

For simplicity, we will use LocalMooseRuntime to locally simulate this computation running across hosts. To do so, we need to provide: a Moose computation, a list of host identities to simulate, and a mapping of the data stored by each simulated host.

  • Since we decorated the function multiparty_correlation with pm.computation, we can simply supply this as our Moose computation.

  • The identities correspond to the names of the host placements we defined for our pm.computation.

  • For the simulated data storage, we provide a dictionary mapping between a key and a locally-provided dataset. The key will be used by the load operations in our computation to load the dataset into Moose tensors.

Once you have instantiated the LocalMooseRuntime with the identities and additional storage mapping and the runtime set as default, you can simply call the Moose computatio to evaluate it. If you prefer, you can also evaluate the computation with runtime.evaluate_computation(computation=multiparty_correlation, arguments={}). We can also provide arguments to the computation if needed, but we don’t have any in this example. Note that the output of evaluate_computation is an empty dictionary, since this function’s output operation pm.save returns the Unit type.

executors_storage = {
    "pub_health_dpt": {"alcohol_data": alcohol_consumption},
    "education_dpt": {"grades_data": grades},
}

runtime = pm.LocalMooseRuntime(
    identities=["pub_health_dpt", "education_dpt", "data_scientist"],
    storage_mapping=executors_storage,
)

runtime.set_default()

_ = multiparty_correlation()

Results#

Once the computation is done, we can extract the result. The correlation coefficient has been stored in the data scientist’s storage. We can extract the value from the storage with read_value_from_storage.

moose_correlation = runtime.read_value_from_storage("data_scientist", "correlation")
print(f"Correlation result with PyMoose: {moose_correlation}")
Correlation result with PyMoose: -0.5462326644010318

The correlation coefficient is equal to -0.54.

In this simulated setting, we can validate that the result on encrypted data matches the computation on plaintext data. To do so, we compute the pearson correlation coefficient with numpy.

np_correlation = np.corrcoef(np.squeeze(alcohol_consumption), np.squeeze(grades))[1, 0]
print(f"Correlation result with Numpy: {np_correlation}")
Correlation result with Numpy: -0.5481005967856092

As you can see the coefficient matches up to the second decimal point. For improved precision, we can re-adjust the configuration of our fixedpoint dtype used in multiparty_correlation. For example, we can trade off integral precision for fractional precision, or try re-scaling/normalizing data before casting to fixedpoint.

Voilà! You were able to compute the correlation while keeping the data encrypted during the entire proccess.

Run Computation over the Network with gRPC#

To run the same computation over the network, you need to launch a gRPC worker at the right endpoints for each party (department of public health, department of education and data scientist). You can launch the three workers as follow:

cargo run --bin comet -- --identity localhost:50000 --port 50000
cargo run --bin comet -- --identity localhost:50001 --port 50001
cargo run --bin comet -- --identity localhost:50002 --port 50002

For the data, we will use the numpy files saved in the data folder of this tutorial containing the alcohol consumption data (alcohol_consumption.npy) and grades data (grades.npy).

For the Moose computation, we will use is exact the same computation as the one used with LocalMooseRuntime except for the key of the load operations we will provide the actual file path.

_DATA_DIR = pathlib.Path().parent / "data"

alcohol_consumption_path = str((_DATA_DIR / "alcohol_consumption.npy").resolve())
grades_path = str((_DATA_DIR / "grades.npy").resolve())


@pm.computation
def multiparty_correlation():

    # Department of Public Health load the data in plaintext
    # Then the data gets converted from float to fixed-point
    with pub_health_dpt:
        alcohol = pm.load(alcohol_consumption_path, dtype=pm.float64)
        alcohol = pm.cast(alcohol, dtype=fixedpoint_dtype)

    # Department of Education load the data in plaintext
    # Then the data gets converted from float to fixed-point
    with education_dpt:
        grades = pm.load(grades_path, dtype=pm.float64)
        grades = pm.cast(grades, dtype=fixedpoint_dtype)

    # Alcohol and grades data gets secret shared when moving from host placement
    # to replicated placement.
    # Then compute the correlation coefficient on secret shared data
    with encrypted_governement:
        correlation = pearson_correlation_coefficient(alcohol, grades)

    # Only the correlation coefficient gets revealed to the data scientist
    # Convert the data from fixed-point to floats and return the correlation
    # to the data scientist who's launching the computation. You could also
    # save the result to his/her storage with `pm.save`.
    with data_scientist:
        correlation = pm.cast(correlation, dtype=pm.float64)

    return correlation

For the runtime, we will use pm.GrpcMooseRuntime this time. As an argument, we need to provide a mapping between the players identity and the gRPC host address. Once you set the runtime as default, you can call the moose computation to compute the encrypted correlation. You could also evaluate the computation with runtime.evaluate_computation(computation=multiparty_correlation, arguments={}) if you prefer.

role_map = {
    pub_health_dpt: "localhost:50000",
    education_dpt: "localhost:50001",
    data_scientist: "localhost:50002",
}

runtime = pm.GrpcMooseRuntime(role_map)
runtime.set_default()

correlation_result = multiparty_correlation()

We can finally comfirm that we get the same result when running this computation with gRPC 🎉!

print(f"Correlation result: {correlation_result[0]['output_0']}")