As a Data Scientist you will login with username/password provided by the data owner and perform Remote Data Science
import syft as sy
import numpy as np
import matplotlib, matplotlib.pyplot as plt
import os
import pandas as pd
%matplotlib inline
ds_node = sy.login(email="zoheb@amat.com", password="bazinga", port=8081)
Connecting to None... done! Logging into local_node... done!
Lets check our initial privacy budget
The privacy budget represents how much noise the data scientist can remove from a dataset when accessing it. Domains will set a privacy budget per data scientist.
ds_node.privacy_budget
700.0
ds_node.datasets
Idx | Name | Description | Assets | Id |
---|---|---|---|---|
[0] | COVID19 Cases in 175 countries | Weekly data for an entire year | ["Country 0"] -> Tensor ["Country 1"] -> Tensor ["Country 2"] -> Tensor ... |
51da7d0f-7e80-4b82-b5aa-9814a3ee9cef |
covid_ds = ds_node.datasets[0]
covid_ds
Dataset: COVID19 Cases in 175 countries Description: Weekly data for an entire year WARNING: Too many assets to print... truncating... You may run assets = my_dataset.assets to view receive a dictionary you can parse through using Python (as opposed to blowing up your notebook with a massive printed table).
Asset Key | Type | Shape |
---|---|---|
["Country 0"] | Tensor | (53,) |
["Country 1"] | Tensor | (53,) |
["Country 2"] | Tensor | (53,) |
["Country 3"] | Tensor | (53,) |
["Country 4"] | Tensor | (53,) |
["Country 5"] | Tensor | (53,) |
["Country 6"] | Tensor | (53,) |
["Country 7"] | Tensor | (53,) |
["Country 8"] | Tensor | (53,) |
["Country 9"] | Tensor | (53,) |
["Country 10"] | Tensor | (53,) |
["Country 11"] | Tensor | (53,) |
["Country 12"] | Tensor | (53,) |
["Country 13"] | Tensor | (53,) |
["Country 14"] | Tensor | (53,) |
We can't see the dataset's values by printing it, hence we can't steal. Here is the tensor pointer to the dataset
print(covid_ds)
<syft.core.node.common.client_manager.dataset_api.Dataset object at 0x10f9856a0>
Create result
- a pointer to one of the selected dataset's tensors.
result = covid_ds["Country 0"]
publish
uses the privacy budget approved by the data owner to access the data in a noised format that does not compromise the original dataset. sigma
is the amount of privacy budget the data scientist plans to use.
published_result = result.publish(sigma=1000)
We call get()
to access the contents of the published_result pointer created above.
published_data = published_result.block_with_timeout(60).get()
published_data
array([ 100.73878185, 541.4500865 , -375.10293856, 2858.73702973, 686.31945154, -20.76044026, 1197.38230958, 640.06508438, 347.7196077 , 990.81971463, 588.6657162 , 1142.07340362, 64.05201107, 1212.90968109, 1354.55840716, 469.85859676, 800.12409571, 406.92008934, -581.09715378, -182.33866302, -1601.69871867, 344.73025418, -1440.2914348 , -1037.69893063, -1455.43654042, 15.35680767, -562.58933802, 1449.02276369, -321.00256185, 455.77455451, -367.60258788, 1993.28491317, -1531.85406781, 489.68772356, 354.53473314, 91.88429386, 729.65001485, 1101.29951442, -257.16234613, 88.52534715, 204.61057498, 321.02971848, 1061.47491978, -1127.56615556, 263.99707188, -1471.40921471, -207.98838313, 729.49451665, 125.73934123, 1501.26873026, 1553.67660508, 681.24566677, -973.26207448])
Check the privacy budget spent -- its decreased
print(ds_node.privacy_budget)
700.0
You can request for budget from Data Owner
#ds_node.request_budget(eps=100, reason="I want to do more data exploration")
Let's plot the noisy data. In comparison to the data visualized by data owner, it is impossible (thanks to differential privacy) to get exact same visualization, but the machine learning properties of the data remain the same.
data_df = pd.DataFrame(published_data)
data_df.plot(legend=False)
<AxesSubplot:>
def plot_extrapolated_country(idx):
x = list(range(53))
y = data_df.loc[:,idx].values
plt.plot(y)
z = np.polyfit(x, y, 2)
f = np.poly1d(z)
new_points = range(12)
new_y = []
for x2 in new_points:
new_y.append(f(53+x2))
plt.plot(range(53, 65), new_y)
plot_extrapolated_country(0)
As you can see above, the data is obscured by noise, but the trends / modeling move in the expected direction.
This is the power of Remote Data Science. We're able to work with and get the benefits of data, without directly owning it, or exposing the privacy of the subjects whose data was collected.