This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Privacy-Preserving Identification of Target Patients from Outsourced Patient Data

Xiaojie Zhu University of OsloOsloNorway [email protected] Erman Ayday Case Western Reserve UniversityOHUSA [email protected]  and  Roman Vitenberg University of OsloOsloNorway [email protected]
Abstract.

With the increasing affordability and availability of patient data, hospitals tend to outsource their data to cloud service providers (CSPs) for the purpose of storage and analytics. However, the concern of data privacy significantly limits the data owners’ choice. In this work, we propose the first solution, to the best of our knowledge, that allows a CSP to perform efficient identification of target patients (e.g., pre-processing for a genome-wide association study - GWAS) over multi-tenant encrypted phenotype data (owned by multiple hospitals or data owners). We first propose an encryption mechanism for phenotype data, where each data owner is allowed to encrypt its data with a unique secret key. Moreover, the ciphertext supports privacy-preserving search and, consequently, enables the selection of the target group of patients (e.g., case and control groups). In addition, we provide a per-query based authorization mechanism for a client to access and operate on the data stored at the CSP. Based on the identified patients, the proposed scheme can either (i) directly conduct GWAS (i.e., computation of statistics about genomic variants) at the CSP or (ii) provide the identified groups to the client to directly query the corresponding data owners and conduct GWAS using existing distributed solutions. We implement the proposed scheme and run experiments over a real-life genomic dataset to show its effectiveness. The result shows that the proposed solution is capable to efficiently identify the case/control groups in a privacy-preserving way.

Privacy, Cloud computing, Genomics, Association studies, Applied cryptography
copyright: noneccs: Security and privacy Management and querying of encrypted data

1. Introduction

With the advent of precision medicine, healthcare providers are empowered with the ability to select treatments based on a genetic understanding of the patient’s disease. For instance, for cancer patients, a potential treatment may be a combination of surgery, chemotherapy, radiation, and immunotherapy, depending on the type of cancer and its stage. Precision medicine can help decide on specific personalized treatment plans with certain drugs proving more effective treatment for specific genetic profiles.

However, to pave the way to such precision medicine, research over large volumes of patient data is required. This leads to the popularity of large-scale patient data sharing, such as Patient-Centered Clinical Research Network (PCORNet) (selby2012patient, ) in the US, TranSMART (Athey2013, ) in the EU, and the Global Alliance for Genomics and Health (GA4GH) (GAGH, ). These data-sharing systems are required to comply with the increasingly stringent privacy regulations (e.g., HIPAA (hipaa, ) or GDPR (gdpr, )).

Both centralized and decentralized approaches have been explored in this context. In the centralized schemes (kim2015private, ; sadat2017safety, ), all data owners encrypt their data and outsource them to a common repository, where collective computation becomes possible. In the decentralized schemes (froelicher2021truly, ; raisaro2018m, ), all data owners store their data locally and run a distributed protocol to conduct the computation over all databases. Although both of these approaches provide the ability to conduct computation over combined data, both approaches assume that the client (who sends a query for the computation) already knows which database owners to query for the analysis. However, in real-life, there is a pre-computation step to identify which databases are utilized in the query processing. For instance, if a physician wants to identify similar patients to a given one based on the symptoms and then conduct some statistical tests on the identified patients, they should first contact all database owners to identify the useful databases. Similarly, to conduct a genome-wide association study (GWAS), a client first needs to identify which database owners have genomes belonging to the desired case and control group specifications.

In this work we aim to fill this gap. In general, we propose to develop efficient, privacy-preserving algorithms that can conduct this pre-computation (i.e., identification of target patients) at a centralized setting, at a CSP, over data from multiple database owners while also providing inherent access control. After the pre-computation step, we support either (i) computation of statistics over the identified records at the CSP (e.g., for simple statistical operations, such as calculation of minor allele frequencies and chi-square values) or (ii) utilization of other distributed solutions (e.g., (froelicher2021truly, ; raisaro2018m, )) between the data owners of the identified records (for more complex operations).

In the following, we discuss the main challenges of conducting large-scale privacy-preserving identification of target patients over multi-tenant patient data at the CSP.

Encryption of data using unique cryptographic keys. To protect data from a potentially curious CSP, each data owner (e.g., hospital) should encrypt its dataset before outsourcing. Moreover, each hospital should use its unique secret key for encryption. By doing so, hospitals can maintain data privacy even if some of the hospitals are corrupted. On the other hand, when each hospital encrypts its dataset with its own unique key, computation over a combined dataset becomes challenging.

Efficient and privacy-preserving search over encrypted data across hospitals. All the patient data should be encrypted against a potentially curious CSP. However, the ciphertext should support secure search operation to facilitate selection of target patients (e.g., case and control groups for the association study) in a privacy-preserving and efficient way. Moreover, to avoid issuing a separate query for each hospital’s dataset, searchability of the encrypted data should be supported across hospitals.

Fine-grained access control of data. Since the data is stored at the CSP, the data owner loses the direct control of the data. In order to guarantee that the data is accessed properly, the data owner needs to enforce an access policy over the outsourced data.

In this work, we propose a privacy-preserving framework for the identification of target patients over outsourced patient data. To the best of our knowledge, this is the first framework that tackles all the aforementioned challenges. In the proposed scheme, each hospital encrypts its own dataset with its unique key and outsources the storage and processing of the search operation over encrypted data. We mainly focus on identifying the target patients based on their encrypted phenotype data (which is typically the case in GWAS). For privacy of phenotype data, we encrypt them with a novel pairing-based encryption algorithm. The proposed encryption algorithm also provides the ability to efficiently search over encrypted phenotype data from several hospitals. This enables efficient identification of target group of patients (e.g., case and control groups) across hospitals. Furthermore, to let the phenotype data be properly accessible by legitimate/authorized clients, we introduce a fine-grained authorization scheme. For a single authorization request, each phenotype is considered as a unit of authorization. We implement and evaluate the proposed scheme using a real-life genomic dataset. Our result shows that the proposed scheme requires less than 2 seconds to identify the case and control groups (each of size 50) over a dataset with 1052 patients.

The rest of the paper is organized as follows. In the next section, we discuss the state of the art. In Section 3, we introduce the primitives we used in this paper. In Section 4, we present the models, including system model, data model, query model, and security model. Then, in Section 5 we describe the proposed scheme in detail. In Section 6, we provide the privacy analysis of the proposed solution. In Section 7, we evaluate the performance. Finally, we conclude the paper in Section 9.

2. Related Work

With the increasing volume of patient data, there are two lines of research work to process the data. The first approach is the centralized solution, which requires large amount of data to be stored in a single repository. In order to match the requirement of storage and computation power, the CSP is the most popular central repository. However, concerns about data privacy in cloud storage arise due to malicious attacks from inside and outside of the CSPs (clouddatabreach, ). To resolve the concern, Kim et al. (kim2015private, ) applied the BGV scheme (gentry2012homomorphic, ) and YASHE scheme (bos2013improved, ) to encrypt the patient data and conduct secure evaluation of χ2\chi^{2} distribution over the encrypted data. Sadat et al. (sadat2017safety, ) proposed a hybrid system called SAFETY, which combines Paillier encryption scheme and Intel Software Guard Extensions (Intel SGX) to improve the efficiency of the chi-square test. The second approach is the decentralized solution, where each participant stores its data locally. In order to inter-operate the patient data over multiple data providers, Kamm et al. (kamm2013new, ) proposed to secretly share the sensitive data among several parties and compute GWAS over the distributed data. Similarly, Bogdandov et al. (bogdanov2014privacy, ) adopted secret-sharing based techniques to implement a privacy-preserving framework for statistical analysis on federated datasets. Raisaro et al (raisaro2018m, ) combines homomorphic encryption and obfuscation techniques to achieve privacy-preserving medical data sharing among many clinical sites. Froelicher et al. (froelicher2021truly, ) proposed a multiparty homomorphic encryption algorithm. Based on the algorithm, each party can store the data locally and be able to run analysis algorithms over all the participants’ data without privacy violation.

In the previous work (zhu2019privacy, ), we proposed a scheme to find similar patients based on genomic makeup. In that scheme, with an assumption that all the data owners share a secret key, all the data owners build an index for their data using the secret key to support similar patient search. In this paper, we attempt to remove not only the assumption, but also the index.

Contribution. Compared with the existing work, our main contributions are as follows:

  1. (1)

    Our scheme supports multiple hospitals to outsource their data to the CSP and each of them encrypts its data with its own key while computation can be properly conducted over all the data.

  2. (2)

    Previous work assumes that the case and control groups for the association study are already known. This cannot meet the requirement of dynamic selection of case/control groups. In our paper, we propose a scheme that supports to dynamically select the case and control groups, which can be easily integrated to the existing solutions of association study.

  3. (3)

    The authorization mechanism in the proposed scheme is designed based on per query request and it supports fine-grained authorization.

3. Background

In this section, we first briefly introduce the background of genomics. Then, we introduce symmetric bilinear groups applied in the proposed pairing-based encryption algorithm.

3.1. Genomics Background

The most common mutation in human population is called single nucleotide polymorphism (SNP). It is the variation in a single nucleotide at a particular position of the genome (risch2000searching, ). There are about 5 million SNPs observed per individual and sensitive information about individuals (such as disease predispositions) are typically inferred by analyzing the SNPs. Two kinds of nucleotides (or alleles) are observed for each SNP: (i) major allele is the one that is observed with a high frequency and (ii) minor allele is the one that is observed with low frequency. The frequency of the minor allele in a given population is denoted as the minor allele frequency (MAF). Each SNP includes two nucleotides, one inherited from the father and the other one from mother. For simplicity, we represent the value of a SNP ii as the number of its minor alleles, and hence SNPi{0,1,2}SNP_{i}\in\{0,1,2\}. A SNP is represented by an (ID, value) pair, where the ID is taken from a large standardized set of strings and the value is in {0,1,2}\{0,1,2\}. In the following sections, if we mention a SNP (or SNPs) without mentioning the ID or value, we mean both parts.

3.2. Symmetric Bilinear Groups

Let GG be an additive group of prime order pp and g1Gg_{1}\in G be the generators of GG and g2Gg_{2}\in G is a random element from GG. Let e:G×GGTe:G\times G\rightarrow G_{T} be a function which maps two elements from GG to an element in the target group GTG_{T} having prime order pp. The tuple (G,GT,p,e)(G,G_{T},p,e) is an symmetric bilinear group if following properties hold:
(a) the group operations in GG, GTG_{T} are efficiently computable.
(b) the mapping ee from GG to GTG_{T} is efficiently computable.
(c) the mapping ee is non-degenerate: e(g1,g2)1e(g_{1},g_{2})\neq 1.
(d) the mapping ee is bilinear: for all a,bpa,b\in\mathbb{Z}_{p}, e(g1a,g2b)=e(g1,g2)abe(g_{1}^{a},g_{2}^{b})=e(g_{1},g_{2})^{ab}. Based on the symmetric bilinear group, we design an encryption algorithm to encrypt phenotypes of patients.

4. Models

In this section, we describe the system, data, query, and security models for our proposed scheme. We present the frequently used notation in Table 1.

Table 1. Frequently used notation
λ\lambda the security parameter of the proposed scheme
τ\tau the number of hospitals
nn the number of phenotype attributes
for all the hospitals
mm the number of SNP identities for all the hospitals
tit_{i} the number of patients in hospital ii
NN the size of case and control groups
𝐀\mathbf{A} the global ordered set of phenotype attributes, |A|=n|A|=n
P.name the attribute of a phenotype, P.name \in 𝐀\mathbf{A}
P.val the value of a phenotype, P.val \in {0, 11}
SNP.ID the identity of a SNP
SNP.val the value of a SNP
pi,jp_{i,j} the pseudonym of patient jj in hospital ii.
𝐯i,j\mathbf{v}_{i,j} the vector of SNP values for patient jj in hospital ii
Ψi\Psi_{i} set of all patient records in hospital ii
θi,j\theta_{i,j} a record of phenotype of patient jj in hospital ii,
including the pseudonym and all phenotypes
of patient jj
ϑi,j\vartheta_{i,j} a record of SNP data of patient jj in hospital ii,
including pseudonym and set of pairs of
SNP.id and SNP.val
𝐂i,SNP\mathbf{C}_{i,\textit{SNP}} the encrypted SNP dataset of hospital ii
DictiP\textit{Dict}_{i}^{P} the encrypted phenotype dataset of hospital ii
KiK_{i} the symmetric key of hospital ii,
applied to encrypt/decrypt phenotype data
PKiPK_{i} the private key of hospital ii, only revealed to a
legitimate client
SKiSK_{i} the master key of hospital ii
qcq_{c} the query generated by client c
tki,c,ktk_{i,c,k} the transform key that hospital ii generates for
authorizing client c to access the phenotype kk
HH a hash function
ee a bilinear mapping

4.1. System Model

As shown in Figure 1, the system consists of three types of entities: clients, hospitals, and a cloud service provider (CSP). The hospital collects biological samples from patients and sequences them with patients’ consent. In parallel, the hospital records various phenotypes of patients (e.g., height, eye color, and blood type). Instead of storing all genotype and phenotype data locally, the hospital encrypts the data and outsources them to the CSP. The client (e.g., a medical researcher) queries the CSP for different association studies on datasets of several hospitals. Before sending a query to the CSP, the client needs authorization from the involved hospital(s). If the client gets such an authorization, the corresponding computation is allowed to be conducted over all hospitals’ datasets. Upon receiving a query from a client, the CSP first constructs the identification of target group of patients (e.g., case and control groups) by running the query over the encrypted phenotype data, and then executes the other algorithms (e.g., GWAS) over the encrypted genotype data of individuals in the identified groups (e.g., case and control groups).

Refer to caption
Figure 1. Proposed system model. Hospitals encrypt their data and outsource them to the cloud service provider (CSP). A client sends an authorization request to a hospital for accessing its data. If the request is approved, the client gets authorization and is able to send request to the CSP to access the hospital’s data.

4.2. Data Model

In the proposed scheme, we make an assumptions to make sure that the data model is uniform across hospitals. We assume there exists a common set of terms applied to describe all the phenotypes across hospitals (referred as “phenotype attributes”). That is, all the hospitals use the same terms to describe the same phenotypes.

For efficiency, we represent the value of a phenotype attribute in a binary format (as 0 or 11). 11 means that the patient matches the phenotype attribute while 0 denotes the opposite. For instance, Table 2 illustrates a partial taxonomy of phenotypes that includes different height ranges in centimeters, as well as presence of breast cancer.

The value of each SNP is set to represent its number of minor alleles (0, 11, or 22) (see the details in Section 3.1). Based on these settings, the phenotype and SNP dataset for hospital ii are represented as in Tables 2 and 3, respectively. In the following sections, when we mention phenotype, it means both phenotype attribute and value, and when we mention SNP, it means both SNP identity and value.

Table 2. Phenotype dataset of hospital ii in terms of height and breast cancer. A patient record includes a list of phenotype attribute values and each of them is either 0 or 11.
Pseudonym P.name height breast cancer
<100<100 [100,120) \cdots >180>180
pi,1p_{i,1} 0 1 \cdots 0 0
\vdots \vdots \vdots \cdots \vdots \vdots
pi,tip_{i,t_{i}} 0 0 \cdots 0 1
Table 3. SNP dataset of hospital ii. A patient record have a list of SNP values and each SNP value is either 0, 11 or 22.
Pseudonym SNP identity SNP.ID1\textit{SNP.ID}_{1} \cdots SNP.IDm\textit{SNP.ID}_{m}
pi,1p_{i,1} 1 \cdots 0
\vdots \vdots \cdots \vdots
pi,tip_{i,t_{i}} 2 \cdots 1

4.3. Query Model

The query mechanism is designed to find target groups of patients for specified phenotypes. In the proposed scheme, the query includes two parameters: (i) a list of phenotypes of interest and (ii) a parameter to set the size of matching groups (e.g., case and control groups). As an instance, we use case and groups to illustrate the query mechanism. For simplicity, we assume the size of case and control groups to be equal, but it can be customized to support different sizes.

Prior to outsourcing the data to the CSP, a hospital first encrypts its phenotype dataset by using the proposed pairing-based encryption algorithm that supports efficient search over encrypted phenotypes (as discussed in Section 5). A query is generated by the client with input phenotype attributes and sent to the CSP. The proposed scheme allows a client to form the case and control groups based on multiple phenotype criterion (e.g., an association study for a particular type of cancer can be conducted only on males within a specific age range). The CSP, using the properties of the proposed pairing-based encryption, can check whether an encrypted patient’s record (in a hospital’s dataset) contains the phenotype attributes inside the query. If all the phenotype attributes in the query are included in a patient’s record, the record is added into case group. In contrast, if a patient’s record does not include any of the phenotype attributes in the query, the patient’s record is added into the control group. Note that the query is encrypted without disclosing any phenotype information to the CSP and the CSP cannot learn any information from the search process. Thus, the CSP identifies the individuals in the case and control groups in a privacy-preserving way. The case and control groups may possibly contain patients from multiple hospitals. Specifically, given a query from a client and transform keys (detailed in Section 5.6) from hospitals that authorize the client to access their data, the CSP can search multiple hospitals’ phenotype datasets (encrypted by using different secret keys). Thus, the search result is not limited to one hospital’s dataset.

4.4. Threat Model

Our threat model is consistent with previous work (lu2015privacy, ; schneider2019episode, ). Client, hospital(s), and the CSP are assumed to be semi-honest, that is, they honestly follow the protocol while trying to learn extra information during the protocol. A hospital may try to use its own data and knowledge to infer another hospital’s data either via collusion with the CSP or exploring the common patient records in different hospitals. The CSP may analyze the stored ciphertext and observe the encrypted queries. Based on this information, the CSP may try to extract sensitive information. Also, a client may try to infer patients’ data without having proper access authorization. Specifically, we consider following attacks:

Ciphertext analysis: Since all the data are outsourced to the CSP, the CSP can run different algorithms to analyze ciphertext and try to extract meaningful information.

Query analysis: Since all the queries are sent to the CSP, the CSP may analyze the received queries and their frequency. Consequently, the CSP may try to infer the query pattern and content.

Operation analysis: The CSP conducts search computation and is able to obtain all the transcripts of the operation. Based on this information, the CSP may try to infer the content of query and stored ciphertext.

Unauthorized access: Since all the hospitals’ data is outsourced to the CSP, a client may try to access a hospital’s data without authorization from that hospital.

Collusion between hospitals: To infer a target hospital’s data, several hospital may collude with each other and combine their knowledge.

Collusion between hospitals and the CSP: If some hospitals and the CSP reach consensus on common interest to learn another hospital’s (victim’s) data, all the related parties combine their knowledge and try to infer the sought information. For instance, if the search (at the CSP) includes two hospitals and if one of these hospital collude with the CSP, the CSP can learn which patients of the other (victim) hospital has the considered phenotype as a result of the search operation. However, to provide the common search functionality, this attack is unavoidable, and hence we do not consider it in this work.

We thoroughly analyze all these attacks and robustness of the proposed scheme against them in Section 6.

5. Proposed Scheme

In this section, we first give an overview of the proposed scheme, and then describe its details.

5.1. Overview

In the proposed scheme, hospitals independently encrypt and outsource their phenotype datasets to a CSP and the CSP conduct search operation over the outsourced federated data (from multiple hospitals) to identify target group of patients. To process phenotype data in an efficient and privacy-preserving way, we propose a novel encryption mechanism to encrypt the phenotype data. The proposed encryption scheme allows different hospitals to encrypt their phenotype data independently with their own secret keys. Moreover, the encryption algorithm supports identification of case and control groups efficiently, without information leakage. Furthermore, the identification process requires less communication compared with secure multi-party computation-based approaches (schneider2019episode, ) and less computation compared to homomorphic encryption-based approaches (akavia2019setup, ). In general, the execution of privacy-preserving identification of target group of patients can be divided into seven phases: data preprocessing, initialization, key generation, data encryption, client authorization, query generation, identification of case and control groups. We now present a high-level overview of these seven phases while the rest of this section provides in-depth details.

Data Preprocessing. Phenotype and genotype data need to be properly encoded in the required data format so that further processing can be conducted.

Initialization. In this phase, the Setup is performed to initialize the parameters and functions.

Key generation. In this phase, the KeyGen function is executed to generate secret keys for hospitals.

Data encryption. Hospitals recursively call SinglePhenotypeEncrypt to encrypt their phenotype data.

Client authorization. Before requesting access to a hospital’s dataset (from the CSP), a client obtains an authorization from the hospital.

Query generation. To run query over outsourced data with specified phenotypes, a query is generated by recursively running query generation algorithm SinglePhenotypeQueryGen with the input of target phenotypes.

Identification of the case and control groups. Given a query, the CSP identifies whether a patient record is in case group or control group by running the Search algorithm.

5.2. Data Preprocessing

In this section, we present the procedures we use to preprocess the phenotype data, so that the processed data matches the data format requirements.

5.2.1. Preprocessing Phenotype Data

The representations of phenotype attributes should be uniform for all the hospitals, and hence we assume that all the hospitals share a common ordered set of phenotype attributes, denoted as 𝐀\mathbf{A}. We also assume that the value corresponding to a phenotype attribute is binary: 11 represents a patient having such phenotype attribute, while 0 means the opposite. In Table 2, we illustrate the processed phenotype data for height and breast cancer. As seen in the table, in the dataset (Ψi\Psi_{i}) of hospital ii, a patient record can be represented as θi,j={pi,j,(P.name1,P.valj,1),,(P.namen,P.valj,n)}\theta_{i,j}=\{p_{i,j},(\textit{P.name}_{1},\textit{P.val}_{j,1}),\cdots,(\textit{P.name}_{n},\textit{P.val}_{j,n})\}, where P.namek𝐀\textit{P.name}_{k}\in\mathbf{A}, P.valj,k{0,1}\textit{P.val}_{j,k}\in\{0,1\}, k[1,n]k\in[1,n] and j[1,ti]j\in[1,t_{i}].

5.3. System Initialization

In this section, we present the procedures during the initialization so that all algorithms use the same initial parameters.

5.3.1. Initialization for Phenotype Data Encryption

With the input of the security parameter λ\lambda, the system first generates a symmetric bilinear mapping e:G×GGTe:G\times G\rightarrow G_{T}, where the multiplicative cyclic group GG is generated by generator gg and has the prime order pp (p>2λp>2^{\lambda}). Then, a cryptographic hash function HH is selected. Above procedures are detailed in Setup algorithm that is shown in Algorithm 1.

Algorithm 1 Setup
1:λ\lambda
2:ee, HH
3:set e:G×GGTe:G\times G\rightarrow G_{T}
4:choose H:{0,1}×{0,1}{0,1}λH:\{0,1\}^{*}\times\{0,1\}^{*}\rightarrow\{0,1\}^{\lambda}
5:return ee, HH

5.4. Key Generation

In this section, we present the procedures to generate the required keys in the system.

5.4.1. Key Generation for Phenotype Data Encryption

Hospital ii randomly selects a master key SKiZpSK_{i}\in Z_{p}, and sets its private key as PKi=gSKiGPK_{i}=g^{SK_{i}}\in G. In addition, hospital ii selects a secret key KiK_{i} from {0,1}λ\{0,1\}^{\lambda}. This is detailed in the Keygen procedure presented in Algorithm 2.

Algorithm 2 KeyGen
1:ii, λ\lambda
2:PKiPK_{i}, SKiSK_{i}, KiK_{i}
3:SKiZpSK_{i}\leftarrow Z_{p}
4:PKigSKiPK_{i}\leftarrow g^{SK_{i}}
5:Ki{0,1}λK_{i}\leftarrow\{0,1\}^{\lambda}
6:return PKiPK_{i}, SKiSK_{i}, KiK_{i}

5.4.2. Key Generation for client

The system randomly selects a number SKcpSK_{c}\in\mathbb{Z}_{p} and generates a key PKc=gSKcPK_{c}=g^{SK_{c}} for each client. Once a client joins the system, it is assigned SKcSK_{c} and PKcPK_{c}.

5.5. Data Encryption

In this section, we describe the proposed phenotype data encryption algorithm, which supports efficient privacy-preserving search and update.

5.5.1. Phenotype Data Encryption

The set of phenotype attributes is denoted by 𝐀\mathbf{A}, as described before. Phenotype data of a patient jj in hospital ii is represented as θi,j={pi,j\theta_{i,j}=\{p_{i,j}, (P.name1\textit{P.name}_{1}, P.valj,1\textit{P.val}_{j,1}), \cdots, (P.namen\textit{P.name}_{n}, P.valj,n)}\textit{P.val}_{j,n})\}. For each phenotype kk of the patient (phenotype name-value pair, i.e., P.namek,P.valj,k\textit{P.name}_{k},\textit{P.val}_{j,k}), the hospital first selects a random number ri,j,kr_{i,j,k} from p\mathbb{Z}_{p}, where pp is a large prime that is larger than 2λ2^{\lambda}. The pair of phenotype attribute and value is first hashed and then encrypted into a group value ci,j,kc_{i,j,k}. Afterwards, a symmetric encryption algorithm SE (e.g., AES) is invoked with the input of a secret key KiK_{i} and the pair of phenotype attribute P.namek\textit{P.name}_{k} and value P.valj,k\textit{P.val}_{j,k}. Finally, the algorithm outputs the ciphertext c^i,j,k\hat{c}_{i,j,k} consisting of a random group element gri,j,kg^{r_{i,j,k}}, an encoded group element ci,j,kc_{i,j,k}, and a symmetrically encrypted ciphertext c¯i,j,k\bar{c}_{i,j,k}.

Algorithm 3 shows how hospital ii encrypts a single phenotype kk of patient jj (i.e., P.namek,P.valj,k\textit{P.name}_{k},\textit{P.val}_{j,k}). To encrypt all patients’ phenotype data in a hospital, we iteratively encrypt each patient’s phenotypes. In detail, for each patient jj in hospital ii, the hospital first reads its phenotype data θi,j\theta_{i,j}, and then invokes the SinglePhenotypeEncrypt to encrypt each phenotype of the patient. Once all the phenotypes of the patient jj is encrypted, the result denoted as 𝐂i,j\mathbf{C}_{i,j} is stored in the dataset DictiP\textit{Dict}_{i}^{P} with the patient pseudonym pi,jp_{i,j}. This process is repeated for each patient, detailed in Algorithm 4.

Algorithm 3 SinglePhenotypeEncrypt
1:(P.namek\textit{P.name}_{k}, P.valj,k\textit{P.val}_{j,k})θi,j\in\theta_{i,j}, SKiSK_{i}, KiK_{i}
2:c^i,j,k\hat{c}_{i,j,k}
3:ri,j,k$pr_{i,j,k}\xleftarrow{\$}\mathbb{Z}_{p}
4:αi,j,kH(P.namek,P.valj,k)\alpha_{i,j,k}\leftarrow H(\textit{P.name}_{k},\textit{P.val}_{j,k})
5:ci,j,kgri,j,k/(SKiαi,j,k)c_{i,j,k}\leftarrow g^{-r_{i,j,k}/(SK_{i}-\alpha_{i,j,k})}
6:c¯i,j,kSE(Ki,P.namekP.valj,k)\bar{c}_{i,j,k}\leftarrow SE(K_{i},\textit{P.name}_{k}\circ\textit{P.val}_{j,k})
7:c^i,j,k(gri,j,k,ci,j,k,c¯i,j,k)\hat{c}_{i,j,k}\leftarrow(g^{r_{i,j,k}},c_{i,j,k},\bar{c}_{i,j,k})
8:return c^i,j,k\hat{c}_{i,j,k}
Algorithm 4 PhenotypeEnc
1:Ψi\Psi_{i}, SKiSK_{i}, KiK_{i}
2:DictiP\textit{Dict}_{i}^{P}
3:initialize a dictionary DictiP\textit{Dict}_{i}^{P}
4:for all θi,jΨi\theta_{i,j}\in\Psi_{i} do
5:     𝐂i,jϕ\mathbf{C}_{i,j}\leftarrow\phi
6:     for all  (P.namek,P.valj,k)θi,j(\textit{P.name}_{k},\textit{P.val}_{j,k})\in\theta_{i,j} do
7:         c^i,j,kSinglePhenotypeEncrypt(P.namek,P.valj,k,SKi,Ki)\hat{c}_{i,j,k}\leftarrow\textit{SinglePhenotypeEncrypt}(\textit{P.name}_{k},\textit{P.val}_{j,k},SK_{i},K_{i})
8:         𝐂i,j𝐂i,jc^i,j,k\mathbf{C}_{i,j}\leftarrow\mathbf{C}_{i,j}\cup\hat{c}_{i,j,k}      
9:     pi,jθi,jp_{i,j}\leftarrow\theta_{i,j}
10:     DictiP[pi,j]𝐂i,j\textit{Dict}_{i}^{P}[p_{i,j}]\leftarrow\mathbf{C}_{i,j}
11:return DictiP\textit{Dict}_{i}^{P}

5.6. Client Authorization

In the proposed scheme, a client needs to get authorization from a hospital before it can access to (operate on) a hospital’s data. Meanwhile, the hospital is capable to authorise the client in fine-granularity and the CSP should not learn the data that the hospital authorises the client to access. In this section, we first present a simple client authorization mechanism, which requires a client to generate a query for each hospital that authorizes the client to access its data. In addition, this mechanism allows a client to access a hospital’s data without any limitations once it is authorized. Then, to achieve flexible authorization and let the client generate a single query that can be used to operate on all authorized hospitals’ datasets, we further present an improved mechanism. The improved mechanism supports per-query based fine-grained authorization and it allows a single query to access multiple hospitals’ data.

5.6.1. Simple Authorization

In the simple authorization mechanism, if a client gets authorization from a hospital, the client can access all the data of the corresponding hospital for all future queries. To present this authorization mechanism, we assume that there exists a client with private key PKc=gSKcPK_{c}=g^{SK_{c}} who wants to access (operate on) data from hospital ii. The client first sends an authorization request to hospital ii with its private key PKcPK_{c}. If the hospital approves the request, the hospital signs the private key σi,c=PKcSKi\sigma_{i,c}=PK_{c}^{SK_{i}} and sends it back to the client. The client recovers PKi=σi,c1/SKcPK_{i}=\sigma_{i,c}^{1/{SK_{c}}}. Upon obtaining PKiPK_{i}, the client can generate a legitimate query. The details of this authorization mechanism are also shown in Algorithm 5.

Algorithm 5 SimpleAuthorization
1:PKcPK_{c}, SKiSK_{i}, SKcSK_{c}
2:PKiPK_{i}
3:Client: send PKc=gSKcPK_{c}=g^{SK_{c}} to hospital ii
4:Hospital ii: compute σi,cPKcSKi\sigma_{i,c}\leftarrow PK_{c}^{SK_{i}} and send σi,c\sigma_{i,c} to the client
5:Client: compute PKiσi1/SKcPK_{i}\leftarrow\sigma_{i}^{1/{SK_{c}}}

5.6.2. Improved Authorization

In the simple authorization mechanism, a client can access hospital’s data indefinitely once the hospital authorizes the client. Also, the hospital is unable to control the granularity of the authorization. In other words, the hospital authorizes either all its data to a client or none of them. Furthermore, simple authorization mechanism results in a query that cannot be executed across multiple hospitals’ data (i.e., the client needs to generate separate queries for each hospital’s dataset).

In order to overcome these concerns, we propose an improved authorization mechanism. The new authorization mechanism supports fine-grained authorization on a per-query basis. Moreover, using the transform key function, a single query sent to the CSP can be executed across hospitals. As a result, the complexity of a client’s query generation is reduced from O(τ\tau) to O(1)O(1), where τ\tau is the number of accessible hospitals. The steps for the improved authorization mechanism are as follows. A client first generates an authorization request and sends it to a hospital. If the hospital approves the request, the hospital first generates a transform key for the client (per query) and sends it to the CSP, such that the CSP can apply the transform key to transform the ciphertext into the format that supports the client’s query. After the transformation, the query can be executed by the CSP over the phenotype data of each hospital that authorizes the client. The construction of the transform key needs to consider the privacy of the transform key, transformation process, and the query. With these privacy requirements in mind, we set the transform key at the granularity of phenotypes and compute it as tki,c,k=grc,k(SKiSKc)SKiαc,ktk_{i,c,k}=g^{\frac{-r_{c,k}(SK_{i}-SK_{c})}{SK_{i}-\alpha_{c,k}}}, where ii denotes the target hospital, cc represents the associated data from a client, and kk represents the phenotype attribute in phenotype attribute set 𝐀\mathbf{A}. The details of the transform key construction are as follows.

  1. (1)

    A client randomly selects rc,kpr_{c,k}\in\mathbb{Z}_{p}  (1kn1\leq k\leq n) and sends it to a hospital with PKcPK_{c} and αc,k\alpha_{c,k} (1kn1\leq k\leq n) to request corresponding data access.

  2. (2)

    Upon receiving the query request, the hospital decides which phenotype attributes can be accessed and generates tki,c,ktk_{i,c,k} (1kn1\leq k\leq n) for each approved phenotype attribute. Specifically, if the hospital approves the requested αc,k\alpha_{c,k}, the corresponding tki,c,ktk_{i,c,k} is properly constructed based on αc,k\alpha_{c,k} and rc,kr_{c,k}. Otherwise, a random group element is selected, such that the access structure can be protected from the CSP.

  3. (3)

    For each phenotype attribute, a transform key tki,c,ktk_{i,c,k} (1kn1\leq k\leq n)) is generated. The hospital sends all of them to the CSP and replies to the client with a success message.

Given the transform key, the CSP can transform the ciphertext into the format that supports corresponding client’s query. Using this technique, the CSP can process all the ciphertext before a query is launched. The details of the transformation are discussed in Section 5.8.

5.7. Query Generation

The query is generated based on client’s input that specifies phenotype data of interest. Then, the query is executed over encrypted phenotype data. Algorithm 6 shows query generation based on the input of a single phenotype. The client first applies the hash function HH on the input phenotype attribute and value, obtaining αc,k\alpha_{c,k} (k𝐀k\in\mathbf{A}) as the output. Then, the client computes a group element grc,kg^{r_{c,k}}, where rc,kr_{c,k} is selected during the generation of authorization request. grc,kg^{r_{c,k}} is included in the query. Afterwards, the client computes the other part of the query: grc,kαc,kPKcrc,kg^{-r_{c,k}\alpha_{c,k}}PK_{c}^{r_{c,k}}. To support multiple phenotypes in a query, the client can iteratively invoke the SinglePhenotypeQueryGen algorithm. In detail, given a set Δ\Delta of phenotypes, for each phenotype inside Δ\Delta, the client calls SinglePhenotypeQueryGen to encrypt it. The result qc,kq_{c,k} (1kn1\leq k\leq n) is stored in a vector 𝐯c\mathbf{v}_{c}, which has the same dimension and the same order of phenotype attributes as 𝐀\mathbf{A}. For the phenotypes whose phenotype attributes are in 𝐀\mathbf{A} but not in Δ\Delta, the corresponding value is set to 0 in vector 𝐯c\mathbf{v}_{c}. qcq_{c} is a set and it includes the vector 𝐯c\mathbf{v}_{c}. Finally, in addition to 𝐯c\mathbf{v}_{c}, the size NN of case and control groups is also included in qcq_{c} before qcq_{c} is sent to the CSP.

Algorithm 7 shows how to extends the input from a single phenotype to multiple phenotypes. Here, for each phenotype inside the input phenotype data set Δ\Delta, the client invokes Algorithm 6. Once all the input data are processed, the algorithm outputs the final query qcq_{c}. Note that all the positions, where phenotype attributes are not included in the input phenotype data are filled with zeros. For example, a client can input a set of pairs of phenotype attribute and value as {(breast cancer, 1), (lung cancer, 0), (blue eye color, 1)} to generate a query. Applying above algorithms, each target phenotype attribute is corresponding to a component qc,iq_{c,i} in the query, the remaining that is not included in the input phenotype attribute set is set to 0. The final query also includes the size NN of case and control groups.

Algorithm 6 SinglePhenotypeQueryGen
1:PKcPK_{c}, (P.namek,P.valc,k\textit{P.name}_{k},\textit{P.val}_{c,k}), rc,kr_{c,k}
2:qc,iq_{c,i}
3:αc,kH(P.namek,P.valc,k)\alpha_{c,k}\leftarrow H(\textit{P.name}_{k},\textit{P.val}_{c,k})
4:qc,k(grc,k,grc,kαc,kPKcrc,k)q_{c,k}\leftarrow(g^{r_{c,k}},g^{-r_{c,k}\alpha_{c,k}}PK_{c}^{r_{c,k}})
5:return qc,kq_{c,k}
Algorithm 7 QueryGen
1:PKcPK_{c}, Δ\Delta
2:qcq_{c}
3:qcϕq_{c}\leftarrow\phi
4:for all (P.namek,P.valc,k)Δ(\textit{P.name}_{k},\textit{P.val}_{c,k})\in\Delta do
5:     qc,kSinglePhenotypeQueryGen(PKc,P.namek,P.valc,k)q_{c,k}\leftarrow\textit{SinglePhenotypeQueryGen}(PK_{c},\textit{P.name}_{k},\textit{P.val}_{c,k})
6:     qcqcqc,kq_{c}\leftarrow q_{c}\cup q_{c,k}
7:fill qcq_{c} with 0 in locations of phenotype attributes in A\ΔA\backslash\Delta
8:qcNq_{c}\leftarrow N
9:return qcq_{c}

5.8. Identification of Case and Control groups

Without loss of generality, we present the identification process at the CSP over a single hospital ii’s dataset. Multiple hospitals’ datasets are processed separately and in parallel. We describe the identification process in two steps. First, we show the search operation over a single phenotype. Then, we extend the algorithm to multiple phenotypes. The details of search over a single phenotype are shown in Algorithm 8. Specifically, given the ciphertext c^i,j,k\hat{c}_{i,j,k} of phenotype kk of patient jj in hospital ii, we first extract the components gri,j,kg^{r_{i,j,k}}, ci,j,kc_{i,j,k}, c¯i,j,k\bar{c}_{i,j,k} from c^i,j,k\hat{c}_{i,j,k}, denoted as c^i,j,k,0\hat{c}_{i,j,k,0}, c^i,j,k,1\hat{c}_{i,j,k,1}, and c^i,j,k,2\hat{c}_{i,j,k,2}. Based on the input of the query qc,kq_{c,k} from client cc querying for phenotype kk, we extract grc,k,grc,kαc,kPKcrc,kg^{r_{c,k}},g^{-r_{c,k}\alpha_{c,k}}PK_{c}^{r_{c,k}} from qc,kq_{c,k}, denoted as qc,k,0q_{c,k,0} and qc,k,1q_{c,k,1}. With the three pairs (c^i,j,k,0\hat{c}_{i,j,k,0}, qc,k,0q_{c,k,0}), (c^j,i,k,1\hat{c}_{j,i,k,1}, qc,k,1q_{c,k,1}), and (c^i,j,k,0\hat{c}_{i,j,k,0} tki,c,ktk_{i,c,k}), we compute the bilinear mapping ee over each pair and multiply all the results one by one. If the computed result is 11, the current phenotype matches the query, otherwise, it does not match. The correctness of the algorithm is shown in Eq. 1.

Algorithm 8 SinglePhenotypeSearch
1:c^i,j,k\hat{c}_{i,j,k}, qc,kq_{c,k}, tki,c,ktk_{i,c,k}
2:11 or 0
3:(gri,j,k,ci,j,k,c¯i,j,k)c^i,j,k(g^{r_{i,j,k}},c_{i,j,k},\bar{c}_{i,j,k})\leftarrow\hat{c}_{i,j,k}
4:(c^i,j,k,0,c^i,j,k,1,c^i,j,k,2)(gri,j,k,ci,j,k,c¯i,j,k)(\hat{c}_{i,j,k,0},\hat{c}_{i,j,k,1},\hat{c}_{i,j,k,2})\leftarrow(g^{r_{i,j,k}},c_{i,j,k},\bar{c}_{i,j,k})
5:(grc,k,grc,kαc,kPKcrc,k)qc,k(g^{r_{c,k}},g^{-r_{c,k}\alpha_{c,k}}PK_{c}^{r_{c,k}})\leftarrow q_{c,k}
6:(qc,k,0,qc,k,1)(grc,k,grc,kαc,kPKcrc,k)(q_{c,k,0},q_{c,k,1})\leftarrow(g^{r_{c,k}},g^{-r_{c,k}\alpha_{c,k}}PK_{c}^{r_{c,k}})
7:if e(c^i,j,k,0,tki,c,k)e(c^i,j,k,0,qc,k,0)e(c^i,j,k,1,qc,k,1)=1e(\hat{c}_{i,j,k,0},tk_{i,c,k})e(\hat{c}_{i,j,k,0},q_{c,k,0})e(\hat{c}_{i,j,k,1},q_{c,k,1})=1 then
8:    return 1
9:else
10:    return 0
(1) e(c^i,i,k,0,tki,c,k)e(c^i,j,k,0,qc,k,0)e(c^i,j,k,1,qc,k,1)=e(g,g)ri,j,krc,k(SKiSKc)(SKiαc,k)e(g,g)ri,j,krc,k(SKcαc,k)SKiαi,j,ke(g,g)rc,kri,j,k{=1αi,j,k=αc,k1otherwise\begin{split}&e(\hat{c}_{i,i,k,0},tk_{i,c,k})e(\hat{c}_{i,j,k,0},q_{c,k,0})e(\hat{c}_{i,j,k,1},q_{c,k,1})\\ &=e(g,g)^{\frac{-r_{i,j,k}r_{c,k}(SK_{i}-SK_{c})}{(SK_{i}-\alpha_{c,k})}}e(g,g)^{\frac{-r_{i,j,k}r_{c,k}(SK_{c}-\alpha_{c,k})}{SK_{i}-\alpha_{i,j,k}}}e(g,g)^{r_{c,k}r_{i,j,k}}\\ &\begin{cases}=1&\alpha_{i,j,k}=\alpha_{c,k}\\ \neq 1&otherwise\\ \end{cases}\end{split}

To support a query that contains multiple phenotypes, we extend Algorithm 8 to Algorithm 9. In detail, we first initialize two lists Υcase\Upsilon_{\textit{case}} and Υcontrol\Upsilon_{\textit{control}}. Then, for each patient with pseudonym pi,jp_{i,j} in hospital ii’s encrypted phenotype dataset DictiP\textit{Dict}_{i}^{P}, the corresponding encrypted phenotype data Ci,jC_{i,j} is read. Given the encrypted phenotype data Ci,jC_{i,j}, query qcq_{c}, and transform key (tki,c,1tk_{i,c,1} , \cdots, tki,c,ntk_{i,c,n}), the per-hospital components of the given data are extracted and passed as the input to the SinglePhenotypeSearch algorithm. For example, both the first element of phenotype data query and transform key are extracted, and then input into the SinglePhenotypeSearch algorithm. If a patient contains all the phenotype data inside a query, the patient’s pseudonym pi,jp_{i,j} is added to the case group Υcase\Upsilon_{\textit{case}}. If a patient does not match query phenotypes, its pseudonym is added to the control group Υcontrol\Upsilon_{\textit{control}}. This process is executed until the size of both case and control groups reaches NN or until all the data is processed.

Algorithm 9 Search
1:DictiP\textit{Dict}_{i}^{P}, qcq_{c}, (tki,c,1tk_{i,c,1},\cdots, tki,c,ntk_{i,c,n})
2:Υcase\Upsilon_{\textit{case}} and Υcontrol\Upsilon_{\textit{control}}
3:initialize two lists Υcase\Upsilon_{\textit{case}} and Υcontrol\Upsilon_{\textit{control}}
4:set the number of non-zero elements in qcq_{c} as thrthr
5:NqcN\leftarrow q_{c}
6:for all pi,jDictiPp_{i,j}\in\textit{Dict}_{i}^{P} do
7:    𝐂i,jDictiP[pi,j]\mathbf{C}_{i,j}\leftarrow\textit{Dict}_{i}^{P}[p_{i,j}]
8:    for all c^i,j,k𝐂i,j\hat{c}_{i,j,k}\in\mathbf{C}_{i,j} do
9:         score \leftarrow score + SinglePhenotypeSearch(c^i,j,k\hat{c}_{i,j,k},qc,kq_{c,k}, tki,c,ktk_{i,c,k})     
10:    if score = thrthr and |Υcase|<N|\Upsilon_{case}|<N then
11:         Υcase\Upsilon_{\textit{case}}.append(pi,jp_{i,j})
12:    else if score = 0 and |Υcontrol|<N|\Upsilon_{control}|<N then
13:         Υcontrol\Upsilon_{\textit{control}}.append(pi,jp_{i,j})     
14:    if |Υcase|=N|\Upsilon_{\textit{case}}|=N and |Υcontrol|=N|\Upsilon_{\textit{control}}|=N  then
15:         break     
16:return Υcase\Upsilon_{\textit{case}} and Υcontrol\Upsilon_{\textit{control}}

5.9. Computation over Identified Target Patients

Here, we briefly discuss how our scheme can compute GWAS (or other statistics) over identified patients in centralized and decentralized approaches. In a centralized setting, the patient genomic data is also stored at the CSP, and the genomic data of each patient can be encrypted using multi-key fully homomorphic encryption mechanism (e.g., (chen2019efficient, )) for storage at the CSP. Then, after identifying the case and control groups (as in Section 5.8), the CSP can directly execute the GWAS algorithm over the identified patient data using the homomorphic properties of multi-key fully homomorphic encryption mechanism. In a decentralized setting, the patient genomic data is not stored at the CSP. In this case, the CSP can return the identified patients to the client. Based on the identified patients, the client can send requests to all the involved hospitals to get access to their data for computing GWAS in a distributed way (raisaro2018m, ; froelicher2021truly, ).

5.10. Managing Dynamic Phenotype Data

Here, we show how the proposed scheme supports efficient update of patient phenotype data stored at the CSP. Assume the hospital ii wants to update phenotype kk of patient jj. Given the patient pseudonym pi,jp_{i,j} and the phenotype (P.namek,P.valk\textit{P.name}_{k},\textit{P.val}_{k}), the phenotype encryption algorithm (in Section 5.5) is called to encrypt the phenotype. Once the encryption is completed, the vector of the update query is constructed by inserting the ciphertext at the position where P.namek\textit{P.name}_{k} is located in 𝐀\mathbf{A} and setting the remaining values to 0. Additionally, the command “update” is added to the query. The update query is sent to the CSP. Upon receiving the update query, the CSP first identifies the entry of the patient pi,jp_{i,j}, then identifies the location corresponding to the non-zero element inside the vector of the update query, and finally replaces the old cipher with the new cipher from the update query. To insert a new phenotype or to delete an existing phenotype from a patient record, the proposed update algorithm can also be used by directly appending or removing a record from stored ciphertext.

6. Privacy Analysis

In this section, we analyze the privacy of phenotype data. We first provide a high level discussion on how the proposed scheme achieves robustness in the presence of attacks presented in Section 4.4. Then we present formal privacy definition and proof.

6.1. Privacy Against Ciphertext Analysis

Phenotype data is encrypted and stored at the CSP. This allows the CSP to analyze the stored ciphertext. The encrypted phenotype data can be split into two parts (as in Section 5.5.1). The first part of the encrypted phenotype data is constructed in two steps. Hospital first computes the hash of the phenotype attribute and value. Then, hospital randomly selects a number to mask above hash result and raised the result to the power of a group element. The privacy of this part relies on the hardness of discrete logarithm problem, one-wayness of the hash function, and randomness of the selected number. The second part of the encrypted phenotype data results from directly encrypting phenotype attribute and value. Hospital uses its secret key and invokes symmetric encryption algorithm (e.g., AES) to encrypt the concatenation of phenotype attribute and value. The privacy of the second part relies on the robustness of the utilized symmetric encryption algorithm (e.g., AES). From the above description, we can conclude that both parts of the encrypted phenotype are semantically secure. Thus, the CSP is unable to learn significant information from the ciphertext analysis.

6.2. Privacy Against Query Analysis

The input of the query includes a set of phenotypes. Each phenotype includes a pair of phenotype attribute and value, which is first hashed and the hash result is lifted as a power of a group element (as in Section 5.7). Additionally, a random mask is selected to hide this result, which enables the encrypted query to be semantically secure. Due to the semantic security of the query, the CSP is unable to learn the query content from the query analysis.

6.3. Privacy Against Operation Analysis

Since the ciphertext of phenotype genotype data is stored at the CSP, the CSP is responsible for conducting search over the ciphertext. For the search operation, the CSP applies the query to search over the ciphertext of phenotype data. If a phenotype is included in the query, the corresponding search result is 11, otherwise, it is 0. Through the execution of the search process, the CSP can learn the number of matching phenotypes of each patient. However, for each matching phenotype, its value can either be 0 or 1, and the CSP cannot distinguish between the two. Thus, the CSP cannot learn any information about the query and ciphertext from the search operation.

6.4. Robustness Against Unauthorized Access

Access control is designed through the transform key (as discussed in Section 5.6). A client selects random numbers rc,kr_{c,k} (1kn1\leq k\leq n) and sends them to the target hospital. The hospital generates the transform key (tki,j,k=grc,k(SKic)SKiαc,ktk_{i,j,k}=g^{\frac{-r_{c,k}(SK_{i}-c)}{SK_{i}-\alpha_{c,k}}}) by lifting these numbers into the power of a group element. Due to the randomness of rc,kr_{c,k} and the hardness of the discrete logarithm problem, the CSP is unable to extract any information from the transform key. Analyzing the search operation, the output of using incorrect transform key is 0, which does not disclose any information to the CSP, as described in Section 6.3. Therefore, the proposed authorization scheme is robust against the unauthorized access.

6.5. Robustness Against Colluding Hospitals

Each hospital independently encrypts its phenotype data and genotype data with its own secret key. Even if several hospitals collude with each other, they cannot get any advantage to infer another hospital’s data.

6.6. Robustness against Collusion between a Hospital and CSP

If one or more hospitals collude with the CSP, the CSP cannot obtain any advantage to infer the remaining hospitals’ data since each hospital’s data is encrypted independently. However, as we clarified in Section 4.4, we do not consider following case. For instance, if the search (at the CSP) includes two hospitals and if one of these hospital collude with the CSP, the CSP can learn which patients of the other (victim) hospital has the considered phenotype as a result of the search operation (but not the genomic data of the identified patients). To provide the common search functionality, this attack is unavoidable, and hence we do not consider it in this work.

6.7. Privacy Analysis

In the following, we provide a formal privacy analysis of the proposed scheme. Following previous works (chen2019efficient, ), the allowed leakage includes (i) size pattern and (ii) access pattern. The size pattern discloses the size of the ciphertext, while the access pattern reveals the access frequency of matching patient records. The allowed leakage is not considered violation of our privacy goal. The privacy of the proposed scheme is based on the hardness of discrete logarithm problem, the randomness of selected random mask, and the robustness of applied symmetric encryption (e.g., AES).

The privacy of the proposed scheme can be divided into two elements: phenotype data privacy, and query privacy. The privacy of phenotype data can be further divided into two parts. One part is encrypted by using symmetric encryption algorithm (the third element in a ciphertext, detailed in Section 5.5.1) and the other part is not (the first and second elements in a ciphertext, detailed in Section 5.5.1). Due to the robustness of symmetric encryption algorithm (e.g., AES), the part with the symmetric encryption is also semantically secure. The other part is first protected by a hash function and then masked with a random value. Both the hash result and random mask are put into the power of a group element. Due to the random mask and the hardness of discrete logarithm problem, the ciphertext is semantically secure in the presence of chosen plaintext attack. The query privacy is achieved similarly, relying on the random mask and discrete logarithm problem.

Formally, we define the privacy experiments as follows. Let Π\Pi be the scheme, the advantage of the adversary is defined as ADV𝒜Π(λ)=Pr(b=b)1/2\text{ADV}_{\mathcal{A}}^{\Pi}(\lambda)=Pr(b^{*}=b)-1/2, where bb and bb^{*} are defined below. In the following, we detail the game.

Init: The adversary 𝒜\mathcal{A} selects two datasets DB0DB_{0} and DB1DB_{1} with same size and sends them to the challenger.

Setup: With the input of security parameter λ\lambda, the challenger runs Setup to initialize the parameters. Then, the challenger calls KeyGen to generate the keys.

Phase 1: 𝒜\mathcal{A} is allowed to ask the following request:

Phenotype data encryption request::

𝒜\mathcal{A} is allowed to send a dataset with phenotype data to ask for encryption. The challenger calls Encrypt algorithm to encrypt the dataset and sends the result back to 𝒜\mathcal{A}.

Challenge: The challenger randomly selects bb from {0,1}\{0,1\}, encrypts the dataset DBbDB_{b}, and sends it to the adversary 𝒜\mathcal{A}.

Phase 2: 𝒜\mathcal{A} is allowed to send requests as in Phase 1. Additionally, 𝒜\mathcal{A} is allowed to send a query request. The challenger only authorizes a query containing the phenotype attributes, where two datasets have the same value. Then, it generates a transform key for 𝒜\mathcal{A}, where the mask (rc,kr_{c,k}) in transform key is randomly selected by the challenger (see details in Section 5.6 ).

Guess: 𝒜\mathcal{A} outputs bb^{*} as the guess for bb.

We say the scheme Π\Pi is privacy-preserving if the advantage of the adversary is negligible, i.e, ADV𝒜Πnegl(λ)ADV_{\mathcal{A}}^{\Pi}\leq negl(\lambda), where negl(λ)negl(\lambda) is a negligible function in λ\lambda.

From above defined privacy game, the adversary 𝒜\mathcal{A} is only allowed to learn the information from Phase 1 and Phase 2. The difference of Phase 2 from Phase 1 is that 𝒜\mathcal{A} holds the challenged ciphertext and is allowed to ask a query request. Thus, we only need to prove that what the adversary 𝒜\mathcal{A} can learn from ciphertext request and query request is negligible as follows.

Phenotype data encryption request::

𝒜\mathcal{A} submits the dataset DBDB of phenotype data to ask for encryption from hospital ii.

If the PhenotypeEnc algorithm is semantically secure, 𝒜\mathcal{A} is unable to distinguish ciphertext from a random string. The ciphertext of each pair of phenotype attribute and value includes three components. The first component gri,j,kg^{r_{i,j,k}} is randomly selected from p\mathbb{Z}_{p}, which does not reveal any information. The second component ci,j,kc_{i,j,k} is gri,j,k/(SKiαi,j,k)g^{-r_{i,j,k}/(SK_{i}-\alpha_{i,j,k})}. Due to the hardness of discrete logarithm problem, 𝒜\mathcal{A} is unable to extract the power of a group element. Similarly, 𝒜\mathcal{A} is unable to distinguish the second component from a random element in 𝔾\mathbb{G}. The third component is encrypted by using a symmetric encryption algorithm (e.g., AES), which is semantically secure.

Therefore, the ciphertext obtained through Encrypt algorithm is semantically secure.

Query request::

First, the transform key is indistinguishable from a random element of group GG. Second, for each pair of phenotype attribute and value, the query is (grc,k(g^{r_{c,k}}, grc,kαc,kPKcrc,k)g^{-r_{c,k}\alpha_{c,k}}PK_{c}^{r_{c,k}}), where PKcPK_{c} is an element from G{G}. Based on the hardness of discrete logarithm problem and the randomness of rc,kr_{c,k}, the query is indistinguishable from two random elements from G{G}. Third, given the query and transform key, the adversary 𝒜\mathcal{A} is capable to run search operation over the ciphertext of phenotype data. Furthermore, 𝒜\mathcal{A} is also capable to run analysis algorithms on ciphertext of genotype data. However, due to the constraint of issuing client authorization, two datasets should output the same search result. Thus, 𝒜\mathcal{A} cannot learn any significant information via executing search operation.

Based on above analysis, we can conclude ADV𝒜Π(λ)ADV^{\Pi}_{\mathcal{A}}(\lambda) is negligible and the proposed scheme is privacy-preserving.

7. Evaluation

In this section, we evaluate the performance of the proposed scheme. We run the experiments on a commodity machine with i7i7 CPU and 16GB RAM. The proposed phenotype encryption algorithm is implemented by Charm (charm13, ) written in Python while the SNP encryption algorithm is implemented by HEAAN (Heaan, ) written in C++. Each experiment is run 10 times; we report the average results.

7.1. Data Model

We use the rsnps tool (rsnps, ) to obtain all the raw patient files from the publicly available OpenSNP dataset (OpenSNP, ). Then, we converted the raw patient files into the VCF format using an open source software named personal-genome-analysis (oga, ). Eventually, we ended up with 1052 valid VCF files. For the phenotype data, we also used the OpenSNP dataset. In total, we collected 1052 records and extracted 1052 phenotype attributes.

7.2. Experimental Results

In this section, we first show the efficiency and scalability of the phenotype and genotype data encryption algorithms. After that, we present the scalability and efficiency of client authorization and query generation.

7.2.1. Phenotype Data Encryption

We adopt symmetric pairing group SS512 to construct the bilinear mapping and AES to implement the symmetric encryption. Accordingly, the time required to encrypt the phenotype data can be divided into two constituents. The first constituent is due to the pairing group operation. The second constituent is due to using AES to encrypt phenotype attribute and value. The secret key of AES is set to 256256 bits. Table 4 shows the performance of the phenotype data encryption for different number of phenotypes. We observed that with the linear increase in the number of patients, the time cost of phenotype encryption for both AES and non-AES (pairing based encryption) constituents increases linearly. Similarly, when the number of phenotypes increases, the required encryption time also increases linearly.

Table 4. Time cost of phenotype data encryption for different number of patients and phenotypes
# patients # phenotype Non-AES (s) AES (s)
1052 1052 8137.5 5
1052 526 3954.8 2.7
1052 263 2068.9 1.38
526 1052 3948 2.7
263 1052 2042.1 1.38
Table 5. Time cost of authorization request generation for different number of requested phenotypes.
# requested phenotypes time (s)
1052 4.47
526 2.28
263 1.14

7.2.2. Client Authorization

The process of client authorization can be divided into two phases. In the first phase, the client generates an authorization request while in the second phase, the hospital generates the transform key. Table 5 shows the efficiency of authorization request generation for different number of requested phenotypes. We observe that the required time of authorization request generation increases linearly upon increasing the number of requested phenotype attributes. Table 6 shows the performance of the transform key generation for different number of phenotype attributes. Here, we observe that the time required for transform key generation increases linearly as a function of the number of phenotype attributes.

Table 6. Time cost of transform key generation for different number of authorized phenotypes
# authorized phenotypes time (s)
1052 5.87
526 3.2
263 1.62
Table 7. Time cost of query generation for different number of input phenotypes
# input phenotypes time (s)
1052 4.8
526 2.39
263 1.19

7.2.3. Query Generation

The performance of the query generation algorithm is affected by the number of input phenotypes. The experimental results are shown in Table 7. From the table, we observe that with the linear increase in the number of input phenotypes, the time required for the query generation also increases linearly.

7.2.4. Phenotype Data Search

The search process involves a number of patient records to be processed to construct the case and control groups. To access each hospital’s data, the query needs to be transformed using the transform key. In addition, the desired size of case and control groups can be a factor to stop the search process earlier once the required number of individuals are identified in case and control groups. Table 8 shows the effect of the number of hospitals, number of input phenotypes, and the size of case and control groups on the efficiency. We observed that if the number of patients is fixed, the number of hospitals almost does not affect the efficiency, while the performance is sensitive to the size of case and control groups and the number of input phenotypes. When the size of case/control groups are set to 1010 and 5050, the search time is reduced to 0.290.29s and 1.781.78s, respectively. The reason is that once the required number of patients are identified for the case and control groups, the search process is terminated. We also evaluate the performance for reduced number of input phenotypes. The observed search times are 32.232.2s, 166.7166.7s, and 327.1327.1s for 1010, 5050, and 100100 phenotypes, respectively. From these results, we can say that the search time increases linearly as the number of input phenotypes grows.

Table 8. Time cost of phenotype data search for the proposed algorithm with a total of 1052 patients and each patient having 1052 phenotypes
# hospitals # queried phenotypes # case/control groups time (s)
1 10 100 32.2
10 10 100 33.1
100 10 100 34.7
1 10 10 0.29
1 10 50 1.78
1 50 100 166.7
1 100 100 327.1

To show the efficiency of the proposed algorithm for phenotype data search, we also designed a following fully homomorphic encryption (FHE)-based version of it for comparison. In detail, the alternative FHE-based approach includes four steps. First, all the phenotype data is encrypted by using CKKS (cheon2017homomorphic, ). Second, the client sends a query to the CSP in order to determine the case/control groups. Third, the CSP sends the computed result to different hospitals. Fourth, hospitals decrypt the result in parallel and send the result (identified case/control groups) to the client. Without considering the time cost of communication, Table 9 shows the required time to complete the search operation using this FHE-based scheme for different number of patients and phenotypes. Comparing Table 8 with Table 9, we observe that the proposed algorithm is more than 20 times faster than the FHE-based algorithm when the number of queried phenotypes is at most 10. Notably, when the size of case/control groups is smaller than 50, the proposed algorithm is more than 300 times faster. Furthermore, in Table 10 we show the comparison between the FHE-based solution and proposed algorithm for different number of patients. From the table, we observe that (i) the proposed scheme consistently exhibits similar performance advantage over the FHE-based scheme and (ii) the time cost of both algorithms increases linearly with the number of patients.

Table 9. Time cost of phenotype data search for the CKKS algorithm with a total of 1052 patients and each patient having 1052 phenotypes
# hospitals # queried phenotypes # case/control group each hospital decryption time (s) total
1 10 100 6.2 684.6
10 10 100 5.9 683.9
100 10 100 0.6 678.6
1 10 10 1.5 678.5
1 10 50 3.1 681.1
1 50 100 6.3 684.7
1 100 100 6.5 684.9
Table 10. Time cost of phenotype data search for the CKKS and proposed algorithm for different number of patients (each patient having 1052 phenotypes). The size of case/control groups is not limited as input parameter.
# hospital # queried phenotypes # patients total time (s)
CKKS algorithm
1 10 1052 747.5
1 10 526 355.9
Proposed algorithm
1 10 1052 35.9
1 10 526 17.4

8. Discussion

The proposed pairing-based encryption scheme for phenotype data (in Section 5.8) is not limited to efficient identification of case and control groups. The scheme can be extended to support additional functions, such as similar patient search, target record retrieval, and statistical analysis of phenotype data. In the following, we discuss some of these potential extensions.

Similar patient search. In the proposed scheme, a client is allowed to input several phenotypes of interest into a query and send it to the CSP. Upon receiving the query, the CSP searches the stored datasets of several hospitals. Based on the number of matching phenotypes, the CSP can order the search results and return similar patient records. Similar functionality can also be provided in the genome level by encrypting genome data with the proposed pairing-based encryption.

Phenotype data retrieval. In the proposed scheme, we show how to use the Search algorithm (in Section 5.8) to find target pseudonyms based on the query. One can extend this function to support a client to retrieve phenotype data of interest. For instance, a client (e.g., a physician) may be interested to know the phenotypes of patients having lung cancer. Then, the client can generate a query and send it to the CSP to retrieve the phenotype data from patients that are diagnosed with lung cancer. Once the client receives the phenotype data from the CSP, the SinglePhenotypeDecrypt algorithm 10 is called to decrypt the phenotype data.

Algorithm 10 SinglePhenotypeDecrypt
1:c^i,j,k\hat{c}_{i,j,k}, KiK_{i}
2:P.namek\textit{P.name}_{k}, P.valj,k\textit{P.val}_{j,k}
3:(gri,j,k,ci,j,k,c¯i,j,k)c^i,j,k(g^{r_{i,j,k}},c_{i,j,k},\bar{c}_{i,j,k})\leftarrow\hat{c}_{i,j,k}
4:(P.namek,P.valj,k)DE(Ki,c¯i,j,k)(\textit{P.name}_{k},\textit{P.val}_{j,k})\leftarrow DE(K_{i},\bar{c}_{i,j,k})
5:return P.namek,P.valj,k\textit{P.name}_{k},\textit{P.val}_{j,k}

9. Conclusion

In this paper, we have designed a privacy-preserving framework for identification of a target group of patients across multi-tenant data. To achieve this, we have proposed a novel phenotype encryption algorithm. To support search and computation over multi-tenant data by a cloud service provider (CSP), we have introduced a transform key to enable the CSP to transform a single query and execute it over different hospitals’ datasets without privacy violation. To manage the authorization of clients, we have proposed a per-query based authorization mechanism supporting selective phenotype data authorization. Via simulations on real genomic data, we have shown the practicality and efficiency of the proposed scheme. We believe that the proposed scheme will further facilitate the use of genomic data in clinical settings and pave the way for personalized medicine. In future work, we will focus on improving the search efficiency of genomic data and batch queries.

References

  • (1) Heaan (02-10-2019), https://github.com/snucrypto/HEAAN
  • (2) rsnps (02-12-2018), https://github.com/ropensci/rsnps/
  • (3) opensnp (14-10-2018), https://opensnp.org/
  • (4) Akavia, A., Gentry, C., Halevi, S., Leibovich, M.: Setup-free secure search on encrypted data: Faster and post-processing free. Proceedings on Privacy Enhancing Technologies 2019(3), 87–107 (2019)
  • (5) Akinyele, J.A., Garman, C., Miers, I., Pagano, M.W., Rushanan, M., Green, M., Rubin, A.D.: Charm: a framework for rapidly prototyping cryptosystems. Journal of Cryptographic Engineering 3(2), 111–128 (2013). https://doi.org/10.1007/s13389-013-0057-3, http://dx.doi.org/10.1007/s13389-013-0057-3
  • (6) Athey, B., Braxenthaler, M., Haas, M., Y, G.: transmart: An open source and community-driven informatics and data sharing platform for clinical and translational research. AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science 6-8 (2013)
  • (7) Bogdanov, D., Kamm, L., Laur, S., Pruulmann-Vengerfeldt, P., Talviste, R., Willemson, J.: Privacy-preserving statistical data analysis on federated databases. In: Annual Privacy Forum. pp. 30–55. Springer (2014)
  • (8) Bos, J.W., Lauter, K., Loftus, J., Naehrig, M.: Improved security for a ring-based fully homomorphic encryption scheme. In: IMA International Conference on Cryptography and Coding. pp. 45–64. Springer (2013)
  • (9) Chen, H., Dai, W., Kim, M., Song, Y.: Efficient multi-key homomorphic encryption with packed ciphertexts with application to oblivious neural network inference. In: Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. pp. 395–412 (2019)
  • (10) Cheon, J.H., Kim, A., Kim, M., Song, Y.: Homomorphic encryption for arithmetic of approximate numbers. In: International Conference on the Theory and Application of Cryptology and Information Security. pp. 409–437. Springer (2017)
  • (11) Froelicher, D., Troncoso-Pastoriza, J.R., Raisaro, J.L., Cuendet, M., Sousa, J.S., Fellay, J., Hubaux, J.P.: Truly privacy-preserving federated analytics for precision medicine with multiparty homomorphic encryption. bioRxiv (2021)
  • (12) for Genomics, T.G.A., Health: A federated ecosystem for sharing genomic, clinical data. Science 352(6291), 1278–1280 (2016). https://doi.org/10.1126/science.aaf6162, https://science.sciencemag.org/content/352/6291/1278
  • (13) Gentry, C., Halevi, S., Smart, N.P.: Homomorphic evaluation of the aes circuit. In: Annual Cryptology Conference. pp. 850–867. Springer (2012)
  • (14) Goud, N.: Top 5 Cloud Security related Data Breaches! (2020), https://www.cybersecurity-insiders.com/top-5-cloud-security-related-data-breaches/
  • (15) Hammerbacher, J.: personal-genome-analysis (27-09-2018), https://github.com/hammer/personal-genome-analysis
  • (16) of Health, U.D., Services, H.: The health insurance portability and accountability act (hipaa) (2021-04-27), https://www.hhs.gov/hipaa/index.html
  • (17) Kamm, L., Bogdanov, D., Laur, S., Vilo, J.: A new way to protect privacy in large-scale genome-wide association studies. Bioinformatics 29(7), 886–893 (2013)
  • (18) Kim, M., Lauter, K.: Private genome analysis through homomorphic encryption. In: BMC medical informatics and decision making. vol. 15, p. S3. Springer (2015)
  • (19) Lu, W.J., Yamada, Y., Sakuma, J.: Privacy-preserving genome-wide association studies on cloud environment using fully homomorphic encryption. In: BMC medical informatics and decision making. vol. 15, p. S1. Springer (2015)
  • (20) Parlament, E.: The EU General Data Protection Regulation (GDPR) (2021-04-27), http://www.eugdpr.org/
  • (21) Raisaro, J.L., Troncoso-Pastoriza, J.R., Misbach, M., Sousa, J.S., Pradervand, S., Missiaglia, E., Michielin, O., Ford, B., Hubaux, J.P.: Medco: Enabling secure and privacy-preserving exploration of distributed clinical and genomic data. IEEE/ACM transactions on computational biology and bioinformatics 16(4), 1328–1341 (2018)
  • (22) Risch, N.J.: Searching for genetic determinants in the new millennium. Nature 405(6788),  847 (2000)
  • (23) Sadat, M.N., Aziz, M.M.A., Mohammed, N., Chen, F., Wang, S., Jiang, X.: Safety: Secure gwas in federated environment through a hybrid solution with intel sgx and homomorphic encryption. arXiv preprint arXiv:1703.02577 (2017)
  • (24) Schneider, T., Tkachenko, O.: Episode: Efficient privacy-preserving similar sequence queries on outsourced genomic databases. ASIACCS (2019)
  • (25) Selby, J.V., Beal, A.C., Frank, L.: The patient-centered outcomes research institute (pcori) national priorities for research and initial research agenda. Jama 307(15), 1583–1584 (2012)
  • (26) Zhu, X., Ayday, E., Vitenberg, R.: A privacy-preserving framework for outsourcing location-based services to the cloud. IEEE Transactions on Dependable and Secure Computing (2019)