Redundancy-aware unsupervised rankings for collections of gene sets
Abstract.
The biological roles of gene sets are used to group them into collections. These collections are often characterized by being high-dimensional, overlapping, and redundant families of sets, thus precluding a straightforward interpretation and study of their content. Bioinformatics looked for solutions to reduce their dimension or increase their intepretability. One possibility lies in aggregating overlapping gene sets to create larger pathways, but the modified biological pathways are hardly biologically justifiable. We propose to use importance scores to rank the pathways in the collections studying the context from a set covering perspective. The proposed Shapley values-based scores consider the distribution of the singletons and the size of the sets in the families; Furthermore, a trick allows us to circumvent the usual exponential complexity of Shapley values’ computation. Finally, we address the challenge of including a redundancy awareness in the obtained rankings where, in our case, sets are redundant if they show prominent intersections.
The rankings can be used to reduce the dimension of collections of gene sets, such that they show lower redundancy and still a high coverage of the genes. We further investigate the impact of our selection on Gene Sets Enrichment Analysis. The proposed method shows a practical utility in bioinformatics to increase the interpretability of the collections of gene sets and a step forward to include redundancy into Shapley values computations.
1. Introduction
One of the main challenges when working with collections of gene sets is the sheer size, and the following low interpretability of the many gene sets belonging to the same collection. We refer to gene sets or pathways as sets of genes deriving from a biological classification of genes concerning chemical or biological functions; these are grouped in variably sized (between hundreds to several thousands of gene sets) collections based on some prior biological or chemical function (liberzon_molecular_2015, ). From this grouping derived scarcely interpretable collections of partly overlapping pathways.
Shapley values (ref:shapley, ) allows allocating resources fairly among cooperative game players. In recent years, they found application in feature selection where the ’players’, i.e., the features, cooperate in creating a model with high accuracy, and Shapley values help discriminate among relevant and non-relevant features for the label prediction. Shapley values derive their success from the flexible and non-demanding definition of the value function, making them an easily applicable tool in various contexts (rozemberczki2022shapley, ; lundberg_unified_2017, ; balestraUSV, ). We introduce unsupervised importance scores based on Shapley values for sets within sheer-sized families of sets to reduce their size. Being a function of the distributions of the elements among sets, the scores are (1) positively correlated with the size of the sets and (2) unaware of intersections among them. Hence, we introduce pruning criteria and get new rankings of sets that show no correlation with the sets’ sizes and low overlap among sets similarly ranked. As a case study, we apply the method to collections of gene sets: our punished Shapley values affect the correlation among sizes of gene sets and their position in the rankings, although not directly meant to solve this issue. The rankings show excellent behavior regarding the redundancy reduction, and the results suggest that the switch to smaller collections of gene sets does not affect the coverage of the genes. Furthermore, the lower dimensional collections still include a similar number of significant pathways when used for gene set enrichment analysis. As an extension of the paper (balestra2023redundancy, ), we propose an analysis of additional collections of gene sets.
2. Ranking sets in families of sets
A cooperative game is a pair where is a finite set of players and is the so-called value function. maps each subset (usually referred as coalition) of players to a real non-negative number under the hypothesis that and . Shapley values represent one possibility of fairly dividing the payoff of the grand coalition among the players such that the amount of resources allocated to each is fair concerning the player’s contribution in any possible coalition within the game. The Shapley value of is the average of its marginal contributions across the possible coalitions , i.e.,
where and . The formula requires computing times the value function; instead, microarray games, as a special case of Sum-Of-Unanimity Games (SOUG), admit a polynomial closed-form solution.
2.1. The computation of Shapley values for microarray games
Consider a family of sets. We denote with the elements belonging to at least one set and . Starting from and , we build a binary matrix where if and otherwise. Transposing the definition given by Moretti et al. (moretti_class_2007, ), for each element , we look at the set of sets in which is present, and the information on is conveyed by the column of . We define the support of as the set . The microarray game is then defined as the cooperative game where
(1) |
Hence, is the ratio of the number of genes’ supports that contains and the number of elements in . As is fixed, is proportional to the number of supports contains; higher scores are achieved by sets covering the full distribution among sets of a high number of elements. Following Sun et al. (sun_game_2020, ) approach, the value function can be easily expressed as a SOUG game such that the computation of Shapley values has polynomial runtime.
Shapley values assign similar scores to redundant players; hence, similar players are ranked in close positions. We integrate into Shapley values a redundancy-awareness concept such that the player ranked at the -th place is the least overlapping possible with the first -ranked players introducing pruning criteria for players highly correlated to the previously ranked ones.

The sets in are arbitrarily large and can overlap, and each contains a variable number of elements. We construct a microarray game based on the binary matrix ; Each row of represents a set , and each column represents the partial ordering relationship of the element belonging to the sets. The Shapley value of is computed following two steps (sun_game_2020, ): (1) from the matrix , we get ; (2) then each Shapley value is computed through the formula:
(2) |
We use the computed Shapley values as importance scores to rank the s: The higher the score of , the more important the set is in the microarray game defined. The importance scores quantify the number of elements contained in , which are also often contained in other sets. is a real number in and we know that from Shapley values’ properties. As mentioned, Shapley values alone are unaware of possible overlaps among players.
2.2. Low redundancy and high coverage
We aim for a redundancy-aware ranking of sets in families of sets where two sets are redundant if they contain many shared elements. We use as redundancy measure the Jaccard index: Given two sets and , their Jaccard index is .
While reducing the redundancy among sets, we want to maintain the coverage of . We hereby define various pruning criteria and compare them w.r.t. coverage and redundancy of the obtained family of set , using as metrics the generalized Jaccard index and the coverage.
Generalized Jaccard index – it is defined as the average Jaccard index among pair of sets in and measures the redundancy in :
(3) |
Coverage of – it is the percentage of elements that are included at least in one set in , i.e.,
(4) |
The ranking given by Shapley values tends to rank bigger sets first. Moreover, bigger sets are more likely to overlap. The punishment criteria will also affect the association between the position in the rankings and the size of the sets.
2.3. Different punishments criteria
We introduce various pruning criteria and provide a detailed comparison of the rankings obtained using collections of gene sets as particular cases (Section 3). We obtain five different rankings of the sets through a greedy ranking process.
Shapley values SV, the original ranking given by Equation (2) that favors larger sets, and it is unaware of overlaps among sets. Because of the tendency of ranking first larger sets, the ranking obtains high coverage of when selecting a small number of sets.
Punished Ordering PO – given , at each step , the score obtained by a not yet ranked set is given by the following recursive formula
At the step , the algorithm ranks the set . The Shapley values are re-computed after each iteration to satisfy the efficiency property where is the Shapley value at the th iteration.
Punished Ordering with Rescaling POR – at each iteration, POR rescales the punishment of PO, i.e., , to the interval .
Artificial Ordering AO – we introduce an artificial set updated at each iteration representing the already-ranked sets’ union, i.e., at the th iteration. The importance score is defined by uniquely pruning with the Jaccard index among the new set and . In AO, the punishment depends on the elements in already covered by the first ranked sets. Introducing the artificial pathway containing the already covered elements avoids multiple punishments for the same overlapping elements typical of PO and POR.
Artificial Ordering with Rescaling AOR – AOR rescales the punishment of AO, i.e., to the interval .
Note that each criterion orders first the set with the highest Shapley value. The orderings differ from each other from the second-ranked set. The choice of which particular punishment to use is highly dependent on the goal, and there is not a unique correct way of choosing which ranking to use. Table 1 summarizes the properties of the different rankings.
3. Application to collections of gene sets
SV | PO | POR | AO | AOR | ||
---|---|---|---|---|---|---|
KEGG13 | 0.691 | -0.251 | 0.336 | 0.295 | 0.344 | |
BIOCARTA | 0.32 | -0.372 | 0.058 | -0.17 | 0.02 | |
COVID | 0.713 | -0.352 | 0.41 | -0.616 | 0.401 |
We apply our methods to gene sets, i.e., lists of genes based on a potential common biological functionality. Typically, the gene sets are grouped in collections of gene sets according to some prior biological knowledge, e.g., involvement in the same biological processes, presence of common biochemical mechanisms, or shared associations with a phenotype (liberzon_molecular_2015, ). We selected collections of gene sets, i.e., KEGG13, BIOCARTA, and COVID, from the Enrich library and some clinical traits to assess whether associations exist between gene sets and traits. Each selected collection of gene sets is a set of pathways whose size ranges between and ; the association traits considered contain at least relevant genes. The implementation code is available on Github. The results here presented summarize both the three collections of genes sets (KEGG13, BIOCARTA, and COVID) and the ones discussed in Balestra et al. (balestra2023redundancy, )
KEGG13 | BIOCARTA | COVID | |||||||
---|---|---|---|---|---|---|---|---|---|
% | 10 | 20 | 40 | 10 | 20 | 40 | 10 | 20 | 40 |
SV | 0.00 | 0.00 | 0.08 | 1.0 | 1.71 | 1.57 | 1.58 | 1.76 | 2.14 |
PO | 0.50 | 0.54 | 0.36 | 0.00 | 0.06 | 0.41 | 0.01 | 0.10 | 0.37 |
POR | 0.58 | 0.70 | 0.58 | 0.64 | 0.49 | 0.79 | 1.01 | 1.15 | 1.18 |
AO | 0.94 | 0.63 | 0.46 | 1.06 | 1.03 | 0.80 | 0.42 | 1.54 | 1.12 |
AOR | 0.69 | 0.88 | 0.88 | 0.8 | 0.97 | 1.03 | 1.29 | 1.13 | 1.10 |
SV | 54.46 | 73.02 | 89.38 | 41.80 | 62.58 | 83.74 | 46.98 | 70.50 | 88.35 |
PO | 43.17 | 48.96 | 69.66 | 31.25 | 48.78 | 72.68 | 11.47 | 23.30 | 63.6 |
POR | 54.36 | 73.92 | 86.89 | 39.20 | 61.99 | 81.22 | 46.75 | 68.64 | 88.07 |
AO | 31.48 | 42.27 | 72.88 | 27.84 | 42.54 | 64.29 | 13.29 | 15.76 | 34.09 |
AOR | 46.97 | 59.16 | 78.31 | 35.78 | 53.38 | 75.80 | 42.35 | 63.26 | 84.00 |
Our rankings represent a new framework to reduce the dimension of collections of gene sets while keeping a high coverage of genes. As previously stated, collections of gene sets of biologically relevant sets of pathways and pathways s are sets of genes s; therefore, our approach directly applies to this context. We present the results for collections of gene sets and some phenotypic traits. We obtain
no favoritism towards larger pathways: The Shapley value function assigns to each set a positive real number incorporating information on the distribution of its elements in the other sets of . This leads to a positive correlation with the sizes of pathways, i.e., sets with higher dimensions are more likely to get a higher Shapley value. Using Kendall’s scores to measure the ordinal association between the pathways’ size and the position in the rankings, we show that using AO and PO, this correlation effect is reversed in most of the collections of gene sets; In AOR and POR, there is no clear tendency of a correlation among the dimension of pathways and the position in the ranking (see Balestra et al. (balestra2023redundancy, ) and Table 1). The punishments indirectly affect the strength of the positive correlation between ranking position and their size, as higher-dimensional pathways are more likely to show overlaps among them.
rankings of the original pathways without modifying them: In contrast to usual dimensional reduction methods for collections of gene sets, the pathways are selected and not modified, thus preserving their biological meaning.
HEIGHT |
B. PLATELET C. |
STANDING H. |
B. RED C. |
HEEL TSCORE |
B. WHITE C. |
B. EOSINOPHIL C. |
SITTING H. |
TRUNK PRED. M. |
W. BODY FAT FREE M. |
W. BODY WATER M. |
BMR |
IMP. OF W. BODY |
BMI |
COMP. H. SIZE AGE 10 |
ARM PRED. M. R |
ARM FAT FREE M. R |
LEG FAT FREE M. R |
LEG PRED. M. R |
IMP. OF LEG R |
LUNG FVC |
WHRATIO |
HAIR PIGMENT |
||
KEGG13 | SV | 1 | 1 | - | 2 | - | - | - | - | 1 | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
PO | 2 | - | 2 | 1 | - | 1 | - | - | 6 | - | - | - | - | - | 1 | - | - | - | - | - | - | - | - | |
POR | 1 | - | 1 | 1 | 3 | - | - | - | 1 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | |
AO | 2 | 1 | 3 | 1 | 5 | 1 | - | - | 6 | - | - | - | - | - | 1 | - | - | - | - | - | - | - | - | |
AOR | 1 | - | 1 | 1 | 1 | - | - | - | 1 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | |
ALL | 7 | 1 | 3 | 1 | 3 | 1 | - | - | 8 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | |
BIOCARTA | SV | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
PO | 2 | - | 2 | 2 | 1 | - | - | - | 1 | - | - | 1 | - | - | - | - | - | - | - | - | - | - | - | |
POR | 1 | - | 1 | 2 | 1 | - | - | - | - | - | - | 1 | - | - | - | - | - | - | - | - | - | - | - | |
AO | 1 | - | 1 | 1 | 1 | - | 2 | - | 1 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | |
AOR | 2 | - | 2 | 4 | 1 | - | - | - | 1 | - | - | 1 | - | - | - | - | - | - | - | - | - | - | - | |
ALL | 2 | - | 2 | 2 | 1 | - | - | - | 1 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | |
COVID | SV | 9 | 5 | 4 | 5 | 1 | 5 | 6 | - | - | - | - | - | - | - | 4 | 2 | 2 | - | - | - | - | - | 4 |
PO | 22 | 15 | 5 | 26 | 4 | 6 | 11 | - | 6 | 2 | 3 | 1 | - | 2 | 3 | 4 | 4 | - | - | - | 2 | 2 | 1 | |
POR | 12 | 7 | 7 | 9 | 2 | 3 | 4 | - | 6 | 6 | 5 | 6 | 4 | - | 1 | 3 | 2 | 3 | 3 | 3 | 5 | 2 | 3 | |
AO | 14 | 6 | 6 | 7 | - | 4 | 5 | 1 | 6 | - | - | - | - | 2 | - | - | - | - | - | - | - | - | - | |
AOR | 18 | 8 | 5 | 8 | 3 | 6 | 1 | 1 | 3 | 4 | 3 | 4 | 5 | - | 4 | 1 | 2 | 3 | 3 | 4 | - | 2 | 5 | |
ALL | 64 | 32 | 27 | 44 | 14 | 13 | 16 | 5 | 16 | 12 | 10 | 11 | 5 | 4 | 10 | 3 | 5 | 6 | 6 | - | 2 | 5 | 9 |
evident redundancy reduction: The pathways’ selections deriving from the rankings proposed show much lower overlap than the selection given by the Shapley values alone. We evaluate the redundancy reduction using the generalized Jaccard index. We rescale the generalized Jaccard indexes to the maximum generalized Jaccard index, i.e., the maximum Jaccard index among any pairs of pathways within the collection of gene set. The lower the generalized Jaccard index, the better the ranking performs in selecting non-overlapping pathways. PO and AO use strong punishments such that highly overlapping pathways with the first are ranked far from each other. KEGG13 represents an exceptional case where SV achieves low redundancy without additional pruning criteria. Table 2 shows a direct comparison for the , and of the pathways.
high coverage of the genes: We investigate the coverage of the genes in percentage when only considering the first ranked pathways using the various rankings. SV, POR, and AOR outperform the rankings PO and AO give. The high coverage achieved by SV is due to the correlation between the size of the pathways and their positions in the ranking; however, not the same can be argued about POR and AOR. The lower performances of PO and AO are due to the harsh punishments for overlapping gene sets; Hence, they select first small pathways which were ranked in the lowest positions by Shapley values alone. We refer to Table 2 for additional details.
similar statistical power as the whole collections: We use the Fisher’s exact test (fisher_logic_1935, ; agresti_introduction_2018, ) to determine whether a pathway is significant and apply FDR multiple hypothesis testing corrections for the p-values (benjamini_controlling_1995, ; dudoit_multiple_2008, ). Figure 2 shows that the numbers of statistical significant pathways recalled when using only the portions of the collection of gene sets given by the proposed rankings and by the full collections are not remarkably different; we specifically use some remarkable association traits and report the numbers of statistical significant pathways when using the first of the ranked pathways in Table 3. For additional experiments with other collections and a comparison with the most common GSEA methods, we refer to Balestra et al. (balestra2023redundancy, ).

4. Related work
CGT led to applications in computer science in various contexts and applications; In particular, Shapley values have been extended to feature selection (cohen_feature_2007, ; pfannschmidt_evaluating_2016, ; balestraUSV, ), networks and security (van_campen_new_2017, ), explainable machine learning (lundberg_unified_2017, ) and bioinformatics (sun_game_2020, ). A recent paper by Rozemberczki et al. (rozemberczki2022shapley, ) summarizes the major applications of Shapley values in machine learning literature. Moreover, importance scores based on Shapley values have been adapted to study the relationships among genetic and phenotypic characteristics for gene sets prioritization analysis (lucchetti_shapley_2010, ; moretti_combining_2008, ). Shapley values’ exact computation, on the other hand, requires evaluations of a value function where is the number of players; on the other hand, the introduction of microarray games (moretti_class_2007, ) reduces the computational challenges of exact Shapley values’ computation to polynomial time under the assumption that the game can be written using only binary relationships.
Collections of gene sets are families of pathways based on some prior biological knowledge (liberzon_molecular_2015, ); the pathways are often overlapping thus making these collections not easily interpretable. Several methods have been proposed to address the low interpretability of the high-dimensional collections, including visualization tools and techniques to merge gene sets into a non-redundant single and unified pathway (belinky_pathcards_2015, ; van_iersel_presenting_2008, ; doderer_pathway_2012, ). Stoney et al. (stoney_using_2018, ) was the first to point out the importance of maximizing gene coverage without altering the pathways from their original form. However, their work handles redundancy among collections of gene sets in the different databases and not in the single collections.
One of the main applications of gene sets is enrichment analyses GSEA (subramanian_gene_2005, ; mathur_gene_2018, ); GSEA tests for the potential over/under-representation of the analyzed genes in specifically biologically annotated gene sets. The number of statistical hypothesis tests performed equals the number of pathways within the collection of gene sets. Thus, correcting for multiple testing becomes a major challenge (dudoit_multiple_2008, ; noble_how_2009, ). Among the different corrections for multiple hypothesis testing, we recall the Bonferroni correction and the false discovery rate FDR (benjamini_controlling_1995, ; benjamini_control_2001, ), where the latter is less conservative.
5. Conclusions and discussion
We proposed redundancy-aware Shapley values to rank sets in families of sets. The four presented rankings aim to satisfy various properties when selecting sets based on them. Motivated by the numerous applications of unsupervised feature selection, the proposed importance scores consider the distribution of elements within the family of sets and their overlap. In the presented application on collections of gene sets, we show good performance for various metrics. However, the range of potential applications is much broader, and our proposed methods can open new research paths in the applied and theoretical fields. For the specific case of collections of gene sets, one possible extension is the addition of a supervised punishment that also considers the relevance of the single pathway to a specific phenotypic trait. This could lead to a higher number of significant pathways when using the Shapley values-based rankings to reduce the dimension of the collections.
6. Acknowledgements
This research was supported by the research training group Dataninja funded by the German federal state of North Rhine-Westphalia and by the Research Center Trustworthy Data Science and Security, an institution of the University Alliance Ruhr.
References
- (1) Agresti, A. An Introduction to Categorical Data Analysis. Wiley Series in Probability and Statistics. Wiley, 2018.
- (2) Balestra, C., Huber, F., Mayr, A., and Müller, E. Unsupervised features ranking via coalitional game theory for categorical data. In Big Data Analytics and Knowledge Discovery (DaWaK) (2022).
- (3) Balestra, C., Maj, C., Müller, E., and Mayr, A. Redundancy-aware unsupervised ranking based on game theory: Ranking pathways in collections of gene sets. Plos one 18, 3 (2023), e0282699.
- (4) Belinky, F., Nativ, N., Stelzer, G., Zimmerman, S., Iny Stein, T., Safran, M., and Lancet, D. PathCards: multi-source consolidation of human biological pathways. Database 2015, bav006 (2015).
- (5) Benjamini, Y., and Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological) 57, 1 (1995).
- (6) Benjamini, Y., and Yekutieli, D. The Control of the False Discovery Rate in Multiple Testing under Dependency. The Annals of Statistics 29, 4 (2001).
- (7) Cohen, S., Dror, G., and Ruppin, E. Feature selection via coalitional game theory. Neural Computation 19, 7 (2007).
- (8) Doderer, M. S., Anguiano, Z., Suresh, U., Dashnamoorthy, R., Bishop, A. J., and Chen, Y. Pathway Distiller - multisource biological pathway consolidation. BMC Genomics 13, 6 (2012).
- (9) Dudoit, S., and Laan, M. Multiple Testing Procedures With Applications to Genomics. 2008.
- (10) Fisher, R. A. The Logic of Inductive Inference. Journal of the Royal Statistical Society 98, 1 (1935).
- (11) Liberzon, A., Birger, C., Thorvaldsdóttir, H., Ghandi, M., Mesirov, J. P., and Tamayo, P. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Systems 1, 6 (2015).
- (12) Lucchetti, R., Moretti, S., Patrone, F., and Radrizzani, P. The Shapley and Banzhaf values in microarray games. Computers & Operations Research 37, 8 (2010).
- (13) Lundberg, S. M., and Lee, S.-I. A unified approach to interpreting model predictions. Advances in neural information processing systems 30 (2017).
- (14) Mathur, R., Rotroff, D., Ma, J., Shojaie, A., and Motsinger-Reif, A. Gene set analysis methods: a systematic comparison. BioData Mining 11, 1 (2018).
- (15) Moretti, S., Patrone, F., and Bonassi, S. The class of microarray games and the relevance index for genes. TOP 15, 2 (2007).
- (16) Moretti, S., van Leeuwen, D., Gmuender, H., Bonassi, S., van Delft, J., Kleinjans, J., Patrone, F., and Merlo, D. F. Combining Shapley value and statistics to the analysis of gene expression data in children exposed to air pollution. BMC Bioinformatics 9, 1 (2008).
- (17) Noble, W. S. How does multiple testing correction work? Nature Biotechnology 27, 12 (2009).
- (18) Pfannschmidt, K., Hüllermeier, E., Held, S., and Neiger, R. Evaluating Tests in Medical Diagnosis: Combining Machine Learning with Game-Theoretical Concepts. Information Processing and Management of Uncertainty in Knowledge-Based Systems 610 (2016).
- (19) Rozemberczki, B., Watson, L., Bayer, P., Yang, H.-T., Kiss, O., Nilsson, S., and Sarkar, R. The shapley value in machine learning, 2022.
- (20) Shapley, L. S. A value for n-person games. Contributions to the Theory of Games (1953).
- (21) Stoney, R. A., Schwartz, J.-M., Robertson, D. L., and Nenadic, G. Using set theory to reduce redundancy in pathway sets. BMC Bioinformatics 19, 1 (2018).
- (22) Subramanian, A., Tamayo, P., Mootha, V., Mukherjee, S., Ebert, B., Gillette, M., Paulovich, A., Pomeroy, S., Golub, T., Lander, E., and Mesirov, J. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America (2005).
- (23) Sun, M., Moretti, S., Paskov, K., Stockham, N., Varma, M., Chrisman, B., Washington, P., Jung, J.-Y., and Wall, D. Game theoretic centrality: a novel approach to prioritize disease candidate genes by combining biological networks with the Shapley value. BMC Bioinformatics 21 (2020).
- (24) van Campen, T., Hamers, H., Husslage, B., and Lindelauf, R. A new approximation method for the Shapley value applied to the WTC 9/11 terrorist attack. Social Network Analysis and Mining 8, 1 (2017).
- (25) van Iersel, M. P., Kelder, T., Pico, A. R., Hanspers, K., Coort, S., Conklin, B. R., and Evelo, C. Presenting and exploring biological pathways with PathVisio. BMC Bioinformatics 9, 1 (2008).