Gene expression data classification using topology and machine learning models

Interpretation of high-throughput gene expression data continues to require mathematical tools in data analysis that recognizes the shape of the data in high dimensions. Topological data analysis (TDA) has recently been successful in extracting robust features in several applications dealing with high dimensional constructs. In this work, we utilize some recent developments in TDA to curate gene expression data. Our work differs from the predecessors in two aspects: (1) Traditional TDA pipelines use topological signatures called barcodes to enhance feature vectors which are used for classification. In contrast, this work involves curating relevant features to obtain somewhat better representatives with the help of TDA. This representatives of the entire data facilitates better comprehension of the phenotype labels. (2) Most of the earlier works employ barcodes obtained using topological summaries as fingerprints for the data. Even though they are stable signatures, there exists no direct mapping between the data and said barcodes.

Results

The topology relevant curated data that we obtain provides an improvement in shallow learning as well as deep learning based supervised classifications. We further show that the representative cycles we compute have an unsupervised inclination towards phenotype labels. This work thus shows that topological signatures are able to comprehend gene expression levels and classify cohorts accordingly.

Conclusions

In this work, we engender representative persistent cycles to discern the gene expression data. These cycles allow us to directly procure genes entailed in similar processes.

Background

The rapid advances in genome-scale sequencing have dispensed a comprehensive list of genes for different organisms. These data gives us a broad scope to comprehend the developmental and functional processes of these organisms. Since the advent of DNA microarray, it is now possible to measure the expression levels of large number of genes simultaneously. This has made a holistic analysis possible for gene data using their expression levels. The stochastic nature of biological processes and associated noise acquired during the mining process pose a fundamental challenge in modelling a mathematical structure explaining these high dimensional data. We look into two problems in data analysis involving gene expressions that are of current research interest.

A genome-wide association study (GWAS) is a method to link a subset of genes to a particular disease or physical phenomenon in an organism. It has been especially important to identify specific gene subsets not only from a clinical perspective but also from a data science perspective as well. The assimilation of these subsets enable better phenotype identification and improve prediction of cohort status using machine learning based approach. Our definition of cohort follows its common usage in biology where a cohort is a group of animals of the same species, identified by a common characteristic, which are studied over a period of time as part of a scientific or medical investigation. In our case all cohorts for each experiment belong to the same taxa.For small or medium sized data sets, since the number of gene expression in a cohort profile is far greater than the number of sample cohorts, disease prediction using neural networks is challenging as these architectures largely succeed when the number of samples is much larger. It becomes important for these cases to identify a subset of genes whose expression levels reflect the phenotype of the cohorts.

In addition, it is often the case that some cohort have incorrect or uncorrelated data due to instrumental or manual error. Hence, their gene expressions may not reflect their phenotype class. We find in practice that the elimination of such instances leads to better prediction scores and performance. In this work, we use topological data analysis to investigate both of these issues. We identify cohorts which are topologically relevant (Section Topo-Curated Cohort). We show that the use of these cohorts to determine phenotypes instead of the entire set improves classification. Next, in Topo-Relevant Gene Expression section, we look into the classic GWAS problem mentioned above to identify a small subset of genes by using topological data analysis. We compare classification results obtained by using this reduced gene subsets against the one obtained by using full gene pool. The results for the receded gene profile yields better prediction rate.

Topological data analysis (TDA), loosely speaking, explains the shape of a data using topological structures. Topological properties can be thought to remain invariant under continuous deformation. For instance, given a donut made of clay, topologically its shape remains the same if we stretch, twist, or bend it but changes if we cut or glue it. The theory of Algebraic Topology lays the mathematical foundation formalising this idea. Persistent Homology is a method to derive topological structures from a given data. Topological signatures, particularly based on Persistent Homology, enjoy some nice theoretical properties including robustness and scale invariance. These features are global and more resilient to local perturbations. This has made TDA an exciting area in data analysis with encouraging results in medical imaging [1, 2], protein analysis [3, 4], and molecular architecture [5, 6] among others. In previous works it has been shown that genes sharing similar attributes tend to cluster in high dimensions [7, 8]. This is because protein encoding genes that are part of the same biological pathway or have similar functionality are corregulated. This ultimately leads such gene clusters to have similar expression profiles. The property of clustering is essentially captured by the $zero^$ order homology class in Persistent Homology (see next section). Motivated by these works, we are interested in finding if there exist relationships among similar genes in the higher order homology classes as well.

Traditional TDA pipelines use Persistent Homology to compute a set of intervals called barcodes which are used as topological feature in subsequent processing such as learning [3, 9,10,11]. While such barcodes provide robust topological signatures for the persistent features in data (such as tunnels, voids, loops, cycles etc.), their association to data is not immediately clear thus missing some crucial information. In effect, since these intervals represent homology classes, they contain the set of all loops around a topological hole. Thus using barcodes, it is hard to localize a feature, e.g., the shortest cycles or holes in a Persistent Homology class. This, in turn, hinders getting any direct mapping between the topological signatures and input cohorts or genes. So far there had been few studies addressing the problem of localizing persistent features and it has been shown that finding shortest cycles in given Persistent Homology classes is an NP-hard problem [12, 13]. However, taking advantage of the recent results in [12, 13], we are able to to compute good representative cycles for our application. These cycles capture definitive geometric features and provide a mapping between two domains of gene expression and topology.

In this paper we conduct two main experiments using the representative cycles: one to extract topologically relevant genes and the other to curate relevant cohorts. For these studies, some organisms were control units while others were either infected and/or injected with some antigen. The input consists of a matrix $<\mathcal >$ which has n rows signifying the cohorts and their corresponding gene expressions in m columns. For obtaining and classifying topologically relevant (topo-relevant) genes, our experiment follow the pipeline in Fig. 1a whereas for determining curated cohorts, it follows the pipeline in Fig. 1b. For a large data set, we can trim out both insignificant cohorts and genes starting from the ‘Training data $<\mathcal >_$ ’. This can be done following the pipeline in Fig. 2. We train our neural network architecture on the final curated dataset and thereby test against any unknown cohort. For our experiments, we work on gene expressions extracted from different organisms including Drosophila, Mus muculus, and Homo sapiens. We convert these data into a binary or multi-classification problem based on their phenotype and feed it into the pipeline. Our methodology and results have been listed in Computing Topological Signature of Gene-Expression Data section (Fig. 3).

Related works

In [14], gene expressions from peripheral blood data was used to build a model based on TDA network model and discrete Morse theory to look into routes of disease progression. These results on viruses suggest that Persistent Homology can be also used to study different forms of reticulate evolution. Topological structures have been used to analyse viruses by Emmet et al. [15]. They worked on influenza to show a bimodality of distribution of intervals in the persistent diagram. This bimodality indicates two scales of topological structure, corresponding to intra-subtype (involving one HA subtype) and inter-subtype (involving multiple HA subtypes) viral reassortments. Persistent Homology has also been used to identify DNA copy number aberrations [16]. Their experiments found a new amplification in 11q at the location of the progesterone receptor in the Luminal A subtype. Seemann et al. [17] used Persistent Homology to identify correlated patient samples in gene expression profiles. Their work focuses on the $<\mathcal >_0$ homology class which is used to partition the point cloud. The paper by Nicolau et al. [18] identified a subgroup of breast cancers using topological data analysis in gene expressions.

Several works [19, 20] on use of machine learning techniques on gene expression profile have shown promising results. Kong et al. [21] used random forests to extract features for their Neural Network architecture. They investigate a problem similar to our ‘Topo-relevant gene’ and the results show significant improvement. [22] analyzes gene expression data to classify cancer types. Different techniques of supervised learning are used to understand genes and classify cancer. The authors of [23] use machine learning to identify novel diagnostic and prognostic markers and therapeutic targets for soft tissue sarcomas. Their work shows overlap of three groups of tumors in their molecular profile.

Our contribution

We provide a technique based on persistent cycles (introduced in [12, 13]) to curate cohort data (datapoints) and gene expressions (features) (See Topo-Curated Cohort and Topo-Relevant Gene Expression sections). Through the experiments we show that these geometric structures, i.e. cycles, encode important information about the cohorts as choosing these topo-curated cohorts improve a classifier’s accuracy. In a separate experiment, we demonstrate that choosing these topo-curated gene expressions provide a better classification. In a way, we provide empirical evidence that there is a one-to-one correspondence between topological features and important gene functionality.

Results

We now discuss the two ways to reduce the input $<\mathcal >_$ into $<\mathcal >_$ where $n'\le n$ and $m'\le m$ . The first section deals with finding pertinent cohorts, and the next with finding pertinent genes. In each subsection we describe the relevant procedure followed by results.

Topo-curated cohort

For our first proof of concept, we find a subset of cohorts who provide topologically relevant information for classification. The aim is to remove cohorts having either incorrect or uncorrelated data due to instrumental or manual error. Specifically, given $<\mathcal >_$ , we would like to find $<\mathcal >'_\subseteq <\mathcal >_$ for $n\le n'$ which improves classification odds for the cohorts. This subset of $n'$ cohorts should therefore be topologically more relevant. We start by converting the matrix $<\mathcal >_$ into a point cloud. This point cloud has n points each of dimension m. Hence each cohort in the matrix is converted to an m-dimensional point where each dimension represents the expression level for each gene. We use Sparse Rips on the resulting point cloud to obtain a simplicial complex $<\mathcal >$ and its filtration ( $<\mathcal >$ ) and apply the theory of Persistent Homology to obtain the set of finite intervals.

We consider the dataset $<\mathbb >0$ having three phenotypes. We generate the longest 100 $<\mathcal >_2$ cycles based on their interval length ( $\delta -\beta$ ). For each cycle, we consider the constituent vertices and their corresponding phenotype labels ( $<\mathcal >$ ). We plot the count of $<\mathcal >$ values in individual $<\mathcal >_2$ -cycles in Fig. 11a with the X, Y and Z axis representing $<\mathcal > \in 0, 1,\text < and >2$ respectively. The black points indicate cycles where all vertices belong to a single phenotype. The red, green, and blue points indicate cycles having labels $(0,1),(0,2)\text < and >(1,2)$ respectively. The yellow points correspond to cycles having all three labels 0, 1, and 2. The takeaway from this plot is that, since most points are skewed towards some particular axis, most $<\mathcal >_2$ -cycles have constituent vertices who belong predominantly to some particular label in $<\mathcal >$ . Thus topological cycles in general are inclined towards some $<\mathcal >$ labels without any supervision as they were not fed with the phenotype labels. Note that we added a small random noise to each point coordinate to illustrate multiplicity. Figure 11b plots similar values for the top 200 $<\mathcal >_2$ cycles for dataset $<\mathbb >1$ . Since this dataset has two phenotypes, we get a 2d plot. The red labels denote cycles which have an equal constituent phenotypes, whereas blue and cyan represent skew, with blacks representing single labeled cycles as before. As is evident, most cycles exhibit a predominance in either $<\mathcal > \in \ 0\ or\ 1$ . Based on the intuition of this plot, we define a cycle $<\mathcal >$ as a Dominant Cycle if, there exists a vertex set $U\subseteq <\mathrm Vert>(<\mathcal >)$ Footnote 1 so that every vertex in U has the same label and $|U|\ge |<\mathrm Vert>(<\mathcal >)/2|$ Footnote 2 .

To illustrate the frequency of dominating cycles versus non-dominating ones, we plot the geodesic centers of the $<\mathcal >_2$ cycles for $<\mathbb >0$ by projecting them down to 2d using T-sNE (Fig. 4). Red vertices indicate non dominating cycles while each of the graded green points indicate the dominating ones. Clearly, most of the topology cycles are dominating and indicate a vote towards some phenotype class. The alpha values (denoted by the green bar at the right) indicates the ratio of the dominating phenotype in each cycle versus the other labels ( $<\mathcal >$ ). Hence, intuitively, more opaque a given point is, more is it dominated by a single class phenotype. Finally, we plot some of the individual dominating $<\mathcal >_2$ cycles along with their phenotype labels in Fig. 5. Note that these points are part of the original D0 cohort point cloud and they were projected down to 3D using PCA.

Classification using machine learning

We work on several gene expression data extracted from different organisms. On each of these, we create a classification problem as described in the data section. For each dataset, we use the entire cohort list (irrespective of their phenotype) as an $(n\times m)$ dimensional point cloud. We generate the top 100 $<\mathcal >_1$ and $<\mathcal >_2$ cycles and select the dominant cycles. Next we select the vertices contained in these dominant cycles which form our new set of $n'(\le n)$ -cohorts. Taking gene expression for these $n'$ -cohorts lets us form our new smaller matrix $<\mathcal >'_$ . Thereafter, we train supervised classification models once using $<\mathcal >_$ and then again using $<\mathcal >'_$ and compare results for each. We use 10-fold cross validation by splitting the data randomly into $8020\%$ in each fold. For our classification models, we use two probability based classification models: Decision Tree and Naive Bayes.Note that we are interested in finding out whether the TDA pipeline can curate and retain the faithful representation of data. As a result we are comparing the performance of Decision Tree and Naive Bayes classifier on Topo-curated data. We do not report Support Vector Machine (SVM) result as its accuracy is too low to report. In general, probability based classification fared better than kernal based (SVM) techniques, hence we have reported our results on the same. The average value of accuracy, precision, and recall for the 10-fold cross validation is reported in Table 4. The column ‘FULL’ represents training on $<\mathcal >_$ while $<\mathcal >_1+<\mathcal >_2$ represent the union of $n'$ topo-relevant cohorts obtained from the dominant cycles in either $<\mathcal >_1$ or $<\mathcal >_2$ . We also get good classification statistics for the vertices in dominant cycles picked up only by $<\mathcal >_2$ cycles only as reported in the same table. As is evident from the results, reduction in the number of cohorts leads to an increase in classification measures. Thus TDA is able to pick up cohorts who carry more decisive gene expression levels for their individual phenotype classes.

Topo-relevant gene expression

Our next problem is to reduce the matrix $<\mathcal >_$ to $<\mathcal >'$ of dimension $(n\times m')$ where $m'\ll m$ . We use the persistent cycle descriptors $<\mathcal >_1$ and $<\mathcal >_2$ introduced in the previous section to extract $|m'|$ meaningful genes ( $\mathcal $ ) such that $\mathcal \subset <\mathcal >$ . To this effect, we use the annotation of the gene set $<\mathcal >$ based on their functional classification obtained from the ‘Panther Classification System by Geneontology’ [24] and the ‘NCBI Gene Data set’ [25]. Thus for each $g \in <\mathcal >$ , $\exists f: g\rightarrow R$ , where R is a vector of functional attributes obtained from [24].

Once we obtain the representative cycles, we find the maximal cover of each cycle defined as follows:

Maximal cover of representative cycle ( $\kappa$ ) For each gene expression $g\in <\mathrm>(<\mathcal >)$ represented as vertices in a single representative cycle, we have a set of annotations f(g). We select the minimum set consisting of at least one annotation for each $g\in Vert(<\mathcal >)$ . Let $<\mathcal ~~>$ be any set of annotations which contains at least one annotation for each $g\in \mathrm(<\mathcal >)$ . Thus,~~

$$\begin \kappa = inf\< |<\mathcal >| \, | \,\forall g \in <<\mathrm>>(<\mathcal >), <\mathcal > \cap f(g)\not =\emptyset \> \end$$
The idea behind using $\kappa$ is to get a sense of the functionality of the gene. A gene may be responsible for multiple processes described in the Panther and NCBI database. If $\kappa$ is low or unity for a certain $<\mathcal >$ , it probably indicates that the gene expressions involved in $<\mathcal >$ reflect the functionality captured by $\kappa$ . This is illustrated in Fig. 6 where we plot some of the $<\mathcal >_2$ cycles generated on $\mathcal $ with color annotated by their functionality. We use PCA as before to project the points down to 3-dimensions. The three figures illustrate three instances of different $\kappa$ -values. Consider the example in Fig. 6a for getting the intuition behind $\kappa$ . The six vertices representing genes in the $<\mathcal >_2$ cycles have function annotations: . Out of this the set: covers all the vertices and hence $\kappa$ is 3.

We choose $<\mathcal >$ with low $\kappa$ values and select their component genes as part of $\mathcal $ . We can control the size of $\mathcal $ based on the value of $\kappa$ .

For all our experiments, we run each architecture and obtain performance measures on $<\mathcal >$ which contains the exhaustive set of m-genes. We re-run these experiments on our trimmed set $\mathcal $ containing $m'(\ll m)$ topologically significant genes. Note that we may use the topo-relevant cohort extraction to additionally reduce $\mathcal _$ into $\mathcal _.$ But since the public datasets we work on as our proof of concept have much less number of data to work with a Neural Net architecture, we do not trim the dataset. The results are in Table 1.

Since the number of samples is still less for CNN, overfitting is an issue. Notice that, for this precise reason, we do not curate this data using the pipeline in Topo-Curated Cohort section. Dropout layers are added after each layer to further prevent overfitting and reduce high variance. We, however, do not initiate early stopping as those pipelines are not amenable to orthogonalization. Finally, the model is optimised using Adam (Adaptive Moment Optimizer) [26]. The dataset is split evenly into 80-20 and cross validated for 50 epochs. The neural network has been implemented in Python using Tensorflow and Keras. The results for our experiment on datasets $<\mathbb >5, <\mathbb >6$ and $<\mathbb >7$ is shown in Table 1. The row # genes show that our architecture using vertices selected from topological cycles are less than $30\%$ the size of the original gene pool. The results have, however, improved in all the cases. For experiments with Neural networks we follow the trend of the loss, accuracy and F1 score by plotting their value after every epoch in our algorithm. Figure 8 shows this result on dataset $<\mathbb >7$ . We see that the loss function on test data has been slightly higher but smoother than the full dataset. Despite this, using TDA the accuracy and F1 score has consistently performed better in every iteration for both the training and test data.

Discussions

Comparison with baseline

We compare our Topo curated cohort with the following standard outlier detection, unsupervised clustering methods.

Local Outlier Factor

Density-based spatial clustering of applications with noise [27] (DBSCAN)

It is noteworthy for the Droso breeding dataset DBSCAN fails to cluster and reports all the cohorts as outliers (Table 2). With DBSCAN as outlier detector and Decision Tree as a classifier, classifier’s accuracy reaches upto $100\%$ which is probably due to imbalance in the dataset and overfitting. In Table 2 we report the maximum accuracy obtained by our method, i.e. $max(Acc_, Acc_, Acc_)$ from Table 4. We then compare our Topo-relevant gene expression method with the following feature selection methods (See Table 3).

Variance thresholding: Removes the low variance features that provide little information for modelling. [28,29,30].

Select K-best features: Selects k-features according to the highest scores. The scoring function used is F-value from analysis of variance (ANOVA). [29, 30]

Principal Component Analysis (PCA): Although this is a dimensionality reduction technique and not a feature selection method we incorporate it because PCA is used widely while analyzing high-dimensional data such as gene expression [31].

Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP): UMAP is a manifold learning learning and TDA based dimensionality reduction technique. UMAP assumes the data to be uniformly distributed on a Riemannian manifold and finds a low dimensional embedding by building a ‘fuzzy topological’ structure on it [32].

We use the same neural net architecture described in Neural Network Architecture section and report the test-set accuracy after selecting the features with the aforementioned feature selection methods and compare with our Topo-Relevant Gene Expression procedure. We report the test-accuracy only because we have observed during the training that the datasets are prone to overfitting (Table 4).

The persistence of holes are captured by a set of birth-death pairs (intervals) of homology classes that indicate at which value of r the class is born and where it dies. Each of these pairs is visualized using horizontal line segment known as a bar which together form the barcode [35] (Fig. 3e). The rank of a Persistent Homology group called the persistent Betti number captures the number of persistent features. This means, for the $^$ homology group $<\mathcal >_0$ consisting of $^$ homology classes, $betti_0$ counts the number of connected components that arise in the filtration. For $<\mathcal >_1$ , $betti_1$ counts the number of circular holes (loops) being born as we proceed through the filtration. Similarly for $<\mathcal >_2$ , $betti_2$ gives a count of the number of surfaces enclosing three dimensional voids in the data. Thus, the short blue line segments of $betti_0$ in (Fig. 3e) correspond to the separate components that are joined to form one big connected component corresponding to the red line. The two long blue line segments of $betti_1$ correspond to the two holes in the structure, the largest representing the bigger hole.

For computational purposes, the growing sequence of the union of balls is converted to a growing sequence of triangulations, simplicial complexes in general, called a filtration (Fig. 9). The topological signatures are born when a series of say, edges (1-simplices), are connected to form a cycle and die when they are filled in with triangles (2-simplices). If we take the example in Fig. 9a, the theory of Persistent Homology suggests that in the filtration $<\mathcal >=K_0\rightarrow K_1\rightarrow \ldots \rightarrow K_n=K$ , the edges inserted in $<\mathcal >_4$ , $<\mathcal >_5$ and $<\mathcal >_6$ (1-simplex denoted $\sigma ^1_4$ , $\sigma ^1_5$ and $\sigma ^1_6$ respectively)are the creators as introduction of which create class of homology cycles. We can think the creators as the representative of the cycles. By convention when a triangle appears in the filtration it kills the youngest homology class and is denoted by pairing with creators. In $<\mathcal >_7$ the triangle kills the cycle created by the edge that came in $<\mathcal >_6$ . So, it pairs with $\sigma ^1_6$ . In $<\mathcal >_8$ the triangle pairs with $\sigma ^1_7$ and the big hole (created by $\sigma ^1_4$ and is the youngest creator unpaired) is filled up and destroyed by the last triangle (2-simplex denoted $\sigma ^2_9$ ) inserted in $<\mathcal >_9$ . Thus $\sigma ^1_4$ is paired with $\sigma ^2_9$ for interval [4, 9). The problem with relying only on the barcodes is that they tell us when the classes are born or die given a filtration. But for each homology class, there can be several cycles in the same class (Fig. 9b). Ideally, we would like the tightest cycle (blue one) in the class to be a representative cycle for a given bar. However, it is shown in [12] that computing such cycles even for $<\mathcal >_1$ is an NP-Hard problem. A follow up paper [13] shows that for dimensions $\ge 1$ , the problem remains NP-Hard. We therefore, use alternate polynomial time algorithms to build good representative $<\mathcal >_1$ and $<\mathcal >_2$ cycles given any barcode interval $[\beta ,\delta )$ . The first algorithm [12] computes a good but not necessarily the tightest representative cycles. The second algorithm [13] computes the tightest representative cycles but for a specific class of domains called pseudo-manifolds. We briefly describe these two algorithms.

Algo. 1 generates $<\mathcal >_1$ cycles. The idea is briefly as follows: we know that at the birth time $\beta$ of a 1-cycle (found by the persistence algorithm), an edge $\sigma ^1_\beta$ is inserted in $<\mathcal >$ to form a cycle in $<\mathcal >_\beta$ . We hence check for the shortest path between the vertices of $\sigma ^1_\beta$ in $<\mathcal >_$ before $\sigma ^1_\beta$ is inserted. Since we know that at least one cycle containing $\sigma ^1_\beta$ is formed at $<\mathcal >_\beta$ , adding $\sigma ^1_\beta$ to this path gives us the shortest cycle at $<\mathcal >_\beta$ . At $\delta$ , we need to know which cycle belonging to the homology class has died. This can be a linear combination of any cycles still alive including the cycle found at $<\mathcal >_\beta$ . This is found using a strategy of annotations [4]. In fact, it is shown in the paper that the shortest cycle found at $<\mathcal >_\beta$ is exactly the shortest cycle for the interval in most practical cases.

Algo. 2 is used to compute $<\mathcal >_2$ cycles for an interval $[\beta ,\delta )$ and can be extended to any $<\mathcal >_n$ . We first construct an undirected dual graph G for $<\mathcal >$ where vertices of G are dual to 2-simplices of $<\mathcal >$ and edges of G are dual to 1-simplices of $<\mathcal >$ . One dummy vertex termed as infinite vertex which does not correspond to any 2-simplices is added to G for graph edges dual to the boundary 1-simplices. We then build an undirected flow network on top of G where the source is the vertex dual to the death of an interval and the sink is the infinite vertex along with the set of vertices dual to those 2-simplices which are added to $<\mathcal >$ after $\delta$ . If a 1-simplex is $\sigma _\beta ^1$ or added to $<\mathcal >$ before $\sigma _\beta ^1$ , we let the capacity of its dual graph edge be its weight; otherwise, we let the capacity of its dual graph edge be $+\infty$ . Finally, we compute a minimal cut of this flow network and return the 2-chain dual to the edges across the minimal cut as a minimal persistent cycle for the interval. The readers may consult the respective papers for $<\mathcal >_1$ -cycles [12] and $<\mathcal >_2$ -cycles [13] computations for more details.

Since computation of $H_n$ -cycles is computationally expensive, especially in higher dimensions, we restrict ourselves with the computation of upto $<\mathcal >_2$ -cycles for our experiments. Most previous works on TDA had mainly included $<\mathcal >_1$ intervals, with applications in gene expression being restricted to $<\mathcal >_0$ , so we hope to shed some new light into the problem even with this restricted setup.

Computing topological signature of gene-expression data

We work under the hypothesis that topological data analysis extracts relevant information sufficient for cohort classification. We note that topological feature extraction methods used in earlier works may not work in this setting. Traditionally, for many applications in bio science (say protein classification) and engineering, we find corresponding topological signatures using Persistent Homology for each sample (in this case cohorts or genes). These signatures are appended to the feature vectors. However, in this case, since each cohort is represented by a single 1D vector of gene expression levels, we are not able to find suitable signatures to append. This is why the algorithms we described in the previous section comes handy, as we will see in this section. We use our tools in two separate set of experiments. For algorithms 1 and 2, we need a simplicial complex $<\mathcal >$ , a filtration $<\mathcal >$ , and finite intervals. For all the studies in the paper, we use Sparse Rips [36] to obtain the simplicial complex $<\mathcal >$ and its filtration ( $<\mathcal >$ ). We can apply the theory of Persistent Homology to obtain the set of all finite intervals. In addition, algorithm 2 requires a pseudo-manifold ( $>$ ) instead of a regular simplicial complex $<\mathcal >$ . For our case, this means that all triangles ( $d=2$ -simplices) has at most two tetrahedrons ( $d+1=3$ -simplices) attached to it. We convert $<\mathcal >$ into $>$ by allowing at most two cofaces (tetrahedra) per triangle which appear first in the filtration:

Add all $\sigma ^<0\ldots d>$ to $>$ :

$\forall \sigma ^ \in K$ :

Sort: its co-faces $<\mathcal > = \sigma ^$ by $<\mathcal >(\sigma ^)$

If: $|<\mathcal >|\ge 2$ , insert into $>$ , the first two $\sigma ^$ in $<\mathcal >$ ,

Else: insert $<\mathcal >$ in $>$