3. TDA and visualisation
If we want to use TDA to get a overall view of a dataset in a visual analytics setting, there are different ways in which we can create a network. Here, we’ll describe the mapper algorithm and STAD (simplified topological approximation of data). In both cases, we take high-dimensional datasets and generate networks like the one below:
Source: Torres BY et al (2016). Tracking Resilience to Infections by Mapping Disease Space. PLOS Biology, 14(4), e1002436
Once we have such network, we can for example colour it according to different features (see example below). For the continuous values, color represents an average of value. Red nodes contain data samples that have higher average values whereas blue nodes contain lower average values. For the categorical values, color can represent a value concentration.
3.1. The mapper algorithm
In their 2007 paper, Singh et al describe the method consisting of the following steps:
-
Choose a "lens" (aka "filter") along which to spread the data.
-
Creating overlapping data bins across this lens (the covering).
-
Cluster datapoints within each bin. These clusters become the nodes in the network.
-
Because they have overlapping bins, some of the clusters in one bin will have datapoints in common with a cluster in the adjacent bin. If that is the case these clusters should be connected.
Let’s look how Lum et al apply this to the hand-shaped point cloud shown above.
-
The lens is the horizontal direction, and indicated using a color scale.
-
The data is binned (with overlaps) across this lens.
-
Data is clustered in each bin (leading to 3 clusters/fingertips in the left-most bin, the addition of a cluster for the tip of the little finger in the second bin, etc), and these clusters are connected if they share datapoints.
As you can see, the resulting network is able to capture the overall shape of the hand, and that in a compressed representation (i.e. with less datapoints).
3.1.1. Choosing parameters
Using the mapper approach, we have to make choices regarding several parameters: the filter, the cover (bin size and the overlap between the bins), and clustering algorithm.
Filter
The lens or filter has an immens impact on the resulting network. In the example below, imagine a data cloud shaped like trousers and investigated using two different lenses. One lens looks at these data horizontally and another vertically; one will result in a Y-shaped network (i.e. a network with 3 flares) whereas the other will create a network consisting of two connected loops.
Source: Anthony Bak, Ayasdi
This example shows that it is crucial to consider the most important (or interesting) lenses; a non-optimal choice will result in a network without useful shapes.
In general, there are three main types of filter:
- projection
-
With a projection, you use an existing or derived feature of the data to spread the datapoints. This can for example be a specific feature/column of the dataset (e.g. the expression of a specific gene in a transcriptomics dataset), but also the first principal component or eigenvalues.
- distance from norm
-
We often use a distance from the norm if we want to identify and understand outliers or create extruding flares in our visualisation. We basically pull apart those datapoints that are outliers and would become tips of flares, whereas "common" datapoints are centered. Typically, we look at eccentricity which we define as
\[eccentricity(i) = max_j(d(x_i,x_j))\]
- density
-
Finally, we can also look at the local density of data clouds, using knn-density. In this case, we take the distance to the k-th nearest neighbour as an (inverse) measure of density.
Cover
The cover determines the overlapping intervals across the lens, in which we’ll do the clustering. Here we have to decide on the number of intervals (or conversely: the size of the intervals) and the amount of overlap between them. With larger intervals we get fewer nodes in our eventual TDA network. With large overlaps we will generate more edges then with small overlaps.
We can decide to use a uniform or balanced covering where either the bins are of the same size or they have the same number of data points.
Clustering
The clustering within each bin is exactly the same as in any other dataset.
There are different implementations of the mapper algorithm. These include giotto-tda and python scikit’s Kepler mapper.
3.2. The STAD algorithm
Work by Daniel Alcaide, Jelmer Bot and Jan Aerts takes another approach to creating these networks. A description can be found in their 2021 paper "Spanning Trees as Approximation of Data Structures. The python implementation is available at https://gitlab.com/vda-lab/pystad2.
In this approach we approximate the distances in high-dimensional space with distances in graph space. If we work in a space \(X\) with \(n\) elements and \(m\) dimensions in \(\mathbb{R}^m\), the methodology (pictured below) follows the following steps:
-
Create a distance matrix in the original high-dimensional space \(D_x\) of size \(nxn\) containing all distances \(d_{ij}\). The distance metric used for this matrix can be generic like the Euclidean distance (for lower-dimensional spaces), or the cosine distance, but can also be a custom distance metric that is specific for this dataset. (We will provide an example below.) We consider this distance matrix as a fully connected, undirected, weighted graph \(G_x\) where each node represents a datapoint, and the edge between two datapoints \(x_i\) and \(x_j\) has a weight corresponding to the distance \(d_{ij}\).
-
Build the minimal spanning tree (MST) from the complete weighted graph. An MST is a subset of \(n-1\) edges that connect all vertices without loops and with a minimal total sum of edge weights. Once we have this MST, we considered the "distance" between two nodes (i.e. datapoints) equal to the path length between those nodes. In other words, we convert the MST with weighted edges to a unit-distance graph \(U\) where each edge gets a weight of
1
. We consider the MST as the STAD network with zero additional edges, \(U_0\). -
Calculate the distance matrix for unit distance graph. We create a full distance matrix of the graph, using path lengths.
-
Evaluate an objective function: Calculate the correlation between the original distance matrix in high-dimensional space, and the distance matrix based on graph \(U_i\)_. We want to maximise this correlation.
-
Add edges into the unit-distance graph. Based on the list of edges, sorted by their original weigth, add edges into network \(U_i\).
-
Repeat from step 3 until objective function maximalised.
For more specific details, see the paper mentioned above.
3.2.1. Custom distance metrics
Although this approach relies on fewer parameters than the mapper algorithm, the choice of distance metric does have an important impact: how do you define the "distance" between two datapoints (e.g. patients)? In general, for a high-dimensional space with numerical data, a cosine distance might be a good first attempt. One example of a custom distance metric concerns the comparison of intensive care-unit (ICU) patients based on their diagnoses. Consider the following two patients:
Patient X | Patient Y | |
---|---|---|
1 |
Infection & inflammatory reaction due to implant |
Unspecified intracranial hemorrhage |
2 |
Sepsis |
Cerebral artery occlusion |
3 |
Urinary tract infection |
Sepsis |
4 |
Unspecified essential hypertension |
Iatrogenic cerebrovascular infarction |
5 |
Urinay tract infection |
|
6 |
Unspecified essential hypertension |
Each patient has a list of diagnoses, ranked from more to less important. To define a distance between them, we need to consider both the presence/absence of diagnoses as their position within the list. A possible match \(M\) between any 2 concepts \(c_A\) and \(c_B\) which contributes to the similarity score could be
\[M(c_A,c_B) = ln(1+\frac{1}{max(position(c_A),position(c_B))})\]
The total similarity between patients is the sum of individual contributions from the matched concepts
\[S(X,Y) = \sum^{i=1}_{n}M(X \cap Y)\]
Using this approach on the MIMIC-III ICU dataset, we were able to identify interesting patient subpopulations for those who have "alcohol withdrawal delirium" as one of their diagnoses.
Source: Alcaide A & Aerts J. (2021). A visual analytic approach for the identification of ICU patient subpopulations using ICD diagnostic codes, 1-20
In this network, community 1 corresponds to patients whose primary diagnosis is alcohol withdrawal delirium itself, whereas community 3 is characterised by head injuries and fractures meaning that they were so drunk that they fell or hit their heads.
3.2.2. Lenses
Another feature of the STAD approach is that we can a lens to incorporate domain specific knowledge. Consider two datapoints in a dataset with 1,000 dimensions that are are identical across 999 of these dimensions, but differ in one. Using generic distance metrics, these datapoints would be considered very similar. However, the one dimension that is different between the two might contain important information which would lead the domain expert to not wanting these data points to be connected in the graph.
In the example below, we have a dataset with 3 dimensions: x, y and colour. We can clearly see a loop and 4 flares which would be identified using a default topological data analysis. However, we see that the colours in the flares themselves vary as well: from yellow to blue to yellow to red to yellow again. Using lenses, we are able to pull apart these flares into loops as well (i.e. we don’t want to have blue nodes connected to red ones).