4. TDA and analysis

In the previous section we saw that we can create a network of these datapoints by clustering in overlapping bins, or by minimizing the path-lengths in graph space with the distances in high-dimensional space. But we can also use topology to know if a signal is "real" or not. For the group of points below: are there two or three clusters?

clusters

Or in this example: is there a hole in this dataset?

loops

We can answer these questions using TDA.

4.1. Simplicial complexes

As we are working in topological space, we have to talk about the data structures that represent this space: simplicial complexes. A simplex is a subset of \(\mathbb{R}^n\) that is the convex hull of "\(k\)-points", where \(k \le n+1\). For \(k\) equal to 2,3 and 4, simplices are intervals, triangles, and tetrahedra, respectively.

simplex

A simplicial complex is a collection of these simplices, across the different dimensions. Consider the following simplicial complex:

simplicial complex

This complex is a set of the following simplices:

  • at \(k\) = 1: {a},{b},{c},{d},{e},{f},{g}

  • at \(k\) = 2: {a,b},{a,c},{a,d},{b,c},{b,d},{d,e},{f,g}

  • at \(k\) = 3: {a,b,c}

Simplices at a dimension \(k\) are bounded by simplices at dimension \(k-1\). For example: {a,b} is bounded by {a} and {b}; {a,b,c} is bounded by {a,b}, {a,c}, and {b,c}. Conversely we say that {a} and {b} are the boundaries for {a,b}, that {a,b}, {a,c} and {b,c} are the boundaries for {a,b,c}, etc. We can describe the topology of a complex based on the number of its connected components, loops, and voids.

4.2. Building simplicial complexes

Given a point cloud ({a},{b},{c},{d},{e},{f},{g}), how to we create such simplicial complexes? Many times, our data or network cannot be immediately thought of as a complex. However, we can generate one based on a collection of data points and a notion of similarity or distance between these points. The Vietoris-Rips (VR) complex is a much-used approach. It starts with data in a metric space and a fixed non-negative parameter \(r\) (radius). If two vertices are close enough, their disks with radius \(r\) will overlap and the VR complex will have an edge between those vertices. Similarly, if there are three vertices lcose enough so that the disks of every pair of them overlap, then the VR complex will have a triangle betwen those three vertices. Every time we have a triangle, we also have the three edges ("boundaries") that make the border of this triangle, and vice versa.

vietoris rips complex 2

Notice that the image above (with edges \(a\) to \(g\)) is not a Vietoris-Rips complex. If it were, than the triangles {a,b,c}, {a,c,d} and {b,c,d} would also be filled in. In contrast to the Vietoris-Rips complex, Çech complex does not automatically fill in each triangle when the sides exist, but only when there is a point of triple intersection. The Çech complex is interesting mathematically, but harder to compute that the VR complex.

Different values for \(r\) will generate different Vietoris-Rips complexes. A filtration of a simplicial complex, K, is a nested sequence of subcomplexes starting at the empty set and ending with the full simplicial complex, i.e.

\[ K_{r_1} \subset K_{r_2} \subset K_{r_3} \subset K_{r_4} \subset K_{r_5} \]

Every simplex that is present at an early stage in the filtration is also present at all later stages.

vr filtration

4.3. Persistent homology

Doing this filtration, we see that connected components get merged, holes are created and filled in. We can use this to identify those signals that are stable, i.e. are present across a large range within the filtration.

vr filtration clusters

In the image above, every datapoint starts as a separate connected component (\(r\) = 5). At \(r\) = 10, some of the datapoints get connected so that the number of connected components decreases. At \(r\) of 15 we already only have three components. These stay separate until \(r\) = 25 where the two groups at the top-right merge, leaving us with 2 components. From then it takes until \(r\) equal to 75 when the two are finally joined into a single connected component. We can consider the two groups in the top-right pretty persistent, and the separation from the bottom-left even more.

vr filtration holes

We can do the same thing for holes/loops. In this image, a first hole appears at radius \(r\) 35, and a second one at \(r\) = 65. They both disappear at \(r\) = 85.

4.3.1. Barcode

We can describe what happens across such filtration (i.e. when does a feature like a connected component or hole appear and disappear again) using persistence diagrams and so-called Betti numbers. The \(d\)th Betti number (after the Italian mathematician Enrico Betti) counts the number of \(d\)-dimensional holes: \(\beta_0\) refers to connected components, \(\beta_1\) to tunnels, and \(\beta_2\) to tunnels.

The simplest way of representing these is using a barcode showing the filtration at the bottom (i.e. the change in radius \(r\) and the Betti numbers on the vertical axis, as shown below. Red lines represent elements for \(\beta_0\) (i.e. connected components); blue lines represent elements for \(\beta_1\) (i.e. loops).

barcode2

Adapted from: Chazal F & Michel B (2021). An Introduction to Topological Data Analysis: Fundamental and Practical Aspects for Data Scientists. Frontiers in Artificial Intelligence, 4(September), 1-28

In panel A, each datapoint is separate so the number of connected components is the number of datapoints. These connected components (n=14) are represented by the 14 red lines in the barcode. In panel B, many of these points have merged into 5 connected components; the others have died. Panel C is the filtration at a stage where two loops exist (Betti number \(\beta_1\) = 2); one started a little earlier than the other. All datapoints are now also connected. In panel D we see that one of the loops has already died. Finally, panel E shows us that no loops exist anymore, and there only remains a single connected component.

The important message to take here, is that long lines correspond to persistent signals and are therefore more "real".

4.3.2. Calculating persistence

It’s nice to do this visually, but how do we compute persistence? Consider the filtered (as in: having a filtration) simplicial complex below. We can create a boundary matrix for this filtration. We’ll see that by performing column reduction on this matrix we get the persistent signals.

persistent homology

For the boundary matrix \(D\), we order all simplices along the filtration; within each filtration, we order along their dimensionality. In other words: simplex \(\sigma_1\) comes before simplex \(\sigma_2\) if:

  • \(\sigma_1\) is born before \(\sigma_2\), or

  • \(\sigma_1\) is a face of \(\sigma_2\).

The values in this matrix are set as follows:

\[ D_{ij} = \begin{cases} 1, \text{if \(\sigma_1\) is a face of \(\sigma_2\)}\\ 0, \text{otherwise} \end{cases} \]

In the first step of the filtration, we see the appearance of 0-d simplices (points) {a}, {b}, {c} and {d}; in the second we see {e}, and the 1-d simplices (lines) {ab} and {ad}; etc. These simplices are put in the same order in the rows and columns.

boundary matrix

In column reduction of this matrix, we search for pivot points by adding columns together using the following algorithm:

for col = 1 to n do
    while exists prev_col < col with low(prev_col) = low(col) != 0 do
        add prev_col to col
    end while
end for

In other words:

  • We go through the columns from left to right.

  • We check if the lowest 1 in the column is also the first one in its row. If so: go to next column.

  • If not: add the column with that previous 1 to the column you’re considering, using base 2 (i.e. 1+1 = 0).

  • Indicate in the column header which other column was added.

  • Check again (for the same column!) if the lowest 1 is the first in its row and add additional columns if necessary.

persistence calculation 2

In the example above, we have already gone through columns {a}, {b}, {c}, {d}, {e}, {ab}, and {ad}, and indicated the pivot 1 's for the latter two columns. So we are now considering column {bd}, and see that the lowest 1 is preceded with a 1 in column {ad}. We add column {ad} to column {bd} and get the situation in panel 2. We again check for the lowest 1 and see that it is preceded by a 1 in column {ab} so we add that column as well, making sure that we record this transformation in the column header. This will give us the situation in panel 3. Panel 4 shows us the final reduced matrix. We indicate the lowest 1 in each column and indeed see that each is on its own row.

Now how do we get from this reduced matrix (panel 4 above, and the matrix below) to the persistence?

persistence calculation columns

We look at the columns that have empty columns. Why? Because these are the simplices that have no boundaries, or - in other words - represent holes (which is what we were looking for in the first place). The columns in the matrix represent when a simplex was born; the rows when they died.

Let’s go from left to right again.

  • Column {a}: simplex {a} is born in step 1 of the filtration (i.e. in column-block 1). If we look at the row for {a}, we do not see a pivot 1 in that row, meaning that this simplex does not die.

  • Column {b}: The column for simplex {b} is in the first filtration step, so that is where it is born. Looking at the row for {b}, we see a pivot in step 2, in the {ab} column. This means that {b} dies in step 2, because it is linked in {ab}.

  • …​

  • Column {ab}{ad}{bd}: These 3 edges create a one-dimensional hole/triangle (A to B to D to A). They are born in step 3 with the appearance of {bd}, the right-most edge of the three in the original matrix, so the last one that was added. If we look up {bd} in the rows, we see that the triangle {ab}{ad}{bd} dies in step 5 because it is filled with the 2d simplex {abd}.

  • …​

  • Column {ab}{ad}{be}{de}: Notice that the hole created with these four edges is immediately killed in the same filtration step when {de} becomes part of {bde}.

Based on this information we can create the following barcode:

barcode example

Red bars are 0-d holes (i.e. connected components); blue bars are 1-d holes (i.e. loops). At the left of each bar we’ve added the simplex that gave rise to that hole; at the right the simplex that killed it.

4.3.3. Types of plots - Representing persistent features

The way that these persistent features are most often represented is not using the barcodes, but using a persistence diagram, which is basically a rebased version of the barcode with birth on the horizontal axis and death on the vertical. Below we show the persistence diagram for the simple simplicial complex filtration from above.

persistence diagram

Any point that is far away from the diagonal persists for a longer time; anything close to the diagonal can be considered noise. Note that for the holes corresponding to the connected component that started with {a} and the loop {bc}{be}{ce} do not die. You will see these often placed on a dashed line above diagram.

Because the plot above is so sparse it can give a wrong impression of what to expect. Here’s a slightly more realistic diagram:

persistence diagram giotta

Source: giotto-ai.github.io

In this diagram, we see two purple 2D holes (or voids) that are far from the diagonal, so are persistent. Some of the green 1D holes (or loops) also seem interesting.