Topological Data Analysis Intro

summary

Start with some points. Determine a rule for incrementally connecting more and more of the points with lines, planes, solids, and so on. As points become connected, geometric structure emerges. As even more points become connected, geometric structure disappears. We quantify this structure, group it into classes, and observe when each class is born and dies. Lastly we compare the birth-death plots of various datasets to determine if the datasets share underlying structural similarities.

more-detailed-but-still-high-level summary

Topological Data Analysis (TDA) encompasses a variety of techniques which "connect the dots", thereby creating shapes whose topology can be understood. As a simple example, we can cover each of the dots in a point cloud with a circle of a given radius. As we increase that radius, we'll produce a shape with fewer and fewer "holes" until we're left with a mass with no holes. When the radii are too small to overlap, we gain no additional information. When the radii are large enough to encompass the entire point cloud, we loose all information. TDA aims to find the sweet-spot radius that will provide useful information about our data.

In practice, most interesting datasets are complex enough that the radius technique (described above) falls short. It doesn't provide a rigorous way to determine which radius creates a topological space that best represents the underlying data. Additionally, it doesn't distinguish between major and minor features in the data. As a result, mathematicians have developed more advanced analysis techniques. One such technique is persistent homology whose core idea is to examine how the topological space changes as the radius changes. This approach smooths over noise in the data, as geometric structures created by noise will disappear quickly as the radius changes, while the structures created by the signal will be long-lived.

Distance is implicit in our discussion of radii. Said differently, in order to be able to draw circles of a given radius around points, we need to have defined a way to measure the length of a line originating from any point in the data. (As a philosophical aside, is this actually the same as being able to measure the distance between two points?) Thus we will always often to use a well-defined distance as we explore data topologically.

To more formally study the geometric shape of data, we use simplicial complexes. For an \(n\)-dimensional set of data points, we can connect any two points with a line, connect any three points with a plane, connect any four points with a solid, and so on. Any \(n\) points comprise a simplex, which is a generalization of a line to a triangle to a tetrahedron to a 5-cell to a something even harder to visualize. A collection of simplices is a simplicial complex \(K\) if:

  1. Every face of a simplex in \(K\) is in \(K\)
  2. Every intersection of simplices \(A\) and \(B\) in \(K\) is a face of both \(A\) and \(B\)

Counterexample to 1 (but not to 2): The singleton set \(X = \{ triangle \}\) consisting of three points, the lines that connects them, and the area inside is not a simplicial complex because each of the three lines (faces) is not a member of \(X\).

Counterexample to 2 (but not to 1): Draw four points in a square. Draw two lines connecting the points diagonal from each other (you've drawn an "X"). Each line is a simplex. Their intersection (at the center of the square) is not a face of either simplex.

Failed Counterexample to 2: (I was intersecting complexes and not simplices.) Complex \(A\): Draw four points; draw a line connecting one point \(c\) to the other three \(x\), \(y\), \(z\). Complex \(B\): Draw a fifth point \(w\); draw a line connecting \(c\) and \(w\); draw a line connecting \(x\) to \(c\) and connecting \(y\) to \(z\). The intersection \(A \cap B\) consists of \(c\), \(y\), \(z\) and the two connecting lines.

We can now construct an interesting simplicial complex: the Vietoris-Rips complex. Given a dataset consisting of \(n\)-dimensional points, a metric (so we have distance), and a fixed value \(\delta > 0\), connect all pairs of points that are within \(\delta\) of each other. Whenever three or more points are connected, we additionally include the higher-dimensional simplex thus created. The collection of all such simplices is the Vietoris-Rips complex. By construction, requirements 1 and 2 are satisfied. (The proof for 2 probably uses triangle inequality: the intersection has to be within \(\delta\) and therefore we'd already have connected the dots.)

How do we determine which value(s) of \(\delta\) create a complex that exhibits the underlying geometric structure of the data? We use persistence diagrams. A persistence diagram illustrates which topological features of a dataset persist as \(\delta\) varies. In particular, we consider the homology of the complex as \(\delta\) varies. Homology provides us with a formalized way to quantify the shape of a space in each dimension. For example, a solid torus has, modulo homotopy smooshing, one point, two loops, and one void. These values are called Betti numbers. They represent equivalence classes of points, loops, and voids.

We are still on our way to understanding the persistence diagram. Let's consider all the classes of \(n\)-dimensional "loops". As \(\delta\) increases from 0, classes will emerge and classes will disappear. The \(\delta\) value for which a class emerges is called its birth value \(b\), and the \(\delta\) value for which a class disappears is called its death value \(d\). For each class, we have a tuple \((b,d)\). What can we do with tuples? Graph them! This graph is the persistence diagram. Traditionally \(b\) is plotted on the horizontal axis and \(d\) is plotted on the vertical axis. The further a point is from the line \(y=x\), the longer it persists. Thus the data's noise lives near the diagonal and the data's signal lives above the diagonal (for many applications).

Now that we have a handful (or maybe many more!) of significant \((b,d)\) pairs for a given dataset, we can compare two datasets to determine how geometrically similar they are. We make this comparison by matching points in one persistence diagram with points in the other persistence diagram. Two such ways to measure the distance (as a single value!) between two persistence diagrams are the bottleneck distance and the Wasserstein distance.

Do we know that this distance is useful? We do, due to the stability theorem. This theorem proves that the distance between two persistence diagrams is less than the distance between the two corresponding datasets.

degrees of freedom (what choices can we make)

a. How do we draw the complex? Vietoris-Rips or something else?

b. How do we measure the distance between the points in the persistence diagram? Bottleneck distance? Wasserstein distance? etc.

c. Which distance do we use? Euclidean distance is not always the best tool for the task! (Thanks to Hubert Wagner for pointing this out!)

next up

barcodes, linear-algebra-based computation, filtrations, persistence homology modules, module as a direct sum of barcode lines