Overview

Diffusion geometries refer to the large-scale geometry of a manifold or a graph representing a data set, which is determined by long-time heat flows on the manifold/graph/data set. A multiscale analysis, that includes all scales and not just large scales, is possible with diffusion wavelets. Code for computing proximity graphs, Laplacians, diffusion wavelets is available here. This behavior can be accurately described by the bottom eigenfunctions of the Laplacian operator on the manifold or on the graph (or data set). Moreover, these eigenfunctions can be used to define an embedding (sometimes called eigenmap) of the manifold or graph into low-dimensional Euclidean space, in such a way that Euclidean distances in the range of the eigenmap correspond to “diffusion distances” on the manifold or graph. Techniques based on eigenvectors of similarity matrices (which are related to the Laplacian have been used successfully for some time in several problems in data analysis, segmentation, clustering, etc... These connections, and connection to differential-geometric structures of manifolds, and problems related to the stability, computability, and extendibility of these eigenfunctions has been explored extensively in Stephane Lafon's thesis. Further material, including talks and papers, are available on Stephane's web-page. The link below points to a talk I gave some time ago at l'Escorial, the first part of which is about these Diffusion Geometries:Diffusion Geometries, Diffusion Wavelets and Harmonic Analysis of Large Data Sets, talk given at the International Harmonic Analysis Conference a l'Escorial, Madrid, May 2004.

Papers

ACHA special issue on Diffusion Maps and Wavelets are a good starting point. The short papers (1,2) that appeared in P.N.A.S. are a short introduction to the main ideas. A general framework for adaptive regularization based on diffusion processes on graphs, A.D. Szlam, M. Maggioni, R.R. Coifman, to appear in J.M.L.R., August 2006. Contains applications to machine learning, in particular semi-supervised learning, as well as image denoising (in a perhaps suprisingly unifying framework). The paper Multiscale Analysis of Data Sets using Diffusion Wavelets, appeared in Proc. Data Mining for BIOmedical Informatics, in conjuction with the SIAM Conference in Data Mining, contains applications to the analysis of document corpora.

Tutorial and Pictures

Most of the pictures on this web page are courtesy of Stephane Lafon (now at Google) Check his home page for cool demos introducing the basics of diffusion geometries. Geodesic distance ---> Diffusion distance Diffusion distance is more stable, uses a “preponderance of evidence” On the graph of “documents” with similarities there is a natural random walk: we get a Markov chain represented by a matrix P(x,y). If P is symmetric and positive semidefinite, we can define the diffusion distance by Embeds the graph in Euclidean space, up to precision, via the eigenfunctions, mapping diffusion distance into Euclidean distance. For a set of points in Euclidean space, sampled from a Riemannian manifold, one can build a discretized Laplace-Beltrami operator (associated to the canonical Brownian motion constrained on the manifold) and map the manifold with diffusion distance isometrically in Euclidean space. From left to right: points on a spiral in R^3 (but could be R^n for any n) are provided in random order, and are sampled nonuniformly; the embedding with the Laplacian as commonly used; the embedding with the Laplacian normalized a la Beltrami; the density function on the points. The color is used to identify the points across the pictures.
Original data set Embedding of the set with eigenmap

Original data set

Embedding of the set with eigenmap

Original data set

Embedding of the set with eigenmap

Example: document analysis Given 1,000 documents, each of which is associated with 10,000 words, with a value indicating the relevance of each word in that document. View this as a set of 1,000 in 10,000 dimensions, construct a graph with 1,000 vertices, each of which with few selected edges to very close-by documents.