Overview
Diffusion geometries refer to the large-scale geometry of a manifold or a graph representing a data set, which is determined by long-time heat flows on the manifold/graph/data set.
A multiscale analysis, that includes all scales and not just large scales, is possible with
diffusion wavelets.
Code for computing proximity graphs, Laplacians, diffusion wavelets is available
here.
This behavior can be accurately described by the bottom eigenfunctions of the Laplacian
operator on the manifold or on the graph (or data set). Moreover, these eigenfunctions can be used to define an embedding (sometimes called eigenmap) of the manifold or graph into low-dimensional Euclidean space, in such a way that Euclidean distances in the range of the eigenmap correspond to “diffusion distances” on the manifold or graph. Techniques based on eigenvectors of similarity matrices (which are related to the Laplacian have been used successfully for some time in several problems in data analysis, segmentation, clustering, etc...
These connections, and connection to differential-geometric structures of manifolds, and
problems related to the stability, computability, and extendibility of these eigenfunctions has been explored extensively in Stephane Lafon's thesis.
Further material, including talks and papers, are available on
Stephane's web-page.
The link below points to a talk I gave some time ago at l'Escorial, the first part of which is about these Diffusion Geometries:
Diffusion
Geometries, Diffusion Wavelets and Harmonic Analysis of Large Data
Sets, talk given at the International Harmonic Analysis
Conference a l'Escorial, Madrid, May 2004.
ACHA special issue on Diffusion Maps and Wavelets are a good starting point. The short papers (
1,
2) that appeared in
P.N.A.S. are a short introduction to the main ideas.
A general framework for adaptive regularization based on diffusion processes on graphs, A.D. Szlam, M. Maggioni, R.R. Coifman, to appear in
J.M.L.R., August 2006. Contains applications to machine learning, in particular semi-supervised learning, as well as image denoising (in a perhaps suprisingly unifying framework).
The paper
Multiscale Analysis of Data Sets using Diffusion Wavelets, appeared in Proc.
Data Mining for BIOmedical Informatics, in conjuction with the
SIAM Conference in Data Mining, contains applications to the analysis of document corpora.
Most of the pictures on this web page are courtesy of
Stephane Lafon (now at Google) Check his home page for cool demos introducing the basics of diffusion geometries.
Geodesic distance ---> Diffusion distance
Diffusion distance is more stable, uses a “preponderance of evidence”

On the graph of “documents” with similarities there is a
natural random walk: we get a
Markov chain represented
by a matrix P(x,y). If P is symmetric and positive semidefinite, we
can define the diffusion distance by

Embeds the graph in Euclidean space, up to precision, via the eigenfunctions, mapping diffusion distance into Euclidean distance.
For a set of points in Euclidean space, sampled from a Riemannian manifold, one can build a discretized Laplace-Beltrami operator (associated to the canonical Brownian motion constrained on the manifold) and map the manifold with diffusion distance isometrically in Euclidean space.

From left to right: points on a spiral in R^3 (but could be R^n for any n) are provided in random order, and are sampled nonuniformly; the embedding with the Laplacian as commonly used; the embedding with the Laplacian normalized a la Beltrami; the density function on the points. The color is used to identify the points across the pictures.
Original data set
|
Embedding of the set with eigenmap
|
|
Original data set
|
Embedding of the set with eigenmap
|
|
Original data set
|
Embedding of the set with eigenmap
|

Example: document analysis
Given 1,000 documents, each of which is associated with 10,000 words, with a value indicating the relevance of each word in that document.
View this as a set of 1,000 in 10,000 dimensions, construct a graph with 1,000 vertices, each of which with few selected edges to very close-by documents.