The faculty in this group work collaboratively on many different projects, often involving vertically integrated teams of undergraduates, graduates, and post-docs. Here are some current examples.
Gene Expression
Harer and Mukherjee study the yeast cell cycle with Steve Haase in the
Duke Biology Department. They are working to discover the
transcriptional network that underlies the yeast
cell cycle. Gene expression data collected at different times gives a
high dimensional data set, and periodic
genes give points distributed near to a circle. This data set is
inherently noisy, includes non-periodic genes
and has missing data. These issues are dealt with by a mixture of
topological, statistical and biological tools.
Software already exists to analyze the data and collaboration with
biologists at Duke make this an effective
and interesting project.
Mathematics Meets Art History
At Princeton, Daubechies mentored Josephine Wolff, a mathematics major
whose senior thesis explores how
mathematical analysis can capture the distinctive signature style of
different artists. She started from high
quality images of paintings by Gossen van der Weyden (GvdW), mastered
wavelet analysis, and learned
firsthand the advantages of dual-tree complex wavelet transforms for
image analysis. Josephine looked
at patterns of wavelet coefficients across different scales to capture
signature features, and learned how
techniques from machine learning can be used to generate strong
classifiers. GvdW made extensive use, as
was customary in his time, of preliminary drawings on the panel to be
painted, typically in charcoal of blank
ink; these can now be uncovered with infrared reflectometry. The style
in these underdrawings has been
hypothesized to be related to whether the mast or an apprentice did
the underpainting, and art historians
have developed a classification of these styles. The goal of the
project was to see whether the features of
the visible painting layer could be used to recover the classification
by style, and using the machine learning
Bloit Boost algorithm she achieved a remarkably good 90% rate. Further
validation is under way and plans
are being made by Prof. Martens from Ghent, who provided the original
images, to incorporate this into his
research. This will provide further opportunities for interesting
projects for budding researchers.
Social Networks
Mukherjee is involved in a collaborative project in modeling social
networks with Prof Banks in Statistical
Science, Prof. Moody in Sociology and Prof. Alberts in Biology.
Geometric approaches are being used
to infer higher-order interactions, hypergraphs, and the dynamics of
these interactions in two data sets:
wikipedia links, and a group/social dynamics in a wild population that
Prof Alberts has studied. Preliminary
software for his problem exists.
Prof. Willett is also studying social networks, especially their
temporal evolution based on meeting
or communication patterns. Preliminary studies of the email database
for the three years surrounding the
downfall of the notorious Enron Corporation (conducted by an
undergraduate using methods developed by
more senior lab members) shows that simply knowing who communicates
with whom is a valuable predictor
of what is being communicated. However, social networks can easily
contain hundreds of thousands of
people or more, resulting in very large data sets and requiring
sophisticated yet scalable analysis methods.
Forest Change
Gelfand is involved in a project modeling forest dynamics on large
spatial scales, e.g., the eastern United
States. Forest change is best understood at the level of the
individual tree. Individual-level survival, growth,
fecundity, flowering, root development, etc., are much studied.
Immediately, all of the foregoing issues
arise in proceeding from individual level performance to forest scale
response. Such a project provides
an excellent example of collaboration between stochastic modelers
(statisticians), geometric computing
algorithm experts (computer scientists) and tree physiologists
(ecologists) and an excellent opportunity for
young researchers to gain knowledge and expertise across multiple fields.
Volcanic Flows
An example of a project with Wolpert that can involve both
undergraduate and graduate students is the
computer simulation of the frequency and magnitude of pyroclastic
flows and the resulting flow levels at
hundreds or thousands of locations, and the comparison of these to
historical records. Students with a first
undergraduate course in probability and a bit of experience in
programming already have enough back-
ground to do this under simplifying assumptions about the underlying
physics, topography, geological pro-
cesses, and their probability distributions; several Ph. D. students
are now in the process of extending known
theory and methodology to make more reliable risk predictions without
those simplifying assumptions. Both
the RHIC study and the volcanology project entail computer modeling,
quantifying the uncertainty about
the deterministic outcome of large slow computer models at thousands
or millions of possible input values,
most of them as yet untried; this too is an area where undergraduate
and graduate students can work together
on simple and (almost!) hopelessly complex versions of the same
models, respectively.
Voting Data
An interesting recent problem being studied by Carin involves
integration of document data with other
data that may come in the form of an (incomplete) matrix.
Specifically, he has looked at voting data in
the US Senate and House, from the first congress to the present and
learned the latent structure associated
with legislatures and of legislations. Since the model is time
evolving, one can examine how the latent
characteristics of a senator/congressman evolve with time.
Additionally, he has a database of the legislation
(document) itself, and therefore can integrate nonparametric Bayesian
topic models with matrix analysis,
to examine how the legislation wording impacts its latent structure
and hence the voting. He plans to also
integrate speeches. Similar analyses may be performed for Supreme
Court votes and cases (documents).
The analysis allows one to infer the lowdimensional latent space
(manifold) on which both legislatures and
legislation reside.
Tree-based Processes
An example of a project with Schmidler concerns statistical inference
and stochastic sampling of tree topolo-
gies. Methods for efficient sampling of the space of trees are being
developed for applications in statistical
classification/regression trees, phylogenetic tree construction using
3D protein structures, and segmentation
of blood vessels in retinal images (in collaboration with Carlo Tomasi
of Duke CS). Currently one under-
graduate and two graduate students are involved, one in each of the
three applications; opportunities range
from the highly applied to the computationally intensive to the highly
theoretical (e.g. mixing times of non-
random-walk Markov chains on complex tree distributions, or methods
for a-stable processes (with Prof
Wolpert) to compute marginal likelihoods for model selection).
Expander Graphs and Compressed Sensing
Mentored jointly by Calderbank and Willett, CS Graduate Student Sina
Jafarpour derived performance
bounds for compressed sensing in the presence of Poisson noise using
expander graphs. The Poisson noise
model is appropriate for a variety of applications, including
low-light imaging and digital streaming, where
the signal-independent and/or bounded noise models used in the
compressed sensing literature are no longer
applicable. Jafarpour developed a new sensing paradigm based on
expander graphs and proposed a MAP
algorithm for recovering sparse or compressible signals from Poisson
observations. The geometry of the
expander graphs and the positivity of the corresponding sensing
matrices play a crucial role in establishing
the bounds on signal reconstruction error. In fact it is the geometry
of the expander graphs that makes them
provably superior to random dense sensing matrices, such as Gaussian
or partial Fourier ensembles, for the
Poisson noise model. There is also some very recent work that uses
expander graphs for sparse monitoring
of data streams.