Ahmed El Alaoui (Berkeley)

Oct 17, 2016.

Title and Abstract

Decoding from Random Histogram Queries: Information-Theoretic bounds and Algorithms
Consider a population consisting of n individuals, each of whom has one of d types (e.g. their blood type, in which case d=4). We are allowed to query the database by specifying a subset of the population, and in response we observe a noiseless histogram (a d-dimensional vector of counts) of types of the pooled individuals. This measurement model arises in practical situations such as pooling of genetic data and may also be motivated by privacy considerations. We are interested in the number of queries one needs to unambiguously determine the type of each individual, first ignoring computational considerations, and then taking them into account. We study the power of random queries, where in each query, we choose a random subset of individuals of size proportional to n. We discuss the information-theoretic question and sketch the key steps of a rigorous proof of almost matching upper and lower bounds on the minimum number of queries m such that there is no other solution than the planted one. Time permitting, we discuss computational considerations, and give heuristic arguments that an AMP-based algorithm breaks down way above the information-theoretic threshold, indicating the existence of a wide “detectable but hard” region in the phase space of the problem.

The proof relies on an exact computation of the ‘‘annealed free energy" of the model in the thermodynamic limit. To this end, we investigate the rich combinatorial structure of the problem, and as a by-product of the analysis, we show a curious identity that relates the Gaussian integral over the space of Eulerian flows of a graph to its spanning tree polynomial.

This is joint work with Aaditya Ramdas, Florent Krzakala, Lenka Zdeborova, and Michael Jordan.

Bio

Ahmed El Alaoui is a graduate student in the PhD program in EECS, Berkeley, advised by Michael Jordan. His research interests are in mathematical statistics, probability and machine learning. Specifically, he is interested in the study of the resource tradeoffs that are in play in extracting information from noisy data, and in the design of algorithms that answer optimally to these tradeoffs