There is an arms race to perform increasingly sophisticated data analysis on ever more varied types of data (text, audio, video, OCR, sensor data, etc.). Current data processing systems typically assume that the data have rigid, precise semantics, which these new data sources do not possess. On the other hand, many of the state-of-the-art approaches to both cope with variations in the structure of data and to deeply anlayze data are statistical. The Hazy project is exploring integrating statistical processing techniques with data processing systems with the goal of making such systems easier to build, to deploy, and to maintain.
To demonstrate our ideas, we are building several applications, including systems to read large amounts of text and answer sophisticated questions (see WiscI and GeoDeepDive) and building general primitives for data analytics that are now incorporated in products from Oracle and Pivotal. Additionally, some of our ideas have helped to find Neutrinos with IceCube (see IceCube).
DeepDive, a general-purpose statistical inference system, has been released. Check it out! One of the most popular uses of the software is in machine reading (or knowledge-base construction), but we are starting to branch out.