Yesterday afternoon I listened to Alon Halevy of Google Research present a lecture at the Sybase-sponsored Database Seminar Series at the University of Waterloo. Alon’s talk was entitled “Bringing (Web) Databases to the Masses” – here is Alon’s abstract:
The World-Wide Web contains vast quantities of structured data on a variety of domains, such as hobbies, products and reference data. Moreover, the Web provides a platform that can encourage publishing more data sets from governments and other public organizations and support new data management opportunities, such as effective crisis response, data journalism and crowd-sourcing data sets. To enable such wide-spread dissemination and use of structured data on the Web, we need to create a ecosystem that makes it easier for users to discover, manage, visualize and publish structured data on the Web.
I will describe some of the efforts we are conducting at Google towards this goal and the technical challenges they raise. In particular, I will describe Google Fusion Tables, a service that makes it easy for users to contribute data and visualizations to the Web, and the WebTables Project that attempts to discover high-quality tables on the Web and provide effective search over the resulting collection of 200 million tables.
Alon gave a super-interesting talk, motivating the University of Waterloo students in the audience to dream about what the web could be. Google Fusion Tables and WebTables are steps in that direction. My notes from Alon’s talk follow:
Structured data and the Web
- there is a huge amount of structured data on the web: reference data, hobbies, products, etc. Currently we are primarily reacting to this existing data.
- but the web could be a platform for getting more data out: government data, crime data, data regarding water conditions, the list is (nearly) boundless. Hence we could be more proactive in our use of the web.
- but the web could also enable new kinds of data collection and management: collaboration, crowd-sourcing, real-time data, crisis response, and could then be used to invent a much brighter future
There is currently no single system that will achieve all these goals. Key processes that are required include data cleaning, querying, sharing, integrating, and visualizing. In particular, data visualization is key: data volumes on the web are such that visualization techniques are necessary to be able to synthesize it. Synthesis should be easy – and visualization is a prerequisite. With visualization, one could really tell a story by using a sequence of visualizations.
The challenges in making the web a better place for information sharing and collaboration include:
- finding high quality structured data
- getting data out of silos
- extracting data from web pages
- databases are hard to use
- integrating heterogeneous data
- publishing data is cumbersome
Two projects at Google are in progress that go towards solving these problems:
- Google Fusion Tables – a database management service for the web; and
- Google WebTables
Google Fusion Tables
The first goal of Google Fusion Tables is to develop an easy-to-use DBMS that is integrated with the web. Google Fusion Tables key features are:
- easy upload (CSV, KML, spreadsheets)
- sharing (even outside your company)
- visualizations front and center
- easy publishing
and, perhaps most importantly, for all of these services to available with zero administration (no DBA).
The second goal of Google Fusion Tables is to provide an integrated, poly-structured data cloud – to enable the discovery of others’ data and combine it with your own. Hence users can create tables, then create visualizations and share them. Fusion Tables has an API so application developers can develop applications that populate Fusion Tables, synthesize that data with other shared data on the web, and then visualize the result. From Alon’s talk, there is a significant thrust towards data visualization – the predictable problem is that with massive amounts of data, you must employ visualization techniques in order to analyze it.
It turns out that as part of the Fusion Tables initiative, it sort of “happened” that Google built a GIS “in the cloud” as part of the infrastructure. Challenges included “trickling”: showing only a small number of geometry or geographic features, i.e. points or polygons, from a large data set. To automatically render such data, the software needs to thin polygons, clip them to the window, and style features on the fly – and do it all in under 100ms.
In his talk, Alon indicated that a 3-yr-old Google study found approximately 14 billion English HTML “tables” (i.e. English-language web content with TABLE tags.) Of these 14 billion, over 99% are “uninteresting”. Part of the problem is that there is a brittle relationship between table values and their semantics; sometimes the semantics are hidden and complicated, and on the web one does not readily have a domain model or context to exploit to analyze data semantics, particularly when you introduce cultural differences and differences in language.
In a previously-published paper (WebTables: Exploring the Relational Web (VLDB 2008)), Google researchers found that of this corpus of 14B raw “tables”, they estimated that 154 million were “good” relations (i.e. they could be used for structured data analysis). WebTables is designed to recover “good” relations from a crawl and enable search – which Alon referred to as discovering a (structured) needle in an (unstructured) haystack. Once these are discovered, one can then build novel applications using that data. To capture semantic meaning, such applications can again rely on the web, which has the same scope of semantic meanings that match the content of the tables themselves, and thus one can use the web to develop ontologies to categorize and analyze the original data.
So of the original 14B “tables”, 154 million are “good” tables, with 2.6 million distinct schemas, and 5.4 million total attributes. With the semantic information from the web, one can join these schemas by similarity in attribute names through synonym discovery.
During the talk, Alon showed a variety of examples that illustrated how one can use Google Fusion Tables to combine data sets and create information from their combination. One example Alon showed is a map that combines earthquake instances since 1973 with current nuclear installations, which conveys a lot of meaningful information very quickly.
A clear thrust of this research are two requirements: ease of use and zero administration. Ideally, creating information should require the only the minimal computing knowledge to accomplish the task. Google is certainly pushing the envelope in that direction. In this concluding remarks, Alon hinted that we can expect more developments in this area over the next while, as Google refines better techniques to discover structured data on the web and unify that data with information from other sources.
Various academic papers on WebTables and Google Fusion Tables have been published in recent conferences:
WebTables: VLDB 2008, VLDB 2009, VLDB 2011
Fusion Tables: SIGMOD 2010, SOCC 2010 – and see http://google.com/fusiontables
Communications of the ACM, February 2011