Tuesday, September 15, 2009

Intensional Associations in Dataspaces

One problem that many users have in managing their data is how to obtain connected items while searching. For example, picture yourself searching for information on an interesting classroom project you developed some years ago. You may type a few keywords in a search tool that will lead you to one or two documents lost on the vast amount of information in your hard drive about that project. Unfortunately, not all documents you are interested in, such as graphs, emails, and results of interesting experiments, may contain the keywords you chose to type on the search box.

The problem in this example is that even though you could find some information related to your project, you cannot connect from this information to other important items in the same context. Together with colleagues from Saarland University and ETH Zurich, I have explored an idea to solve this problem in a paper recently accepted for publication at ICDE 2010. The full version of our paper can be found here (link to draft).

In order to define connections among items in a dataspace, we propose association trails. An association trail is a declarative definition of how items in the dataspace are connected by virtual association edges to other items. A set of association trails defines a logical graph of associations over the dataspace. For example, you may connect documents in your personal dataspace by associating items touched around the same time, documents with similar content, different versions of documents you authored or received, or items that reside in similar folder hierarchies in your email server and in your filesystem.

Coming back to our classroom project search, association trails create connections from your one or two search results to a rich set of related emails, documents, and experiment results. Automatically obtaining all of this context information from search results is called in our paper a neighborhood query. While neighborhood queries are very useful to help you find information in your data, they are also very expensive to process over the logical graph of connections created by association trails. In order to address this problem, our paper investigates a new indexing technique, called the grouping-compressed index (GCI). In a nutshell, GCI creates a compressed representation of the logical graph declared by association trails. We can use this compressed representation to answer neighborhood queries without ever having to expand it to the whole graph. As a consequence, GCI can achieve over an order of magnitude better indexing or querying times when compared to various alternatives.

Association trails have been integrated into the iMeMex Dataspace Management System and the code is released under an open-source license. If you are interested in dataspaces, you can also find out about other work I have done in iMeMex by taking a look at my PhD thesis.

I am looking forward to an interesting conference at Long Beach next year! Hope to see you there!

No comments:

Post a Comment