At the USETIM (Using Search Engine Technology for Information Management) Workshop at VLDB 2009, we presented a paper that explores the technical similarities and differences between column stores and search engines. The motivation for the paper is that we believe that the two systems have a lot of similarities, and that development in both fields point towards a confluence. We therefore try to evaluate the synergies obtainable from implementing one system to support both workloads, and we find that there are both opportunities and challenges associated with developing such a hybrid system.
With a high-level perspective, we can claim that both column stores and search engines are column-oriented. While column stores obviously store column-oriented representations of tables, an inverted index used in search engines can be interpreted as a column-oriented representation of a term-document matrix where the entries describe the occurrence of terms in documents. Despite this similarity, it is clear that the queries typically supported in the two types of systems are quite different. While search engines focus on returning the top-k ranked documents for a keyword query, column stores are mainly used in decision support systems where queries with aggregations are dominant. However, typical search engine queries might become more similar to traditional DSS queries through the introduction of faceted search. In addition, new approaches suggested for storing the entries in inverted lists/posting lists make the storage layout in search engines more column-oriented. The fact that that column stores have been used as back-ends to search engines with good performance also indicates similarities between the fields.
The similarities might suggest that one system can support both workloads efficiently. Our qualitative evaluation indicates that there are clear opportunities associated with constructing a hybrid system, such as efficient storage structures for facets in faceted search engines based on a columnar layout, and technology transfer of loading techniques and indexing structures. But, there are also challenges that need to be overcome for a hybrid system to be efficient. Examples of challenges include the fact that search engine workloads require low latency, while throughput is the major optimization criterion for column stores. A successful hybrid system might also be able to support other related workloads, but supporting more workloads typically involves supporting more query languages. When supporting several query languages, it becomes a challenge to find a physical algebra for the system that enables supporting all query languages efficiently.
If you are interested in hearing some more details about this work, or in discussing the potential in developing a hybrid system for both search and decision support workloads, comment in this blog or read our paper at the USETIM workshop at VLDB 2009!