Range queries over DHTs

From P2P Wiki

Jump to: navigation, search

Introduction

Earlier peer-to-peer proposals were unstructured overlays, like Gnutella. Their communication scheme was based on flooding mechanisms, incurring on an important amount of overhead. Later, structured overlays and the DHT paradigm emerged (CAN, Chord, Pastry), providing exact match queries with good performance qualities. These systems perform insertion and lookup of data via hashing functions, thus guaranteeing a uniform data distribution over the network.

Nevertheless, real-life applications demand more complex searches, e.g. range searches. These kinds of searches enable the user retrieve related or similar data objects to the query. But, DHTs do not deal with these kinds of searches. Instead, we would need to flood the entire network for retrieving all related content, which becomes unacceptable for large-scale networks.

All these (distributed) applications present the same elements in their architecture: a multi-dimensional data domain with data objects being stored and searched; a distance function which defines the similarity between two data objects; and the algorithms for storing and searching the content. For instance, consider an image querying system where users publish images. These images are characterized by real-valued d-dimensional feature vectors. A query in this system consists of such a vector and the user expects the most similar images to it. Also, Information Retrieval (IR) applications work in the same manner, where each text document is characterized by a d-dimensional vector with the best descriptive terms. In both cases, when a search is performed, applications must evaluate the similarity of various data objects according to some distance function, e.g. Euclidean distance for image retrieval applications or Cosine distance for text retrieval ones. And also each (distributed) application decides how and where to store the content.

Related Work

There exists a lot of work in similarity search. Earlier works are PHT [1] and SkipNet [2], providing range queries for one-dimensional datasets. Other systems support multi-dimensional similarity searches, like SkipIndex [3] and the work of Bin Liu et al. [4]

Some of existing systems operates with a d-dimensional domain to identify nodes and can map d-dimensional content onto nodes with a little effort, e.g. PRoBe [5], Mercury [6]. Because most DHTs operate with one-dimensional domains to identify nodes, they need some mechanism to map this d-dimensional content onto the network. Some work, e.g. Skipnet[2], Mercury[6], [7], use linearization functions like SFCs [8] instead of DHT’s hash functions. Linearization functions address getting similar content distributed near the same place within the network. Hence, only few nodes are necessary to visit for an arbitrary search. But linearization functions and related querying algorithms are oblivious to the underlying DHT topology. Thus, semantically close nodes may be far in terms of communication cost. ZNet [9] works in this line and SQS [10].

ZNet[9] is based on Skip Graphs and uses SFCs to map multi-dimensional data objects to nodes. ZNet and SQS are similar in the following terms: a) the mapping scheme builds a logical hierarchy, defining where to place the content; b) this scheme enables nodes to easily check whether they own content related to a similarity query, by comparing the value of their node identifier and the query; and c) consecutive nodes in the node identifier domain have contiguous data domain fragments. In ZNet, nevertheless, data is mapped onto a flat overlay network and data load balancing is addressed by joining and rejoining nodes into heavily loaded network zones, which becomes an excessive management cost. Eventually, ZNet does not deal with network communication costs and can support just one application at a time.

References

  1. S. Ramabhadran et al. Prefix Hash Tree: An Indexing Data Structure over Distributed Hash Tables, IRB Tech Report, Feb. 2004
  2. N. Harvey et al.SkipNet: A Scalable Overlay Network with Practical Locality Properties. Proc. USITS, 2003.
  3. C. Zhang et al., SkipIndex: Towards a scalable peer-to-peer index service for high dimensional data. Technical Report TR-703-04, Princeton University, 2004
  4. Bin Liu et al., Supporting Complex Multi-dimensional Queries in P2P systems. Proc. IEEE ICDCS, 2005.
  5. O. D. Sahin et al., PRoBe: Multi-Dimensional Range Queries in P2P Networks. WISE, 2005.
  6. A. R. Bharambe et al., Mercury: Supporting Scalable Multi-Attribute Range Queries. Proc. SIGCOMM, 2004
  7. Indrajit Bhattacharya et al., Similarity Searching in Peer-to-Peer Databases. Proc IEEE ICDCS, 2005.
  8. H. Sagan,Space-Filling Curves.Springer-Verlag, 1994.
  9. Y. Shu et al., Supporting Multi-dimensional Range Queries in Peer-to-Peer Systems. Proc. IEEE P2P, 2005.
  10. Jordi Pujol Ahulló et al. SQS: Similarity Query Scheme for Peer-to-Peer Databases. Proc. IEEE ISCC, 2007
Personal tools