AI Research

Recurrent Binary Embedding for GPU-Enabled Exhaustive Retrieval from Billion-Scale Semantic Vectors

Information retrieval (IR) is the activity of retrieving information from a collection of sources stored on computers, based on user queries. IR enjoys a history of one century [1], and serves as the heart of many ubiquitous applications such as web search, product recommendation, and personal feeds on social networks.

1. Introduction

Information retrieval (IR) is the activity of retrieving information from a collection of sources stored on computers, based on user queries. IR enjoys a history of one century [1], and serves as the heart of many ubiquitous applications such as web search, product recommendation, and personal feeds on social networks. The major breakthrough happened in the 1960s and 1970s [2], when researchers started to encode queries and documents as high-dimensional vectors.

However, dealing with high-dimensional vectors is a challenging task. Statistical measures based on the entropy and Kullback-Leibler divergence were widely adopted in retrieval tasks involving high-dimensional data. Although such approaches are usually confronted by the curse of dimensionality [3,4].

A classic algorithm, k-nearest neighbor (k-NN) attempted to tame the dimensionality curse. The k-NN algorithm applies “brute-force” search (or exhaustive search), in order to find k neighbors of the query point from the reference point set. However, the k-NN algorithm is computation-intensive. Approaches such as Approximate Nearest Neighbors (ANN) [5,6] were hence proposed to reduce the number of computations by pre-arranging the data using a kd-tree.

The past decade has witnessed a wide adoption of Graphic Processing Units (GPUs) in image and video processing tasks. However, fewer efforts were made in developing GPU-based information retrieval, especially for exhaustive search. An early attempt in GPU-enabled k-NN exhaustive search was proposed by Garcia V et al. [7].

In this short article, we review a paper by Microsoft Bing researchers which proposes a novel model called “Recurrent Binary Embedding” (RBE) wherein a GPU-enabled exhaustive k-NN is, for the first time, applied to billion-scale real-time retrieval.

The major contribution of this paper is the design of “Recurrent Binary Embedding” (RBE), which generates compact semantic representations that can be stored on GPUs, enabling billion-scale retrieval in real-time. The RBE model was implemented using BrainScript in CNTK and trained on a GPU cluster. (The BrainScript will be available as open source shortly at In addition, the RBE GPU-enabled Information Retrieval (or rbeGIR for short) was implemented on a customized multi-GPU server as shown in Fig. 1.

Screen Shot 2018-10-10 at 7.32.51 PM.png
Figure 1: The rbeGIR server with four nVIDIA GoForce GTX1080 GPUs, two 8-core CPUs, and 128GB DDR memory.

The rbeGIR system was evaluated on data collected from a paid search engine by Microsoft, while the training data for the RBE model contained 175 million (M) unique clicked pairs sampled from two-years worth of search logs. Using cross sampling to add 10 negative samples, which were generated per clicked pair through cross sampling, the total number of training samples arrived at 1925M. The validation data consisted of 1.2M pairs sampled a month after in order to avoid overlap. In addition, the test data consisted of 0.8M pairs labeled by human judges.

The rbeGIR system achieved impressive performance regarding recall and latency. Evaluated based on 1.2 billion keywords and 10,000 queries, the average recall rate of the proposed system is 99.99% , while the latency is only 31.17 ms on average, which qualifies the rbeGIR system for real-time retrieval.

2. Highlights

The RBE model is proposed in the context of sponsored search of a major search engine. Essentially, sponsored search shows advertisements along with the search results. Three major agents in the ecosystem are “the user”, “the advertiser”, and “the search platform”. The search platform aims at showing ads that users would like to click.

The user would enter “queries” in the search box and the search engine will find relevant information. Then the search engine would use IR techniques to retrieve “keywords” associating the user’s intent with the advertiser’s intent. Finally, the search engine would display a couple of ads (or “impressions ”) based on the keywords. A “click” event is recorded if the user clicks on an ad.

In this section, we will introduce three highlights of the paper. The RBE model is selected as the first highlight. It provides a novel way of generating compact vector representations for queries and keywords in the context of sponsored search. The second highlight, RBE-based information retrieval, involves applying the proposed RBE model to an actual GPU-enabled IR system. Finally, the third highlight, exhaustive k-NN selection with GPU, is an essential component of the rbeGIR system particularly designed for billion-scale retrieval.

2.1 The RBE Model

As shown in Fig. 2, the RBE model aims at generating compact vector representations for queries and keywords in the sponsored search. The RBE model is built upon the CLSM model [8] (the architecture is illustrated in Fig. 3).

When compared with the CLSM model, RBE has the same forward processes until the multi-layers, as the parts beyond are called RBE layers. The RBE layers are formulated by Eq. (8)-(11). What makes the RBE model “recurrent” is the looped pattern, as shown in Fig. 3, and Eq. (9) – (11). However, the RBE model has no connections with other network structures RNN or LSTM. The “recurrent” analogy only refers to the looped pattern in the RBE model. For RNN or LSTM-alike models, transformations from timestamp t to t-1 share the same set of parameters for the purpose of learning a persistence memory unit. The RBE model meanwhile is more flexible at deciding whether the set of parameters should be fixed or time-varying.

Screen Shot 2018-10-06 at 10.12.04 PM.png
Figure 2: Recurrent Binary Embedding (RBE)
Screen Shot 2018-10-07 at 4.12.59 PM.png
Figure 3: CLSM model architecture
Screen Shot 2018-10-07 at 4.58.14 PM.png
Essentially, the key idea behind Eq. (8)- (11) is to construct the binary decomposition b_i^t by maximizing the information extracted from the real-valued vectors f_i . Multiple intermediate vectors are generated during the training process in order to reconstruct the binary decomposition.

2.2 RBE-based Information Retrieval

The architecture of the system for keyword retrieval is shown in Fig. 4. The system is referred to as RBE GPU-enabled Information Retrieval, or the rbeGIR system. To begin with, the system adopts multiple GPUs to store and process the RBE embeddings. The round rectangle shows the components of the p-th GPU. The bottom of Fig. 4, shows that keywords are first transformed offline into RBE vectors, and then uploaded to the GPU memory from the CPU memory. On the other side, the query is transformed into a RBE vector on-the-fly, before uploading to GPU memory. The exhaustive match component inside the GPU will be responsible for computing the similarity between RBE embeddings of the query and each keyword. The match results will guide the local selection and global selection process for p-th partition, in order to find the best keywords. The results from all GPUs will contribute to the top N keywords via Selection Merge.

A key advantage of the RBE model is its memory efficiency. For example, storing one billion keywords only requires 14.9GB memory, instead of 238GB using float. This paves the way for in-memory retrieval on multiple GPUs. In addition, RBE learns application-specific representations, therefore, it is more accurate than general purpose quantization algorithms.

Screen Shot 2018-10-06 at 10.12.11 PM.png
Figure 4: The RBE GPU-enabled Information Retrieval (rbeGIR) system.

The rbeGIR system was evaluated against a production retrieval system developed the authors. There are two baselines, prod_1 refers to the production setting with the same amount of memory, while prod_2 indicates the one with the same amount of keywords. In addition, the embeddings of 1.2 billion unique keywords were stored in the rbeGIR system.

As shown in Table 3, the average quality of top 5 keywords returned from each of the 2,000 queries is labelled as “bad”, “fair”, “good” or “excellent”. Each column in Table 3 represents the percentage gap between the the counts of query-keyword pairs by column label. For example, regarding the “good” results, the rbeGIR system was observed to outperform the prod_1 and prod_2 by 18.52% and 11.19% respectively.
Screen Shot 2018-10-17 at 10.14.29 AM.pngIn order to evaluate the recall of the rbeGIR system, 10,000 queries were first matched offline with 1. 2 billion keywords using exact nearest neighbor and the RBE embeddings. The per query recall @1000 indicates the total number of top keywords which overlap with the relevant keywords, divided by 1000. It was observed that the average recall @1000 for rbeGIR is 99.99%. Moreover, as the latency of the rbeGIR system remains at 31.17ms on average, the proposed system qualifies for real-time retrieval.

2.3 Exhaustive k-NN Selection with GPU

An essential component of the rbeGIR system is the brute-force k-NN selection algorithm designed for billion-scale retrieval. The selection algorithm begins with a local selection process, which depends on a k-NN kernel, as shown in Algorithm 1. The algorithm takes the RBE embedding of the query and each keyword, and outputs a priority queue of top similarity scores and their indices. The output priority queue will be fed into global selection and merge selection, in order to derive the top keywords. Both global selection and merge selection adopted the Radix sort method [9], which is one of the fastest sorting algorithms.
Screen Shot 2018-10-07 at 6.16.39 PM.png

3. Conclusion

The paper introduces the RBE model for generating semantic vector representations for billions-scale information retrieval, which can be stored and processed on GPUs. The RBE representation can be further integrated with the exhaustive k-NN search, contributing to the proposed rbeGIR system as an early example of IR harnessing both deep learning algorithms and powerful GPUs.

The authors noted that the RBE representations are not confined to the CLSM model only. As demonstrated in Fig. 5, the concept of RBE can be further generalized to networks such as semantic hashing [10] and word2vec [11] in the future.

Screen Shot 2018-10-07 at 6.35.55 PM.png
Figure 5: The concept of RBE can be generalized to other networks such as semantic hashing (left) and word2vec (right).


[1] Sanderson, M., and Croft, W. B. The history of information retrieval research. Proceedings of the IEEE 100, Special Centennial Issue (2012), 1444-1451.

[2] Salton, G., Wong, A., and Yang, C.-S. A vector space model for automatic indexing. Communications of the ACM 18, 11 (1975), 613-620.

[3] Weber, R., Schek, H.-J., and Blott, S. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB (1998), vol. 98, pp. 194-205.

[4] Aggarwal, C. C., Hinneburg, A., and Keim, D. A. On the surprising behavior of distance metrics in high dimensional spaces. In ICDT (2001), vol. 1, Springer, pp. 420-434.

[5] Friedman, J. H., Bentley, J. L., and Finkel, R. A. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software (TOMS) 3, 3 (1977), 209-226.

[6] Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. S. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry (2004), ACM, pp. 253-262.

[7] Garcia, V., Debreuve, E., and Barlaud, M. Fast k nearest neighbor search using GPU. arXiv preprint arXiv:0804.1448.

[8] Shen, Y., He, X., Gao, J., Deng, L., and Mesnil, G. A latent semantic model with convolutional-pooling structure for information retrieval. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (2014), ACM, pp. 101-110.

[9] Merrill, D. G., and Grimshaw, A. S. Revisiting sorting for gpgpu stream architectures. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques (2010), ACM, pp. 545-546.

[10] Salakhutdinov, R., and Hinton, G. Semantic hashing. RBM 500, 3 (2007), 500.

[11] Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).

Author: Olli Huang | Editor: Hao Wang, Michael Sarazen

0 comments on “Recurrent Binary Embedding for GPU-Enabled Exhaustive Retrieval from Billion-Scale Semantic Vectors

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: