BCTH: A Novel Text Hashing Approach via Bayesian Clustering

ying wenjie1, yuquan le2, Hantao Xiong2
1School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China, 2Changsha Lvzhidao Information Technology Co., Ltd.


Abstract

Similarity search is to find the most similar items for a certain target item. The ability of similarity search at large scale plays a significant role in many information retrieval applications, and thus has received much attention. Text hashing is a promising strategy, which utilizes binary encoding to represent documents, obtaining attractive performance. This paper makes the first attempt to utilize Bayesian Clustering for Text Hashing, dubbed as BCTH. Specifically, BCTH is able to map documents to binary codes by utilizing multiple Bayesian Clusterings in parallel, where each Bayesian Clustering is responsible for one bit. Our approach employs the bit-balanced constraint to maximize the amount of information in each bit. Meanwhile, the bit-uncorrected constraint is adopted to keep the independence among all bits. The time complexity of BCTH is linear, where the hash codes and hash function are jointly learned. The experimental results, based on four widely-used datasets, demonstrate that BCTH is competitive, compared with currently competitive baselines in the perspective of both precision and training speed.