Supervised by: Ministry of Culture of PRC

Sponsored by:National Library of China
  Library Society of China

ISSN 1001-8867    CN 11-2746/G2

Filtering and Classifying Relevant Short Text with a Few Seed Words

Abstract: Filtering out irrelevant documents andclassifying the relevant ones into topical categories is ade facto task in many applications. However, supervisedlearning solutions require extravagant human effortson document labeling. In this paper, we propose anovel seed-guided topic model for dataless short textclassification and filtering, named SSCF. Without usingany labeled documents, SSCF takes a few “seed words” foreach category of interest, and conducts short text filteringand classification in a weakly supervised manner. Toovercome the issues of data sparsity and imbalance, theshort text collection is mapped to a collection of pseudodocuments,one for each word. SSCF infers two kinds oftopics on pseudo-documents: category-topics and generaltopics.Each category-topic is associated with one categoryof interest, covering the meaning of the latter. In SSCF,we devise a novel word relevance estimation processbased on the seed words, for hidden topic inference. Thedominating topic of a short text is identified through postinference and then used for filtering and classification. Ontwo real-world datasets in two languages, experimentalresults show that our proposed SSCF consistentlyachieves better classification accuracy than state-of-theartbaselines. We also observe that SSCF can even achievesuperior performance than the supervised classifierssupervised latent dirichlet allocation (sLDA) and supportvector machine (SVM) on some testing tasks.

Keywords: dataless text classification, short text, topicmodeling, seed word, pseudo-document


淅川县| 大竹县| 舟曲县| 托克逊县| 遂昌县| 郎溪县| 林口县| 东方市| 万载县| 溧水县| 汕尾市| 卓尼县| 昂仁县| 巴东县| 瓮安县| 孟州市| 浦北县| 泸西县| 普兰县| 衡山县| 关岭| 安化县| 都江堰市| 平凉市| 禄劝| 东乌| 阳高县| 香河县| 景泰县| 襄汾县| 阜新市| 文化| 千阳县| 冷水江市| 临湘市| 隆化县| 临夏县| 错那县| 辰溪县| 苗栗市| 华蓥市|