Micro-blog Commercial Word Extraction Based On Improved TF-IDF Algorithm

IEEE TENCON 2013——Nowadays found some micro-blog commercial extraction algorithm only considering the relationship between the key words and the number of it appearing in texts, and ignoring the key words’ distribution in a certain category, which leads the decreased accuracy problems of micro-blog commercial word extraction. To solve this problem, the application of TF-IDF algorithm in words weight calculation was researched in this paper. Combining the relevant knowledge of information theory and analyzing the distribution of keywords within a class, the article proposed improving TF-IDF algorithm and applying it in term weight calculation. To test the feasibility of the improved algorithm, this paper initially classified the massive micro-blog information into certain types, and then used improved TFIDF algorithm to calculate term weight among the categories, and, this calculation was realized under the Hadoop Distributed framework. The experiment results demonstrated that in the application of micro-blog commercial word extraction, the improved TF-IDF algorithm is effective and feasible. Compared with traditional algorithms, the improved algorithm greatly improved accuracy. In addition, the data processing speed has greatly improved under Hadoop framework.


关键词: TF-IDF算法 词语权重计算 Hadoop分布式框架 微博客商业词汇 数据处理 IEEE TENCON 2013

主讲人:Xing Huang 机构:School of Computer Science and Technology Hangzhou Dianzi University

时长:0:14:05 年代:2013年