• 基于决策树的敏感词变形体识别算法研究及应用

    Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2019-04-01 Cooperative journals: 《计算机应用研究》

    Abstract: In order to solve the problem that the recognition efficiency of sensitive word deformed bodies of the network text is not high, this paper proposed a sensitive word deformed bodies recognition algorithm based on decision tree. Firstly, it studied sensitive words and deformed bodies by analyzing the characteristics of Chinese characters and pronunciation and so on. Secondly, it constructed a sensitive word decision tree based on sensitive word library. Finally, it calculated the text sensitivity of new media such as Weibo by multi-factor improvement model. The experimental results show that the proposed algorithm can achieve the highest recall rate and precision rate of 95% and 94% respectively when identifying Chinese sensitive words and deformed bodies. Compared with the improved algorithm based on the finite automaton, the recall rate and the precision rate are increased by 19.8% and 21.1% respectively. Compared with the sensitive information decision tree information filtering algorithm, the recall rate and the precision rate are increased by 17.9% and 18.1%. The analysis show that the algorithm is effective in the recognition and automatic filtering of sensitive word deformed bodies.

  • 面向中文敏感词变形体的识别方法研究

    Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2018-05-02 Cooperative journals: 《计算机应用研究》

    Abstract: To purify the network environment, the network information needs to be reviewed. Recognizing the sensitive words in the network information, especially the change form of Chinese sensitive words, is an urgent problem to be solved. By analyzing the structure and pronunciation of Chinese characters, this paper proposes a method of recognition of the change form of Chinese sensitive words. This method has designed sensitive word recognition algorithm based on the grouping of confusing pinyin, String abbreviation recognition algorithm and recognition algorithm based on KMP's character split recognition algorithm for the pinyin of word , the abbreviation of word and the split of word, and improve the accuracy and efficiency of the review. The experimental results show that the proposed method has higher recall and precision when recognizing the change form of Chinese sensitive words.