失眠网 > NLP预处理——编码繁转简停用词表情标签

NLP预处理——编码繁转简停用词表情标签

时间：2023-02-24 00:31:05

preface：随着经历的积累，觉得预处理问题愈发重要，需要整理整理。

环境：mac，anaconda2

一、文本编码转换

二、繁转简

三、停用词

四、表情异常符号

五、html/json/xml标签处理

六、分词&切割

七、全角&半角转换

一、文本编码转换

python2 VS python3 python2读取文件：默认asciii，类型为str 转为utf-8 demo：

$ ipython# 改变默认编码格式为utf-8import sysreload(sys)sys.setdefaultencoding('utf-8')s = '今天上海的天气也蛮不错的'us = unicode(s)print type(s), len(s), sprint type(us), len(us), uss us'''# 结果<type 'str'> 36 今天上海的天气也蛮不错的<type 'unicode'> 12 今天上海的天气也蛮不错的'\xe4\xbb\x8a\xe5\xa4\xa9\xe4\xb8\x8a\xe6\xb5\xb7\xe7\x9a\x84\xe5\xa4\xa9\xe6\xb0\x94\xe4\xb9\x9f\xe8\x9b\xae\xe4\xb8\x8d\xe9\x94\x99\xe7\x9a\x84'u'\u4eca\u5929\u4e0a\u6d77\u7684\u5929\u6c14\u4e5f\u86ee\u4e0d\u9519\u7684''''

python3：中文字符串转为unicode：Chinesename.encode('unicode_escape').参考：/xyq046463/article/details/58606657默认utf-8格式，并且显示为str类型，不必转换格式，这一块比较好unicode转为byte格式：s.encode()byte格式转为unicode：s.decode()使用例子1：在做tensorflow的tfrecord时，需要将输入文本转为Byte格式，传入BytesList

# python3环境下准备tfrecord时values = [sentence.encode("utf-8") for sentence in doc]record = {'text_feature_keys':tf.train.Feature(bytes_list=tf.train.BytesList(value=values)),}

例子：

$ /anaconda3/bin/ipythons = '今天上海的天气也蛮不错的'bs = s.endcode()print('{},{},{}'.format(type(s),len(s),s))print('{},{},{}'.format(type(bs),len(bs),bs))'''# 结果<class 'str'>,12,今天上海的天气也蛮不错的<class 'bytes'>,36,b'\xe4\xbb\x8a\xe5\xa4\xa9\xe4\xb8\x8a\xe6\xb5\xb7\xe7\x9a\x84\xe5\xa4\xa9\xe6\xb0\x94\xe4\xb9\x9f\xe8\x9b\xae\xe4\xb8\x8d\xe9\x94\x99\xe7\x9a\x84''''

python2中需要注意的坑：详见以前的博客，python函数——编码问题——str与Unicode的区别编码这块还是建议大家用python3，python2在后就不支持了，python2中的编码实在太多坑了。

二、繁转简

背景：无论微博文本、点评评论文本、豆瓣影评文本，都有繁体的存在，繁体占比比较少但又不可不考虑，这样就导致训练时繁体的语义不能用上汉字的语义了，毕竟繁体字在计算机中也单独被认为一个汉字，如若不预处理的话。坑：存在坑爹的栗子。（长度会改变，并且差了很多）泡麵——>方便面雪糕——>冰淇淋繁转简方法1（慢）下载到本地的备用文件：zh_wiki.py、langconv.py使用：

from langconv import *keyword = '飛機飛向藍天'keyword = Converter("zh-hans").convert(keyword.decode("utf-8")).encode("utf-8") #繁体转简体keyword = Converter("zh-hant").convert(keyword.decode("utf-8")).encode("utf-8") #简体转繁体

参考：python实现中文字符繁体和简体中文转换缺点：较慢，针对每个词，都要过一遍繁体字典，卤煮处理上百万条评论时，加了一个繁转简，预处理时间瞬间拉长了好几分钟。繁转简方法2（快）使用snownlp包自带的繁转简，demo：

$ pip install SnowNLP$ ipythonfrom snownlp import SnowNLPs = SnowNLP(u'「繁體字」「繁體中文」的叫法在臺灣亦很常見。')print s.han

snownlp繁转简所需文件：trie.py、zh.py，分别在snownlp包下面的utils和normal文件夹下，即 /anaconda2/lib/python2.7/site-packages/snownlp/utils/trie.py/anaconda2/lib/python2.7/site-packages/snownlp/normal/zh.py实际上的使用需要trie.py、zh.py两个文件，复制到自己的项目里或线上机器上，调用zh.py里面的transfer函数即可本质上：snownlp使用trie树，加快繁转简效率，简转繁体时，zh2hans变量整成hans2zh后续一样生成trie对象即可。hanziconv包这个包还好，不会存在雪糕改为冰激凌的场景，HanziConv.toSimplified('附註')，没有转为简体。SnowNLP('附註').han则可以/berniey/hanziconv

三、停用词

常用停用词表：网上也有很多资源，略：/download/u010454729/11010194nlp的很多包中也有停用词，如nltk中的停用词：

import nltkstopwords = nltk.corpus.stopwords.words('english')# 179个英文停用词，没有中文

坑：可能遇到的坑依然是编码问题

四、表情异常符号

背景：微博、评论文本中等，包含大量的emoji，这些emoji，当然有的时候可以作为特征的一种，比如用于情感分析时，但大多其他任务时，意义并不大，可以去掉。无论去不去掉，得知晓怎么提取/去除文本中的emoji，以及其他符号符号：英文符号：from string import punctuation as enpunctuation中文符号：zhonPunctuation = u'''＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿 – — ‘ ’ ‛ “ ” „ ‟ … ‧ ﹏ ﹑ ﹔ · ！？｡、。'''punctuations = set([unicode(i) for i in enpunctuation]) | set([unicode(i) for i in zhonPunctuation])bert的符号处理：对大于小于等于号等并不作为符号。（这是什么操作？？？）

import unicodedatadef _is_punctuation(char):"""Checks whether `chars` is a punctuation character."""cp = ord(char)# We treat all non-letter/number ASCII as punctuation.# Characters such as "^", "$", and "`" are not in the Unicode# Punctuation class but we treat them as punctuation anyways, for# consistency.if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or(cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):return Truecat = unicodedata.category(char)if cat.startswith("P"):return Truereturn Falsesentence = '上海＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､\xa0〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏﹑﹔·！？｡、。\xa0'for i in sentence:print(i, _is_punctuation(i))'''上 False海 False＃ True＄ False％ True＆ True＇ True（ True） True＊ True＋ False， True－ True／ True： True； True＜ False＝ False＞ False＠ True［ True＼ True］ True＾ False＿ True｀ False｛ True｜ False｝ True～ False｟ True｠ True｢ True｣ True､ TrueFalse〃 True〈 True〉 True《 True》 True「 True」 True『 True』 True【 True】 True〔 True〕 True〖 True〗 True〘 True〙 True〚 True〛 True〜 True〝 True〞 True〟 True〰 True〾 False〿 False– True— True‘ True’ True‛ True“ True” True„ True‟ True… True‧ True﹏ True﹑ True﹔ True· True！ True？ True｡ True、 True。 TrueFalse'''

emoji、boxDrawing、Face：https://apps.timwhitlock.info/emoji/tables/unicode#block-6c-other-additional-symbols

import re# 过滤emoji更全的方法#pip install emojititle = '''The process changes a little each time and that’s what makes it fun❤️ #learnfromme #learnontiktok #songwriter #singer #music #fyp #foryou Bussin✨ use my sound & tag meee? #fyp #foryou #bussin #wishmeluck'''title = title.replace(':', '_') # 将title里的冒号替换掉，替换为其他符号re.sub('(:.+?:)',' ', emoji.demojize(title)) # 使用emoji包对文本里的emoji解码，用正则对双引号的地方替换掉# emoji解码后会成为：:red_heart_selector:、:sparkles:；使用正则可去掉。def filterEmoji(desstr,restr=' '):# 过滤emojitry:co = pile(u'[\U00010000-\U0010ffff]')except re.error:co = pile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')return co.sub(restr, desstr)def filterBoxDrawing(desstr, restr=' '):# 过滤形如：╠、╤等boxdrawing字符co = pile(u'[\u2500-\u257f]')return co.sub(restr, desstr)def filterFace(desstr, restr= ' '):# 过滤：形如[衰]、[生气]、[开心]、[捂脸]等表情，用词典更好些p = pile('\[.{1,4}\]')t = p.findall(desstr)for i in t:desstr = desstr.replace(i, restr)return desstrdef filterSpecialSym(desstr, restr=' '):#print u'1\u20e3\ufe0f' #10个特殊的类似emoij的表情co = pile(u'[0-9]?\u20e3\ufe0f?')return co.sub(restr, desstr)def bodyNorm(body):#body = pile(u'''\\\\\\\\\\\\\\\\n''').sub(' ', body) # 得用16个斜杠才行震惊body = pile(u'''\\\\+?n''').sub(' ', body)body = filterSpecialSym(body)body = filterEmoji(body)body = filterBoxDrawing(body)body = filterFace(body)return bodyucontentbody = unicode(contentbody, 'utf-8')print '处理前:',ucontentbodybody = bodyNorm(ucontentbody)print '处理后:',bodyimport jiebafrom string import punctuation as enpunctuationzhonPunctuation = u'''＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〾〿 – — ‘ ’ ‛ “ ” „ ‟ … ‧ ﹏ ﹑ ﹔ · ！？｡、。'''punctuations = set([unicode(i) for i in enpunctuation]) | set([unicode(i) for i in zhonPunctuation])words = [seg for seg in jieba.cut(body) if seg!=' ' and seg not in punctuations]print '分词后:',' '.join(words)

干掉所有麻烦字符的终极武器：非中文、英文的都干掉（中文unicode编码范围：[0x4E00,0x9FA5]）

pChineseEnglishText = pile(u'[\u4E00-\u9FA5|\s\w]').findall(desstr)

ChineseEnglishText = "".join(pChineseEnglishText)

五、html/json/xml标签处理

略

六、分词&切割

切割与分词的区别在于：切割后进行二元、三元组合不损失信息，如新词、热点词分词并不容易发现，但所有词肯定都是被切割后二元、三元组合后的词包含着。缺点在于量太大，常用字5k，二元组合就达到了10w+，三元组合更是达到了1kw+，使用词频过滤则可以减少候选。分词：能较方便得处理词，避免字组合成词导致的维度爆炸。常用切割文本：sentence = "今天上海天气不错。"，word = [i for i in sentence]bert切割： bert整了个Tokenizer，对于中文可分，对于英文则是拥有其定义的单词才分，否则##代替。对于不在词库里的词，被切分为[UNK]，如鸡枞菇的“枞”想要的切割：例子：sentence = '上海最近开得costco，超级火爆，100w+人关注。'——>上海最近开得 costco ，超级火爆， 100w + 人关注。也即英文、数字，连在一起的应当放在一起，而非把英文单词、一连串都切分开来。类似的功能，在unicodedata包里已经实现代码：

# python3import stringdef splitSentence(sentence):#sentence = 'POPMART快闪1235测试Star。TestFor678abc'words = []slength = len(sentence)idx = 0while idx < slength:word = sentence[idx]if word in string.ascii_letters:delta = 1while (idx + delta)<=slength:if (idx + delta)==slength or sentence[idx + delta] not in string.ascii_letters:words.append(''.join(sentence[idx:idx + delta]))breakelse:delta += 1idx = idx + deltaelif word in string.digits:delta = 1while (idx + delta)<=slength:if (idx + delta)==slength or sentence[idx + delta] not in string.digits:words.append(''.join(sentence[idx:idx + delta]))breakelse:delta += 1idx = idx + deltaelse:words.append(word)idx += 1return words# 使用unicodedata进行基本的切割：basic_tokenizer.py# coding=utf8import unicodedataclass BaseTokenizer(object):def __init__(self):passdef tokenize(self, _str):"""Adds whitespace around any CJK character."""output = []tmp_str = ''for char in _str:cp = ord(char)if self._is_chinese_char(cp) or self._is_whitespace(char) or self._is_punctuation(char):if len(tmp_str) > 0:output.append(tmp_str)tmp_str = ''output.append(char)else:tmp_str += charif len(tmp_str) > 0:output.append(tmp_str)return outputdef _is_chinese_char(self, cp):"""Checks whether CP is the codepoint of a CJK character."""# This defines a "chinese character" as anything in the CJK Unicode block:# /wiki/CJK_Unified_Ideographs_(Unicode_block)## Note that the CJK Unicode block is NOT all Japanese and Korean characters,# despite its name. The modern Korean Hangul alphabet is a different block,# as is Japanese Hiragana and Katakana. Those alphabets are used to write# space-separated words, so they are not treated specially and handled# like the all of the other languages.if ((cp >= 0x4E00 and cp <= 0x9FFF) or #(cp >= 0x3400 and cp <= 0x4DBF) or #(cp >= 0x20000 and cp <= 0x2A6DF) or #(cp >= 0x2A700 and cp <= 0x2B73F) or #(cp >= 0x2B740 and cp <= 0x2B81F) or #(cp >= 0x2B820 and cp <= 0x2CEAF) or(cp >= 0xF900 and cp <= 0xFAFF) or #(cp >= 0x2F800 and cp <= 0x2FA1F)): #return Truereturn False@staticmethoddef _is_punctuation(char):"""Checks whether `chars` is a punctuation character."""cp = ord(char)# We treat all non-letter/number ASCII as punctuation.# Characters such as "^", "$", and "`" are not in the Unicode# Punctuation class but we treat them as punctuation anyways, for# consistency.if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or(cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):return Truecat = unicodedata.category(char)if cat.startswith("P"):return Truereturn False@staticmethoddef _is_whitespace(char):"""Checks whether `chars` is a whitespace character."""# \t, \n, and \r are technically contorl characters but we treat them# as whitespace since they are generally considered as such.if char == " " or char == "\t" or char == "\n" or char == "\r":return Truecat = unicodedata.category(char)if cat == "Zs":return Truereturn Falseif __name__ == "__main__":test_str = "我要我要我要watch book，我要我要写字字"test_arr = BaseTokenizer().tokenize(test_str)print(test_arr)

七、全角&半角转换

def strQ2B(ustring):"""把字符串全角转半角"""ss = []for s in ustring:rstring = ""for uchar in s:inside_code = ord(uchar)if inside_code == 12288: # 全角空格直接转换inside_code = 32elif (inside_code >= 65281 and inside_code <= 65374): # 全角字符（除空格）根据关系转化inside_code -= 65248rstring += chr(inside_code)ss.append(rstring)return ''.join(ss)def strB2Q(ustring):"""把字符串半角转全角"""ss = []for s in ustring:rstring = ""for uchar in s:inside_code = ord(uchar)if inside_code == 32: # 全角空格直接转换inside_code = 12288elif (inside_code >= 33 and inside_code <= 126): # 全角字符（除空格）根据关系转化inside_code += 65248rstring += chr(inside_code)ss.append(rstring)return ''.join(ss)

参考：/sparkexpert/article/details/82749207

编辑时间：-03-10

如果觉得《NLP预处理——编码繁转简停用词表情标签》对你有帮助，请点赞、收藏，并留下你的观点哦！

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。