Python停用词表已更新,现在包含了最新的热门词汇。这些词汇在文本分析中可能会影响结果的准确性,因此需要被排除在外。
Python停用词表更新热词表

(图片来源网络,侵删)
1. 获取停用词表
我们需要从网上下载一个中文停用词表,这里我们使用jieba库的内置停用词表。
import jieba 获取停用词表 stopwords = set(jieba.analyse.stop_words)
2. 读取文本数据
我们需要读取文本数据,这里我们假设文本数据存储在一个名为text_data.txt的文件中。
with open('text_data.txt', 'r', encoding='utf8') as f:
text = f.read()
3. 分词并去除停用词
使用jieba库对文本进行分词,并去除停用词。
import jieba.posseg as pseg 分词并去除停用词 words = [word for word, flag in pseg.cut(text) if word not in stopwords]
4. 统计词频

(图片来源网络,侵删)
使用collections库中的Counter类统计词频。
from collections import Counter 统计词频 word_freq = Counter(words)
5. 更新热词表
将统计出的词频按照降序排列,取前N个作为热词。
更新热词表 hotwords = word_freq.most_common(N)
6. 输出热词表
将热词表输出到文件。
输出热词表
with open('hotwords.txt', 'w', encoding='utf8') as f:
for word, freq in hotwords:
f.write(f'{word}: {freq}
')
至此,我们已经完成了Python停用词表的更新热词表操作。

(图片来源网络,侵删)
以下是一个简单的介绍,包含了两列:一列是Python停用词表,另一列是更新热词表。
| 停用词表 | 更新热词表 |
| a | 新冠病毒 |
| about | 疫情 |
| above | 云计算 |
| after | 5G |
| again | 人工智能 |
| all | 大数据 |
| almost | 区块链 |
| along | 芯片 |
| also | 无人驾驶 |
| always | 虚拟现实 |
| among | 生物技术 |
| an | 量子计算 |
| and | |
| any | |
| are | |
| as | |
| at | |
| be | |
| because | |
| been | |
| before | |
| being | |
| below | |
| between | |
| both | |
| but | |
| by | |
| can | |
| could | |
| did | |
| do | |
| does | |
| doing | |
| down | |
| during | |
| each | |
| few | |
| for | |
| from | |
| further | |
| had | |
| has | |
| have | |
| having | |
| he | |
| her | |
| here | |
| hers | |
| herself | |
| him | |
| himself | |
| his | |
| how | |
| however | |
| i | |
| if | |
| in | |
| into | |
| is | |
| it | |
| its | |
| itself | |
| just | |
| kg | |
| km | |
| lb | |
| left | |
| like | |
| ln | |
| ltd | |
| m | |
| mg | |
| might | |
| ml | |
| mm | |
| more | |
| most | |
| mr | |
| mrs | |
| ms | |
| much | |
| must | |
| my | |
| myself | |
| n | |
| no | |
| nor | |
| not | |
| of | |
| off | |
| often | |
| on | |
| once | |
| only | |
| or | |
| other | |
| our | |
| ours | |
| ourselves | |
| out | |
| over | |
| own | |
| part | |
| per | |
| perhaps | |
| put | |
| rather | |
| re | |
| s | |
| same | |
| she | |
| should | |
| since | |
| so | |
| some | |
| such | |
| t | |
| than | |
| that | |
| the | |
| their | |
| theirs | |
| them | |
| themselves | |
| then | |
| there | |
| these | |
| they | |
| thick | |
| thin | |
| this | |
| those | |
| through | |
| to | |
| too | |
| under | |
| until | |
| up | |
| very | |
| was | |
| we | |
| well | |
| were | |
| what | |
| when | |
| where | |
| which | |
| while | |
| who | |
| whom | |
| why | |
| with | |
| within | |
| without | |
| would | |
| yet | |
| you | |
| your | |
| yours | |
| yourself | |
| yourselves |
请注意,停用词表是英文的,而更新热词表是中文的,这个介绍仅作为示例,实际上停用词表和热词表的内容可以根据实际需求进行调整,停用词表通常包含一些常见的、没有实际意义的单词,而热词表则包含当前热门的话题或关键词。
声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。



评论(0)