>>> import jieba
>>> import jieba.posseg as pseg
>>> words = pseg.cut("我爱北京天安门") #jieba默认模式
>>> jieba.enable_paddle() #启动paddle模式。 0.40版之后开始支持,早期版本不支持
>>> words = pseg.cut("我爱北京天安门",use_paddle=True) #paddle模式
>>> for word, flag in words:
... print('%s %s' % (word, flag))
...
我 r
爱 v
北京 ns
天安门 ns
gt; python -m jieba --help
Jieba command line interface.
positional arguments:
filename input file
optional arguments:
-h, --help show this help message and exit
-d [DELIM], --delimiter [DELIM]
use DELIM instead of ' / ' for word delimiter; or a
space if it is used without DELIM
-p [DELIM], --pos [DELIM]
enable POS tagging; if DELIM is specified, use DELIM
instead of '_' for POS delimiter
-D DICT, --dict DICT use DICT as dictionary
-u USER_DICT, --user-dict USER_DICT
use USER_DICT together with the default dictionary or
DICT (if specified)
-a, --cut-all full pattern cutting (ignored with POS tagging)
-n, --no-hmm don't use the Hidden Markov Model
-q, --quiet don't print loading messages to stderr
-V, --version show program's version number and exit
If no filename specified, use STDIN instead.
“Jieba” (Chinese for “to stutter”) Chinese text segmentation: built to be the best Python Chinese word segmentation module.
Features
Support three types of segmentation mode:
Accurate Mode attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis.
Full Mode gets all the possible words from the sentence. Fast but not accurate.
Search Engine Mode, based on the Accurate Mode, attempts to cut long words into several short words, which can raise the recall rate. Suitable for search engines.
Manual installation: place the jieba directory in the current directory or python site-packages directory.
import jieba.
Algorithm
Based on a prefix dictionary structure to achieve efficient word graph scanning. Build a directed acyclic graph (DAG) for all possible word combinations.
Use dynamic programming to find the most probable combination based on the word frequency.
For unknown words, a HMM-based model is used with the Viterbi algorithm.
Main Functions
Cut
The jieba.cut function accepts three input parameters: the first parameter is the string to be cut; the second parameter is cut_all, controlling the cut mode; the third parameter is to control whether to use the Hidden Markov Model.
jieba.cut_for_search accepts two parameter: the string to be cut; whether to use the Hidden Markov Model. This will cut the sentence into short words suitable for search engines.
The input string can be an unicode/str object, or a str/bytes object which is encoded in UTF-8 or GBK. Note that using GBK encoding is not recommended because it may be unexpectly decoded as UTF-8.
jieba.cut and jieba.cut_for_search returns an generator, from which you can use a for loop to get the segmentation result (in unicode).
jieba.lcut and jieba.lcut_for_search returns a list.
jieba.Tokenizer(dictionary=DEFAULT_DICT) creates a new customized Tokenizer, which enables you to use different dictionaries at the same time. jieba.dt is the default Tokenizer, to which almost all global functions are mapped.
[Full Mode]: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
[Accurate Mode]: 我/ 来到/ 北京/ 清华大学
[Unknown Words Recognize] 他, 来到, 了, 网易, 杭研, 大厦 (In this case, "杭研" is not in the dictionary, but is identified by the Viterbi algorithm)
[Search Engine Mode]: 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造
Add a custom dictionary
Load dictionary
Developers can specify their own custom dictionary to be included in the jieba default dictionary. Jieba is able to identify new words, but you can add your own new words can ensure a higher accuracy.
Usage: jieba.load_userdict(file_name) # file_name is a file-like object or the path of the custom dictionary
The dictionary format is the same as that of dict.txt: one word per line; each line is divided into three parts separated by a space: word, word frequency, POS tag. If file_name is a path or a file opened in binary mode, the dictionary must be UTF-8 encoded.
The word frequency and POS tag can be omitted respectively. The word frequency will be filled with a suitable value if omitted.
For example:
创新办 3 i
云计算 5
凱特琳 nz
台中
Change a Tokenizer’s tmp_dir and cache_file to specify the path of the cache file, for using on a restricted file system.
jieba.analyse.TextRank() creates a new TextRank instance.
Part of Speech Tagging
jieba.posseg.POSTokenizer(tokenizer=None) creates a new customized Tokenizer. tokenizer specifies the jieba.Tokenizer to internally use. jieba.posseg.dt is the default POSTokenizer.
Tags the POS of each word after segmentation, using labels compatible with ictclas.
Example:
>>> import jieba.posseg as pseg
>>> words = pseg.cut("我爱北京天安门")
>>> for w in words:
... print('%s %s' % (w.word, w.flag))
...
我 r
爱 v
北京 ns
天安门 ns
Parallel Processing
Principle: Split target text by line, assign the lines into multiple Python processes, and then merge the results, which is considerably faster.
Based on the multiprocessing module of Python.
Usage:
jieba.enable_parallel(4) # Enable parallel processing. The parameter is the number of processes.
Result: On a four-core 3.4GHz Linux machine, do accurate word segmentation on Complete Works of Jin Yong, and the speed reaches 1MB/s, which is 3.3 times faster than the single-process version.
Note that parallel processing supports only default tokenizers, jieba.dt and jieba.posseg.dt.
Tokenize: return words with position
The input must be unicode
Default mode
result = jieba.tokenize(u'永和服装饰品有限公司')
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
word 永和 start: 0 end:2
word 服装 start: 2 end:4
word 饰品 start: 4 end:6
word 有限公司 start: 6 end:10
Search mode
result = jieba.tokenize(u'永和服装饰品有限公司',mode='search')
for tk in result:
print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
word 永和 start: 0 end:2
word 服装 start: 2 end:4
word 饰品 start: 4 end:6
word 有限 start: 6 end:8
word 公司 start: 8 end:10
word 有限公司 start: 6 end:10
gt; python -m jieba --help
Jieba command line interface.
positional arguments:
filename input file
optional arguments:
-h, --help show this help message and exit
-d [DELIM], --delimiter [DELIM]
use DELIM instead of ' / ' for word delimiter; or a
space if it is used without DELIM
-p [DELIM], --pos [DELIM]
enable POS tagging; if DELIM is specified, use DELIM
instead of '_' for POS delimiter
-D DICT, --dict DICT use DICT as dictionary
-u USER_DICT, --user-dict USER_DICT
use USER_DICT together with the default dictionary or
DICT (if specified)
-a, --cut-all full pattern cutting (ignored with POS tagging)
-n, --no-hmm don't use the Hidden Markov Model
-q, --quiet don't print loading messages to stderr
-V, --version show program's version number and exit
If no filename specified, use STDIN instead.
Initialization
By default, Jieba don’t build the prefix dictionary unless it’s necessary. This takes 1-3 seconds, after which it is not initialized again. If you want to initialize Jieba manually, you can call:
import jieba
jieba.initialize() # (optional)
You can also specify the dictionary (not supported before version 0.28) :
jieba.set_dictionary('data/dict.txt.big')
Using Other Dictionaries
It is possible to use your own dictionary with Jieba, and there are also two dictionaries ready for download:
jieba
“结巴”中文分词:做最好的 Python 中文分词组件
“Jieba” (Chinese for “to stutter”) Chinese text segmentation: built to be the best Python Chinese word segmentation module.
特点
pip install paddlepaddle-tiny==1.6.1。目前paddle模式支持jieba v0.40及以上版本。jieba v0.40以下版本,请升级jieba,pip install jieba --upgrade。PaddlePaddle官网安装说明
代码对 Python 2/3 均兼容
easy_install jieba或者pip install jieba/pip3 install jiebapython setup.py installimport jieba来引用pip install paddlepaddle-tiny==1.6.1。算法
主要功能
jieba.cut方法接受四个输入参数: 需要分词的字符串;cut_all 参数用来控制是否采用全模式;HMM 参数用来控制是否使用 HMM 模型;use_paddle 参数用来控制是否使用paddle模式下的分词模式,paddle模式采用延迟加载方式,通过enable_paddle接口安装paddlepaddle-tiny,并且import相关代码;jieba.cut_for_search方法接受两个参数:需要分词的字符串;是否使用 HMM 模型。该方法适合用于搜索引擎构建倒排索引的分词,粒度比较细jieba.cut以及jieba.cut_for_search返回的结构都是一个可迭代的 generator,可以使用 for 循环来获得分词后得到的每一个词语(unicode),或者用jieba.lcut以及jieba.lcut_for_search直接返回 listjieba.Tokenizer(dictionary=DEFAULT_DICT)新建自定义分词器,可用于同时使用不同词典。jieba.dt为默认分词器,所有全局分词相关函数都是该分词器的映射。代码示例
输出:
载入词典
dict.txt一样,一个词占一行;每一行分三部分:词语、词频(可省略)、词性(可省略),用空格隔开,顺序不可颠倒。file_name若为路径或二进制方式打开的文件,则文件必须为 UTF-8 编码。例如:
更改分词器(默认为
jieba.dt)的tmp_dir和cache_file属性,可分别指定缓存文件所在的文件夹及其文件名,用于受限的文件系统。范例:
自定义词典:https://github.com/fxsjy/jieba/blob/master/test/userdict.txt
用法示例:https://github.com/fxsjy/jieba/blob/master/test/test_userdict.py
调整词典
使用
add_word(word, freq=None, tag=None)和del_word(word)可在程序中动态修改词典。使用
suggest_freq(segment, tune=True)可调节单个词语的词频,使其能(或不能)被分出来。注意:自动计算的词频在使用 HMM 新词发现功能时可能无效。
代码示例:
基于 TF-IDF 算法的关键词抽取
import jieba.analyse代码示例 (关键词提取)
https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
关键词提取所使用逆向文件频率(IDF)文本语料库可以切换成自定义语料库的路径
关键词提取所使用停止词(Stop Words)文本语料库可以切换成自定义语料库的路径
关键词一并返回关键词权重值示例
基于 TextRank 算法的关键词抽取
算法论文: TextRank: Bringing Order into Texts
基本思想:
使用示例:
见 test/demo.py
jieba.posseg.POSTokenizer(tokenizer=None)新建自定义分词器,tokenizer参数可指定内部使用的jieba.Tokenizer分词器。jieba.posseg.dt为默认词性标注分词器。paddle模式词性标注对应表如下:
paddle模式词性和专名类别标签集合如下表,其中词性标签 24 个(小写字母),专名类别标签 4 个(大写字母)。
原理:将目标文本按行分隔后,把各行文本分配到多个 Python 进程并行分词,然后归并结果,从而获得分词速度的可观提升
基于 python 自带的 multiprocessing 模块,目前暂不支持 Windows
用法:
jieba.enable_parallel(4)# 开启并行分词模式,参数为并行进程数jieba.disable_parallel()# 关闭并行分词模式例子:https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py
实验结果:在 4 核 3.4GHz Linux 机器上,对金庸全集进行精确分词,获得了 1MB/s 的速度,是单进程版的 3.3 倍。
注意:并行分词仅支持默认分词器
jieba.dt和jieba.posseg.dt。from jieba.analyse import ChineseAnalyzer使用示例:
python -m jieba news.txt > cut_result.txt命令行选项(翻译):
--help选项输出:延迟加载机制
jieba 采用延迟加载,
import jieba和jieba.Tokenizer()不会立即触发词典的加载,一旦有必要才开始加载词典构建前缀字典。如果你想手工初始 jieba,也可以手动初始化。在 0.28 之前的版本是不能指定主词典的路径的,有了延迟加载机制后,你可以改变主词典的路径:
例子: https://github.com/fxsjy/jieba/blob/master/test/test_change_dictpath.py
其他词典
占用内存较小的词典文件 https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small
支持繁体分词更好的词典文件 https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
下载你所需要的词典,然后覆盖 jieba/dict.txt 即可;或者用
jieba.set_dictionary('data/dict.txt.big')其他语言实现
结巴分词 Java 版本
作者:piaolingxue 地址:https://github.com/huaban/jieba-analysis
结巴分词 C++ 版本
作者:yanyiwu 地址:https://github.com/yanyiwu/cppjieba
结巴分词 Rust 版本
作者:messense, MnO2 地址:https://github.com/messense/jieba-rs
结巴分词 Node.js 版本
作者:yanyiwu 地址:https://github.com/yanyiwu/nodejieba
结巴分词 Erlang 版本
作者:falood 地址:https://github.com/falood/exjieba
结巴分词 R 版本
作者:qinwf 地址:https://github.com/qinwf/jiebaR
结巴分词 iOS 版本
作者:yanyiwu 地址:https://github.com/yanyiwu/iosjieba
结巴分词 PHP 版本
作者:fukuball 地址:https://github.com/fukuball/jieba-php
结巴分词 .NET(C#) 版本
作者:anderscui 地址:https://github.com/anderscui/jieba.NET/
结巴分词 Go 版本
结巴分词Android版本
友情链接
系统集成
分词速度
常见问题
1. 模型的数据是如何生成的?
详见: https://github.com/fxsjy/jieba/issues/7
2. “台中”总是被切成“台 中”?(以及类似情况)
P(台中) < P(台)×P(中),“台中”词频不够导致其成词概率较低
解决方法:强制调高词频
jieba.add_word('台中')或者jieba.suggest_freq('台中', True)3. “今天天气 不错”应该被切成“今天 天气 不错”?(以及类似情况)
解决方法:强制调低词频
jieba.suggest_freq(('今天', '天气'), True)或者直接删除该词
jieba.del_word('今天天气')4. 切出了词典中没有的词语,效果不理想?
解决方法:关闭新词发现
jieba.cut('丰田太省了', HMM=False)jieba.cut('我们中出了一个叛徒', HMM=False)更多问题请点击:https://github.com/fxsjy/jieba/issues?sort=updated&state=closed
修订历史
https://github.com/fxsjy/jieba/blob/master/Changelog
jieba
“Jieba” (Chinese for “to stutter”) Chinese text segmentation: built to be the best Python Chinese word segmentation module.
Features
Online demo
http://jiebademo.ap01.aws.af.cm/
(Powered by Appfog)
Usage
easy_install jiebaorpip install jiebapython setup.py installafter extracting.jiebadirectory in the current directory or pythonsite-packagesdirectory.import jieba.Algorithm
Main Functions
jieba.cutfunction accepts three input parameters: the first parameter is the string to be cut; the second parameter iscut_all, controlling the cut mode; the third parameter is to control whether to use the Hidden Markov Model.jieba.cut_for_searchaccepts two parameter: the string to be cut; whether to use the Hidden Markov Model. This will cut the sentence into short words suitable for search engines.jieba.cutandjieba.cut_for_searchreturns an generator, from which you can use aforloop to get the segmentation result (in unicode).jieba.lcutandjieba.lcut_for_searchreturns a list.jieba.Tokenizer(dictionary=DEFAULT_DICT)creates a new customized Tokenizer, which enables you to use different dictionaries at the same time.jieba.dtis the default Tokenizer, to which almost all global functions are mapped.Code example: segmentation
Output:
Load dictionary
jieba.load_userdict(file_name)# file_name is a file-like object or the path of the custom dictionarydict.txt: one word per line; each line is divided into three parts separated by a space: word, word frequency, POS tag. Iffile_nameis a path or a file opened in binary mode, the dictionary must be UTF-8 encoded.For example:
Change a Tokenizer’s
tmp_dirandcache_fileto specify the path of the cache file, for using on a restricted file system.Example:
Modify dictionary
Use
add_word(word, freq=None, tag=None)anddel_word(word)to modify the dictionary dynamically in programs.Use
suggest_freq(segment, tune=True)to adjust the frequency of a single word so that it can (or cannot) be segmented.Note that HMM may affect the final result.
Example:
import jieba.analysejieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())sentence: the text to be extractedtopK: return how many keywords with the highest TF/IDF weights. The default value is 20withWeight: whether return TF/IDF weights with the keywords. The default value is FalseallowPOS: filter words with which POSs are included. Empty for no filtering.jieba.analyse.TFIDF(idf_path=None)creates a new TFIDF instance,idf_pathspecifies IDF file path.Example (keyword extraction)
https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
Developers can specify their own custom IDF corpus in jieba keyword extraction
jieba.analyse.set_idf_path(file_name) # file_name is the path for the custom corpusDevelopers can specify their own custom stop words corpus in jieba keyword extraction
jieba.analyse.set_stop_words(file_name) # file_name is the path for the custom corpusThere’s also a TextRank implementation available.
Use:
jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'))Note that it filters POS by default.
jieba.analyse.TextRank()creates a new TextRank instance.jieba.posseg.POSTokenizer(tokenizer=None)creates a new customized Tokenizer.tokenizerspecifies the jieba.Tokenizer to internally use.jieba.posseg.dtis the default POSTokenizer.Principle: Split target text by line, assign the lines into multiple Python processes, and then merge the results, which is considerably faster.
Based on the multiprocessing module of Python.
Usage:
jieba.enable_parallel(4)# Enable parallel processing. The parameter is the number of processes.jieba.disable_parallel()# Disable parallel processing.Example: https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py
Result: On a four-core 3.4GHz Linux machine, do accurate word segmentation on Complete Works of Jin Yong, and the speed reaches 1MB/s, which is 3.3 times faster than the single-process version.
Note that parallel processing supports only default tokenizers,
jieba.dtandjieba.posseg.dt.from jieba.analyse import ChineseAnalyzerInitialization
By default, Jieba don’t build the prefix dictionary unless it’s necessary. This takes 1-3 seconds, after which it is not initialized again. If you want to initialize Jieba manually, you can call:
You can also specify the dictionary (not supported before version 0.28) :
Using Other Dictionaries
It is possible to use your own dictionary with Jieba, and there are also two dictionaries ready for download:
A smaller dictionary for a smaller memory footprint: https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small
There is also a bigger dictionary that has better support for traditional Chinese (繁體): https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
By default, an in-between dictionary is used, called
dict.txtand included in the distribution.In either case, download the file you want, and then call
jieba.set_dictionary('data/dict.txt.big')or just replace the existingdict.txt.Segmentation speed