NAME

    Lingua::ZH::Jieba - Perl wrapper for CppJieba (Chinese text
    segmentation)

VERSION

    version 0.007

SYNOPSIS

        use Lingua::ZH::Jieba;
    
        binmode STDOUT, ":utf8";
        
        my $jieba = Lingua::ZH::Jieba->new();
    
        # default cut (切词,MP/HMM混合方法)
        my $words = $jieba->cut("他来到了网易杭研大厦");
        print join('/', @$words), "\n";
        # 他/来到/了/网易/杭研/大厦
    
        # cut without HMM (切词,MP方法)
        my $words_nohmm = $jieba->cut(
            "他来到了网易杭研大厦",
            { no_hmm => 1 } );
        print join('/', @$words_nohmm), "\n";
        # 他/来到/了/网易/杭/研/大厦
    
        # cut all (Full方法,切出所有词典里的词语)
        my $words_cutall = $jieba->cut(
            "我来到北京清华大学",
            { cut_all => 1 } );
        print join('/', @$words_cutall), "\n";
        # 我/来到/北京/清华/清华大学/华大/大学
    
        # cut for search (先用Mix方法切词,对于切出的较长词再用Full方法)
        my $words_cut4search = $jieba->cut_for_search(
            "小明硕士毕业于中国科学院计算所,后在日本京都大学深造" );
        print join('/', @$words_cut4search), "\n";
        # 小明/硕士/毕业/于/中国/科学/学院/科学院/中国科学院/计算/计算所/,/后/在/日本/京都/大学/日本京都大学/深造
    
        # get word offset and length with cut_ex() or cut_for_search_ex()
        my $words_ex = $jieba->cut_ex("他来到了网易杭研大厦");
        # [
        #   [ "ä»–",     0, 1 ],
        #   [ "来到",   1, 2 ],
        #   [ "了",     3, 1 ],
        #   [ "网易",   4, 2 ],
        #   [ "杭研",   6, 2 ],
        #   [ "大厦",   8, 2 ],
        # ]
    
        # part-of-speech tagging (词性标注)
        my $word_pos_tags = $jieba->tag("我是蓝翔技工拖拉机学院手扶拖拉机专业的。");
        for my $pair (@$word_pos_tags) {
            my ($word, $part_of_speech) = @$pair;
            print "$word:$part_of_speech\n";
        }
        # 我:r
        # 是:v
        # 蓝翔:nz
        # 技工:n
        # 拖拉机:n
        # ...
    
        # keyword extraction (关键词提取)
        my $extractor = $jieba->extractor();
        my $word_score = $extractor->extract(
            "我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰。",
            5
        );
        for my $pair (@$word_scores) {
            my ($word, $score) = @$pair;
            printf "%s:%.3f\n", $word, $score;
        }
        # CEO:11.739
        # 升职:10.856
        # 加薪:10.643
        # 手扶拖拉机:10.009
        # å·…å³°:9.494
    
        # insert user word (动态增加用户词)
        my $words_before_insert = $jieba->cut("男默女泪");
        print join('/', @$words_before_insert), "\n";
        # 男默/女泪
    
        $jieba->insert_user_word("男默女泪");
    
        my $words_after_insert = $jieba->cut("男默女泪");
        print join('/', @$words_after_insert), "\n";
        # 男默女泪

DESCRIPTION

    This module is the Perl wrapper for CppJieba, which is a C++
    implementation of the Jieba Chinese text segmentation library. The
    Perl/C++ binding is generated via SWIG.

    The module may contain several packages. Unless stated otherwise, you
    only need to use Lingua::ZH::Jieba; in your programs.

    At present this module is still in alpha state. Its interface is
    subject to change in future, although I will keep compatibilities if
    possible.

CONSTRUCTOR

 new

        my $jieba = Lingua::ZH::Jieba->new;

    By default constructor would use data files from "share" dir of its
    installation. But it's possible to override any of the data files like
    below.

        my $jieba = Lingua::ZH::Jieba->new(
            {
                dict_path => $my_dict_path,
                hmm_path => $my_hmm_path,
                user_dict_path => $my_user_dict_path,
                idf_path => $my_idf_path,
                stop_word_path => $my_stop_word,
            }
        );
        # if you just would like override user dict 
        my $jieba = Lingua::ZH::Jieba->new(
            {
                user_dict_path => $my_user_dict_path,
            }
        );

METHODS

 cut

        my $words = $jieba->cut($sentence);

    Default cut mode. Returns an arrayref of utf8 strings of words cut from
    the sentence.

        my $words = $jieba->cut($sentence, { no_hmm => 1 });

    Cut without HMM mode.

        my $words = $jieba->cut($sentence, { cut_all => 1 });

    Cut all possible words in dictionary.

 cut_ex

        my $words_ex = $jieba->cut_ex($sentence);

    Similar to cut(), but returns an arrayref of complex data. Each element
    in the result arrayref is [ word, offset, length ].

 cut_for_search

        my $words = $jieba->cut_for_search($sentence);
        my $words_nohmm = $jieba->cut_for_search($sentence, { no_hmm => 1 });

 cut_for_search_ex

        my $words_ex = $jieba->cut_for_search_ex($sentence);

    Similar to cut_for_search(), but returns an arrayref of complex data.
    Each element in the result arrayref is [ word, offset, length ].

 tag

        my $word_pos_tags = $jieba->tag($sentence);
        for my $pair (@$word_pos_tags) {
            my ($word, $part_of_speech) = @$pair;
            ...
        }

    POS (part-of-speech) tagging. Returns an arrayref of which each element
    is in the form of [ $word, $part_of_speech ].

 insert_user_word

        $jieba->insert_user_word($word);

    Dynamically inserts a user word.

 extractor

        my $extractor = $jieba->extractor();

    Get the keyword extractor object. For more about the extractor, see
    Lingua::ZH::Jieba::KeywordExtractor.

SEE ALSO

    https://github.com/fxsjy/jieba - Jieba, the Chinese text segmentation
    library

    https://github.com/yanyiwu/cppjieba - CppJieba, Jieba implemented in
    C++

    http://www.swig.org - SWIG, the Simplified Wrapper and Interface
    Generator

ACKNOWLEDGEMENTS

    Thanks to Junyi Sun, and Yanyi Wu. This piece of Perl library would not
    be existing without their work on jieba and CppJieba.

AUTHOR

    Stephan Loyd <stephanloyd9@gmail.com>

COPYRIGHT AND LICENSE

    This software is copyright (c) 2017-2023 by Stephan Loyd.

    This is free software; you can redistribute it and/or modify it under
    the same terms as the Perl 5 programming language system itself.