NAME Lingua::JA::NormalizeText - All-in-One Japanese text normalizer SYNOPSIS use Lingua::JA::NormalizeText; use utf8; my @options = ( qw/nfkc decode_entities/, \&dearinsu_to_desu ); my $normalizer = Lingua::JA::NormalizeText->new(@options); my $text = $normalizer->normalize('é³¥ãŒãŒ§ãŒ¦ã§ã‚ã‚Šã‚“ã™♥'); # => 'é³¥ãŒãƒˆãƒ³ãƒ‰ãƒ«ã§ã™â™¥' sub dearinsu_to_desu { my $text = shift; $text =~ s/ã§ã‚ã‚Šã‚“ã™/ã§ã™/g; return $text; } # or use Lingua::JA::NormalizeText qw/old2new_kanji/; use utf8; my $text = old2new_kanji('惡ã®è¯'); # => '悪ã®è¯' DESCRIPTION All-in-One Japanese text normalizer. METHODS new(@options) Creates a new Lingua::JA::NormalizeText instance. The following options are available: OPTION SAMPLE INPUT OUTPUT FOR SAMPLE INPUT --------------------- --------------------- ----------------------- lc DdD ddd uc DdD DDD nfkc ガ ガ (U+30AC) nfkd ガ ã‚«ã‚™ (U+30AB. U+3099) nfc ド ド (U+30C9) nfd ド ド (U+30C8, U+3099) decode_entities ♥ ♥ strip_html <em>ã‚</em> ã‚ alnum_z2h ABC123 ABC123 alnum_h2z ABC123 ABC123 space_z2h \x{3000} \x{0020} space_h2z \x{0020} \x{3000} katakana_z2h ãƒã‚¡ãƒã‚¡ ハァハァ katakana_h2z スーハースーハー スーãƒãƒ¼ã‚¹ãƒ¼ãƒãƒ¼ katakana2hiragana パンツ ã±ã‚“㤠hiragana2katakana ã±ã‚“㤠パンツ wave2tilde 〜, 〰 ~ tilde2wave ~ 〜 wavetilde2long 〜, 〰, ~ ー wave2long 〜, 〰 ー tilde2long ~ ー fullminus2long ï¼ ãƒ¼ dashes2long — ー drawing_lines2long ─ ー unify_long_repeats ヴァーーー ヴァー nl2space (LF)(CR)(CRLF} (space)(space)(space) unify_nl (LF)(CR)(CRLF) \n\n\n unify_long_spaces ã‚(space)(space)ã‚ ã‚(space)ã‚ unify_whitespaces \x{00A0} (space) trim (space)ã‚(space)ã‚(space) ã‚(space)ã‚ ltrim (space)ã‚(space) ã‚(space) rtrim ã‚ã‚(space)(space) ã‚ã‚ old2new_kana ã‚ヰゑヱヸヹ ã„イãˆã‚¨ã‚¤ã‚™ã‚¨ã‚™ old2new_kanji äºžï©§é¬ äºœé€¸é—˜ tab2space (tab)(tab) (space)(space) remove_controls ã‚\x{0000}ã‚ ã‚ã‚ remove_spaces \x{0020}ã‚\x{3000}ã‚\x{0020} ã‚ã‚ dakuon_normalize ã•\x{3099} ã– (U+3056) handakuon_normalize ã¯\x{309A} ã± (U+3071) all_dakuon_normalize ã•\x{3099}ã¯\x{309A} ã–ã± (U+3056, U+3071) square2katakana ㌢ センムcircled2kana ㋙㋛㋑㋟㋑ コシイタイ circled2kanji ㊩㊫㊚㊒㊖ 医å¦ç”·æœ‰è²¡ The order in which these options are applied is according to the order of the elements of @options. (i.e., The first element is applied first, and the last element is applied last.) External functions are also addable. (See dearinsu_to_desu function of the SYNOPSIS section.) normalize($text) normalizes $text. OPTIONS lc, uc These options are the same as CORE::lc and CORE::uc. nfkc, nfkd, nfc, nfd See Unicode::Normalize. decode_entities See HTML::Entities. strip_html Strips the HTML tags from the given text. alnum_z2h, alnum_h2z Converts English alphabet, numbers and symbols ZENKAKU <-> HANKAKU. ZENKAKU: '[vï½ï¼Œï½žï¼”c9Fu_ï¼ï¼§ï¼´ï¼·ï¼°ï½‘ï¿£ï½ ï¼¶ï½‰ï¼©ï½’ï¼šï¼ºï¼¸ï¼½ï½Œï¼ž ï½ï¿¦ï¼ï½œï½˜ï¼–%t^8eDK5jï¼ï¿ h1{U2NH&ï¼ï¼ƒï¼¯ï½Žï¿¢ ï¼ ï½Ÿï½†ï¼“ï¼±ï½ï½ï¼ªï¿¥ï¼Ÿï¼¡ï½—\$"Bï½ï¼£ï¼—;¦ï¼ï½™ï¼‹ï½‡ï¼¹ï¼²ï½‚Lk )S`E(£*.zsï¼ï¼œï½„ HANKAKU: '[vo,~4c9Fu_MGTWPq¯⦆ViIr:ZX]l> }â‚©!|x6%t^8eDK5j-¢h1{U2NH&0#On¬ @⦅f3QapJÂ¥?Aw\$"BmC7;¦=y+gYRbLk )S`E(£*.zs/<d space_z2h, space_h2z SPACE (U+0020) <-> IDEOGRAPHIC SPACE (U+3000) katakana_z2h, katakana_h2z Converts katakanas ZENKAKU <-> HANKAKU. See Lingua::JA::Regular::Unicode. hiragana2katakana INPUT: ã·ã‚”ã«ã‚€ã¦ã„ã§ã¹ã‚žã‚ãµã¨ãŠã‚Šã’ãã¥ã‚ˆã¯ã¤ã–ã—ゃã®ã£ãã²ãƒãŸã‚‡ ã‘ã¾ã‚Œã³ã‚„ãŒã½ã¬ãºããžã±ã”ã‚’ã¸ãšã‹ã´ã‚…ã‚Žã‚ãã‚–ã‡ã©ã ã‚ã‚‚ãˆã‚ ã‚“ã¶ãœã‚ãªã¡ã°ã¢ã‚‹ã™ãã‚•ã¼ã‚‰ã‰ã‚ãã»ã•ã‚‘ãŽã¿ã›ã˜ã“ã…ゆㆠOUTPUT FOR INPUT: プヴニムテイデベヾヰフトオリゲソヅヨãƒãƒ„ザシャノッãƒãƒ’ィタョ ケマレビヤガãƒãƒŒãƒšã‚¯ã‚¾ãƒ‘ゴヲヘズカピュヮアã‚ヶェドダãƒãƒ¢ã‚¨ãƒ¯ ンブゼメナãƒãƒãƒ‚ルスァヵボラォヽグホサヱギミセジコゥユウ katakana2hiragana INPUT: リボズシキï½ï¾™ï¾ˆã‚°ãƒã‚ェヱテクニトロドェコヽï¾ã‚¬ï¾ãƒˆï½©ãƒ€ãƒ¤ãƒ¬ ニãƒã‚½ãƒŽï½¿ï½»ãƒ‘ヨァノハゴゲォヮモヰルヲムアテゼãƒãƒ•ãƒãƒ£ã‚µãƒƒãƒ© ï¾ã‚¢ã‚£ãƒ§ï½³ï½µã‚ªã‚¯ãƒ¡ï¾•ã‚¥ãƒ‚ギメウナススラセザブフヘコカペカイヾ エワヴンタャホョヨツゾãƒãƒ—モセムケリデï¾ãƒŸãƒ›ã‚±ã‚¤ãƒ’ッユツマヵ タレピジシヌビヅヌィï¾ï½´ã‚¡ã‚©ãƒ¶ãƒŠï½¦ãƒ¥ï¾”ãƒï¾‹ãƒ™ãƒ¯ OUTPUT FOR INPUT: ã‚Šã¼ãšã—ãã‚…ã‚‹ããããã‡ã‚‘ã¦ãã«ã¨ã‚ã©ã‡ã“ã‚ã¡ãŒã¸ã¨ã…ã ã‚„ã‚Œ ã«ã¡ãã®ãã•ã±ã‚ˆãã®ã¯ã”ã’ã‰ã‚Žã‚‚ã‚ã‚‹ã‚’ã‚€ã‚ã¦ãœã½ãµã¯ã‚ƒã•ã£ã‚‰ ã¾ã‚ãƒã‚‡ã†ãŠãŠãã‚ゆã…ã¢ãŽã‚ã†ãªã™ã™ã‚‰ã›ã–ã¶ãµã¸ã“ã‹ãºã‹ã„ã‚ž ãˆã‚ゔんãŸã‚ƒã»ã‚‡ã‚ˆã¤ãžã°ã·ã‚‚ã›ã‚€ã‘ã‚Šã§ã¿ã¿ã»ã‘ã„ã²ã£ã‚†ã¤ã¾ã‚• ãŸã‚Œã´ã˜ã—ã¬ã³ã¥ã¬ãƒã‚“ãˆãã‰ã‚–ãªã‚’ã‚…ã‚„ã‚ã²ã¹ã‚ wave2tilde Converts WAVE DASH (U+301C) and WAVY DASH (U+3030) into tilde (U+FF5E). tilde2wave Converts tilde (U+FF5E) into wave (U+301C). wavetilde2long Converts WAVE DASH (U+301C), WAVY DASH (U+3030) and tilde (U+FF5E) into long (U+30FC). wave2long Converts WAVE DASH (U+301C) and WAVY DASH (U+3030) into long (U+30FC). tilde2long Converts tilde (U+FF5E) into long (U+30FC). fullminus2long Converts FULLWIDTH HYPHEN-MINUS (U+FF0D) into long (U+30FC). dashes2long Converts the following characters into long (U+30FC). U+2012 FIGURE DASH U+2013 EN DASH U+2014 EM DASH U+2015 HORIZONTAL BAR Note that this option does not convert hyphens into long. drawing_line2long Converts the following characters into long (U+30FC). U+2500 BOX DRAWINGS LIGHT HORIZONTAL U+2501 BOX DRAWINGS HEAVY HORIZONTAL U+254C BOX DRAWINGS LIGHT DOUBLE DASH HORIZONTAL U+254D BOX DRAWINGS HEAVY DOUBLE DASH HORIZONTAL U+2574 BOX DRAWINGS LIGHT LEFT U+2576 BOX DRAWINGS LIGHT RIGHT U+2578 BOX DRAWINGS HEAVY LEFT U+257A BOX DRAWINGS HEAVY RIGHT unify_long_repeats Unifies long (U+30FC) repeats. nl2space Converts new lines (LF, CR, CRLF) into SPACE (U+0020). unify_nl Unifies new lines. unify_long_spaces Unifies long spaces (U+0020 and U+3000). unify_whitespaces Converts the following characters into SPACE (U+0020). U+000B LINE TABULATION U+000C FORM FEED U+0085 NEXT LINE U+00A0 NO-BREAK SPACE U+1680 OGHAM SPACE MARK U+180E MONGOLIAN VOWEL SEPARATOR U+2000 EN QUAD U+2001 EM QUAD U+2002 EN SPACE U+2003 EM SPACE U+2004 THREE-PER-EM SPACE U+2005 FOUR-PER-EM SPACE U+2006 SIX-PER-EM SPACE U+2007 FIGURE SPACE U+2008 PUNCTUATION SPACE U+2009 THIN SPACE U+200A HAIR SPACE U+2028 LINE SEPARATOR U+2029 PARAGRAPH SEPARATOR U+202F NARROW NO-BREAK SPACE U+205F MEDIUM MATHEMATICAL SPACE Note that this option does not convert the following characters: U+0009 CHARACTER TABULATION U+000A LINE FEED U+000D CARRIAGE RETURN U+3000 IDEOGRAPHIC SPACE trim Removes leading and trailing whitespace. ltrim Removes only leading whitespace. rtrim Removes only trailing whitespace. old2new_kana INPUT OUTPUT FOR INPUT ----- -------------------- ゠ㄠヰ イ ã‚‘ ㈠ヱ エ ヸ イ゙ (U+30A4, U+3099) ヹ エ゙ (U+30A8, U+3099) old2new_kanji INPUT: 亞惡壓åœçˆ²é†«å£¹ï©§ç¨»é£®éš±ç‡Ÿæ¦®è¡žé©›ï©¢åœ“緣艷鹽奧應橫æ毆黃溫穩å‡åƒ¹ 禍畫會壞悔懷海繪慨槪擴殼覺å¸å¶½æ¨‚ï¨¶æ¸´ï© å‹¸å·å¯¬æ¡ï©‡ç½è§€é—œé™·é¡ï¨¸ ï©‚æ¸æ°£ï©Žé¾œåƒžæˆ²çŠ§èˆŠæ“šæ“§è™›å³½æŒ¾ç‹¹é„•ï©©æ›‰ï¨´ï©£å€é©…å‹³è–°å¾‘æƒ æ溪經繼 莖螢輕鷄è—擊缺儉åŠåœˆæª¢æ¬Šç»ç¡ç¸£éšªé¡¯é©—嚴效廣æ†é‘›è™Ÿåœ‹ï©”黑濟碎齋 劑櫻册殺雜åƒæ…˜æ£§è ¶è´Šæ®˜ï©çµ²ï©¡é½’å…’è¾æ¿•å¯¦èˆå¯«ï©ˆï©Œï©›é‡‹å£½æ”¶ï©œå¾žæ¾ ç¸ç¸±ï©‘肅處暑緖署諸æ•å¥¬å°‡æ¶‰ç‡’祥稱è‰ä¹˜å‰©å£¤åƒæ¢æ·¨ç‹€ç–Šè®“釀囑觸寢 愼眞神盡圖粹醉隨髓數樞瀨è²éœé½Šæ”竊節專戰淺潛纖è¸éŒ¢ç¦ªæ›¾ï©ï¨±é›™ 壯層æœæ’å·¢çˆç˜¦ç¸½èŽŠè£é¨·å¢žï¨¿è‡Ÿè—ï©¥å½å±¬çºŒå¢®é«”å°å¸¶æ»¯è‡ºç€§æ“‡æ¾¤å–®ï¨· 擔膽團彈斷癡é²æ™èŸ²é‘„著廳徵懲è½æ••éŽï¨éžéµè½‰é»žå‚³ï¨¦é»¨ç›œç‡ˆç•¶é¬å¾· ç¨è®€ï©•å±†ç¹©ï©¨è²³æƒ±è…¦éœ¸å»¢æ‹œï©„è³£éº¥ç™¼é«®æ‹”ï©™æ™šè »ï¨µï©‹ç¥•æ¿±ï©¤ï©ªï©ç”侮 福拂佛倂塀ç«è®Šé‚Šï¨³è¾¨ç“£è¾¯èˆ–æ¥ç©—寶襃è±ï¨ºæ²’飜æ¯è¬æ»¿ï¨²éºµé»˜é¤ 戾彌 è—¥è¯è±«é¤˜èˆ‡è½æ–æ¨£è¬ ä¾†è³´äº‚ï¤è¦½ï§œé¾ï¤¶å…©çµç¶ 壘淚ï§å‹µç¦®éš¸éˆé½¡æ›†æ· 戀練éŠçˆå‹žï¤¨ï¤©æ¨“郞錄ç£å ¯å·–晉槇渚猪琢瑤ï©ç¥¿ï©“ç©°è°é™ OUTPUT FOR INPUT: äºœæ‚ªåœ§å›²ç‚ºåŒ»å£±é€¸ç¨²é£²éš å–¶æ „è¡›é§…è¬å††ç¸è‰¶å¡©å¥¥å¿œæ¨ªæ¬§æ®´é»„温ç©ä»®ä¾¡ ç¦ç”»ä¼šå£Šæ‚”æ‡æµ·çµµæ…¨æ¦‚拡殻覚å¦å²³æ¥½å–渇è¤å‹§å·»å¯›æ“漢缶観関陥顔器 既帰気祈亀å½æˆ¯çŠ æ—§æ‹ æŒ™è™šå³¡æŒŸç‹éƒ·éŸ¿æšå‹¤è¬¹åŒºé§†å‹²è–«å¾„æµæŽ²æ¸“経継 茎è›è»½é¶èŠ¸æ’ƒæ¬ 倹剣åœæ¤œæ¨©çŒ®ç ”県険顕験厳効広æ’鉱å·å›½ç©€é»’æ¸ˆç •æ–Ž 剤桜冊殺雑å‚惨桟蚕賛残祉糸視æ¯å…辞湿実舎写煮社者釈寿åŽè‡å¾“渋 ç£ç¸¦ç¥ç²›å‡¦æš‘緒署諸å™å¥¨å°†æ¸‰ç„¼ç¥¥ç§°è¨¼ä¹—剰壌嬢æ¡æµ„状畳è²é†¸å˜±è§¦å¯ 慎真神尽図粋酔éšé«„数枢瀬声é™æ–‰æ‘‚窃節専戦浅潜繊践éŠç¦…æ›½ç¥–åƒ§åŒ å£®å±¤æœæŒ¿å·£äº‰ç—©ç·è˜è£…騒増憎臓蔵贈å³å±žç¶šå •ä½“対帯滞å°æ»æŠžæ²¢å˜å˜† 担胆団弾æ–ç—´é…昼虫鋳著åºå¾´æ‡²è´å‹…鎮塚逓鉄転点ä¼éƒ½å…šç›—ç¯å½“闘徳 独èªçªå±Šç¸„難å¼æ‚©è„³è¦‡å»ƒæ‹æ¢…売麦発髪抜ç¹æ™©è›®å‘ç¢‘ç§˜æµœè³“é »æ•ç“¶ä¾® ç¦æ‰•ä»ä½µå¡€ä¸¦å¤‰è¾ºå‹‰å¼å¼å¼èˆ—æ©ç©‚å®è¤’豊墨没翻毎万満å…麺黙餅戻弥 薬訳予余与誉æºæ§˜è¬¡æ¥é ¼ä¹±æ¬„覧隆竜虜両猟緑å¡æ¶™é¡žåŠ±ç¤¼éš·éœŠé½¢æš¦æ´ æ‹ç·´éŒ¬ç‚‰åŠ´å»Šæœ—楼郎録湾å°å·Œæ™‹æ§™æ¸šçŒªç¢ç‘¶ç¥ç¦„禎穣è¡é¥ tab2space Converts CHARACTER TABULATION (U+0009) into SPACE (U+0020). remove_controls Removes the following characters: U+0000 - U+0008 U+000B U+000C U+000E - U+001F U+007E - U+009F Note that this option does not remove the following characters: U+0009 CHARACTER TABULATION U+000A LINE FEED U+000D CARRIAGE RETURN remove_spaces Removes SPACE (U+0020) and IDEOGRAPHIC SPACE (U+3000). dakuon_normalize, handakuon_normalize, all_dakuon_normalize See Lingua::JA::Dakuon. square2katakana, circled2kana, circled2kanji See Lingua::JA::Moji. AUTHOR pawa <pawapawa@cpan.org> SEE ALSO æ–°æ—§å—体表: <http://www.asahi-net.or.jp/~ax2s-kmtn/ref/old_chara.html> Lingua::JA::Regular::Unicode Lingua::JA::Dakuon Lingua::JA::Moji Unicode::Normalize Unicode::Number HTML::Entities HTML::Scrubber LICENSE This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.