Parsing kanji

開発

2014/5/16

Parsing kanji

faker, faker_japanese, hiragana, japanese, kanji, katakana, language processing, mecab, natto, romaji, ruby

denvazh

この記事は1年以上前に書かれたもので、内容が古い可能性がありますのでご注意ください。

I promised myself, that this post would be dead simple, but very informative. Because of this, let’s skip the introduction part and start directly from “why?” followed by “how?”.

“Why?”

Recently, I was working on a project where I was preparing a lot of testing data. In Ruby this can be easily accomplished using Faker gem and/or using sequence feature of FactoryGirl gem. Everything worked fine so far, until I started feeding tests real data in Japanese.

Faker supports locale switching and to some degree it can generate japanese names, but functionality is quite limited. To be able to generate name which would have not only kanji, but katakana, hiragana and romaji (using latin alphabet) I had to implement my own solution.

I don’t want to go into too much details, because I explained everything quite well
here.

What I would like to explain here, is how I generated this long list of static fake data

“How?”

While surfing the web for existing solutions for what I wanted to do I found a library which provides interface to MeCab, a natural langauge processor for japanese.

Because I use Mac and homebrew I was able to install it like this:

$ brew install mecab mecab-ipadic

This would grab recent version of mecab and its dictionary.

Now we can actually start writing some ruby code.

$ mkdir parse_kanji && cd parse_kanji

Create Gemfile

$ bundle init

Open Gemfile, delete all lines starting with gem and put this line there

gem "natto"

Install dependencies

$ bundle install

From this point you can create any ruby script file and work normally.

Let’s actually do something interesting.
Suppose we have a file with list of kanji which we have no idea how to read, but we like to create a csv file with readings.

面白
目黒
岡田
無論
外国
漠然

For this task we also include another gem to conveniently convert katakana to romaji. Add the line below to the Gemfile ( and don’t forget to run bundle install again)

gem "romaji"

Now we can actually write our script. Below I will point out few imporant points and then give a reference to the full script.

Referencing natto interface for mecab from global variable. This is convenient for small scripts.

$nm = Natto::MeCab.new

Then, to actually create a conversion portions of code we just need to implement two functions:

Conversion of kanji string to katakana

def to_katakana(s)
	arr =[]
	
	$nm.parse(s) do |n|
		if n.char_type==2
			yomi = n.feature.split(',')[-2]
			arr << yomi
		else
			arr << n.surface
		end
 	end
 	
 	arr.join
end

Conversion to hiragana is more of a convenience method, rather than a complete implementation of something new. We merely wrap NKF method to convert katakana to hiragana string, which we also expect to be in UTF8.

def to_hiragana(s)
	NKF.nkf('-h1 -w', s)
end

Full script one can find here

Finally, using this script would give us a way to read the kanji above:

面白,オモシロ,おもしろ,omoshiro
目黒,メグロ,めぐろ,meguro
岡田,オカダ,おかだ,okada
無論,ムロン,むろん,muron
外国,ガイコク,がいこく,gaikoku
漠然,バクゼン,ばくぜん,bakuzen

Comments are closed.

May 2014
M	T	W	T	F	S	S
	1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

May 2014

Parsing kanji

“Why?”

“How?”

おすすめ記事

カテゴリ

アーカイブ