에스페란토 데 마소리스

1887년 자멘호프 박사가 만든 국제어 에스페란토

Hanja Hangul Converter 0.0.7

댓글 0

漢字, Unicode

2013. 11. 6.


hanja-0.0.7.7z

kordic.rar



Hanja Hangul Converter 0.0.7

 

1.                 License

All files in Hanja Hangul Converter are available under Public Domain. Anyone can use all files freely.

 

Hanja Hangul Convert Project Homepage: http://kldp.net/projects/hanja/

Author: Masoris Sung-il Kim (masoris@gmail.com)

 

2.                 Describe Files

1.       hanja.py: Simple Hanja Hangul Converter, coded by python.

2.       hanconv.py: Hanja Hangul Converter Module

3.       README.odt and README.pdf: This document what you read now.

4.       dic0.txt: List to convert CJK Compatibility Forms to Han unification. This list do same thing with Unicode normalization algorithm.

5.       dic1.txt: Hanja character list, which didn't use duuembeobchik (두음법칙). In Korean Language in South Korea, when read sino-korean there are rule apply to first sound to make easy to pronounce. This rule is don't use in North Korea. You can download original hanja-hangul.ods file. (Han unification)

6.       dic2.txt : Represent Phonetic Data in Unihan. (Han unification)

7.       dic3.txt: Represent Phonetic Data in Unihan. (CJK Compatibility Forms)

8.       dic4.txt: List of Sino-korean. List extracted form libhangul-0.0.4 and converted to use Unicode normalization algorithm. (Han unification)

9.       dic5.txt: Exception list for covert Hanja to Hangul. Read 3.3 for detail. (Han unification)

10.   dic6.txt: (experiment) Database to convert Hangul to Hanja.

 

3.                 How to use this database

 3.1         You need to know first

These days there are two way to use Hanja by Unicode. The first way is to use both Han unification and CJK Compatibility Forms. This way used most time in Korean daily life. MS Word and Hangul word process support this way. In this case, The CJK Compatibility Forms are used for indicate pronunciation of Hanja. Most Hanja character in Korean has just one pronunciation, but some character are not (most times because of duuembeobchik 두음법칙). So Hanja characters which have plural pronunciation are mapped plural code in CJK Compatibility Forms. So it makes easy to convert to Hangul.

 

But Unicode doesn't recommends to map a character to plural codes, and there are tool which names 'Unicode normalization algorithm' to make all each character map one code. So they don’t use CJK Compatibility Forms. Some web site such as Wikipedia support this way.

 

Therefore someone convert Hanja data in Wikipedia to Hangul by MS Word, it didn't converted correctly. To convert those Hanja data correctly, it needs Sino-korean dictionary, the file 'dic4.txt' will works for this.

 

In North Korean there are no duuembeobchik not likes in South Korean. So you only need 'dic1.txt' which didn't use the rule.

 

 3.2         Convert Hanja text which use only Han unification

To convert Hanja text which only use Han unification (Hanja which converted by Unicode normalization algorithm likes in Wikipedia), Because of some Hanja character pronounced different way by words, so it needs  Sino-korean dictionary 'dic4.txt'. And the problem is left characters which can't convert to use 'dic4.txt', those character doesn't have phonetic information in character, the recommend way is it convert to the sound which most use. So the recommended way is to convert those properties.

1.       dic4.txt

2.       dic5.txt

3.       dic1.txt

 

When someone demands convert to this way, there are no confidence that, the all texts written by Han unification, so I recommend that convert Hanja first to use ‘dic0.txt’; it makes all Hanja to Han unification, then convert to Hangul.

 

'dic5.txt' makes result naturally, because of 'dic1.txt' doesn't contain duuembeobchik or other phonetic reason. 'dic5.txt' contains surname of Koreans and character which always use duuembeobchik.

 

 3.3         Convert Hanja text which uses both Han unification and CJK Compatibility Forms.

To convert Hanja which use both Han unification and CJK Compatibility Forms like MS Word and Hangul word process do. It's simple. Just convert to use these databases. Those contain Unihan Represent Phonetic Data.

1.       dic2.txt

2.       dic3.txt

 

 3.4         Convert Hanja text to Hangul (without duuembeobchik)

1.       Covert Hanja to Han unification, to use 'dic0.txt'

2.       And Covert to Hangul, to use 'dic1.txt'

 

4.                 hanja.py

'hanja.py' is python script to convert Hanja to Hangul.

 

 4.1         Executive

 4.1.1     In Linux Terminal

To run 'hanja.py', do this in Linux terminal;

l  $ chmod a+x hanja.py    // Give execute permission to hanja.py file.

l  $ ./hanja.py                        // And run it.

 

Or, You can run it by python interpreter;

l  $ python hanja.py

 

 4.1.2     In Microsoft Windows

1.       Run ‘hanja.py’ file by double click it in explorer.

2.       If you didn’t install Python before, Download Python in Official Website, or download directly Python 2.5 Windows Installer. And install it.

 

 4.2         Convert in Input Mode

To convert Hanja to Hangul in Input Mode is very simple.

$ ./hanja.py  #  First, Run python script.

Hanja Hangul Converter 0.0.4    by Sung-il KIM (masoris@gmail.com)

Commands: exit(종료), mode(방식), reverse(역변환), list(목록)

Type Hanja to convert and press enter.

 

Choose mode to convert Hanja to Hangul

1. Han unification only (Wikipedia, default)

2. Han unification and CJK Compatibility Forms (MS Word)

3. Without duuembeobchik (North Korean)

4. Apply Unicode normalization algorithm

5. (experiment) Compatible convert with both ways (1, 2)

6. (experiment) Convert Hangul to Hanja

> 3  # Choose mode what you want. I choose 'With out duuembeobchik (North Korean)' in here.

Load dic1.txt file, 27496 of indexes

Total 27496 of indexes

 

1> 韓國 歷史  # Just input Hanja, what you convert, and press enter.

1> 한국의 력사  # So computer convert the Hanja to Hangul directly. The result is right, the pronunciation of 歷史 is 력사 in North Korean.

 

2> mode  # If you want change converting mode, Type 'mode' or '방식' and press enter.

Choose mode to convert Hanja to Hangul

1. Han unification only (Wikipedia, default)

2. Han unification and CJK Compatibility Forms (MS Word)

3. Without duuembeobchik (North Korean)

4. Apply Unicode normalization algorithm

5. (experiment) Compatible convert with both ways (1, 2)

6. (experiment) Convert Hangul to Hanja

> 1  # I choose '1. Han unification only (Wikipedia, default)', You can choose '1' by Just Enter without type '1'. If you are type nothing when choose mode, it selected '1' default.

Load dic4.txt file, 32237 of indexes

Load dic5.txt file, 10 of indexes

Load dic1.txt file, 27496 of indexes

Total 59743 of indexes

 

Hanja Hangul Converter 0.0.4    by Sung-il KIM (masoris@gmail.com)

Commands: exit(종료), mode(방식), reverse(역변환), list(목록)

Type Hanja to convert and press enter.

 

3> 韓國 歷史  # And I typed same Hanja again.

3> 한국의 역사  # The result is not same with before, because the pronunciation of  '歷史' in not same between South Korea and North Korea, it's '력사' in North Korean, and '역사' in South Korean. Therefore the result '한국의 역사' is right, because I choose mode1.

 

4> reverse  # The command 'reverse' or '역변환' make database reverse. It's experiment function for test database. So do not use this function to convert Hangul to Hanja.

Convert Hangul to Hanja (Reverse)

 

5> 한국의 역사

5> 韓國 力士 

 

6> list  #  If you tyed 'list' or '목록', you can see all lists in database, what using now.

 ( ... )

 

7> exit  # To exit programme, just type 'exit' or '종료'.

 

 4.2.1     Commands

l  exit(종료) : Exit Progarmme.

l  mode(방식, 모드) : Change converting mode.

l  reverse(역변환, 정변환) : Make database reverse.

l  list(목록) : Print list of database.

l  convfile(파일변환) : Convert a exists file. For example to convert ‘foo.txt’, type “convfile foo.txt” or “파일변환 foo.txt”. Encoding of source file must be ‘UTF-8’.

 

 4.3         Convert in Terminal

You can also convert text file, in type command in terminal. A convert file must be encoded by UTF-8 and the result file also will be save by UTF-8.

1.       $ ./hanja.py [File name to convert or String]  # It will convert file or string by default mode, and print result on screen.

2.       $ ./hanja.py [File name to convert or String] [Mode]  # It will convert file or string by select mode, and print result on screen.

3.       $ ./hanja.py [File name to convert or String] [Mode] [File name to save result]  # It will convert file or string by select mode, and save the result to file.

 

5.                 hanconv.py

This file is module for convert hanja to hangul.

 

 5.1         Available Functions

 5.1.1     convert([text], [mode], [reverse]) -> unicode

This function convert text hanja to hangul.

 

You can use those modes:

l  'unionly', (1): Han unification only (Wikipedia, default)

l  'uniandcomp', (2): Han unification and CJK Compatibility Forms (MS Word

l  'withoutduuem', (3): Without duuembeobchik (North Korean)

l  'uninormal', (4): Apply Unicode normalization algorithm

l  'compboth', (5): (experiment) Compatible convert with both ways (1, 2)

l  'hangul2hanja',  (6): (experiment) Convert Hangul to Hanja

The numbers in parenthesis could be changed in version up.

 

 5.1.2     getlistfrommode(mode) -> list

This function returns list for convert in mode.

 

 5.1.3     getlenoflistfrommode(mode) -> int

This function returns number of indexes in mode.

 

 5.1.4     printlistfrommode(mode) -> list

This function print all index list in mode.

 

 5.2         Example

$ python  #Run python on hanja hangul converter directory

Python 2.4.4c1 (#2, Oct 11 2006, 21:51:02)

[GCC 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)] on linux2

Type "help", "copyright", "credits" or "license" for more information.

>>> import hanconv  #Import hanconv module

>>> print hanconv.convert('歷史')

역사

>>> print hanconv.convert('歷史', 3)

력사

>>> print hanconv.convert('역사', 1, True)

力士

>>> print hanconv.getlenoflistfrommode(1)

60019

 






#!/usr/bin/env python
# -*- coding: utf-8 -*-
# File name: hanja.py
import sys
import hanconv
import os.path

#Definitions
def preface(): #Print Preface
    print u'Hanja Hangul Converter 0.0.6    by Sung-il KIM (masoris@gmail.com)'
    print u'Commands: exit(종료), mode(방식), reverse(역변환), list(목록)'
    print u'          convfile(파일변환) [filename]'
    print u'Type hanja to convert and press enter.'

def command(cmd): #Analyzing Command and Get Result
    global looping, reverse

    #Commands
    if cmd.lower() in (u'exit',u'종료'):
        looping = False
        return
    elif cmd.lower() in (u'list', u'목록'):
        hanconv.printlistfrommode(mode)
    elif cmd.lower() in (u'reverse', u'정변환', u'역변환'):
        if(reverse == True):
            reverse = False
            print u'Convert Hanja to Hangul'
            return
        elif(reverse == False):
            reverse = True
            print u'Convert Hangul to Hanja (Reverse)'
            return
    elif cmd.lower() in (u'mode', u'방식', u'모드'):
        setmode()
        print ''
        preface()
        return

    if cmd.lower()[0:5] == u'파일변환 ':
        cmd = u'convfile '+cmd[5:]
    if cmd.lower()[0:9] == u'convfile ':
        filename = cmd[9:].replace('\'','').replace('\"','')
        if not os.path.exists(filename):
            print 'ERROR: file \''+filename+'\' doesn\'t exists'
            return
        try:
            oritxt = unicode(file(filename).read(),'utf8')
            print u'Input :\n'+oritxt
            resulttxt = hanconv.convert(oritxt,mode,reverse)
            print u'Output :\n'+resulttxt
            file(filename,'w').write(resulttxt.encode('utf8'))
            print 'Successed Converting file\''+filename+'\'.'
        except:
            print 'ERROR: A error occur in converting file \''+filename+'\'.'    
        return

    #Convert
    print u'出'+unicode(times)+u'> '+convert(cmd)

def convert(txt): #Convert Hanja to Hangul
    return hanconv.convert(txt, mode, reverse)

def setmode(mod=u"-1"):
    global mode

    #Select Mode
    if(mod == u"-1"):
        print u'Choose mode to convert Hanja to Hangul'
        print u'1. Han unification only (Wikipedia, default)'
        print u'2. Han unification and CJK Compatibility Forms (MS Word)'
        print u'3. Without duuembeobchik (North Korean)'
        print u'4. Apply Unicode normalization algorithm'
        print u'5. (experiment) Compatible convert with both ways (1, 2)'
        print u'6. (experiment) Convert Hangul to Hanja'
        mod = unicode(raw_input(u'擇> '.encode(defaultencoding)),defaultencoding)
    
    mod = unicode(mod)
    if mod in hanconv.modes:
        mode = mod
    else:
        mode = u'1'

    if(message):
        print u'Mode '+mode+u' Selected'
        print u'Total '+unicode(hanconv.getlenoflistfrommode(mode))+' of indexes'


#Set Variations
looping = True
times = 0 #The number of looping times
reverse = False #Reverse Converting
message = False #Print Message
mode = '1' #Current Mode
arginput = ''
defaultencoding = sys.getfilesystemencoding()


    
#Start Programme
#Arguments Start
#Is argv1 is text to convert or a filename?
if(len(sys.argv)>=2):
    try:
        arginput = unicode(file(sys.argv[1]).read(),'utf8')        
    except:
        arginput = unicode(sys.argv[1],defaultencoding)

#Convert argv1 and print result
if(len(sys.argv)==2):
    print hanconv.convert(arginput)

#Convert argv1 by argv2 mode and print result
elif(len(sys.argv)==3):
    print hanconv.convert(arginput,sys.argv[2])

#Convert argv1 by argv2 mode and save result to argv3 file
elif(len(sys.argv)==4):
    file(sys.argv[3], 'w').write(hanconv.convert(arginput,sys.argv[2]).encode('utf8'))

#Start looping
else:
    message = True
    preface()
    print ''
    setmode()
    
    while(looping):
        times = times + 1
        command(unicode(raw_input(u'\n入'.encode(defaultencoding)+str(times)+'> '), defaultencoding))
#End Programme







#!/usr/bin/env python
# -*- coding: utf-8 -*-
# File name: hanconv.py
import sys

def getlistfromfile(filename):
    if type(filename) is not unicode:filename = unicode(filename, defaultencoding)
    result = [[],[]]
    for line in unicode(file(filename).read(),'utf8').splitlines():
        if line.find(u'\t') == -1:continue
        if not -1 == line.find(u'#'):continue
        if not -1 == line.find(u'?'):continue
        splited = line.rsplit(u'\t')
        if len(splited) in (0, 1):continue
        if len(splited[0]) == 0 or len(splited[1]) == 0:continue
        if len(splited[0]) != len(splited[1]):continue #Only for Hanja-Hangul Converting
        result[0].append(splited[0])
        result[1].append(splited[1])
    return result

def getlistfromfilenames(filenames):
    result = [[],[]]
    for filename in filenames:
        l = getlistfromfile(filename)
        for n in range(0, len(l[0])):
            result[0].append(l[0][n])
            result[1].append(l[1][n])
    return result

def getlistfrommode(mode):
    if type(mode) is not type(unicode()):mode = unicode(mode, defaultencoding)
    global getlistfrommodecache, getlistfrommodelastmode
    if getlistfrommodecacheenable == True:
        if getlistfrommodelastmode == mode:
            result = getlistfrommodecache
        else:
            getlistfrommodelastmode = mode
            result = getlistfromfilenames(modes[mode])
            getlistfrommodecache = result
    else:
        result = getlistfromfilenames(modes[mode])
    return result

def getlenoflistfrommode(mode):
    if type(mode) is not type(unicode()):mode = unicode(mode, defaultencoding)
    return len(getlistfrommode(mode)[0])

def printlistfrommode(mode):
    if type(mode) is not type(unicode()):mode = unicode(mode, defaultencoding)
    modelist = getlistfrommode(mode)
    errornum = 0
    for n in range(0, len(modelist[0])):
        try:
            print modelist[0][n]+u'\t'+modelist[1][n]
        except:
            errornum = errornum + 1
    print u"Total "+unicode(getlenoflistfrommode(mode))+u" of indexes"
    if errornum != 0:print unicode(errornum)+u' of indexes couldn\'t be printed because of ERROR'
   

def convert(text = u'', mode = u'unionly', reverse=False):
    if type(mode) is not type(unicode()):mode = unicode(mode, defaultencoding)
    if type(text) is not type(unicode()):text = unicode(text, defaultencoding)
    convlist = getlistfrommode(mode)
    for n in range(0, len(convlist[0])):
        if reverse == True:
            text = text.replace(convlist[1][n], convlist[0][n])
        else:
            text = text.replace(convlist[0][n], convlist[1][n])
    return text

def initmodes():
    global modes
    #Set default modes
    modes[u'unionly'] = [u'dic0.txt', u'dic4.txt', u'dic5.txt', u'dic1.txt']
    modes[u'uniandcomp'] = [u'dic3.txt', u'dic2.txt']
    modes[u'withoutduuem'] = [u'dic0.txt', u'dic1.txt']
    modes[u'uninormal'] = [u'dic0.txt']
    modes[u'compboth'] = [u'dic4.txt', u'dic3.txt', u'dic2.txt']
    modes[u'hangul2hanja'] = [u'dic6.txt']

    #Alternative mode names
    modes[u'1'] = modes[u'unionly']
    modes[u'2'] = modes[u'uniandcomp']
    modes[u'3'] = modes[u'withoutduuem']
    modes[u'4'] = modes[u'uninormal']
    modes[u'5'] = modes[u'compboth']
    modes[u'6'] = modes[u'hangul2hanja']

getlistfrommodecacheenable = True
getlistfrommodecache = u''
getlistfrommodelastmode = u''
defaultencoding = sys.getfilesystemencoding()

modes = {}
initmodes()







- 첨부파일

hanja-0.0.7.7z  
kordic.rar