regex - Making multiple search and replace more precise in Python for lemmatizer -

- March 15, 2013

i trying make own lemmatizer spanish in python2.7 using lemmatization dictionary.

i replace of words in text lemma form. code have been working on far.

def replace_all(text, dic):     i, j in dic.iteritems():         text = text.replace(i, j)     return text   my_text = 'flojo y cargantes. decepcionantes. decenté decentó' my_text_lower= my_text.lower()  lemmatize_list = 'exampledictionary' lemmatize_word_dict = {} open(lemmatize_list) f:     line in f:         depurated_line = line.rstrip()         (val, key) = depurated_line.split("\t")         lemmatize_word_dict[key] = val  txt = replace_all(my_text_lower, lemmatize_word_dict) print txt

here example dictionary file contains lemmatized forms used replace words in input, or my_tyext_lower. example dictionary tab-separated 2-column file in col. 1 represented values , col 2 represents keys match.

exampledictionary

flojo   floja flojo   flojas flojo   flojos cargamento  cargamentos cargante    cargantes decepción   decepciones decepcionante   decepcionantes decentar    decenté decentar    decentéis decentar    decentemos decentar    decentó

my desired output follows:

flojo y cargante. decepcionante. decentar decentar

using these inputs (and example phrase, listed in my_textwithin code). actual output is:

felitrojo y cargramarramarrartserargramarramarrunirdo. decepáginacionarrtícolitroargramarramarrunirdo. decentar decentar

currently, can't seem understand going wrong code.

it seems replacing letters or chunks of each word, instead of recognizing word, finding in lemma dictionary , replace instead.

for instance, result getting when use entire dictionary (more 50.000 entries). problem not happen small example dictionary. when use complete dictionary makes me think prehaps double "replacing" @ point?

is there pythonic technique missing , can incorporate code make search , replace function more precise, identify full words replacement rather chunks and/or not make double replacements?

because use text.replace there's chance you'll still matching sub-string, , text processed again. it's better process 1 input word @ time , build output string word-by-word.

i've switched key-value other way around (because want right , find word on left), , changed replace_all:

import re  def replace_all(text, dic):     result = ""     input = re.findall(r"[\w']+|[.,!?;]", text)     word in input:         changed = dic.get(word,word)         result = result + " " + changed     return result  my_text = 'flojo y cargantes. decepcionantes. decenté decentó' my_text_lower= my_text.lower()  lemmatize_list = 'exampledictionary' lemmatize_word_dict = {} open(lemmatize_list) f:     line in f:         kv = line.split()         lemmatize_word_dict[kv[1]] =kv[0]      txt = replace_all(my_text_lower, lemmatize_word_dict)     print txt

Search This Blog

JAV

regex - Making multiple search and replace more precise in Python for lemmatizer -

Comments

Post a Comment

Popular posts from this blog

ios - UITEXTFIELD InputView Uipicker not working in swift -

Hatching array of circles in AutoCAD using c# -

jqgrid - how to change theme of grid using jqwidgets -