regex - Making multiple search and replace more precise in Python for lemmatizer -
i trying make own lemmatizer spanish in python2.7
using lemmatization dictionary.
i replace of words in text lemma form. code have been working on far.
def replace_all(text, dic): i, j in dic.iteritems(): text = text.replace(i, j) return text my_text = 'flojo y cargantes. decepcionantes. decenté decentó' my_text_lower= my_text.lower() lemmatize_list = 'exampledictionary' lemmatize_word_dict = {} open(lemmatize_list) f: line in f: depurated_line = line.rstrip() (val, key) = depurated_line.split("\t") lemmatize_word_dict[key] = val txt = replace_all(my_text_lower, lemmatize_word_dict) print txt
here example dictionary
file contains lemmatized forms used replace words in input
, or my_tyext_lower
. example dictionary tab-separated 2-column file in col. 1 represented values , col 2 represents keys match.
exampledictionary
flojo floja flojo flojas flojo flojos cargamento cargamentos cargante cargantes decepción decepciones decepcionante decepcionantes decentar decenté decentar decentéis decentar decentemos decentar decentó
my desired output follows:
flojo y cargante. decepcionante. decentar decentar
using these inputs (and example phrase, listed in my_text
within code). actual output is:
felitrojo y cargramarramarrartserargramarramarrunirdo. decepáginacionarrtícolitroargramarramarrunirdo. decentar decentar
currently, can't seem understand going wrong code.
it seems replacing letters or chunks of each word, instead of recognizing word, finding in lemma dictionary
, replace instead.
for instance, result getting when use entire dictionary (more 50.000 entries). problem not happen small example dictionary. when use complete dictionary makes me think prehaps double "replacing" @ point?
is there pythonic technique missing , can incorporate code make search , replace function more precise, identify full words replacement rather chunks and/or not make double replacements?
because use text.replace there's chance you'll still matching sub-string, , text processed again. it's better process 1 input word @ time , build output string word-by-word.
i've switched key-value other way around (because want right , find word on left), , changed replace_all:
import re def replace_all(text, dic): result = "" input = re.findall(r"[\w']+|[.,!?;]", text) word in input: changed = dic.get(word,word) result = result + " " + changed return result my_text = 'flojo y cargantes. decepcionantes. decenté decentó' my_text_lower= my_text.lower() lemmatize_list = 'exampledictionary' lemmatize_word_dict = {} open(lemmatize_list) f: line in f: kv = line.split() lemmatize_word_dict[kv[1]] =kv[0] txt = replace_all(my_text_lower, lemmatize_word_dict) print txt
Comments
Post a Comment