API Reference (auto-generated)¶

Morphological Analyzer¶

class pymorphy2.analyzer.MorphAnalyzer(path=None, lang=None, result_type=<class 'pymorphy2.analyzer.Parse'>, units=None, probability_estimator_cls=<object object>, char_substitutes=<object object>)[source]¶

Morphological analyzer for Russian language.

For a given word it can find all possible inflectional paradigms and thus compute all possible tags and normal forms.

Analyzer uses morphological word features and a lexicon (dictionary compiled from XML available at OpenCorpora.org); for unknown words heuristic algorithm is used.

Create a MorphAnalyzer object:

>>> import pymorphy2
>>> morph = pymorphy2.MorphAnalyzer()

MorphAnalyzer uses dictionaries from pymorphy2-dicts package (which can be installed via pip install pymorphy2-dicts).

Alternatively (e.g. if you have your own precompiled dictionaries), either create PYMORPHY2_DICT_PATH environment variable with a path to dictionaries, or pass path argument to pymorphy2.MorphAnalyzer constructor:

>>> morph = pymorphy2.MorphAnalyzer(path='/path/to/dictionaries') 

By default, methods of this class return parsing results as namedtuples Parse. This has performance implications under CPython, so if you need maximum speed then pass result_type=None to make analyzer return plain unwrapped tuples:

>>> morph = pymorphy2.MorphAnalyzer(result_type=None)

DEFAULT_SUBSTITUTES = {'е': 'ё'}¶

DEFAULT_UNITS = [[DictionaryAnalyzer(), AbbreviatedFirstNameAnalyzer(letters='АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЭЮЯ', score=0.1, tag_pattern='NOUN,anim,%(gender)s,Sgtm,Name,Fixd,Abbr,Init sing,%(case)s'), AbbreviatedPatronymicAnalyzer(letters='АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЭЮЯ', score=0.1, tag_pattern='NOUN,anim,%(gender)s,Sgtm,Patr,Fixd,Abbr,Init sing,%(case)s')], NumberAnalyzer(score=0.9), PunctuationAnalyzer(score=0.9), [RomanNumberAnalyzer(score=0.9), LatinAnalyzer(score=0.9)], HyphenSeparatedParticleAnalyzer(particles_after_hyphen=['-то', '-ка', '-таки', '-де', '-тко', '-тка', '-с', '-ста'], score_multiplier=0.9), HyphenAdverbAnalyzer(score_multiplier=0.7), HyphenatedWordsAnalyzer(score_multiplier=0.75, skip_prefixes=<...>), KnownPrefixAnalyzer(known_prefixes=<...>, min_remainder_length=3, score_multiplier=0.75), [UnknownPrefixAnalyzer(score_multiplier=0.5), KnownSuffixAnalyzer(min_word_length=4, score_multiplier=0.5)], UnknAnalyzer()]¶

DICT_PATH_ENV_VARIABLE = 'PYMORPHY2_DICT_PATH'¶

TagClass¶

Return type:	pymorphy2.tagset.OpencorporaTag

char_substitutes = None¶

classmethod choose_dictionary_path(path=None, lang=None)[source]¶

classmethod choose_language(dictionary, lang)[source]¶

cyr2lat(tag_or_grammeme)[source]¶: Return Latin representation for tag_or_grammeme string

get_lexeme(form)[source]¶: Return the lexeme this parse belongs to.

iter_known_word_parses(prefix='')[source]¶: Return an iterator over parses of dictionary words that starts with a given prefix (default empty prefix means “all words”).

lat2cyr(tag_or_grammeme)[source]¶: Return Cyrillic representation for tag_or_grammeme string

normal_forms(word)[source]¶: Return a list of word normal forms.

parse(word)[source]¶

Analyze the word and return a list of pymorphy2.analyzer.Parse namedtuples:

Parse(word, tag, normal_form, para_id, idx, _score)

(or plain tuples if result_type=None was used in constructor).

tag(word)[source]¶

word_is_known(word, strict=False)[source]¶

Check if a word is in the dictionary.

By default, some fuzziness is allowed, depending on a dictionary - e.g. for Russian ё letters replaced with е are handled. Pass strict=True to make matching strict (e.g. if it is guaranteed the word has correct е/ё or г/ґ letters).

Note

Dictionary words are not always correct words; the dictionary also contains incorrect forms which are commonly used. So for spellchecking tasks this method should be used with extra care.

class pymorphy2.analyzer.Parse[source]¶

Parse result wrapper.

inflect(required_grammemes)[source]¶

is_known¶: True if this form is a known dictionary form.

lexeme¶: A lexeme this form belongs to.

make_agree_with_number(num)[source]¶: Inflect the word so that it agrees with num

normalized¶: A Parse instance for self.normal_form.

class pymorphy2.analyzer.ProbabilityEstimator(dict_path)[source]¶

apply_to_parses(word, word_lower, parses)[source]¶

apply_to_tags(word, word_lower, tags)[source]¶

pymorphy2.analyzer.lang_dict_path(lang)[source]¶: Return language-specific dictionary path

Analyzer units¶

Dictionary analyzer unit¶

class pymorphy2.units.by_lookup.DictionaryAnalyzer[source]¶

Analyzer unit that analyzes word using dictionary.

get_lexeme(form)[source]¶: Return a lexeme (given a parsed word).

parse(word, word_lower, seen_parses)[source]¶: Parse a word using this dictionary.

tag(word, word_lower, seen_tags)[source]¶: Tag a word using this dictionary.

Analogy analyzer units¶

This module provides analyzer units that analyzes unknown words by looking at how similar known words are analyzed.

class pymorphy2.units.by_analogy.KnownPrefixAnalyzer(known_prefixes, score_multiplier=0.75, min_remainder_length=3)[source]¶

Parse the word by checking if it starts with a known prefix and parsing the remainder.

Example: псевдокошка -> (псевдо) + кошка.

class pymorphy2.units.by_analogy.KnownSuffixAnalyzer(score_multiplier=0.5, min_word_length=4)[source]¶

Parse the word by checking how the words with similar suffixes are parsed.

Example: бутявкать -> …вкать

class FakeDictionary[source]¶: This is just a DictionaryAnalyzer with different __repr__

class pymorphy2.units.by_analogy.UnknownPrefixAnalyzer(score_multiplier=0.5)[source]¶

Parse the word by parsing only the word suffix (with restrictions on prefix & suffix lengths).

Example: байткод -> (байт) + код

Analyzer units for unknown words with hyphens¶

class pymorphy2.units.by_hyphen.HyphenAdverbAnalyzer(score_multiplier=0.7)[source]¶

Detect adverbs that starts with “по-“.

Example: по-западному

class pymorphy2.units.by_hyphen.HyphenSeparatedParticleAnalyzer(particles_after_hyphen, score_multiplier=0.9)[source]¶

Parse the word by analyzing it without a particle after a hyphen.

Example: смотри-ка -> смотри + “-ка”.

Note

This analyzer doesn’t remove particles from the result so for normalization you may need to handle particles at tokenization level.

lexemizer(form, this_method)[source]¶: A coroutine for preparing lexemes

normalizer(form, this_method)[source]¶: A coroutine for normalization

class pymorphy2.units.by_hyphen.HyphenatedWordsAnalyzer(skip_prefixes, score_multiplier=0.75)[source]¶

Parse the word by parsing its hyphen-separated parts.

Examples:

интернет-магазин -> “интернет-” + магазин

человек-гора -> человек + гора

Analyzer units that analyzes non-word tokes¶

class pymorphy2.units.by_shape.LatinAnalyzer(score=0.9)[source]¶: This analyzer marks latin words with “LATN” tag. Example: “pdf” -> LATN

class pymorphy2.units.by_shape.NumberAnalyzer(score=0.9)[source]¶: This analyzer marks integer numbers with “NUMB,int” or “NUMB,real” tags. Example: “12” -> NUMB,int; “12.4” -> NUMB,real

Note

Don’t confuse it with “NUMR”: “тридцать” -> NUMR

class pymorphy2.units.by_shape.PunctuationAnalyzer(score=0.9)[source]¶: This analyzer tags punctuation marks as “PNCT”. Example: “,” -> PNCT

class pymorphy2.units.by_shape.RomanNumberAnalyzer(score=0.9)[source]¶

Tagset¶

Utils for working with grammatical tags.

class pymorphy2.tagset.OpencorporaTag(tag)[source]¶

Wrapper class for OpenCorpora.org tags.

Warning

In order to work properly, the class has to be globally initialized with actual grammemes (using _init_grammemes method).

Pymorphy2 initializes it when loading a dictionary; it may be not a good idea to use this class directly. If possible, use morph_analyzer.TagClass instead.

Example:

>>> from pymorphy2 import MorphAnalyzer
>>> morph = MorphAnalyzer()
>>> Tag = morph.TagClass  # get an initialzed Tag class
>>> tag = Tag('VERB,perf,tran plur,impr,excl')
>>> tag
OpencorporaTag('VERB,perf,tran plur,impr,excl')

Tag instances have attributes for accessing grammemes:

>>> print(tag.POS)
VERB
>>> print(tag.number)
plur
>>> print(tag.case)
None

Available attributes are: POS, animacy, aspect, case, gender, involvement, mood, number, person, tense, transitivity and voice.

You may check if a grammeme is in tag or if all grammemes from a given set are in tag:

>>> 'perf' in tag
True
>>> 'nomn' in tag
False
>>> 'Geox' in tag
False
>>> set(['VERB', 'perf']) in tag
True
>>> set(['VERB', 'perf', 'sing']) in tag
False

In order to fight typos, for unknown grammemes an exception is raised:

>>> 'foobar' in tag
Traceback (most recent call last):
...
ValueError: Grammeme is unknown: foobar
>>> set(['NOUN', 'foo', 'bar']) in tag
Traceback (most recent call last):
...
ValueError: Grammemes are unknown: {'bar', 'foo'}

This also works for attributes:

>>> tag.POS == 'plur'
Traceback (most recent call last):
...
ValueError: 'plur' is not a valid grammeme for this attribute. Valid grammemes: ...

classmethod cyr2lat(tag_or_grammeme)[source]¶: Return Latin representation for tag_or_grammeme string

cyr_repr¶: Cyrillic representation of this tag

classmethod fix_rare_cases(grammemes)[source]¶: Replace rare cases (loc2/voct/…) with common ones (loct/nomn/…).

grammemes¶: A frozenset with grammemes for this tag.

grammemes_cyr¶: A frozenset with Cyrillic grammemes for this tag.

classmethod lat2cyr(tag_or_grammeme)[source]¶: Return Cyrillic representation for tag_or_grammeme string

updated_grammemes(required)[source]¶: Return a new set of grammemes with required grammemes added and incompatible grammemes removed.

Command-Line Interface¶

Usage:

pymorphy parse [options] [<input>]
pymorphy dict meta [--lang <lang> | --dict <path>]
pymorphy dict mem_usage [--lang <lang> | --dict <path>] [--verbose]
pymorphy -h | --help
pymorphy --version

Options:

-l --lemmatize      Include normal forms (lemmas)
-s --score          Include non-contextual P(tag|word) scores
-t --tag            Include tags
--thresh <NUM>      Drop all results with estimated P(tag|word) less
                    than a threshold [default: 0.0]
--tokenized         Assume that input text is already tokenized:
                    one token per line.
-c --cache <SIZE>   Cache size, in entries. Set it to 0 to disable
                    cache; use 'unlim' value for unlimited cache
                    size [default: 20000]
--lang <lang>       Language to use. Allowed values: ru, uk [default: ru]
--dict <path>       Dictionary folder path
-v --verbose        Be more verbose
-h --help           Show this help

Utilities for OpenCorpora Dictionaries¶

class pymorphy2.opencorpora_dict.wrapper.Dictionary(path)[source]¶

OpenCorpora dictionary wrapper class.

build_normal_form(para_id, idx, fixed_word)[source]¶: Build a normal form.

build_paradigm_info(para_id)[source]¶

Return a list of

(prefix, tag, suffix)

tuples representing the paradigm.

build_stem(paradigm, idx, fixed_word)[source]¶: Return word stem (given a word, paradigm and the word index).

build_tag_info(para_id, idx)[source]¶: Return tag as a string.

iter_known_words(prefix='')[source]¶: Return an iterator over (word, tag, normal_form, para_id, idx) tuples with dictionary words that starts with a given prefix (default empty prefix means “all words”).

word_is_known(word, substitutes_compiled=None)[source]¶

Check if a word is in the dictionary.

To allow some fuzzyness pass substitutes_compiled argument; it should be a result of DAWG.compile_replaces(). This way you can e.g. handle ё letters replaced with е in the input words.

Note

Dictionary words are not always correct words; the dictionary also contains incorrect forms which are commonly used. So for spellchecking tasks this method should be used with extra care.

Various Utilities¶

pymorphy2.tokenizers.simple_word_tokenize(text, _split=<built-in method split of re.Pattern object>)[source]¶: Split text into tokens. Don’t split by a hyphen. Preserve punctuation, but not whitespaces.

pymorphy2.shapes.is_latin(token)[source]¶

Return True if all token letters are latin and there is at least one latin letter in the token:

>>> is_latin('foo')
True
>>> is_latin('123-FOO')
True
>>> is_latin('123')
False
>>> is_latin(':)')
False
>>> is_latin('')
False

pymorphy2.shapes.is_punctuation(token)[source]¶

Return True if a word contains only spaces and punctuation marks and there is at least one punctuation mark:

>>> is_punctuation(', ')
True
>>> is_punctuation('..!')
True
>>> is_punctuation('x')
False
>>> is_punctuation(' ')
False
>>> is_punctuation('')
False

pymorphy2.shapes.is_roman_number(token, _match=<built-in method match of re.Pattern object>)[source]¶

Return True if token looks like a Roman number:

>>> is_roman_number('II')
True
>>> is_roman_number('IX')
True
>>> is_roman_number('XIIIII')
False
>>> is_roman_number('')
False

pymorphy2.shapes.restore_capitalization(word, example)[source]¶

Make the capitalization of the word be the same as in example:

>>> restore_capitalization('bye', 'Hello')
'Bye'
>>> restore_capitalization('half-an-hour', 'Minute')
'Half-An-Hour'
>>> restore_capitalization('usa', 'IEEE')
'USA'
>>> restore_capitalization('pre-world', 'anti-World')
'pre-World'
>>> restore_capitalization('123-do', 'anti-IEEE')
'123-DO'
>>> restore_capitalization('123--do', 'anti--IEEE')
'123--DO'

In the alignment fails, the reminder is lower-cased:

>>> restore_capitalization('foo-BAR-BAZ', 'Baz-Baz')
'Foo-Bar-baz'
>>> restore_capitalization('foo', 'foo-bar')
'foo'

pymorphy2.shapes.restore_word_case(word, example)[source]¶: This function is renamed to restore_capitalization

pymorphy2.utils.combinations_of_all_lengths(it)[source]¶

Return an iterable with all possible combinations of items from it:

>>> for comb in combinations_of_all_lengths('ABC'):
...     print("".join(comb))
A
B
C
AB
AC
BC
ABC

pymorphy2.utils.get_mem_usage()[source]¶: Return memory usage of the current process, in bytes. Requires psutil Python package.

pymorphy2.utils.json_read(filename, **json_options)[source]¶: Read an object from a json file filename

pymorphy2.utils.json_write(filename, obj, **json_options)[source]¶: Create file filename with obj serialized to JSON

pymorphy2.utils.kwargs_repr(kwargs=None, dont_show_value=None)[source]¶

>>> kwargs_repr(dict(foo="123", a=5, x=8))
"a=5, foo='123', x=8"
>>> kwargs_repr(dict(foo="123", a=5, x=8), dont_show_value=['foo'])
'a=5, foo=<...>, x=8'
>>> kwargs_repr()
''

pymorphy2.utils.largest_elements(iterable, key, n=1)[source]¶

Return a list of large elements of the iterable (according to key function).

n is a number of top element values to consider; when n==1 (default) only largest elements are returned; when n==2 - elements with one of the top-2 values, etc.

>>> s = [-4, 3, 5, 7, 4, -7]
>>> largest_elements(s, abs)
[7, -7]
>>> largest_elements(s, abs, 2)
[5, 7, -7]
>>> largest_elements(s, abs, 3)
[-4, 5, 7, 4, -7]

pymorphy2.utils.longest_common_substring(data)[source]¶

Return a longest common substring of a list of strings:

>>> longest_common_substring(["apricot", "rice", "cricket"])
'ric'
>>> longest_common_substring(["apricot", "banana"])
'a'
>>> longest_common_substring(["foo", "bar", "baz"])
''
>>> longest_common_substring(["", "foo"])
''
>>> longest_common_substring(["apricot"])
'apricot'
>>> longest_common_substring([])
''

See http://stackoverflow.com/questions/2892931/.

pymorphy2.utils.with_progress(iterable, desc=None, total=None, leave=True)[source]¶: Return an iterator which prints the iteration progress using tqdm package. Return iterable intact if tqdm is not available.

pymorphy2.utils.word_splits(word, min_reminder=3, max_prefix_length=5)[source]¶: Return all splits of a word (taking in account min_reminder and max_prefix_length).