API Reference (auto-generated)¶
Morphological Analyzer¶
-
class
pymorphy2.analyzer.
MorphAnalyzer
(path=None, lang=None, result_type=<class 'pymorphy2.analyzer.Parse'>, units=None, probability_estimator_cls=<object object>, char_substitutes=<object object>)[source]¶ Morphological analyzer for Russian language.
For a given word it can find all possible inflectional paradigms and thus compute all possible tags and normal forms.
Analyzer uses morphological word features and a lexicon (dictionary compiled from XML available at OpenCorpora.org); for unknown words heuristic algorithm is used.
Create a
MorphAnalyzer
object:>>> import pymorphy2 >>> morph = pymorphy2.MorphAnalyzer()
MorphAnalyzer uses dictionaries from
pymorphy2-dicts
package (which can be installed viapip install pymorphy2-dicts
).Alternatively (e.g. if you have your own precompiled dictionaries), either create
PYMORPHY2_DICT_PATH
environment variable with a path to dictionaries, or passpath
argument topymorphy2.MorphAnalyzer
constructor:>>> morph = pymorphy2.MorphAnalyzer(path='/path/to/dictionaries')
By default, methods of this class return parsing results as namedtuples
Parse
. This has performance implications under CPython, so if you need maximum speed then passresult_type=None
to make analyzer return plain unwrapped tuples:>>> morph = pymorphy2.MorphAnalyzer(result_type=None)
-
DEFAULT_SUBSTITUTES
= {'е': 'ё'}¶
-
DEFAULT_UNITS
= [[DictionaryAnalyzer(), AbbreviatedFirstNameAnalyzer(letters='АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЭЮЯ', score=0.1, tag_pattern='NOUN,anim,%(gender)s,Sgtm,Name,Fixd,Abbr,Init sing,%(case)s'), AbbreviatedPatronymicAnalyzer(letters='АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЭЮЯ', score=0.1, tag_pattern='NOUN,anim,%(gender)s,Sgtm,Patr,Fixd,Abbr,Init sing,%(case)s')], NumberAnalyzer(score=0.9), PunctuationAnalyzer(score=0.9), [RomanNumberAnalyzer(score=0.9), LatinAnalyzer(score=0.9)], HyphenSeparatedParticleAnalyzer(particles_after_hyphen=['-то', '-ка', '-таки', '-де', '-тко', '-тка', '-с', '-ста'], score_multiplier=0.9), HyphenAdverbAnalyzer(score_multiplier=0.7), HyphenatedWordsAnalyzer(score_multiplier=0.75, skip_prefixes=<...>), KnownPrefixAnalyzer(known_prefixes=<...>, min_remainder_length=3, score_multiplier=0.75), [UnknownPrefixAnalyzer(score_multiplier=0.5), KnownSuffixAnalyzer(min_word_length=4, score_multiplier=0.5)], UnknAnalyzer()]¶
-
DICT_PATH_ENV_VARIABLE
= 'PYMORPHY2_DICT_PATH'¶
-
TagClass
¶ Return type: pymorphy2.tagset.OpencorporaTag
-
char_substitutes
= None¶
-
iter_known_word_parses
(prefix='')[source]¶ Return an iterator over parses of dictionary words that starts with a given prefix (default empty prefix means “all words”).
-
parse
(word)[source]¶ Analyze the word and return a list of
pymorphy2.analyzer.Parse
namedtuples:Parse(word, tag, normal_form, para_id, idx, _score)(or plain tuples if
result_type=None
was used in constructor).
-
word_is_known
(word, strict=False)[source]¶ Check if a
word
is in the dictionary.By default, some fuzziness is allowed, depending on a dictionary - e.g. for Russian ё letters replaced with е are handled. Pass
strict=True
to make matching strict (e.g. if it is guaranteed theword
has correct е/ё or г/ґ letters).Note
Dictionary words are not always correct words; the dictionary also contains incorrect forms which are commonly used. So for spellchecking tasks this method should be used with extra care.
-
-
class
pymorphy2.analyzer.
Parse
[source]¶ Parse result wrapper.
-
is_known
¶ True if this form is a known dictionary form.
-
lexeme
¶ A lexeme this form belongs to.
-
Analyzer units¶
Dictionary analyzer unit¶
Analogy analyzer units¶
This module provides analyzer units that analyzes unknown words by looking at how similar known words are analyzed.
-
class
pymorphy2.units.by_analogy.
KnownPrefixAnalyzer
(known_prefixes, score_multiplier=0.75, min_remainder_length=3)[source]¶ Parse the word by checking if it starts with a known prefix and parsing the remainder.
Example: псевдокошка -> (псевдо) + кошка.
Analyzer units for unknown words with hyphens¶
-
class
pymorphy2.units.by_hyphen.
HyphenAdverbAnalyzer
(score_multiplier=0.7)[source]¶ Detect adverbs that starts with “по-“.
Example: по-западному
-
class
pymorphy2.units.by_hyphen.
HyphenSeparatedParticleAnalyzer
(particles_after_hyphen, score_multiplier=0.9)[source]¶ Parse the word by analyzing it without a particle after a hyphen.
Example: смотри-ка -> смотри + “-ка”.
Note
This analyzer doesn’t remove particles from the result so for normalization you may need to handle particles at tokenization level.
Analyzer units that analyzes non-word tokes¶
-
class
pymorphy2.units.by_shape.
LatinAnalyzer
(score=0.9)[source]¶ This analyzer marks latin words with “LATN” tag. Example: “pdf” -> LATN
-
class
pymorphy2.units.by_shape.
NumberAnalyzer
(score=0.9)[source]¶ This analyzer marks integer numbers with “NUMB,int” or “NUMB,real” tags. Example: “12” -> NUMB,int; “12.4” -> NUMB,real
Note
Don’t confuse it with “NUMR”: “тридцать” -> NUMR
Tagset¶
Utils for working with grammatical tags.
Wrapper class for OpenCorpora.org tags.
Warning
In order to work properly, the class has to be globally initialized with actual grammemes (using _init_grammemes method).
Pymorphy2 initializes it when loading a dictionary; it may be not a good idea to use this class directly. If possible, use
morph_analyzer.TagClass
instead.Example:
>>> from pymorphy2 import MorphAnalyzer >>> morph = MorphAnalyzer() >>> Tag = morph.TagClass # get an initialzed Tag class >>> tag = Tag('VERB,perf,tran plur,impr,excl') >>> tag OpencorporaTag('VERB,perf,tran plur,impr,excl')
Tag instances have attributes for accessing grammemes:
>>> print(tag.POS) VERB >>> print(tag.number) plur >>> print(tag.case) None
Available attributes are: POS, animacy, aspect, case, gender, involvement, mood, number, person, tense, transitivity and voice.
You may check if a grammeme is in tag or if all grammemes from a given set are in tag:
>>> 'perf' in tag True >>> 'nomn' in tag False >>> 'Geox' in tag False >>> set(['VERB', 'perf']) in tag True >>> set(['VERB', 'perf', 'sing']) in tag False
In order to fight typos, for unknown grammemes an exception is raised:
>>> 'foobar' in tag Traceback (most recent call last): ... ValueError: Grammeme is unknown: foobar >>> set(['NOUN', 'foo', 'bar']) in tag Traceback (most recent call last): ... ValueError: Grammemes are unknown: {'bar', 'foo'}
This also works for attributes:
>>> tag.POS == 'plur' Traceback (most recent call last): ... ValueError: 'plur' is not a valid grammeme for this attribute. Valid grammemes: ...
Return Latin representation for
tag_or_grammeme
string
Cyrillic representation of this tag
Replace rare cases (loc2/voct/…) with common ones (loct/nomn/…).
A frozenset with grammemes for this tag.
A frozenset with Cyrillic grammemes for this tag.
Return Cyrillic representation for
tag_or_grammeme
string
Return a new set of grammemes with
required
grammemes added and incompatible grammemes removed.
Command-Line Interface¶
Usage:
pymorphy parse [options] [<input>]
pymorphy dict meta [--lang <lang> | --dict <path>]
pymorphy dict mem_usage [--lang <lang> | --dict <path>] [--verbose]
pymorphy -h | --help
pymorphy --version
Options:
-l --lemmatize Include normal forms (lemmas)
-s --score Include non-contextual P(tag|word) scores
-t --tag Include tags
--thresh <NUM> Drop all results with estimated P(tag|word) less
than a threshold [default: 0.0]
--tokenized Assume that input text is already tokenized:
one token per line.
-c --cache <SIZE> Cache size, in entries. Set it to 0 to disable
cache; use 'unlim' value for unlimited cache
size [default: 20000]
--lang <lang> Language to use. Allowed values: ru, uk [default: ru]
--dict <path> Dictionary folder path
-v --verbose Be more verbose
-h --help Show this help
Utilities for OpenCorpora Dictionaries¶
-
class
pymorphy2.opencorpora_dict.wrapper.
Dictionary
(path)[source]¶ OpenCorpora dictionary wrapper class.
-
build_paradigm_info
(para_id)[source]¶ Return a list of
(prefix, tag, suffix)tuples representing the paradigm.
-
build_stem
(paradigm, idx, fixed_word)[source]¶ Return word stem (given a word, paradigm and the word index).
-
iter_known_words
(prefix='')[source]¶ Return an iterator over
(word, tag, normal_form, para_id, idx)
tuples with dictionary words that starts with a given prefix (default empty prefix means “all words”).
-
word_is_known
(word, substitutes_compiled=None)[source]¶ Check if a
word
is in the dictionary.To allow some fuzzyness pass
substitutes_compiled
argument; it should be a result ofDAWG.compile_replaces()
. This way you can e.g. handle ё letters replaced with е in the input words.Note
Dictionary words are not always correct words; the dictionary also contains incorrect forms which are commonly used. So for spellchecking tasks this method should be used with extra care.
-
Various Utilities¶
-
pymorphy2.tokenizers.
simple_word_tokenize
(text, _split=<built-in method split of re.Pattern object>)[source]¶ Split text into tokens. Don’t split by a hyphen. Preserve punctuation, but not whitespaces.
-
pymorphy2.shapes.
is_latin
(token)[source]¶ Return True if all token letters are latin and there is at least one latin letter in the token:
>>> is_latin('foo') True >>> is_latin('123-FOO') True >>> is_latin('123') False >>> is_latin(':)') False >>> is_latin('') False
-
pymorphy2.shapes.
is_punctuation
(token)[source]¶ Return True if a word contains only spaces and punctuation marks and there is at least one punctuation mark:
>>> is_punctuation(', ') True >>> is_punctuation('..!') True >>> is_punctuation('x') False >>> is_punctuation(' ') False >>> is_punctuation('') False
-
pymorphy2.shapes.
is_roman_number
(token, _match=<built-in method match of re.Pattern object>)[source]¶ Return True if token looks like a Roman number:
>>> is_roman_number('II') True >>> is_roman_number('IX') True >>> is_roman_number('XIIIII') False >>> is_roman_number('') False
-
pymorphy2.shapes.
restore_capitalization
(word, example)[source]¶ Make the capitalization of the
word
be the same as inexample
:>>> restore_capitalization('bye', 'Hello') 'Bye' >>> restore_capitalization('half-an-hour', 'Minute') 'Half-An-Hour' >>> restore_capitalization('usa', 'IEEE') 'USA' >>> restore_capitalization('pre-world', 'anti-World') 'pre-World' >>> restore_capitalization('123-do', 'anti-IEEE') '123-DO' >>> restore_capitalization('123--do', 'anti--IEEE') '123--DO'
In the alignment fails, the reminder is lower-cased:
>>> restore_capitalization('foo-BAR-BAZ', 'Baz-Baz') 'Foo-Bar-baz' >>> restore_capitalization('foo', 'foo-bar') 'foo'
-
pymorphy2.shapes.
restore_word_case
(word, example)[source]¶ This function is renamed to
restore_capitalization
-
pymorphy2.utils.
combinations_of_all_lengths
(it)[source]¶ Return an iterable with all possible combinations of items from
it
:>>> for comb in combinations_of_all_lengths('ABC'): ... print("".join(comb)) A B C AB AC BC ABC
-
pymorphy2.utils.
get_mem_usage
()[source]¶ Return memory usage of the current process, in bytes. Requires psutil Python package.
-
pymorphy2.utils.
json_read
(filename, **json_options)[source]¶ Read an object from a json file
filename
-
pymorphy2.utils.
json_write
(filename, obj, **json_options)[source]¶ Create file
filename
withobj
serialized to JSON
-
pymorphy2.utils.
kwargs_repr
(kwargs=None, dont_show_value=None)[source]¶ >>> kwargs_repr(dict(foo="123", a=5, x=8)) "a=5, foo='123', x=8" >>> kwargs_repr(dict(foo="123", a=5, x=8), dont_show_value=['foo']) 'a=5, foo=<...>, x=8' >>> kwargs_repr() ''
-
pymorphy2.utils.
largest_elements
(iterable, key, n=1)[source]¶ Return a list of large elements of the
iterable
(according tokey
function).n
is a number of top element values to consider; when n==1 (default) only largest elements are returned; when n==2 - elements with one of the top-2 values, etc.>>> s = [-4, 3, 5, 7, 4, -7] >>> largest_elements(s, abs) [7, -7] >>> largest_elements(s, abs, 2) [5, 7, -7] >>> largest_elements(s, abs, 3) [-4, 5, 7, 4, -7]
-
pymorphy2.utils.
longest_common_substring
(data)[source]¶ Return a longest common substring of a list of strings:
>>> longest_common_substring(["apricot", "rice", "cricket"]) 'ric' >>> longest_common_substring(["apricot", "banana"]) 'a' >>> longest_common_substring(["foo", "bar", "baz"]) '' >>> longest_common_substring(["", "foo"]) '' >>> longest_common_substring(["apricot"]) 'apricot' >>> longest_common_substring([]) ''