EduNLP.SIF¶
SIF¶
- EduNLP.SIF.sif.is_sif(item)[source]¶
- Parameters
item –
- Returns
when item can not be parsed correctly, raise Error;
when item doesn’t need to be modified, return Ture;
when item needs to be modified, return False;
Examples
>>> text = '若$x,y$满足约束条件' \ ... '$\\left\\{\\begin{array}{c}2 x+y-2 \\leq 0 \\\\ x-y-1 \\geq 0 \\\\ y+1 \\geq 0\\end{array}\\right.$,' \ ... '则$z=x+7 y$的最大值$\\SIFUnderline$' >>> is_sif(text) True >>> text = '某校一个课外学习小组为研究某作物的发芽率y和温度x(单位...' >>> is_sif(text) False
- EduNLP.SIF.sif.sif4sci(item: str, figures: (<class 'dict'>, <class 'bool'>) = None, safe=True, symbol: str = None, tokenization=True, tokenization_params=None, errors='raise')[source]¶
Default to use linear Tokenizer, change the tokenizer by specifying tokenization_params
- Parameters
item –
figures –
safe –
symbol –
tokenization –
tokenization_params –
method: which tokenizer to be used, “linear” or “ast” The parameters only useful for “linear”:
- The parameters only useful for “ast”:
ord2token: whether to transfer the variables (mathord) and constants (textord) to special tokens. var_numbering: whether to use number suffix to denote different variables
errors – warn raise coerce strict ignore
- Returns
When tokenization is False, return SegmentList;
When tokenization is True, return TokenList
Examples
>>> test_item = r"如图所示,则$\bigtriangleup ABC$的面积是$\SIFBlank$。$\FigureID{1}$" >>> tl = sif4sci(test_item) >>> tl ['如图所示', '\\bigtriangleup', 'ABC', '面积', '\\SIFBlank', \FigureID{1}] >>> tl.describe() {'t': 2, 'f': 2, 'g': 1, 'm': 1} >>> with tl.filter('fgm'): ... tl ['如图所示', '面积'] >>> with tl.filter(keep='t'): ... tl ['如图所示', '面积'] >>> with tl.filter(): ... tl ['如图所示', '\\bigtriangleup', 'ABC', '面积', '\\SIFBlank', \FigureID{1}] >>> tl.text_tokens ['如图所示', '面积'] >>> tl.formula_tokens ['\\bigtriangleup', 'ABC'] >>> tl.figure_tokens [\FigureID{1}] >>> tl.ques_mark_tokens ['\\SIFBlank'] >>> sif4sci(test_item, symbol="gm", tokenization_params={"formula_params": {"method": "ast"}}) ['如图所示', <Formula: \bigtriangleup ABC>, '面积', '[MARK]', '[FIGURE]'] >>> sif4sci(test_item, symbol="tfgm") ['[TEXT]', '[FORMULA]', '[TEXT]', '[MARK]', '[TEXT]', '[FIGURE]'] >>> sif4sci(test_item, symbol="gm", ... tokenization_params={"formula_params": {"method": "ast", "return_type": "list"}}) ['如图所示', '\\bigtriangleup', 'A', 'B', 'C', '面积', '[MARK]', '[FIGURE]'] >>> test_item_1 = { ... "stem": r"若$x=2$, $y=\sqrt{x}$,则下列说法正确的是$\SIFChoice$", ... "options": [r"$x < y$", r"$y = x$", r"$y < x$"] ... } >>> tls = [ ... sif4sci(e, symbol="gm", ... tokenization_params={ ... "formula_params": { ... "method": "ast", "return_type": "list", "ord2token": True, "var_numbering": True, ... "link_variable": False} ... }) ... for e in ([test_item_1["stem"]] + test_item_1["options"]) ... ] >>> tls[1:] [['mathord_0', '<', 'mathord_1'], ['mathord_0', '=', 'mathord_1'], ['mathord_0', '<', 'mathord_1']] >>> link_formulas(*tls) >>> tls[1:] [['mathord_0', '<', 'mathord_1'], ['mathord_1', '=', 'mathord_0'], ['mathord_1', '<', 'mathord_0']] >>> from EduNLP.utils import dict2str4sif >>> test_item_1_str = dict2str4sif(test_item_1, tag_mode="head", add_list_no_tag=False) >>> test_item_1_str '$\\SIFTag{stem}$...则下列说法正确的是$\\SIFChoice$$\\SIFTag{options}$$x < y$$\\SIFSep$$y = x$$\\SIFSep$$y < x$' >>> tl1 = sif4sci(test_item_1_str, symbol="gm", ... tokenization_params={"formula_params": {"method": "ast", "return_type": "list", "ord2token": True}}) >>> tl1.get_segments()[0] ['\\SIFTag{stem}'] >>> tl1.get_segments()[1:3] [['[TEXT_BEGIN]', '[TEXT_END]'], ['[FORMULA_BEGIN]', 'mathord', '=', 'textord', '[FORMULA_END]']] >>> tl1.get_segments(add_seg_type=False)[0:3] [['\\SIFTag{stem}'], ['mathord', '=', 'textord'], ['mathord', '=', 'mathord', '{ }', '\\sqrt']] >>> test_item_2 = {"options": [r"$x < y$", r"$y = x$", r"$y < x$"]} >>> test_item_2 {'options': ['$x < y$', '$y = x$', '$y < x$']} >>> test_item_2_str = dict2str4sif(test_item_2, tag_mode="head", add_list_no_tag=False) >>> test_item_2_str '$\\SIFTag{options}$$x < y$$\\SIFSep$$y = x$$\\SIFSep$$y < x$' >>> tl2 = sif4sci(test_item_2_str, symbol="gms", ... tokenization_params={"formula_params": {"method": "ast", "return_type": "list"}}) >>> tl2 ['\\SIFTag{options}', 'x', '<', 'y', '[SEP]', 'y', '=', 'x', '[SEP]', 'y', '<', 'x'] >>> tl2.get_segments(add_seg_type=False) [['\\SIFTag{options}'], ['x', '<', 'y'], ['[SEP]'], ['y', '=', 'x'], ['[SEP]'], ['y', '<', 'x']] >>> tl2.get_segments(add_seg_type=False, drop="s") [['\\SIFTag{options}'], ['x', '<', 'y'], ['y', '=', 'x'], ['y', '<', 'x']] >>> tl3 = sif4sci(test_item_1["stem"], symbol="gs") >>> tl3.text_segments [['说法', '正确']] >>> tl3.formula_segments [['x', '=', '2'], ['y', '=', '\\sqrt', '{', 'x', '}']] >>> tl3.figure_segments [] >>> tl3.ques_mark_segments [['\\SIFChoice']]
Segment¶
- EduNLP.SIF.segment.seg(item, figures=None, symbol=None)[source]¶
- Parameters
item –
figures –
symbol –
Examples
>>> test_item = r"如图所示,则$\bigtriangleup ABC$的面积是$\SIFBlank$。$\FigureID{1}$" >>> s = seg(test_item) >>> s ['如图所示,则', '\\bigtriangleup ABC', '的面积是', '\\SIFBlank', '。', \FigureID{1}] >>> s.describe() {'t': 3, 'f': 1, 'g': 1, 'm': 1} >>> with s.filter("f"): ... s ['如图所示,则', '的面积是', '\\SIFBlank', '。', \FigureID{1}] >>> with s.filter(keep="t"): ... s ['如图所示,则', '的面积是', '。'] >>> with s.filter(): ... s ['如图所示,则', '\\bigtriangleup ABC', '的面积是', '\\SIFBlank', '。', \FigureID{1}] >>> seg(test_item, symbol="fgm") ['如图所示,则', '[FORMULA]', '的面积是', '[MARK]', '。', '[FIGURE]'] >>> seg(test_item, symbol="tfgm") ['[TEXT]', '[FORMULA]', '[TEXT]', '[MARK]', '[TEXT]', '[FIGURE]'] >>> seg(r"如图所示,则$\FormFigureID{0}$的面积是$\SIFBlank$。$\FigureID{1}$") ['如图所示,则', \FormFigureID{0}, '的面积是', '\\SIFBlank', '。', \FigureID{1}] >>> seg(r"如图所示,则$\FormFigureID{0}$的面积是$\SIFBlank$。$\FigureID{1}$", symbol="fgm") ['如图所示,则', '[FORMULA]', '的面积是', '[MARK]', '。', '[FIGURE]'] >>> s.text_segments ['如图所示,则', '的面积是', '。'] >>> s.formula_segments ['\\bigtriangleup ABC'] >>> s.figure_segments [\FigureID{1}] >>> s.ques_mark_segments ['\\SIFBlank'] >>> test_item_1 = { ... "stem": r"若复数$z=1+2 i+i^{3}$,则$|z|=$", ... "options": ['0', '1', r'$\sqrt{2}$', '2'] ... } >>> from EduNLP.utils import dict2str4sif >>> test_item_1_str = dict2str4sif(test_item_1) >>> test_item_1_str '$\\SIFTag{stem_begin}$...$\\SIFTag{stem_end}$$\\SIFTag{options_begin}$$\\SIFTag{list_0}$0...$\\SIFTag{options_end}$' >>> s1 = seg(test_item_1_str, symbol="tfgm") >>> s1 ['\\SIFTag{stem_begin}'...'\\SIFTag{stem_end}', '\\SIFTag{options_begin}', '\\SIFTag{list_0}', ...] >>> with s1.filter(keep="a"): ... s1 [...'\\SIFTag{list_0}', '\\SIFTag{list_1}', '\\SIFTag{list_2}', '\\SIFTag{list_3}', '\\SIFTag{options_end}'] >>> s1.tag_segments ['\\SIFTag{stem_begin}', '\\SIFTag{stem_end}', '\\SIFTag{options_begin}', ... '\\SIFTag{options_end}'] >>> test_item_1_str_2 = dict2str4sif(test_item_1, tag_mode="head", add_list_no_tag=False) >>> seg(test_item_1_str_2, symbol="tfgmas") ['[TAG]', ... '[TAG]', '[TEXT]', '[SEP]', '[TEXT]', '[SEP]', '[FORMULA]', '[SEP]', '[TEXT]'] >>> s2 = seg(test_item_1_str_2, symbol="fgm") >>> s2.tag_segments ['\\SIFTag{stem}', '\\SIFTag{options}']
Parser¶
Tokenization¶
tokenize¶
- class EduNLP.SIF.tokenization.tokenization.TokenList(segment_list: EduNLP.SIF.segment.segment.SegmentList, text_params=None, formula_params=None, figure_params=None)[source]¶
- get_segments(add_seg_type=True, add_seg_mode='delimiter', keep='*', drop='', depth=None)[source]¶
- Parameters
add_seg_type –
add_seg_mode – delimiter: both in the head and at the tail head: only in the head tail: only at the tail
keep –
drop –
depth (int or None) – 0: only separate at SIFSep 1: only separate at SIFTag 2: separate at SIFTag and SIFSep otherwise, separate all segments
text¶
- EduNLP.SIF.tokenization.text.tokenize(text, granularity='word', stopwords='default')[source]¶
- Parameters
text –
granularity –
stopwords (str, None or set) –
Examples
>>> tokenize("三角函数是基本初等函数之一") ['三角函数', '初等', '函数'] >>> tokenize("三角函数是基本初等函数之一", granularity="char") ['三', '角', '函', '数', '基', '初', '函', '数']
formula¶
- EduNLP.SIF.tokenization.formula.tokenize(formula, method='linear', errors='raise', **kwargs)[source]¶
- Parameters
formula –
method –
errors (how to handle the exception occurs in ast tokenize) – “coerce”: use linear_tokenize “raise”: raise exception
kwargs –
Examples
>>> tokenize(r"\frac{\pi}{x + y} + 1 = x") ['\\frac', '{', '\\pi', '}', '{', 'x', '+', 'y', '}', '+', '1', '=', 'x'] >>> tokenize(r"\frac{\pi}{x + y} + 1 = x", method="ast", ord2token=True) <Formula: \frac{\pi}{x + y} + 1 = x> >>> tokenize(r"\frac{\pi}{x + y} + 1 = x", method="ast", ord2token=True, return_type="list") ['mathord', '{ }', 'mathord', '+', 'mathord', '{ }', '\\frac', '+', 'textord', '=', 'mathord']