Corpus

ChineseDailyNerCorpus

class kashgari.corpus.ChineseDailyNerCorpus[source]

Bases: object

Chinese Daily New New Corpus https://github.com/zjy-ucas/ChineseNER/

Example

>>> from kashgari.corpus import ChineseDailyNerCorpus
>>> train_x, train_y = ChineseDailyNerCorpus.load_data('train')
>>> test_x, test_y = ChineseDailyNerCorpus.load_data('test')
>>> valid_x, valid_y = ChineseDailyNerCorpus.load_data('valid')
>>> print(train_x)
    [['海', '钓', '比', '赛', '地', '点', '在', '厦', '门', ...], ...]
>>> print(train_y)
    [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', ...], ...]
__init__

Initialize self. See help(type(self)) for accurate signature.

classmethod load_data(subset_name: str = 'train', shuffle: bool = True) → Tuple[List[List[str]], List[List[str]]][source]

Load dataset as sequence labeling format, char level tokenized

Parameters:
  • subset_name – {train, test, valid}
  • shuffle – should shuffle or not, default True.
Returns:

dataset_features and dataset labels

SMP2018ECDTCorpus

class kashgari.corpus.SMP2018ECDTCorpus[source]

Bases: object

https://worksheets.codalab.org/worksheets/0x27203f932f8341b79841d50ce0fd684f/

This dataset is released by the Evaluation of Chinese Human-Computer Dialogue Technology (SMP2018-ECDT) task 1 and is provided by the iFLYTEK Corporation, which is a Chinese human-computer dialogue dataset.

Sample:

      label           query
0   weather        今天东莞天气如何
1       map  从观音桥到重庆市图书馆怎么走
2  cookbook          鸭蛋怎么腌?
3    health         怎么治疗牛皮癣
4      chat             唠什么

Example

>>> from kashgari.corpus import SMP2018ECDTCorpus
>>> train_x, train_y = SMP2018ECDTCorpus.load_data('train')
>>> test_x, test_y = SMP2018ECDTCorpus.load_data('test')
>>> valid_x, valid_y = SMP2018ECDTCorpus.load_data('valid')
>>> print(train_x)
[['听', '新', '闻', '。'], ['电', '视', '台', '在', '播', '什', '么'], ...]
>>> print(train_y)
['news', 'epg', ...]
__init__

Initialize self. See help(type(self)) for accurate signature.

classmethod load_data(subset_name: str = 'train', shuffle: bool = True, cutter: str = 'char') → Tuple[List[List[str]], List[str]][source]

Load dataset as sequence classification format, char level tokenized

Parameters:
  • subset_name – {train, test, valid}
  • shuffle – should shuffle or not, default True.
  • cutter – sentence cutter, {char, jieba}
Returns:

dataset_features and dataset labels

JigsawToxicCommentCorpus

class kashgari.corpus.JigsawToxicCommentCorpus(corpus_train_csv_path: str, sample_count: int = None, tokenizer: kashgari.tokenizers.base_tokenizer.Tokenizer = None)[source]

Bases: object

Kaggle Toxic Comment Classification Challenge corpus

You need to download corpus from https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/overview to a folder. Then init a JigsawToxicCommentCorpus object with train.csv path.

Examples

>>> from kashgari.corpus import JigsawToxicCommentCorpus
>>> corpus = JigsawToxicCommentCorpus('<train.csv file-path>')
>>> train_x, train_y = corpus.load_data('train')
>>> test_x, test_y = corpus.load_data('test')
>>> print(train_x)
[['Please', 'stop', 'being', 'a', 'penis—', 'and', 'Grow', 'Up', 'Regards-'], ...]
>>> print(train_y)
[['obscene', 'insult'], ...]
__init__(corpus_train_csv_path: str, sample_count: int = None, tokenizer: kashgari.tokenizers.base_tokenizer.Tokenizer = None) → None[source]

Initialize self. See help(type(self)) for accurate signature.

load_data(subset_name: str = 'train', shuffle: bool = True) → Tuple[List[List[str]], List[List[str]]][source]

Load dataset as sequence labeling format, char level tokenized

Parameters:
  • subset_name – {train, test, valid}
  • shuffle – should shuffle or not, default True.
Returns:

dataset_features and dataset labels