Corpus

ChineseDailyNerCorpus

class kashgari.corpus.ChineseDailyNerCorpus[source]

Bases: object

Chinese Daily New New Corpus https://github.com/zjy-ucas/ChineseNER/

Example

>>> from kashgari.corpus import ChineseDailyNerCorpus
>>> train_x, train_y = ChineseDailyNerCorpus.load_data('train')
>>> test_x, test_y = ChineseDailyNerCorpus.load_data('test')
>>> valid_x, valid_y = ChineseDailyNerCorpus.load_data('valid')
>>> print(train_x)
    [['海', '钓', '比', '赛', '地', '点', '在', '厦', '门', ...], ...]
>>> print(train_y)
    [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', ...], ...]
classmethod load_data(subset_name='train', shuffle=True)[source]

Load dataset as sequence labeling format, char level tokenized

Parameters
  • subset_name (str) – {train, test, valid}

  • shuffle (bool) – should shuffle or not, default True.

Returns

dataset_features and dataset labels

Return type

Tuple[List[List[str]], List[List[str]]]

SMP2018ECDTCorpus

class kashgari.corpus.SMP2018ECDTCorpus[source]

Bases: object

https://worksheets.codalab.org/worksheets/0x27203f932f8341b79841d50ce0fd684f/

This dataset is released by the Evaluation of Chinese Human-Computer Dialogue Technology (SMP2018-ECDT) task 1 and is provided by the iFLYTEK Corporation, which is a Chinese human-computer dialogue dataset.

Sample:

      label           query
0   weather        今天东莞天气如何
1       map  从观音桥到重庆市图书馆怎么走
2  cookbook          鸭蛋怎么腌?
3    health         怎么治疗牛皮癣
4      chat             唠什么

Example

>>> from kashgari.corpus import SMP2018ECDTCorpus
>>> train_x, train_y = SMP2018ECDTCorpus.load_data('train')
>>> test_x, test_y = SMP2018ECDTCorpus.load_data('test')
>>> valid_x, valid_y = SMP2018ECDTCorpus.load_data('valid')
>>> print(train_x)
[['听', '新', '闻', '。'], ['电', '视', '台', '在', '播', '什', '么'], ...]
>>> print(train_y)
['news', 'epg', ...]
classmethod load_data(subset_name='train', shuffle=True, cutter='char')[source]

Load dataset as sequence classification format, char level tokenized

Parameters
  • subset_name (str) – {train, test, valid}

  • shuffle (bool) – should shuffle or not, default True.

  • cutter (str) – sentence cutter, {char, jieba}

Returns

dataset_features and dataset labels

Return type

Tuple[List[List[str]], List[str]]

JigsawToxicCommentCorpus

class kashgari.corpus.JigsawToxicCommentCorpus(corpus_train_csv_path, sample_count=None, tokenizer=None)[source]

Bases: object

Kaggle Toxic Comment Classification Challenge corpus

You need to download corpus from https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/overview to a folder. Then init a JigsawToxicCommentCorpus object with train.csv path.

Examples

>>> from kashgari.corpus import JigsawToxicCommentCorpus
>>> corpus = JigsawToxicCommentCorpus('<train.csv file-path>')
>>> train_x, train_y = corpus.load_data('train')
>>> test_x, test_y = corpus.load_data('test')
>>> print(train_x)
[['Please', 'stop', 'being', 'a', 'penis—', 'and', 'Grow', 'Up', 'Regards-'], ...]
>>> print(train_y)
[['obscene', 'insult'], ...]
__init__(corpus_train_csv_path, sample_count=None, tokenizer=None)[source]

Initialize self. See help(type(self)) for accurate signature.

Parameters
  • corpus_train_csv_path (str) –

  • sample_count (Optional[int]) –

  • tokenizer (Optional[kashgari.tokenizers.base_tokenizer.Tokenizer]) –

Return type

None

load_data(subset_name='train', shuffle=True)[source]

Load dataset as sequence labeling format, char level tokenized

Parameters
  • subset_name (str) – {train, test, valid}

  • shuffle (bool) – should shuffle or not, default True.

Returns

dataset_features and dataset labels

Return type

Tuple[List[List[str]], List[List[str]]]