Corpus ¶

Table of Contents

Corpus

ChineseDailyNerCorpus ¶

class kashgari.corpus.ChineseDailyNerCorpus[source]¶

Bases: object

Chinese Daily New New Corpus https://github.com/zjy-ucas/ChineseNER/

Example

>>> from kashgari.corpus import ChineseDailyNerCorpus
>>> train_x, train_y = ChineseDailyNerCorpus.load_data('train')
>>> test_x, test_y = ChineseDailyNerCorpus.load_data('test')
>>> valid_x, valid_y = ChineseDailyNerCorpus.load_data('valid')
>>> print(train_x)
    [['海', '钓', '比', '赛', '地', '点', '在', '厦', '门', ...], ...]
>>> print(train_y)
    [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', ...], ...]

classmethod load_data(subset_name='train', shuffle=True)[source]¶

Load dataset as sequence labeling format, char level tokenized

Parameters

subset_name (str) – {train, test, valid}
shuffle (bool) – should shuffle or not, default True.

Returns

dataset_features and dataset labels

Return type

Tuple[List[List[str]], List[List[str]]]

SMP2018ECDTCorpus ¶

class kashgari.corpus.SMP2018ECDTCorpus[source]¶

Bases: object

https://worksheets.codalab.org/worksheets/0x27203f932f8341b79841d50ce0fd684f/

This dataset is released by the Evaluation of Chinese Human-Computer Dialogue Technology (SMP2018-ECDT) task 1 and is provided by the iFLYTEK Corporation, which is a Chinese human-computer dialogue dataset.

Sample:

      label           query
 weather        今天东莞天气如何
     map  从观音桥到重庆市图书馆怎么走
cookbook          鸭蛋怎么腌？
  health         怎么治疗牛皮癣
    chat             唠什么

Example

>>> from kashgari.corpus import SMP2018ECDTCorpus
>>> train_x, train_y = SMP2018ECDTCorpus.load_data('train')
>>> test_x, test_y = SMP2018ECDTCorpus.load_data('test')
>>> valid_x, valid_y = SMP2018ECDTCorpus.load_data('valid')
>>> print(train_x)
[['听', '新', '闻', '。'], ['电', '视', '台', '在', '播', '什', '么'], ...]
>>> print(train_y)
['news', 'epg', ...]

classmethod load_data(subset_name='train', shuffle=True, cutter='char')[source]¶

Load dataset as sequence classification format, char level tokenized

Parameters

subset_name (str) – {train, test, valid}
shuffle (bool) – should shuffle or not, default True.
cutter (str) – sentence cutter, {char, jieba}

Returns

dataset_features and dataset labels

Return type

Tuple[List[List[str]], List[str]]

JigsawToxicCommentCorpus ¶

class kashgari.corpus.JigsawToxicCommentCorpus(corpus_train_csv_path, sample_count=None, tokenizer=None)[source]¶

Bases: object

Kaggle Toxic Comment Classification Challenge corpus

You need to download corpus from https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/overview to a folder. Then init a JigsawToxicCommentCorpus object with train.csv path.

Examples

>>> from kashgari.corpus import JigsawToxicCommentCorpus
>>> corpus = JigsawToxicCommentCorpus('<train.csv file-path>')
>>> train_x, train_y = corpus.load_data('train')
>>> test_x, test_y = corpus.load_data('test')
>>> print(train_x)
[['Please', 'stop', 'being', 'a', 'penis—', 'and', 'Grow', 'Up', 'Regards-'], ...]
>>> print(train_y)
[['obscene', 'insult'], ...]

Parameters

corpus_train_csv_path (str) –
sample_count (int) –
tokenizer (kashgari.tokenizers.base_tokenizer.Tokenizer) –

Return type

None

__init__(corpus_train_csv_path, sample_count=None, tokenizer=None)[source]¶

Initialize self. See help(type(self)) for accurate signature.

Parameters

corpus_train_csv_path (str) –
sample_count (Optional[int]) –
tokenizer (Optional[kashgari.tokenizers.base_tokenizer.Tokenizer]) –

Return type

None

load_data(subset_name='train', shuffle=True)[source]¶

Load dataset as sequence labeling format, char level tokenized

Parameters

subset_name (str) – {train, test, valid}
shuffle (bool) – should shuffle or not, default True.

Returns

dataset_features and dataset labels

Return type

Tuple[List[List[str]], List[List[str]]]

Corpus¶

ChineseDailyNerCorpus¶

SMP2018ECDTCorpus¶

JigsawToxicCommentCorpus¶

Corpus ¶

ChineseDailyNerCorpus ¶

SMP2018ECDTCorpus ¶

JigsawToxicCommentCorpus ¶