Corpus¶
Table of Contents
ChineseDailyNerCorpus¶
-
class
kashgari.corpus.
ChineseDailyNerCorpus
[source]¶ Bases:
object
Chinese Daily New New Corpus https://github.com/zjy-ucas/ChineseNER/
Example
>>> from kashgari.corpus import ChineseDailyNerCorpus >>> train_x, train_y = ChineseDailyNerCorpus.load_data('train') >>> test_x, test_y = ChineseDailyNerCorpus.load_data('test') >>> valid_x, valid_y = ChineseDailyNerCorpus.load_data('valid') >>> print(train_x) [['海', '钓', '比', '赛', '地', '点', '在', '厦', '门', ...], ...] >>> print(train_y) [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', ...], ...]
-
__init__
()¶ Initialize self. See help(type(self)) for accurate signature.
-
SMP2018ECDTCorpus¶
-
class
kashgari.corpus.
SMP2018ECDTCorpus
[source]¶ Bases:
object
https://worksheets.codalab.org/worksheets/0x27203f932f8341b79841d50ce0fd684f/
This dataset is released by the Evaluation of Chinese Human-Computer Dialogue Technology (SMP2018-ECDT) task 1 and is provided by the iFLYTEK Corporation, which is a Chinese human-computer dialogue dataset.
Sample:
label query 0 weather 今天东莞天气如何 1 map 从观音桥到重庆市图书馆怎么走 2 cookbook 鸭蛋怎么腌? 3 health 怎么治疗牛皮癣 4 chat 唠什么
Example
>>> from kashgari.corpus import SMP2018ECDTCorpus >>> train_x, train_y = SMP2018ECDTCorpus.load_data('train') >>> test_x, test_y = SMP2018ECDTCorpus.load_data('test') >>> valid_x, valid_y = SMP2018ECDTCorpus.load_data('valid') >>> print(train_x) [['听', '新', '闻', '。'], ['电', '视', '台', '在', '播', '什', '么'], ...] >>> print(train_y) ['news', 'epg', ...]
-
__init__
()¶ Initialize self. See help(type(self)) for accurate signature.
-
JigsawToxicCommentCorpus¶
-
class
kashgari.corpus.
JigsawToxicCommentCorpus
(corpus_train_csv_path, sample_count=None, tokenizer=None)[source]¶ Bases:
object
Kaggle Toxic Comment Classification Challenge corpus
You need to download corpus from https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/overview to a folder. Then init a JigsawToxicCommentCorpus object with train.csv path.
Examples
>>> from kashgari.corpus import JigsawToxicCommentCorpus >>> corpus = JigsawToxicCommentCorpus('<train.csv file-path>') >>> train_x, train_y = corpus.load_data('train') >>> test_x, test_y = corpus.load_data('test') >>> print(train_x) [['Please', 'stop', 'being', 'a', 'penis—', 'and', 'Grow', 'Up', 'Regards-'], ...] >>> print(train_y) [['obscene', 'insult'], ...]
-
__init__
(corpus_train_csv_path, sample_count=None, tokenizer=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-