Corpus¶
Table of Contents
ChineseDailyNerCorpus¶
-
class
kashgari.corpus.
ChineseDailyNerCorpus
[source]¶ Bases:
object
Chinese Daily New New Corpus https://github.com/zjy-ucas/ChineseNER/
Example
>>> from kashgari.corpus import ChineseDailyNerCorpus >>> train_x, train_y = ChineseDailyNerCorpus.load_data('train') >>> test_x, test_y = ChineseDailyNerCorpus.load_data('test') >>> valid_x, valid_y = ChineseDailyNerCorpus.load_data('valid') >>> print(train_x) [['海', '钓', '比', '赛', '地', '点', '在', '厦', '门', ...], ...] >>> print(train_y) [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', ...], ...]
-
__init__
¶ Initialize self. See help(type(self)) for accurate signature.
-
classmethod
load_data
(subset_name: str = 'train', shuffle: bool = True) → Tuple[List[List[str]], List[List[str]]][source]¶ Load dataset as sequence labeling format, char level tokenized
Parameters: - subset_name – {train, test, valid}
- shuffle – should shuffle or not, default True.
Returns: dataset_features and dataset labels
-
SMP2018ECDTCorpus¶
-
class
kashgari.corpus.
SMP2018ECDTCorpus
[source]¶ Bases:
object
https://worksheets.codalab.org/worksheets/0x27203f932f8341b79841d50ce0fd684f/
This dataset is released by the Evaluation of Chinese Human-Computer Dialogue Technology (SMP2018-ECDT) task 1 and is provided by the iFLYTEK Corporation, which is a Chinese human-computer dialogue dataset.
Sample:
label query 0 weather 今天东莞天气如何 1 map 从观音桥到重庆市图书馆怎么走 2 cookbook 鸭蛋怎么腌? 3 health 怎么治疗牛皮癣 4 chat 唠什么
Example
>>> from kashgari.corpus import SMP2018ECDTCorpus >>> train_x, train_y = SMP2018ECDTCorpus.load_data('train') >>> test_x, test_y = SMP2018ECDTCorpus.load_data('test') >>> valid_x, valid_y = SMP2018ECDTCorpus.load_data('valid') >>> print(train_x) [['听', '新', '闻', '。'], ['电', '视', '台', '在', '播', '什', '么'], ...] >>> print(train_y) ['news', 'epg', ...]
-
__init__
¶ Initialize self. See help(type(self)) for accurate signature.
-
classmethod
load_data
(subset_name: str = 'train', shuffle: bool = True, cutter: str = 'char') → Tuple[List[List[str]], List[str]][source]¶ Load dataset as sequence classification format, char level tokenized
Parameters: - subset_name – {train, test, valid}
- shuffle – should shuffle or not, default True.
- cutter – sentence cutter, {char, jieba}
Returns: dataset_features and dataset labels
-
JigsawToxicCommentCorpus¶
-
class
kashgari.corpus.
JigsawToxicCommentCorpus
(corpus_train_csv_path: str, sample_count: int = None, tokenizer: kashgari.tokenizers.base_tokenizer.Tokenizer = None)[source]¶ Bases:
object
Kaggle Toxic Comment Classification Challenge corpus
You need to download corpus from https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/overview to a folder. Then init a JigsawToxicCommentCorpus object with train.csv path.
Examples
>>> from kashgari.corpus import JigsawToxicCommentCorpus >>> corpus = JigsawToxicCommentCorpus('<train.csv file-path>') >>> train_x, train_y = corpus.load_data('train') >>> test_x, test_y = corpus.load_data('test') >>> print(train_x) [['Please', 'stop', 'being', 'a', 'penis—', 'and', 'Grow', 'Up', 'Regards-'], ...] >>> print(train_y) [['obscene', 'insult'], ...]
-
__init__
(corpus_train_csv_path: str, sample_count: int = None, tokenizer: kashgari.tokenizers.base_tokenizer.Tokenizer = None) → None[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
load_data
(subset_name: str = 'train', shuffle: bool = True) → Tuple[List[List[str]], List[List[str]]][source]¶ Load dataset as sequence labeling format, char level tokenized
Parameters: - subset_name – {train, test, valid}
- shuffle – should shuffle or not, default True.
Returns: dataset_features and dataset labels
-