Kashgari
Overview | Performance | Installation | Documentation | Contributing
🎉🎉🎉 We released the 2.0.0 version with TF2 Support. 🎉🎉🎉
If you use this project for your research, please cite:
@misc{Kashgari
author = {Eliyar Eziz},
title = {Kashgari},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/BrikerMan/Kashgari}}
}
Overview¶
Kashgari is a simple and powerful NLP Transfer learning framework, build a state-of-art model in 5 minutes for named entity recognition (NER), part-of-speech tagging (PoS), and text classification tasks.
Human-friendly. Kashgari’s code is straightforward, well documented and tested, which makes it very easy to understand and modify.
Powerful and simple. Kashgari allows you to apply state-of-the-art natural language processing (NLP) models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS) and classification.
Built-in transfer learning. Kashgari built-in pre-trained BERT and Word2vec embedding models, which makes it very simple to transfer learning to train your model.
Fully scalable. Kashgari provides a simple, fast, and scalable environment for fast experimentation, train your models and experiment with new approaches using different embeddings and model structure.
Production Ready. Kashgari could export model with
SavedModel
format for tensorflow serving, you could directly deploy it on the cloud.
Our Goal¶
Academic users Easier experimentation to prove their hypothesis without coding from scratch.
NLP beginners Learn how to build an NLP project with production level code quality.
NLP developers Build a production level classification/labeling model within minutes.
Performance¶
Welcome to add performance report.
Task |
Language |
Dataset |
Score |
---|---|---|---|
Chinese |
95.57 |
||
Chinese |
94.57 |
Installation¶
The project is based on Python 3.6+, because it is 2019 and type hinting is cool.
Backend |
pypi version |
desc |
---|---|---|
TensorFlow 2.1+ |
|
TF2.10+ with tf.keras |
TensorFlow 1.14+ |
|
TF1.14+ with tf.keras |
Keras |
|
keras version |
Tutorials¶
Here is a set of quick tutorials to get you started with the library:
There are also articles and posts that illustrate how to use Kashgari:
Examples:
Contributors ✨¶
Thanks goes to these wonderful people. And there are many ways to get involved. Start with the contributor guidelines and then check these open issues for specific tasks.
Kashgari
Overview | Performance | Installation | Documentation | Contributing
🎉🎉🎉 We released the 2.0.0 version with TF2 Support. 🎉🎉🎉
If you use this project for your research, please cite:
@misc{Kashgari
author = {Eliyar Eziz},
title = {Kashgari},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/BrikerMan/Kashgari}}
}
Overview¶
Kashgari is a simple and powerful NLP Transfer learning framework, build a state-of-art model in 5 minutes for named entity recognition (NER), part-of-speech tagging (PoS), and text classification tasks.
Human-friendly. Kashgari’s code is straightforward, well documented and tested, which makes it very easy to understand and modify.
Powerful and simple. Kashgari allows you to apply state-of-the-art natural language processing (NLP) models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS) and classification.
Built-in transfer learning. Kashgari built-in pre-trained BERT and Word2vec embedding models, which makes it very simple to transfer learning to train your model.
Fully scalable. Kashgari provides a simple, fast, and scalable environment for fast experimentation, train your models and experiment with new approaches using different embeddings and model structure.
Production Ready. Kashgari could export model with
SavedModel
format for tensorflow serving, you could directly deploy it on the cloud.
Our Goal¶
Academic users Easier experimentation to prove their hypothesis without coding from scratch.
NLP beginners Learn how to build an NLP project with production level code quality.
NLP developers Build a production level classification/labeling model within minutes.
Performance¶
Welcome to add performance report.
Task |
Language |
Dataset |
Score |
---|---|---|---|
Chinese |
95.57 |
||
Chinese |
94.57 |
Installation¶
The project is based on Python 3.6+, because it is 2019 and type hinting is cool.
Backend |
pypi version |
desc |
---|---|---|
TensorFlow 2.1+ |
|
TF2.10+ with tf.keras |
TensorFlow 1.14+ |
|
TF1.14+ with tf.keras |
Keras |
|
keras version |
Tutorials¶
Here is a set of quick tutorials to get you started with the library:
There are also articles and posts that illustrate how to use Kashgari:
Examples:
Contributors ✨¶
Thanks goes to these wonderful people. And there are many ways to get involved. Start with the contributor guidelines and then check these open issues for specific tasks.
Text Classification Model¶
Kashgari provides several models for text classification,
All labeling models inherit from the ABCClassificationModel
.
You could easily switch from one model to another just by changing one line of code.
Available Models¶
Name | info |
---|---|
BiLSTM_Model | |
BiGRU_Model | |
CNN_Model | |
CNN_LSTM_Model | |
CNN_GRU_Model | |
CNN_Attention_Model |
Train basic classification model¶
Kashgari provides the basic intent-classification corpus for experiments. You could also use your corpus in any language for training.
# Load build-in corpus.
from kashgari.corpus import SMP2018ECDTCorpus
train_x, train_y = SMP2018ECDTCorpus.load_data('train')
valid_x, valid_y = SMP2018ECDTCorpus.load_data('valid')
test_x, test_y = SMP2018ECDTCorpus.load_data('test')
# Or use your own corpus
train_x = [['Hello', 'world'], ['Hello', 'Kashgari']]
train_y = ['a', 'b']
valid_x, valid_y = train_x, train_y
test_x, test_x = train_x, train_y
Then train our first model. All models provided some APIs, so you could use any labeling model here.
import kashgari
from kashgari.tasks.classification import BiLSTM_Model
import logging
logging.basicConfig(level='DEBUG')
model = BiLSTM_Model()
model.fit(train_x, train_y, valid_x, valid_y)
# Evaluate the model
model.evaluate(test_x, test_y)
# Model data will save to `saved_ner_model` folder
model.save('saved_classification_model')
# Load saved model
loaded_model = BiLSTM_Model.load_model('saved_classification_model')
loaded_model.predict(test_x[:10])
# To continue training, compile the newly loaded model first
loaded_model.compile_model()
model.fit(train_x, train_y, valid_x, valid_y)
That’s all your need to do. Easy right.
Text classification with transfer learning¶
Kashgari provides varies Language model Embeddings for transfer learning. Here is the example for BERT Embedding.
import kashgari
from kashgari.tasks.classification import BiGRU_Model
from kashgari.embeddings import BertEmbedding
import logging
logging.basicConfig(level='DEBUG')
bert_embed = BertEmbedding('<PRE_TRAINED_BERT_MODEL_FOLDER>')
model = BiGRU_Model(bert_embed, sequence_length=100)
model.fit(train_x, train_y, valid_x, valid_y)
You could replace bert_embedding with any Embedding class in kashgari.embeddings
. More info about Embedding: LINK THIS.
Adjust model’s hyper-parameters¶
You could easily change model’s hyper-parameters. For example, we change the lstm unit in BiLSTM_Model
from 128 to 32.
from kashgari.tasks.classification import BiLSTM_Model
hyper = BiLSTM_Model.default_hyper_parameters()
print(hyper)
# {'layer_bi_lstm': {'units': 128, 'return_sequences': False}, 'layer_dense': {'activation': 'softmax'}}
hyper['layer_bi_lstm']['units'] = 32
model = BiLSTM_Model(hyper_parameters=hyper)
Use custom optimizer¶
Kashgari already supports using customized optimizer, like RAdam.
from kashgari.corpus import SMP2018ECDTCorpus
from kashgari.tasks.classification import BiLSTM_Model
# Remember to import kashgari before than RAdam
from keras_radam import RAdam
train_x, train_y = SMP2018ECDTCorpus.load_data('train')
valid_x, valid_y = SMP2018ECDTCorpus.load_data('valid')
test_x, test_y = SMP2018ECDTCorpus.load_data('test')
model = BiLSTM_Model()
# This step will build token dict, label dict and model structure
model.build_model(train_x, train_y, valid_x, valid_y)
# Compile model with custom optimizer, you can also customize loss and metrics.
optimizer = RAdam()
model.compile_model(optimizer=optimizer)
# Train model
model.fit(train_x, train_y, valid_x, valid_y)
Use callbacks¶
Kashgari is based on keras so that you could use all of the tf.keras callbacks directly with Kashgari model. For example, here is how to visualize training with tensorboard.
from tensorflow.python import keras
from kashgari.tasks.classification import BiGRU_Model
from kashgari.callbacks import EvalCallBack
import logging
logging.basicConfig(level='DEBUG')
model = BiGRU_Model()
tf_board_callback = keras.callbacks.TensorBoard(log_dir='./logs', update_freq=1000)
# Build-in callback for print precision, recall and f1 at every epoch step
eval_callback = EvalCallBack(kash_model=model,
valid_x=valid_x,
valid_y=valid_y,
step=5)
model.fit(train_x,
train_y,
valid_x,
valid_y,
batch_size=100,
callbacks=[eval_callback, tf_board_callback])
Multi-Label Classification¶
Kashgari support multi-label classification, Here is how we build one.
Let’s assume we have a dataset like this.
x = [
['This','news','are' , 'very','well','organized'],
['What','extremely','usefull','tv','show'],
['The','tv','presenter','were','very','well','dress'],
['Multi-class', 'classification', 'means', 'a', 'classification', 'task', 'with', 'more', 'than', 'two', 'classes']
]
y = [
['A', 'B'],
['A',],
['B', 'C'],
[]
]
Now we need to init a Processor
and Embedding
for our model, then prepare model and fit.
import logging
from kashgari.embeddings import BertEmbedding
from kashgari.tasks.classification import BiLSTM_Model
logging.basicConfig(level='DEBUG')
bert_embed = BertEmbedding('<PRE_TRAINED_BERT_MODEL_FOLDER>')
model = BiLSTM_Model(bert_embed, sequence_length=100, multi_label=True)
model.fit(x, y)
Customize your own model¶
It is very easy and straightforward to build your own customized model,
just inherit the ABCEmbedding
and implement the default_hyper_parameters()
function and build_model_arc()
function.
from typing import Dict, Any
from tensorflow import keras
from kashgari.tasks.classification.abc_model import ABCClassificationModel
from kashgari.layers import L
import logging
logging.basicConfig(level='DEBUG')
class DoubleBLSTMModel(ABCClassificationModel):
"""Bidirectional LSTM Sequence Labeling Model"""
@classmethod
def default_hyper_parameters(cls) -> Dict[str, Dict[str, Any]]:
"""
Get hyper parameters of model
Returns:
hyper parameters dict
"""
return {
'layer_blstm1': {
'units': 128,
'return_sequences': True
},
'layer_blstm2': {
'units': 128,
'return_sequences': False
},
'layer_dropout': {
'rate': 0.4
},
'layer_time_distributed': {},
'layer_output': {
}
}
def build_model_arc(self):
"""
build model architectural
"""
output_dim = len(self.processor.label2idx)
config = self.hyper_parameters
embed_model = self.embedding.embed_model
# Define your layers
layer_blstm1 = L.Bidirectional(L.LSTM(**config['layer_blstm1']),
name='layer_blstm1')
layer_blstm2 = L.Bidirectional(L.LSTM(**config['layer_blstm2']),
name='layer_blstm2')
layer_dropout = L.Dropout(**config['layer_dropout'],
name='layer_dropout')
layer_time_distributed = L.Dense(output_dim, **config['layer_output'])
# You need to use this actiovation layer as final activation
# to suppor multi-label classification
layer_activation = self._activation_layer()
# Define tensor flow
tensor = layer_blstm1(embed_model.output)
tensor = layer_blstm2(tensor)
tensor = layer_dropout(tensor)
tensor = layer_time_distributed(tensor)
output_tensor = layer_activation(tensor)
# Init model
self.tf_model = keras.Model(embed_model.inputs, output_tensor)
model = DoubleBLSTMModel()
model.fit(train_x, train_y, valid_x, valid_y)
Short Sentence Classification Performance¶
We have run the classification tests on SMP2018ECDTCorpus. Here is the full code: colab link
SEQUENCE_LENGTH = 60
EPOCHS = 30
EARL_STOPPING_PATIENCE = 10
REDUCE_RL_PATIENCE = 5
BATCH_SIZE = 64
Embedding | Model | Best F1-Score | Best F1 @ epochs | |
---|---|---|---|---|
0 | RoBERTa-wwm-ext | BiLSTM_Model | 92.89 | 15 |
1 | RoBERTa-wwm-ext | BiGRU_Model | 94.57 | 10 |
2 | RoBERTa-wwm-ext | CNN_Model | 92.95 | 12 |
3 | RoBERTa-wwm-ext | CNN_Attention_Model | 92.07 | 3 |
4 | RoBERTa-wwm-ext | CNN_GRU_Model | 89.56 | 22 |
5 | RoBERTa-wwm-ext | CNN_LSTM_Model | 90.9 | 26 |
6 | Bert-Chinese | BiLSTM_Model | 93.74 | 4 |
7 | Bert-Chinese | BiGRU_Model | 93.12 | 13 |
8 | Bert-Chinese | CNN_Model | 92.95 | 13 |
9 | Bert-Chinese | CNN_Attention_Model | 92.04 | 8 |
10 | Bert-Chinese | CNN_GRU_Model | 92.88 | 8 |
11 | Bert-Chinese | CNN_LSTM_Model | 91.15 | 24 |
12 | Bare | BiLSTM_Model | 81.96 | 11 |
13 | Bare | BiGRU_Model | 82.86 | 9 |
14 | Bare | CNN_Model | 86.61 | 11 |
15 | Bare | CNN_Attention_Model | 78.84 | 12 |
16 | Bare | CNN_GRU_Model | 66.14 | 26 |
17 | Bare | CNN_LSTM_Model | 48.13 | 29 |
Text Labeling Model¶
Kashgari provides several models for text labeling,
All labeling models inherit from the BaseLabelingModel
.
You could easily switch from one model to another just by changing one line of code.
Available Models¶
Name | Info |
---|---|
CNN_LSTM_Model | |
BiLSTM_Model | |
BiGRU_Model |
Train basic NER model¶
Kashgari provices basic NER corpus for expirement. You could also use your corpus in any language for training.
# Load build-in corpus.
from kashgari.corpus import ChineseDailyNerCorpus
train_x, train_y = ChineseDailyNerCorpus.load_data('train')
valid_x, valid_y = ChineseDailyNerCorpus.load_data('valid')
test_x, test_y = ChineseDailyNerCorpus.load_data('test')
# Or use your own corpus
train_x = [['Hello', 'world'], ['Hello', 'Kashgari'], ['I', 'love', 'Beijing']]
train_y = [['O', 'O'], ['O', 'B-PER'], ['O', 'B-LOC']]
valid_x, valid_y = train_x, train_y
test_x, test_x = train_x, train_y
Or use your own corpus, it needs to be tokenized like this.
>>> print(train_x[0])
['海', '钓', '比', '赛', '地', '点', '在', '厦', '门', '与', '金', '门', '之', '间', '的', '海', '域', '。']
>>> print(train_y[0])
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O']
Then train our first model. All models provided some APIs, so you could use any labeling model here.
import kashgari
from kashgari.tasks.labeling import BiLSTM_Model
model = BiLSTM_Model()
model.fit(train_x, train_y, valid_x, valid_y)
# Evaluate the model
model.evaluate(test_x, test_y)
# Model data will save to `saved_ner_model` folder
model.save('saved_ner_model')
# Load saved model
loaded_model = BiLSTM_Model.load_model('saved_ner_model')
loaded_model.predict(test_x[:10])
# To continue training, compile the newly loaded model first
loaded_model.compile_model()
model.fit(train_x, train_y, valid_x, valid_y)
That’s all your need to do. Easy right.
Sequence labeling with transfer learning¶
Kashgari provides varies Language model Embeddings for transfer learning. Here is the example for BERT Embedding.
from kashgari.tasks.labeling import BiLSTM_Model
from kashgari.embeddings import BertEmbedding
bert_embed = BertEmbedding('<PRE_TRAINED_BERT_MODEL_FOLDER>')
model = BiLSTM_Model(bert_embed, sequence_length=100)
model.fit(train_x, train_y, valid_x, valid_y)
You could replace bert_embedding with any Embedding class in kashgari.embeddings
. More info about Embedding: LINK THIS.
Adjust model’s hyper-parameters¶
You could easily change model’s hyper-parameters. For example, we change the lstm unit in BLSTMModel
from 128 to 32.
from kashgari.tasks.labeling import BiLSTM_Model
hyper = BiLSTM_Model.default_hyper_parameters()
print(hyper)
# {'layer_blstm': {'units': 128, 'return_sequences': True}, 'layer_dropout': {'rate': 0.4}, 'layer_time_distributed': {}, 'layer_activation': {'activation': 'softmax'}}
hyper['layer_blstm']['units'] = 32
model = BiLSTM_Model(hyper_parameters=hyper)
Use custom optimizer¶
Kashgari already supports using customized optimizer, like RAdam.
from kashgari.corpus import SMP2018ECDTCorpus
from kashgari.tasks.classification import BiLSTM_Model
# Remember to import kashgari before than RAdam
from keras_radam import RAdam
train_x, train_y = SMP2018ECDTCorpus.load_data('train')
valid_x, valid_y = SMP2018ECDTCorpus.load_data('valid')
test_x, test_y = SMP2018ECDTCorpus.load_data('test')
model = BiLSTM_Model()
# This step will build token dict, label dict and model structure
model.build_model(train_x, train_y, valid_x, valid_y)
# Compile model with custom optimizer, you can also customize loss and metrics.
optimizer = RAdam()
model.compile_model(optimizer=optimizer)
# Train model
model.fit(train_x, train_y, valid_x, valid_y)
Use callbacks¶
Kashgari is based on keras so that you could use all of the tf.keras callbacks directly with Kashgari model. For example, here is how to visualize training with tensorboard.
from tensorflow import keras
from kashgari.tasks.labeling import BiLSTM_Model
from kashgari.callbacks import EvalCallBack
model = BLSTMModel()
tf_board_callback = keras.callbacks.TensorBoard(log_dir='./logs', update_freq=1000)
# Build-in callback for print precision, recall and f1 at every epoch step
eval_callback = EvalCallBack(kash_model=model,
valid_x=valid_x,
valid_y=valid_y,
step=5)
model.fit(train_x,
train_y,
valid_x,
valid_y,
batch_size=100,
callbacks=[eval_callback, tf_board_callback])
Customize your own model¶
It is very easy and straightforward to build your own customized model,
just inherit the ABCLabelingModel
and implement the default_hyper_parameters()
function
and build_model_arc()
function.
from typing import Dict, Any
from tensorflow import keras
from kashgari.tasks.labeling.abc_model import ABCLabelingModel
from kashgari.layers import L
import logging
logging.basicConfig(level='DEBUG')
class DoubleBLSTMModel(ABCLabelingModel):
"""Bidirectional LSTM Sequence Labeling Model"""
@classmethod
def default_hyper_parameters(cls) -> Dict[str, Dict[str, Any]]:
"""
Get hyper parameters of model
Returns:
hyper parameters dict
"""
return {
'layer_blstm1': {
'units': 128,
'return_sequences': True
},
'layer_blstm2': {
'units': 128,
'return_sequences': True
},
'layer_dropout': {
'rate': 0.4
},
'layer_time_distributed': {},
'layer_activation': {
'activation': 'softmax'
}
}
def build_model_arc(self):
"""
build model architectural
"""
output_dim = len(self.processor.label2idx)
config = self.hyper_parameters
embed_model = self.embedding.embed_model
# Define your layers
layer_blstm1 = L.Bidirectional(L.LSTM(**config['layer_blstm1']),
name='layer_blstm1')
layer_blstm2 = L.Bidirectional(L.LSTM(**config['layer_blstm2']),
name='layer_blstm2')
layer_dropout = L.Dropout(**config['layer_dropout'],
name='layer_dropout')
layer_time_distributed = L.TimeDistributed(L.Dense(output_dim,
**config['layer_time_distributed']),
name='layer_time_distributed')
layer_activation = L.Activation(**config['layer_activation'])
# Define tensor flow
tensor = layer_blstm1(embed_model.output)
tensor = layer_blstm2(tensor)
tensor = layer_dropout(tensor)
tensor = layer_time_distributed(tensor)
output_tensor = layer_activation(tensor)
# Init model
self.tf_model = keras.Model(embed_model.inputs, output_tensor)
model = DoubleBLSTMModel()
model.fit(train_x, train_y, valid_x, valid_y)
Chinese NER Performance¶
We have run the classification tests on ChineseDailyNerCorpus. Here is the full code: colab link
SEQUENCE_LENGTH = 100
EPOCHS = 30
EARL_STOPPING_PATIENCE = 10
REDUCE_RL_PATIENCE = 5
BATCH_SIZE = 64
Embedding | Model | Best F1-Score | Best F1 @ epochs | |
---|---|---|---|---|
0 | RoBERTa-wwm-ext | BiGRU_Model | 93.22 | 11 |
1 | RoBERTa-wwm-ext | BiGRU_CRF_Model | 95.13 | 29 |
2 | RoBERTa-wwm-ext | BiLSTM_Model | 93.37 | 19 |
3 | RoBERTa-wwm-ext | BiLSTM_CRF_Model | 95.43 | 26 |
4 | RoBERTa-wwm-ext | CNN_LSTM_Model | 94.05 | 23 |
5 | Bert-Chinese | BiGRU_Model | 93.01 | 16 |
6 | Bert-Chinese | BiGRU_CRF_Model | 95.01 | 24 |
7 | Bert-Chinese | BiLSTM_Model | 93.85 | 17 |
8 | Bert-Chinese | BiLSTM_CRF_Model | 95.57 | 26 |
9 | Bert-Chinese | CNN_LSTM_Model | 93.17 | 16 |
10 | Bare | BiGRU_Model | 74.85 | 16 |
11 | Bare | BiGRU_CRF_Model | 81.24 | 21 |
12 | Bare | BiLSTM_Model | 74.7 | 19 |
13 | Bare | BiLSTM_CRF_Model | 82.37 | 25 |
14 | Bare | CNN_LSTM_Model | 75.07 | 14 |
Seq2Seq Model¶
Train a translate model¶
# Original Corpus
x_original = [
'Who am I?',
'I am sick.',
'I like you.',
'I need help.',
'It may hurt.',
'Good morning.']
y_original = [
'مەن كىم ؟',
'مەن كېسەل.',
'مەن سىزنى ياخشى كۆرمەن',
'ماڭا ياردەم كېرەك.',
'ئاغىرىشى مۇمكىن.',
'خەيىرلىك ئەتىگەن.']
# Tokenize sentence with custom tokenizing function
# Tokenize sentence with custom tokenizing function
# We use Bert Tokenizer for this demo
from kashgari.tokenizers import BertTokenizer
tokenizer = BertTokenizer()
x_tokenized = [tokenizer.tokenize(sample) for sample in x_original]
y_tokenized = [tokenizer.tokenize(sample) for sample in y_original]
After tokenizing the corpus, we can build a seq2seq Model.
from kashgari.tasks.seq2seq import Seq2Seq
model = Seq2Seq()
model.fit(x_tokenized, y_tokenized)
# predict with model
preds, attention = model.predict(x_tokenized)
print(preds)
Train with custom embedding¶
You can define both encoder’s and decoder’s embedding. This is how to use Bert Embedding as encoder’s embedding layer.
from kashgari.tasks.seq2seq import Seq2Seq
from kashgari.embeddings import BertEmbedding
bert = BertEmbedding('<PATH_TO_BERT_EMBEDDING>')
model = Seq2Seq(encoder_embedding=bert, hidden_size=512)
model.fit(x_tokenized, y_tokenized)
Language Embeddings¶
Kashgari provides several embeddings for language representation. Embedding layers will convert input sequence to tensor for downstream task. Availabel embeddings list:
class name |
description |
---|---|
random init |
|
pre-trained Word2Vec embedding |
|
pre-trained BERT embedding |
|
pre-trained TransferEmbedding embedding (BERT, ALBERT, RoBERTa, NEZHA) |
All embedding classes inherit from the Embedding
class and implement the embed()
to embed your input sequence and embed_model
property which you need to build you own Model. By providing the embed()
function and embed_model
property, Kashgari hides the the complexity of different language embedding from users, all you need to care is which language embedding you need.
You could check out the Embedding API document here
Quick start¶
Feature Extract From Pre-trained Embedding¶
Feature Extraction is one of the major way to use pre-trained language embedding.
Kashgari provides simple API for this task.
All you need to is init a embedding object and setup it’s pre-processor, then call embed
function.
Here is the example. All embedding shares same embed API.
from kashgari.embeddings import BertEmbedding
from kashgari.processors import SequenceProcessor
bert = BertEmbedding('<BERT_MODEL_FOLDER>')
processor = SequenceProcessor()
bert.setup_text_processor(processor)
# call for embed
embed_tensor = bert.embed([['语', '言', '模', '型']])
print(embed_tensor)
# array([[-0.5001117 , 0.9344998 , -0.55165815, ..., 0.49122602,
# -0.2049343 , 0.25752577],
# [-1.05762 , -0.43353617, 0.54398274, ..., -0.61096823,
# 0.04312163, 0.03881482],
# [ 0.14332692, -0.42566583, 0.68867105, ..., 0.42449307,
# 0.41105768, 0.08222893],
# ...,
# [-0.86124015, 0.08591427, -0.34404194, ..., 0.19915134,
# -0.34176797, 0.06111742],
# [-0.73940575, -0.02692179, -0.5826528 , ..., 0.26934686,
# -0.29708537, 0.01855129],
# [-0.85489404, 0.007399 , -0.26482674, ..., 0.16851354,
# -0.36805922, -0.0052386 ]], dtype=float32)
Classification and Labeling¶
See details at classification and labeling tutorial.
Customized model¶
You can access the tf.keras model of embedding and add your own layers or any kind customization. Just need to access the embed_model
property of the embedding object.
Bare Embedding¶
BareEmbedding is a random init tf.keras.layers.Embedding
layer for text sequence embedding, which is the defualt embedding class for kashgari models.
- kashgari.embeddings.BareEmbedding.__init__(self, embedding_size=100, **kwargs)¶
- Parameters
embedding_size (int) – Dimension of the dense embedding.
kwargs (Any) – additional params
Here is the sample how to use embedding class. The key difference here is that must call analyze_corpus
function before using the embed function. This is because the embedding layer is not pre-trained and do not contain any word-list. We need to build word-list from the corpus.
import kashgari
from kashgari.embeddings import BareEmbedding
embedding = BareEmbedding(embedding_size=100)
embedding.analyze_corpus(x_data, y_data)
embed_tensor = embedding.embed_one(['语', '言', '模', '型'])
Word Embedding¶
WordEmbedding is a tf.keras.layers.Embedding
layer with pre-trained Word2Vec/GloVe Emedding weights.
- kashgari.embeddings.WordEmbedding.__init__(self, w2v_path, *, w2v_kwargs=None, **kwargs)¶
- Parameters
w2v_path (str) – Word2Vec file path.
w2v_kwargs (Optional[Dict[str, Any]]) – params pass to the
load_word2vec_format()
function of gensim.models.KeyedVectorskwargs (Any) – additional params
Bert Embedding¶
BertEmbedding is a simple wrapped class of Transformer Embedding. If you need load other kind of transformer based language model, please use the Transformer Embedding.
Note
When using pre-trained embedding, remember to use same tokenize tool with the embedding model, this will allow to access the full power of the embedding
- kashgari.embeddings.BertEmbedding.__init__(self, model_folder, **kwargs)¶
- Parameters
model_folder (str) – path of checkpoint folder.
kwargs (Any) – additional params
Example Usage - Text Classification¶
Let’s run a text classification model with BERT.
sentences = [
"Jim Henson was a puppeteer.",
"This here's an example of using the BERT tokenizer.",
"Why did the chicken cross the road?"
]
labels = [
"class1",
"class2",
"class1"
]
########## Load Bert Embedding ##########
import os
from kashgari.embeddings import BertEmbedding
from kashgari.tokenizers import BertTokenizer
bert_embedding = BertEmbedding('<PATH_TO_BERT_EMBEDDING>')
tokenizer = BertTokenizer.load_from_vocab_file(os.path.join('<PATH_TO_BERT_EMBEDDING>', 'vocab_chinese.txt'))
sentences_tokenized = [tokenizer.tokenize(s) for s in sentences]
"""
The sentences will become tokenized into:
[
['jim', 'henson', 'was', 'a', 'puppet', '##eer', '.'],
['this', 'here', "'", 's', 'an', 'example', 'of', 'using', 'the', 'bert', 'token', '##izer', '.'],
['why', 'did', 'the', 'chicken', 'cross', 'the', 'road', '?']
]
"""
train_x, train_y = sentences_tokenized[:2], labels[:2]
validate_x, validate_y = sentences_tokenized[2:], labels[2:]
########## build model ##########
from kashgari.tasks.classification import CNN_LSTM_Model
model = CNN_LSTM_Model(bert_embedding)
########## /build model ##########
model.fit(
train_x, train_y,
validate_x, validate_y,
epochs=3,
batch_size=32
)
# save model
model.save('path/to/save/model/to')
Use sentence pairs for input¶
let’s assume input pair sample is "First do it" "then do it right"
, Then first tokenize the sentences using bert tokenizer. Then
sentence1 = ['First', 'do', 'it']
sentence2 = ['then', 'do', 'it', 'right']
sample = sentence1 + ["[SEP]"] + sentence2
# Add a special separation token `[SEP]` between two sentences tokens
# Generate a new token list
# ['First', 'do', 'it', '[SEP]', 'then', 'do', 'it', 'right']
train_x = [sample]
Transformer Embedding¶
TransformerEmbedding is based on bert4keras. The embeddings itself are wrapped into our simple embedding interface so that they can be used like any other embedding.
TransformerEmbedding support models:
Model |
Author |
Link |
---|---|---|
BERT |
||
ALBERT |
||
ALBERT |
brightmart |
|
RoBERTa |
brightmart |
|
RoBERTa |
哈工大 |
|
RoBERTa |
苏剑林 |
|
NEZHA |
Huawei |
https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/NEZHA |
Note
When using pre-trained embedding, remember to use same tokenize tool with the embedding model, this will allow to access the full power of the embedding
- kashgari.embeddings.TransformerEmbedding.__init__(self, vocab_path, config_path, checkpoint_path, model_type='bert', **kwargs)¶
- Parameters
Example Usage - Text Classification¶
Let’s run a text classification model with BERT.
sentences = [
"Jim Henson was a puppeteer.",
"This here's an example of using the BERT tokenizer.",
"Why did the chicken cross the road?"
]
labels = [
"class1",
"class2",
"class1"
]
# ------------ Load Bert Embedding ------------
import os
from kashgari.embeddings import TransformerEmbedding
from kashgari.tokenizers import BertTokenizer
# Setup paths
model_folder = '/xxx/xxx/albert_base'
checkpoint_path = os.path.join(model_folder, 'model.ckpt-best')
config_path = os.path.join(model_folder, 'albert_config.json')
vocab_path = os.path.join(model_folder, 'vocab_chinese.txt')
tokenizer = BertTokenizer.load_from_vocab_file(vocab_path)
embed = TransformerEmbedding(vocab_path, config_path, checkpoint_path,
bert_type='albert')
sentences_tokenized = [tokenizer.tokenize(s) for s in sentences]
"""
The sentences will become tokenized into:
[
['jim', 'henson', 'was', 'a', 'puppet', '##eer', '.'],
['this', 'here', "'", 's', 'an', 'example', 'of', 'using', 'the', 'bert', 'token', '##izer', '.'],
['why', 'did', 'the', 'chicken', 'cross', 'the', 'road', '?']
]
"""
train_x, train_y = sentences_tokenized[:2], labels[:2]
validate_x, validate_y = sentences_tokenized[2:], labels[2:]
# ------------ Build Model Start ------------
from kashgari.tasks.classification import CNN_LSTM_Model
model = CNN_LSTM_Model(embed)
# ------------ Build Model End ------------
model.fit(
train_x, train_y,
validate_x, validate_y,
epochs=3,
batch_size=32
)
# save model
model.save('path/to/save/model/to')
Tensorflow Serving¶
from kashgari.tasks.classification import BiGRU_Model
from kashgari.corpus import SMP2018ECDTCorpus
from kashgari import utils
train_x, train_y = SMP2018ECDTCorpus.load_data()
model = BiGRU_Model()
model.fit(train_x, train_y)
# Save model
utils.convert_to_saved_model(model,
model_path="saved_model/bgru",
version=1)
Then run tensorflow-serving.
docker run -t --rm -p 8501:8501 -v "<path_to>/saved_model:/models/" -e MODEL_NAME=bgru tensorflow/serving
Load processor from model, then predict with serving.
We need to check model input keys first.
import requests
res = requests.get("http://localhost:8501/v1/models/bgru/metadata")
inputs = res.json()['metadata']['signature_def']['signature_def']['serving_default']['inputs']
input_sample_keys = list(inputs.keys())
print(input_sample_keys)
# ['Input-Token', 'Input-Segment']
If we have only one input key, aka we are not using BERT like embedding, we need to pass json in this format to predict endpoint.
{
"instances": [
[2, 1, 1, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[2, 9, 41, 459, 1, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0]
]
}
Here is the code.
import requests
import numpy as np
from kashgari.processors import load_processors_from_model
text_processor, label_processor = load_processors_from_model('/Users/brikerman/Desktop/tf-serving/1603683152')
samples = [
['hello', 'world'],
['你', '好', '世', '界']
]
tensor = text_processor.transform(samples)
instances = [i.tolist() for i in tensor]
# predict
r = requests.post("http://localhost:8501/v1/models/bgru:predict", json={"instances": instances})
predictions = r.json()['predictions']
# Convert result back to labels
labels = label_processor.inverse_transform(np.array(predictions).argmax(-1))
print(labels)
If we are using Bert, then we need to handle multi input fields,
for example we get this two keys ['Input-Token', 'Input-Segment']
from metadata endpoint.
Then we need to pass a json in this format.
[
{
"Input-Token": [2, 1, 1, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
"Input-Segment": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
},
{
"Input-Token": [2, 9, 41, 459, 1, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0],
"Input-Segment": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
}
]
Here is the code.
import requests
import numpy as np
from kashgari.processors import load_processors_from_model
text_processor, label_processor = load_processors_from_model('/Users/brikerman/Desktop/tf-serving/1603683152')
samples = [
['hello', 'world'],
['你', '好', '世', '界']
]
tensor = text_processor.transform(samples)
instances = [{
"Input-Token": i.tolist(),
"Input-Segment": np.zeros(i.shape).tolist()
} for i in tensor]
# predict
r = requests.post("http://localhost:8501/v1/models/bgru:predict", json={"instances": instances})
predictions = r.json()['predictions']
# Convert result back to labels
labels = label_processor.inverse_transform(np.array(predictions).argmax(-1))
print(labels)
Corpus¶
Table of Contents
ChineseDailyNerCorpus¶
- class kashgari.corpus.ChineseDailyNerCorpus[source]¶
Bases:
object
Chinese Daily New New Corpus https://github.com/zjy-ucas/ChineseNER/
Example
>>> from kashgari.corpus import ChineseDailyNerCorpus >>> train_x, train_y = ChineseDailyNerCorpus.load_data('train') >>> test_x, test_y = ChineseDailyNerCorpus.load_data('test') >>> valid_x, valid_y = ChineseDailyNerCorpus.load_data('valid') >>> print(train_x) [['海', '钓', '比', '赛', '地', '点', '在', '厦', '门', ...], ...] >>> print(train_y) [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', ...], ...]
SMP2018ECDTCorpus¶
- class kashgari.corpus.SMP2018ECDTCorpus[source]¶
Bases:
object
https://worksheets.codalab.org/worksheets/0x27203f932f8341b79841d50ce0fd684f/
This dataset is released by the Evaluation of Chinese Human-Computer Dialogue Technology (SMP2018-ECDT) task 1 and is provided by the iFLYTEK Corporation, which is a Chinese human-computer dialogue dataset.
Sample:
label query 0 weather 今天东莞天气如何 1 map 从观音桥到重庆市图书馆怎么走 2 cookbook 鸭蛋怎么腌? 3 health 怎么治疗牛皮癣 4 chat 唠什么
Example
>>> from kashgari.corpus import SMP2018ECDTCorpus >>> train_x, train_y = SMP2018ECDTCorpus.load_data('train') >>> test_x, test_y = SMP2018ECDTCorpus.load_data('test') >>> valid_x, valid_y = SMP2018ECDTCorpus.load_data('valid') >>> print(train_x) [['听', '新', '闻', '。'], ['电', '视', '台', '在', '播', '什', '么'], ...] >>> print(train_y) ['news', 'epg', ...]
JigsawToxicCommentCorpus¶
- class kashgari.corpus.JigsawToxicCommentCorpus(corpus_train_csv_path, sample_count=None, tokenizer=None)[source]¶
Bases:
object
Kaggle Toxic Comment Classification Challenge corpus
You need to download corpus from https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/overview to a folder. Then init a JigsawToxicCommentCorpus object with train.csv path.
Examples
>>> from kashgari.corpus import JigsawToxicCommentCorpus >>> corpus = JigsawToxicCommentCorpus('<train.csv file-path>') >>> train_x, train_y = corpus.load_data('train') >>> test_x, test_y = corpus.load_data('test') >>> print(train_x) [['Please', 'stop', 'being', 'a', 'penis—', 'and', 'Grow', 'Up', 'Regards-'], ...] >>> print(train_y) [['obscene', 'insult'], ...]
- Parameters
- Return type
- __init__(corpus_train_csv_path, sample_count=None, tokenizer=None)[source]¶
Initialize self. See help(type(self)) for accurate signature.
Embeddings¶
BareEmbedding¶
- class kashgari.embeddings.BareEmbedding(embedding_size=100, **kwargs)[source]¶
Bases:
kashgari.embeddings.abc_embedding.ABCEmbedding
BareEmbedding is a random init tf.keras.layers.Embedding layer for text sequence embedding, which is the defualt embedding class for kashgari models.
- Parameters
embedding_size (int) –
kwargs (Any) –
- __init__(embedding_size=100, **kwargs)[source]¶
- Parameters
embedding_size (int) – Dimension of the dense embedding.
kwargs (Any) – additional params
- embed(sentences, *, debug=False)¶
batch embed sentences
- Parameters
- Returns
vectorized sentence list
- Return type
- get_seq_length_from_corpus(generators, *, use_label=False, cover_rate=0.95)¶
Calculate proper sequence length according to the corpus
- Parameters
generators (List[kashgari.generators.CorpusGenerator]) –
use_label (bool) –
cover_rate (float) –
- Return type
Returns:
- setup_text_processor(processor)¶
- Parameters
processor (kashgari.processors.abc_processor.ABCProcessor) –
- Return type
WordEmbedding¶
- class kashgari.embeddings.WordEmbedding(w2v_path, *, w2v_kwargs=None, **kwargs)[source]¶
Bases:
kashgari.embeddings.abc_embedding.ABCEmbedding
- __init__(w2v_path, *, w2v_kwargs=None, **kwargs)[source]¶
- Parameters
w2v_path (str) – Word2Vec file path.
w2v_kwargs (Optional[Dict[str, Any]]) – params pass to the
load_word2vec_format()
function of gensim.models.KeyedVectorskwargs (Any) – additional params
- embed(sentences, *, debug=False)¶
batch embed sentences
- Parameters
- Returns
vectorized sentence list
- Return type
- get_seq_length_from_corpus(generators, *, use_label=False, cover_rate=0.95)¶
Calculate proper sequence length according to the corpus
- Parameters
generators (List[kashgari.generators.CorpusGenerator]) –
use_label (bool) –
cover_rate (float) –
- Return type
Returns:
- setup_text_processor(processor)¶
- Parameters
processor (kashgari.processors.abc_processor.ABCProcessor) –
- Return type
TransformerEmbedding¶
- class kashgari.embeddings.TransformerEmbedding(vocab_path, config_path, checkpoint_path, model_type='bert', **kwargs)[source]¶
Bases:
kashgari.embeddings.abc_embedding.ABCEmbedding
TransformerEmbedding is based on bert4keras. The embeddings itself are wrapped into our simple embedding interface so that they can be used like any other embedding.
- Parameters
- embed(sentences, *, debug=False)¶
batch embed sentences
- Parameters
- Returns
vectorized sentence list
- Return type
- get_seq_length_from_corpus(generators, *, use_label=False, cover_rate=0.95)¶
Calculate proper sequence length according to the corpus
- Parameters
generators (List[kashgari.generators.CorpusGenerator]) –
use_label (bool) –
cover_rate (float) –
- Return type
Returns:
- setup_text_processor(processor)¶
- Parameters
processor (kashgari.processors.abc_processor.ABCProcessor) –
- Return type
BertEmbedding¶
- class kashgari.embeddings.BertEmbedding(model_folder, **kwargs)[source]¶
Bases:
kashgari.embeddings.transformer_embedding.TransformerEmbedding
BertEmbedding is a simple wrapped class of TransformerEmbedding. If you need load other kind of transformer based language model, please use the TransformerEmbedding.
- Parameters
model_folder (str) –
kwargs (Any) –
- __init__(model_folder, **kwargs)[source]¶
- Parameters
model_folder (str) – path of checkpoint folder.
kwargs (Any) – additional params
- build_embedding_model(*, vocab_size=None, force=False, **kwargs)¶
- embed(sentences, *, debug=False)¶
batch embed sentences
- Parameters
- Returns
vectorized sentence list
- Return type
- get_seq_length_from_corpus(generators, *, use_label=False, cover_rate=0.95)¶
Calculate proper sequence length according to the corpus
- Parameters
generators (List[kashgari.generators.CorpusGenerator]) –
use_label (bool) –
cover_rate (float) –
- Return type
Returns:
- load_embed_vocab()¶
Load vocab dict from embedding layer
- setup_text_processor(processor)¶
- Parameters
processor (kashgari.processors.abc_processor.ABCProcessor) –
- Return type
Classification Models¶
Table of Contents
Bidirectional LSTM Model¶
- class kashgari.tasks.classification.BiLSTM_Model(embedding=None, *, sequence_length=None, hyper_parameters=None, multi_label=False, text_processor=None, label_processor=None)[source]¶
Bases:
kashgari.tasks.classification.abc_model.ABCClassificationModel
- Parameters
- __init__(embedding=None, *, sequence_length=None, hyper_parameters=None, multi_label=False, text_processor=None, label_processor=None)¶
- Parameters
embedding (Optional[kashgari.embeddings.abc_embedding.ABCEmbedding]) – embedding object
sequence_length (Optional[int]) – target sequence length
hyper_parameters (Optional[Dict[str, Dict[str, Any]]]) – hyper_parameters to overwrite
multi_label (bool) – is multi-label classification
text_processor (Optional[kashgari.processors.abc_processor.ABCProcessor]) – text processor
label_processor (Optional[kashgari.processors.abc_processor.ABCProcessor]) – label processor
- build_model(x_train, y_train)¶
Build Model with x_data and y_data
- This function will setup a
CorpusGenerator
, then call py:meth:ABCClassificationModel.build_model_gen for preparing processor and model
- Parameters
- Return type
Returns:
- This function will setup a
- build_model_generator(generators)¶
- Parameters
generators (List[kashgari.generators.CorpusGenerator]) –
- Return type
- compile_model(loss=None, optimizer=None, metrics=None, **kwargs)¶
Configures the model for training. call
tf.keras.Model.predict()
to compile model with custom loss, optimizer and metricsExamples
>>> model = BiLSTM_Model() # Build model with corpus >>> model.build_model(train_x, train_y) # Compile model with custom loss, optimizer and metrics >>> model.compile(loss='categorical_crossentropy', optimizer='rsm', metrics = ['accuracy'])
- Parameters
loss (Optional[Any]) – name of objective function, objective function or
tf.keras.losses.Loss
instance.optimizer (Optional[Any]) – name of optimizer or optimizer instance.
metrics (object) – List of metrics to be evaluated by the model during training and testing.
**kwargs – additional params passed to
tf.keras.Model.predict`()
.kwargs (Any) –
- Return type
- classmethod default_hyper_parameters()[source]¶
The default hyper parameters of the model dict, all models must implement this function.
You could easily change model’s hyper-parameters.
For example, change the LSTM unit in BiLSTM_Model from 128 to 32.
>>> from kashgari.tasks.classification import BiLSTM_Model >>> hyper = BiLSTM_Model.default_hyper_parameters() >>> print(hyper) {'layer_bi_lstm': {'units': 128, 'return_sequences': False}, 'layer_output': {}} >>> hyper['layer_bi_lstm']['units'] = 32 >>> model = BiLSTM_Model(hyper_parameters=hyper)
- evaluate(x_data, y_data, *, batch_size=32, digits=4, multi_label_threshold=0.5, truncating=False)¶
- fit(x_train, y_train, x_validate=None, y_validate=None, *, batch_size=64, epochs=5, callbacks=None, fit_kwargs=None)¶
Trains the model for a given number of epochs with given data set list.
- Parameters
x_train (List[List[str]]) – Array of train feature data (if the model has a single input), or tuple of train feature data array (if the model has multiple inputs)
y_train (Union[List[str], List[List[str]], List[Tuple[str]]]) – Array of train label data
x_validate (Optional[List[List[str]]]) – Array of validation feature data (if the model has a single input), or tuple of validation feature data array (if the model has multiple inputs)
y_validate (Optional[Union[List[str], List[List[str]], List[Tuple[str]]]]) – Array of validation label data
batch_size (int) – Number of samples per gradient update, default to 64.
epochs (int) – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
callbacks (Optional[List[tensorflow.python.keras.callbacks.Callback]]) – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See
tf.keras.callbacks
.fit_kwargs (Optional[Dict]) – fit_kwargs: additional arguments passed to
tf.keras.Model.fit()
- Returns
A
tf.keras.callback.History
object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).- Return type
tensorflow.python.keras.callbacks.History
- fit_generator(train_sample_gen, valid_sample_gen=None, *, batch_size=64, epochs=5, callbacks=None, fit_kwargs=None)¶
Trains the model for a given number of epochs with given data generator.
Data generator must be the subclass of CorpusGenerator
- Parameters
train_sample_gen (kashgari.generators.CorpusGenerator) – train data generator.
valid_sample_gen (Optional[kashgari.generators.CorpusGenerator]) – valid data generator.
batch_size (int) – Number of samples per gradient update, default to 64.
epochs (int) – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
callbacks (Optional[List[tensorflow.python.keras.callbacks.Callback]]) – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See tf.keras.callbacks.
fit_kwargs (Optional[Dict]) – fit_kwargs: additional arguments passed to
tf.keras.Model.fit()
- Returns
A
tf.keras.callback.History
object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).- Return type
tensorflow.python.keras.callbacks.History
- classmethod load_model(model_path)¶
- Parameters
model_path (str) –
- Return type
Union[ABCLabelingModel, ABCClassificationModel]
- predict(x_data, *, batch_size=32, truncating=False, multi_label_threshold=0.5, predict_kwargs=None)¶
Generates output predictions for the input samples.
Computation is done in batches.
- Parameters
x_data (List[List[str]]) – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
batch_size (int) – Integer. If unspecified, it will default to 32.
truncating (bool) – remove values from sequences larger than model.embedding.sequence_length
multi_label_threshold (float) –
predict_kwargs (Optional[Dict]) – arguments passed to
predict()
function oftf.keras.Model
- Returns
array(s) of predictions.
- Return type
- to_dict()¶
- Return type
Dict
Bidirectional GRU Model¶
- class kashgari.tasks.classification.BiGRU_Model(embedding=None, *, sequence_length=None, hyper_parameters=None, multi_label=False, text_processor=None, label_processor=None)[source]¶
Bases:
kashgari.tasks.classification.abc_model.ABCClassificationModel
- Parameters
- __init__(embedding=None, *, sequence_length=None, hyper_parameters=None, multi_label=False, text_processor=None, label_processor=None)¶
- Parameters
embedding (Optional[kashgari.embeddings.abc_embedding.ABCEmbedding]) – embedding object
sequence_length (Optional[int]) – target sequence length
hyper_parameters (Optional[Dict[str, Dict[str, Any]]]) – hyper_parameters to overwrite
multi_label (bool) – is multi-label classification
text_processor (Optional[kashgari.processors.abc_processor.ABCProcessor]) – text processor
label_processor (Optional[kashgari.processors.abc_processor.ABCProcessor]) – label processor
- build_model(x_train, y_train)¶
Build Model with x_data and y_data
- This function will setup a
CorpusGenerator
, then call py:meth:ABCClassificationModel.build_model_gen for preparing processor and model
- Parameters
- Return type
Returns:
- This function will setup a
- build_model_generator(generators)¶
- Parameters
generators (List[kashgari.generators.CorpusGenerator]) –
- Return type
- compile_model(loss=None, optimizer=None, metrics=None, **kwargs)¶
Configures the model for training. call
tf.keras.Model.predict()
to compile model with custom loss, optimizer and metricsExamples
>>> model = BiLSTM_Model() # Build model with corpus >>> model.build_model(train_x, train_y) # Compile model with custom loss, optimizer and metrics >>> model.compile(loss='categorical_crossentropy', optimizer='rsm', metrics = ['accuracy'])
- Parameters
loss (Optional[Any]) – name of objective function, objective function or
tf.keras.losses.Loss
instance.optimizer (Optional[Any]) – name of optimizer or optimizer instance.
metrics (object) – List of metrics to be evaluated by the model during training and testing.
**kwargs – additional params passed to
tf.keras.Model.predict`()
.kwargs (Any) –
- Return type
- classmethod default_hyper_parameters()[source]¶
The default hyper parameters of the model dict, all models must implement this function.
You could easily change model’s hyper-parameters.
For example, change the LSTM unit in BiLSTM_Model from 128 to 32.
>>> from kashgari.tasks.classification import BiLSTM_Model >>> hyper = BiLSTM_Model.default_hyper_parameters() >>> print(hyper) {'layer_bi_lstm': {'units': 128, 'return_sequences': False}, 'layer_output': {}} >>> hyper['layer_bi_lstm']['units'] = 32 >>> model = BiLSTM_Model(hyper_parameters=hyper)
- evaluate(x_data, y_data, *, batch_size=32, digits=4, multi_label_threshold=0.5, truncating=False)¶
- fit(x_train, y_train, x_validate=None, y_validate=None, *, batch_size=64, epochs=5, callbacks=None, fit_kwargs=None)¶
Trains the model for a given number of epochs with given data set list.
- Parameters
x_train (List[List[str]]) – Array of train feature data (if the model has a single input), or tuple of train feature data array (if the model has multiple inputs)
y_train (Union[List[str], List[List[str]], List[Tuple[str]]]) – Array of train label data
x_validate (Optional[List[List[str]]]) – Array of validation feature data (if the model has a single input), or tuple of validation feature data array (if the model has multiple inputs)
y_validate (Optional[Union[List[str], List[List[str]], List[Tuple[str]]]]) – Array of validation label data
batch_size (int) – Number of samples per gradient update, default to 64.
epochs (int) – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
callbacks (Optional[List[tensorflow.python.keras.callbacks.Callback]]) – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See
tf.keras.callbacks
.fit_kwargs (Optional[Dict]) – fit_kwargs: additional arguments passed to
tf.keras.Model.fit()
- Returns
A
tf.keras.callback.History
object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).- Return type
tensorflow.python.keras.callbacks.History
- fit_generator(train_sample_gen, valid_sample_gen=None, *, batch_size=64, epochs=5, callbacks=None, fit_kwargs=None)¶
Trains the model for a given number of epochs with given data generator.
Data generator must be the subclass of CorpusGenerator
- Parameters
train_sample_gen (kashgari.generators.CorpusGenerator) – train data generator.
valid_sample_gen (Optional[kashgari.generators.CorpusGenerator]) – valid data generator.
batch_size (int) – Number of samples per gradient update, default to 64.
epochs (int) – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
callbacks (Optional[List[tensorflow.python.keras.callbacks.Callback]]) – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See tf.keras.callbacks.
fit_kwargs (Optional[Dict]) – fit_kwargs: additional arguments passed to
tf.keras.Model.fit()
- Returns
A
tf.keras.callback.History
object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).- Return type
tensorflow.python.keras.callbacks.History
- classmethod load_model(model_path)¶
- Parameters
model_path (str) –
- Return type
Union[ABCLabelingModel, ABCClassificationModel]
- predict(x_data, *, batch_size=32, truncating=False, multi_label_threshold=0.5, predict_kwargs=None)¶
Generates output predictions for the input samples.
Computation is done in batches.
- Parameters
x_data (List[List[str]]) – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
batch_size (int) – Integer. If unspecified, it will default to 32.
truncating (bool) – remove values from sequences larger than model.embedding.sequence_length
multi_label_threshold (float) –
predict_kwargs (Optional[Dict]) – arguments passed to
predict()
function oftf.keras.Model
- Returns
array(s) of predictions.
- Return type
- to_dict()¶
- Return type
Dict
CNN Model¶
- class kashgari.tasks.classification.CNN_Model(embedding=None, *, sequence_length=None, hyper_parameters=None, multi_label=False, text_processor=None, label_processor=None)[source]¶
Bases:
kashgari.tasks.classification.abc_model.ABCClassificationModel
- Parameters
- __init__(embedding=None, *, sequence_length=None, hyper_parameters=None, multi_label=False, text_processor=None, label_processor=None)¶
- Parameters
embedding (Optional[kashgari.embeddings.abc_embedding.ABCEmbedding]) – embedding object
sequence_length (Optional[int]) – target sequence length
hyper_parameters (Optional[Dict[str, Dict[str, Any]]]) – hyper_parameters to overwrite
multi_label (bool) – is multi-label classification
text_processor (Optional[kashgari.processors.abc_processor.ABCProcessor]) – text processor
label_processor (Optional[kashgari.processors.abc_processor.ABCProcessor]) – label processor
- build_model(x_train, y_train)¶
Build Model with x_data and y_data
- This function will setup a
CorpusGenerator
, then call py:meth:ABCClassificationModel.build_model_gen for preparing processor and model
- Parameters
- Return type
Returns:
- This function will setup a
- build_model_generator(generators)¶
- Parameters
generators (List[kashgari.generators.CorpusGenerator]) –
- Return type
- compile_model(loss=None, optimizer=None, metrics=None, **kwargs)¶
Configures the model for training. call
tf.keras.Model.predict()
to compile model with custom loss, optimizer and metricsExamples
>>> model = BiLSTM_Model() # Build model with corpus >>> model.build_model(train_x, train_y) # Compile model with custom loss, optimizer and metrics >>> model.compile(loss='categorical_crossentropy', optimizer='rsm', metrics = ['accuracy'])
- Parameters
loss (Optional[Any]) – name of objective function, objective function or
tf.keras.losses.Loss
instance.optimizer (Optional[Any]) – name of optimizer or optimizer instance.
metrics (object) – List of metrics to be evaluated by the model during training and testing.
**kwargs – additional params passed to
tf.keras.Model.predict`()
.kwargs (Any) –
- Return type
- classmethod default_hyper_parameters()[source]¶
The default hyper parameters of the model dict, all models must implement this function.
You could easily change model’s hyper-parameters.
For example, change the LSTM unit in BiLSTM_Model from 128 to 32.
>>> from kashgari.tasks.classification import BiLSTM_Model >>> hyper = BiLSTM_Model.default_hyper_parameters() >>> print(hyper) {'layer_bi_lstm': {'units': 128, 'return_sequences': False}, 'layer_output': {}} >>> hyper['layer_bi_lstm']['units'] = 32 >>> model = BiLSTM_Model(hyper_parameters=hyper)
- evaluate(x_data, y_data, *, batch_size=32, digits=4, multi_label_threshold=0.5, truncating=False)¶
- fit(x_train, y_train, x_validate=None, y_validate=None, *, batch_size=64, epochs=5, callbacks=None, fit_kwargs=None)¶
Trains the model for a given number of epochs with given data set list.
- Parameters
x_train (List[List[str]]) – Array of train feature data (if the model has a single input), or tuple of train feature data array (if the model has multiple inputs)
y_train (Union[List[str], List[List[str]], List[Tuple[str]]]) – Array of train label data
x_validate (Optional[List[List[str]]]) – Array of validation feature data (if the model has a single input), or tuple of validation feature data array (if the model has multiple inputs)
y_validate (Optional[Union[List[str], List[List[str]], List[Tuple[str]]]]) – Array of validation label data
batch_size (int) – Number of samples per gradient update, default to 64.
epochs (int) – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
callbacks (Optional[List[tensorflow.python.keras.callbacks.Callback]]) – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See
tf.keras.callbacks
.fit_kwargs (Optional[Dict]) – fit_kwargs: additional arguments passed to
tf.keras.Model.fit()
- Returns
A
tf.keras.callback.History
object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).- Return type
tensorflow.python.keras.callbacks.History
- fit_generator(train_sample_gen, valid_sample_gen=None, *, batch_size=64, epochs=5, callbacks=None, fit_kwargs=None)¶
Trains the model for a given number of epochs with given data generator.
Data generator must be the subclass of CorpusGenerator
- Parameters
train_sample_gen (kashgari.generators.CorpusGenerator) – train data generator.
valid_sample_gen (Optional[kashgari.generators.CorpusGenerator]) – valid data generator.
batch_size (int) – Number of samples per gradient update, default to 64.
epochs (int) – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
callbacks (Optional[List[tensorflow.python.keras.callbacks.Callback]]) – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See tf.keras.callbacks.
fit_kwargs (Optional[Dict]) – fit_kwargs: additional arguments passed to
tf.keras.Model.fit()
- Returns
A
tf.keras.callback.History
object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).- Return type
tensorflow.python.keras.callbacks.History
- classmethod load_model(model_path)¶
- Parameters
model_path (str) –
- Return type
Union[ABCLabelingModel, ABCClassificationModel]
- predict(x_data, *, batch_size=32, truncating=False, multi_label_threshold=0.5, predict_kwargs=None)¶
Generates output predictions for the input samples.
Computation is done in batches.
- Parameters
x_data (List[List[str]]) – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
batch_size (int) – Integer. If unspecified, it will default to 32.
truncating (bool) – remove values from sequences larger than model.embedding.sequence_length
multi_label_threshold (float) –
predict_kwargs (Optional[Dict]) – arguments passed to
predict()
function oftf.keras.Model
- Returns
array(s) of predictions.
- Return type
- to_dict()¶
- Return type
Dict
Labeling Models¶
Table of Contents
Bidirectional LSTM Model¶
- class kashgari.tasks.labeling.BiLSTM_Model(embedding=None, sequence_length=None, hyper_parameters=None)[source]¶
Bases:
kashgari.tasks.labeling.abc_model.ABCLabelingModel
- Parameters
- __init__(embedding=None, sequence_length=None, hyper_parameters=None)¶
- build_model(x_data, y_data)¶
Build Model with x_data and y_data
- This function will setup a
CorpusGenerator
, then call
ABCClassificationModel.build_model_gen()
for preparing processor and model
Returns:
- This function will setup a
- build_model_generator(generators)¶
- Parameters
generators (List[kashgari.generators.CorpusGenerator]) –
- Return type
- compile_model(loss=None, optimizer=None, metrics=None, **kwargs)¶
Configures the model for training. call
tf.keras.Model.predict()
to compile model with custom loss, optimizer and metricsExamples
>>> model = BiLSTM_Model() # Build model with corpus >>> model.build_model(train_x, train_y) # Compile model with custom loss, optimizer and metrics >>> model.compile(loss='categorical_crossentropy', optimizer='rsm', metrics = ['accuracy'])
- Parameters
loss (Optional[Any]) – name of objective function, objective function or
tf.keras.losses.Loss
instance.optimizer (Optional[Any]) – name of optimizer or optimizer instance.
metrics (object) – List of metrics to be evaluated by the model during training and testing.
kwargs (Any) – additional params passed to
tf.keras.Model.predict`()
.
- Return type
- classmethod default_hyper_parameters()[source]¶
The default hyper parameters of the model dict, all models must implement this function.
You could easily change model’s hyper-parameters.
For example, change the LSTM unit in BiLSTM_Model from 128 to 32.
>>> from kashgari.tasks.classification import BiLSTM_Model >>> hyper = BiLSTM_Model.default_hyper_parameters() >>> print(hyper) {'layer_bi_lstm': {'units': 128, 'return_sequences': False}, 'layer_output': {}} >>> hyper['layer_bi_lstm']['units'] = 32 >>> model = BiLSTM_Model(hyper_parameters=hyper)
- evaluate(x_data, y_data, batch_size=32, digits=4, truncating=False)¶
Build a text report showing the main labeling metrics.
- Parameters
- Returns
A report dict
- Return type
Dict
Example
>>> from kashgari.tasks.labeling import BiGRU_Model >>> model = BiGRU_Model() >>> model.fit(train_x, train_y, valid_x, valid_y) >>> report = model.evaluate(test_x, test_y) precision recall f1-score support ORG 0.0665 0.1108 0.0831 984 LOC 0.1870 0.2086 0.1972 1951 PER 0.1685 0.0882 0.1158 884 micro avg 0.1384 0.1555 0.1465 3819 macro avg 0.1516 0.1555 0.1490 3819 >>> print(report) { 'f1-score': 0.14895159934887792, 'precision': 0.1516294012813676, 'recall': 0.15553809897879026, 'support': 3819, 'detail': {'LOC': {'f1-score': 0.19718992248062014, 'precision': 0.18695452457510336, 'recall': 0.20861096873398258, 'support': 1951}, 'ORG': {'f1-score': 0.08307926829268293, 'precision': 0.06646341463414634, 'recall': 0.11077235772357724, 'support': 984}, 'PER': {'f1-score': 0.11581291759465479, 'precision': 0.16846652267818574, 'recall': 0.08823529411764706, 'support': 884}}, }
- fit(x_train, y_train, x_validate=None, y_validate=None, batch_size=64, epochs=5, callbacks=None, fit_kwargs=None)¶
Trains the model for a given number of epochs with given data set list.
- Parameters
x_train (List[List[str]]) – Array of train feature data (if the model has a single input), or tuple of train feature data array (if the model has multiple inputs)
y_train (List[List[str]]) – Array of train label data
x_validate (Optional[List[List[str]]]) – Array of validation feature data (if the model has a single input), or tuple of validation feature data array (if the model has multiple inputs)
y_validate (Optional[List[List[str]]]) – Array of validation label data
batch_size (int) – Number of samples per gradient update, default to 64.
epochs (int) – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
callbacks (Optional[List[tensorflow.python.keras.callbacks.Callback]]) – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See
tf.keras.callbacks
.fit_kwargs (Optional[Dict]) – fit_kwargs: additional arguments passed to
tf.keras.Model.fit()
- Returns
A
tf.keras.callback.History
object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).- Return type
tensorflow.python.keras.callbacks.History
- fit_generator(train_sample_gen, valid_sample_gen=None, batch_size=64, epochs=5, callbacks=None, fit_kwargs=None)¶
Trains the model for a given number of epochs with given data generator.
Data generator must be the subclass of CorpusGenerator
- Parameters
train_sample_gen (kashgari.generators.CorpusGenerator) – train data generator.
valid_sample_gen (Optional[kashgari.generators.CorpusGenerator]) – valid data generator.
batch_size (int) – Number of samples per gradient update, default to 64.
epochs (int) – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
callbacks (Optional[List[tensorflow.python.keras.callbacks.Callback]]) – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See tf.keras.callbacks.
fit_kwargs (Optional[Dict]) – fit_kwargs: additional arguments passed to
tf.keras.Model.fit()
- Returns
A
tf.keras.callback.History
object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).- Return type
tensorflow.python.keras.callbacks.History
- classmethod load_model(model_path)¶
- Parameters
model_path (str) –
- Return type
Union[ABCLabelingModel, ABCClassificationModel]
- predict(x_data, *, batch_size=32, truncating=False, predict_kwargs=None)¶
Generates output predictions for the input samples.
Computation is done in batches.
- Parameters
x_data (List[List[str]]) – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
batch_size (int) – Integer. If unspecified, it will default to 32.
truncating (bool) – remove values from sequences larger than model.embedding.sequence_length
predict_kwargs (Optional[Dict]) – arguments passed to
tf.keras.Model.predict()
- Returns
array(s) of predictions.
- Return type
List[List[str]]
- predict_entities(x_data, batch_size=32, join_chunk=' ', truncating=False, predict_kwargs=None)¶
Gets entities from sequence.
- Parameters
x_data (List[List[str]]) – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
batch_size (int) – Integer. If unspecified, it will default to 32.
truncating (bool) – remove values from sequences larger than model.embedding.sequence_length
join_chunk (str) – str or False,
predict_kwargs (Optional[Dict]) – arguments passed to
tf.keras.Model.predict()
- Returns
list of entity.
- Return type
Bidirectional GRU Model¶
- class kashgari.tasks.labeling.BiGRU_Model(embedding=None, sequence_length=None, hyper_parameters=None)[source]¶
Bases:
kashgari.tasks.labeling.abc_model.ABCLabelingModel
- Parameters
- __init__(embedding=None, sequence_length=None, hyper_parameters=None)¶
- build_model(x_data, y_data)¶
Build Model with x_data and y_data
- This function will setup a
CorpusGenerator
, then call
ABCClassificationModel.build_model_gen()
for preparing processor and model
Returns:
- This function will setup a
- build_model_generator(generators)¶
- Parameters
generators (List[kashgari.generators.CorpusGenerator]) –
- Return type
- compile_model(loss=None, optimizer=None, metrics=None, **kwargs)¶
Configures the model for training. call
tf.keras.Model.predict()
to compile model with custom loss, optimizer and metricsExamples
>>> model = BiLSTM_Model() # Build model with corpus >>> model.build_model(train_x, train_y) # Compile model with custom loss, optimizer and metrics >>> model.compile(loss='categorical_crossentropy', optimizer='rsm', metrics = ['accuracy'])
- Parameters
loss (Optional[Any]) – name of objective function, objective function or
tf.keras.losses.Loss
instance.optimizer (Optional[Any]) – name of optimizer or optimizer instance.
metrics (object) – List of metrics to be evaluated by the model during training and testing.
kwargs (Any) – additional params passed to
tf.keras.Model.predict`()
.
- Return type
- classmethod default_hyper_parameters()[source]¶
The default hyper parameters of the model dict, all models must implement this function.
You could easily change model’s hyper-parameters.
For example, change the LSTM unit in BiLSTM_Model from 128 to 32.
>>> from kashgari.tasks.classification import BiLSTM_Model >>> hyper = BiLSTM_Model.default_hyper_parameters() >>> print(hyper) {'layer_bi_lstm': {'units': 128, 'return_sequences': False}, 'layer_output': {}} >>> hyper['layer_bi_lstm']['units'] = 32 >>> model = BiLSTM_Model(hyper_parameters=hyper)
- evaluate(x_data, y_data, batch_size=32, digits=4, truncating=False)¶
Build a text report showing the main labeling metrics.
- Parameters
- Returns
A report dict
- Return type
Dict
Example
>>> from kashgari.tasks.labeling import BiGRU_Model >>> model = BiGRU_Model() >>> model.fit(train_x, train_y, valid_x, valid_y) >>> report = model.evaluate(test_x, test_y) precision recall f1-score support ORG 0.0665 0.1108 0.0831 984 LOC 0.1870 0.2086 0.1972 1951 PER 0.1685 0.0882 0.1158 884 micro avg 0.1384 0.1555 0.1465 3819 macro avg 0.1516 0.1555 0.1490 3819 >>> print(report) { 'f1-score': 0.14895159934887792, 'precision': 0.1516294012813676, 'recall': 0.15553809897879026, 'support': 3819, 'detail': {'LOC': {'f1-score': 0.19718992248062014, 'precision': 0.18695452457510336, 'recall': 0.20861096873398258, 'support': 1951}, 'ORG': {'f1-score': 0.08307926829268293, 'precision': 0.06646341463414634, 'recall': 0.11077235772357724, 'support': 984}, 'PER': {'f1-score': 0.11581291759465479, 'precision': 0.16846652267818574, 'recall': 0.08823529411764706, 'support': 884}}, }
- fit(x_train, y_train, x_validate=None, y_validate=None, batch_size=64, epochs=5, callbacks=None, fit_kwargs=None)¶
Trains the model for a given number of epochs with given data set list.
- Parameters
x_train (List[List[str]]) – Array of train feature data (if the model has a single input), or tuple of train feature data array (if the model has multiple inputs)
y_train (List[List[str]]) – Array of train label data
x_validate (Optional[List[List[str]]]) – Array of validation feature data (if the model has a single input), or tuple of validation feature data array (if the model has multiple inputs)
y_validate (Optional[List[List[str]]]) – Array of validation label data
batch_size (int) – Number of samples per gradient update, default to 64.
epochs (int) – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
callbacks (Optional[List[tensorflow.python.keras.callbacks.Callback]]) – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See
tf.keras.callbacks
.fit_kwargs (Optional[Dict]) – fit_kwargs: additional arguments passed to
tf.keras.Model.fit()
- Returns
A
tf.keras.callback.History
object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).- Return type
tensorflow.python.keras.callbacks.History
- fit_generator(train_sample_gen, valid_sample_gen=None, batch_size=64, epochs=5, callbacks=None, fit_kwargs=None)¶
Trains the model for a given number of epochs with given data generator.
Data generator must be the subclass of CorpusGenerator
- Parameters
train_sample_gen (kashgari.generators.CorpusGenerator) – train data generator.
valid_sample_gen (Optional[kashgari.generators.CorpusGenerator]) – valid data generator.
batch_size (int) – Number of samples per gradient update, default to 64.
epochs (int) – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
callbacks (Optional[List[tensorflow.python.keras.callbacks.Callback]]) – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See tf.keras.callbacks.
fit_kwargs (Optional[Dict]) – fit_kwargs: additional arguments passed to
tf.keras.Model.fit()
- Returns
A
tf.keras.callback.History
object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).- Return type
tensorflow.python.keras.callbacks.History
- classmethod load_model(model_path)¶
- Parameters
model_path (str) –
- Return type
Union[ABCLabelingModel, ABCClassificationModel]
- predict(x_data, *, batch_size=32, truncating=False, predict_kwargs=None)¶
Generates output predictions for the input samples.
Computation is done in batches.
- Parameters
x_data (List[List[str]]) – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
batch_size (int) – Integer. If unspecified, it will default to 32.
truncating (bool) – remove values from sequences larger than model.embedding.sequence_length
predict_kwargs (Optional[Dict]) – arguments passed to
tf.keras.Model.predict()
- Returns
array(s) of predictions.
- Return type
List[List[str]]
- predict_entities(x_data, batch_size=32, join_chunk=' ', truncating=False, predict_kwargs=None)¶
Gets entities from sequence.
- Parameters
x_data (List[List[str]]) – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
batch_size (int) – Integer. If unspecified, it will default to 32.
truncating (bool) – remove values from sequences larger than model.embedding.sequence_length
join_chunk (str) – str or False,
predict_kwargs (Optional[Dict]) – arguments passed to
tf.keras.Model.predict()
- Returns
list of entity.
- Return type
Bidirectional LSTM CRF Model¶
- class kashgari.tasks.labeling.BiLSTM_CRF_Model(embedding=None, sequence_length=None, hyper_parameters=None)[source]¶
Bases:
kashgari.tasks.labeling.abc_model.ABCLabelingModel
- Parameters
- __init__(embedding=None, sequence_length=None, hyper_parameters=None)¶
- build_model(x_data, y_data)¶
Build Model with x_data and y_data
- This function will setup a
CorpusGenerator
, then call
ABCClassificationModel.build_model_gen()
for preparing processor and model
Returns:
- This function will setup a
- build_model_generator(generators)¶
- Parameters
generators (List[kashgari.generators.CorpusGenerator]) –
- Return type
- compile_model(loss=None, optimizer=None, metrics=None, **kwargs)[source]¶
Configures the model for training. call
tf.keras.Model.predict()
to compile model with custom loss, optimizer and metricsExamples
>>> model = BiLSTM_Model() # Build model with corpus >>> model.build_model(train_x, train_y) # Compile model with custom loss, optimizer and metrics >>> model.compile(loss='categorical_crossentropy', optimizer='rsm', metrics = ['accuracy'])
- Parameters
loss (Optional[Any]) – name of objective function, objective function or
tf.keras.losses.Loss
instance.optimizer (Optional[Any]) – name of optimizer or optimizer instance.
metrics (object) – List of metrics to be evaluated by the model during training and testing.
kwargs (Any) – additional params passed to
tf.keras.Model.predict`()
.
- Return type
- classmethod default_hyper_parameters()[source]¶
The default hyper parameters of the model dict, all models must implement this function.
You could easily change model’s hyper-parameters.
For example, change the LSTM unit in BiLSTM_Model from 128 to 32.
>>> from kashgari.tasks.classification import BiLSTM_Model >>> hyper = BiLSTM_Model.default_hyper_parameters() >>> print(hyper) {'layer_bi_lstm': {'units': 128, 'return_sequences': False}, 'layer_output': {}} >>> hyper['layer_bi_lstm']['units'] = 32 >>> model = BiLSTM_Model(hyper_parameters=hyper)
- evaluate(x_data, y_data, batch_size=32, digits=4, truncating=False)¶
Build a text report showing the main labeling metrics.
- Parameters
- Returns
A report dict
- Return type
Dict
Example
>>> from kashgari.tasks.labeling import BiGRU_Model >>> model = BiGRU_Model() >>> model.fit(train_x, train_y, valid_x, valid_y) >>> report = model.evaluate(test_x, test_y) precision recall f1-score support ORG 0.0665 0.1108 0.0831 984 LOC 0.1870 0.2086 0.1972 1951 PER 0.1685 0.0882 0.1158 884 micro avg 0.1384 0.1555 0.1465 3819 macro avg 0.1516 0.1555 0.1490 3819 >>> print(report) { 'f1-score': 0.14895159934887792, 'precision': 0.1516294012813676, 'recall': 0.15553809897879026, 'support': 3819, 'detail': {'LOC': {'f1-score': 0.19718992248062014, 'precision': 0.18695452457510336, 'recall': 0.20861096873398258, 'support': 1951}, 'ORG': {'f1-score': 0.08307926829268293, 'precision': 0.06646341463414634, 'recall': 0.11077235772357724, 'support': 984}, 'PER': {'f1-score': 0.11581291759465479, 'precision': 0.16846652267818574, 'recall': 0.08823529411764706, 'support': 884}}, }
- fit(x_train, y_train, x_validate=None, y_validate=None, batch_size=64, epochs=5, callbacks=None, fit_kwargs=None)¶
Trains the model for a given number of epochs with given data set list.
- Parameters
x_train (List[List[str]]) – Array of train feature data (if the model has a single input), or tuple of train feature data array (if the model has multiple inputs)
y_train (List[List[str]]) – Array of train label data
x_validate (Optional[List[List[str]]]) – Array of validation feature data (if the model has a single input), or tuple of validation feature data array (if the model has multiple inputs)
y_validate (Optional[List[List[str]]]) – Array of validation label data
batch_size (int) – Number of samples per gradient update, default to 64.
epochs (int) – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
callbacks (Optional[List[tensorflow.python.keras.callbacks.Callback]]) – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See
tf.keras.callbacks
.fit_kwargs (Optional[Dict]) – fit_kwargs: additional arguments passed to
tf.keras.Model.fit()
- Returns
A
tf.keras.callback.History
object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).- Return type
tensorflow.python.keras.callbacks.History
- fit_generator(train_sample_gen, valid_sample_gen=None, batch_size=64, epochs=5, callbacks=None, fit_kwargs=None)¶
Trains the model for a given number of epochs with given data generator.
Data generator must be the subclass of CorpusGenerator
- Parameters
train_sample_gen (kashgari.generators.CorpusGenerator) – train data generator.
valid_sample_gen (Optional[kashgari.generators.CorpusGenerator]) – valid data generator.
batch_size (int) – Number of samples per gradient update, default to 64.
epochs (int) – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
callbacks (Optional[List[tensorflow.python.keras.callbacks.Callback]]) – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See tf.keras.callbacks.
fit_kwargs (Optional[Dict]) – fit_kwargs: additional arguments passed to
tf.keras.Model.fit()
- Returns
A
tf.keras.callback.History
object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).- Return type
tensorflow.python.keras.callbacks.History
- classmethod load_model(model_path)¶
- Parameters
model_path (str) –
- Return type
Union[ABCLabelingModel, ABCClassificationModel]
- predict(x_data, *, batch_size=32, truncating=False, predict_kwargs=None)¶
Generates output predictions for the input samples.
Computation is done in batches.
- Parameters
x_data (List[List[str]]) – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
batch_size (int) – Integer. If unspecified, it will default to 32.
truncating (bool) – remove values from sequences larger than model.embedding.sequence_length
predict_kwargs (Optional[Dict]) – arguments passed to
tf.keras.Model.predict()
- Returns
array(s) of predictions.
- Return type
List[List[str]]
- predict_entities(x_data, batch_size=32, join_chunk=' ', truncating=False, predict_kwargs=None)¶
Gets entities from sequence.
- Parameters
x_data (List[List[str]]) – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
batch_size (int) – Integer. If unspecified, it will default to 32.
truncating (bool) – remove values from sequences larger than model.embedding.sequence_length
join_chunk (str) – str or False,
predict_kwargs (Optional[Dict]) – arguments passed to
tf.keras.Model.predict()
- Returns
list of entity.
- Return type
Bidirectional GRU CRF Model¶
- class kashgari.tasks.labeling.BiGRU_CRF_Model(embedding=None, sequence_length=None, hyper_parameters=None)[source]¶
Bases:
kashgari.tasks.labeling.abc_model.ABCLabelingModel
- Parameters
- __init__(embedding=None, sequence_length=None, hyper_parameters=None)¶
- build_model(x_data, y_data)¶
Build Model with x_data and y_data
- This function will setup a
CorpusGenerator
, then call
ABCClassificationModel.build_model_gen()
for preparing processor and model
Returns:
- This function will setup a
- build_model_generator(generators)¶
- Parameters
generators (List[kashgari.generators.CorpusGenerator]) –
- Return type
- compile_model(loss=None, optimizer=None, metrics=None, **kwargs)[source]¶
Configures the model for training. call
tf.keras.Model.predict()
to compile model with custom loss, optimizer and metricsExamples
>>> model = BiLSTM_Model() # Build model with corpus >>> model.build_model(train_x, train_y) # Compile model with custom loss, optimizer and metrics >>> model.compile(loss='categorical_crossentropy', optimizer='rsm', metrics = ['accuracy'])
- Parameters
loss (Optional[Any]) – name of objective function, objective function or
tf.keras.losses.Loss
instance.optimizer (Optional[Any]) – name of optimizer or optimizer instance.
metrics (object) – List of metrics to be evaluated by the model during training and testing.
kwargs (Any) – additional params passed to
tf.keras.Model.predict`()
.
- Return type
- classmethod default_hyper_parameters()[source]¶
The default hyper parameters of the model dict, all models must implement this function.
You could easily change model’s hyper-parameters.
For example, change the LSTM unit in BiLSTM_Model from 128 to 32.
>>> from kashgari.tasks.classification import BiLSTM_Model >>> hyper = BiLSTM_Model.default_hyper_parameters() >>> print(hyper) {'layer_bi_lstm': {'units': 128, 'return_sequences': False}, 'layer_output': {}} >>> hyper['layer_bi_lstm']['units'] = 32 >>> model = BiLSTM_Model(hyper_parameters=hyper)
- evaluate(x_data, y_data, batch_size=32, digits=4, truncating=False)¶
Build a text report showing the main labeling metrics.
- Parameters
- Returns
A report dict
- Return type
Dict
Example
>>> from kashgari.tasks.labeling import BiGRU_Model >>> model = BiGRU_Model() >>> model.fit(train_x, train_y, valid_x, valid_y) >>> report = model.evaluate(test_x, test_y) precision recall f1-score support ORG 0.0665 0.1108 0.0831 984 LOC 0.1870 0.2086 0.1972 1951 PER 0.1685 0.0882 0.1158 884 micro avg 0.1384 0.1555 0.1465 3819 macro avg 0.1516 0.1555 0.1490 3819 >>> print(report) { 'f1-score': 0.14895159934887792, 'precision': 0.1516294012813676, 'recall': 0.15553809897879026, 'support': 3819, 'detail': {'LOC': {'f1-score': 0.19718992248062014, 'precision': 0.18695452457510336, 'recall': 0.20861096873398258, 'support': 1951}, 'ORG': {'f1-score': 0.08307926829268293, 'precision': 0.06646341463414634, 'recall': 0.11077235772357724, 'support': 984}, 'PER': {'f1-score': 0.11581291759465479, 'precision': 0.16846652267818574, 'recall': 0.08823529411764706, 'support': 884}}, }
- fit(x_train, y_train, x_validate=None, y_validate=None, batch_size=64, epochs=5, callbacks=None, fit_kwargs=None)¶
Trains the model for a given number of epochs with given data set list.
- Parameters
x_train (List[List[str]]) – Array of train feature data (if the model has a single input), or tuple of train feature data array (if the model has multiple inputs)
y_train (List[List[str]]) – Array of train label data
x_validate (Optional[List[List[str]]]) – Array of validation feature data (if the model has a single input), or tuple of validation feature data array (if the model has multiple inputs)
y_validate (Optional[List[List[str]]]) – Array of validation label data
batch_size (int) – Number of samples per gradient update, default to 64.
epochs (int) – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
callbacks (Optional[List[tensorflow.python.keras.callbacks.Callback]]) – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See
tf.keras.callbacks
.fit_kwargs (Optional[Dict]) – fit_kwargs: additional arguments passed to
tf.keras.Model.fit()
- Returns
A
tf.keras.callback.History
object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).- Return type
tensorflow.python.keras.callbacks.History
- fit_generator(train_sample_gen, valid_sample_gen=None, batch_size=64, epochs=5, callbacks=None, fit_kwargs=None)¶
Trains the model for a given number of epochs with given data generator.
Data generator must be the subclass of CorpusGenerator
- Parameters
train_sample_gen (kashgari.generators.CorpusGenerator) – train data generator.
valid_sample_gen (Optional[kashgari.generators.CorpusGenerator]) – valid data generator.
batch_size (int) – Number of samples per gradient update, default to 64.
epochs (int) – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
callbacks (Optional[List[tensorflow.python.keras.callbacks.Callback]]) – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See tf.keras.callbacks.
fit_kwargs (Optional[Dict]) – fit_kwargs: additional arguments passed to
tf.keras.Model.fit()
- Returns
A
tf.keras.callback.History
object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).- Return type
tensorflow.python.keras.callbacks.History
- classmethod load_model(model_path)¶
- Parameters
model_path (str) –
- Return type
Union[ABCLabelingModel, ABCClassificationModel]
- predict(x_data, *, batch_size=32, truncating=False, predict_kwargs=None)¶
Generates output predictions for the input samples.
Computation is done in batches.
- Parameters
x_data (List[List[str]]) – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
batch_size (int) – Integer. If unspecified, it will default to 32.
truncating (bool) – remove values from sequences larger than model.embedding.sequence_length
predict_kwargs (Optional[Dict]) – arguments passed to
tf.keras.Model.predict()
- Returns
array(s) of predictions.
- Return type
List[List[str]]
- predict_entities(x_data, batch_size=32, join_chunk=' ', truncating=False, predict_kwargs=None)¶
Gets entities from sequence.
- Parameters
x_data (List[List[str]]) – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
batch_size (int) – Integer. If unspecified, it will default to 32.
truncating (bool) – remove values from sequences larger than model.embedding.sequence_length
join_chunk (str) – str or False,
predict_kwargs (Optional[Dict]) – arguments passed to
tf.keras.Model.predict()
- Returns
list of entity.
- Return type
Bidirectional CNN LSTM Model¶
- class kashgari.tasks.labeling.CNN_LSTM_Model(embedding=None, sequence_length=None, hyper_parameters=None)[source]¶
Bases:
kashgari.tasks.labeling.abc_model.ABCLabelingModel
- Parameters
- __init__(embedding=None, sequence_length=None, hyper_parameters=None)¶
- build_model(x_data, y_data)¶
Build Model with x_data and y_data
- This function will setup a
CorpusGenerator
, then call
ABCClassificationModel.build_model_gen()
for preparing processor and model
Returns:
- This function will setup a
- build_model_generator(generators)¶
- Parameters
generators (List[kashgari.generators.CorpusGenerator]) –
- Return type
- compile_model(loss=None, optimizer=None, metrics=None, **kwargs)¶
Configures the model for training. call
tf.keras.Model.predict()
to compile model with custom loss, optimizer and metricsExamples
>>> model = BiLSTM_Model() # Build model with corpus >>> model.build_model(train_x, train_y) # Compile model with custom loss, optimizer and metrics >>> model.compile(loss='categorical_crossentropy', optimizer='rsm', metrics = ['accuracy'])
- Parameters
loss (Optional[Any]) – name of objective function, objective function or
tf.keras.losses.Loss
instance.optimizer (Optional[Any]) – name of optimizer or optimizer instance.
metrics (object) – List of metrics to be evaluated by the model during training and testing.
kwargs (Any) – additional params passed to
tf.keras.Model.predict`()
.
- Return type
- classmethod default_hyper_parameters()[source]¶
The default hyper parameters of the model dict, all models must implement this function.
You could easily change model’s hyper-parameters.
For example, change the LSTM unit in BiLSTM_Model from 128 to 32.
>>> from kashgari.tasks.classification import BiLSTM_Model >>> hyper = BiLSTM_Model.default_hyper_parameters() >>> print(hyper) {'layer_bi_lstm': {'units': 128, 'return_sequences': False}, 'layer_output': {}} >>> hyper['layer_bi_lstm']['units'] = 32 >>> model = BiLSTM_Model(hyper_parameters=hyper)
- evaluate(x_data, y_data, batch_size=32, digits=4, truncating=False)¶
Build a text report showing the main labeling metrics.
- Parameters
- Returns
A report dict
- Return type
Dict
Example
>>> from kashgari.tasks.labeling import BiGRU_Model >>> model = BiGRU_Model() >>> model.fit(train_x, train_y, valid_x, valid_y) >>> report = model.evaluate(test_x, test_y) precision recall f1-score support ORG 0.0665 0.1108 0.0831 984 LOC 0.1870 0.2086 0.1972 1951 PER 0.1685 0.0882 0.1158 884 micro avg 0.1384 0.1555 0.1465 3819 macro avg 0.1516 0.1555 0.1490 3819 >>> print(report) { 'f1-score': 0.14895159934887792, 'precision': 0.1516294012813676, 'recall': 0.15553809897879026, 'support': 3819, 'detail': {'LOC': {'f1-score': 0.19718992248062014, 'precision': 0.18695452457510336, 'recall': 0.20861096873398258, 'support': 1951}, 'ORG': {'f1-score': 0.08307926829268293, 'precision': 0.06646341463414634, 'recall': 0.11077235772357724, 'support': 984}, 'PER': {'f1-score': 0.11581291759465479, 'precision': 0.16846652267818574, 'recall': 0.08823529411764706, 'support': 884}}, }
- fit(x_train, y_train, x_validate=None, y_validate=None, batch_size=64, epochs=5, callbacks=None, fit_kwargs=None)¶
Trains the model for a given number of epochs with given data set list.
- Parameters
x_train (List[List[str]]) – Array of train feature data (if the model has a single input), or tuple of train feature data array (if the model has multiple inputs)
y_train (List[List[str]]) – Array of train label data
x_validate (Optional[List[List[str]]]) – Array of validation feature data (if the model has a single input), or tuple of validation feature data array (if the model has multiple inputs)
y_validate (Optional[List[List[str]]]) – Array of validation label data
batch_size (int) – Number of samples per gradient update, default to 64.
epochs (int) – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
callbacks (Optional[List[tensorflow.python.keras.callbacks.Callback]]) – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See
tf.keras.callbacks
.fit_kwargs (Optional[Dict]) – fit_kwargs: additional arguments passed to
tf.keras.Model.fit()
- Returns
A
tf.keras.callback.History
object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).- Return type
tensorflow.python.keras.callbacks.History
- fit_generator(train_sample_gen, valid_sample_gen=None, batch_size=64, epochs=5, callbacks=None, fit_kwargs=None)¶
Trains the model for a given number of epochs with given data generator.
Data generator must be the subclass of CorpusGenerator
- Parameters
train_sample_gen (kashgari.generators.CorpusGenerator) – train data generator.
valid_sample_gen (Optional[kashgari.generators.CorpusGenerator]) – valid data generator.
batch_size (int) – Number of samples per gradient update, default to 64.
epochs (int) – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
callbacks (Optional[List[tensorflow.python.keras.callbacks.Callback]]) – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See tf.keras.callbacks.
fit_kwargs (Optional[Dict]) – fit_kwargs: additional arguments passed to
tf.keras.Model.fit()
- Returns
A
tf.keras.callback.History
object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).- Return type
tensorflow.python.keras.callbacks.History
- classmethod load_model(model_path)¶
- Parameters
model_path (str) –
- Return type
Union[ABCLabelingModel, ABCClassificationModel]
- predict(x_data, *, batch_size=32, truncating=False, predict_kwargs=None)¶
Generates output predictions for the input samples.
Computation is done in batches.
- Parameters
x_data (List[List[str]]) – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
batch_size (int) – Integer. If unspecified, it will default to 32.
truncating (bool) – remove values from sequences larger than model.embedding.sequence_length
predict_kwargs (Optional[Dict]) – arguments passed to
tf.keras.Model.predict()
- Returns
array(s) of predictions.
- Return type
List[List[str]]
- predict_entities(x_data, batch_size=32, join_chunk=' ', truncating=False, predict_kwargs=None)¶
Gets entities from sequence.
- Parameters
x_data (List[List[str]]) – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
batch_size (int) – Integer. If unspecified, it will default to 32.
truncating (bool) – remove values from sequences larger than model.embedding.sequence_length
join_chunk (str) – str or False,
predict_kwargs (Optional[Dict]) – arguments passed to
tf.keras.Model.predict()
- Returns
list of entity.
- Return type
Generators¶
Table of Contents
CorpusGenerator¶
BatchDataSet¶
- class kashgari.generators.BatchDataSet(corpus, *, text_processor, label_processor, seq_length=None, max_position=None, segment=False, batch_size=64)[source]¶
Bases:
Iterable
Data Processors¶
Table of Contents
SequenceProcessor¶
- class kashgari.processors.SequenceProcessor(build_in_vocab='text', min_count=3, build_vocab_from_labels=False, **kwargs)[source]¶
Bases:
kashgari.processors.abc_processor.ABCProcessor
Generic processors for the sequence samples.
- Parameters
- Return type
- build_vocab(x_data, y_data)¶
- build_vocab_generator(generators)[source]¶
- Parameters
generators (List[kashgari.generators.CorpusGenerator]) –
- Return type
- get_tensor_shape(batch_size, seq_length)¶
- inverse_transform(labels, *, lengths=None, threshold=0.5, **kwargs)[source]¶
- Parameters
labels (Union[List[List[int]], numpy.ndarray]) –
lengths (Optional[List[int]]) –
threshold (float) –
kwargs (Any) –
- Return type
List[List[str]]
- transform(samples, *, seq_length=None, max_position=None, segment=False)[source]¶
- Parameters
- Return type
- property is_vocab_build: bool¶
- property vocab_size: int¶
ClassificationProcessor¶
- class kashgari.processors.ClassificationProcessor(multi_label=False, **kwargs)[source]¶
Bases:
kashgari.processors.abc_processor.ABCProcessor
- __init__(multi_label=False, **kwargs)[source]¶
Initialize self. See help(type(self)) for accurate signature.
- build_vocab(x_data, y_data)¶
- build_vocab_generator(generators)[source]¶
- Parameters
generators (List[kashgari.generators.CorpusGenerator]) –
- Return type
- transform(samples, *, seq_length=None, max_position=None, segment=False)[source]¶
- Parameters
- Return type
- property is_vocab_build: bool¶
- property vocab_size: int¶
Contributing & Support¶
We are happy to accept contributions that make Kashgari
better and more awesome! You could contribute in various ways:
Bug Reports¶
Please read the documentation and search the issue tracker to try and find the answer to your question before posting an issue.
When creating an issue on the repository, please provide as much info as possible:
Version being used.
Operating system.
Version of Python.
Errors in console.
Detailed description of the problem.
Examples for reproducing the error. You can post pictures, but if specific text or code is required to reproduce the issue, please provide the text in a plain text format for easy copy/paste.
The more info provided the greater the chance someone will take the time to answer, implement, or fix the issue.
Be prepared to answer questions and provide additional information if required. Issues in which the creator refuses to respond to follow up questions will be marked as stale and closed.
Reviewing Code¶
Take part in reviewing pull requests and/or reviewing direct commits. Make suggestions to improve the code and discuss solutions to overcome weakness in the algorithm.
Answer Questions in Issues¶
Take time and answer questions and offer suggestions to people who’ve created issues in the issue tracker. Often people will have questions that you might have an answer for. Or maybe you know how to help them accomplish a specific task they are asking about. Feel free to share your experience with others to help them out.
Pull Requests¶
Pull requests are welcome, and a great way to help fix bugs and add new features.
Accuracy Benchmarks¶
Use Kashgari your own data, and report the F-1 score.
Adding New Models¶
New models can be of two basic types:
Adding New Tasks¶
Currently, Kashgari can handle text-classification and sequence-labeling tasks. If you want to apply Kashgari for a new task, please submit a request issue and explain why we would consider adding the new task to Kashgari
Documentation Improvements¶
A ton of time has been spent not only creating and supporting this tool, but also spent making this documentation. If you feel it is still lacking, show your appreciation for the tool by helping to improve/translate the documentation.
You can build the docs by running this commands in project root folder. Source files are in the docs
folder.
pip install -r docs/requirements.txt
python setup.py install
sh ./scripts/docs-live.sh
Release notes¶
Upgrading¶
To upgrade Kashgari to the latest version, use pip
:
pip uninstall -y kashgari-tf
pip install --upgrade kashgari
To inspect the currently installed version, use the following command:
pip show kashgari
Current Release¶
[2.0.1] - 2020.10.28¶
✨ Add
convert_to_saved_model
API for tf-serving use case.✨ Add tf-serving documents.
[2.0.0] - 2020.09.10¶
This is a fully re-implemented version with TF2.
✨ Embeddings
✨ Text Classification Task
✨ Text Labeling Task
✨ Seq2Seq Task
✨ Examples
✨ Neural machine translation with Seq2Seq
✨ Benchmarks
1.1.1 - 2020.03.13¶
✨ Add BERTEmbeddingV2.
💥 Migrate documents to https://readthedoc.org for the version control.
1.1.0 - 2019.12.27¶
✨ Add Scoring task. (#303)
✨ Add tokenizers.
🐛 Fixing multi-label classification model loading. #304
1.0.0 - 2019.10.18¶
Unfortunately, we have to change the package name for clarity and consistency. Here is the new naming sytle.
Backend | pypi version | desc |
---|---|---|
TensorFlow 2.x | kashgari 2.x.x | coming soon |
TensorFlow 1.14+ | kashgari 1.x.x | |
Keras | kashgari 0.x.x | legacy version |
Here is how the existing versions changes
Supported Backend | Kashgari Versions | Kahgsari-tf Version |
---|---|---|
TensorFlow 2.x | kashgari 2.x.x | - |
TensorFlow 1.14+ | kashgari 1.0.1 | - |
TensorFlow 1.14+ | kashgari 1.0.0 | 0.5.5 |
TensorFlow 1.14+ | - | 0.5.4 |
TensorFlow 1.14+ | - | 0.5.3 |
TensorFlow 1.14+ | - | 0.5.2 |
TensorFlow 1.14+ | - | 0.5.1 |
Keras (legacy) | kashgari 0.2.6 | - |
Keras (legacy) | kashgari 0.2.5 | - |
Keras (legacy) | kashgari 0.x.x | - |
0.5.4 - 2019.09.30¶
✨ Add shuffle parameter to fit function (#249)
✨ Improved type hinting for loaded model (#248)
🐛 Fix the configuration changes during embedding save/load (#224)
🐛 Fix stacked embedding save/load (#224)
🐛 Fix evaluate function where the list has int instead of str ([#222])
💥 Renaming model.pre_processor to model.processor
🚨 Removing TensorFlow and numpy warnings
📝 Add docs how to specify which CPU or GPU
📝 Add docs how to compile model with custom optimizer
0.5.1 - 2019.07.15¶
📝 Rewrite documents with mkdocs
📝 Add Chinese documents
✨ Add
predict_top_k_class
for classification model to get predict probabilities (#146)🚸 Add
label2idx
,token2idx
properties to Embeddings and Models🚸 Add
tokenizer
property for BERT Embedding. (#136)🚸 Add
predict_kwargs
for modelspredict()
function⚡️ Change multi-label classification’s default loss function to binary_crossentropy (#151)
Legacy Version Changelog¶
0.2.0¶
multi-label classification for all classification models
support cuDNN cell for sequence labeling
add option for output
BOS
andEOS
in sequence labeling result, fix #31
0.1.9¶
add
AVCNNModel
,KMaxCNNModel
,RCNNModel
,AVRNNModel
,DropoutBGRUModel
,DropoutAVRNNModel
model to classification task.fix several small bugs
0.1.8¶
fix BERT Embedding model’s
to_json
function, issue #19
0.1.7¶
remove class candidates filter to fix #16
overwrite init function in CustomEmbedding
add parameter check to custom_embedding layer
add
keras-bert
version to setup.py file
0.1.6¶
add
output_dict
,debug_info
params to text_classification modeladd
output_dict
,debug_info
andchunk_joiner
params to text_classification modelfix possible crash at data_generator
0.1.5¶
fix sequence labeling evaluate result output
refactor model save and load function
0.1.4¶
fix classification model evaluate result output
change test settings