Kashgari

GitHub Slack Coverage Status PyPI

Overview | Performance | Installation | Documentation | Contributing

🎉🎉🎉 We released the 2.0.0-alpha2 version with Seq2Seq Support. 🎉🎉🎉

Overview

Kashgari is a simple and powerful NLP Transfer learning framework, build a state-of-art model in 5 minutes for named entity recognition (NER), part-of-speech tagging (PoS), and text classification tasks.

  • Human-friendly. Kashgari’s code is straightforward, well documented and tested, which makes it very easy to understand and modify.
  • Powerful and simple. Kashgari allows you to apply state-of-the-art natural language processing (NLP) models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS) and classification.
  • Built-in transfer learning. Kashgari built-in pre-trained BERT and Word2vec embedding models, which makes it very simple to transfer learning to train your model.
  • Fully scalable. Kashgari provides a simple, fast, and scalable environment for fast experimentation, train your models and experiment with new approaches using different embeddings and model structure.
  • Production Ready. Kashgari could export model with SavedModel format for tensorflow serving, you could directly deploy it on the cloud.

Our Goal

  • Academic users Easier experimentation to prove their hypothesis without coding from scratch.
  • NLP beginners Learn how to build an NLP project with production level code quality.
  • NLP developers Build a production level classification/labeling model within minutes.

Performance

Welcome to add performance report.

Task Language Dataset Score
Named Entity Recognition Chinese People’s Daily Ner Corpus 95.57
Text Classification Chinese SMP2018ECDTCorpus 94.57

Installation

The project is based on Python 3.6+, because it is 2019 and type hinting is cool.

Backend pypi version desc
TensorFlow 2.1+ pip install 'kashgari>=2.0.0' TF2.10+ with tf.keras
TensorFlow 1.14+ pip install 'kashgari>=1.0.0,<2.0.0' TF1.14+ with tf.keras
Keras pip install 'kashgari<1.0.0' keras version

Contributors ✨

Thanks goes to these wonderful people. And there are many ways to get involved. Start with the contributor guidelines and then check these open issues for specific tasks.

Kashgari

GitHub Slack Coverage Status PyPI

Overview | Performance | Installation | Documentation | Contributing

🎉🎉🎉 We released the 2.0.0-alpha2 version with Seq2Seq Support. 🎉🎉🎉

Overview

Kashgari is a simple and powerful NLP Transfer learning framework, build a state-of-art model in 5 minutes for named entity recognition (NER), part-of-speech tagging (PoS), and text classification tasks.

  • Human-friendly. Kashgari’s code is straightforward, well documented and tested, which makes it very easy to understand and modify.
  • Powerful and simple. Kashgari allows you to apply state-of-the-art natural language processing (NLP) models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS) and classification.
  • Built-in transfer learning. Kashgari built-in pre-trained BERT and Word2vec embedding models, which makes it very simple to transfer learning to train your model.
  • Fully scalable. Kashgari provides a simple, fast, and scalable environment for fast experimentation, train your models and experiment with new approaches using different embeddings and model structure.
  • Production Ready. Kashgari could export model with SavedModel format for tensorflow serving, you could directly deploy it on the cloud.

Our Goal

  • Academic users Easier experimentation to prove their hypothesis without coding from scratch.
  • NLP beginners Learn how to build an NLP project with production level code quality.
  • NLP developers Build a production level classification/labeling model within minutes.

Performance

Welcome to add performance report.

Task Language Dataset Score
Named Entity Recognition Chinese People’s Daily Ner Corpus 95.57
Text Classification Chinese SMP2018ECDTCorpus 94.57

Installation

The project is based on Python 3.6+, because it is 2019 and type hinting is cool.

Backend pypi version desc
TensorFlow 2.1+ pip install 'kashgari>=2.0.0' TF2.10+ with tf.keras
TensorFlow 1.14+ pip install 'kashgari>=1.0.0,<2.0.0' TF1.14+ with tf.keras
Keras pip install 'kashgari<1.0.0' keras version

Contributors ✨

Thanks goes to these wonderful people. And there are many ways to get involved. Start with the contributor guidelines and then check these open issues for specific tasks.

Text Classification Model

Kashgari provides several models for text classification, All labeling models inherit from the ABCClassificationModel. You could easily switch from one model to another just by changing one line of code.

Available Models

Name info
BiLSTM_Model
BiGRU_Model
CNN_Model
CNN_LSTM_Model
CNN_GRU_Model
CNN_Attention_Model

Train basic classification model

Kashgari provides the basic intent-classification corpus for experiments. You could also use your corpus in any language for training.

# Load build-in corpus.
from kashgari.corpus import SMP2018ECDTCorpus

train_x, train_y = SMP2018ECDTCorpus.load_data('train')
valid_x, valid_y = SMP2018ECDTCorpus.load_data('valid')
test_x, test_y = SMP2018ECDTCorpus.load_data('test')

# Or use your own corpus
train_x = [['Hello', 'world'], ['Hello', 'Kashgari']]
train_y = ['a', 'b']

valid_x, valid_y = train_x, train_y
test_x, test_x = train_x, train_y

Then train our first model. All models provided some APIs, so you could use any labeling model here.

import kashgari
from kashgari.tasks.classification import BiLSTM_Model

import logging
logging.basicConfig(level='DEBUG')

model = BiLSTM_Model()
model.fit(train_x, train_y, valid_x, valid_y)

# Evaluate the model
model.evaluate(test_x, test_y)

# Model data will save to `saved_ner_model` folder
model.save('saved_classification_model')

# Load saved model
loaded_model = BiLSTM_Model.load_model('saved_classification_model')
loaded_model.predict(test_x[:10])

# To continue training, compile the newly loaded model first
loaded_model.compile_model()
model.fit(train_x, train_y, valid_x, valid_y)

That’s all your need to do. Easy right.

Text classification with transfer learning

Kashgari provides varies Language model Embeddings for transfer learning. Here is the example for BERT Embedding.

import kashgari
from kashgari.tasks.classification import BiGRU_Model
from kashgari.embeddings import BertEmbedding

import logging
logging.basicConfig(level='DEBUG')

bert_embed = BertEmbedding('<PRE_TRAINED_BERT_MODEL_FOLDER>')
model = BiGRU_Model(bert_embed, sequence_length=100)
model.fit(train_x, train_y, valid_x, valid_y)

You could replace bert_embedding with any Embedding class in kashgari.embeddings. More info about Embedding: LINK THIS.

Adjust model’s hyper-parameters

You could easily change model’s hyper-parameters. For example, we change the lstm unit in BiLSTM_Model from 128 to 32.

from kashgari.tasks.classification import BiLSTM_Model

hyper = BiLSTM_Model.default_hyper_parameters()
print(hyper)
# {'layer_bi_lstm': {'units': 128, 'return_sequences': False}, 'layer_dense': {'activation': 'softmax'}}

hyper['layer_bi_lstm']['units'] = 32

model = BiLSTM_Model(hyper_parameters=hyper)

Use custom optimizer

Kashgari already supports using customized optimizer, like RAdam.

from kashgari.corpus import SMP2018ECDTCorpus
from kashgari.tasks.classification import BiLSTM_Model
# Remember to import kashgari before than RAdam
from keras_radam import RAdam

train_x, train_y = SMP2018ECDTCorpus.load_data('train')
valid_x, valid_y = SMP2018ECDTCorpus.load_data('valid')
test_x, test_y = SMP2018ECDTCorpus.load_data('test')

model = BiLSTM_Model()
# This step will build token dict, label dict and model structure
model.build_model(train_x, train_y, valid_x, valid_y)
# Compile model with custom optimizer, you can also customize loss and metrics.
optimizer = RAdam()
model.compile_model(optimizer=optimizer)

# Train model
model.fit(train_x, train_y, valid_x, valid_y)

Use callbacks

Kashgari is based on keras so that you could use all of the tf.keras callbacks directly with Kashgari model. For example, here is how to visualize training with tensorboard.

from tensorflow.python import keras
from kashgari.tasks.classification import BiGRU_Model
from kashgari.callbacks import EvalCallBack

import logging
logging.basicConfig(level='DEBUG')

model = BiGRU_Model()

tf_board_callback = keras.callbacks.TensorBoard(log_dir='./logs', update_freq=1000)

# Build-in callback for print precision, recall and f1 at every epoch step
eval_callback = EvalCallBack(kash_model=model,
                             valid_x=valid_x,
                             valid_y=valid_y,
                             step=5)

model.fit(train_x,
          train_y,
          valid_x,
          valid_y,
          batch_size=100,
          callbacks=[eval_callback, tf_board_callback])

Multi-Label Classification

Kashgari support multi-label classification, Here is how we build one.

Let’s assume we have a dataset like this.

x = [
   ['This','news','are' , 'very','well','organized'],
   ['What','extremely','usefull','tv','show'],
   ['The','tv','presenter','were','very','well','dress'],
   ['Multi-class', 'classification', 'means', 'a', 'classification', 'task', 'with', 'more', 'than', 'two', 'classes']
]

y = [
   ['A', 'B'],
   ['A',],
   ['B', 'C'],
   []
]

Now we need to init a Processor and Embedding for our model, then prepare model and fit.

import logging
from kashgari.embeddings import BertEmbedding
from kashgari.tasks.classification import BiLSTM_Model

logging.basicConfig(level='DEBUG')

bert_embed = BertEmbedding('<PRE_TRAINED_BERT_MODEL_FOLDER>')

model = BiLSTM_Model(bert_embed, sequence_length=100, multi_label=True)
model.fit(x, y)

Customize your own model

It is very easy and straightforward to build your own customized model, just inherit the ABCEmbedding and implement the default_hyper_parameters() function and build_model_arc() function.

from typing import Dict, Any

from tensorflow import keras

from kashgari.tasks.classification.abc_model import ABCClassificationModel
from kashgari.layers import L

import logging
logging.basicConfig(level='DEBUG')


class DoubleBLSTMModel(ABCClassificationModel):
    """Bidirectional LSTM Sequence Labeling Model"""

    @classmethod
    def default_hyper_parameters(cls) -> Dict[str, Dict[str, Any]]:
        """
        Get hyper parameters of model
        Returns:
            hyper parameters dict
        """
        return {
            'layer_blstm1': {
                'units': 128,
                'return_sequences': True
            },
            'layer_blstm2': {
                'units': 128,
                'return_sequences': False
            },
            'layer_dropout': {
                'rate': 0.4
            },
            'layer_time_distributed': {},
            'layer_output': {

            }
        }

    def build_model_arc(self):
        """
        build model architectural
        """
        output_dim = len(self.processor.label2idx)
        config = self.hyper_parameters
        embed_model = self.embedding.embed_model

        # Define your layers
        layer_blstm1 = L.Bidirectional(L.LSTM(**config['layer_blstm1']),
                                       name='layer_blstm1')
        layer_blstm2 = L.Bidirectional(L.LSTM(**config['layer_blstm2']),
                                       name='layer_blstm2')

        layer_dropout = L.Dropout(**config['layer_dropout'],
                                  name='layer_dropout')

        layer_time_distributed = L.Dense(output_dim, **config['layer_output'])

        # You need to use this actiovation layer as final activation
        # to suppor multi-label classification
        layer_activation = self._activation_layer()

        # Define tensor flow
        tensor = layer_blstm1(embed_model.output)
        tensor = layer_blstm2(tensor)
        tensor = layer_dropout(tensor)
        tensor = layer_time_distributed(tensor)
        output_tensor = layer_activation(tensor)

        # Init model
        self.tf_model = keras.Model(embed_model.inputs, output_tensor)

model = DoubleBLSTMModel()
model.fit(train_x, train_y, valid_x, valid_y)

Short Sentence Classification Performance

We have run the classification tests on SMP2018ECDTCorpus. Here is the full code: colab link

  • SEQUENCE_LENGTH = 60
  • EPOCHS = 30
  • EARL_STOPPING_PATIENCE = 10
  • REDUCE_RL_PATIENCE = 5
  • BATCH_SIZE = 64
Embedding Model Best F1-Score Best F1 @ epochs
0 RoBERTa-wwm-ext BiLSTM_Model 92.89 15
1 RoBERTa-wwm-ext BiGRU_Model 94.57 10
2 RoBERTa-wwm-ext CNN_Model 92.95 12
3 RoBERTa-wwm-ext CNN_Attention_Model 92.07 3
4 RoBERTa-wwm-ext CNN_GRU_Model 89.56 22
5 RoBERTa-wwm-ext CNN_LSTM_Model 90.9 26
6 Bert-Chinese BiLSTM_Model 93.74 4
7 Bert-Chinese BiGRU_Model 93.12 13
8 Bert-Chinese CNN_Model 92.95 13
9 Bert-Chinese CNN_Attention_Model 92.04 8
10 Bert-Chinese CNN_GRU_Model 92.88 8
11 Bert-Chinese CNN_LSTM_Model 91.15 24
12 Bare BiLSTM_Model 81.96 11
13 Bare BiGRU_Model 82.86 9
14 Bare CNN_Model 86.61 11
15 Bare CNN_Attention_Model 78.84 12
16 Bare CNN_GRU_Model 66.14 26
17 Bare CNN_LSTM_Model 48.13 29

_images/smp2018ecdtcorpus_f1_score.png

Text Labeling Model

Kashgari provides several models for text labeling, All labeling models inherit from the BaseLabelingModel. You could easily switch from one model to another just by changing one line of code.

Available Models

Name Info
CNN_LSTM_Model
BiLSTM_Model
BiGRU_Model

Train basic NER model

Kashgari provices basic NER corpus for expirement. You could also use your corpus in any language for training.

# Load build-in corpus.
from kashgari.corpus import ChineseDailyNerCorpus

train_x, train_y = ChineseDailyNerCorpus.load_data('train')
valid_x, valid_y = ChineseDailyNerCorpus.load_data('valid')
test_x, test_y = ChineseDailyNerCorpus.load_data('test')

# Or use your own corpus
train_x = [['Hello', 'world'], ['Hello', 'Kashgari'], ['I', 'love', 'Beijing']]
train_y = [['O', 'O'], ['O', 'B-PER'], ['O', 'B-LOC']]

valid_x, valid_y = train_x, train_y
test_x, test_x = train_x, train_y

Or use your own corpus, it needs to be tokenized like this.

>>> print(train_x[0])
['海', '钓', '比', '赛', '地', '点', '在', '厦', '门', '与', '金', '门', '之', '间', '的', '海', '域', '。']

>>> print(train_y[0])
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O']

Then train our first model. All models provided some APIs, so you could use any labeling model here.

import kashgari
from kashgari.tasks.labeling import BiLSTM_Model

model = BiLSTM_Model()
model.fit(train_x, train_y, valid_x, valid_y)

# Evaluate the model

model.evaluate(test_x, test_y)

# Model data will save to `saved_ner_model` folder
model.save('saved_ner_model')

# Load saved model
loaded_model = BiLSTM_Model.load_model('saved_ner_model')
loaded_model.predict(test_x[:10])

# To continue training, compile the newly loaded model first
loaded_model.compile_model()
model.fit(train_x, train_y, valid_x, valid_y)

That’s all your need to do. Easy right.

Sequence labeling with transfer learning

Kashgari provides varies Language model Embeddings for transfer learning. Here is the example for BERT Embedding.

from kashgari.tasks.labeling import BiLSTM_Model
from kashgari.embeddings import BertEmbedding

bert_embed = BertEmbedding('<PRE_TRAINED_BERT_MODEL_FOLDER>')
model = BiLSTM_Model(bert_embed, sequence_length=100)
model.fit(train_x, train_y, valid_x, valid_y)

You could replace bert_embedding with any Embedding class in kashgari.embeddings. More info about Embedding: LINK THIS.

Adjust model’s hyper-parameters

You could easily change model’s hyper-parameters. For example, we change the lstm unit in BLSTMModel from 128 to 32.

from kashgari.tasks.labeling import BiLSTM_Model

hyper = BiLSTM_Model.default_hyper_parameters()
print(hyper)
# {'layer_blstm': {'units': 128, 'return_sequences': True}, 'layer_dropout': {'rate': 0.4}, 'layer_time_distributed': {}, 'layer_activation': {'activation': 'softmax'}}

hyper['layer_blstm']['units'] = 32

model = BiLSTM_Model(hyper_parameters=hyper)

Use custom optimizer

Kashgari already supports using customized optimizer, like RAdam.

from kashgari.corpus import SMP2018ECDTCorpus
from kashgari.tasks.classification import BiLSTM_Model
# Remember to import kashgari before than RAdam
from keras_radam import RAdam

train_x, train_y = SMP2018ECDTCorpus.load_data('train')
valid_x, valid_y = SMP2018ECDTCorpus.load_data('valid')
test_x, test_y = SMP2018ECDTCorpus.load_data('test')

model = BiLSTM_Model()
# This step will build token dict, label dict and model structure
model.build_model(train_x, train_y, valid_x, valid_y)
# Compile model with custom optimizer, you can also customize loss and metrics.
optimizer = RAdam()
model.compile_model(optimizer=optimizer)

# Train model
model.fit(train_x, train_y, valid_x, valid_y)

Use callbacks

Kashgari is based on keras so that you could use all of the tf.keras callbacks directly with Kashgari model. For example, here is how to visualize training with tensorboard.

from tensorflow import keras
from kashgari.tasks.labeling import BiLSTM_Model
from kashgari.callbacks import EvalCallBack


model = BLSTMModel()

tf_board_callback = keras.callbacks.TensorBoard(log_dir='./logs', update_freq=1000)

# Build-in callback for print precision, recall and f1 at every epoch step
eval_callback = EvalCallBack(kash_model=model,
                             valid_x=valid_x,
                             valid_y=valid_y,
                             step=5)

model.fit(train_x,
          train_y,
          valid_x,
          valid_y,
          batch_size=100,
          callbacks=[eval_callback, tf_board_callback])

Customize your own model

It is very easy and straightforward to build your own customized model, just inherit the ABCLabelingModel and implement the default_hyper_parameters() function and build_model_arc() function.

from typing import Dict, Any

from tensorflow import keras

from kashgari.tasks.labeling.abc_model import ABCLabelingModel
from kashgari.layers import L

import logging
logging.basicConfig(level='DEBUG')

class DoubleBLSTMModel(ABCLabelingModel):
    """Bidirectional LSTM Sequence Labeling Model"""

    @classmethod
    def default_hyper_parameters(cls) -> Dict[str, Dict[str, Any]]:
        """
        Get hyper parameters of model
        Returns:
            hyper parameters dict
        """
        return {
            'layer_blstm1': {
                'units': 128,
                'return_sequences': True
            },
            'layer_blstm2': {
                'units': 128,
                'return_sequences': True
            },
            'layer_dropout': {
                'rate': 0.4
            },
            'layer_time_distributed': {},
            'layer_activation': {
                'activation': 'softmax'
            }
        }

    def build_model_arc(self):
        """
        build model architectural
        """
        output_dim = len(self.processor.label2idx)
        config = self.hyper_parameters
        embed_model = self.embedding.embed_model

        # Define your layers
        layer_blstm1 = L.Bidirectional(L.LSTM(**config['layer_blstm1']),
                                       name='layer_blstm1')
        layer_blstm2 = L.Bidirectional(L.LSTM(**config['layer_blstm2']),
                                       name='layer_blstm2')

        layer_dropout = L.Dropout(**config['layer_dropout'],
                                  name='layer_dropout')

        layer_time_distributed = L.TimeDistributed(L.Dense(output_dim,
                                                           **config['layer_time_distributed']),
                                                   name='layer_time_distributed')
        layer_activation = L.Activation(**config['layer_activation'])

        # Define tensor flow
        tensor = layer_blstm1(embed_model.output)
        tensor = layer_blstm2(tensor)
        tensor = layer_dropout(tensor)
        tensor = layer_time_distributed(tensor)
        output_tensor = layer_activation(tensor)

        # Init model
        self.tf_model = keras.Model(embed_model.inputs, output_tensor)

model = DoubleBLSTMModel()
model.fit(train_x, train_y, valid_x, valid_y)

Chinese NER Performance

We have run the classification tests on ChineseDailyNerCorpus. Here is the full code: colab link

  • SEQUENCE_LENGTH = 100
  • EPOCHS = 30
  • EARL_STOPPING_PATIENCE = 10
  • REDUCE_RL_PATIENCE = 5
  • BATCH_SIZE = 64
Embedding Model Best F1-Score Best F1 @ epochs
0 RoBERTa-wwm-ext BiGRU_Model 93.22 11
1 RoBERTa-wwm-ext BiGRU_CRF_Model 95.13 29
2 RoBERTa-wwm-ext BiLSTM_Model 93.37 19
3 RoBERTa-wwm-ext BiLSTM_CRF_Model 95.43 26
4 RoBERTa-wwm-ext CNN_LSTM_Model 94.05 23
5 Bert-Chinese BiGRU_Model 93.01 16
6 Bert-Chinese BiGRU_CRF_Model 95.01 24
7 Bert-Chinese BiLSTM_Model 93.85 17
8 Bert-Chinese BiLSTM_CRF_Model 95.57 26
9 Bert-Chinese CNN_LSTM_Model 93.17 16
10 Bare BiGRU_Model 74.85 16
11 Bare BiGRU_CRF_Model 81.24 21
12 Bare BiLSTM_Model 74.7 19
13 Bare BiLSTM_CRF_Model 82.37 25
14 Bare CNN_LSTM_Model 75.07 14

_images/ner_f1_scores.png

Seq2Seq Model

Train a translate model

# Original Corpus
x_original = [
    'Who am I?',
    'I am sick.',
    'I like you.',
    'I need help.',
    'It may hurt.',
    'Good morning.']

y_original = [
    'مەن كىم ؟',
    'مەن كېسەل.',
    'مەن سىزنى ياخشى كۆرمەن',
    'ماڭا ياردەم كېرەك.',
    'ئاغىرىشى مۇمكىن.',
    'خەيىرلىك ئەتىگەن.']

# Tokenize sentence with custom tokenizing function
# Tokenize sentence with custom tokenizing function
# We use Bert Tokenizer for this demo
from kashgari.tokenizers import BertTokenizer
tokenizer = BertTokenizer()
x_tokenized = [tokenizer.tokenize(sample) for sample in x_original]
y_tokenized = [tokenizer.tokenize(sample) for sample in y_original]

After tokenizing the corpus, we can build a seq2seq Model.

from kashgari.tasks.seq2seq import Seq2Seq

model = Seq2Seq()
model.fit(x_tokenized, y_tokenized)

# predict with model
preds, attention = model.predict(x_tokenized)
print(preds)

Train with custom embedding

You can define both encoder’s and decoder’s embedding. This is how to use Bert Embedding as encoder’s embedding layer.

from kashgari.tasks.seq2seq import Seq2Seq
from kashgari.embeddings import BertEmbedding

bert = BertEmbedding('<PATH_TO_BERT_EMBEDDING>')
model = Seq2Seq(encoder_embedding=bert, hidden_size=512)

model.fit(x_tokenized, y_tokenized)

Language Embeddings

Kashgari provides several embeddings for language representation. Embedding layers will convert input sequence to tensor for downstream task. Availabel embeddings list:

class name description
BareEmbedding random init tf.keras.layers.Embedding layer for text sequence embedding
WordEmbedding pre-trained Word2Vec embedding
BERTEmbedding pre-trained BERT embedding
TransformerEmbedding pre-trained TransferEmbedding embedding (BERT, ALBERT, RoBERTa, NEZHA)

All embedding classes inherit from the Embedding class and implement the embed() to embed your input sequence and embed_model property which you need to build you own Model. By providing the embed() function and embed_model property, Kashgari hides the the complexity of different language embedding from users, all you need to care is which language embedding you need.

You could check out the Embedding API document here

Quick start

Feature Extract From Pre-trained Embedding

Feature Extraction is one of the major way to use pre-trained language embedding. Kashgari provides simple API for this task. All you need to is init a embedding object and setup it’s pre-processor, then call embed function. Here is the example. All embedding shares same embed API.

from kashgari.embeddings import BertEmbedding
from kashgari.processors import SequenceProcessor

bert = BertEmbedding('<BERT_MODEL_FOLDER>')
processor = SequenceProcessor()
bert.setup_text_processor(processor)
# call for embed
embed_tensor = bert.embed([['语', '言', '模', '型']])

print(embed_tensor)
# array([[-0.5001117 ,  0.9344998 , -0.55165815, ...,  0.49122602,
#         -0.2049343 ,  0.25752577],
#        [-1.05762   , -0.43353617,  0.54398274, ..., -0.61096823,
#          0.04312163,  0.03881482],
#        [ 0.14332692, -0.42566583,  0.68867105, ...,  0.42449307,
#          0.41105768,  0.08222893],
#        ...,
#        [-0.86124015,  0.08591427, -0.34404194, ...,  0.19915134,
#         -0.34176797,  0.06111742],
#        [-0.73940575, -0.02692179, -0.5826528 , ...,  0.26934686,
#         -0.29708537,  0.01855129],
#        [-0.85489404,  0.007399  , -0.26482674, ...,  0.16851354,
#         -0.36805922, -0.0052386 ]], dtype=float32)

Classification and Labeling

See details at classification and labeling tutorial.

Customized model

You can access the tf.keras model of embedding and add your own layers or any kind customization. Just need to access the embed_model property of the embedding object.

Bare Embedding

BareEmbedding is a random init tf.keras.layers.Embedding layer for text sequence embedding, which is the defualt embedding class for kashgari models.

kashgari.embeddings.BareEmbedding.__init__(self, embedding_size: int = 100, **kwargs)
Parameters:
  • embedding_size – Dimension of the dense embedding.
  • kwargs – additional params

Here is the sample how to use embedding class. The key difference here is that must call analyze_corpus function before using the embed function. This is because the embedding layer is not pre-trained and do not contain any word-list. We need to build word-list from the corpus.

import kashgari
from kashgari.embeddings import BareEmbedding

embedding = BareEmbedding(embedding_size=100)

embedding.analyze_corpus(x_data, y_data)

embed_tensor = embedding.embed_one(['语', '言', '模', '型'])

Word Embedding

WordEmbedding is a tf.keras.layers.Embedding layer with pre-trained Word2Vec/GloVe Emedding weights.

kashgari.embeddings.WordEmbedding.__init__(self, w2v_path: str, *, w2v_kwargs: Dict[str, Any] = None, **kwargs)
Parameters:
  • w2v_path – Word2Vec file path.
  • w2v_kwargs – params pass to the load_word2vec_format() function of gensim.models.KeyedVectors
  • kwargs – additional params

Bert Embedding

BertEmbedding is a simple wrapped class of Transformer Embedding. If you need load other kind of transformer based language model, please use the Transformer Embedding.

Note

When using pre-trained embedding, remember to use same tokenize tool with the embedding model, this will allow to access the full power of the embedding

kashgari.embeddings.BertEmbedding.__init__(self, model_folder: str, **kwargs)
Parameters:
  • model_folder – path of checkpoint folder.
  • kwargs – additional params

Example Usage - Text Classification

Let’s run a text classification model with BERT.

sentences = [
    "Jim Henson was a puppeteer.",
    "This here's an example of using the BERT tokenizer.",
    "Why did the chicken cross the road?"
            ]
labels = [
    "class1",
    "class2",
    "class1"
]
########## Load Bert Embedding ##########
import os
from kashgari.embeddings import BertEmbedding
from kashgari.tokenizers import BertTokenizer

bert_embedding = BertEmbedding('<PATH_TO_BERT_EMBEDDING>')

tokenizer = BertTokenizer(os.path.join('<PATH_TO_BERT_EMBEDDING>', 'vocab_chinese.txt'))
sentences_tokenized = [tokenizer.tokenize(s) for s in sentences]

"""
The sentences will become tokenized into:
[
    ['jim', 'henson', 'was', 'a', 'puppet', '##eer', '.'],
    ['this', 'here', "'", 's', 'an', 'example', 'of', 'using', 'the', 'bert', 'token', '##izer', '.'],
    ['why', 'did', 'the', 'chicken', 'cross', 'the', 'road', '?']
]
"""

train_x, train_y = sentences_tokenized[:2], labels[:2]
validate_x, validate_y = sentences_tokenized[2:], labels[2:]

########## build model ##########
from kashgari.tasks.classification import CNN_LSTM_Model
model = CNN_LSTM_Model(bert_embedding)

########## /build model ##########
model.fit(
    train_x, train_y,
    validate_x, validate_y,
    epochs=3,
    batch_size=32
)
# save model
model.save('path/to/save/model/to')

Use sentence pairs for input

let’s assume input pair sample is "First do it" "then do it right", Then first tokenize the sentences using bert tokenizer. Then

sentence1 = ['First', 'do', 'it']
sentence2 = ['then', 'do', 'it', 'right']

sample = sentence1 + ["[SEP]"] + sentence2
# Add a special separation token `[SEP]` between two sentences tokens
# Generate a new token list
# ['First', 'do', 'it', '[SEP]', 'then', 'do', 'it', 'right']

train_x = [sample]

Transformer Embedding

TransformerEmbedding is based on bert4keras. The embeddings itself are wrapped into our simple embedding interface so that they can be used like any other embedding.

TransformerEmbedding support models:

Model Author Link
BERT Google https://github.com/google-research/bert
ALBERT Google https://github.com/google-research/ALBERT
ALBERT brightmart https://github.com/brightmart/albert_zh
RoBERTa brightmart https://github.com/brightmart/roberta_zh
RoBERTa 哈工大 https://github.com/ymcui/Chinese-BERT-wwm
RoBERTa 苏剑林 https://github.com/ZhuiyiTechnology/pretrained-models
NEZHA Huawei https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/NEZHA

Note

When using pre-trained embedding, remember to use same tokenize tool with the embedding model, this will allow to access the full power of the embedding

kashgari.embeddings.TransformerEmbedding.__init__(self, vocab_path: str, config_path: str, checkpoint_path: str, model_type: str = 'bert', **kwargs)
Parameters:
  • vocab_path – vocab file path, example vocab.txt
  • config_path – model config path, example config.json
  • checkpoint_path – model weight path, example model.ckpt-100000
  • model_type – transfer model type, {bert, albert, nezha, gpt2_ml, t5}
  • kwargs – additional params

Example Usage - Text Classification

Let’s run a text classification model with BERT.

sentences = [
    "Jim Henson was a puppeteer.",
    "This here's an example of using the BERT tokenizer.",
    "Why did the chicken cross the road?"
            ]
labels = [
    "class1",
    "class2",
    "class1"
]
# ------------ Load Bert Embedding ------------
import os
from kashgari.embeddings import TransformerEmbedding
from kashgari.tokenizers import BertTokenizer

# Setup paths
model_folder = '/xxx/xxx/albert_base'
checkpoint_path = os.path.join(model_folder, 'model.ckpt-best')
config_path = os.path.join(model_folder, 'albert_config.json')
vocab_path = os.path.join(model_folder, 'vocab_chinese.txt')

tokenizer = BertTokenizer.load_from_vocab_file(vocab_path)
embed = TransformerEmbedding(vocab_path, config_path, checkpoint_path,
                             bert_type='albert')

sentences_tokenized = [tokenizer.tokenize(s) for s in sentences]
"""
The sentences will become tokenized into:
[
    ['jim', 'henson', 'was', 'a', 'puppet', '##eer', '.'],
    ['this', 'here', "'", 's', 'an', 'example', 'of', 'using', 'the', 'bert', 'token', '##izer', '.'],
    ['why', 'did', 'the', 'chicken', 'cross', 'the', 'road', '?']
]
"""

train_x, train_y = sentences_tokenized[:2], labels[:2]
validate_x, validate_y = sentences_tokenized[2:], labels[2:]

# ------------ Build Model Start ------------
from kashgari.tasks.classification import CNN_LSTM_Model
model = CNN_LSTM_Model(embed)

# ------------ Build Model End ------------

model.fit(
    train_x, train_y,
    validate_x, validate_y,
    epochs=3,
    batch_size=32
)
# save model
model.save('path/to/save/model/to')

Corpus

ChineseDailyNerCorpus

class kashgari.corpus.ChineseDailyNerCorpus[source]

Bases: object

Chinese Daily New New Corpus https://github.com/zjy-ucas/ChineseNER/

Example

>>> from kashgari.corpus import ChineseDailyNerCorpus
>>> train_x, train_y = ChineseDailyNerCorpus.load_data('train')
>>> test_x, test_y = ChineseDailyNerCorpus.load_data('test')
>>> valid_x, valid_y = ChineseDailyNerCorpus.load_data('valid')
>>> print(train_x)
    [['海', '钓', '比', '赛', '地', '点', '在', '厦', '门', ...], ...]
>>> print(train_y)
    [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', ...], ...]
__init__

Initialize self. See help(type(self)) for accurate signature.

classmethod load_data(subset_name: str = 'train', shuffle: bool = True) → Tuple[List[List[str]], List[List[str]]][source]

Load dataset as sequence labeling format, char level tokenized

Parameters:
  • subset_name – {train, test, valid}
  • shuffle – should shuffle or not, default True.
Returns:

dataset_features and dataset labels

SMP2018ECDTCorpus

class kashgari.corpus.SMP2018ECDTCorpus[source]

Bases: object

https://worksheets.codalab.org/worksheets/0x27203f932f8341b79841d50ce0fd684f/

This dataset is released by the Evaluation of Chinese Human-Computer Dialogue Technology (SMP2018-ECDT) task 1 and is provided by the iFLYTEK Corporation, which is a Chinese human-computer dialogue dataset.

Sample:

      label           query
0   weather        今天东莞天气如何
1       map  从观音桥到重庆市图书馆怎么走
2  cookbook          鸭蛋怎么腌?
3    health         怎么治疗牛皮癣
4      chat             唠什么

Example

>>> from kashgari.corpus import SMP2018ECDTCorpus
>>> train_x, train_y = SMP2018ECDTCorpus.load_data('train')
>>> test_x, test_y = SMP2018ECDTCorpus.load_data('test')
>>> valid_x, valid_y = SMP2018ECDTCorpus.load_data('valid')
>>> print(train_x)
[['听', '新', '闻', '。'], ['电', '视', '台', '在', '播', '什', '么'], ...]
>>> print(train_y)
['news', 'epg', ...]
__init__

Initialize self. See help(type(self)) for accurate signature.

classmethod load_data(subset_name: str = 'train', shuffle: bool = True, cutter: str = 'char') → Tuple[List[List[str]], List[str]][source]

Load dataset as sequence classification format, char level tokenized

Parameters:
  • subset_name – {train, test, valid}
  • shuffle – should shuffle or not, default True.
  • cutter – sentence cutter, {char, jieba}
Returns:

dataset_features and dataset labels

JigsawToxicCommentCorpus

class kashgari.corpus.JigsawToxicCommentCorpus(corpus_train_csv_path: str, sample_count: int = None, tokenizer: kashgari.tokenizers.base_tokenizer.Tokenizer = None)[source]

Bases: object

Kaggle Toxic Comment Classification Challenge corpus

You need to download corpus from https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/overview to a folder. Then init a JigsawToxicCommentCorpus object with train.csv path.

Examples

>>> from kashgari.corpus import JigsawToxicCommentCorpus
>>> corpus = JigsawToxicCommentCorpus('<train.csv file-path>')
>>> train_x, train_y = corpus.load_data('train')
>>> test_x, test_y = corpus.load_data('test')
>>> print(train_x)
[['Please', 'stop', 'being', 'a', 'penis—', 'and', 'Grow', 'Up', 'Regards-'], ...]
>>> print(train_y)
[['obscene', 'insult'], ...]
__init__(corpus_train_csv_path: str, sample_count: int = None, tokenizer: kashgari.tokenizers.base_tokenizer.Tokenizer = None) → None[source]

Initialize self. See help(type(self)) for accurate signature.

load_data(subset_name: str = 'train', shuffle: bool = True) → Tuple[List[List[str]], List[List[str]]][source]

Load dataset as sequence labeling format, char level tokenized

Parameters:
  • subset_name – {train, test, valid}
  • shuffle – should shuffle or not, default True.
Returns:

dataset_features and dataset labels

Embeddings

BareEmbedding

class kashgari.embeddings.BareEmbedding(embedding_size: int = 100, **kwargs)[source]

Bases: kashgari.embeddings.abc_embedding.ABCEmbedding

BareEmbedding is a random init tf.keras.layers.Embedding layer for text sequence embedding, which is the defualt embedding class for kashgari models.

__init__(embedding_size: int = 100, **kwargs)[source]
Parameters:
  • embedding_size – Dimension of the dense embedding.
  • kwargs – additional params
build_embedding_model(*, vocab_size: int = None, force: bool = False, **kwargs) → None[source]
embed(sentences: List[List[str]], *, debug: bool = False) → numpy.ndarray

batch embed sentences

Parameters:
  • sentences – Sentence list to embed
  • debug – show debug info
Returns:

vectorized sentence list

get_seq_length_from_corpus(generators: List[kashgari.generators.CorpusGenerator], *, use_label: bool = False, cover_rate: float = 0.95) → int

Calculate proper sequence length according to the corpus

Parameters:
  • generators
  • use_label
  • cover_rate

Returns:

load_embed_vocab() → Optional[Dict[str, int]][source]

Load vocab dict from embedding layer

Returns:vocab dict or None
setup_text_processor(processor: kashgari.processors.abc_processor.ABCProcessor) → None
to_dict() → Dict[str, Any]

WordEmbedding

class kashgari.embeddings.WordEmbedding(w2v_path: str, *, w2v_kwargs: Dict[str, Any] = None, **kwargs)[source]

Bases: kashgari.embeddings.abc_embedding.ABCEmbedding

__init__(w2v_path: str, *, w2v_kwargs: Dict[str, Any] = None, **kwargs)[source]
Parameters:
  • w2v_path – Word2Vec file path.
  • w2v_kwargs – params pass to the load_word2vec_format() function of gensim.models.KeyedVectors
  • kwargs – additional params
build_embedding_model(*, vocab_size: int = None, force: bool = False, **kwargs) → None[source]
embed(sentences: List[List[str]], *, debug: bool = False) → numpy.ndarray

batch embed sentences

Parameters:
  • sentences – Sentence list to embed
  • debug – show debug info
Returns:

vectorized sentence list

get_seq_length_from_corpus(generators: List[kashgari.generators.CorpusGenerator], *, use_label: bool = False, cover_rate: float = 0.95) → int

Calculate proper sequence length according to the corpus

Parameters:
  • generators
  • use_label
  • cover_rate

Returns:

load_embed_vocab() → Optional[Dict[str, int]][source]

Load vocab dict from embedding layer

Returns:vocab dict or None
setup_text_processor(processor: kashgari.processors.abc_processor.ABCProcessor) → None
to_dict() → Dict[str, Any][source]

TransformerEmbedding

class kashgari.embeddings.TransformerEmbedding(vocab_path: str, config_path: str, checkpoint_path: str, model_type: str = 'bert', **kwargs)[source]

Bases: kashgari.embeddings.abc_embedding.ABCEmbedding

TransformerEmbedding is based on bert4keras. The embeddings itself are wrapped into our simple embedding interface so that they can be used like any other embedding.

__init__(vocab_path: str, config_path: str, checkpoint_path: str, model_type: str = 'bert', **kwargs)[source]
Parameters:
  • vocab_path – vocab file path, example vocab.txt
  • config_path – model config path, example config.json
  • checkpoint_path – model weight path, example model.ckpt-100000
  • model_type – transfer model type, {bert, albert, nezha, gpt2_ml, t5}
  • kwargs – additional params
build_embedding_model(*, vocab_size: int = None, force: bool = False, **kwargs) → None[source]
embed(sentences: List[List[str]], *, debug: bool = False) → numpy.ndarray

batch embed sentences

Parameters:
  • sentences – Sentence list to embed
  • debug – show debug info
Returns:

vectorized sentence list

get_seq_length_from_corpus(generators: List[kashgari.generators.CorpusGenerator], *, use_label: bool = False, cover_rate: float = 0.95) → int

Calculate proper sequence length according to the corpus

Parameters:
  • generators
  • use_label
  • cover_rate

Returns:

load_embed_vocab() → Optional[Dict[str, int]][source]

Load vocab dict from embedding layer

Returns:vocab dict or None
setup_text_processor(processor: kashgari.processors.abc_processor.ABCProcessor) → None
to_dict() → Dict[str, Any][source]

BertEmbedding

class kashgari.embeddings.BertEmbedding(model_folder: str, **kwargs)[source]

Bases: kashgari.embeddings.transformer_embedding.TransformerEmbedding

BertEmbedding is a simple wrapped class of TransformerEmbedding. If you need load other kind of transformer based language model, please use the TransformerEmbedding.

__init__(model_folder: str, **kwargs)[source]
Parameters:
  • model_folder – path of checkpoint folder.
  • kwargs – additional params
build_embedding_model(*, vocab_size: int = None, force: bool = False, **kwargs) → None
embed(sentences: List[List[str]], *, debug: bool = False) → numpy.ndarray

batch embed sentences

Parameters:
  • sentences – Sentence list to embed
  • debug – show debug info
Returns:

vectorized sentence list

get_seq_length_from_corpus(generators: List[kashgari.generators.CorpusGenerator], *, use_label: bool = False, cover_rate: float = 0.95) → int

Calculate proper sequence length according to the corpus

Parameters:
  • generators
  • use_label
  • cover_rate

Returns:

load_embed_vocab() → Optional[Dict[str, int]]

Load vocab dict from embedding layer

Returns:vocab dict or None
setup_text_processor(processor: kashgari.processors.abc_processor.ABCProcessor) → None
to_dict() → Dict[str, Any][source]

Classification Models

Bidirectional LSTM Model

class kashgari.tasks.classification.BiLSTM_Model(embedding: kashgari.embeddings.abc_embedding.ABCEmbedding = None, *, sequence_length: int = None, hyper_parameters: Dict[str, Dict[str, Any]] = None, multi_label: bool = False, text_processor: kashgari.processors.abc_processor.ABCProcessor = None, label_processor: kashgari.processors.abc_processor.ABCProcessor = None)[source]

Bases: kashgari.tasks.classification.abc_model.ABCClassificationModel

__init__(embedding: kashgari.embeddings.abc_embedding.ABCEmbedding = None, *, sequence_length: int = None, hyper_parameters: Dict[str, Dict[str, Any]] = None, multi_label: bool = False, text_processor: kashgari.processors.abc_processor.ABCProcessor = None, label_processor: kashgari.processors.abc_processor.ABCProcessor = None)
Parameters:
  • embedding – embedding object
  • sequence_length – target sequence length
  • hyper_parameters – hyper_parameters to overwrite
  • multi_label – is multi-label classification
  • text_processor – text processor
  • label_processor – label processor
build_model(x_train: List[List[str]], y_train: Union[List[str], List[List[str]], List[Tuple[str]]]) → None

Build Model with x_data and y_data

This function will setup a CorpusGenerator,
then call py:meth:ABCClassificationModel.build_model_gen for preparing processor and model
Parameters:
  • x_train
  • y_train

Returns:

build_model_arc() → None[source]
build_model_generator(generators: List[kashgari.generators.CorpusGenerator]) → None
compile_model(loss: Any = None, optimizer: Any = None, metrics: Any = None, **kwargs) → None

Configures the model for training. call tf.keras.Model.predict() to compile model with custom loss, optimizer and metrics

Examples

>>> model = BiLSTM_Model()
# Build model with corpus
>>> model.build_model(train_x, train_y)
# Compile model with custom loss, optimizer and metrics
>>> model.compile(loss='categorical_crossentropy', optimizer='rsm', metrics = ['accuracy'])
Parameters:
  • loss – name of objective function, objective function or tf.keras.losses.Loss instance.
  • optimizer – name of optimizer or optimizer instance.
  • metrics (object) – List of metrics to be evaluated by the model during training and testing.
  • **kwargs – additional params passed to tf.keras.Model.predict`().
classmethod default_hyper_parameters() → Dict[str, Dict[str, Any]][source]

The default hyper parameters of the model dict, all models must implement this function.

You could easily change model’s hyper-parameters.

For example, change the LSTM unit in BiLSTM_Model from 128 to 32.

>>> from kashgari.tasks.classification import BiLSTM_Model
>>> hyper = BiLSTM_Model.default_hyper_parameters()
>>> print(hyper)
{'layer_bi_lstm': {'units': 128, 'return_sequences': False}, 'layer_output': {}}
>>> hyper['layer_bi_lstm']['units'] = 32
>>> model = BiLSTM_Model(hyper_parameters=hyper)
Returns:hyper params dict
evaluate(x_data: List[List[str]], y_data: Union[List[str], List[List[str]], List[Tuple[str]]], *, batch_size: int = 32, digits: int = 4, multi_label_threshold: float = 0.5, truncating: bool = False) → Dict[KT, VT]
fit(x_train: List[List[str]], y_train: Union[List[str], List[List[str]], List[Tuple[str]]], x_validate: List[List[str]] = None, y_validate: Union[List[str], List[List[str]], List[Tuple[str]]] = None, *, batch_size: int = 64, epochs: int = 5, callbacks: List[keras.callbacks.Callback] = None, fit_kwargs: Dict[KT, VT] = None) → tensorflow.python.keras.callbacks.History

Trains the model for a given number of epochs with given data set list.

Parameters:
  • x_train – Array of train feature data (if the model has a single input), or tuple of train feature data array (if the model has multiple inputs)
  • y_train – Array of train label data
  • x_validate – Array of validation feature data (if the model has a single input), or tuple of validation feature data array (if the model has multiple inputs)
  • y_validate – Array of validation label data
  • batch_size – Number of samples per gradient update, default to 64.
  • epochs – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
  • callbacks – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See tf.keras.callbacks.
  • fit_kwargs – fit_kwargs: additional arguments passed to tf.keras.Model.fit()
Returns:

A tf.keras.callback.History object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).

fit_generator(train_sample_gen: kashgari.generators.CorpusGenerator, valid_sample_gen: kashgari.generators.CorpusGenerator = None, *, batch_size: int = 64, epochs: int = 5, callbacks: List[keras.callbacks.Callback] = None, fit_kwargs: Dict[KT, VT] = None) → tensorflow.python.keras.callbacks.History

Trains the model for a given number of epochs with given data generator.

Data generator must be the subclass of CorpusGenerator

Parameters:
  • train_sample_gen – train data generator.
  • valid_sample_gen – valid data generator.
  • batch_size – Number of samples per gradient update, default to 64.
  • epochs – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
  • callbacks – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See tf.keras.callbacks.
  • fit_kwargs – fit_kwargs: additional arguments passed to tf.keras.Model.fit()
Returns:

A tf.keras.callback.History object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).

classmethod load_model(model_path: str) → Union[ABCLabelingModel, ABCClassificationModel]
predict(x_data: List[List[str]], *, batch_size: int = 32, truncating: bool = False, multi_label_threshold: float = 0.5, predict_kwargs: Dict[KT, VT] = None) → Union[List[str], List[List[str]], List[Tuple[str]]]

Generates output predictions for the input samples.

Computation is done in batches.

Parameters:
  • x_data – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
  • batch_size – Integer. If unspecified, it will default to 32.
  • truncating – remove values from sequences larger than model.embedding.sequence_length
  • multi_label_threshold
  • predict_kwargs – arguments passed to predict() function of tf.keras.Model
Returns:

array(s) of predictions.

save(model_path: str) → str

Save model :param model_path:

to_dict() → Dict[KT, VT]

Bidirectional GRU Model

class kashgari.tasks.classification.BiGRU_Model(embedding: kashgari.embeddings.abc_embedding.ABCEmbedding = None, *, sequence_length: int = None, hyper_parameters: Dict[str, Dict[str, Any]] = None, multi_label: bool = False, text_processor: kashgari.processors.abc_processor.ABCProcessor = None, label_processor: kashgari.processors.abc_processor.ABCProcessor = None)[source]

Bases: kashgari.tasks.classification.abc_model.ABCClassificationModel

__init__(embedding: kashgari.embeddings.abc_embedding.ABCEmbedding = None, *, sequence_length: int = None, hyper_parameters: Dict[str, Dict[str, Any]] = None, multi_label: bool = False, text_processor: kashgari.processors.abc_processor.ABCProcessor = None, label_processor: kashgari.processors.abc_processor.ABCProcessor = None)
Parameters:
  • embedding – embedding object
  • sequence_length – target sequence length
  • hyper_parameters – hyper_parameters to overwrite
  • multi_label – is multi-label classification
  • text_processor – text processor
  • label_processor – label processor
build_model(x_train: List[List[str]], y_train: Union[List[str], List[List[str]], List[Tuple[str]]]) → None

Build Model with x_data and y_data

This function will setup a CorpusGenerator,
then call py:meth:ABCClassificationModel.build_model_gen for preparing processor and model
Parameters:
  • x_train
  • y_train

Returns:

build_model_arc() → None[source]
build_model_generator(generators: List[kashgari.generators.CorpusGenerator]) → None
compile_model(loss: Any = None, optimizer: Any = None, metrics: Any = None, **kwargs) → None

Configures the model for training. call tf.keras.Model.predict() to compile model with custom loss, optimizer and metrics

Examples

>>> model = BiLSTM_Model()
# Build model with corpus
>>> model.build_model(train_x, train_y)
# Compile model with custom loss, optimizer and metrics
>>> model.compile(loss='categorical_crossentropy', optimizer='rsm', metrics = ['accuracy'])
Parameters:
  • loss – name of objective function, objective function or tf.keras.losses.Loss instance.
  • optimizer – name of optimizer or optimizer instance.
  • metrics (object) – List of metrics to be evaluated by the model during training and testing.
  • **kwargs – additional params passed to tf.keras.Model.predict`().
classmethod default_hyper_parameters() → Dict[str, Dict[str, Any]][source]

The default hyper parameters of the model dict, all models must implement this function.

You could easily change model’s hyper-parameters.

For example, change the LSTM unit in BiLSTM_Model from 128 to 32.

>>> from kashgari.tasks.classification import BiLSTM_Model
>>> hyper = BiLSTM_Model.default_hyper_parameters()
>>> print(hyper)
{'layer_bi_lstm': {'units': 128, 'return_sequences': False}, 'layer_output': {}}
>>> hyper['layer_bi_lstm']['units'] = 32
>>> model = BiLSTM_Model(hyper_parameters=hyper)
Returns:hyper params dict
evaluate(x_data: List[List[str]], y_data: Union[List[str], List[List[str]], List[Tuple[str]]], *, batch_size: int = 32, digits: int = 4, multi_label_threshold: float = 0.5, truncating: bool = False) → Dict[KT, VT]
fit(x_train: List[List[str]], y_train: Union[List[str], List[List[str]], List[Tuple[str]]], x_validate: List[List[str]] = None, y_validate: Union[List[str], List[List[str]], List[Tuple[str]]] = None, *, batch_size: int = 64, epochs: int = 5, callbacks: List[keras.callbacks.Callback] = None, fit_kwargs: Dict[KT, VT] = None) → tensorflow.python.keras.callbacks.History

Trains the model for a given number of epochs with given data set list.

Parameters:
  • x_train – Array of train feature data (if the model has a single input), or tuple of train feature data array (if the model has multiple inputs)
  • y_train – Array of train label data
  • x_validate – Array of validation feature data (if the model has a single input), or tuple of validation feature data array (if the model has multiple inputs)
  • y_validate – Array of validation label data
  • batch_size – Number of samples per gradient update, default to 64.
  • epochs – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
  • callbacks – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See tf.keras.callbacks.
  • fit_kwargs – fit_kwargs: additional arguments passed to tf.keras.Model.fit()
Returns:

A tf.keras.callback.History object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).

fit_generator(train_sample_gen: kashgari.generators.CorpusGenerator, valid_sample_gen: kashgari.generators.CorpusGenerator = None, *, batch_size: int = 64, epochs: int = 5, callbacks: List[keras.callbacks.Callback] = None, fit_kwargs: Dict[KT, VT] = None) → tensorflow.python.keras.callbacks.History

Trains the model for a given number of epochs with given data generator.

Data generator must be the subclass of CorpusGenerator

Parameters:
  • train_sample_gen – train data generator.
  • valid_sample_gen – valid data generator.
  • batch_size – Number of samples per gradient update, default to 64.
  • epochs – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
  • callbacks – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See tf.keras.callbacks.
  • fit_kwargs – fit_kwargs: additional arguments passed to tf.keras.Model.fit()
Returns:

A tf.keras.callback.History object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).

classmethod load_model(model_path: str) → Union[ABCLabelingModel, ABCClassificationModel]
predict(x_data: List[List[str]], *, batch_size: int = 32, truncating: bool = False, multi_label_threshold: float = 0.5, predict_kwargs: Dict[KT, VT] = None) → Union[List[str], List[List[str]], List[Tuple[str]]]

Generates output predictions for the input samples.

Computation is done in batches.

Parameters:
  • x_data – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
  • batch_size – Integer. If unspecified, it will default to 32.
  • truncating – remove values from sequences larger than model.embedding.sequence_length
  • multi_label_threshold
  • predict_kwargs – arguments passed to predict() function of tf.keras.Model
Returns:

array(s) of predictions.

save(model_path: str) → str

Save model :param model_path:

to_dict() → Dict[KT, VT]

CNN Model

class kashgari.tasks.classification.CNN_Model(embedding: kashgari.embeddings.abc_embedding.ABCEmbedding = None, *, sequence_length: int = None, hyper_parameters: Dict[str, Dict[str, Any]] = None, multi_label: bool = False, text_processor: kashgari.processors.abc_processor.ABCProcessor = None, label_processor: kashgari.processors.abc_processor.ABCProcessor = None)[source]

Bases: kashgari.tasks.classification.abc_model.ABCClassificationModel

__init__(embedding: kashgari.embeddings.abc_embedding.ABCEmbedding = None, *, sequence_length: int = None, hyper_parameters: Dict[str, Dict[str, Any]] = None, multi_label: bool = False, text_processor: kashgari.processors.abc_processor.ABCProcessor = None, label_processor: kashgari.processors.abc_processor.ABCProcessor = None)
Parameters:
  • embedding – embedding object
  • sequence_length – target sequence length
  • hyper_parameters – hyper_parameters to overwrite
  • multi_label – is multi-label classification
  • text_processor – text processor
  • label_processor – label processor
build_model(x_train: List[List[str]], y_train: Union[List[str], List[List[str]], List[Tuple[str]]]) → None

Build Model with x_data and y_data

This function will setup a CorpusGenerator,
then call py:meth:ABCClassificationModel.build_model_gen for preparing processor and model
Parameters:
  • x_train
  • y_train

Returns:

build_model_arc() → None[source]
build_model_generator(generators: List[kashgari.generators.CorpusGenerator]) → None
compile_model(loss: Any = None, optimizer: Any = None, metrics: Any = None, **kwargs) → None

Configures the model for training. call tf.keras.Model.predict() to compile model with custom loss, optimizer and metrics

Examples

>>> model = BiLSTM_Model()
# Build model with corpus
>>> model.build_model(train_x, train_y)
# Compile model with custom loss, optimizer and metrics
>>> model.compile(loss='categorical_crossentropy', optimizer='rsm', metrics = ['accuracy'])
Parameters:
  • loss – name of objective function, objective function or tf.keras.losses.Loss instance.
  • optimizer – name of optimizer or optimizer instance.
  • metrics (object) – List of metrics to be evaluated by the model during training and testing.
  • **kwargs – additional params passed to tf.keras.Model.predict`().
classmethod default_hyper_parameters() → Dict[str, Dict[str, Any]][source]

The default hyper parameters of the model dict, all models must implement this function.

You could easily change model’s hyper-parameters.

For example, change the LSTM unit in BiLSTM_Model from 128 to 32.

>>> from kashgari.tasks.classification import BiLSTM_Model
>>> hyper = BiLSTM_Model.default_hyper_parameters()
>>> print(hyper)
{'layer_bi_lstm': {'units': 128, 'return_sequences': False}, 'layer_output': {}}
>>> hyper['layer_bi_lstm']['units'] = 32
>>> model = BiLSTM_Model(hyper_parameters=hyper)
Returns:hyper params dict
evaluate(x_data: List[List[str]], y_data: Union[List[str], List[List[str]], List[Tuple[str]]], *, batch_size: int = 32, digits: int = 4, multi_label_threshold: float = 0.5, truncating: bool = False) → Dict[KT, VT]
fit(x_train: List[List[str]], y_train: Union[List[str], List[List[str]], List[Tuple[str]]], x_validate: List[List[str]] = None, y_validate: Union[List[str], List[List[str]], List[Tuple[str]]] = None, *, batch_size: int = 64, epochs: int = 5, callbacks: List[keras.callbacks.Callback] = None, fit_kwargs: Dict[KT, VT] = None) → tensorflow.python.keras.callbacks.History

Trains the model for a given number of epochs with given data set list.

Parameters:
  • x_train – Array of train feature data (if the model has a single input), or tuple of train feature data array (if the model has multiple inputs)
  • y_train – Array of train label data
  • x_validate – Array of validation feature data (if the model has a single input), or tuple of validation feature data array (if the model has multiple inputs)
  • y_validate – Array of validation label data
  • batch_size – Number of samples per gradient update, default to 64.
  • epochs – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
  • callbacks – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See tf.keras.callbacks.
  • fit_kwargs – fit_kwargs: additional arguments passed to tf.keras.Model.fit()
Returns:

A tf.keras.callback.History object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).

fit_generator(train_sample_gen: kashgari.generators.CorpusGenerator, valid_sample_gen: kashgari.generators.CorpusGenerator = None, *, batch_size: int = 64, epochs: int = 5, callbacks: List[keras.callbacks.Callback] = None, fit_kwargs: Dict[KT, VT] = None) → tensorflow.python.keras.callbacks.History

Trains the model for a given number of epochs with given data generator.

Data generator must be the subclass of CorpusGenerator

Parameters:
  • train_sample_gen – train data generator.
  • valid_sample_gen – valid data generator.
  • batch_size – Number of samples per gradient update, default to 64.
  • epochs – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
  • callbacks – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See tf.keras.callbacks.
  • fit_kwargs – fit_kwargs: additional arguments passed to tf.keras.Model.fit()
Returns:

A tf.keras.callback.History object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).

classmethod load_model(model_path: str) → Union[ABCLabelingModel, ABCClassificationModel]
predict(x_data: List[List[str]], *, batch_size: int = 32, truncating: bool = False, multi_label_threshold: float = 0.5, predict_kwargs: Dict[KT, VT] = None) → Union[List[str], List[List[str]], List[Tuple[str]]]

Generates output predictions for the input samples.

Computation is done in batches.

Parameters:
  • x_data – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
  • batch_size – Integer. If unspecified, it will default to 32.
  • truncating – remove values from sequences larger than model.embedding.sequence_length
  • multi_label_threshold
  • predict_kwargs – arguments passed to predict() function of tf.keras.Model
Returns:

array(s) of predictions.

save(model_path: str) → str

Save model :param model_path:

to_dict() → Dict[KT, VT]

Labeling Models

Bidirectional LSTM Model

class kashgari.tasks.labeling.BiLSTM_Model(embedding: kashgari.embeddings.abc_embedding.ABCEmbedding = None, sequence_length: int = None, hyper_parameters: Dict[str, Dict[str, Any]] = None)[source]

Bases: kashgari.tasks.labeling.abc_model.ABCLabelingModel

__init__(embedding: kashgari.embeddings.abc_embedding.ABCEmbedding = None, sequence_length: int = None, hyper_parameters: Dict[str, Dict[str, Any]] = None)
Parameters:
  • embedding – embedding object
  • sequence_length – target sequence length
  • hyper_parameters – hyper_parameters to overwrite
build_model(x_data: List[List[str]], y_data: List[List[str]]) → None

Build Model with x_data and y_data

This function will setup a CorpusGenerator,
then call ABCClassificationModel.build_model_gen() for preparing processor and model
Parameters:
  • x_data
  • y_data

Returns:

build_model_arc() → None[source]
build_model_generator(generators: List[kashgari.generators.CorpusGenerator]) → None
compile_model(loss: Any = None, optimizer: Any = None, metrics: Any = None, **kwargs) → None

Configures the model for training. call tf.keras.Model.predict() to compile model with custom loss, optimizer and metrics

Examples

>>> model = BiLSTM_Model()
# Build model with corpus
>>> model.build_model(train_x, train_y)
# Compile model with custom loss, optimizer and metrics
>>> model.compile(loss='categorical_crossentropy', optimizer='rsm', metrics = ['accuracy'])
Parameters:
  • loss – name of objective function, objective function or tf.keras.losses.Loss instance.
  • optimizer – name of optimizer or optimizer instance.
  • metrics (object) – List of metrics to be evaluated by the model during training and testing.
  • kwargs – additional params passed to tf.keras.Model.predict`().
classmethod default_hyper_parameters() → Dict[str, Dict[str, Any]][source]

The default hyper parameters of the model dict, all models must implement this function.

You could easily change model’s hyper-parameters.

For example, change the LSTM unit in BiLSTM_Model from 128 to 32.

>>> from kashgari.tasks.classification import BiLSTM_Model
>>> hyper = BiLSTM_Model.default_hyper_parameters()
>>> print(hyper)
{'layer_bi_lstm': {'units': 128, 'return_sequences': False}, 'layer_output': {}}
>>> hyper['layer_bi_lstm']['units'] = 32
>>> model = BiLSTM_Model(hyper_parameters=hyper)
Returns:hyper params dict
evaluate(x_data: List[List[str]], y_data: List[List[str]], batch_size: int = 32, digits: int = 4, truncating: bool = False) → Dict[KT, VT]

Build a text report showing the main labeling metrics.

Parameters:
  • x_data
  • y_data
  • batch_size
  • digits
  • truncating
Returns:

A report dict

Example

>>> from kashgari.tasks.labeling import BiGRU_Model
>>> model = BiGRU_Model()
>>> model.fit(train_x, train_y, valid_x, valid_y)
>>> report = model.evaluate(test_x, test_y)
           precision    recall  f1-score   support
    <BLANKLINE>
          ORG     0.0665    0.1108    0.0831       984
          LOC     0.1870    0.2086    0.1972      1951
          PER     0.1685    0.0882    0.1158       884
    <BLANKLINE>
    micro avg     0.1384    0.1555    0.1465      3819
    macro avg     0.1516    0.1555    0.1490      3819
    <BLANKLINE>
>>> print(report)
    {
     'f1-score': 0.14895159934887792,
     'precision': 0.1516294012813676,
     'recall': 0.15553809897879026,
     'support': 3819,
     'detail': {'LOC': {'f1-score': 0.19718992248062014,
                        'precision': 0.18695452457510336,
                        'recall': 0.20861096873398258,
                        'support': 1951},
                'ORG': {'f1-score': 0.08307926829268293,
                        'precision': 0.06646341463414634,
                        'recall': 0.11077235772357724,
                        'support': 984},
                'PER': {'f1-score': 0.11581291759465479,
                        'precision': 0.16846652267818574,
                        'recall': 0.08823529411764706,
                        'support': 884}},
    }
fit(x_train: List[List[str]], y_train: List[List[str]], x_validate: List[List[str]] = None, y_validate: List[List[str]] = None, batch_size: int = 64, epochs: int = 5, callbacks: List[tensorflow.python.keras.callbacks.Callback] = None, fit_kwargs: Dict[KT, VT] = None) → tensorflow.python.keras.callbacks.History

Trains the model for a given number of epochs with given data set list.

Parameters:
  • x_train – Array of train feature data (if the model has a single input), or tuple of train feature data array (if the model has multiple inputs)
  • y_train – Array of train label data
  • x_validate – Array of validation feature data (if the model has a single input), or tuple of validation feature data array (if the model has multiple inputs)
  • y_validate – Array of validation label data
  • batch_size – Number of samples per gradient update, default to 64.
  • epochs – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
  • callbacks – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See tf.keras.callbacks.
  • fit_kwargs – fit_kwargs: additional arguments passed to tf.keras.Model.fit()
Returns:

A tf.keras.callback.History object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).

fit_generator(train_sample_gen: kashgari.generators.CorpusGenerator, valid_sample_gen: kashgari.generators.CorpusGenerator = None, batch_size: int = 64, epochs: int = 5, callbacks: List[tf.keras.callbacks.Callback] = None, fit_kwargs: Dict[KT, VT] = None) → tensorflow.python.keras.callbacks.History

Trains the model for a given number of epochs with given data generator.

Data generator must be the subclass of CorpusGenerator

Parameters:
  • train_sample_gen – train data generator.
  • valid_sample_gen – valid data generator.
  • batch_size – Number of samples per gradient update, default to 64.
  • epochs – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
  • callbacks – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See tf.keras.callbacks.
  • fit_kwargs – fit_kwargs: additional arguments passed to tf.keras.Model.fit()
Returns:

A tf.keras.callback.History object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).

classmethod load_model(model_path: str) → Union[ABCLabelingModel, ABCClassificationModel]
predict(x_data: List[List[str]], *, batch_size: int = 32, truncating: bool = False, predict_kwargs: Dict[KT, VT] = None) → List[List[str]]

Generates output predictions for the input samples.

Computation is done in batches.

Parameters:
  • x_data – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
  • batch_size – Integer. If unspecified, it will default to 32.
  • truncating – remove values from sequences larger than model.embedding.sequence_length
  • predict_kwargs – arguments passed to tf.keras.Model.predict()
Returns:

array(s) of predictions.

predict_entities(x_data: List[List[str]], batch_size: int = 32, join_chunk: str = ' ', truncating: bool = False, predict_kwargs: Dict[KT, VT] = None) → List[Dict[KT, VT]]

Gets entities from sequence.

Parameters:
  • x_data – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
  • batch_size – Integer. If unspecified, it will default to 32.
  • truncating – remove values from sequences larger than model.embedding.sequence_length
  • join_chunk – str or False,
  • predict_kwargs – arguments passed to tf.keras.Model.predict()
Returns:

list of entity.

Return type:

list

save(model_path: str) → str

Save model :param model_path:

to_dict() → Dict[str, Any]

Bidirectional GRU Model

class kashgari.tasks.labeling.BiGRU_Model(embedding: kashgari.embeddings.abc_embedding.ABCEmbedding = None, sequence_length: int = None, hyper_parameters: Dict[str, Dict[str, Any]] = None)[source]

Bases: kashgari.tasks.labeling.abc_model.ABCLabelingModel

__init__(embedding: kashgari.embeddings.abc_embedding.ABCEmbedding = None, sequence_length: int = None, hyper_parameters: Dict[str, Dict[str, Any]] = None)
Parameters:
  • embedding – embedding object
  • sequence_length – target sequence length
  • hyper_parameters – hyper_parameters to overwrite
build_model(x_data: List[List[str]], y_data: List[List[str]]) → None

Build Model with x_data and y_data

This function will setup a CorpusGenerator,
then call ABCClassificationModel.build_model_gen() for preparing processor and model
Parameters:
  • x_data
  • y_data

Returns:

build_model_arc() → None[source]
build_model_generator(generators: List[kashgari.generators.CorpusGenerator]) → None
compile_model(loss: Any = None, optimizer: Any = None, metrics: Any = None, **kwargs) → None

Configures the model for training. call tf.keras.Model.predict() to compile model with custom loss, optimizer and metrics

Examples

>>> model = BiLSTM_Model()
# Build model with corpus
>>> model.build_model(train_x, train_y)
# Compile model with custom loss, optimizer and metrics
>>> model.compile(loss='categorical_crossentropy', optimizer='rsm', metrics = ['accuracy'])
Parameters:
  • loss – name of objective function, objective function or tf.keras.losses.Loss instance.
  • optimizer – name of optimizer or optimizer instance.
  • metrics (object) – List of metrics to be evaluated by the model during training and testing.
  • kwargs – additional params passed to tf.keras.Model.predict`().
classmethod default_hyper_parameters() → Dict[str, Dict[str, Any]][source]

The default hyper parameters of the model dict, all models must implement this function.

You could easily change model’s hyper-parameters.

For example, change the LSTM unit in BiLSTM_Model from 128 to 32.

>>> from kashgari.tasks.classification import BiLSTM_Model
>>> hyper = BiLSTM_Model.default_hyper_parameters()
>>> print(hyper)
{'layer_bi_lstm': {'units': 128, 'return_sequences': False}, 'layer_output': {}}
>>> hyper['layer_bi_lstm']['units'] = 32
>>> model = BiLSTM_Model(hyper_parameters=hyper)
Returns:hyper params dict
evaluate(x_data: List[List[str]], y_data: List[List[str]], batch_size: int = 32, digits: int = 4, truncating: bool = False) → Dict[KT, VT]

Build a text report showing the main labeling metrics.

Parameters:
  • x_data
  • y_data
  • batch_size
  • digits
  • truncating
Returns:

A report dict

Example

>>> from kashgari.tasks.labeling import BiGRU_Model
>>> model = BiGRU_Model()
>>> model.fit(train_x, train_y, valid_x, valid_y)
>>> report = model.evaluate(test_x, test_y)
           precision    recall  f1-score   support
    <BLANKLINE>
          ORG     0.0665    0.1108    0.0831       984
          LOC     0.1870    0.2086    0.1972      1951
          PER     0.1685    0.0882    0.1158       884
    <BLANKLINE>
    micro avg     0.1384    0.1555    0.1465      3819
    macro avg     0.1516    0.1555    0.1490      3819
    <BLANKLINE>
>>> print(report)
    {
     'f1-score': 0.14895159934887792,
     'precision': 0.1516294012813676,
     'recall': 0.15553809897879026,
     'support': 3819,
     'detail': {'LOC': {'f1-score': 0.19718992248062014,
                        'precision': 0.18695452457510336,
                        'recall': 0.20861096873398258,
                        'support': 1951},
                'ORG': {'f1-score': 0.08307926829268293,
                        'precision': 0.06646341463414634,
                        'recall': 0.11077235772357724,
                        'support': 984},
                'PER': {'f1-score': 0.11581291759465479,
                        'precision': 0.16846652267818574,
                        'recall': 0.08823529411764706,
                        'support': 884}},
    }
fit(x_train: List[List[str]], y_train: List[List[str]], x_validate: List[List[str]] = None, y_validate: List[List[str]] = None, batch_size: int = 64, epochs: int = 5, callbacks: List[tensorflow.python.keras.callbacks.Callback] = None, fit_kwargs: Dict[KT, VT] = None) → tensorflow.python.keras.callbacks.History

Trains the model for a given number of epochs with given data set list.

Parameters:
  • x_train – Array of train feature data (if the model has a single input), or tuple of train feature data array (if the model has multiple inputs)
  • y_train – Array of train label data
  • x_validate – Array of validation feature data (if the model has a single input), or tuple of validation feature data array (if the model has multiple inputs)
  • y_validate – Array of validation label data
  • batch_size – Number of samples per gradient update, default to 64.
  • epochs – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
  • callbacks – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See tf.keras.callbacks.
  • fit_kwargs – fit_kwargs: additional arguments passed to tf.keras.Model.fit()
Returns:

A tf.keras.callback.History object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).

fit_generator(train_sample_gen: kashgari.generators.CorpusGenerator, valid_sample_gen: kashgari.generators.CorpusGenerator = None, batch_size: int = 64, epochs: int = 5, callbacks: List[tf.keras.callbacks.Callback] = None, fit_kwargs: Dict[KT, VT] = None) → tensorflow.python.keras.callbacks.History

Trains the model for a given number of epochs with given data generator.

Data generator must be the subclass of CorpusGenerator

Parameters:
  • train_sample_gen – train data generator.
  • valid_sample_gen – valid data generator.
  • batch_size – Number of samples per gradient update, default to 64.
  • epochs – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
  • callbacks – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See tf.keras.callbacks.
  • fit_kwargs – fit_kwargs: additional arguments passed to tf.keras.Model.fit()
Returns:

A tf.keras.callback.History object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).

classmethod load_model(model_path: str) → Union[ABCLabelingModel, ABCClassificationModel]
predict(x_data: List[List[str]], *, batch_size: int = 32, truncating: bool = False, predict_kwargs: Dict[KT, VT] = None) → List[List[str]]

Generates output predictions for the input samples.

Computation is done in batches.

Parameters:
  • x_data – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
  • batch_size – Integer. If unspecified, it will default to 32.
  • truncating – remove values from sequences larger than model.embedding.sequence_length
  • predict_kwargs – arguments passed to tf.keras.Model.predict()
Returns:

array(s) of predictions.

predict_entities(x_data: List[List[str]], batch_size: int = 32, join_chunk: str = ' ', truncating: bool = False, predict_kwargs: Dict[KT, VT] = None) → List[Dict[KT, VT]]

Gets entities from sequence.

Parameters:
  • x_data – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
  • batch_size – Integer. If unspecified, it will default to 32.
  • truncating – remove values from sequences larger than model.embedding.sequence_length
  • join_chunk – str or False,
  • predict_kwargs – arguments passed to tf.keras.Model.predict()
Returns:

list of entity.

Return type:

list

save(model_path: str) → str

Save model :param model_path:

to_dict() → Dict[str, Any]

Bidirectional LSTM CRF Model

class kashgari.tasks.labeling.BiLSTM_CRF_Model(embedding: kashgari.embeddings.abc_embedding.ABCEmbedding = None, sequence_length: int = None, hyper_parameters: Dict[str, Dict[str, Any]] = None)[source]

Bases: kashgari.tasks.labeling.abc_model.ABCLabelingModel

__init__(embedding: kashgari.embeddings.abc_embedding.ABCEmbedding = None, sequence_length: int = None, hyper_parameters: Dict[str, Dict[str, Any]] = None)
Parameters:
  • embedding – embedding object
  • sequence_length – target sequence length
  • hyper_parameters – hyper_parameters to overwrite
build_model(x_data: List[List[str]], y_data: List[List[str]]) → None

Build Model with x_data and y_data

This function will setup a CorpusGenerator,
then call ABCClassificationModel.build_model_gen() for preparing processor and model
Parameters:
  • x_data
  • y_data

Returns:

build_model_arc() → None[source]
build_model_generator(generators: List[kashgari.generators.CorpusGenerator]) → None
compile_model(loss: Any = None, optimizer: Any = None, metrics: Any = None, **kwargs) → None[source]

Configures the model for training. call tf.keras.Model.predict() to compile model with custom loss, optimizer and metrics

Examples

>>> model = BiLSTM_Model()
# Build model with corpus
>>> model.build_model(train_x, train_y)
# Compile model with custom loss, optimizer and metrics
>>> model.compile(loss='categorical_crossentropy', optimizer='rsm', metrics = ['accuracy'])
Parameters:
  • loss – name of objective function, objective function or tf.keras.losses.Loss instance.
  • optimizer – name of optimizer or optimizer instance.
  • metrics (object) – List of metrics to be evaluated by the model during training and testing.
  • kwargs – additional params passed to tf.keras.Model.predict`().
classmethod default_hyper_parameters() → Dict[str, Dict[str, Any]][source]

The default hyper parameters of the model dict, all models must implement this function.

You could easily change model’s hyper-parameters.

For example, change the LSTM unit in BiLSTM_Model from 128 to 32.

>>> from kashgari.tasks.classification import BiLSTM_Model
>>> hyper = BiLSTM_Model.default_hyper_parameters()
>>> print(hyper)
{'layer_bi_lstm': {'units': 128, 'return_sequences': False}, 'layer_output': {}}
>>> hyper['layer_bi_lstm']['units'] = 32
>>> model = BiLSTM_Model(hyper_parameters=hyper)
Returns:hyper params dict
evaluate(x_data: List[List[str]], y_data: List[List[str]], batch_size: int = 32, digits: int = 4, truncating: bool = False) → Dict[KT, VT]

Build a text report showing the main labeling metrics.

Parameters:
  • x_data
  • y_data
  • batch_size
  • digits
  • truncating
Returns:

A report dict

Example

>>> from kashgari.tasks.labeling import BiGRU_Model
>>> model = BiGRU_Model()
>>> model.fit(train_x, train_y, valid_x, valid_y)
>>> report = model.evaluate(test_x, test_y)
           precision    recall  f1-score   support
    <BLANKLINE>
          ORG     0.0665    0.1108    0.0831       984
          LOC     0.1870    0.2086    0.1972      1951
          PER     0.1685    0.0882    0.1158       884
    <BLANKLINE>
    micro avg     0.1384    0.1555    0.1465      3819
    macro avg     0.1516    0.1555    0.1490      3819
    <BLANKLINE>
>>> print(report)
    {
     'f1-score': 0.14895159934887792,
     'precision': 0.1516294012813676,
     'recall': 0.15553809897879026,
     'support': 3819,
     'detail': {'LOC': {'f1-score': 0.19718992248062014,
                        'precision': 0.18695452457510336,
                        'recall': 0.20861096873398258,
                        'support': 1951},
                'ORG': {'f1-score': 0.08307926829268293,
                        'precision': 0.06646341463414634,
                        'recall': 0.11077235772357724,
                        'support': 984},
                'PER': {'f1-score': 0.11581291759465479,
                        'precision': 0.16846652267818574,
                        'recall': 0.08823529411764706,
                        'support': 884}},
    }
fit(x_train: List[List[str]], y_train: List[List[str]], x_validate: List[List[str]] = None, y_validate: List[List[str]] = None, batch_size: int = 64, epochs: int = 5, callbacks: List[tensorflow.python.keras.callbacks.Callback] = None, fit_kwargs: Dict[KT, VT] = None) → tensorflow.python.keras.callbacks.History

Trains the model for a given number of epochs with given data set list.

Parameters:
  • x_train – Array of train feature data (if the model has a single input), or tuple of train feature data array (if the model has multiple inputs)
  • y_train – Array of train label data
  • x_validate – Array of validation feature data (if the model has a single input), or tuple of validation feature data array (if the model has multiple inputs)
  • y_validate – Array of validation label data
  • batch_size – Number of samples per gradient update, default to 64.
  • epochs – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
  • callbacks – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See tf.keras.callbacks.
  • fit_kwargs – fit_kwargs: additional arguments passed to tf.keras.Model.fit()
Returns:

A tf.keras.callback.History object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).

fit_generator(train_sample_gen: kashgari.generators.CorpusGenerator, valid_sample_gen: kashgari.generators.CorpusGenerator = None, batch_size: int = 64, epochs: int = 5, callbacks: List[tf.keras.callbacks.Callback] = None, fit_kwargs: Dict[KT, VT] = None) → tensorflow.python.keras.callbacks.History

Trains the model for a given number of epochs with given data generator.

Data generator must be the subclass of CorpusGenerator

Parameters:
  • train_sample_gen – train data generator.
  • valid_sample_gen – valid data generator.
  • batch_size – Number of samples per gradient update, default to 64.
  • epochs – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
  • callbacks – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See tf.keras.callbacks.
  • fit_kwargs – fit_kwargs: additional arguments passed to tf.keras.Model.fit()
Returns:

A tf.keras.callback.History object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).

classmethod load_model(model_path: str) → Union[ABCLabelingModel, ABCClassificationModel]
predict(x_data: List[List[str]], *, batch_size: int = 32, truncating: bool = False, predict_kwargs: Dict[KT, VT] = None) → List[List[str]]

Generates output predictions for the input samples.

Computation is done in batches.

Parameters:
  • x_data – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
  • batch_size – Integer. If unspecified, it will default to 32.
  • truncating – remove values from sequences larger than model.embedding.sequence_length
  • predict_kwargs – arguments passed to tf.keras.Model.predict()
Returns:

array(s) of predictions.

predict_entities(x_data: List[List[str]], batch_size: int = 32, join_chunk: str = ' ', truncating: bool = False, predict_kwargs: Dict[KT, VT] = None) → List[Dict[KT, VT]]

Gets entities from sequence.

Parameters:
  • x_data – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
  • batch_size – Integer. If unspecified, it will default to 32.
  • truncating – remove values from sequences larger than model.embedding.sequence_length
  • join_chunk – str or False,
  • predict_kwargs – arguments passed to tf.keras.Model.predict()
Returns:

list of entity.

Return type:

list

save(model_path: str) → str

Save model :param model_path:

to_dict() → Dict[str, Any]

Bidirectional GRU CRF Model

class kashgari.tasks.labeling.BiGRU_CRF_Model(embedding: kashgari.embeddings.abc_embedding.ABCEmbedding = None, sequence_length: int = None, hyper_parameters: Dict[str, Dict[str, Any]] = None)[source]

Bases: kashgari.tasks.labeling.abc_model.ABCLabelingModel

__init__(embedding: kashgari.embeddings.abc_embedding.ABCEmbedding = None, sequence_length: int = None, hyper_parameters: Dict[str, Dict[str, Any]] = None)
Parameters:
  • embedding – embedding object
  • sequence_length – target sequence length
  • hyper_parameters – hyper_parameters to overwrite
build_model(x_data: List[List[str]], y_data: List[List[str]]) → None

Build Model with x_data and y_data

This function will setup a CorpusGenerator,
then call ABCClassificationModel.build_model_gen() for preparing processor and model
Parameters:
  • x_data
  • y_data

Returns:

build_model_arc() → None[source]
build_model_generator(generators: List[kashgari.generators.CorpusGenerator]) → None
compile_model(loss: Any = None, optimizer: Any = None, metrics: Any = None, **kwargs) → None[source]

Configures the model for training. call tf.keras.Model.predict() to compile model with custom loss, optimizer and metrics

Examples

>>> model = BiLSTM_Model()
# Build model with corpus
>>> model.build_model(train_x, train_y)
# Compile model with custom loss, optimizer and metrics
>>> model.compile(loss='categorical_crossentropy', optimizer='rsm', metrics = ['accuracy'])
Parameters:
  • loss – name of objective function, objective function or tf.keras.losses.Loss instance.
  • optimizer – name of optimizer or optimizer instance.
  • metrics (object) – List of metrics to be evaluated by the model during training and testing.
  • kwargs – additional params passed to tf.keras.Model.predict`().
classmethod default_hyper_parameters() → Dict[str, Dict[str, Any]][source]

The default hyper parameters of the model dict, all models must implement this function.

You could easily change model’s hyper-parameters.

For example, change the LSTM unit in BiLSTM_Model from 128 to 32.

>>> from kashgari.tasks.classification import BiLSTM_Model
>>> hyper = BiLSTM_Model.default_hyper_parameters()
>>> print(hyper)
{'layer_bi_lstm': {'units': 128, 'return_sequences': False}, 'layer_output': {}}
>>> hyper['layer_bi_lstm']['units'] = 32
>>> model = BiLSTM_Model(hyper_parameters=hyper)
Returns:hyper params dict
evaluate(x_data: List[List[str]], y_data: List[List[str]], batch_size: int = 32, digits: int = 4, truncating: bool = False) → Dict[KT, VT]

Build a text report showing the main labeling metrics.

Parameters:
  • x_data
  • y_data
  • batch_size
  • digits
  • truncating
Returns:

A report dict

Example

>>> from kashgari.tasks.labeling import BiGRU_Model
>>> model = BiGRU_Model()
>>> model.fit(train_x, train_y, valid_x, valid_y)
>>> report = model.evaluate(test_x, test_y)
           precision    recall  f1-score   support
    <BLANKLINE>
          ORG     0.0665    0.1108    0.0831       984
          LOC     0.1870    0.2086    0.1972      1951
          PER     0.1685    0.0882    0.1158       884
    <BLANKLINE>
    micro avg     0.1384    0.1555    0.1465      3819
    macro avg     0.1516    0.1555    0.1490      3819
    <BLANKLINE>
>>> print(report)
    {
     'f1-score': 0.14895159934887792,
     'precision': 0.1516294012813676,
     'recall': 0.15553809897879026,
     'support': 3819,
     'detail': {'LOC': {'f1-score': 0.19718992248062014,
                        'precision': 0.18695452457510336,
                        'recall': 0.20861096873398258,
                        'support': 1951},
                'ORG': {'f1-score': 0.08307926829268293,
                        'precision': 0.06646341463414634,
                        'recall': 0.11077235772357724,
                        'support': 984},
                'PER': {'f1-score': 0.11581291759465479,
                        'precision': 0.16846652267818574,
                        'recall': 0.08823529411764706,
                        'support': 884}},
    }
fit(x_train: List[List[str]], y_train: List[List[str]], x_validate: List[List[str]] = None, y_validate: List[List[str]] = None, batch_size: int = 64, epochs: int = 5, callbacks: List[tensorflow.python.keras.callbacks.Callback] = None, fit_kwargs: Dict[KT, VT] = None) → tensorflow.python.keras.callbacks.History

Trains the model for a given number of epochs with given data set list.

Parameters:
  • x_train – Array of train feature data (if the model has a single input), or tuple of train feature data array (if the model has multiple inputs)
  • y_train – Array of train label data
  • x_validate – Array of validation feature data (if the model has a single input), or tuple of validation feature data array (if the model has multiple inputs)
  • y_validate – Array of validation label data
  • batch_size – Number of samples per gradient update, default to 64.
  • epochs – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
  • callbacks – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See tf.keras.callbacks.
  • fit_kwargs – fit_kwargs: additional arguments passed to tf.keras.Model.fit()
Returns:

A tf.keras.callback.History object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).

fit_generator(train_sample_gen: kashgari.generators.CorpusGenerator, valid_sample_gen: kashgari.generators.CorpusGenerator = None, batch_size: int = 64, epochs: int = 5, callbacks: List[tf.keras.callbacks.Callback] = None, fit_kwargs: Dict[KT, VT] = None) → tensorflow.python.keras.callbacks.History

Trains the model for a given number of epochs with given data generator.

Data generator must be the subclass of CorpusGenerator

Parameters:
  • train_sample_gen – train data generator.
  • valid_sample_gen – valid data generator.
  • batch_size – Number of samples per gradient update, default to 64.
  • epochs – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
  • callbacks – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See tf.keras.callbacks.
  • fit_kwargs – fit_kwargs: additional arguments passed to tf.keras.Model.fit()
Returns:

A tf.keras.callback.History object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).

classmethod load_model(model_path: str) → Union[ABCLabelingModel, ABCClassificationModel]
predict(x_data: List[List[str]], *, batch_size: int = 32, truncating: bool = False, predict_kwargs: Dict[KT, VT] = None) → List[List[str]]

Generates output predictions for the input samples.

Computation is done in batches.

Parameters:
  • x_data – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
  • batch_size – Integer. If unspecified, it will default to 32.
  • truncating – remove values from sequences larger than model.embedding.sequence_length
  • predict_kwargs – arguments passed to tf.keras.Model.predict()
Returns:

array(s) of predictions.

predict_entities(x_data: List[List[str]], batch_size: int = 32, join_chunk: str = ' ', truncating: bool = False, predict_kwargs: Dict[KT, VT] = None) → List[Dict[KT, VT]]

Gets entities from sequence.

Parameters:
  • x_data – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
  • batch_size – Integer. If unspecified, it will default to 32.
  • truncating – remove values from sequences larger than model.embedding.sequence_length
  • join_chunk – str or False,
  • predict_kwargs – arguments passed to tf.keras.Model.predict()
Returns:

list of entity.

Return type:

list

save(model_path: str) → str

Save model :param model_path:

to_dict() → Dict[str, Any]

Bidirectional CNN LSTM Model

class kashgari.tasks.labeling.CNN_LSTM_Model(embedding: kashgari.embeddings.abc_embedding.ABCEmbedding = None, sequence_length: int = None, hyper_parameters: Dict[str, Dict[str, Any]] = None)[source]

Bases: kashgari.tasks.labeling.abc_model.ABCLabelingModel

__init__(embedding: kashgari.embeddings.abc_embedding.ABCEmbedding = None, sequence_length: int = None, hyper_parameters: Dict[str, Dict[str, Any]] = None)
Parameters:
  • embedding – embedding object
  • sequence_length – target sequence length
  • hyper_parameters – hyper_parameters to overwrite
build_model(x_data: List[List[str]], y_data: List[List[str]]) → None

Build Model with x_data and y_data

This function will setup a CorpusGenerator,
then call ABCClassificationModel.build_model_gen() for preparing processor and model
Parameters:
  • x_data
  • y_data

Returns:

build_model_arc() → None[source]
build_model_generator(generators: List[kashgari.generators.CorpusGenerator]) → None
compile_model(loss: Any = None, optimizer: Any = None, metrics: Any = None, **kwargs) → None

Configures the model for training. call tf.keras.Model.predict() to compile model with custom loss, optimizer and metrics

Examples

>>> model = BiLSTM_Model()
# Build model with corpus
>>> model.build_model(train_x, train_y)
# Compile model with custom loss, optimizer and metrics
>>> model.compile(loss='categorical_crossentropy', optimizer='rsm', metrics = ['accuracy'])
Parameters:
  • loss – name of objective function, objective function or tf.keras.losses.Loss instance.
  • optimizer – name of optimizer or optimizer instance.
  • metrics (object) – List of metrics to be evaluated by the model during training and testing.
  • kwargs – additional params passed to tf.keras.Model.predict`().
classmethod default_hyper_parameters() → Dict[str, Dict[str, Any]][source]

The default hyper parameters of the model dict, all models must implement this function.

You could easily change model’s hyper-parameters.

For example, change the LSTM unit in BiLSTM_Model from 128 to 32.

>>> from kashgari.tasks.classification import BiLSTM_Model
>>> hyper = BiLSTM_Model.default_hyper_parameters()
>>> print(hyper)
{'layer_bi_lstm': {'units': 128, 'return_sequences': False}, 'layer_output': {}}
>>> hyper['layer_bi_lstm']['units'] = 32
>>> model = BiLSTM_Model(hyper_parameters=hyper)
Returns:hyper params dict
evaluate(x_data: List[List[str]], y_data: List[List[str]], batch_size: int = 32, digits: int = 4, truncating: bool = False) → Dict[KT, VT]

Build a text report showing the main labeling metrics.

Parameters:
  • x_data
  • y_data
  • batch_size
  • digits
  • truncating
Returns:

A report dict

Example

>>> from kashgari.tasks.labeling import BiGRU_Model
>>> model = BiGRU_Model()
>>> model.fit(train_x, train_y, valid_x, valid_y)
>>> report = model.evaluate(test_x, test_y)
           precision    recall  f1-score   support
    <BLANKLINE>
          ORG     0.0665    0.1108    0.0831       984
          LOC     0.1870    0.2086    0.1972      1951
          PER     0.1685    0.0882    0.1158       884
    <BLANKLINE>
    micro avg     0.1384    0.1555    0.1465      3819
    macro avg     0.1516    0.1555    0.1490      3819
    <BLANKLINE>
>>> print(report)
    {
     'f1-score': 0.14895159934887792,
     'precision': 0.1516294012813676,
     'recall': 0.15553809897879026,
     'support': 3819,
     'detail': {'LOC': {'f1-score': 0.19718992248062014,
                        'precision': 0.18695452457510336,
                        'recall': 0.20861096873398258,
                        'support': 1951},
                'ORG': {'f1-score': 0.08307926829268293,
                        'precision': 0.06646341463414634,
                        'recall': 0.11077235772357724,
                        'support': 984},
                'PER': {'f1-score': 0.11581291759465479,
                        'precision': 0.16846652267818574,
                        'recall': 0.08823529411764706,
                        'support': 884}},
    }
fit(x_train: List[List[str]], y_train: List[List[str]], x_validate: List[List[str]] = None, y_validate: List[List[str]] = None, batch_size: int = 64, epochs: int = 5, callbacks: List[tensorflow.python.keras.callbacks.Callback] = None, fit_kwargs: Dict[KT, VT] = None) → tensorflow.python.keras.callbacks.History

Trains the model for a given number of epochs with given data set list.

Parameters:
  • x_train – Array of train feature data (if the model has a single input), or tuple of train feature data array (if the model has multiple inputs)
  • y_train – Array of train label data
  • x_validate – Array of validation feature data (if the model has a single input), or tuple of validation feature data array (if the model has multiple inputs)
  • y_validate – Array of validation label data
  • batch_size – Number of samples per gradient update, default to 64.
  • epochs – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
  • callbacks – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See tf.keras.callbacks.
  • fit_kwargs – fit_kwargs: additional arguments passed to tf.keras.Model.fit()
Returns:

A tf.keras.callback.History object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).

fit_generator(train_sample_gen: kashgari.generators.CorpusGenerator, valid_sample_gen: kashgari.generators.CorpusGenerator = None, batch_size: int = 64, epochs: int = 5, callbacks: List[tf.keras.callbacks.Callback] = None, fit_kwargs: Dict[KT, VT] = None) → tensorflow.python.keras.callbacks.History

Trains the model for a given number of epochs with given data generator.

Data generator must be the subclass of CorpusGenerator

Parameters:
  • train_sample_gen – train data generator.
  • valid_sample_gen – valid data generator.
  • batch_size – Number of samples per gradient update, default to 64.
  • epochs – Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided.
  • callbacks – List of tf.keras.callbacks.Callback instances. List of callbacks to apply during training. See tf.keras.callbacks.
  • fit_kwargs – fit_kwargs: additional arguments passed to tf.keras.Model.fit()
Returns:

A tf.keras.callback.History object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).

classmethod load_model(model_path: str) → Union[ABCLabelingModel, ABCClassificationModel]
predict(x_data: List[List[str]], *, batch_size: int = 32, truncating: bool = False, predict_kwargs: Dict[KT, VT] = None) → List[List[str]]

Generates output predictions for the input samples.

Computation is done in batches.

Parameters:
  • x_data – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
  • batch_size – Integer. If unspecified, it will default to 32.
  • truncating – remove values from sequences larger than model.embedding.sequence_length
  • predict_kwargs – arguments passed to tf.keras.Model.predict()
Returns:

array(s) of predictions.

predict_entities(x_data: List[List[str]], batch_size: int = 32, join_chunk: str = ' ', truncating: bool = False, predict_kwargs: Dict[KT, VT] = None) → List[Dict[KT, VT]]

Gets entities from sequence.

Parameters:
  • x_data – The input data, as a Numpy array (or list of Numpy arrays if the model has multiple inputs).
  • batch_size – Integer. If unspecified, it will default to 32.
  • truncating – remove values from sequences larger than model.embedding.sequence_length
  • join_chunk – str or False,
  • predict_kwargs – arguments passed to tf.keras.Model.predict()
Returns:

list of entity.

Return type:

list

save(model_path: str) → str

Save model :param model_path:

to_dict() → Dict[str, Any]

Generators

CorpusGenerator

class kashgari.generators.CorpusGenerator(x_data: List[T], y_data: List[T], *, buffer_size: int = 2000)[source]

Bases: kashgari.generators.ABCGenerator

__init__(x_data: List[T], y_data: List[T], *, buffer_size: int = 2000) → None[source]

Initialize self. See help(type(self)) for accurate signature.

sample() → Iterator[Tuple[Any, Any]]

BatchDataSet

class kashgari.generators.BatchDataSet(corpus: kashgari.generators.CorpusGenerator, *, text_processor: ABCProcessor, label_processor: ABCProcessor, seq_length: int = None, max_position: int = None, segment: bool = False, batch_size: int = 64)[source]

Bases: collections.abc.Iterable, typing.Generic

__init__(corpus: kashgari.generators.CorpusGenerator, *, text_processor: ABCProcessor, label_processor: ABCProcessor, seq_length: int = None, max_position: int = None, segment: bool = False, batch_size: int = 64) → None[source]

Initialize self. See help(type(self)) for accurate signature.

take(batch_count: int = None) → Any[source]

take batches from the dataset

Parameters:batch_count – number of batch count, iterate forever when batch_count is None.

Data Processors

SequenceProcessor

class kashgari.processors.SequenceProcessor(build_in_vocab: str = 'text', min_count: int = 3, build_vocab_from_labels: bool = False, **kwargs)[source]

Bases: kashgari.processors.abc_processor.ABCProcessor

Generic processors for the sequence samples.

__init__(build_in_vocab: str = 'text', min_count: int = 3, build_vocab_from_labels: bool = False, **kwargs) → None[source]
Parameters:
  • vocab_dict_type – initial vocab dict type, one of text labeling.
  • **kwargs
build_vocab(x_data: List[List[str]], y_data: List[List[str]]) → None
build_vocab_generator(generators: List[kashgari.generators.CorpusGenerator]) → None[source]
get_tensor_shape(batch_size: int, seq_length: int) → Tuple
inverse_transform(labels: Union[List[List[int]], numpy.ndarray], *, lengths: List[int] = None, threshold: float = 0.5, **kwargs) → List[List[str]][source]
is_vocab_build
to_dict() → Dict[str, Any][source]
transform(samples: List[List[str]], *, seq_length: int = None, max_position: int = None, segment: bool = False) → numpy.ndarray[source]
vocab_size

ClassificationProcessor

class kashgari.processors.ClassificationProcessor(multi_label: bool = False, **kwargs)[source]

Bases: kashgari.processors.abc_processor.ABCProcessor

__init__(multi_label: bool = False, **kwargs) → None[source]

Initialize self. See help(type(self)) for accurate signature.

build_vocab(x_data: List[List[str]], y_data: List[List[str]]) → None
build_vocab_generator(generators: List[kashgari.generators.CorpusGenerator]) → None[source]
get_tensor_shape(batch_size: int, seq_length: int) → Tuple[source]
inverse_transform(labels: Union[List[int], numpy.ndarray], *, lengths: List[int] = None, threshold: float = 0.5, **kwargs) → Union[List[List[str]], List[str]][source]
is_vocab_build
to_dict() → Dict[str, Any][source]
transform(samples: List[List[str]], *, seq_length: int = None, max_position: int = None, segment: bool = False) → numpy.ndarray[source]
vocab_size

Contributing & Support

We are happy to accept contributions that make Kashgari better and more awesome! You could contribute in various ways:

Bug Reports

  1. Please read the documentation and search the issue tracker to try and find the answer to your question before posting an issue.

  2. When creating an issue on the repository, please provide as much info as possible:

    • Version being used.
    • Operating system.
    • Version of Python.
    • Errors in console.
    • Detailed description of the problem.
    • Examples for reproducing the error. You can post pictures, but if specific text or code is required to reproduce the issue, please provide the text in a plain text format for easy copy/paste.

    The more info provided the greater the chance someone will take the time to answer, implement, or fix the issue.

  3. Be prepared to answer questions and provide additional information if required. Issues in which the creator refuses to respond to follow up questions will be marked as stale and closed.

Reviewing Code

Take part in reviewing pull requests and/or reviewing direct commits. Make suggestions to improve the code and discuss solutions to overcome weakness in the algorithm.

Answer Questions in Issues

Take time and answer questions and offer suggestions to people who’ve created issues in the issue tracker. Often people will have questions that you might have an answer for. Or maybe you know how to help them accomplish a specific task they are asking about. Feel free to share your experience with others to help them out.

Pull Requests

Pull requests are welcome, and a great way to help fix bugs and add new features.

Accuracy Benchmarks

Use Kashgari your own data, and report the F-1 score.

Adding New Models

New models can be of two basic types:

Adding New Tasks

Currently, Kashgari can handle text-classification and sequence-labeling tasks. If you want to apply Kashgari for a new task, please submit a request issue and explain why we would consider adding the new task to Kashgari

Documentation Improvements

A ton of time has been spent not only creating and supporting this tool, but also spent making this documentation. If you feel it is still lacking, show your appreciation for the tool by helping to improve/translate the documentation.

You can build the docs by running this commands in project root folder. Source files are in the docs folder.

pip install -r docs/requirements.txt
python setup.py install
sh ./scripts/docs-live.sh

Release notes

Upgrading

To upgrade Kashgari to the latest version, use pip:

pip uninstall -y kashgari-tf
pip install --upgrade kashgari

To inspect the currently installed version, use the following command:

pip show kashgari

Current Release

[2.0.0] - 2020.09.10

This is a fully re-implemented version with TF2.

  • ✨ Embeddings
  • ✨ Text Classification Task
  • ✨ Text Labeling Task
  • ✨ Seq2Seq Task
  • ✨ Examples
    • ✨ Neural machine translation with Seq2Seq
    • ✨ Benchmarks

1.1.1 - 2020.03.13

  • ✨ Add BERTEmbeddingV2.
  • 💥 Migrate documents to https://readthedoc.org for the version control.

1.1.0 - 2019.12.27

  • ✨ Add Scoring task. (#303)
  • ✨ Add tokenizers.
  • 🐛 Fixing multi-label classification model loading. #304

1.0.0 - 2019.10.18

Unfortunately, we have to change the package name for clarity and consistency. Here is the new naming sytle.

Backend pypi version desc
TensorFlow 2.x kashgari 2.x.x coming soon
TensorFlow 1.14+ kashgari 1.x.x
Keras kashgari 0.x.x legacy version

Here is how the existing versions changes

Supported Backend Kashgari Versions Kahgsari-tf Version
TensorFlow 2.x kashgari 2.x.x -
TensorFlow 1.14+ kashgari 1.0.1 -
TensorFlow 1.14+ kashgari 1.0.0 0.5.5
TensorFlow 1.14+ - 0.5.4
TensorFlow 1.14+ - 0.5.3
TensorFlow 1.14+ - 0.5.2
TensorFlow 1.14+ - 0.5.1
Keras (legacy) kashgari 0.2.6 -
Keras (legacy) kashgari 0.2.5 -
Keras (legacy) kashgari 0.x.x -

0.5.4 - 2019.09.30

  • ✨ Add shuffle parameter to fit function (#249)
  • ✨ Improved type hinting for loaded model (#248)
  • 🐛 Fix loading models with CRF layers (#244, #228)
  • 🐛 Fix the configuration changes during embedding save/load (#224)
  • 🐛 Fix stacked embedding save/load (#224)
  • 🐛 Fix evaluate function where the list has int instead of str ([#222])
  • 💥 Renaming model.pre_processor to model.processor
  • 🚨 Removing TensorFlow and numpy warnings
  • 📝 Add docs how to specify which CPU or GPU
  • 📝 Add docs how to compile model with custom optimizer

0.5.3 - 2019.08.11

  • 🐛 Fixing CuDNN Error (#198)

0.5.2 - 2019.08.10

  • 💥 Add CuDNN Cell config, disable auto CuDNN cell. (#182, #198)

0.5.1 - 2019.07.15

  • 📝 Rewrite documents with mkdocs
  • 📝 Add Chinese documents
  • ✨ Add predict_top_k_class for classification model to get predict probabilities (#146)
  • 🚸 Add label2idx, token2idx properties to Embeddings and Models
  • 🚸 Add tokenizer property for BERT Embedding. (#136)
  • 🚸 Add predict_kwargs for models predict() function
  • ⚡️ Change multi-label classification’s default loss function to binary_crossentropy (#151)

0.5.0 - 2019.07.11

🎉🎉 tf.keras version 🎉🎉

  • 🎉 Rewrite Kashgari using tf.keras (#77)
  • 🎉 Rewrite Documents
  • ✨ Add TPU support
  • ✨ Add TF-Serving support.
  • ✨ Add advance customization support, like multi-input model
  • 🐎 Performance optimization

Legacy Version Changelog

0.2.6 - 2019.07.12

  • 📝 Add tf.keras version info
  • 🐛 Fixing lstm issue in labeling model (#125)

0.2.4 - 2019.06.06

  • Add BERT output feature layer fine-tune support. Discussion: (#103)
  • Add BERT output feature layer number selection, default 4 according to BERT paper
  • Fix BERT embedding token index offset issue (#104

0.2.1 - 2019.03.05

  • fix missing sequence_labeling_tokenize_add_bos_eos config

0.2.0

  • multi-label classification for all classification models
  • support cuDNN cell for sequence labeling
  • add option for output BOS and EOS in sequence labeling result, fix #31

0.1.9

  • add AVCNNModel, KMaxCNNModel, RCNNModel, AVRNNModel, DropoutBGRUModel, DropoutAVRNNModel model to classification task.
  • fix several small bugs

0.1.8

  • fix BERT Embedding model’s to_json function, issue #19

0.1.7

  • remove class candidates filter to fix #16
  • overwrite init function in CustomEmbedding
  • add parameter check to custom_embedding layer
  • add keras-bert version to setup.py file

0.1.6

  • add output_dict, debug_info params to text_classification model
  • add output_dict, debug_info and chunk_joinerparams to text_classification model
  • fix possible crash at data_generator

0.1.5

  • fix sequence labeling evaluate result output
  • refactor model save and load function

0.1.4

  • fix classification model evaluate result output
  • change test settings