Transformer Embedding

TransformerEmbedding is based on bert4keras. The embeddings itself are wrapped into our simple embedding interface so that they can be used like any other embedding.

TransformerEmbedding support models:

Model Author Link
BERT Google
ALBERT brightmart
RoBERTa brightmart
RoBERTa 哈工大
RoBERTa 苏剑林
NEZHA Huawei


When using pre-trained embedding, remember to use same tokenize tool with the embedding model, this will allow to access the full power of the embedding

kashgari.embeddings.TransformerEmbedding.__init__(self, vocab_path: str, config_path: str, checkpoint_path: str, model_type: str = 'bert', **kwargs)
  • vocab_path – vocab file path, example vocab.txt
  • config_path – model config path, example config.json
  • checkpoint_path – model weight path, example model.ckpt-100000
  • model_type – transfer model type, {bert, albert, nezha, gpt2_ml, t5}
  • kwargs – additional params

Example Usage - Text Classification

Let’s run a text classification model with BERT.

sentences = [
    "Jim Henson was a puppeteer.",
    "This here's an example of using the BERT tokenizer.",
    "Why did the chicken cross the road?"
labels = [
# ------------ Load Bert Embedding ------------
import os
from kashgari.embeddings import TransformerEmbedding
from kashgari.tokenizers import BertTokenizer

# Setup paths
model_folder = '/xxx/xxx/albert_base'
checkpoint_path = os.path.join(model_folder, 'model.ckpt-best')
config_path = os.path.join(model_folder, 'albert_config.json')
vocab_path = os.path.join(model_folder, 'vocab_chinese.txt')

tokenizer = BertTokenizer.load_from_vocab_file(vocab_path)
embed = TransformerEmbedding(vocab_path, config_path, checkpoint_path,

sentences_tokenized = [tokenizer.tokenize(s) for s in sentences]
The sentences will become tokenized into:
    ['jim', 'henson', 'was', 'a', 'puppet', '##eer', '.'],
    ['this', 'here', "'", 's', 'an', 'example', 'of', 'using', 'the', 'bert', 'token', '##izer', '.'],
    ['why', 'did', 'the', 'chicken', 'cross', 'the', 'road', '?']

train_x, train_y = sentences_tokenized[:2], labels[:2]
validate_x, validate_y = sentences_tokenized[2:], labels[2:]

# ------------ Build Model Start ------------
from kashgari.tasks.classification import CNN_LSTM_Model
model = CNN_LSTM_Model(embed)

# ------------ Build Model End ------------
    train_x, train_y,
    validate_x, validate_y,
# save model'path/to/save/model/to')