Transformer Embedding¶
TransformerEmbedding is based on bert4keras. The embeddings itself are wrapped into our simple embedding interface so that they can be used like any other embedding.
TransformerEmbedding support models:
Model | Author | Link | Example | |
---|---|---|---|---|
BERT | https://github.com/google-research/bert | |||
ALBERT | https://github.com/google-research/ALBERT | |||
ALBERT | brightmart | https://github.com/brightmart/albert_zh | ||
RoBERTa | brightmart | https://github.com/brightmart/roberta_zh | ||
RoBERTa | 哈工大 | https://github.com/ymcui/Chinese-BERT-wwm | ||
RoBERTa | 苏剑林 | https://github.com/ZhuiyiTechnology/pretrained-models | ||
NEZHA | Huawei | https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/NEZHA |
!!! tip When using pre-trained embedding, remember to use same tokenize tool with the embedding model, this will allow to access the full power of the embedding
kashgari.embeddings.TransformerEmbedding(vocab_path: str,
config_path: str,
checkpoint_path: str,
bert_type: str = 'bert',
task: str = None,
sequence_length: Union[str, int] = 'auto',
processor: Optional[BaseProcessor] = None,
from_saved_model: bool = False):
Arguments
- vocab_path: path of model’s
vacab.txt
file - config_path: path of model’s
model.json
file - checkpoint_path: path of model’s checkpoint file
- bert_type:
bert
,albert
,nezha
,electra
,gpt2_ml
,t5
. Type of BERT model. - task:
kashgari.CLASSIFICATION
kashgari.LABELING
. Downstream task type, If you only need to feature extraction, just set it askashgari.CLASSIFICATION
. - sequence_length:
'auto'
or integer. When using'auto'
, use the 95% of corpus length as sequence length. If using an integer, let’s say50
, the input output sequence length will set to 50.
Example Usage - Text Classification¶
Let’s run a text classification model with BERT.
sentences = [
"Jim Henson was a puppeteer.",
"This here's an example of using the BERT tokenizer.",
"Why did the chicken cross the road?"
]
labels = [
"class1",
"class2",
"class1"
]
# ------------ Load Bert Embedding ------------
import os
import kashgari
from kashgari.embeddings import TransformerEmbedding
from kashgari.tokenizer import BertTokenizer
# Setup paths
model_folder = '/Users/brikerman/Desktop/nlp/language_models/albert_base'
checkpoint_path = os.path.join(model_folder, 'model.ckpt-best')
config_path = os.path.join(model_folder, 'albert_config.json')
vocab_path = os.path.join(model_folder, 'vocab_chinese.txt')
tokenizer = BertTokenizer.load_from_vacob_file(vocab_path)
embed = TransformerEmbedding(vocab_path, config_path, checkpoint_path,
bert_type='albert',
task=kashgari.CLASSIFICATION,
sequence_length=100)
sentences_tokenized = [tokenizer.tokenize(s) for s in sentences]
"""
The sentences will become tokenized into:
[
['jim', 'henson', 'was', 'a', 'puppet', '##eer', '.'],
['this', 'here', "'", 's', 'an', 'example', 'of', 'using', 'the', 'bert', 'token', '##izer', '.'],
['why', 'did', 'the', 'chicken', 'cross', 'the', 'road', '?']
]
"""
train_x, train_y = sentences_tokenized[:2], labels[:2]
validate_x, validate_y = sentences_tokenized[2:], labels[2:]
# ------------ Build Model Start ------------
from kashgari.tasks.classification import CNNLSTMModel
model = CNNLSTMModel(embed)
# ------------ Build Model End ------------
model.fit(
train_x, train_y,
validate_x, validate_y,
epochs=3,
batch_size=32
)
# save model
model.save('path/to/save/model/to')