Bert Embedding¶
BertEmbedding is a simple wrapped class of Transformer Embedding. If you need load other kind of transformer based language model, please use the Transformer Embedding.
Note
When using pre-trained embedding, remember to use same tokenize tool with the embedding model, this will allow to access the full power of the embedding
- kashgari.embeddings.BertEmbedding.__init__(self, model_folder, **kwargs)¶
- Parameters
model_folder (str) – path of checkpoint folder.
kwargs (Any) – additional params
Example Usage - Text Classification¶
Let’s run a text classification model with BERT.
sentences = [
"Jim Henson was a puppeteer.",
"This here's an example of using the BERT tokenizer.",
"Why did the chicken cross the road?"
]
labels = [
"class1",
"class2",
"class1"
]
########## Load Bert Embedding ##########
import os
from kashgari.embeddings import BertEmbedding
from kashgari.tokenizers import BertTokenizer
bert_embedding = BertEmbedding('<PATH_TO_BERT_EMBEDDING>')
tokenizer = BertTokenizer.load_from_vocab_file(os.path.join('<PATH_TO_BERT_EMBEDDING>', 'vocab_chinese.txt'))
sentences_tokenized = [tokenizer.tokenize(s) for s in sentences]
"""
The sentences will become tokenized into:
[
['jim', 'henson', 'was', 'a', 'puppet', '##eer', '.'],
['this', 'here', "'", 's', 'an', 'example', 'of', 'using', 'the', 'bert', 'token', '##izer', '.'],
['why', 'did', 'the', 'chicken', 'cross', 'the', 'road', '?']
]
"""
train_x, train_y = sentences_tokenized[:2], labels[:2]
validate_x, validate_y = sentences_tokenized[2:], labels[2:]
########## build model ##########
from kashgari.tasks.classification import CNN_LSTM_Model
model = CNN_LSTM_Model(bert_embedding)
########## /build model ##########
model.fit(
train_x, train_y,
validate_x, validate_y,
epochs=3,
batch_size=32
)
# save model
model.save('path/to/save/model/to')
Use sentence pairs for input¶
let’s assume input pair sample is "First do it" "then do it right"
, Then first tokenize the sentences using bert tokenizer. Then
sentence1 = ['First', 'do', 'it']
sentence2 = ['then', 'do', 'it', 'right']
sample = sentence1 + ["[SEP]"] + sentence2
# Add a special separation token `[SEP]` between two sentences tokens
# Generate a new token list
# ['First', 'do', 'it', '[SEP]', 'then', 'do', 'it', 'right']
train_x = [sample]