Language Embeddings¶

Kashgari provides several embeddings for language representation. Embedding layers will convert input sequence to tensor for downstream task. Availabel embeddings list:

class name	description
BareEmbedding	random init `tf.keras.layers.Embedding` layer for text sequence embedding
WordEmbedding	pre-trained Word2Vec embedding
BERTEmbedding	pre-trained BERT embedding
TransformerEmbedding	pre-trained TransferEmbedding embedding (BERT, ALBERT, RoBERTa, NEZHA)

All embedding classes inherit from the Embedding class and implement the embed() to embed your input sequence and embed_model property which you need to build you own Model. By providing the embed() function and embed_model property, Kashgari hides the the complexity of different language embedding from users, all you need to care is which language embedding you need.

You could check out the Embedding API document here

Quick start¶

Feature Extract From Pre-trained Embedding¶

Feature Extraction is one of the major way to use pre-trained language embedding. Kashgari provides simple API for this task. All you need to is init a embedding object and setup it’s pre-processor, then call embed function. Here is the example. All embedding shares same embed API.

from kashgari.embeddings import BertEmbedding
from kashgari.processors import SequenceProcessor

bert = BertEmbedding('<BERT_MODEL_FOLDER>')
processor = SequenceProcessor()
bert.setup_text_processor(processor)
# call for embed
embed_tensor = bert.embed([['语', '言', '模', '型']])

print(embed_tensor)
# array([[-0.5001117 ,  0.9344998 , -0.55165815, ...,  0.49122602,
#         -0.2049343 ,  0.25752577],
#        [-1.05762   , -0.43353617,  0.54398274, ..., -0.61096823,
#          0.04312163,  0.03881482],
#        [ 0.14332692, -0.42566583,  0.68867105, ...,  0.42449307,
#          0.41105768,  0.08222893],
#        ...,
#        [-0.86124015,  0.08591427, -0.34404194, ...,  0.19915134,
#         -0.34176797,  0.06111742],
#        [-0.73940575, -0.02692179, -0.5826528 , ...,  0.26934686,
#         -0.29708537,  0.01855129],
#        [-0.85489404,  0.007399  , -0.26482674, ...,  0.16851354,
#         -0.36805922, -0.0052386 ]], dtype=float32)

Classification and Labeling¶

See details at classification and labeling tutorial.

Customized model¶

You can access the tf.keras model of embedding and add your own layers or any kind customization. Just need to access the embed_model property of the embedding object.