Handle Numeric features¶
This feature is a experimental feature
https://github.com/BrikerMan/Kashgari/issues/90
At times there have some additional features like text formatting (italic, bold, centered),
position in text and more. Kashgari provides NumericFeaturesEmbedding
and StackedEmbedding
for this kind data. Here is the details:
If you have a dataset like this.
token=NLP start_of_p=True bold=True center=True B-Category
token=Projects start_of_p=False bold=True center=True I-Category
token=Project start_of_p=True bold=True center=False B-Project-name
token=Name start_of_p=False bold=True center=False I-Project-name
token=: start_of_p=False bold=False center=False I-Project-name
First, numerize your additional features. Convert your data to this. Remember to leave 0
for padding.
text = ['NLP', 'Projects', 'Project', 'Name', ':']
start_of_p = [1, 2, 1, 2, 2]
bold = [1, 1, 1, 1, 2]
center = [1, 1, 2, 2, 2]
label = ['B-Category', 'I-Category', 'B-Project-name', 'I-Project-name', 'I-Project-name']
Then you have four input sequence and one output sequence. Prepare your embedding layers.
import kashgari
from kashgari.embeddings import NumericFeaturesEmbedding, BareEmbedding, StackedEmbedding
import logging
logging.basicConfig(level='DEBUG')
text = ['NLP', 'Projects', 'Project', 'Name', ':']
start_of_p = [1, 2, 1, 2, 2]
bold = [1, 1, 1, 1, 2]
center = [1, 1, 2, 2, 2]
label = ['B-Category', 'I-Category', 'B-ProjectName', 'I-ProjectName', 'I-ProjectName']
text_list = [text] * 100
start_of_p_list = [start_of_p] * 100
bold_list = [bold] * 100
center_list = [center] * 100
label_list = [label] * 100
SEQUENCE_LEN = 100
# You can use Word Embedding or BERT Embedding for your text embedding
text_embedding = BareEmbedding(task=kashgari.LABELING, sequence_length=SEQUENCE_LEN)
start_of_p_embedding = NumericFeaturesEmbedding(feature_count=2,
feature_name='start_of_p',
sequence_length=SEQUENCE_LEN)
bold_embedding = NumericFeaturesEmbedding(feature_count=2,
feature_name='bold',
sequence_length=SEQUENCE_LEN)
center_embedding = NumericFeaturesEmbedding(feature_count=2,
feature_name='center',
sequence_length=SEQUENCE_LEN)
# first attribute, must be the text embedding
stack_embedding = StackedEmbedding([
text_embedding,
start_of_p_embedding,
bold_embedding,
center_embedding
])
x = (text_list, start_of_p_list, bold_list, center_list)
y = label_list
stack_embedding.analyze_corpus(x, y)
# Now we can embed using this stacked embedding layer
print(stack_embedding.embed(x))
Once the embedding layer prepared, you can use all of the classification and labeling models.
# We can build any labeling model with this embedding
from kashgari.tasks.labeling import BLSTMModel
model = BLSTMModel(embedding=stack_embedding)
model.fit(x, y)
print(model.predict(x))
print(model.predict_entities(x))
This is the structurer of this model.