embeddings¶

init¶

Embedding layers have its own __init__ function, check it out from their document page.

class name	description
BareEmbedding	random init `tf.keras.layers.Embedding` layer for text sequence embedding
WordEmbedding	pre-trained Word2Vec embedding
BERTEmbedding	pre-trained BERT embedding
GPT2Embedding	pre-trained GPT-2 embedding
NumericFeaturesEmbedding	random init `tf.keras.layers.Embedding` layer for numeric feature embedding
StackedEmbedding	stack other embeddings for multi-input model

All embedding layer shares same API except the __init__ function.

Properties¶

token_count¶

int, corpus token count

sequence_length¶

int, model sequence length

label2idx¶

dict, label to index dict

token_count¶

int, corpus token count

tokenizer¶

Built-in Tokenizer of Embedding layer, available in BERTEmbedding.

Methods¶

analyze_corpus¶

Analyze data, build the token dict and label dict

def analyze_corpus(self,
                   x: List[List[str]],
                   y: Union[List[List[str]], List[str]]):

Args:

x: Array of input data
y_train: Array of label data

process_x_dataset¶

Batch process feature data to tensor, mostly call processor’s process_x_dataset function to handle the data.

def process_x_dataset(self,
                      data: List[List[str]],
                      subset: Optional[List[int]] = None) -> np.ndarray:

Args:

data: target dataset
subset: subset index list

Returns:

vectorized feature tensor

process_y_dataset¶

Batch process labels data to tensor, mostly call processor’s process_y_dataset function to handle the data.

def process_y_dataset(self,
                      data: List[List[str]],
                      subset: Optional[List[int]] = None) -> np.ndarray:

Args:

data: target dataset
subset: subset index list

Returns:

vectorized label tensor

reverse_numerize_label_sequences¶

def reverse_numerize_label_sequences(self,
                                     sequences,
                                     lengths=None):

embed¶

Batch embed sentences, use this function for feature extraction. Input text then get the tensor representation.

def embed(self,
          sentence_list: Union[List[List[str]], List[List[int]]],
          debug: bool = False) -> np.ndarray:

Args:

sentence_list: Sentence list to embed
debug: Show debug info, defualt False

Returns:

A list of numpy arrays representing the embeddings

embed_one¶

Dummy function for embed single sentence.

Args:

sentence: Target sentence, list of tokens

Returns:

Numpy arrays representing the embeddings

info¶

Returns a dictionary containing the configuration of the model.

def info(self) -> Dict:

embeddings¶

__init__¶

Properties¶

token_count¶

sequence_length¶

label2idx¶

token_count¶

tokenizer¶

Methods¶

analyze_corpus¶

process_x_dataset¶

process_y_dataset¶

reverse_numerize_label_sequences¶

embed¶

embed_one¶

info¶

init¶