embeddings

__init__

Embedding layers have its own __init__ function, check it out from their document page.

class name description
BareEmbedding random init tf.keras.layers.Embedding layer for text sequence embedding
WordEmbedding pre-trained Word2Vec embedding
BERTEmbedding pre-trained BERT embedding
GPT2Embedding pre-trained GPT-2 embedding
NumericFeaturesEmbedding random init tf.keras.layers.Embedding layer for numeric feature embedding
StackedEmbedding stack other embeddings for multi-input model

All embedding layer shares same API except the __init__ function.

Properties

token_count

int, corpus token count

sequence_length

int, model sequence length

label2idx

dict, label to index dict

token_count

int, corpus token count

tokenizer

Built-in Tokenizer of Embedding layer, available in BERTEmbedding.

Methods

analyze_corpus

Analyze data, build the token dict and label dict

def analyze_corpus(self,
                   x: List[List[str]],
                   y: Union[List[List[str]], List[str]]):

Args:

  • x: Array of input data
  • y_train: Array of label data

process_x_dataset

Batch process feature data to tensor, mostly call processor’s process_x_dataset function to handle the data.

def process_x_dataset(self,
                      data: List[List[str]],
                      subset: Optional[List[int]] = None) -> np.ndarray:

Args:

  • data: target dataset
  • subset: subset index list

Returns:

  • vectorized feature tensor

process_y_dataset

Batch process labels data to tensor, mostly call processor’s process_y_dataset function to handle the data.

def process_y_dataset(self,
                      data: List[List[str]],
                      subset: Optional[List[int]] = None) -> np.ndarray:

Args:

  • data: target dataset
  • subset: subset index list

Returns:

  • vectorized label tensor

reverse_numerize_label_sequences

def reverse_numerize_label_sequences(self,
                                     sequences,
                                     lengths=None):

embed

Batch embed sentences, use this function for feature extraction. Input text then get the tensor representation.

def embed(self,
          sentence_list: Union[List[List[str]], List[List[int]]],
          debug: bool = False) -> np.ndarray:

Args:

  • sentence_list: Sentence list to embed
  • debug: Show debug info, defualt False

Returns:

  • A list of numpy arrays representing the embeddings

embed_one

Dummy function for embed single sentence.

Args:

  • sentence: Target sentence, list of tokens

Returns:

  • Numpy arrays representing the embeddings

info

Returns a dictionary containing the configuration of the model.

def info(self) -> Dict: