embeddings¶
__init__¶
Embedding layers have its own __init__ function, check it out from their document page.
class name | description |
---|---|
BareEmbedding | random init tf.keras.layers.Embedding layer for text sequence embedding |
WordEmbedding | pre-trained Word2Vec embedding |
BERTEmbedding | pre-trained BERT embedding |
GPT2Embedding | pre-trained GPT-2 embedding |
NumericFeaturesEmbedding | random init tf.keras.layers.Embedding layer for numeric feature embedding |
StackedEmbedding | stack other embeddings for multi-input model |
All embedding layer shares same API except the __init__
function.
Properties¶
token_count¶
int, corpus token count
sequence_length¶
int, model sequence length
label2idx¶
dict, label to index dict
token_count¶
int, corpus token count
tokenizer¶
Built-in Tokenizer of Embedding layer, available in BERTEmbedding
.
Methods¶
analyze_corpus¶
Analyze data, build the token dict and label dict
def analyze_corpus(self,
x: List[List[str]],
y: Union[List[List[str]], List[str]]):
Args:
- x: Array of input data
- y_train: Array of label data
process_x_dataset¶
Batch process feature data to tensor, mostly call processor’s process_x_dataset
function to handle the data.
def process_x_dataset(self,
data: List[List[str]],
subset: Optional[List[int]] = None) -> np.ndarray:
Args:
- data: target dataset
- subset: subset index list
Returns:
- vectorized feature tensor
process_y_dataset¶
Batch process labels data to tensor, mostly call processor’s process_y_dataset
function to handle the data.
def process_y_dataset(self,
data: List[List[str]],
subset: Optional[List[int]] = None) -> np.ndarray:
Args:
- data: target dataset
- subset: subset index list
Returns:
- vectorized label tensor
reverse_numerize_label_sequences¶
def reverse_numerize_label_sequences(self,
sequences,
lengths=None):
embed¶
Batch embed sentences, use this function for feature extraction. Input text then get the tensor representation.
def embed(self,
sentence_list: Union[List[List[str]], List[List[int]]],
debug: bool = False) -> np.ndarray:
Args:
- sentence_list: Sentence list to embed
- debug: Show debug info, defualt False
Returns:
- A list of numpy arrays representing the embeddings
embed_one¶
Dummy function for embed single sentence.
Args:
- sentence: Target sentence, list of tokens
Returns:
- Numpy arrays representing the embeddings