Embeddings¶
BareEmbedding¶
-
class
kashgari.embeddings.
BareEmbedding
(embedding_size: int = 100, **kwargs)[source]¶ Bases:
kashgari.embeddings.abc_embedding.ABCEmbedding
BareEmbedding is a random init tf.keras.layers.Embedding layer for text sequence embedding, which is the defualt embedding class for kashgari models.
-
__init__
(embedding_size: int = 100, **kwargs)[source]¶ Parameters: - embedding_size – Dimension of the dense embedding.
- kwargs – additional params
-
embed
(sentences: List[List[str]], *, debug: bool = False) → numpy.ndarray¶ batch embed sentences
Parameters: - sentences – Sentence list to embed
- debug – show debug info
Returns: vectorized sentence list
-
get_seq_length_from_corpus
(generators: List[kashgari.generators.CorpusGenerator], *, use_label: bool = False, cover_rate: float = 0.95) → int¶ Calculate proper sequence length according to the corpus
Parameters: - generators –
- use_label –
- cover_rate –
Returns:
-
load_embed_vocab
() → Optional[Dict[str, int]][source]¶ Load vocab dict from embedding layer
Returns: vocab dict or None
-
setup_text_processor
(processor: kashgari.processors.abc_processor.ABCProcessor) → None¶
-
to_dict
() → Dict[str, Any]¶
-
WordEmbedding¶
-
class
kashgari.embeddings.
WordEmbedding
(w2v_path: str, *, w2v_kwargs: Dict[str, Any] = None, **kwargs)[source]¶ Bases:
kashgari.embeddings.abc_embedding.ABCEmbedding
-
__init__
(w2v_path: str, *, w2v_kwargs: Dict[str, Any] = None, **kwargs)[source]¶ Parameters: - w2v_path – Word2Vec file path.
- w2v_kwargs – params pass to the
load_word2vec_format()
function of gensim.models.KeyedVectors - kwargs – additional params
-
embed
(sentences: List[List[str]], *, debug: bool = False) → numpy.ndarray¶ batch embed sentences
Parameters: - sentences – Sentence list to embed
- debug – show debug info
Returns: vectorized sentence list
-
get_seq_length_from_corpus
(generators: List[kashgari.generators.CorpusGenerator], *, use_label: bool = False, cover_rate: float = 0.95) → int¶ Calculate proper sequence length according to the corpus
Parameters: - generators –
- use_label –
- cover_rate –
Returns:
-
load_embed_vocab
() → Optional[Dict[str, int]][source]¶ Load vocab dict from embedding layer
Returns: vocab dict or None
-
setup_text_processor
(processor: kashgari.processors.abc_processor.ABCProcessor) → None¶
-
TransformerEmbedding¶
-
class
kashgari.embeddings.
TransformerEmbedding
(vocab_path: str, config_path: str, checkpoint_path: str, model_type: str = 'bert', **kwargs)[source]¶ Bases:
kashgari.embeddings.abc_embedding.ABCEmbedding
TransformerEmbedding is based on bert4keras. The embeddings itself are wrapped into our simple embedding interface so that they can be used like any other embedding.
-
__init__
(vocab_path: str, config_path: str, checkpoint_path: str, model_type: str = 'bert', **kwargs)[source]¶ Parameters: - vocab_path – vocab file path, example vocab.txt
- config_path – model config path, example config.json
- checkpoint_path – model weight path, example model.ckpt-100000
- model_type – transfer model type, {bert, albert, nezha, gpt2_ml, t5}
- kwargs – additional params
-
embed
(sentences: List[List[str]], *, debug: bool = False) → numpy.ndarray¶ batch embed sentences
Parameters: - sentences – Sentence list to embed
- debug – show debug info
Returns: vectorized sentence list
-
get_seq_length_from_corpus
(generators: List[kashgari.generators.CorpusGenerator], *, use_label: bool = False, cover_rate: float = 0.95) → int¶ Calculate proper sequence length according to the corpus
Parameters: - generators –
- use_label –
- cover_rate –
Returns:
-
load_embed_vocab
() → Optional[Dict[str, int]][source]¶ Load vocab dict from embedding layer
Returns: vocab dict or None
-
setup_text_processor
(processor: kashgari.processors.abc_processor.ABCProcessor) → None¶
-
BertEmbedding¶
-
class
kashgari.embeddings.
BertEmbedding
(model_folder: str, **kwargs)[source]¶ Bases:
kashgari.embeddings.transformer_embedding.TransformerEmbedding
BertEmbedding is a simple wrapped class of TransformerEmbedding. If you need load other kind of transformer based language model, please use the TransformerEmbedding.
-
__init__
(model_folder: str, **kwargs)[source]¶ Parameters: - model_folder – path of checkpoint folder.
- kwargs – additional params
-
build_embedding_model
(*, vocab_size: int = None, force: bool = False, **kwargs) → None¶
-
embed
(sentences: List[List[str]], *, debug: bool = False) → numpy.ndarray¶ batch embed sentences
Parameters: - sentences – Sentence list to embed
- debug – show debug info
Returns: vectorized sentence list
-
get_seq_length_from_corpus
(generators: List[kashgari.generators.CorpusGenerator], *, use_label: bool = False, cover_rate: float = 0.95) → int¶ Calculate proper sequence length according to the corpus
Parameters: - generators –
- use_label –
- cover_rate –
Returns:
-
load_embed_vocab
() → Optional[Dict[str, int]]¶ Load vocab dict from embedding layer
Returns: vocab dict or None
-
setup_text_processor
(processor: kashgari.processors.abc_processor.ABCProcessor) → None¶
-