Embeddings

BareEmbedding

class kashgari.embeddings.BareEmbedding(embedding_size: int = 100, **kwargs)[source]

Bases: kashgari.embeddings.abc_embedding.ABCEmbedding

BareEmbedding is a random init tf.keras.layers.Embedding layer for text sequence embedding, which is the defualt embedding class for kashgari models.

__init__(embedding_size: int = 100, **kwargs)[source]
Parameters:
  • embedding_size – Dimension of the dense embedding.
  • kwargs – additional params
build_embedding_model(*, vocab_size: int = None, force: bool = False, **kwargs) → None[source]
embed(sentences: List[List[str]], *, debug: bool = False) → numpy.ndarray

batch embed sentences

Parameters:
  • sentences – Sentence list to embed
  • debug – show debug info
Returns:

vectorized sentence list

get_seq_length_from_corpus(generators: List[kashgari.generators.CorpusGenerator], *, use_label: bool = False, cover_rate: float = 0.95) → int

Calculate proper sequence length according to the corpus

Parameters:
  • generators
  • use_label
  • cover_rate

Returns:

load_embed_vocab() → Optional[Dict[str, int]][source]

Load vocab dict from embedding layer

Returns:vocab dict or None
setup_text_processor(processor: kashgari.processors.abc_processor.ABCProcessor) → None
to_dict() → Dict[str, Any]

WordEmbedding

class kashgari.embeddings.WordEmbedding(w2v_path: str, *, w2v_kwargs: Dict[str, Any] = None, **kwargs)[source]

Bases: kashgari.embeddings.abc_embedding.ABCEmbedding

__init__(w2v_path: str, *, w2v_kwargs: Dict[str, Any] = None, **kwargs)[source]
Parameters:
  • w2v_path – Word2Vec file path.
  • w2v_kwargs – params pass to the load_word2vec_format() function of gensim.models.KeyedVectors
  • kwargs – additional params
build_embedding_model(*, vocab_size: int = None, force: bool = False, **kwargs) → None[source]
embed(sentences: List[List[str]], *, debug: bool = False) → numpy.ndarray

batch embed sentences

Parameters:
  • sentences – Sentence list to embed
  • debug – show debug info
Returns:

vectorized sentence list

get_seq_length_from_corpus(generators: List[kashgari.generators.CorpusGenerator], *, use_label: bool = False, cover_rate: float = 0.95) → int

Calculate proper sequence length according to the corpus

Parameters:
  • generators
  • use_label
  • cover_rate

Returns:

load_embed_vocab() → Optional[Dict[str, int]][source]

Load vocab dict from embedding layer

Returns:vocab dict or None
setup_text_processor(processor: kashgari.processors.abc_processor.ABCProcessor) → None
to_dict() → Dict[str, Any][source]

TransformerEmbedding

class kashgari.embeddings.TransformerEmbedding(vocab_path: str, config_path: str, checkpoint_path: str, model_type: str = 'bert', **kwargs)[source]

Bases: kashgari.embeddings.abc_embedding.ABCEmbedding

TransformerEmbedding is based on bert4keras. The embeddings itself are wrapped into our simple embedding interface so that they can be used like any other embedding.

__init__(vocab_path: str, config_path: str, checkpoint_path: str, model_type: str = 'bert', **kwargs)[source]
Parameters:
  • vocab_path – vocab file path, example vocab.txt
  • config_path – model config path, example config.json
  • checkpoint_path – model weight path, example model.ckpt-100000
  • model_type – transfer model type, {bert, albert, nezha, gpt2_ml, t5}
  • kwargs – additional params
build_embedding_model(*, vocab_size: int = None, force: bool = False, **kwargs) → None[source]
embed(sentences: List[List[str]], *, debug: bool = False) → numpy.ndarray

batch embed sentences

Parameters:
  • sentences – Sentence list to embed
  • debug – show debug info
Returns:

vectorized sentence list

get_seq_length_from_corpus(generators: List[kashgari.generators.CorpusGenerator], *, use_label: bool = False, cover_rate: float = 0.95) → int

Calculate proper sequence length according to the corpus

Parameters:
  • generators
  • use_label
  • cover_rate

Returns:

load_embed_vocab() → Optional[Dict[str, int]][source]

Load vocab dict from embedding layer

Returns:vocab dict or None
setup_text_processor(processor: kashgari.processors.abc_processor.ABCProcessor) → None
to_dict() → Dict[str, Any][source]

BertEmbedding

class kashgari.embeddings.BertEmbedding(model_folder: str, **kwargs)[source]

Bases: kashgari.embeddings.transformer_embedding.TransformerEmbedding

BertEmbedding is a simple wrapped class of TransformerEmbedding. If you need load other kind of transformer based language model, please use the TransformerEmbedding.

__init__(model_folder: str, **kwargs)[source]
Parameters:
  • model_folder – path of checkpoint folder.
  • kwargs – additional params
build_embedding_model(*, vocab_size: int = None, force: bool = False, **kwargs) → None
embed(sentences: List[List[str]], *, debug: bool = False) → numpy.ndarray

batch embed sentences

Parameters:
  • sentences – Sentence list to embed
  • debug – show debug info
Returns:

vectorized sentence list

get_seq_length_from_corpus(generators: List[kashgari.generators.CorpusGenerator], *, use_label: bool = False, cover_rate: float = 0.95) → int

Calculate proper sequence length according to the corpus

Parameters:
  • generators
  • use_label
  • cover_rate

Returns:

load_embed_vocab() → Optional[Dict[str, int]]

Load vocab dict from embedding layer

Returns:vocab dict or None
setup_text_processor(processor: kashgari.processors.abc_processor.ABCProcessor) → None
to_dict() → Dict[str, Any][source]