Embeddings

BareEmbedding

class kashgari.embeddings.BareEmbedding(embedding_size=100, **kwargs)[source]

Bases: kashgari.embeddings.abc_embedding.ABCEmbedding

BareEmbedding is a random init tf.keras.layers.Embedding layer for text sequence embedding, which is the defualt embedding class for kashgari models.

__init__(embedding_size=100, **kwargs)[source]
Parameters
  • embedding_size (int) – Dimension of the dense embedding.

  • kwargs (Any) – additional params

build_embedding_model(*, vocab_size=None, force=False, **kwargs)[source]
Parameters
  • vocab_size (Optional[int]) –

  • force (bool) –

  • kwargs (Dict) –

Return type

None

embed(sentences, *, debug=False)

batch embed sentences

Parameters
  • sentences (List[List[str]]) – Sentence list to embed

  • debug (bool) – show debug info

Returns

vectorized sentence list

Return type

numpy.ndarray

get_seq_length_from_corpus(generators, *, use_label=False, cover_rate=0.95)

Calculate proper sequence length according to the corpus

Parameters
Return type

int

Returns:

load_embed_vocab()[source]

Load vocab dict from embedding layer

Returns

vocab dict or None

Return type

Optional[Dict[str, int]]

setup_text_processor(processor)
Parameters

processor (kashgari.processors.abc_processor.ABCProcessor) –

Return type

None

to_dict()
Return type

Dict[str, Any]

WordEmbedding

class kashgari.embeddings.WordEmbedding(w2v_path, *, w2v_kwargs=None, **kwargs)[source]

Bases: kashgari.embeddings.abc_embedding.ABCEmbedding

__init__(w2v_path, *, w2v_kwargs=None, **kwargs)[source]
Parameters
  • w2v_path (str) – Word2Vec file path.

  • w2v_kwargs (Optional[Dict[str, Any]]) – params pass to the load_word2vec_format() function of gensim.models.KeyedVectors

  • kwargs (Any) – additional params

build_embedding_model(*, vocab_size=None, force=False, **kwargs)[source]
Parameters
  • vocab_size (Optional[int]) –

  • force (bool) –

  • kwargs (Dict) –

Return type

None

embed(sentences, *, debug=False)

batch embed sentences

Parameters
  • sentences (List[List[str]]) – Sentence list to embed

  • debug (bool) – show debug info

Returns

vectorized sentence list

Return type

numpy.ndarray

get_seq_length_from_corpus(generators, *, use_label=False, cover_rate=0.95)

Calculate proper sequence length according to the corpus

Parameters
Return type

int

Returns:

load_embed_vocab()[source]

Load vocab dict from embedding layer

Returns

vocab dict or None

Return type

Optional[Dict[str, int]]

setup_text_processor(processor)
Parameters

processor (kashgari.processors.abc_processor.ABCProcessor) –

Return type

None

to_dict()[source]
Return type

Dict[str, Any]

TransformerEmbedding

class kashgari.embeddings.TransformerEmbedding(vocab_path, config_path, checkpoint_path, model_type='bert', **kwargs)[source]

Bases: kashgari.embeddings.abc_embedding.ABCEmbedding

TransformerEmbedding is based on bert4keras. The embeddings itself are wrapped into our simple embedding interface so that they can be used like any other embedding.

__init__(vocab_path, config_path, checkpoint_path, model_type='bert', **kwargs)[source]
Parameters
  • vocab_path (str) – vocab file path, example vocab.txt

  • config_path (str) – model config path, example config.json

  • checkpoint_path (str) – model weight path, example model.ckpt-100000

  • model_type (str) – transfer model type, {bert, albert, nezha, gpt2_ml, t5}

  • kwargs (Any) – additional params

build_embedding_model(*, vocab_size=None, force=False, **kwargs)[source]
Parameters
  • vocab_size (Optional[int]) –

  • force (bool) –

  • kwargs (Dict) –

Return type

None

embed(sentences, *, debug=False)

batch embed sentences

Parameters
  • sentences (List[List[str]]) – Sentence list to embed

  • debug (bool) – show debug info

Returns

vectorized sentence list

Return type

numpy.ndarray

get_seq_length_from_corpus(generators, *, use_label=False, cover_rate=0.95)

Calculate proper sequence length according to the corpus

Parameters
Return type

int

Returns:

load_embed_vocab()[source]

Load vocab dict from embedding layer

Returns

vocab dict or None

Return type

Optional[Dict[str, int]]

setup_text_processor(processor)
Parameters

processor (kashgari.processors.abc_processor.ABCProcessor) –

Return type

None

to_dict()[source]
Return type

Dict[str, Any]

BertEmbedding

class kashgari.embeddings.BertEmbedding(model_folder, **kwargs)[source]

Bases: kashgari.embeddings.transformer_embedding.TransformerEmbedding

BertEmbedding is a simple wrapped class of TransformerEmbedding. If you need load other kind of transformer based language model, please use the TransformerEmbedding.

__init__(model_folder, **kwargs)[source]
Parameters
  • model_folder (str) – path of checkpoint folder.

  • kwargs (Any) – additional params

build_embedding_model(*, vocab_size=None, force=False, **kwargs)
Parameters
  • vocab_size (Optional[int]) –

  • force (bool) –

  • kwargs (Dict) –

Return type

None

embed(sentences, *, debug=False)

batch embed sentences

Parameters
  • sentences (List[List[str]]) – Sentence list to embed

  • debug (bool) – show debug info

Returns

vectorized sentence list

Return type

numpy.ndarray

get_seq_length_from_corpus(generators, *, use_label=False, cover_rate=0.95)

Calculate proper sequence length according to the corpus

Parameters
Return type

int

Returns:

load_embed_vocab()

Load vocab dict from embedding layer

Returns

vocab dict or None

Return type

Optional[Dict[str, int]]

setup_text_processor(processor)
Parameters

processor (kashgari.processors.abc_processor.ABCProcessor) –

Return type

None

to_dict()[source]
Return type

Dict[str, Any]