Problems with LDA model #989

phoenix-asv · 2019-11-15T17:07:30Z

I am trying to follow tutorial at http://docs.bigartm.org/en/stable/tutorials/python_tutorial.html and run into some issues. Here is the sample code for training LDA model on wiki-enru dataset:

import artm

batch_vectorizer = artm.BatchVectorizer(
    data_path='vw.wiki-enru.txt', 
    data_format='vowpal_wabbit',
    target_folder='wiki-enru-batches',
)

lda = artm.LDA(
    num_topics=10, 
    num_document_passes=5, 
    dictionary=batch_vectorizer.dictionary
)

lda.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=10)
topics = lda.transform(batch_vectorizer=batch_vectorizer)

Then I am trying to get some scores and top words in each topic with this code:

print(f"perplexity: {lda.perplexity_last_value}")
print(f"sparsity_phi: {lda.sparsity_phi_last_value}")
print(f"sparsity_theta: {lda.sparsity_theta_last_value}")

top_tokens = lda.get_top_tokens(num_tokens=10)
for i, token_list in enumerate(top_tokens):
     print(f"Topic #{i}: {token_list}")

and I get:

perplexity: 3975.892333984375
sparsity_phi: nan
sparsity_theta: 0.0
Topic #0: []
...
Topic #9: []

Transformation seams to be working fine, but I am not sure of it's quality.
No warnings or errors messages shown in jupyter notebook.

Host: Linux

Version: 14c93c2 (current stable)

Log file: bigartm.WARNING

Log file created at: 2019/11/15 19:40:51
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
W1115 19:40:51.799517 13698 helpers.cc:221] File already exists: wiki-enru-batches/aaaaaa.batch
W1115 19:41:02.016083 13677 check_messages.h:1121] Inconsistent fields size in ThetaMatrix: 1000 vs 0 vs 1000 vs 0;
W1115 19:53:09.125246 13677 check_messages.h:1121] Inconsistent fields size in ThetaMatrix: 1000 vs 0 vs 1000 vs 0;

The text was updated successfully, but these errors were encountered:

bt2901 · 2019-11-15T18:02:42Z

It seems that the problem is caused by presence of multiple modalities. artm.LDA doesn't support the multimodal case very well (I'm not sure which modality it uses, if any). As a result, the scores are confused about which modality they belong to.

Quick and dirty workaround. Add the following before calling fit_offline:

POSSIBLE_MODALITIES = ["@russian", "@english"]
for modality_name in POSSIBLE_MODALITIES:
    lda._internal_model.scores.add(SparsityPhiScore(name=lda._sp_phi_score_name + modality_name, class_id=modality_name))

use it like so:

for modality_name in POSSIBLE_MODALITIES:
    value = lda._internal_model.score_tracker[lda._sp_phi_score_name + modality_name].last_value
    print(f"sparsity_phi for {modality_name}: {value}")

Additionally, you need to replace get_top_tokens function.

def get_top_tokens(model, num_tokens=10, with_weights=False, modality_name=None):
    """
    :Description: returns most probable tokens for each topic
    :param int num_tokens: number of top tokens to be returned
    :param bool with_weights: return only tokens, or tuples (token, its p_wt)
    :return:
      * list of lists of str, each internal list corresponds one topic in\
        natural order, if with_weights == False, or list, or list of lists\
        of tules, each tuple is (str, float)
    """
    model._internal_model.scores.add(
        TopTokensScore(name=model._tt_score_name, num_tokens=num_tokens, class_id=modality_name), overwrite=True)
    result = model._internal_model.get_score(model._tt_score_name)

    tokens = []
    global_token_index = 0
    for topic_index in range(model.num_topics):
        if not with_weights:
            tokens.append(result.token[global_token_index: (global_token_index + num_tokens)])
        else:
            result_token = result.token[global_token_index: (global_token_index + num_tokens)]
            result_weight = result.weight[global_token_index: (global_token_index + num_tokens)]
            tokens.append(list(zip(result_token, result_weight)))
        global_token_index += num_tokens

    return tokens


for modality_name in POSSIBLE_MODALITIES:
    top_tokens = get_top_tokens(lda, num_tokens=10, modality_name=modality_name)
    for i, token_list in enumerate(top_tokens):
         print(f"Topic #{i}: {token_list}")

I cannot access wiki-enru dataset to check right now. Does it help?

phoenix-asv · 2019-11-15T19:11:05Z

Thanks for fast reply! Workaround helped)

Although, the same problem appears on my own dataset:
1 language, vw file contains lines docID word1:cnt1 word2:cnt2 ... wordN:cntN.
So it seams this problem is not caused or caused not only by presence of multiple modalities.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with LDA model #989

Problems with LDA model #989

phoenix-asv commented Nov 15, 2019

bt2901 commented Nov 15, 2019 •

edited

phoenix-asv commented Nov 15, 2019

Problems with LDA model #989

Problems with LDA model #989

Comments

phoenix-asv commented Nov 15, 2019

bt2901 commented Nov 15, 2019 • edited

phoenix-asv commented Nov 15, 2019

bt2901 commented Nov 15, 2019 •

edited