Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with LDA model #989

Open
phoenix-asv opened this issue Nov 15, 2019 · 2 comments
Open

Problems with LDA model #989

phoenix-asv opened this issue Nov 15, 2019 · 2 comments

Comments

@phoenix-asv
Copy link

I am trying to follow tutorial at http://docs.bigartm.org/en/stable/tutorials/python_tutorial.html and run into some issues. Here is the sample code for training LDA model on wiki-enru dataset:

import artm

batch_vectorizer = artm.BatchVectorizer(
    data_path='vw.wiki-enru.txt', 
    data_format='vowpal_wabbit',
    target_folder='wiki-enru-batches',
)

lda = artm.LDA(
    num_topics=10, 
    num_document_passes=5, 
    dictionary=batch_vectorizer.dictionary
)

lda.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=10)
topics = lda.transform(batch_vectorizer=batch_vectorizer)

Then I am trying to get some scores and top words in each topic with this code:

print(f"perplexity: {lda.perplexity_last_value}")
print(f"sparsity_phi: {lda.sparsity_phi_last_value}")
print(f"sparsity_theta: {lda.sparsity_theta_last_value}")

top_tokens = lda.get_top_tokens(num_tokens=10)
for i, token_list in enumerate(top_tokens):
     print(f"Topic #{i}: {token_list}")

and I get:

perplexity: 3975.892333984375
sparsity_phi: nan
sparsity_theta: 0.0
Topic #0: []
...
Topic #9: []

Transformation seams to be working fine, but I am not sure of it's quality.
No warnings or errors messages shown in jupyter notebook.

Host: Linux

Version: 14c93c2 (current stable)

Log file: bigartm.WARNING

Log file created at: 2019/11/15 19:40:51
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
W1115 19:40:51.799517 13698 helpers.cc:221] File already exists: wiki-enru-batches/aaaaaa.batch
W1115 19:41:02.016083 13677 check_messages.h:1121] Inconsistent fields size in ThetaMatrix: 1000 vs 0 vs 1000 vs 0;
W1115 19:53:09.125246 13677 check_messages.h:1121] Inconsistent fields size in ThetaMatrix: 1000 vs 0 vs 1000 vs 0;
@bt2901
Copy link
Contributor

bt2901 commented Nov 15, 2019

It seems that the problem is caused by presence of multiple modalities. artm.LDA doesn't support the multimodal case very well (I'm not sure which modality it uses, if any). As a result, the scores are confused about which modality they belong to.

Quick and dirty workaround. Add the following before calling fit_offline:

POSSIBLE_MODALITIES = ["@russian", "@english"]
for modality_name in POSSIBLE_MODALITIES:
    lda._internal_model.scores.add(SparsityPhiScore(name=lda._sp_phi_score_name + modality_name, class_id=modality_name))

use it like so:

for modality_name in POSSIBLE_MODALITIES:
    value = lda._internal_model.score_tracker[lda._sp_phi_score_name + modality_name].last_value
    print(f"sparsity_phi for {modality_name}: {value}")

Additionally, you need to replace get_top_tokens function.

def get_top_tokens(model, num_tokens=10, with_weights=False, modality_name=None):
    """
    :Description: returns most probable tokens for each topic
    :param int num_tokens: number of top tokens to be returned
    :param bool with_weights: return only tokens, or tuples (token, its p_wt)
    :return:
      * list of lists of str, each internal list corresponds one topic in\
        natural order, if with_weights == False, or list, or list of lists\
        of tules, each tuple is (str, float)
    """
    model._internal_model.scores.add(
        TopTokensScore(name=model._tt_score_name, num_tokens=num_tokens, class_id=modality_name), overwrite=True)
    result = model._internal_model.get_score(model._tt_score_name)

    tokens = []
    global_token_index = 0
    for topic_index in range(model.num_topics):
        if not with_weights:
            tokens.append(result.token[global_token_index: (global_token_index + num_tokens)])
        else:
            result_token = result.token[global_token_index: (global_token_index + num_tokens)]
            result_weight = result.weight[global_token_index: (global_token_index + num_tokens)]
            tokens.append(list(zip(result_token, result_weight)))
        global_token_index += num_tokens

    return tokens


for modality_name in POSSIBLE_MODALITIES:
    top_tokens = get_top_tokens(lda, num_tokens=10, modality_name=modality_name)
    for i, token_list in enumerate(top_tokens):
         print(f"Topic #{i}: {token_list}")

I cannot access wiki-enru dataset to check right now. Does it help?

@phoenix-asv
Copy link
Author

Thanks for fast reply! Workaround helped)

Although, the same problem appears on my own dataset:
1 language, vw file contains lines docID word1:cnt1 word2:cnt2 ... wordN:cntN.
So it seams this problem is not caused or caused not only by presence of multiple modalities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants