Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analysis Failure with Unreadable Data Files if a Language is selected #1312

Open
PrajjwalDatir opened this issue Nov 11, 2023 · 2 comments
Open

Comments

@PrajjwalDatir
Copy link

PrajjwalDatir commented Nov 11, 2023

Description:
When selecting a specific programming language and uploading a zip file that contains unreadable data files such as PDFs or videos, the analysis process fails. However, the analysis successfully passes when the language is set to automatic.

Error logs:
in console : Uncaught TypeError: Cannot read properties of undefined (reading 'data')
in UI: An error occurred while analyzing the dataset (out-of-memory)

Steps to Reproduce:
Select Python as the programming language.
Upload a zip file such as PDF files.
Observe the analysis failure.

Expected Behavior:
The analysis should read or skip PDF files in the zip without causing failure.

Additional Information:
Workaround: Removing the PDF files and re-zipping the contents allows for successful analysis.
Environment: https://dolos.ugent.be/server/#/

Note
you can assign this issue to me & connect me with a person with whom I can discuss the approach to solve this.

@PrajjwalDatir
Copy link
Author

@rien @maartenvn @ArneCJacobs

@rien
Copy link
Member

rien commented Nov 15, 2023

Hi @PrajjwalDatir, thanks for the extensive report. There is no need to mention developers, this might come over as pushy and could have an adverse effect.

The reason for the current behavior is because the -l <language> option skips language detection and will ignore file extensions. The automatic detection (when -l is not given) will print a warning when files are detected that do not match the given extension:

The language of the files was detected as <language> but <n> files were ignored because they did not have a matching extension.
You can override this behavior by setting the language explicitly.

I do agree that the difference in how files are selected might be confusing, especially in the web server.

I see two options:

  • Adding a CLI option in addition to -l <language> option to filter out files not matching the program's extension (for example --filter-extension)
  • Harmonizing the default behavior to always filter out files that do not match the extension, unless an option --ignore-extension is given.

I think the last option is the least confusing, since I think it will seldomly occur where someone wants to analyze files with a language that does not match its extension.

What is your opinion on this? Which change would fit your use case best?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants