Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge text handling #3121

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open

Conversation

idodeclare
Copy link
Contributor

Hello,

Please consider for integration this patch to add Huge Text file handling.

Indexer and Configuration get two new settings, hugeTextThresholdBytes (default 1_000_000) and hugeTextLimitCharacters (default 5_000_000). The threshold determines when OpenGrok will override a PLAIN genre file as a hugetext DATA file instead. The character limit determines how much to read and index for hugetext (with contextless truncation); the limit may be zero.

hugeTextThresholdBytes is checked for applicable files with each run, while no state for hugeTextLimitCharacters is stored. Changing hugeTextLimitCharacters after indexing would require touching affected source code files to revise the index.

For affected gzip and bzip2 files, changes to either hugeTextThresholdBytes or hugeTextLimitCharacters would require touching affected compressed files to revise the index.

Thank you.

@coveralls
Copy link

coveralls commented Apr 18, 2020

Pull Request Test Coverage Report for Build 5464

  • 122 of 219 (55.71%) changed or added relevant lines in 17 files are covered.
  • 7 unchanged lines in 5 files lost coverage.
  • Overall coverage decreased (-0.07%) to 75.639%

Changes Missing Coverage Covered Lines Changed/Added Lines %
opengrok-indexer/src/main/java/org/opengrok/indexer/util/LimitedReader.java 18 19 94.74%
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/StreamSource.java 0 2 0.0%
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/archive/GZIPAnalyzer.java 2 4 50.0%
opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java 17 19 89.47%
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/archive/BZip2Analyzer.java 5 8 62.5%
opengrok-indexer/src/main/java/org/opengrok/indexer/index/Indexer.java 7 10 70.0%
opengrok-indexer/src/main/java/org/opengrok/indexer/web/SingleResult.java 0 6 0.0%
opengrok-web/src/main/java/org/opengrok/web/PageConfig.java 3 14 21.43%
opengrok-indexer/src/main/java/org/opengrok/indexer/web/SearchHelper.java 0 12 0.0%
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/AnalyzerGuru.java 25 43 58.14%
Files with Coverage Reduction New Missed Lines %
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/AnalyzerGuru.java 1 84.21%
opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java 1 60.19%
opengrok-indexer/src/main/java/org/opengrok/indexer/index/Indexer.java 1 51.3%
opengrok-indexer/src/main/java/org/opengrok/indexer/configuration/IndexTimestamp.java 2 42.86%
opengrok-web/src/main/java/org/opengrok/web/PageConfig.java 2 39.1%
Totals Coverage Status
Change from base Build 5462: -0.07%
Covered Lines: 41408
Relevant Lines: 54744

💛 - Coveralls

@tarzanek
Copy link
Contributor

this actually looks good, I just quickly skimmed through

maybe only one concern, I think we should WARN instead of FINE print if a file is skipped because of limits ... (unless FINE is printed by default to log file ... but then I'd like to see those as WARN on console too ... )

@vladak
Copy link
Member

vladak commented Apr 20, 2020

What happens if index is created with particular limits and then the limits are changed ?

@idodeclare
Copy link
Contributor Author

maybe only one concern, I think we should WARN instead of FINE print if a file is skipped because of limits ... (unless FINE is printed by default to log file ... but then I'd like to see those as WARN on console too ... )

No file is skipped. It is still included but under the HugeTextAnalyzer where the user chooses how much to analyze (as low as 0). It also gets no xref.

@idodeclare
Copy link
Contributor Author

What happens if index is created with particular limits and then the limits are changed ?

I tried to describe that above, but to clarify:

You can change hugeTextThresholdBytes and then re-index, and (as applicable) PLAIN files will be reclassified as hugetext DATA or hugetext DATA will be re-classified to PLAIN.

If you change hugeTextLimitCharacters and re-index, then nothing will happen because no state is stored related to that limit.

PLAIN files inside gzip or bzip2 are handled by HugeTextAnalyzer, but changing hugeTextThresholdBytes and re-indexing does nothing because the uncompressed size is not known when checkSettings would need to decide. (Changing hugeTextLimitCharacters and reindexing also does nothing for gzip or bzip2, for the same reason above.)

@tarzanek
Copy link
Contributor

sorry, I meant "trimmed" down, not skipped
anyways, logs should say it loud that file x and y are now not considered fully
I don't expect this output will flood the screen, but it should be a warning to get printed on console

@idodeclare
Copy link
Contributor Author

@tarzanek , that's done.

@idodeclare
Copy link
Contributor Author

PLAIN files inside gzip or bzip2 are handled by HugeTextAnalyzer, but changing hugeTextThresholdBytes and re-indexing does nothing because the uncompressed size is not known when checkSettings would need to decide.

I suppose it would be straight-forward to store a value for uncompressed size in the Document so that we could compare against hugeTextThresholdBytes in checkSettings for gzip/bzip2. Worth it?

@idodeclare
Copy link
Contributor Author

PLAIN files inside gzip or bzip2 are handled by HugeTextAnalyzer, but changing hugeTextThresholdBytes and re-indexing does nothing because the uncompressed size is not known when checkSettings would need to decide.

I suppose it would be straight-forward to store a value for uncompressed size in the Document so that we could compare against hugeTextThresholdBytes in checkSettings for gzip/bzip2. Worth it?

Oh but that would mean decompressing entirely. Probably not a good idea.

@idodeclare idodeclare marked this pull request as draft April 24, 2020 01:39
@idodeclare idodeclare marked this pull request as ready for review May 10, 2020 04:42
@tarzanek
Copy link
Contributor

lgtm, so +1 from my side
but since this PR is big, I'd like to get at least one other review to be merged
@vladak , @tulinkry ?

@idodeclare
Copy link
Contributor Author

Just rebased on master since this needed revision to accommodate the Configuration API changes of #3127 (but would not have shown up as having any Git conflicts)

@vladak
Copy link
Member

vladak commented Aug 20, 2020

I will take a look; also needs rebase.

@idodeclare
Copy link
Contributor Author

Just trivial conflicts upon rebase

@idodeclare idodeclare force-pushed the feature/huge_text branch 2 times, most recently from 1b51b08 to 2fb1b2e Compare September 27, 2020 21:00
@idodeclare idodeclare force-pushed the feature/huge_text branch 2 times, most recently from 5403dfd to 29aad0e Compare October 6, 2020 17:53
@idodeclare
Copy link
Contributor Author

Just rebasing for trivial conflicts related to R analyzer and then again after parallel detection merged

@idodeclare
Copy link
Contributor Author

Rebased for trivial conflict in search.jsp

@idodeclare
Copy link
Contributor Author

Rebased for PageConfig.java re-lo, and git automatic-merge took care of it

@vladak vladak self-assigned this Jun 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants