Huge text handling #3121

idodeclare · 2020-04-18T04:29:13Z

Hello,

Please consider for integration this patch to add Huge Text file handling.

Indexer and Configuration get two new settings, hugeTextThresholdBytes (default 1_000_000) and hugeTextLimitCharacters (default 5_000_000). The threshold determines when OpenGrok will override a PLAIN genre file as a hugetext DATA file instead. The character limit determines how much to read and index for hugetext (with contextless truncation); the limit may be zero.

hugeTextThresholdBytes is checked for applicable files with each run, while no state for hugeTextLimitCharacters is stored. Changing hugeTextLimitCharacters after indexing would require touching affected source code files to revise the index.

For affected gzip and bzip2 files, changes to either hugeTextThresholdBytes or hugeTextLimitCharacters would require touching affected compressed files to revise the index.

Thank you.

coveralls · 2020-04-18T05:02:51Z

Pull Request Test Coverage Report for Build 5464

122 of 219 (55.71%) changed or added relevant lines in 17 files are covered.
7 unchanged lines in 5 files lost coverage.
Overall coverage decreased (-0.07%) to 75.639%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
opengrok-indexer/src/main/java/org/opengrok/indexer/util/LimitedReader.java	18	19	94.74%
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/StreamSource.java	0	2	0.0%
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/archive/GZIPAnalyzer.java	2	4	50.0%
opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java	17	19	89.47%
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/archive/BZip2Analyzer.java	5	8	62.5%
opengrok-indexer/src/main/java/org/opengrok/indexer/index/Indexer.java	7	10	70.0%
opengrok-indexer/src/main/java/org/opengrok/indexer/web/SingleResult.java	0	6	0.0%
opengrok-web/src/main/java/org/opengrok/web/PageConfig.java	3	14	21.43%
opengrok-indexer/src/main/java/org/opengrok/indexer/web/SearchHelper.java	0	12	0.0%
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/AnalyzerGuru.java	25	43	58.14%

Files with Coverage Reduction	New Missed Lines	%
opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/AnalyzerGuru.java	1	84.21%
opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java	1	60.19%
opengrok-indexer/src/main/java/org/opengrok/indexer/index/Indexer.java	1	51.3%
opengrok-indexer/src/main/java/org/opengrok/indexer/configuration/IndexTimestamp.java	2	42.86%
opengrok-web/src/main/java/org/opengrok/web/PageConfig.java	2	39.1%

Totals
Change from base Build 5462:	-0.07%
Covered Lines:	41408
Relevant Lines:	54744

💛 - Coveralls

tarzanek · 2020-04-20T18:55:03Z

this actually looks good, I just quickly skimmed through

maybe only one concern, I think we should WARN instead of FINE print if a file is skipped because of limits ... (unless FINE is printed by default to log file ... but then I'd like to see those as WARN on console too ... )

vladak · 2020-04-20T19:23:29Z

What happens if index is created with particular limits and then the limits are changed ?

idodeclare · 2020-04-20T21:40:45Z

maybe only one concern, I think we should WARN instead of FINE print if a file is skipped because of limits ... (unless FINE is printed by default to log file ... but then I'd like to see those as WARN on console too ... )

No file is skipped. It is still included but under the HugeTextAnalyzer where the user chooses how much to analyze (as low as 0). It also gets no xref.

idodeclare · 2020-04-20T21:56:30Z

What happens if index is created with particular limits and then the limits are changed ?

I tried to describe that above, but to clarify:

You can change hugeTextThresholdBytes and then re-index, and (as applicable) PLAIN files will be reclassified as hugetext DATA or hugetext DATA will be re-classified to PLAIN.

If you change hugeTextLimitCharacters and re-index, then nothing will happen because no state is stored related to that limit.

PLAIN files inside gzip or bzip2 are handled by HugeTextAnalyzer, but changing hugeTextThresholdBytes and re-indexing does nothing because the uncompressed size is not known when checkSettings would need to decide. (Changing hugeTextLimitCharacters and reindexing also does nothing for gzip or bzip2, for the same reason above.)

tarzanek · 2020-04-21T06:57:25Z

sorry, I meant "trimmed" down, not skipped
anyways, logs should say it loud that file x and y are now not considered fully
I don't expect this output will flood the screen, but it should be a warning to get printed on console

idodeclare · 2020-04-21T22:27:54Z

@tarzanek , that's done.

idodeclare · 2020-04-21T22:32:06Z

PLAIN files inside gzip or bzip2 are handled by HugeTextAnalyzer, but changing hugeTextThresholdBytes and re-indexing does nothing because the uncompressed size is not known when checkSettings would need to decide.

I suppose it would be straight-forward to store a value for uncompressed size in the Document so that we could compare against hugeTextThresholdBytes in checkSettings for gzip/bzip2. Worth it?

idodeclare · 2020-04-21T22:46:17Z

PLAIN files inside gzip or bzip2 are handled by HugeTextAnalyzer, but changing hugeTextThresholdBytes and re-indexing does nothing because the uncompressed size is not known when checkSettings would need to decide.

I suppose it would be straight-forward to store a value for uncompressed size in the Document so that we could compare against hugeTextThresholdBytes in checkSettings for gzip/bzip2. Worth it?

Oh but that would mean decompressing entirely. Probably not a good idea.

tarzanek · 2020-05-26T08:26:31Z

lgtm, so +1 from my side
but since this PR is big, I'd like to get at least one other review to be merged
@vladak , @tulinkry ?

idodeclare · 2020-05-27T14:49:47Z

Just rebased on master since this needed revision to accommodate the Configuration API changes of #3127 (but would not have shown up as having any Git conflicts)

vladak · 2020-08-20T14:57:41Z

I will take a look; also needs rebase.

idodeclare · 2020-08-20T17:07:07Z

Just trivial conflicts upon rebase

idodeclare · 2020-10-06T17:55:24Z

Just rebasing for trivial conflicts related to R analyzer and then again after parallel detection merged

opengrok-indexer/src/main/java/org/opengrok/indexer/index/Indexer.java

opengrok-indexer/src/test/java/org/opengrok/indexer/index/HugeTextTest.java

idodeclare · 2020-10-07T18:43:53Z

Rebased for trivial conflict in search.jsp

…files

Also, move some logic properly to AnalyzerGuru that had crept into IndexDatabase.

idodeclare · 2020-10-09T15:58:36Z

Rebased for PageConfig.java re-lo, and git automatic-merge took care of it

idodeclare marked this pull request as draft April 24, 2020 01:39

idodeclare force-pushed the feature/huge_text branch from 5dca6b3 to 1d15e51 Compare May 10, 2020 03:47

idodeclare marked this pull request as ready for review May 10, 2020 04:42

idodeclare force-pushed the feature/huge_text branch from cdcbea3 to 7559eb7 Compare May 27, 2020 14:48

vladak self-requested a review August 20, 2020 14:57

idodeclare force-pushed the feature/huge_text branch from 7559eb7 to 8ae1950 Compare August 20, 2020 17:06

idodeclare force-pushed the feature/huge_text branch 2 times, most recently from 1b51b08 to 2fb1b2e Compare September 27, 2020 21:00

idodeclare force-pushed the feature/huge_text branch 2 times, most recently from 5403dfd to 29aad0e Compare October 6, 2020 17:53

vladak reviewed Oct 7, 2020

View reviewed changes

opengrok-indexer/src/main/java/org/opengrok/indexer/index/Indexer.java Show resolved Hide resolved

vladak reviewed Oct 7, 2020

View reviewed changes

opengrok-indexer/src/test/java/org/opengrok/indexer/index/HugeTextTest.java Show resolved Hide resolved

idodeclare force-pushed the feature/huge_text branch from 29aad0e to 358d2f6 Compare October 7, 2020 18:43

idodeclare added 2 commits October 9, 2020 10:56

Fix to cut extra elements from Java 11 serialization

317755b

Relocate as static final

4766031

idodeclare added 13 commits October 9, 2020 10:56

Store QueryBuilder.T for every Genre

5cdfd57

AnalyzerGuru is actually (unfortunately) a singleton at the moment

b5b252e

Delete unused fileUpdate()

1b26df7

Fix oracle#534 Fix oracle#1646 Fix oracle#3097 : constrain huge text …

752a7ce

…files

Fix oracle#2560 : recognize Huge Text in gzip or bzip2

f9ff866

Attempt InputStream.skip() for more efficiency

9f39558

Tweak test to work on Windows

1f5d484

Show WARNINGs about content classified as Huge Text

15458da

Make list.jsp Huge-Text-aware

bc1aea9

Also, move some logic properly to AnalyzerGuru that had crept into IndexDatabase.

Fix to get origFileTypeName earlier for logging

3a6b916

Fix possible NPE by ensuring acquisition of analyzer factory

0ce96e7

Fix not properly reanalyzing for Genre.HTML

9c87d35

Fix String mangled during automatic-merge

36245a5

idodeclare force-pushed the feature/huge_text branch from f6bdc40 to 36245a5 Compare October 9, 2020 15:57

vladak self-assigned this Jun 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge text handling #3121

Huge text handling #3121

idodeclare commented Apr 18, 2020

coveralls commented Apr 18, 2020 •

edited

tarzanek commented Apr 20, 2020

vladak commented Apr 20, 2020

idodeclare commented Apr 20, 2020

idodeclare commented Apr 20, 2020

tarzanek commented Apr 21, 2020

idodeclare commented Apr 21, 2020

idodeclare commented Apr 21, 2020

idodeclare commented Apr 21, 2020

tarzanek commented May 26, 2020

idodeclare commented May 27, 2020

vladak commented Aug 20, 2020

idodeclare commented Aug 20, 2020

idodeclare commented Oct 6, 2020

idodeclare commented Oct 7, 2020

idodeclare commented Oct 9, 2020

Huge text handling #3121

Are you sure you want to change the base?

Huge text handling #3121

Conversation

idodeclare commented Apr 18, 2020

coveralls commented Apr 18, 2020 • edited

Pull Request Test Coverage Report for Build 5464

💛 - Coveralls

tarzanek commented Apr 20, 2020

vladak commented Apr 20, 2020

idodeclare commented Apr 20, 2020

idodeclare commented Apr 20, 2020

tarzanek commented Apr 21, 2020

idodeclare commented Apr 21, 2020

idodeclare commented Apr 21, 2020

idodeclare commented Apr 21, 2020

tarzanek commented May 26, 2020

idodeclare commented May 27, 2020

vladak commented Aug 20, 2020

idodeclare commented Aug 20, 2020

idodeclare commented Oct 6, 2020

idodeclare commented Oct 7, 2020

idodeclare commented Oct 9, 2020

coveralls commented Apr 18, 2020 •

edited