Add a directory searcher that uses `ripgrep` as the backend #19348

rafeca · 2019-05-16T18:42:06Z

Summary

This PR adds a new RipgrepDirectorySearcher (the backend that atom.workspace.scan() uses that implements the search using ripgrep.

For simplicity (and backwards compatibility), the RipgrepDirectorySearcher implements the exact same interface that the DefaultDirectorySearcher (in order to verify that, we're running the exact same tests against both of them).

🍐'd with @nathansobo

Benefits

This should speed up considerably searches along big projects (I'll provide some benchmarks later), since ripgrep has proven to be dramatically faster than using Node's filesystem apis to crawl a big project.

Alternate designs

Instead of mimicking the data structure of scandal, we could have chosen to just return the data in a similar structure than the output of ripgrep (which aligns better with what's used on the find and replace UI). This would mean less processing of data and would allow simplifying the find and replace logic.

At the same time, this alternative had some big drawbacks:

Even if it would allow simplifying the logic in find and replace, we would need to do big changes there, which would mean much more testing and risk of introducing bugs. Since for now we are exploring if ripgrep is a suitable alternative for find and replace, we prefer to avoid having to do this amount of work just to test the hypothesis.
Even more complexity would have to be introduced since we want to test both the ripgrep and the standard scanner at the same time to compare them. This means that we would need to keep both types of handling logic in the find and replace, with the risk of diverging or introducing bugs in one logic only without realizing.
If we change the data structure returned by scan(), other packages using that method (or other parts of Atom) won't be able to easily benefit from ripgrep improvements without being refactored.

Verification process

The same unit tests are now run for both ripgrep and scandal, a few new tests have been added to test the context lines logic.
Do some sanity testing on the find and replace package.

Co-Authored-By: Rafael Oleza <rafeca@users.noreply.github.com>

Since a new RipgrepDirectorySearcher is constructed during the snapshot creation, we cannot require vscode-ripgrep from there (since it's not in the snapshot).

nathansobo

Clever approach keeping the set of trailing context lines and continuing populating them in subsequent matches. Did you profile this at all to get a sense of how much overhead this state tracking introduces? I don't imagine it would be too bad though... just a bit of extra work per match, and the alternative is breaking the API which I agree is probably not worth it.

nathansobo · 2019-05-17T11:07:44Z

src/ripgrep-directory-searcher.js

+
+  if (options.trailingContextLineCount) {
+    for (const trailingContextLines of pendingTrailingContexts) {
+      trailingContextLines.push(message.data.lines.text.trim())


What if message.data.lines.text contains newlines?

I haven't tested this, but I'm going to add a test for that.

I remember that when we were testing the current implementation we saw some issues with multi-line results right?

Yeah, although I honestly can't remember what we saw.

rafeca · 2019-05-17T12:44:27Z

Did you profile this at all to get a sense of how much overhead this state tracking introduces?

Good point! On my local tests I couldn't see any issue even when returning a lot of results but I'm going to do some profiling

…rd implementation

rafeca · 2019-05-21T12:47:34Z

Did you profile this at all to get a sense of how much overhead this state tracking introduces?

I've done some profiling and the overhead added is non-significative. The whole processing of a single line from ripgrep takes around ~0.4ms (most of it comes from the JSON parsing of the line).

For comparison, etch takes ~36ms to update the UI when it receives each of these results:

rafeca · 2019-05-22T14:22:38Z

I've added a new commit which enables multiline support in the ripgrep scanner. This, paired with some minor changes in the find-and-replace package, will allow users to search using multiline regexps, a popular feature requested a long time ago.

I've also added a bunch of tests to verify that the behaviour is correct ✨

rafeca · 2019-05-22T20:06:32Z

Oh god I've had to handle yet more edge cases (created tests for each of them):

Handle correctly the unicode support: ripgrep returns the start and end positions of results within a line in bytes, while our UI (and JS usually) expects these positions in terms of character position on the string. I've had to add some logic to do this conversion (which needs to iterate over the whole line if it has characters that are represented by more than 1 byte). I've made this logic as performant as possible (It's O(n) based on the line length), and in my benchmarks even when getting thousands of results from a single line of > 100k characters length, the processing only takes ~10ms.
Convert includes/exludes into globs: ripgrep expects globs for include and exclude patterns, but scandal (and in general find all interfaces) handle also things like src/ so I had to add some logic to convert these kind of things into globs that would get all expected files.
Unescape slashes from RegExps: ripgrep is quite picky about unnecessarily escaped sequences, so we need to unescape slashes which get automatically escaped by JS RegExp implementation.

There may be other edge cases that I'm missing, so my current recommendation is to add this as a setting on the find-and-replace package and leave it by default for a couple of releases while we get some usage.

Stirfry70 · 2020-08-07T16:40:41Z

Hey

WIP

fd82a58

Co-Authored-By: Rafael Oleza <rafeca@users.noreply.github.com>

rafeca added the FY2019Q5 atom perf More information: https://github.com/github/pe-atom-log/issues/728 label May 16, 2019

rafeca requested a review from nathansobo May 16, 2019 18:42

rafeca force-pushed the ns-ro/ripgrep-scan branch from 293eda6 to c46522e Compare May 16, 2019 19:19

rafeca changed the title ~~Add a directorty searcher that uses ripgrep as the backend~~ Add a directory searcher that uses ripgrep as the backend May 16, 2019

rafeca added 2 commits May 17, 2019 11:40

Add context support to ripgrep-directory-searcher

0647a00

Exclude the ripgrep module from the v8 snapshots

d35aef3

rafeca force-pushed the ns-ro/ripgrep-scan branch from 1c47e4f to d35aef3 Compare May 17, 2019 09:59

rafeca added 2 commits May 17, 2019 12:13

Delay the require of the vscode-ripgrep module

0cbd329

Since a new RipgrepDirectorySearcher is constructed during the snapshot creation, we cannot require vscode-ripgrep from there (since it's not in the snapshot).

Use the correct ripgrep path on asar packages

0a6a798

nathansobo approved these changes May 17, 2019

View reviewed changes

rafeca added 4 commits May 20, 2019 15:45

Merge branch 'master' into ns-ro/ripgrep-scan

9d1406c

Update workspace-spec tests to be run against both ripgrep and standa…

d9c27cc

…rd implementation

Fix ripgrep scan implementation to pass all the tests

73e2b60

Fix path issues in Windows

4f9bd50

rafeca force-pushed the ns-ro/ripgrep-scan branch from 884d1b2 to 4f9bd50 Compare May 21, 2019 11:24

Add proper multiline support to ripgrep scanner

c3845dd

rafeca added 2 commits May 22, 2019 21:31

Unescape slashes in regexps

411e2a9

Process unicode results from ripgrep correctly

6748b84

rafeca force-pushed the ns-ro/ripgrep-scan branch from eb7060a to 6748b84 Compare May 22, 2019 20:00

rafeca mentioned this pull request May 22, 2019

Improve time to find on a large repository by using ripgrep atom/find-and-replace#1075

Closed

1 task

rafeca added 2 commits May 22, 2019 23:37

Merge branch 'master' into ns-ro/ripgrep-scan

78c0a28

Add missing fixture

854471f

rafeca merged commit a2a1de8 into master May 23, 2019

rafeca deleted the ns-ro/ripgrep-scan branch May 23, 2019 07:39

rafeca mentioned this pull request May 24, 2019

Add config option to use ripgrep for scanning files atom/find-and-replace#1086

Merged

rafeca mentioned this pull request May 24, 2019

Fix handling of binary files when using ripgrep scanner #19403

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a directory searcher that uses `ripgrep` as the backend #19348

Add a directory searcher that uses `ripgrep` as the backend #19348

rafeca commented May 16, 2019

nathansobo left a comment

nathansobo May 17, 2019

rafeca May 17, 2019

nathansobo May 17, 2019

rafeca commented May 17, 2019

rafeca commented May 21, 2019

rafeca commented May 22, 2019 •

edited

rafeca commented May 22, 2019

Stirfry70 commented Aug 7, 2020

Add a directory searcher that uses ripgrep as the backend #19348

Add a directory searcher that uses ripgrep as the backend #19348

Conversation

rafeca commented May 16, 2019

Summary

Benefits

Alternate designs

Verification process

nathansobo left a comment

Choose a reason for hiding this comment

nathansobo May 17, 2019

Choose a reason for hiding this comment

rafeca May 17, 2019

Choose a reason for hiding this comment

nathansobo May 17, 2019

Choose a reason for hiding this comment

rafeca commented May 17, 2019

rafeca commented May 21, 2019

rafeca commented May 22, 2019 • edited

rafeca commented May 22, 2019

Stirfry70 commented Aug 7, 2020

Add a directory searcher that uses `ripgrep` as the backend #19348

Add a directory searcher that uses `ripgrep` as the backend #19348

rafeca commented May 22, 2019 •

edited