Skip to content
This repository has been archived by the owner on Mar 3, 2023. It is now read-only.

Add a directory searcher that uses ripgrep as the backend #19348

Merged
merged 14 commits into from
May 23, 2019

Conversation

rafeca
Copy link
Contributor

@rafeca rafeca commented May 16, 2019

Summary

This PR adds a new RipgrepDirectorySearcher (the backend that atom.workspace.scan() uses that implements the search using ripgrep.

For simplicity (and backwards compatibility), the RipgrepDirectorySearcher implements the exact same interface that the DefaultDirectorySearcher (in order to verify that, we're running the exact same tests against both of them).

🍐'd with @nathansobo

Benefits

This should speed up considerably searches along big projects (I'll provide some benchmarks later), since ripgrep has proven to be dramatically faster than using Node's filesystem apis to crawl a big project.

Alternate designs

Instead of mimicking the data structure of scandal, we could have chosen to just return the data in a similar structure than the output of ripgrep (which aligns better with what's used on the find and replace UI). This would mean less processing of data and would allow simplifying the find and replace logic.

At the same time, this alternative had some big drawbacks:

  1. Even if it would allow simplifying the logic in find and replace, we would need to do big changes there, which would mean much more testing and risk of introducing bugs. Since for now we are exploring if ripgrep is a suitable alternative for find and replace, we prefer to avoid having to do this amount of work just to test the hypothesis.
  2. Even more complexity would have to be introduced since we want to test both the ripgrep and the standard scanner at the same time to compare them. This means that we would need to keep both types of handling logic in the find and replace, with the risk of diverging or introducing bugs in one logic only without realizing.
  3. If we change the data structure returned by scan(), other packages using that method (or other parts of Atom) won't be able to easily benefit from ripgrep improvements without being refactored.

Verification process

  • The same unit tests are now run for both ripgrep and scandal, a few new tests have been added to test the context lines logic.
  • Do some sanity testing on the find and replace package.

Co-Authored-By: Rafael Oleza <rafeca@users.noreply.github.com>
@rafeca rafeca added the FY2019Q5 atom perf More information: https://github.com/github/pe-atom-log/issues/728 label May 16, 2019
@rafeca rafeca requested a review from nathansobo May 16, 2019 18:42
@rafeca rafeca changed the title Add a directorty searcher that uses ripgrep as the backend Add a directory searcher that uses ripgrep as the backend May 16, 2019
Since a new RipgrepDirectorySearcher is constructed during the snapshot
creation, we cannot require vscode-ripgrep from there (since it's not in
the snapshot).
Copy link
Contributor

@nathansobo nathansobo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever approach keeping the set of trailing context lines and continuing populating them in subsequent matches. Did you profile this at all to get a sense of how much overhead this state tracking introduces? I don't imagine it would be too bad though... just a bit of extra work per match, and the alternative is breaking the API which I agree is probably not worth it.


if (options.trailingContextLineCount) {
for (const trailingContextLines of pendingTrailingContexts) {
trailingContextLines.push(message.data.lines.text.trim())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if message.data.lines.text contains newlines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested this, but I'm going to add a test for that.

I remember that when we were testing the current implementation we saw some issues with multi-line results right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, although I honestly can't remember what we saw.

@rafeca
Copy link
Contributor Author

rafeca commented May 17, 2019

Did you profile this at all to get a sense of how much overhead this state tracking introduces?

Good point! On my local tests I couldn't see any issue even when returning a lot of results but I'm going to do some profiling

@rafeca
Copy link
Contributor Author

rafeca commented May 21, 2019

Did you profile this at all to get a sense of how much overhead this state tracking introduces?

I've done some profiling and the overhead added is non-significative. The whole processing of a single line from ripgrep takes around ~0.4ms (most of it comes from the JSON parsing of the line).

For comparison, etch takes ~36ms to update the UI when it receives each of these results:

Screenshot 2019-05-21 at 14 44 46

@rafeca
Copy link
Contributor Author

rafeca commented May 22, 2019

I've added a new commit which enables multiline support in the ripgrep scanner. This, paired with some minor changes in the find-and-replace package, will allow users to search using multiline regexps, a popular feature requested a long time ago.

I've also added a bunch of tests to verify that the behaviour is correct ✨

@rafeca
Copy link
Contributor Author

rafeca commented May 22, 2019

Oh god I've had to handle yet more edge cases (created tests for each of them):

  1. Handle correctly the unicode support: ripgrep returns the start and end positions of results within a line in bytes, while our UI (and JS usually) expects these positions in terms of character position on the string. I've had to add some logic to do this conversion (which needs to iterate over the whole line if it has characters that are represented by more than 1 byte). I've made this logic as performant as possible (It's O(n) based on the line length), and in my benchmarks even when getting thousands of results from a single line of > 100k characters length, the processing only takes ~10ms.
  2. Convert includes/exludes into globs: ripgrep expects globs for include and exclude patterns, but scandal (and in general find all interfaces) handle also things like src/ so I had to add some logic to convert these kind of things into globs that would get all expected files.
  3. Unescape slashes from RegExps: ripgrep is quite picky about unnecessarily escaped sequences, so we need to unescape slashes which get automatically escaped by JS RegExp implementation.

There may be other edge cases that I'm missing, so my current recommendation is to add this as a setting on the find-and-replace package and leave it by default for a couple of releases while we get some usage.

@rafeca rafeca merged commit a2a1de8 into master May 23, 2019
@rafeca rafeca deleted the ns-ro/ripgrep-scan branch May 23, 2019 07:39
@Stirfry70
Copy link

Hey

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FY2019Q5 atom perf More information: https://github.com/github/pe-atom-log/issues/728
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants