Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add experimental hash protection from concurrent modifications #1783

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vrurg
Copy link
Contributor

@vrurg vrurg commented Jan 9, 2024

This is a quick-baked attempt to provide protection from a hash being modified while concurrent threads are also accessing it.

This PR is a working draft. It does nothing to the lower-level MVMStrHashTable to protect from MoarVM internal problems. But it does intercept NQP and Raku attempts to break the rules.

Also, I know too little about speshing to get it play well with the protection. Therefore this PR simply disables speshing of hashes and slurpy named arguments.

@vrurg
Copy link
Contributor Author

vrurg commented Jan 9, 2024

This was my attempt to resolve systematic cannot iterate over VMNull errors I get with my work application. This patch does allow to catch unsafe operations in the following example:

use v6.e.PREVIEW;

my %h;

constant THREADS-PER-GROUP = 20;

my $starter = Promise.new;

sub writer($tid) {
    my $id = $tid.fmt('%5d');
    for ^100000 -> $i {
        %h{$i} = $i ~ ":" ~ $id;
    }
}

sub reader($tid) {
    my $id = $tid.fmt('%5d');
    for ^100000 -> $i {
        note "[$id] $i: ", %h{$i};
    }
}

my @w;
my @pw;
my @pr;
for ^THREADS-PER-GROUP -> $thread-id {
    @pw[$thread-id] = Promise.new;
    @w.push: start {
        @pw[$thread-id].keep;
        await $starter;
        writer($thread-id);
    }

    @pr[$thread-id] = Promise.new;
    @w.push: start {
        @pr[$thread-id].keep;
        await $starter;
        reader($thread-id);
    }
}

await |@pr, |@pw;

$starter.keep;

await @w;

note "Done";

But it proved that corrupted hash data is not the cause of the error above. Instead, on a few occasions, I observed another problem where metamodel classes was recveiving a VMNull while reading from attributes. I.e. in something like the following code if %!attr has never been accessed by a thread yet then there is a chance it'd be nullish:

method foo() {
    if $debug { note(%!attr.HOW.name(%!attr)); } # Here we may get VMNull
    my $v = %!attr<foo>; # Here a valid hash is observed
}

Unfortunately, I haven't managed to produce a reliable scenario where the problem would show up more or less regularly.

@lizmat
Copy link
Contributor

lizmat commented Jan 9, 2024

Perhaps relatedly: the REA ecosystem harvester regularly crashes on https://github.com/rakudo/rakudo/blob/225533d6f2a207f8b913c0f58bc9b6a4dcd4b3b8/src/core.c/IO/Pipe.rakumod#L19 where apparently $!nl-in has become unset.

Now. https://github.com/rakudo/rakudo/blob/main/src/core.c/IO/Handle.rakumod#L5 shows that $!nl-in will normally always be set. And that uses the (possibly fragile) nqp::attrinited functionality, of which I know that @jnthn wants to get rid of.

As an experiment, I will change that code in IO::Handle to set the default in TWEAK, which would not use nqp::attrinited. And see if the harvester crashes less. If it does crash less, then I guess we need to look at nqp::attrinited more closely.

lizmat added a commit to rakudo/rakudo that referenced this pull request Jan 9, 2024
In response to #5444 and the
research that vrurg did for MoarVM/MoarVM#1783
this removes the default value setting of the $!chomp, $!nl-in and
$!nl-out to TWEAK, instead being specified on the attribute definition.
By doing this, the default value won't be set internally by the
nqp::attrinited op, which is now suspected of being a possible cause
of random crashes.

If the REA harvester doesn't crash anymore (runs hourly, crashed about
twice a week), then we have a direction to further look into these
crashes.
@vrurg
Copy link
Contributor Author

vrurg commented Jan 9, 2024

As an experiment, I will change that code in IO::Handle to set the default in TWEAK, which would not use nqp::attrinited. And see if the harvester crashes less. If it does crash less, then I guess we need to look at nqp::attrinited more closely.

Sounds promising.

While I was able to reproduce the problem with a local test (then something changed and the test doesn't crash anymore) the scenario was always pointing and deserialization stage. If memory doesn't lie to me then MoarVM does lazy deserialization on demand and I was thinking that there is a race possible where while one thread is still deserializing, the other one already reads NULL from undeserialized attribute. But later observations did show that the real app log has errors recorded long after all related modules should've been loaded. It doesn't mean that I don't overlook something, but attrinit is way more likely source of issues.

Now. https://github.com/rakudo/rakudo/blob/main/src/core.c/IO/Handle.rakumod#L5 shows that $!nl-in will normally always be set.

This is the only part where I'm slightly confused. Rakudo doesn't currently use attrinit with MoarVM – all is done by dispatchers. NQP does and my experience backs this by pointing out that whenever I was able to pinpoint the location where VMNull pops up it was NQP code. But here it is clearly a Raku case. It would make sense to somehow get a full backtrace to see if it's the attribute itself or something in the underlying NQP.

@lizmat
Copy link
Contributor

lizmat commented Jan 9, 2024

Rakudo doesn't currently use attrinit with MoarVM – all is done by dispatchers

I think that's incorrect: https://github.com/rakudo/rakudo/blob/main/src/Perl6/World.nqp#L3853 is where the BUILDALL builder creates nqp::attrinited code.

@lizmat
Copy link
Contributor

lizmat commented Jan 9, 2024

And the default BUILDALL at https://github.com/rakudo/rakudo/blob/main/src/core.c/Mu.rakumod#L222 contains references to nqp::p6attrinited. Not sure what the difference is, if there is any.

@lizmat
Copy link
Contributor

lizmat commented Jan 9, 2024

Documentation of nqp::attrinited: https://github.com/Raku/nqp/blob/main/docs/ops.markdown#attrinited

"Note that any access to the attribute that results in a getattr call causes it to be inited."

I wonder if that would be the explanation: that a getattr sneaking in would mark the attribute as initialized, when it wasn't yet.

@lizmat
Copy link
Contributor

lizmat commented Jan 9, 2024

Continue to think out loud re attribute initialization: I wonder if Opaque objects could have their attributes (cheaply) initialized with a sentinel value, and then an nqp::eqaddr could be used to see whether the attribute had been initialized already or not.

@vrurg
Copy link
Contributor Author

vrurg commented Jan 9, 2024

Along the way of investigation you missed another location: https://github.com/rakudo/rakudo/blob/69b8a24ae34ec70f51e1b9f20b37d240cec1ffd3/src/Perl6/Actions.nqp#L977

Continue to think out loud re attribute initialization: I wonder if Opaque objects could have their attributes (cheaply) initialized with a sentinel value

This is kind of what @jnthn has done to it, in my understanding (I never really in-depth analyzed that part): https://github.com/rakudo/rakudo/blob/69b8a24ae34ec70f51e1b9f20b37d240cec1ffd3/src/Perl6/bootstrap.c/BOOTSTRAP.nqp#L1425

@lizmat
Copy link
Contributor

lizmat commented Jan 10, 2024

The REA harvester crashed again with a VMnull error at the same location. So it does NOT appear to be related to nqp::attrinited.

@vrurg
Copy link
Contributor Author

vrurg commented Jan 26, 2024

@niner Would it be possible to have a look at this PR and decide what to do about it? It's already let me locate a problem with %!mro in Metamodel::C3MRO which otherwise wouldn't be that easy to spot. So, there is a point in having this functionality in the VM.

@@ -158,6 +158,7 @@ MVM_trycas_AO(volatile AO_t *addr, uintptr_t old, const uintptr_t new) {
#define MVM_HASH_RANDOMIZE 1
#define MVM_HASH_MAX_PROBE_DISTANCE 255
#define MVM_HASH_INITIAL_BITS_IN_METADATA 5
#define MVM_HASH_PROTECT 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We surely want this to be disabled by default, don't we?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure. The first comment says that this is draft. Don't remember why I haven't explicitly marked it so.

I more worried about the necessity to disable speshing because I haven't found a way to initialize per-hash lock for spesh-allocated objects.

This is a quick-baked attempt to provide protection from a hash being
modified while concurrent threads are also accessing it.
@vrurg vrurg force-pushed the draft-of-hash-lock-protection branch from 1595d9a to 751a2ff Compare February 22, 2024 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants