-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Namespaced actors (Multitenant placement service) #7745
Conversation
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
pkg/placement/placement.go
Outdated
grpcServer := grpc.NewServer(sec.GRPCServerOptionMTLS()) | ||
|
||
keepaliveParams := keepalive.ServerParameters{ | ||
Time: 1 * time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can discuss these numbers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 1 second is fairly aggressive, but not insane. gRPC library uses 5 seconds by default https://github.com/grpc/grpc-go/blob/e22436abb80982dab3294250b7284eedd8f20768/examples/features/keepalive/server/main.go#L47
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've seen recently a few examples of managed K8s clusters running in what appeared to be sporadically unreliable networks. In these cases 1 second does seem a bit aggressive, however if we're confident that reconnections shouldn't cause any widespread issues then I guess its fine to keep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Time
in this case is the period of inactivity after which the server will try to ping the client to check if the stream is still alive. Knowing that the sidecars pings every second, maybe we can relax this requirement, and make it 2 seconds.
Timeout
represents the duration after the ping during which the server will wait for a response, after which the connection is closed. I think we can increase this to 3 seconds, which will match the old faulty host detection timeout of 3 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments from me 🙂
leaderLoopCancel() | ||
<-loopNotRunning | ||
}() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have three wrapped contexts here in this function, as well as double select
s, once
s, defers
etc. which can be tricky to follow. Would using a concurrency.RunnerManager
per for
loop help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is existing code I moved from membership.go
to its own file. I would like to limit changes to that part, just so we can keep the scope of the PR smaller, but I agree that this is a good candidate for refactoring.
|
||
barrier := raft.Barrier(barrierWriteTimeout) | ||
if err := barrier.Error(); err != nil { | ||
log.Error("failed to wait for barrier", "error", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a transient error we can ignore? Should we not return the error here and ultimately close placement- given this could mean we can no longer write to the db (?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this looks safe, because the loop will just restart. If there's a real problem with the raft node, it will lose its leader status anyway, which will cancel the context and the loop.
require.Len(t, placementTables.GetEntries(), 6) | ||
require.Contains(t, placementTables.GetEntries(), "actor2") | ||
require.Contains(t, placementTables.GetEntries(), "actor3") | ||
require.Contains(t, placementTables.GetEntries(), "actor4") | ||
require.Contains(t, placementTables.GetEntries(), "actor5") | ||
require.Contains(t, placementTables.GetEntries(), "actor6") | ||
require.Contains(t, placementTables.GetEntries(), "actor10") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think require
will cause a panic when an assertion fails inside a EventuallyWithT
loop. Consider using assert
, with an if
check on the assertion to gate further assertions to avoid a nil pointer panic .
if assert.NoError(c, rErr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requires should be ok, because they're under an if
condition, but in the eventual case of them failing in the future, it's better to have them as asserts.
I also added a check for the element's presence.
tests/integration/suite/placement/dissemination/namespaced_actors_tls.go
Outdated
Show resolved
Hide resolved
tests/integration/suite/placement/dissemination/namespaced_actors_tls.go
Outdated
Show resolved
Hide resolved
@@ -124,7 +126,7 @@ func (i *insecure) Run(t *testing.T, ctx context.Context) { | |||
if err != nil { | |||
return false | |||
} | |||
err = stream.Send(&v1pb.Host{Id: "app-1"}) | |||
err = stream.Send(&v1pb.Host{Id: "app-1", Namespace: "default"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add tests which have namespace as an empty string. We should also include empty string namespace tests elsewhere if we don't already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests/integration/suite/placement/dissemination/namespaced_actors_tls.go
Outdated
Show resolved
Hide resolved
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
…/dapr into feature/namespaced-actors
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
…ad of EOF Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
pkg/actors/placement/placement.go
Outdated
@@ -128,6 +130,7 @@ func NewActorPlacement(opts internal.ActorsProviderOptions) internal.PlacementSe | |||
closeCh: make(chan struct{}), | |||
resiliency: opts.Resiliency, | |||
virtualNodesCache: hashing.NewVirtualNodesCache(), | |||
namespace: security.CurrentNamespace(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please pass as option as this function uses env var which can be awkward for testing.
p.memberUpdateCount.Add(1) | ||
// ApplyCommand returns true only if the command changes the hashing table. | ||
if updateCount { | ||
c, _ := p.memberUpdateCount.GetOrSet(op.host.Namespace, &atomic.Uint32{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we ignoring errors here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this just returns a bool indicating if the value already existed.
func getClient(t *testing.T, ctx context.Context, addr string) rtv1.DaprClient { | ||
t.Helper() | ||
|
||
conn, err := grpc.DialContext(ctx, addr, grpc.WithTransportCredentials(insecure.NewCredentials()), grpc.WithBlock()) | ||
require.NoError(t, err) | ||
t.Cleanup(func() { require.NoError(t, conn.Close()) }) | ||
return rtv1.NewDaprClient(conn) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func getClient(t *testing.T, ctx context.Context, addr string) rtv1.DaprClient { | |
t.Helper() | |
conn, err := grpc.DialContext(ctx, addr, grpc.WithTransportCredentials(insecure.NewCredentials()), grpc.WithBlock()) | |
require.NoError(t, err) | |
t.Cleanup(func() { require.NoError(t, conn.Close()) }) | |
return rtv1.NewDaprClient(conn) | |
} |
n.daprd3.WaitUntilAppHealth(t, ctx) | ||
|
||
t.Run("host1 can see actor 1 in ns1, but not actors 2 and 3 in ns2", func(t *testing.T) { | ||
client := getClient(t, ctx, n.daprd1.GRPCAddress()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
client := getClient(t, ctx, n.daprd1.GRPCAddress()) | |
client := n.daprd1.GRPCClient() |
Id: "app-1", | ||
Namespace: "", | ||
}) | ||
require.NoError(t, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check there error type returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't have an error here.
var err error | ||
var err error | ||
|
||
require.EventuallyWithT(t, func(c *assert.CollectT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Eventually
isn't doing anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use it as a retry until placement is ready, and then check the error outside of the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Eventually
is only going to be run once as c
is not asserted. It doesn't seem like we need this Eventually
if the tests are passing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes! Fixed.
@@ -252,7 +252,7 @@ func (p *Placement) RegisterHostWithMetadata(t *testing.T, parentCtx context.Con | |||
for { | |||
in, err := stream.Recv() | |||
if err != nil { | |||
if ctx.Err() != nil || errors.Is(err, context.Canceled) || errors.Is(err, io.EOF) || status.Code(err) == codes.Canceled { | |||
if ctx.Err() != nil || errors.Is(err, context.Canceled) || errors.Is(err, io.EOF) || status.Code(err) == codes.Canceled || status.Code(err) == codes.FailedPrecondition || status.Code(err) == codes.Unavailable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was an old code I added to, but it doesn't make sense as it is... I changed it to return on any error.
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Incredibly happy to see this land. |
Description
This PR implements support for namespaced actors (multi-tenant placement service).
Requirements
Some implementation details
Backwards compatibility scenarios:
TLS enabled
As soon as the placement server is updated, sidecars A, B and C will see each other’s actor types, and no others. The same is true for sidecars D, E and F.
TLS not enabled
For true backwards compatibility, we would keep sending actor types from all six sidecars to the old sidecars, but this is a security concern and breaks the first requirement for multi-tenancy above (”sidecars belonging to tenant A will not receive placement information for tenant B”). A malicious sidecar could connect with an older version on purpose and see actor types that belong to other namespaces.
This behaviour will be clearly documented (”If you’re not using TLS you need to have a uniform version per namespace, either upgrade all sidecars in a namespace or none.”).
Issue reference
#4711
#3167
Checklist
Please make sure you've completed the relevant tasks for this PR, out of the following list: