Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Namespaced actors (Multitenant placement service) #7745

Merged
merged 67 commits into from
Jun 5, 2024

Conversation

elena-kolevska
Copy link
Contributor

Description

This PR implements support for namespaced actors (multi-tenant placement service).

Requirements

  • Multi-tenancy means that sidecars belonging to tenant A will not receive placement information for tenant B (As a consequence, multiple tenants can have actor types with the same name - a hotly requested feature).
  • When mTLS is enabled, the sidecar reported namespace should be verified through spiffe ID. When mTLS is not enabled, we accept the reported namespace as-is.

Some implementation details

  • The in-memory raft-based store was adapted to group placement data by namespace
  • Placement tables are kept per namespace
  • Sidecars in a namespace will only receive the placement table for their own namespace
  • A stream connection pool will be kept per namespace

Backwards compatibility scenarios:

  • The format of the placement table that the sidecars receive will not change, only its content will, making it fully backwards compatible.
  • Old sidecars that use TLS do report their namespace at the moment; we will default to that one, even if the sidecar didn’t report it explicitly (because it’s on an old version).
  • We have no way of knowing the namespace for old sidecars that do not use TLS. For these sidecars we will use a special “empty” namespace. When they connect to a new placement service, they will only get the actor types hosted on other old sidecars that are not on TLS (in the “empty namespace”).

Screenshot 2024-05-17 at 14 50 42

TLS enabled

As soon as the placement server is updated, sidecars A, B and C will see each other’s actor types, and no others. The same is true for sidecars D, E and F.

TLS not enabled

  • Sidecars A and B will not have information about the actor types hosted on sidecar C.
  • Sidecar D will not have information about the actor types hosted on sidecar E and F.
  • Sidecars C, E and F will see each other’s actor-types, as before

For true backwards compatibility, we would keep sending actor types from all six sidecars to the old sidecars, but this is a security concern and breaks the first requirement for multi-tenancy above (”sidecars belonging to tenant A will not receive placement information for tenant B”). A malicious sidecar could connect with an older version on purpose and see actor types that belong to other namespaces.

This behaviour will be clearly documented (”If you’re not using TLS you need to have a uniform version per namespace, either upgrade all sidecars in a namespace or none.”).

Issue reference

#4711
#3167

Checklist

Please make sure you've completed the relevant tasks for this PR, out of the following list:

Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
grpcServer := grpc.NewServer(sec.GRPCServerOptionMTLS())

keepaliveParams := keepalive.ServerParameters{
Time: 1 * time.Second,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discuss these numbers

Copy link
Contributor

@JoshVanL JoshVanL May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 1 second is fairly aggressive, but not insane. gRPC library uses 5 seconds by default https://github.com/grpc/grpc-go/blob/e22436abb80982dab3294250b7284eedd8f20768/examples/features/keepalive/server/main.go#L47

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've seen recently a few examples of managed K8s clusters running in what appeared to be sporadically unreliable networks. In these cases 1 second does seem a bit aggressive, however if we're confident that reconnections shouldn't cause any widespread issues then I guess its fine to keep

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time in this case is the period of inactivity after which the server will try to ping the client to check if the stream is still alive. Knowing that the sidecars pings every second, maybe we can relax this requirement, and make it 2 seconds.
Timeout represents the duration after the ping during which the server will wait for a response, after which the connection is closed. I think we can increase this to 3 seconds, which will match the old faulty host detection timeout of 3 seconds.

Copy link
Contributor

@JoshVanL JoshVanL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments from me 🙂

pkg/actors/placement/placement.go Outdated Show resolved Hide resolved
pkg/placement/leadership.go Show resolved Hide resolved
pkg/placement/leadership.go Show resolved Hide resolved
leaderLoopCancel()
<-loopNotRunning
}()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have three wrapped contexts here in this function, as well as double selects, onces, defers etc. which can be tricky to follow. Would using a concurrency.RunnerManager per for loop help?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is existing code I moved from membership.go to its own file. I would like to limit changes to that part, just so we can keep the scope of the PR smaller, but I agree that this is a good candidate for refactoring.


barrier := raft.Barrier(barrierWriteTimeout)
if err := barrier.Error(); err != nil {
log.Error("failed to wait for barrier", "error", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a transient error we can ignore? Should we not return the error here and ultimately close placement- given this could mean we can no longer write to the db (?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this looks safe, because the loop will just restart. If there's a real problem with the raft node, it will lose its leader status anyway, which will cancel the context and the loop.

Comment on lines 123 to 129
require.Len(t, placementTables.GetEntries(), 6)
require.Contains(t, placementTables.GetEntries(), "actor2")
require.Contains(t, placementTables.GetEntries(), "actor3")
require.Contains(t, placementTables.GetEntries(), "actor4")
require.Contains(t, placementTables.GetEntries(), "actor5")
require.Contains(t, placementTables.GetEntries(), "actor6")
require.Contains(t, placementTables.GetEntries(), "actor10")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think require will cause a panic when an assertion fails inside a EventuallyWithT loop. Consider using assert, with an if check on the assertion to gate further assertions to avoid a nil pointer panic .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requires should be ok, because they're under an if condition, but in the eventual case of them failing in the future, it's better to have them as asserts.
I also added a check for the element's presence.

@@ -124,7 +126,7 @@ func (i *insecure) Run(t *testing.T, ctx context.Context) {
if err != nil {
return false
}
err = stream.Send(&v1pb.Host{Id: "app-1"})
err = stream.Send(&v1pb.Host{Id: "app-1", Namespace: "default"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests which have namespace as an empty string. We should also include empty string namespace tests elsewhere if we don't already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we do. We have one here specific to dissemination, and there two in an auth context here and here.
I don't know if it makes sense to do the same for the quorum tests, since the namespace doesn't influence the leader election and dapr finding the right leader.

Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
elena-kolevska and others added 6 commits May 27, 2024 01:16
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
@mikeee mikeee mentioned this pull request May 28, 2024
46 tasks
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
elena-kolevska and others added 6 commits May 31, 2024 10:26
…ad of EOF

Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
pkg/placement/raft/state.go Outdated Show resolved Hide resolved
pkg/placement/raft/fsm.go Outdated Show resolved Hide resolved
pkg/placement/pool.go Outdated Show resolved Hide resolved
elena-kolevska and others added 2 commits June 3, 2024 13:33
@@ -128,6 +130,7 @@ func NewActorPlacement(opts internal.ActorsProviderOptions) internal.PlacementSe
closeCh: make(chan struct{}),
resiliency: opts.Resiliency,
virtualNodesCache: hashing.NewVirtualNodesCache(),
namespace: security.CurrentNamespace(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please pass as option as this function uses env var which can be awkward for testing.

p.memberUpdateCount.Add(1)
// ApplyCommand returns true only if the command changes the hashing table.
if updateCount {
c, _ := p.memberUpdateCount.GetOrSet(op.host.Namespace, &atomic.Uint32{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we ignoring errors here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this just returns a bool indicating if the value already existed.

tests/integration/suite/placement/ha/placementfailover.go Outdated Show resolved Hide resolved
Comment on lines 302 to 309
func getClient(t *testing.T, ctx context.Context, addr string) rtv1.DaprClient {
t.Helper()

conn, err := grpc.DialContext(ctx, addr, grpc.WithTransportCredentials(insecure.NewCredentials()), grpc.WithBlock())
require.NoError(t, err)
t.Cleanup(func() { require.NoError(t, conn.Close()) })
return rtv1.NewDaprClient(conn)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func getClient(t *testing.T, ctx context.Context, addr string) rtv1.DaprClient {
t.Helper()
conn, err := grpc.DialContext(ctx, addr, grpc.WithTransportCredentials(insecure.NewCredentials()), grpc.WithBlock())
require.NoError(t, err)
t.Cleanup(func() { require.NoError(t, conn.Close()) })
return rtv1.NewDaprClient(conn)
}

n.daprd3.WaitUntilAppHealth(t, ctx)

t.Run("host1 can see actor 1 in ns1, but not actors 2 and 3 in ns2", func(t *testing.T) {
client := getClient(t, ctx, n.daprd1.GRPCAddress())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
client := getClient(t, ctx, n.daprd1.GRPCAddress())
client := n.daprd1.GRPCClient()

Id: "app-1",
Namespace: "",
})
require.NoError(t, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check there error type returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have an error here.

var err error
var err error

require.EventuallyWithT(t, func(c *assert.CollectT) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Eventually isn't doing anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use it as a retry until placement is ready, and then check the error outside of the function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Eventually is only going to be run once as c is not asserted. It doesn't seem like we need this Eventually if the tests are passing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes! Fixed.

@@ -252,7 +252,7 @@ func (p *Placement) RegisterHostWithMetadata(t *testing.T, parentCtx context.Con
for {
in, err := stream.Recv()
if err != nil {
if ctx.Err() != nil || errors.Is(err, context.Canceled) || errors.Is(err, io.EOF) || status.Code(err) == codes.Canceled {
if ctx.Err() != nil || errors.Is(err, context.Canceled) || errors.Is(err, io.EOF) || status.Code(err) == codes.Canceled || status.Code(err) == codes.FailedPrecondition || status.Code(err) == codes.Unavailable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was an old code I added to, but it doesn't make sense as it is... I changed it to return on any error.

pkg/placement/pool.go Outdated Show resolved Hide resolved
pkg/placement/pool.go Outdated Show resolved Hide resolved
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
@yaron2 yaron2 merged commit 941178c into dapr:master Jun 5, 2024
18 of 20 checks passed
@yaron2
Copy link
Member

yaron2 commented Jun 5, 2024

Incredibly happy to see this land.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants