Scale Paralus and Relay replicas for high availability and no-downtime upgrades #264

akshay196 · 2023-10-11T07:17:14Z

Briefly describe the feature

Paralus and Relay should be scaled to more than one replicas. Relay should work properly when relay server or agent are scaled up.

What problem does this feature solve? Please link any relevant documentation or Issues

Support HA and zero-downtime upgrades.

(optional) What is your current workaround?

None

akshay196 · 2023-10-18T11:19:41Z

First attempt to scale relay server (2 replicas) with an imported cluster in my local Kind setup. Paralus server got crashed with below error.

{"AccountID":"","PartnerID":"","OrganizationID":"","Username":"","IsSSO":false,"EnforceSession":false,"SessionType":"","SystemUser":false,"RelayNetwork":false}}
panic: uuid: Parse(): invalid UUID length: 0

goroutine 405 [running]:
github.com/google/uuid.MustParse({0x0, 0x0})
	/go/pkg/mod/github.com/google/uuid@v1.3.0/uuid.go:163 +0xb9
github.com/paralus/paralus/pkg/service.(*accountPermissionService).GetAccount(0xc000593388, {0x239ebc0, 0xc0005413b0}, {0x0, 0x0})
	/build/pkg/service/account_permission.go:91 +0x33
github.com/paralus/paralus/server.(*auditInfoServer).LookupUser(0xc00049c660, {0x239ebc0, 0xc0005413b0}, 0xc0003dcfc0)
	/build/server/audit_info.go:47 +0x367
github.com/paralus/paralus/proto/rpc/sentry._AuditInformationService_LookupUser_Handler({0x1dda6e0, 0xc00049c660}, {0x239ebc0, 0xc0005413b0}, 0xc0000bb080, 0x0)
	/build/proto/rpc/sentry/audit_info_grpc.pb.go:91 +0x170
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0005ee1c0, {0x23d4b70, 0xc000315040}, 0xc0003c5c20, 0xc000aceff0, 0x365f220, 0x0)
	/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1282 +0xccf
google.golang.org/grpc.(*Server).handleStream(0xc0005ee1c0, {0x23d4b70, 0xc000315040}, 0xc0003c5c20, 0x0)
	/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1616 +0xa2a
google.golang.org/grpc.(*Server).serveStreams.func1.2()
	/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:921 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:919 +0x294

Looks like common name (or entire peer certificate) is missing the request coming to relay server. Auditing handler - https://github.com/paralus/relay/blob/cc8661975750da3f4c6e156d72d8a955d9ccf6cd/pkg/audit/audit.go#L69

akshay196 · 2023-10-25T16:55:18Z

Fixing above issue and scaling up relay server cause failure accessing target cluster.

kubectl get pod --kubeconfig kubeconfig-admin@paralus.local.yaml 
E1025 22:24:11.932897  218790 memcache.go:265] couldn't get current server API group list: the server has asked for the client to provide credentials
E1025 22:24:11.934600  218790 memcache.go:265] couldn't get current server API group list: the server has asked for the client to provide credentials
E1025 22:24:11.936381  218790 memcache.go:265] couldn't get current server API group list: the server has asked for the client to provide credentials
E1025 22:24:11.938531  218790 memcache.go:265] couldn't get current server API group list: the server has asked for the client to provide credentials
E1025 22:24:11.940673  218790 memcache.go:265] couldn't get current server API group list: the server has asked for the client to provide credentials
error: You must be logged in to the server (the server has asked for the client to provide credentials)

akshay196 · 2023-11-01T03:17:06Z

When no dialin conn key (dialinsni) in dialin pool then we lookup peer cache - https://github.com/paralus/relay/blob/cc8661975750da3f4c6e156d72d8a955d9ccf6cd/pkg/tunnel/server.go#L679
But there is no routine found that is inserting relay peer to peer cache. -

paralus/pkg/sentry/peering/peering.go

Line 49 in cc1a68a

    
           func InsertPeerCache(cache *ristretto.Cache, expiry time.Duration, key, value interface{}) bool {

akshay196 added enhancement New feature or request needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 11, 2023

akshay196 self-assigned this Oct 11, 2023

akshay196 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 18, 2023

nirav-rafay added this to the v0.2.7 milestone Jan 19, 2024

niravparikh05 modified the milestones: v0.2.7, v0.2.8 Feb 28, 2024

akshay196 mentioned this issue Mar 12, 2024

audit_info lookupUser with empty content leads to parse error #164

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale Paralus and Relay replicas for high availability and no-downtime upgrades #264

Scale Paralus and Relay replicas for high availability and no-downtime upgrades #264

akshay196 commented Oct 11, 2023

akshay196 commented Oct 18, 2023

akshay196 commented Oct 25, 2023

akshay196 commented Nov 1, 2023

Scale Paralus and Relay replicas for high availability and no-downtime upgrades #264

Scale Paralus and Relay replicas for high availability and no-downtime upgrades #264

Comments

akshay196 commented Oct 11, 2023

Briefly describe the feature

What problem does this feature solve? Please link any relevant documentation or Issues

(optional) What is your current workaround?

akshay196 commented Oct 18, 2023

akshay196 commented Oct 25, 2023

akshay196 commented Nov 1, 2023