Should routing of /data requests be restricted to Space or Room-level nodes? (Netdata Cloud) #13554

hugovalente-pm · 2022-08-22T15:09:10Z

hugovalente-pm
Aug 22, 2022

Intro

When a /data request needs to be sent to a Node which has available multiple Node Instances (on parent nodes) routing needs to be applied to identify the proper Node Instance to satisfy that request.

Currently, Cloud is looking across all the available Nodes Instances for a given Node independent of the Space or Room the instances are.

Why to restrict routing?

Ideal scenario (?)

If you have a Production Space or Room that has the nodes running with the configurations you want, on the agent versions you want, you ensure that all the /data requests are sent to at least one Node with the setup you want

Current scenario

On that Space but in a different Room or other Space, you could have some experimental Nodes getting data from Production nodes that you are using to run some experiments. With the current approach, it isn't easy to ensure the experimental nodes won't be
targeted for /data requests on data for some child node and having some unwanted behaviour

Note: This behaviour is only intended for Cloud routing logic, this shouldn't affect current agent streaming in any way

hugovalente-pm · 2022-08-22T15:30:54Z

hugovalente-pm
Aug 22, 2022
Author

@juacker this is as a follow-up to the conversation on this bug netdata/netdata-cloud#525

@amalkov @ktsaou @papazach @ralphm @stelfrag it would be good to get your inputs, since we are experiencing this also on our current Netdata Production Space, e.g. on k8s-prd Room requests are served by aws-parent-* nodes which aren't part of this Room - they are on the space.

0 replies

ralphm · 2022-08-22T16:53:47Z

ralphm
Aug 22, 2022
Maintainer

The routing logic is such that all node instances for a given node are candidates for querying, for those node instances that are represented by Agents that have been claimed to a given space. This means that if you don't want a certain Agent to be responding for a given node, you should not claim that Agent to that space. Unfortunately, because nodes can only exist in one space on a (Cloud) Hub, this also means that in practice you should not claim the Agent to the same Hub.

So, if you want to have experimental parent nodes, you should claim them to a different Hub (e.g. our internal staging or testing environments).

Rooms have nothing to do with the logic, and that makes sense. Requiring parent nodes to be a member of a particular room so the nodes it represents can be queried seems undesirable to me. I.e. the parent typically has nothing to do with the workload of the monitored nodes.

Maybe we could find a way for instances of a given node to be in different spaces, but I'm don't think our current data model supports that.

0 replies

juacker · 2022-08-22T17:23:19Z

juacker
Aug 22, 2022
Collaborator

I think we can see the current authorization model as having two layers.

The first layer stablishes an association between users and node ids. This layer defines which user can access which node data and it's configured through the UI, by assigning nodes and users to a spaces and rooms.
The second layer stablishes an association between nodes and agents. This layer defines where each node data is replicated to and where each node data can be fetched from and it's configured at the agent configuration file.

If the user decides to stream data from one agent to an experimental agent in a different space, should the cloud block that?

With the current authorization model, I think the user has complete control over which users can access each data, and where they can access it from, so I think in terms of data access and resource utilization, they have full control and can decide the approach best suits their needs.

For the scenarios mentioned above I think that some users may want to access the information, no matter if it is on an experimental agent, and some of them may not, so having a one-size-fits-all rule, suitable for everyone or for every purpose, can be difficult to find.

I think, however, that the cloud does not learn from bad routing decision, and maybe trying to improve in that direction would help us solve this kind of issues. Do we want to give the user a finer grain of control over routing? do we want to add more intelligence to the cloud to learn automatically when a routing request fails continuously? I think both ways have pros and cons, but maybe we can move in this direction to solve this kind of issues.

0 replies

hugovalente-pm · 2022-08-23T11:28:18Z

hugovalente-pm
Aug 23, 2022
Author

I agree with your inputs @ralphm, but then our routing is "broken" since as @juacker mentioned:

The second layer stablishes an association between nodes and agents. This layer defines where each node data is replicated to and where each node data can be fetched from and it's configured at the agent configuration file.

If nodes are in different spaces there could be cross-space requests flowing.

@juacker if this isn't to much effort we could start with fixing this - will open a bug if all agree.

The smart/learning-routing inputs are really interesting but maybe we could start smaller, I remember in the past being discussed that we should provide a way for user to define a priority for routing, e.g. experimental nodes could have the lowest priority given, and this doesn't invalidate the more sophisticated approach for learning based on missed requests.

1 reply

ralphm Aug 23, 2022
Maintainer

Indeed, let's avoid introducing magical routing here and stick with simple routing rules like we have now.

I'm also curious about the necessity of supporting experimental parents for anyone besides us. I.e. who outside of Netdata, Inc. would need this? If we are the only potential user, let's please just claim the experimental Agents to staging and work on other things.

hugovalente-pm · 2022-08-25T06:48:36Z

hugovalente-pm
Aug 25, 2022
Author

@ralphm with your last comment I'm not clear if you consider this as needed to fixed or not

...but then our routing is "broken" since as @juacker mentioned:

"The second layer stablishes an association between nodes and agents. This layer defines where each node data is replicated to and where each node data can be fetched from and it's configured at the agent configuration file."

If nodes are in different spaces there could be cross-space requests flowing.

0 replies

ralphm · 2022-09-01T12:38:36Z

ralphm
Sep 1, 2022
Maintainer

Isn't the point that right now nodes are tied to spaces, and then node instances to nodes? I.e. a node instance doesn't have a direct connection to the space. If you want to have certain instances in different spaces, we'd have to change that model. I'm not sure the Agent really cares.

0 replies

hugovalente-pm · 2022-09-01T14:32:43Z

hugovalente-pm
Sep 1, 2022
Author

Isn't the point that right now nodes are tied to spaces, and then node instances to nodes?

Correct, these are the two layers that @juacker mentioned above.

If you want to have certain instances in different spaces, we'd have to change that model. I'm not sure the Agent really cares.

Why the Agent is relevant here for routing? What @juacker has mentioned is:

The second layer stablishes an association between nodes and agents. This layer defines where each node data is replicated to and where each node data can be fetched from and it's configured at the agent configuration file.

As I understand, these nodes and node instances are kept in our Cloud BE which should be the representation of what is configured at the agent configuration file.. If we wanted, on Cloud we could check which nodes instances are on a node in the same Space as the child node we want to query the data for.

I just wanted to clarify if we wanted to fix this last part, but totally ok to close this discussion and park it since there are other priorities and this isn't an issue reported by any user.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should routing of /data requests be restricted to Space or Room-level nodes? (Netdata Cloud) #13554

{{title}}

Replies: 7 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Should routing of /data requests be restricted to Space or Room-level nodes? (Netdata Cloud) #13554

hugovalente-pm Aug 22, 2022

Intro

Why to restrict routing?

Ideal scenario (?)

Current scenario

Replies: 7 comments · 1 reply

hugovalente-pm Aug 22, 2022 Author

ralphm Aug 22, 2022 Maintainer

juacker Aug 22, 2022 Collaborator

hugovalente-pm Aug 23, 2022 Author

ralphm Aug 23, 2022 Maintainer

hugovalente-pm Aug 25, 2022 Author

ralphm Sep 1, 2022 Maintainer

hugovalente-pm Sep 1, 2022 Author

hugovalente-pm
Aug 22, 2022

Replies: 7 comments 1 reply

hugovalente-pm
Aug 22, 2022
Author

ralphm
Aug 22, 2022
Maintainer

juacker
Aug 22, 2022
Collaborator

hugovalente-pm
Aug 23, 2022
Author

ralphm Aug 23, 2022
Maintainer

hugovalente-pm
Aug 25, 2022
Author

ralphm
Sep 1, 2022
Maintainer

hugovalente-pm
Sep 1, 2022
Author