-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Watch and Read cilium network policies from static directory path #32599
base: main
Are you sure you want to change the base?
Conversation
cfac0a5
to
dc4bac8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I think the motivation of the feature makes sense overall, but I have some feedback on the implementation
pkg/policy/k8s/watcher.go
Outdated
} | ||
|
||
cnp := &cilium_v2.CiliumNetworkPolicy{} | ||
err = json.Unmarshal(jsonData, cnp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think we should also think about what we want to do with the CNP fields which don't make sense when loaded from file, e.g. K8s name and namespace. In particular if there are two files with the same policy name on the object metadata, then I think the current system is non-deterministic in what policy gets applied.
Instead of parsing in the full cilium_v2.CiliumNetworkPolicy
schema, we could consider only reading in api.Rule
and overwrite the labels of each rule to contain the filename or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are 2 policies with same name, then it would merge both. If policy is same, then it would ignore adding same rule and if different it merges rules from both policy. Wouldn't namespace be significant if user wants to apply policy for specific namespace.? If we skip name and namespace, then it supports only clusterwide policies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are 2 policies with same name, then it would merge both.
Yes, I think this could be a good solution. This is not yet implemented in the current version, right?
I am still unsure though if we even want these policies to have names. The names don't serve any purpose and I think the file name is a much better identifier (since it is actually unique). Maybe it would be better if we just ignored the name completely.
Wouldn't namespace be significant if user wants to apply policy for specific namespace.?
Ah, I missed this bit. So the way this is implemented is that if something is read from a CNP that we simply attach the namespace to the endpoint selector when coverting them to low-level api.Rules
:
cilium/pkg/k8s/apis/cilium.io/utils/utils.go
Lines 335 to 345 in b35017b
if namespace != "" { | |
userNamespace, present := r.EndpointSelector.GetMatch(podPrefixLbl) | |
if present && !namespacesAreValid(namespace, userNamespace) { | |
log.WithFields(logrus.Fields{ | |
logfields.K8sNamespace: namespace, | |
logfields.CiliumNetworkPolicyName: name, | |
logfields.K8sNamespace + ".illegal": userNamespace, | |
}).Warn("CiliumNetworkPolicy contains illegal namespace match in EndpointSelector." + | |
" EndpointSelector always applies in namespace of the policy resource, removing illegal namespace match'.") | |
} | |
retRule.EndpointSelector.AddMatch(podPrefixLbl, namespace) |
If we skip name and namespace, then it supports only clusterwide policies.
Sort of, but not necessarily - a clusterwide policy can always be translated into a namespaced policy by modifying the endpoint selector. So reading api.Rules
does not take away any capabilities.
But I do see an argument to be made that the CNP/CCNPs format might be easier for most Cilium users to understand - since this is how most users interact with policies these days. But it does have the downside that there are the "K8s" name field in the CNP metadata causes some confusion, since it is completely unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep correct. instead of name, can update with filename instead but still I like to namespace. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I only saw this after I re-reviewed. Yes, I think keeping the namespace and forcing the name to be the filename seems like an elegant solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tamilmani1989 Nice work! Thank you for this PR. Might be out of context here but I have a few concerns as well regarding this feature.
pkg/policy/k8s/watcher.go
Outdated
} | ||
} | ||
if event.Op&fsnotify.Remove == fsnotify.Remove { | ||
p.log.WithField("file", event.Name).Debug("CNP file removed from directory") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move this to a method i.e deleteFromPolicyEngine
or the likes so we can test the functionality outside the watcher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this should be moved to its own function. Why has this been marked as resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, not my intention to resolve. Agreed, will take as separate function as like add
Thanks for reviewing this @gandro and @derailed. Initially I thought about having separate watcher for loading from file but it seems to be duplicating few api and structs again ( |
Thank you for the feedback! Overall I think one design decision to take is if we want
|
fcd8a43
to
02bc933
Compare
@gandro @derailed I updated based on your suggestion. Separated directory watcher and starting it from a different cell. Also removed |
0601118
to
f2066e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we want to not add more logic to the daemon initializization.I understand the concern regarding the circular dependency on Daemon. I took a quick look this and it seems we can move the "policy manger" part into its own type.
I made a draft PR to do just that #32847.
I suggest we merge this PR in its current state, once it merges I can rebase and update this watcher to use the separate policy manager so a lifecycle can be used here.
@tamilmani1989 please adjust the I'll also note that the release freeze for v1.16 is coming up quick, so you'll need to coordinate with sig-policy folks about timing and whether you think it's viable to land in the next week or so. Additionally, the checkboxes in the issue description are for you to fill out to ensure the PR follows the guidelines for merging. Please check through them and check them off to indicate the status of the PR. Thanks. |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API changes look good to go, but I have a couple of questions regarding the implementation. Thanks!
pkg/policy/k8s/watcher.go
Outdated
} | ||
} | ||
if event.Op&fsnotify.Remove == fsnotify.Remove { | ||
p.log.WithField("file", event.Name).Debug("CNP file removed from directory") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this should be moved to its own function. Why has this been marked as resolved?
// Listen for file add, update and delete | ||
for { | ||
select { | ||
case event := <-watcher.Events: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why the Rename and Write events are not handled? Not handling these events could cause policy leaks and/or the state in the directory to not match the state in Cilium's policy engine. For example:
- If a user modifies an existing policy on disk, that change will not be pushed to the policy engine.
- If a user moves an existing policy on disk to a different directory, that policy will not be removed from the policy engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I have to handle rename and write events. First wanted to make sure approach was right and planned to add this later. i addressed it now
pkg/policy/directory/watcher.go
Outdated
for { | ||
select { | ||
case event := <-watcher.Events: | ||
if !(event.Op&fsnotify.Create == fsnotify.Create || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be safer and easier to read if event.Has
was used instead of event.Op&
, as this is the recommended usage in the documentation.
pkg/policy/directory/watcher.go
Outdated
} | ||
reportCNPChangeMetrics(err) | ||
case err := <-watcher.Errors: | ||
p.log.Error("Error:", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we expand on the error handling here to provide more information on the context of this message? Specifically, it would be useful it we could we add a prefix like unknown error from fsnotify while watching policy directory
and even handle specific errors such as ErrEventOverflow
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please clarify how should we handle ErrEventOverflow? I saw increasing buffer size is one option to auto recover but suggested method works only for windows.
This only has effect on Windows systems, and is a no-op for other backends.
I would rather keep simple and return error instead of increasing buffer size which may have other complications
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, I see what you mean. Can we just change the Error:
string to something like Unexpected error thrown by fsnotify watcher when watching policy directory
then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd generally expect the form p.log.WithError(err).Error(...)
to ensure proper structured logging, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep agreed. I addressed that
pkg/policy/directory/watcher.go
Outdated
labels.NewLabel("name", name, labels.LabelSourceDirectory), | ||
labels.NewLabel("path", filePath, labels.LabelSourceDirectory), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubernetes has restrictions on the characters that can be put into labels, as well as the label's format (see here). The filePath
argument needs to undergo input validation to ensure it fits within these guidelines. For example, the forward-slashes in the path should probably be replaced with a different character, such as an underscore, since the forward-slash character has a specific meaning. Additionally, Kubernetes labels are restricted to ASCII characters, but file paths can contain UTF-8 characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not K8s labels - they are Cilium labels. Slashes should be fine - as to my knowledge, they are not exposed to K8s anywhere, so I don't think we need to be overly restrictive with the label format. But I'm not 100% certain where labels on policy are are used - the only place I know is Hubble policy correlation.
Still, it probably would be worth sanitizing any non-UTF8 characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah interesting that's good to know, TIL. Do we have the format of Cilium labels documented somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of absolute path, i'm planning to add only name of file.
- Anyway name of file will be unique under directory and directory watcher listens only on one directory
- can avoid longer filepaths
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thank you! I like this version much better.
I have focused mainly on questions around policy ingestion and policy lifetime. I have not yet focused on the details of the file watcher itself, but I see Ryan has looked at that.
I think there are still some questions around the lifecycle of policies and in particular the meaning of the name
of a policy
daemon/cmd/daemon.go
Outdated
@@ -760,6 +760,12 @@ func newDaemon(ctx context.Context, cleaner *daemonCleanup, params *daemonParams | |||
} | |||
} | |||
|
|||
if option.Config.StaticCiliumNetworkPolicyPath != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having a global option, we could move this flag into the directory watcher cell. newDirectoryPolicyResourcesWatcher
would then return nil if no path is set, thereby simplifying the code a bit and only requiring the nil check here
pkg/policy/directory/watcher.go
Outdated
labels.NewLabel("name", name, labels.LabelSourceDirectory), | ||
labels.NewLabel("path", filePath, labels.LabelSourceDirectory), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not K8s labels - they are Cilium labels. Slashes should be fine - as to my knowledge, they are not exposed to K8s anywhere, so I don't think we need to be overly restrictive with the label format. But I'm not 100% certain where labels on policy are are used - the only place I know is Hubble policy correlation.
Still, it probably would be worth sanitizing any non-UTF8 characters
pkg/policy/k8s/watcher.go
Outdated
} | ||
|
||
cnp := &cilium_v2.CiliumNetworkPolicy{} | ||
err = json.Unmarshal(jsonData, cnp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are 2 policies with same name, then it would merge both.
Yes, I think this could be a good solution. This is not yet implemented in the current version, right?
I am still unsure though if we even want these policies to have names. The names don't serve any purpose and I think the file name is a much better identifier (since it is actually unique). Maybe it would be better if we just ignored the name completely.
Wouldn't namespace be significant if user wants to apply policy for specific namespace.?
Ah, I missed this bit. So the way this is implemented is that if something is read from a CNP that we simply attach the namespace to the endpoint selector when coverting them to low-level api.Rules
:
cilium/pkg/k8s/apis/cilium.io/utils/utils.go
Lines 335 to 345 in b35017b
if namespace != "" { | |
userNamespace, present := r.EndpointSelector.GetMatch(podPrefixLbl) | |
if present && !namespacesAreValid(namespace, userNamespace) { | |
log.WithFields(logrus.Fields{ | |
logfields.K8sNamespace: namespace, | |
logfields.CiliumNetworkPolicyName: name, | |
logfields.K8sNamespace + ".illegal": userNamespace, | |
}).Warn("CiliumNetworkPolicy contains illegal namespace match in EndpointSelector." + | |
" EndpointSelector always applies in namespace of the policy resource, removing illegal namespace match'.") | |
} | |
retRule.EndpointSelector.AddMatch(podPrefixLbl, namespace) |
If we skip name and namespace, then it supports only clusterwide policies.
Sort of, but not necessarily - a clusterwide policy can always be translated into a namespaced policy by modifying the endpoint selector. So reading api.Rules
does not take away any capabilities.
But I do see an argument to be made that the CNP/CCNPs format might be easier for most Cilium users to understand - since this is how most users interact with policies these days. But it does have the downside that there are the "K8s" name field in the CNP metadata causes some confusion, since it is completely unused.
pkg/policy/directory/watcher.go
Outdated
resourceID := ipcacheTypes.NewResourceID( | ||
ipcacheTypes.ResourceKindCNP, | ||
cnp.ObjectMeta.Namespace, | ||
cnp.ObjectMeta.Name, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect. We must not use the K8s name and namespace for tracking the lifecycle of IPCache entries owned by this policy. This ID here is used to e.g. if the policy contains a ToCIDR
rule, requiring us to insert a CIDR prefix into IPCache. If we use the K8s name here, and you have a real K8s policy and a non-K8s policy read from disk with the same name and namespace, both resources would claim ownership over the IPCache CIDR entry, causing conflicting updates for those IPCache entries.
Instead, I would suggest introducing a new ipcacheTypes.ResourceKind
here (e.g. ResourceKind("file")
), leave the namespace empty and use the filename as the name. Then the IPCache entry is always owned by the file (and thus updated when the file is updated and removed when the file is removed, which is exactly what we want).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree on using fileName as the name but still thinking to keep namespace for the reasons i specified in other comment (to restrict policy to apply to specific namespace)
f91b5a5
to
f8670e7
Compare
Thanks for your wonderful suggestions @gandro @learnitall. Fairly new to this code and apologies if I'm missing basic stuff. Appreciate your help in reviewing this and guiding with right set of changes. I addressed most of your comments and replied for which I didn't address. I'm happy to address any comments further that makes this code better. |
@joestringer yep I sync'd with sig-policy folks in last sig-policy meeting to get this for 1.16 release |
f8670e7
to
c9cfcb3
Compare
Cilium reads CNP yaml if `static-cnp-path` is specified in cilium config. It converts to rules and add those rules to policy engine. This allows admin to configure policy to not allow traffic to certain secure infrastructure endpoints from pods running in cloud. Signed-off-by: tamanoha <tamanoha@microsoft.com>
c9cfcb3
to
5b5b378
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is starting to look really really good. I still have some feedback, but most of it is rather minor
@@ -44,6 +44,7 @@ var ( | |||
ResourceKindCCNP = ResourceKind("ccnp") | |||
ResourceKindDaemon = ResourceKind("daemon") | |||
ResourceKindEndpoint = ResourceKind("ep") | |||
ResourceKindFile = ResourceKind("File") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
ResourceKindFile = ResourceKind("File") | |
ResourceKindFile = ResourceKind("file") |
|
||
// WatchDirectoryPolicyResources starts watching Cilium Network policy files created under a directory. | ||
func (p *PolicyResourcesWatcher) WatchDirectoryPolicyResources(ctx context.Context, policyManager PolicyManager, | ||
readStatus DirectoryWatcherReadStatus) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit/suggestion:
Since DirectoryWatcherReadStatus
is constructed in the hive, we could make it a field in PolicyWatcherParams
instead of passing it in here.
This would also allow you to call close(readStatus)
in newDirectoryPolicyResourcesWatcher
if the watcher functionality is disabled
return err | ||
} | ||
|
||
//update labels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
micro nit:
//update labels | |
// update labels |
|
||
resourceID := ipcacheTypes.NewResourceID( | ||
ipcacheTypes.ResourceKindFile, | ||
cnp.ObjectMeta.Namespace, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as below, if we end up changing the resource ID for addToPolicyEngine
resourceID := ipcacheTypes.NewResourceID( | ||
ipcacheTypes.ResourceKindFile, | ||
cnp.ObjectMeta.Namespace, | ||
cnp.ObjectMeta.Name, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion:
I tripped up on this, thinking you were using the CNP name instead of the filename here (which would be wrong). I think it would be easier to read if you assigned the filename into a separate variable to make sure this expression here looks the same as deleteFromPolicyEngine
.
In addition, I think instead of using the namespace I'd use the base directory. That way, if we end up having other code in another package using ResourceKindFile
, there is no accidental conflict because we're only using the base name here.
resourceID := ipcacheTypes.NewResourceID( | |
ipcacheTypes.ResourceKindFile, | |
cnp.ObjectMeta.Namespace, | |
cnp.ObjectMeta.Name, | |
) | |
resourceID := ipcacheTypes.NewResourceID( | |
ipcacheTypes.ResourceKindFile, | |
option.Config.StaticCiliumNetworkPolicyPath, | |
filepath.Base(cnpFilePath), | |
) |
|
||
type policyWatcher struct { | ||
log logrus.FieldLogger | ||
config *option.DaemonConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Using DaemonConfig
for flags is no longer recommended, unless the Daemon needs to know the flag.
In this case here, I don't think it does (since we simply check if the policy watcher is nil or not). Therefore I suggest we instead add the option.Config.StaticCiliumNetworkPolicyPath
as a cell flag like this:
Line 63 in 2311f3d
func (cfg Config) Flags(flags *pflag.FlagSet) { |
dir := p.config.StaticCiliumNetworkPolicyPath | ||
err = watcher.Add(dir) | ||
if err != nil { | ||
p.log.WithFields(logrus.Fields{"Dir": dir, "err": err}).Error("Directory Watcher Add failed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
We're using "Dir"
here but "dir"
below, that's an inconsistency . I'd recommend using logfields.Path
as a key instead. Also please use WithError(err)
:
p.log.WithFields(logrus.Fields{"Dir": dir, "err": err}).Error("Directory Watcher Add failed") | |
p.log.WithError(err).WithField(logfields.Path, dir).Error("Failed to watch policy directory. Policies will not be loaded from disk") |
} else { | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty else
for { | ||
select { | ||
case event := <-watcher.Events: | ||
if !utf8.ValidString(event.Name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we now use the filename as the CNP name, I wonder if we want to enforce that the filename is a valid CNP name. We can always relax later, but that way we can be sure there is no breaking change if somehow cnp.Parse()
becomes more restrictive in the future.
To validate if a filename is a valid CNP name, you can use validation.IsDNS1123Subdomain(event.Name)
@@ -48,6 +48,8 @@ const ( | |||
// by the previous agent instance. Can be overwritten by all other | |||
// sources (except for unspec). | |||
Restored Source = "restored" | |||
|
|||
Directory Source = "Directory" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some Go doc string here. Also:
Directory Source = "Directory" | |
Directory Source = "directory" |
watcher, err := fsnotify.NewWatcher() | ||
if err != nil { | ||
p.log.WithField("Err", err).Error("Initializing NewWatcher failed") | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, it would be great if we could automate retrying the watcher if it fails here, with a specific backoff. That way if some kind of one-off error occurs, the watcher can recover without requiring a Cilium agent restart.
Cilium reads CNP files from directory if path is configured via
static-cnp-path
field in cilium config. It watches the directory for any changes and read those files and convert to CNP object and add it to policy engine. This allows admin to configure policy to not allow traffic to certain endpoints without showing up as policy resource in kubernetes. This is implemented based on this discussion: #30060 (comment)Please ensure your pull request adheres to the following guidelines:
description and a
Fixes: #XXX
line if the commit addresses a particularGitHub issue.
Fixes: <commit-id>
tag, thenplease add the commit author[s] as reviewer[s] to this issue.
Fixes: #issue-number