Kubernetes: Consider exposing node labels via downward api

97

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): downward api node labels


Is this a BUG REPORT or FEATURE REQUEST? (choose one): Feature request

I think the downward API should support exposing the node labels on which the pod it is running. While I understand that the Pod should not be coupled too much with the node it is scheduled to. But I think there are certain scenarios that require the pod to have more information about it's host.

  • failure-zones like rack names on premise or availability zones in cloud
  • instance-types for adapting thread settings per instance type
  • other labels like local ssd storage available

This helps applications that are cloud-native to understand where they are located. While this could be achieved using API calls to the node object, I think it should be supported by the downward API. A node object would expose way more information that should not be available to any pod. In OpenShift's default configuration there is access to the Node objects prevented by default.

@mattbates @smarterclayton @bprashanth

simonswine picture simonswine  ·  27 Jan 2017

Most helpful comment

27

It is a bit sad seeing this feature, with so many practical use cases listed above, left unanswered for so long.
As a fully owned, single tenant Kubernetes user, I take care of both nodes AND pods, and I've been needing this feature for a long time.
It it utterly pointless to set topology information on nodes if that information can not be propagated to pods to implement intelligent, topology-aware deployments. Besides, pod affinity/antiaffinity rules already rely on node labels, and that is not considered a security concern.

I have never considered hiding nodes' labels a security feature.

I am wondering if we can tackle the security concerns by adding a clause that the user creating (or the serviceaccount associated with) a pod that requests node's information via one of the methods proposed by @bjhaid in the linked google doc, needs to have the get node permission?

Feder1co5oave picture Feder1co5oave  ·  14 Jun 2020

All comments

0

There could be security implications for some types of clusters (exposing information the admin doesn't want the pod to have, like the exact core architecture). That said, for most deployments it doesn't seem completely unreasonable. A per node setting might decide whether to expose it or to return an empty map.

smarterclayton picture smarterclayton  ·  28 Jan 2017
0

This was previously explored in #25957

tpetracca picture tpetracca  ·  27 Apr 2017
0

/sig node

kargakis picture kargakis  ·  21 Jun 2017
0

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot picture fejta-bot  ·  29 Dec 2017
0

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot picture fejta-bot  ·  28 Jan 2018
0

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot picture fejta-bot  ·  27 Feb 2018
0

/remove-lifecycle rotten
Can somebody please re-open this issue? I think this is a valid concern. I don't think that node labels and annotations are sensitive but if people disagree, maybe whitelisting some labels would be an option.

My use case is setting labels for node group detection in the autoscaler. If I could use the downward API, I wouldn't have to hardcode (or template) the manifests.

discordianfish picture discordianfish  ·  6 Apr 2018
9

/reopen

kargakis picture kargakis  ·  6 Apr 2018
-4

@kargakis: you can't re-open an issue/PR unless you authored it or you are assigned to it.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot picture k8s-ci-robot  ·  6 Apr 2018
0

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot picture fejta-bot  ·  5 Jul 2018
0

/remove-lifecycle stale
/lifecycle freeze

discordianfish picture discordianfish  ·  9 Jul 2018
25

Is this still being considered? I'm currently trying to find a sensible way to retrieve the zone a pod is scheduled in for rack awareness, as per the original use case.

dwilliams782 picture dwilliams782  ·  20 Sep 2018
2

We're looking at doing this for a per-zone load balancer, and per-region load balancer on a single deployment. Our use case is that customers must be able to bind at layer 4 to multiple failure domains. This could be achieved with 3x deployments (one per zone) but it would be much nicer to only have to worry about a single deployment.

laurieodgers picture laurieodgers  ·  9 Oct 2018
8

I think this should be brought up in a sig-node meeting, otherwise I doubt this will get attention and ultimately get closed off by @fejta-bot

discordianfish picture discordianfish  ·  10 Oct 2018
0

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot picture fejta-bot  ·  8 Jan 2019
0

/remove-lifecycle stale

runningman84 picture runningman84  ·  8 Jan 2019
16

I am in favor of something like this. The real problem here is that most people seem to want this for topology information and that information is somewhat arbitrary. Hostname/Zone/Region are codified, so they get the most attention, but there is some demand for other topology. That leads me to assert that the common API is labels and not something special for topology info.

thockin picture thockin  ·  26 Mar 2019
6

would folks be comfortable with kubelet having a configurable whitelist of node-labels that can be consumed by pods via the downward-api. I am constantly hitting snags that having this feature would have unblocked or prevented weird workarounds

bjhaid picture bjhaid  ·  11 Jun 2019
0

Can we talk about concrete needs? What snags are you hitting?

I have heard concrete need for topology info, which is stored as labels and is indistinguishable from other labels. That (rightly) sets off alarms for security-conscious people. Do we need to first-class topology? I'm less confident in that.

The last thing we really want is another configuration knob that admins have to know about to be secure...

thockin picture thockin  ·  12 Jun 2019
6

Our use case is being able to inform elasticsearch pods about topology so that we can make sure shards get placed across failure domains (eg racks). Currently, we have to contort ourselves through some awkward initContainers to get those details into elasticsearch configs.

erik-stephens picture erik-stephens  ·  12 Jun 2019
0

ACK. Another +1 for topology. I am not sure the right answer. Moving
topology to a distinct struct would make it easy to copy that info into
pods, but there's a lot of momentum around labels already, so we'd have to
do both for the indefinite future.

As far as I know, nobody is signed up to try to push that yet.

On Tue, Jun 11, 2019 at 4:55 PM Erik Stephens notifications@github.com
wrote:

Our use case is being able to inform elasticsearch pods about topology so
that we can make sure shards get placed across failure domains (eg racks).
Currently, we have to contort ourselves through some awkward initContainers
to get those details into elasticsearch configs.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/40610?email_source=notifications&email_token=ABKWAVF4QT5PYDADVUJRZNTP2A3QJA5CNFSM4C57YMFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXO2YJA#issuecomment-501066788,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABKWAVFPKZP7W747YAJANKTP2A3QJANCNFSM4C57YMFA
.

thockin picture thockin  ·  12 Jun 2019
1

Can we talk about concrete needs? What snags are you hitting?

Sorry for not providing an helpful case :) All my snags have all been topology related, app needs to know what region/az or datacenter it is deployed to.

As far as I know, nobody is signed up to try to push that yet.

Where do I get to sign up for this :) I don't mind pushing it if I get guidance

bjhaid picture bjhaid  ·  12 Jun 2019
1

My use case is for the EC2 instance ID (which is automatically added as a label by the cloud provider) to be added as an environment variable for a container. Since I lacked this feature I had to provide a workaround (granted my pod access to the AWS metadata FUSE we use at my company https://github.com/jszwedko/ec2-metadatafs) that I wasn't really keen on. Having pod labels available in the downward API just makes sense honestly, I don't know why we need to introduce brand new features for this :D

2rs2ts picture 2rs2ts  ·  20 Jun 2019
6

Where do I get to sign up for this :) I don't mind pushing it if I get guidance

You sign up by saying "I'd like to tackle this". :)

As for guidance, the real problem is that I don't know what the right design is. The way I would usually start such an effort is with a brainstorm doc - try to pin down the kernel of the requirements expressed here and then dream up 10 different ways we could solve it, no judgement on good/bad ideas. Then write the most aggressive pros/cons you can for each. Then we discuss.

e.g.

1) we could add node labels as a downward API "fake" field path
1b) we could make new syntax for field paths that follow object ref pointers
2) we could literally copy node labels to pods
3) we could add a new node topology block and copy THAT into pods
4) we could add a link-local metadata API for use at run-time
5) we could add a new type of volume/envFrom source that is node-metadata
...

thockin picture thockin  ·  20 Jun 2019
8

You sign up by saying "I'd like to tackle this". :)

I'd like to tackle this :)

Thanks for all of this context, I'll start writing some doc and share as soon I have something

bjhaid picture bjhaid  ·  20 Jun 2019
0

Like @erik-stephens failure domain awareness would be a prime motivator for this: in our case Ceph.

intlabs picture intlabs  ·  11 Jul 2019
11

For folks interested myself and @thockin had a brainstorm in this doc https://docs.google.com/document/d/1fH64mEfZH597luup-ZBfBNkiTVjfoFYGEa-G1G_TM6A/edit#, I wrote something that covered the last idea (which is Tim's preference): https://github.com/bjhaid/node-labels-to-files

bjhaid picture bjhaid  ·  21 Jul 2019
0

Thanks @bjhaid, this helps but for my use-case (elasticsearch), I will still need another initContainer in order to get those values into a form that elasticsearch will read (eg write updated config, inject environment variables). Maybe I can use something like envdir. Some soft goals I'm aiming for:

  1. Minimal changes to vendor's image
  2. Use default entrypoint & command
  3. Minimal initContainers

Chiming in here in case others on this thread have ideas. I was hoping there was a stock/std linux way to do what envdir does but I couldn't find.

erik-stephens picture erik-stephens  ·  13 Aug 2019
4

we need this for 2 user cases:

  • nodes zone info in pod labels, so service selector can deliver traffic inside the same zone instead of doing cross zone requests
  • environment variable with node zone info, so cluster based apps can have the correct info. It can also be used to request the correct in-zone dns and add the correct labels to monitoring ( like /metrics) and logs

While we can use a initcontainer to set this, it require giving pods patch permissions to affected pods and so opening a security hole to allow some hacker to change kubernetes setup

My proposal is to have a node "flag"/label/annotations that would enable label propagation and what labels are allowed to be propagated (say: node-labels-propagation: false (default) , true (all labels) , '{"failure-domain.beta.kubernetes.io/zone","failure-domain.beta.kubernetes.io/region"}')

  • This allow the admin to control the info node leak to pods.

Then in the pods, it would be required to request what info they want, with two annotation, one for labels and another for environment ( say: pod-labels-from-nodes-labels: false, true , '{"...","..."}' for label copy and pod-enviroment-from-nodes-labels: false, true, '{..}' for environment). If needed, another option for a volume with the info (pod-volume-from-node-labels: ...)

  • This allow the pod admin to request only what is needed and ignore any other label that he doesn't need
danielmotaleite picture danielmotaleite  ·  16 Sep 2019
0

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot picture fejta-bot  ·  15 Dec 2019
5

I'm not sure, that fejta-bot is useful. Why every one subscriber must get this message, just because no body take care about issues and hope that they are automagically get closed by time?

/remove-lifecycle stale

Bessonov picture Bessonov  ·  15 Dec 2019
0

@Bessonov tell that @fejta
I've just stopped filling issues I don't have the resources to fix right away.

discordianfish picture discordianfish  ·  19 Dec 2019
15

One similar use case from istio: to support locality load balancing, we need to watch all nodes of the cluster to acquire its topology labels such as[topology.kubernetes.io/region, topology.kubernetes.io/zone]. We just need these locality info, but we have to get all the nodes, which is a waste of lot of memory. It would be better if k8s provide a way to allow setting these topology info directly into pod.

hzxuzhonghu picture hzxuzhonghu  ·  13 Mar 2020
12

+1 for this as well

We would like access to this information in order for a highly available data store service to know which [topology.kubernetes.io/region, topology.kubernetes.io/zone] the data store processes are running. Without this information, we could end up in a scenario where data is stored in only a single region or zone. Upon that region becoming unavailable, it's possible the data store service would be 0% available.

jpmcb picture jpmcb  ·  9 Apr 2020
3
node-labels-propagation: false (default) , true (all labels) ...

I think when node label is allowed to propagate, 'all labels' should be avoided since there may be information which admin doesn't want to reveal.
Listing specific labels to propagate would serve the use cases.

tedyu picture tedyu  ·  19 Apr 2020
27

It is a bit sad seeing this feature, with so many practical use cases listed above, left unanswered for so long.
As a fully owned, single tenant Kubernetes user, I take care of both nodes AND pods, and I've been needing this feature for a long time.
It it utterly pointless to set topology information on nodes if that information can not be propagated to pods to implement intelligent, topology-aware deployments. Besides, pod affinity/antiaffinity rules already rely on node labels, and that is not considered a security concern.

I have never considered hiding nodes' labels a security feature.

I am wondering if we can tackle the security concerns by adding a clause that the user creating (or the serviceaccount associated with) a pod that requests node's information via one of the methods proposed by @bjhaid in the linked google doc, needs to have the get node permission?

Feder1co5oave picture Feder1co5oave  ·  14 Jun 2020
16

I'm another +1 for topology info. Topology spread constraints let you easily spread pods across different regions/racks etc, but there's no (easy) way for the pod to know where it is actually scheduled, so the application can't make smart decisions based on topology. Even making only topology.kubernetes.io labels available would be an improvement.

anyasabo picture anyasabo  ·  14 Aug 2020
0

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot picture fejta-bot  ·  14 Nov 2020
0

/remove-lifecycle stale

Bessonov picture Bessonov  ·  14 Nov 2020
0

/remove-lifecycle stale

george-angel picture george-angel  ·  14 Nov 2020
5

Another solid use-case is in a cluster-per-environment setup where a node label can be added (e.g. in gke or eks you could add a label cluster-environment: staging to all nodes in a cluster) and then that can be provided to workloads via the downward api to behave appropriately according to business logic.
This obviously doesn't fit with a topology structure, so I might propose that it is maybe slightly too narrow in scope. Something perhaps to consider would be a format for "exposable" labels. Perhaps something along the lines of user/.* being exposable.
Already regretting this comment just because of the notifications I'll get from fejta forever as it languishes.

iMartyn picture iMartyn  ·  18 Nov 2020
0

I want to have node zone name in envronment variable to use some latency-critical services strongly in the same zone (for example: mongos).

mongos is a separate service from mongodb, through which all requests in a sharded configuration routed (app <-> mongos <-> mongodb). Since these are database queries, each hop has a significant impact on overall performance.

In bare hardware we deploy and use mongos in the same host (localhost) for reduce traffic and latency.

In Kubernetes envrinoment we try to:

  • Reproduce localhost topology by DaemonSet deploy. This does not work well due to very different resource consumption from node to node and service to service. We have had situations where mongod created more than 50% of the load on the node.
  • Deploy mongos to same pod. This leads to too large RAM and CPU overhead.
  • Deploy separate mongos Service to every zone. This solution looks very ugly, as the application has to execute a lot of code to get the zone from the node's metadata. Moreover, this code is different in each language. If it was possible to add a zone to an environment variable, it would be much easier (for example: Spring in Java expand environment variables in properties out of the box).
bozaro picture bozaro  ·  26 Nov 2020
5

Hi,

we need to propagate the node labels
failure-domain.beta.kubernetes.io/region
failure-domain.beta.kubernetes.io/zone
topology.kubernetes.io/region
topology.kubernetes.io/zone
to our pods because we need the information to distribute elasticsearch shards correctly and more safely.

So it would be great if this possibility could be implemented.

steffeneichler picture steffeneichler  ·  1 Dec 2020
0

Given EndpointSlice is removing topology, this becomes even more important, as the only source of region/zone is now the node.

howardjohn picture howardjohn  ·  2 Dec 2020
0

Given EndpointSlice is removing topology

Any links? @howardjohn

hzxuzhonghu picture hzxuzhonghu  ·  3 Dec 2020
1

https://opensource.googleblog.com/2020/11/kubernetes-efficient-multi-zone.html, https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/2030-endpointslice-subsetting, and https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/2086-service-internal-traffic-policy/README.md

After thinking about it a bit more I don't think it is directly related to this issue though (needing to know the topology of all endpoints vs your own node labels) but this would still help in many other cases

howardjohn picture howardjohn  ·  3 Dec 2020
0

@steffeneichler

This is what I had to do to implement rack-awareness for elasticsearch in kubernetes:

Add rack awareness in elasticsearch.yml:

cluster.routing.allocation.awareness.attributes: rack

Label my nodes with their rack-information:

kubectl label node my-node topology.kubernetes.io/rack=L101B09

Extract node information via kubectl, and export them to elasticsearch via env variable: (I stripped pod affinity, ports, probes, volumes... for brevity)

apiVersion: apps/v1
kind: StatefulSet
spec:
  serviceName: elasticsearch
  template:
    spec:
      serviceAccountName: elasticsearch

      initContainers:
      - name: get-rack
        image: kubectl
        command:
        - bash
        - -c
        - |
          set -ex
          RACK=$(kubectl get node $NODE_NAME -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/rack}')
          echo "$RACK" > /tmp/es-rack
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        volumeMounts:
        - name: tmp
          mountPath: /tmp/

      containers:
      - name: elasticsearch
        image: elasticsearch
        command:
        - bash
        - -c
        - |
          set -ex
          env "node.attr.rack=$(</tmp/es-rack)" /usr/local/bin/docker-entrypoint.sh eswrapper
        env:
        - name: node.name
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: node.attr.node
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        volumeMounts:
        - name: tmp
          mountPath: /tmp/
      volumes:
      - name: tmp
        emptyDir: {}

Aaaaand of course you need RBAC to allow the serviceaccount to extract node information:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: elasticsearch
  namespace: elastic
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: elasticsearch-get-nodes
  namespace: elastic
rules:
- apiGroups: [""]
  resources: [nodes]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: elasticsearch-get-nodes
  namespace: elastic
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: elasticsearch-get-nodes
subjects:
- kind: ServiceAccount
  name: elasticsearch
  namespace: elastic

Yes, it is this ugly. I actually could have done it all in the main container by creating an image with both elasticsearch and kubectl.

Feder1co5oave picture Feder1co5oave  ·  3 Dec 2020
0

Hi @Feder1co5oave,

I found this way too. But this is a very ugly way. The best way would be to propagate the node topoloy information to the pods like the nodename.

steffeneichler picture steffeneichler  ·  4 Dec 2020