Change the default shm size

Hi NV Team,

Now we are using the NVIDIA DGX-Station with latest OS(Desktop 4.3.0).
I would like to change the default shm-size of docker but it happened the error as below.

  • changed

{
“default-runtime”: “nvidia”,
“default-shm-size”: “1G”,
“default-ulimits”: {
“memlock”: {
“hard”: -1,
“name”: “memlock”,
“soft”: -1
},
“stack”: {
“hard”: 67108864,
“name”: “stack”,
“soft”: 67108864
}
},
“runtimes”: {
“nvidia”: {
“path”: “/usr/bin/nvidia-container-runtime”,
“runtimeArgs”:
}
}
}

Job for docker.service failed because the control process exited with error code.
See “systemctl status docker.service” and “journalctl -xe” for details.

Would you please teach me the way to change the default shm-size?

Best regards.
Kaka

Hi NV team,

Would you please provide your advice about shm-size in case of using the DGX station under K8s enviroments?
I would like to increase the shm-size but regarding to the following URL, it seems that there is any issue for shm-size in case of using the K8s.
https://github.com/kubernetes/kubernetes/issues/28272

How should do?

I added the following description to my manifest but what do you think that there will be any problem for DGX-station behabir?

volumeMounts:

  • mountPath: /dev/shm
    name: dshm
    volumes:
  • name: dshm
    emptyDir:
    medium: Memory

Best regards.
Kaka

I am waiting for your advice.

Best regards.
Kaka

This turns out to be relatively easy since you are using DeepOps and Kubernetes (right?). You want to edit config/groupvars/k8s-cluster.yml, specifically the “NVIDIA Docker Configuration”.

# NVIDIA Docker Configuration
# Setting reference: https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html
docker_daemon_json:
  bip: 192.168.99.1/24
  default-shm-size: <b>1G</b>

  default-ulimits:
    memlock:
      name: memlock
      hard: -1
      soft: -1
    stack:
      name: stack
(snip)

Adjust the “default-shm-size”, and then redeploy nvidia-docker plugin.

Hi Scott-san,

Thank you for your comments.
But actually, the shm size was used the default size when create the container via Kubernetes even though the shm size on daemon.json file was set as 1G. If you have any time, please check it. It just create the sample job like below.

apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
containers:
- name: cuda-container
image: nvidia/cuda:10.0-base
command: [“sleep”]
args: [“100000”]
resources:
limits:
nvidia.com/gpu: 1

Best regards.
Kaka

Hello Kaka,

I think I know what your problem might be.

Sometimes depending on how Docker, Kubernetes, and nvidia-docker are installed there can be conflicts in the configuration files.

Docker has configuration in /etc/docker/daemon.json and in /etc/systemd/system/docker.service.d/.

I’ve seen several cases where the daemon.json and /etc/systemd/system/docker.service.d/docker-override.conf both define the shm which will cause the error you posted above.

Take a look at the override file and if that is defining the shm variable try removing most of the content from your daemon.json and remove most of the variables from daemon.json so it looks something like this:

{
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "path": "/usr/bin/nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}

Hi aletelman-san,

Thank you for your response. I understand that your comments.
On the other hands, I had met the issue that the shm size was set the 64M even though we had set the default shm size as 1G on the config file under docker.service.d folder.

It seems that it is cause of issue by Kubernetes but I would like to know the workaround against it.

best regards.
Kaka

That’s strange. As far as I know Kubernetes relies on the Docker configuration to dictate the shared memory size.

If you set the Docker shm size to 1G and restarted the Docker service I would think it worked.

Were you able to verify/check the shm size running on a Docker container (outside of Kubernetes)?

Hi

Sure. I confirmed the shm size with pure docker command. As a result, the shm size was 1G as we expected.

root@0ecdeb743016:/workspace# df -h
Filesystem Size Used Avail Use% Mounted on
overlay 1.8T 70G 1.6T 5% /
tmpfs 64M 0 64M 0% /dev
tmpfs 126G 0 126G 0% /sys/fs/cgroup
shm 1.0G 0 1.0G 0% /dev/shm
/dev/sda2 1.8T 70G 1.6T 5% /data
tmpfs 126G 12K 126G 1% /proc/driver/nvidia
tmpfs 26G 3.5M 26G 1% /run/nvidia-persistenced/socket
udev 126G 0 126G 0% /dev/nvidia0
tmpfs 126G 0 126G 0% /proc/asound
tmpfs 126G 0 126G 0% /proc/acpi
tmpfs 126G 0 126G 0% /proc/scsi
tmpfs 126G 0 126G 0% /sys/firmware

Best regards.
kaka

Hey Kaka,

You’re correct. I just did a bit of digging and it looks like this is currently a limitation in Kubernetes. There does not appear to be an out-of-the-box solution to increase the shm size.

The only way to change the shm at this point appears to be by following the “hack” documented here: https://docs.okd.io/latest/dev_guide/shared_memory.html

Hi Atetelman

Thank you for your comments. Yes, I saw that web site on your post.
In this case, the shm directory mount the to the container but that size was equal to the host pc. I am afraid whether there is any issue to mount the all shm storage to container and share it with all container.

Best regards.
Kaka