We have two grids both running Sun grid engine. One has three K20’s per node and the other grid has four M40 Nvidia GPUs per node. When I send 100 jobs that all use GPUs to the K20 grid where each slot (20 slots per node) shares a GPU card, 20 slots all run and progress properly. All is well.
However, when I send the same jobs the M40 grid only some of the jobs start running on the M40s and hence progress along but many of the other jobs just pause and dont progress. The only difference I see in the configuration difference between the K20s and M40s is the “GPU Operation Mode” is set to compute on the K20s (that work properly) and to “N/A” on the M40’s. Some of the ECC settings are also set to N/A on the M40s where as they are set to ‘0’ on the K20s.
I’m really stuck and need to finish my work. Does anyone one have any suggestions? I’m trying to get the permissions to change the GPU card settings.
Sorry, don’t know anything about grid engines. In order to make it easier for others to assist you might want to mention the version of the Sun grid engine being run, as well as relevant information about the grid engine configuration.
At this point, have you excluded the grid engine itself as a potential source of the issues you observe? According to NVIDIA’s documentation for the M40 (http://images.nvidia.com/content/tesla/pdf/tesla-m40-product-brief.pdf), it does have ECC (which is enabled by default), so it is weird that the ECC status would show as N/A. Can you show the output of
nvidia-smi -q for the M40s?
What is the compute mode on the GPUs set to in each case? It should be something like “Default”, “Exclusive Process”, or “Exclusive Thread”?
Do all nodes (both M40 and K20) have the same CUDA version and same GPU driver version? What are they?
Thanks for taking the time to reply. We are actually using Univa Grid Engine (paid version of Sun Grid Engine) ver 8.3 I think. I dont have access to the grid at the moment, but will try to post soon. The compute mode is set to “default” for both the K20 cards and M40 cards so that jobs can share the GPU cards. For M40s we have it set to receive roughly 7 jobs per GPU card x 4 cards thats about 28 jobs per node.
The K20’s have many of the ECC settings set to ‘0’ and “GPU Operation Mode” set to “compute”. Again, everything works fine on the K20s. M40s have many of the ECC settings set to N/A although I think ECC is ‘enabled’ for all GPUs. I don’t know the Cuda version but I can check and will repost.
I think the first step will be to make the settings, cuda versions, etc. exactly the same on both and then see if it works. However, to make changes to settings do i need to reboot the node? I am able to change the settings but when i tried to reboot the GPU it said something like can’t and I that I had to reboot the system.