nvidia-smi and exclusive compute mode

sWienke · April 20, 2010, 9:15am

Hi,
we have a Tesla S1070 system and using 2 of its GPUs in an Linux environment. Now we want to use the GPUs in batch mode. I read that using the Nvidia driver tool nvidia-smi is quite useful for scheduling the jobs to the both GPUs because you can set the GPUs to exclusive compute mode and thus it is guranteed that only one job can run on one (free) GPU. I tried it with CUDA C and everything is working fine.
However, I want to work it with PGI Acc C/Fortran (and CUDA Fortran) as well. Is there any support for this feature?
I tried a small test application with PGI Acc C (and just starting a couple of these Acc programs which do not set explicitly a device). I got “call to cuCtxCreate returned error 999: Unknown”. ACC_NOTIFY shows that there is only one kernel launched (device 0).
So, do you know how I could force the program to let it scheduled itself by the nvidia driver tool?
Regards, Sandra

MatColgrove · April 23, 2010, 8:10pm

Hi Sandra,

I’m not familiar with nvidia-smi myself, but CUDA Fortran is interoperable with CUDA C. Can you call a CUDA C routine to perform the task?

In general, GPU scheduling needs to be performed by the user.

Mat

sWienke · April 26, 2010, 6:57am

Hello Mat,
with CUDA C and CUDA Fortran everything is working fine (see source code below). That means, if I don’t use cudaSetDevice in my program, the kernel is executed on a available GPU (if there is one)… and that is how it should work. (Thus, if the user does not request a certain GPU, the automatic scheduling works well).
However, using PGI Accelerator programming model my program always wants to run on device 0. That’s the problem.
But PGI Accelerator is based on CUDA as well, isn’t it? Thus, I assume there must be somewhere an explicit call to set the cuda device to 0 in the PGI Accelerator implementation. If that is the point, is there a reason for setting the device explicitly? Otherwise I would appreciate it, if it could somehow removed. I think, this might get a general problem for all batch mode usage of GPUs that wants to use PGI Accelerator.

This is a small cuda fortran source code which is executed 3 times (we have 2 GPUs): 2 times it is scheduled to GPU 0/1 and one time I am getting an error:

program MAIN
        use cudafor
        implicit none
        integer, device, allocatable:: d_ptr(:)
        integer:: error, dev
        allocate(d_ptr(1))
        error = cudaGetLastError()
        if (error /= cudaSuccess) then
                write(*,*) 'Error: ', cudaGetErrorString(error)
                stop
        end if
        error = cudaGetDevice(dev)
        write(*,*) "Running on device ", dev
        call sleep(10)
        deallocate(d_ptr)
end program MAIN

And this is the shell script I’am using:

pgfortran -Mcuda=3.0 devtest.CUF -o devtest;
for i in 1 2 3; do ./devtest & done;

The output looks like the following (order of output statements differs):

0: ALLOCATE: 4 bytes requested; status = 38
 Running on device             1
 Running on device             0

BTW: If you want to try it out, you have to set your GPUs to exclusive compute mode first. You can do that by: “nvidia-smi -g 0 -c 1” where -g denotes the ID of the GPU (so you probably have to do “nvidia-smi -g 1 -c 1”, too) and -c specifies the compute mode: 1 means exclusive compute mode (0 is the default). Test it by using: “nvidia-smi -s” → All GPUs and their compute mode number are listed.

MatColgrove · April 26, 2010, 7:50pm

Hi Sandra,

In the PGI Accelerator model the default is to use device 0. However you can set which device to use by calling “acc_set_device_num” from your program or use the environment variable ACC_DEVICE_NUM. (See page 23 and 27 of http://www.pgroup.com/lit/whitepapers/pgi_accel_prog_model_1.2.pdf)

Does this solve the problem?

Mat

sWienke · April 27, 2010, 10:46am

Now, that’s unfortunately not helping. Because for setting the device explicitly (with acc_set_device_num or the environment variable) you already have to know which GPU is available and which is not. But as a normal user in a cluster multi-user environment you don’t know that.
So I’m afraid we have to implement it somehow on our own with SGE. I just thought there might be a software/driver solution for the PGI Accelerator model, as there is for CUDA.