How to specificy which GPUs to run on

sindimo1 · December 5, 2010, 1:42pm

I have a section of an application being run on a Tesla S1070 using the PGI accelerator directives.

The node I am running on has 2 GPUs present and 4 AMD CPUs, pgaccinfo reports 2 GPUs as well:

[sindimo@superbeast]$ pgaccelinfo 
CUDA Driver Version:           3010

Device Number:                 0
Device Name:                   Tesla T10 Processor
Device Revision Number:        1.3
Global Memory Size:            4294770688
Number of Multiprocessors:     30
Number of Cores:               240
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 16384
Registers per Block:           16384
Warp Size:                     32
Maximum Threads per Block:     512
Maximum Block Dimensions:      512, 512, 64
Maximum Grid Dimensions:       65535 x 65535 x 1
Maximum Memory Pitch:          2147483647B
Texture Alignment:             256B
Clock Rate:                    1296 MHz
Initialization time:           1861641 microseconds
Current free memory:           4254142208
Upload time (4MB):             2502 microseconds (2760 ms pinned)
Download time:                 3464 microseconds (1465 ms pinned)
Upload bandwidth:              1676 MB/sec (1519 MB/sec pinned)
Download bandwidth:            1210 MB/sec (2863 MB/sec pinned)

Device Number:                 1
Device Name:                   Tesla T10 Processor
Device Revision Number:        1.3
Global Memory Size:            4294770688
Number of Multiprocessors:     30
Number of Cores:               240
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 16384
Registers per Block:           16384
Warp Size:                     32
Maximum Threads per Block:     512
Maximum Block Dimensions:      512, 512, 64
Maximum Grid Dimensions:       65535 x 65535 x 1
Maximum Memory Pitch:          2147483647B
Texture Alignment:             256B
Clock Rate:                    1296 MHz
Initialization time:           1861641 microseconds
Current free memory:           4254142208
Upload time (4MB):             2356 microseconds (2773 ms pinned)
Download time:                 3222 microseconds (1480 ms pinned)
Upload bandwidth:              1780 MB/sec (1512 MB/sec pinned)
Download bandwidth:            1301 MB/sec (2833 MB/sec pinned)

My question is, when I run my application, how can I know if it’s utilizing 1 of the GPUs or both? Is there a way to force it to run on both GPUs?

I read in the documentation you can set the ACC_DEVICE_NUM variable but that only sets the default GPU to run on, is there something similar to tell the program to run on both GPUs?

I timed two runs, one using 1 CPU and the other using 2 CPUs and I noticed that the data movement when using 2 CPUs is a lot worse even though it’s the same job and data is being processed.

I am just wondering if the 2 CPU run is only using 1 GPU, hence causing a congestion on the PCIe of that GPU. If I can have each CPU associated with 1 GPU, maybe that would distributed the load on the data movement since each GPU will have it’s own PCIe (I guess so??).

#Results using 1 node with 1 CPU  
[sindimo@superbeast]$ /usr/local/mpi/mpich2/pgi10.9/bin/mpiexec -np 1 -f myNodes app.exe 
Accelerator Kernel Timing data
    175: region entered 423 times
        time(us): total=40654257 init=2066869 region=38587388
                  kernels=19509459 data=17751884
        w/o init: total=38587388 max=108797 min=89307 avg=91223
        177: kernel launched 423 times
            grid: [34]  block: [256]
            time(us): total=19509459 max=46225 min=46052 avg=46121


#Results using 1 node with 2 CPUs
[sindimo@superbeast]$ /usr/local/mpi/mpich2/pgi10.9/bin/mpiexec -np 2 -f myNodes app.exe 
Accelerator Kernel Timing data

    175: region entered 423 times
        time(us): total=75512482 init=2089617 region=73422865
                  kernels=11542442 data=48740662
        w/o init: total=73422865 max=198729 min=93850 avg=173576
        177: kernel launched 423 times
            grid: [34]  block: [256]
            time(us): total=11542442 max=27336 min=27241 avg=27287

Thank you for your help.

Mohamad Sindi

MatColgrove · December 6, 2010, 11:55pm

Hi Mohamad Sindi,

is there something similar to tell the program to run on both GPUs?

A single CPU thread can only attach to a single GPU. Hence, to use multiple GPUs you need to add another level of parallelization, such as OpenMP, MPI, or pthreads.

I am just wondering if the 2 CPU run is only using 1 GPU, hence causing a congestion on the PCIe of that GPU. If I can have each CPU associated with 1 GPU, maybe that would distributed the load on the data movement since each GPU will have it’s own PCIe (I guess so??).

Most likely. To use multiple GPUs, from each of your MPI process you need to attach to a particular device using the “acc_set_device_num” function.

Although it’s for OpenMP, this post might help: PGI complier with OMP option

Mat

sindimo1 · December 8, 2010, 5:38am

Dear Mat, the acc_set_device is exactly what I was looking for.

However I tested it on a simple program and it doesn’t seem to work.

Basically I set my ACC_NOTIFY environment variable to 1, then in the program I set the acc_set_device to the GPU device number and run my program and see which GPU device the kernel gets launched on.

From the example below I set the GPU device once to 0 and another time to 1, but in both cases it only runs on GPU device 0 (i.e. device=0).

[sindimo@superbeast]$ cat test.f 

      integer dim1, dim2, dim3
      parameter (dim1 = 10, dim2 = 10, dim3 = 10)
      double precision A(dim1, dim2), B(dim2, dim3), C(dim1, dim3)
      real start, finish
      call srand(86456)
      do i = 1, dim1
        do j = 1, dim2
          A(i, j) = rand()
        enddo
      enddo
      do i = 1, dim2
        do j = 1, dim3
          B(i, j) = rand()
        enddo
      enddo

      call cpu_time(start)

!Setting which GPU to use, we have two GPUs on this sytem, 0 and 1
       call acc_set_device(0) 


!$acc region
        do j = 1, dim3
        do i = 1, dim1
          C(i, j) = 0
        enddo
        do k = 1, dim2
          do i = 1, dim1
            C(i, j) = C(i, j) + A(i, k)*B(k, j)
          enddo
        enddo
       enddo
!$acc end region


      call cpu_time(finish)
      print *,'time for C(',dim1,',',dim3,') = A(',dim1,',',dim2,') B(',
     1dim2,',',dim3,') is',finish - start,' s'
      end

Run done on GPU 0:
[sindimo@superbeast]$ setenv ACC_NOTIFY 1
[sindimo@superbeast]$ grep acc_set_device test.f
call acc_set_device(0)
[sindimo@superbeast]$ mpif90 -fast -ta=nvidia -Minfo=all,accel -Minline test.f
MAIN:
8, Loop not vectorized/parallelized: contains call
13, Loop not vectorized/parallelized: contains call
24, Generating copyin(a(1:10,1:10))
Generating copyin(b(1:10,1:10))
Generating copyout(c(1:10,1:10))
Generating compute capability 1.3 binary
25, Loop is parallelizable
26, Loop is parallelizable
Accelerator kernel generated
25, !$acc do parallel, vector(10)
26, !$acc do parallel, vector(10)
CC 1.3 : 6 registers; 24 shared, 44 constant, 0 local memory bytes; 100 occupancy
29, Loop carried reuse of ‘c’ prevents parallelization
30, Loop is parallelizable
Accelerator kernel generated
25, !$acc do parallel, vector(10)
29, !$acc do seq
Cached references to size [10x10] block of ‘a’
Cached references to size [10x10] block of ‘b’
30, !$acc do parallel, vector(10)
Using register for ‘c’
CC 1.3 : 15 registers; 1624 shared, 48 constant, 0 local memory bytes; 100 occupancy
[sindimo@superbeast]$ ./a.out
launch kernel file=/red/ssd/usr/sindimo/GPU-Stuff/working-fortran-example-with-gpu/test.f function=MAIN line=26 device=0 grid=1 block=10x10
launch kernel file=/red/ssd/usr/sindimo/GPU-Stuff/working-fortran-example-with-gpu/test.f function=MAIN line=30 device=0 grid=1 block=10x10
time for C( 10 , 10 ) = A( 10 , 10
) B( 10 , 10 ) is 2.004053 s

Run done on GPU 1:
[sindimo@superbeast]$ grep acc_set_device test.f
call acc_set_device(1)
[sindimo@superbeast]$ mpif90 -fast -ta=nvidia -Minfo=all,accel -Minline test.f
MAIN:
8, Loop not vectorized/parallelized: contains call
13, Loop not vectorized/parallelized: contains call
24, Generating copyin(a(1:10,1:10))
Generating copyin(b(1:10,1:10))
Generating copyout(c(1:10,1:10))
Generating compute capability 1.3 binary
25, Loop is parallelizable
26, Loop is parallelizable
Accelerator kernel generated
25, !$acc do parallel, vector(10)
26, !$acc do parallel, vector(10)
CC 1.3 : 6 registers; 24 shared, 44 constant, 0 local memory bytes; 100 occupancy
29, Loop carried reuse of ‘c’ prevents parallelization
30, Loop is parallelizable
Accelerator kernel generated
25, !$acc do parallel, vector(10)
29, !$acc do seq
Cached references to size [10x10] block of ‘a’
Cached references to size [10x10] block of ‘b’
30, !$acc do parallel, vector(10)
Using register for ‘c’
CC 1.3 : 15 registers; 1624 shared, 48 constant, 0 local memory bytes; 100 occupancy
[sindimo@superbeast]$ ./a.out
launch kernel file=/red/ssd/usr/sindimo/GPU-Stuff/working-fortran-example-with-gpu/test.f function=MAIN line=26 device=0 grid=1 block=10x10
launch kernel file=/red/ssd/usr/sindimo/GPU-Stuff/working-fortran-example-with-gpu/test.f function=MAIN line=30 device=0 grid=1 block=10x10
time for C( 10 , 10 ) = A( 10 , 10
) B( 10 , 10 ) is 2.011037 s

Any clue why it’s not working?

Thanks again for your help!

Mohamad Sindi

sindimo1 · December 8, 2010, 7:07am

I think I figured it out while going through the manual.

The function mentioned in the link you posted earlier was “acc_set_device” which didn’t work for me.

I tried using “acc_set_device_num” from the manual and that worked fine, see example below.

Since I am using MPI for parallelism in my real application, I will get the rank of the MPI process and if it’s even then I will assign it to GPU 0 and if it’s odd then I will assign it GPU 1.

Thanks Mat!

#On GPU 0
[sindimo@superbeast]$ grep acc_set test.f
       call acc_set_device_num(0, acc_device_nvidia) 
[sindimo@superbeast]$ mpif90 -fast -ta=nvidia -Minfo=all,accel -Minline test.f
MAIN:
     14, Loop not vectorized/parallelized: contains call
     19, Loop not vectorized/parallelized: contains call
     29, Generating copyin(a(1:10,1:10))
         Generating copyin(b(1:10,1:10))
         Generating copyout(c(1:10,1:10))
         Generating compute capability 1.3 binary
     30, Loop is parallelizable
     31, Loop is parallelizable
         Accelerator kernel generated
         30, !$acc do parallel, vector(10)
         31, !$acc do parallel, vector(10)
             CC 1.3 : 6 registers; 24 shared, 44 constant, 0 local memory bytes; 100 occupancy
     34, Loop carried reuse of 'c' prevents parallelization
     35, Loop is parallelizable
         Accelerator kernel generated
         30, !$acc do parallel, vector(10)
         34, !$acc do seq
             Cached references to size [10x10] block of 'a'
             Cached references to size [10x10] block of 'b'
         35, !$acc do parallel, vector(10)
             Using register for 'c'
             CC 1.3 : 15 registers; 1624 shared, 48 constant, 0 local memory bytes; 100 occupancy
[sindimo@tlca058 working-fortran-example-with-gpu]$ ./a.out
launch kernel  file=/red/ssd/usr/sindimo/GPU-Stuff/working-fortran-example-with-gpu/test.f function=MAIN line=31 device=0 grid=1 block=10x10
launch kernel  file=/red/ssd/usr/sindimo/GPU-Stuff/working-fortran-example-with-gpu/test.f function=MAIN line=35 device=0 grid=1 block=10x10
 time for C(           10 ,           10 ) = A(           10 ,           10 
 ) B(           10 ,           10 ) is    2.012127      s


#On GPU 1
[sindimo@superbeast]$ grep acc_set test.f
       call acc_set_device_num(1, acc_device_nvidia) 
[sindimo@superbeast]$ mpif90 -fast -ta=nvidia -Minfo=all,accel -Minline test.f
MAIN:
     14, Loop not vectorized/parallelized: contains call
     19, Loop not vectorized/parallelized: contains call
     29, Generating copyin(a(1:10,1:10))
         Generating copyin(b(1:10,1:10))
         Generating copyout(c(1:10,1:10))
         Generating compute capability 1.3 binary
     30, Loop is parallelizable
     31, Loop is parallelizable
         Accelerator kernel generated
         30, !$acc do parallel, vector(10)
         31, !$acc do parallel, vector(10)
             CC 1.3 : 6 registers; 24 shared, 44 constant, 0 local memory bytes; 100 occupancy
     34, Loop carried reuse of 'c' prevents parallelization
     35, Loop is parallelizable
         Accelerator kernel generated
         30, !$acc do parallel, vector(10)
         34, !$acc do seq
             Cached references to size [10x10] block of 'a'
             Cached references to size [10x10] block of 'b'
         35, !$acc do parallel, vector(10)
             Using register for 'c'
             CC 1.3 : 15 registers; 1624 shared, 48 constant, 0 local memory bytes; 100 occupancy
[sindimo@superbeast]$ ./a.out
launch kernel  file=/red/ssd/usr/sindimo/GPU-Stuff/working-fortran-example-with-gpu/test.f function=MAIN line=31 device=1 grid=1 block=10x10
launch kernel  file=/red/ssd/usr/sindimo/GPU-Stuff/working-fortran-example-with-gpu/test.f function=MAIN line=35 device=1 grid=1 block=10x10
 time for C(           10 ,           10 ) = A(           10 ,           10 
 ) B(           10 ,           10 ) is    1.998724      s

Mohamad Sindi

sindimo1 · December 8, 2010, 11:21am

Just for everyone’s reference, this is the chunk of code I used to make even number processes bind to GPU 0 and odd number processes bind to GPU 1:

        if (mod(get_myid(),2)==0) then
           !Process is even, run on GPU 0
           call acc_set_device_num(0, acc_device_nvidia)
       else
           !Process is odd, run on GPU 1
           call acc_set_device_num(1, acc_device_nvidia)
       endif

Now when I run my actual program, it distributes the load on both GPUs:

launch kernel file=/myapp.f function=myapp line=190 device=0 grid=34 block=256
launch kernel file=/myapp.f function=myapp line=190 device=1 grid=34 block=256
launch kernel file=/myapp.f function=myapp line=248 device=0 grid=34 block=256
launch kernel file=/myapp.f function=myapp line=248 device=1 grid=34 block=256

I hope others find this useful.

Thank you.

Mohamad Sindi

MatColgrove · December 8, 2010, 4:55pm

Hi Mohamad Sindi,

The function mentioned in the link you posted earlier was “acc_set_device” which didn’t work for me.

I tried using “acc_set_device_num” from the manual and that worked fine, see example below.

Sorry about that. I meant “acc_set_device_num”. “acc_set_device” toggles if the code should run on the host or device, not the device number. I’ll correct my post.

Mat

Topic		Replies	Views
accelerate a single loop with mpi and gpu Legacy PGI Compilers	21	15857	July 19, 2013
How used my four gpu node Legacy PGI Compilers	6	4619	April 21, 2018
Using multiple GPUs Legacy PGI Compilers	7	22076	August 11, 2009
Tesla C2050 performance comparision with C1060 CUDA Programming and Performance	63	10177	September 14, 2010
Need advices for optimizing "matrix.vector product&quot Legacy PGI Compilers	3	3245	November 23, 2016
OpenMP, OpenACC and acc_set_device_num Legacy PGI Compilers	12	10776	March 15, 2013
Unknown 8GB memory getting allocated on GPU Legacy PGI Compilers	12	9663	December 7, 2020
cuModuleLoadData error 209 Legacy PGI Compilers	7	16290	February 10, 2015
Getting Performance on Titan Legacy PGI Compilers	12	11809	December 27, 2016
PGI Acc: Matrix-matrix-multiplication Legacy PGI Compilers	3	5176	September 10, 2010

How to specificy which GPUs to run on

Related topics