I have a section of an application being run on a Tesla S1070 using the PGI accelerator directives.
The node I am running on has 2 GPUs present and 4 AMD CPUs, pgaccinfo reports 2 GPUs as well:
[sindimo@superbeast]$ pgaccelinfo
CUDA Driver Version: 3010
Device Number: 0
Device Name: Tesla T10 Processor
Device Revision Number: 1.3
Global Memory Size: 4294770688
Number of Multiprocessors: 30
Number of Cores: 240
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 16384
Registers per Block: 16384
Warp Size: 32
Maximum Threads per Block: 512
Maximum Block Dimensions: 512, 512, 64
Maximum Grid Dimensions: 65535 x 65535 x 1
Maximum Memory Pitch: 2147483647B
Texture Alignment: 256B
Clock Rate: 1296 MHz
Initialization time: 1861641 microseconds
Current free memory: 4254142208
Upload time (4MB): 2502 microseconds (2760 ms pinned)
Download time: 3464 microseconds (1465 ms pinned)
Upload bandwidth: 1676 MB/sec (1519 MB/sec pinned)
Download bandwidth: 1210 MB/sec (2863 MB/sec pinned)
Device Number: 1
Device Name: Tesla T10 Processor
Device Revision Number: 1.3
Global Memory Size: 4294770688
Number of Multiprocessors: 30
Number of Cores: 240
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 16384
Registers per Block: 16384
Warp Size: 32
Maximum Threads per Block: 512
Maximum Block Dimensions: 512, 512, 64
Maximum Grid Dimensions: 65535 x 65535 x 1
Maximum Memory Pitch: 2147483647B
Texture Alignment: 256B
Clock Rate: 1296 MHz
Initialization time: 1861641 microseconds
Current free memory: 4254142208
Upload time (4MB): 2356 microseconds (2773 ms pinned)
Download time: 3222 microseconds (1480 ms pinned)
Upload bandwidth: 1780 MB/sec (1512 MB/sec pinned)
Download bandwidth: 1301 MB/sec (2833 MB/sec pinned)
My question is, when I run my application, how can I know if it’s utilizing 1 of the GPUs or both? Is there a way to force it to run on both GPUs?
I read in the documentation you can set the ACC_DEVICE_NUM variable but that only sets the default GPU to run on, is there something similar to tell the program to run on both GPUs?
I timed two runs, one using 1 CPU and the other using 2 CPUs and I noticed that the data movement when using 2 CPUs is a lot worse even though it’s the same job and data is being processed.
I am just wondering if the 2 CPU run is only using 1 GPU, hence causing a congestion on the PCIe of that GPU. If I can have each CPU associated with 1 GPU, maybe that would distributed the load on the data movement since each GPU will have it’s own PCIe (I guess so??).
#Results using 1 node with 1 CPU
[sindimo@superbeast]$ /usr/local/mpi/mpich2/pgi10.9/bin/mpiexec -np 1 -f myNodes app.exe
Accelerator Kernel Timing data
175: region entered 423 times
time(us): total=40654257 init=2066869 region=38587388
kernels=19509459 data=17751884
w/o init: total=38587388 max=108797 min=89307 avg=91223
177: kernel launched 423 times
grid: [34] block: [256]
time(us): total=19509459 max=46225 min=46052 avg=46121
#Results using 1 node with 2 CPUs
[sindimo@superbeast]$ /usr/local/mpi/mpich2/pgi10.9/bin/mpiexec -np 2 -f myNodes app.exe
Accelerator Kernel Timing data
175: region entered 423 times
time(us): total=75512482 init=2089617 region=73422865
kernels=11542442 data=48740662
w/o init: total=73422865 max=198729 min=93850 avg=173576
177: kernel launched 423 times
grid: [34] block: [256]
time(us): total=11542442 max=27336 min=27241 avg=27287
Thank you for your help.
Mohamad Sindi