3D grids

Hello,

It is my understanding that compute capability 2.x supports 3D grids (pgaccelinfo indicates this for my card).
Is it correct that pgfortran only supports a 2D grid (z-index must be equal to 1)? If so, will a future release support 3D grids?

Thanks, Jan

Hi Jan,

The allowable kernel schedule isn’t a PGI restriction, rather it has more to with the CUDA driver runtime environment and your device. Hence, if your card and driver supports 3D Grids, then you can use 3D Grids in your CUDA Fortran code.

Hope this helps,
Mat

Hi Mat,

Do you know which driver supports 3D grids? My device (GTX580) supports is according to pgaccelinfo. In my code, I can define a 3D grid and the kernel launches and finishes without error message but the result is wrong. Some debugging in emu mode showed that the blockidx%z never counts above 1, even though griddim%z is 2.

I got concerned because in the guide is states on page 7 that " The value of blockidx%z is always one."

Thanks, Jan


:~$ pgaccelinfo
CUDA Driver Version: 4020
NVRM version: NVIDIA UNIX x86_64 Kernel Module 295.41 Fri Apr 6 23:18:58 PDT 2012

Device Number: 0
Device Name: GeForce GTX 580
Device Revision Number: 2.0
Global Memory Size: 1609760768
Number of Multiprocessors: 16
Number of Cores: 512
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1544 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 2004 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 786432 bytes
Max Threads Per SMP: 1536
Async Engines: 1
Unified Addressing: Yes
Initialization time: 1203421 microseconds
Current free memory: 1535369216
Upload time (4MB): 2255 microseconds (1587 ms pinned)
Download time: 3341 microseconds (2321 ms pinned)
Upload bandwidth: 1860 MB/sec (2642 MB/sec pinned)
Download bandwidth: 1255 MB/sec (1807 MB/sec pinned)

Hi Jan,

Your driver version is fine. Also, I just tried a trivial example that seems to work. The CUDA version does need to be 4.0, so if you’re using PGI 11.6 to 11.9, add the flag “-Mcuda=cuda4.0”. The 12.x compilers use CUDA 4.0 by default so should just work.

Note that it does appear our emulation mode still assumes that the Z dimension is always 1. I’ve added a problem report (TPR#18799) and sent it to our engineers.

Here’s my trivial example:

% cat test3D.cuf 
module test3D

    integer, device, allocatable, dimension(:,:,:) :: Ad
    integer, allocatable, dimension(:,:,:) :: A
    
contains

   attributes(global) subroutine test3Dsub(l,m,n)

     integer, value :: l,m,n
     integer ix,iy,iz

     ix = (blockidx%x-1) * blockdim%x + threadidx%x 
     iy = (blockidx%y-1) * blockdim%y + threadidx%y 
     iz = (blockidx%z-1) * blockdim%z + threadidx%z

     if (ix .le. l .and. iy .le. m .and. iz .le. n) then
	Ad(ix,iy,iz) = iz
     endif
     
   end subroutine test3Dsub

end module test3D

program test
  use cudafor 
  use test3D

  integer l,m,n
  type(dim3) :: blocks
  type(dim3) :: threads
	
  l=64
  m=128
  n=64

  threads = dim3(8,8,8)
  blocks = dim3(l/8, m/8, n/8) 

  allocate(A(l,m,n), Ad(l,m,n))
  Ad=0
  call test3Dsub <<<blocks,threads>>>(l,m,n)
  A=Ad
  print *, A(45,23,:)	

end program test
% pgf90 test3D.cuf -V12.5 ; a.out
            1            2            3            4            5            6 
            7            8            9           10           11           12 
           13           14           15           16           17           18 
           19           20           21           22           23           24 
           25           26           27           28           29           30 
           31           32           33           34           35           36 
           37           38           39           40           41           42 
           43           44           45           46           47           48 
           49           50           51           52           53           54 
           55           56           57           58           59           60 
           61           62           63           64
% pgaccelinfo
CUDA Driver Version:           4020
NVRM version: NVIDIA UNIX x86_64 Kernel Module  295.59  Wed Jun  6 21:19:40 PDT 2012

Device Number:                 0
Device Name:                   Tesla C2070
Device Revision Number:        2.0
Global Memory Size:            6441598976
Number of Multiprocessors:     14
Number of Cores:               448
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1147 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             1494 MHz
Memory Bus Width:              384 bits
L2 Cache Size:                 786432 bytes
Max Threads Per SMP:           1536
Async Engines:                 2
Unified Addressing:            Yes
Initialization time:           1174550 microseconds
Current free memory:           6367338496
Upload time (4MB):              968 microseconds ( 712 ms pinned)
Download time:                 1042 microseconds ( 672 ms pinned)
Upload bandwidth:              4332 MB/sec (5890 MB/sec pinned)
Download bandwidth:            4025 MB/sec (6241 MB/sec pinned)

Hi Mat,

thanks, adding the flag cuda4.0 did the trick on 11.9. I can confirm that the emu mode does not count beyond 1.

Jan