Hi Jan,
Your driver version is fine. Also, I just tried a trivial example that seems to work. The CUDA version does need to be 4.0, so if you’re using PGI 11.6 to 11.9, add the flag “-Mcuda=cuda4.0”. The 12.x compilers use CUDA 4.0 by default so should just work.
Note that it does appear our emulation mode still assumes that the Z dimension is always 1. I’ve added a problem report (TPR#18799) and sent it to our engineers.
Here’s my trivial example:
% cat test3D.cuf
module test3D
integer, device, allocatable, dimension(:,:,:) :: Ad
integer, allocatable, dimension(:,:,:) :: A
contains
attributes(global) subroutine test3Dsub(l,m,n)
integer, value :: l,m,n
integer ix,iy,iz
ix = (blockidx%x-1) * blockdim%x + threadidx%x
iy = (blockidx%y-1) * blockdim%y + threadidx%y
iz = (blockidx%z-1) * blockdim%z + threadidx%z
if (ix .le. l .and. iy .le. m .and. iz .le. n) then
Ad(ix,iy,iz) = iz
endif
end subroutine test3Dsub
end module test3D
program test
use cudafor
use test3D
integer l,m,n
type(dim3) :: blocks
type(dim3) :: threads
l=64
m=128
n=64
threads = dim3(8,8,8)
blocks = dim3(l/8, m/8, n/8)
allocate(A(l,m,n), Ad(l,m,n))
Ad=0
call test3Dsub <<<blocks,threads>>>(l,m,n)
A=Ad
print *, A(45,23,:)
end program test
% pgf90 test3D.cuf -V12.5 ; a.out
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18
19 20 21 22 23 24
25 26 27 28 29 30
31 32 33 34 35 36
37 38 39 40 41 42
43 44 45 46 47 48
49 50 51 52 53 54
55 56 57 58 59 60
61 62 63 64
% pgaccelinfo
CUDA Driver Version: 4020
NVRM version: NVIDIA UNIX x86_64 Kernel Module 295.59 Wed Jun 6 21:19:40 PDT 2012
Device Number: 0
Device Name: Tesla C2070
Device Revision Number: 2.0
Global Memory Size: 6441598976
Number of Multiprocessors: 14
Number of Cores: 448
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1147 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 1494 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 786432 bytes
Max Threads Per SMP: 1536
Async Engines: 2
Unified Addressing: Yes
Initialization time: 1174550 microseconds
Current free memory: 6367338496
Upload time (4MB): 968 microseconds ( 712 ms pinned)
Download time: 1042 microseconds ( 672 ms pinned)
Upload bandwidth: 4332 MB/sec (5890 MB/sec pinned)
Download bandwidth: 4025 MB/sec (6241 MB/sec pinned)