I took a tough start in CUDA with the PGI Fortran compiler. I firmly intend to make religion here, but can’t seem to get around the problem that the cufinfo tells me that my card has no global memory, i.e.,
Device Number: 0
Device Name: Device Emulation (CPU)
Total Global Memory: 0.000 Gbytes <---- this line
sharedMemPerBlock: 16384 bytes
regsPerBlock: 8192
warpSize: 1 <---- is this correct by the way? shouldn't it be 32?
maxThreadsPerBlock: 512
maxThreadsDim: 512 x 512 x 64
maxGridSize: 65535 x 65535 x 1
ClockRate: 1.350 GHz
Total Const Memory: 65536 bytes
Compute Capability Revision: 9999.9999
TextureAlignment: 256 bytes
deviceOverlap: F
multiProcessorCount: 16
integrated: T
canMapHostMemory: T
The above was run on a MacBook Pro equipped with a GeForce 8600M GT. As a result, the matmult example returns an error when allocating memory on the device for the matrix.
Following my last thread, here is what deviceQuery from the CUDA SDK tells:
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA
Device 0: "GeForce 8600M GT"
CUDA Driver Version: 3.0
CUDA Runtime Version: 3.0
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 134021120 bytes
Number of multiprocessors: 4
Number of cores: 32
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 0.94 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: No
Compute mode: Default (multiple host threads can use this device simultaneously)
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 53331, CUDA Runtime Version = 3.0, NumDevs = 1, Device = GeForce 8600M GT
PASSED
However even though deviceQuery tells me that my GPU does have memory and I managed to run some toy codes written in C, I still cannot allocate any variable on the device using Fortran.
I was able to recreate the issue here on a MacBook Pro. It appears that the CUDA 2.3 libraries that we ship with the compilers are incompatible with NVIDIA’s CUDA 3.0 MacOS driver. To fix either rename or remove the “/opt/pgi/osx86/2010/cuda/2.3” directory and compile with “-ta=nvidia,cuda3.0” when using the PGI Accelerator model or “-Mcuda=cuda3.0” when using CUDA Fortran.
Note that the incompatibility seems to only occur with devices using compute capability 1.1.
We started shipping CUDA 3.0 with the 10.4 release. Future versions of CUDA will be added after they are officially released by NVIDIA (i.e. not Beta) and once we have validated it.