Hello All,
I have been using CUDA 2.1 on my Mac Book Pro (10.5.7) with the 512MB Nvidia 8600M GT and decided that it was about time to upgrade to CUDA version 2.2. But that is where I’m seeing a discrepancy in the concurrent memory copy and execution field from deviceQuery. Below is what CUDA 2.2 says.
[codebox]frisdawg:release wfrisby$ ./deviceQuery
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA
Device 0: “GeForce 8600M GT”
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 536674304 bytes
Number of multiprocessors: 4
Number of cores: 32
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 0.75 GHz
Concurrent copy and execution: No
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: No
Compute mode: Default (multiple host threads can use this device simultaneously)
Test PASSED
Press ENTER to exit…[/codebox]
It passes the test but doesn’t say that concurrent copy and execution is allowed. I know that this worked in CUDA 2.1 so I removed 2.2 and installed 2.1 to test it. This is what I got:
[codebox]frisdawg:release wfrisby$ ./deviceQuery
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA
Device 0: “GeForce 8600M GT”
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 536674304 bytes
Number of multiprocessors: 4
Number of cores: 32
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 0.75 GHz
Concurrent copy and execution: Yes
Test PASSED
Press ENTER to exit…[/codebox]
I also ran simpleStreams from the SDK and it does show a speedup using streams.
[codebox]frisdawg:release wfrisby$ ./simpleStreams
running on: GeForce 8600M GT
memcopy: 39.70
kernel: 88.01
non-streamed: 125.98 (127.71 expected)
4 streams: 94.69 (97.93 expected with compute capability 1.1 or later)
Test PASSED
Press ENTER to exit…
[/codebox]
Has anyone else seen this before? BTW I did run the simpleStreams tool with CUDA 2.2 and the there was no speedup. Is this being caused by not correctly detecting the abilities of the GPU in CUDA 2.2?
Thanks,
Wes