Concurrent copy and execution bug in CUDA 2.2

wfrisby · June 9, 2009, 7:44pm

Hello All,

I have been using CUDA 2.1 on my Mac Book Pro (10.5.7) with the 512MB Nvidia 8600M GT and decided that it was about time to upgrade to CUDA version 2.2. But that is where I’m seeing a discrepancy in the concurrent memory copy and execution field from deviceQuery. Below is what CUDA 2.2 says.

[codebox]frisdawg:release wfrisby$ ./deviceQuery

CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: “GeForce 8600M GT”

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 1

Total amount of global memory: 536674304 bytes

Number of multiprocessors: 4

Number of cores: 32

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 8192

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 0.75 GHz

Concurrent copy and execution: No

Run time limit on kernels: Yes

Integrated: No

Support host page-locked memory mapping: No

Compute mode: Default (multiple host threads can use this device simultaneously)

Test PASSED

Press ENTER to exit…[/codebox]

It passes the test but doesn’t say that concurrent copy and execution is allowed. I know that this worked in CUDA 2.1 so I removed 2.2 and installed 2.1 to test it. This is what I got:

[codebox]frisdawg:release wfrisby$ ./deviceQuery

CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: “GeForce 8600M GT”

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 1

Total amount of global memory: 536674304 bytes

Number of multiprocessors: 4

Number of cores: 32

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 8192

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 0.75 GHz

Concurrent copy and execution: Yes

Test PASSED

Press ENTER to exit…[/codebox]

I also ran simpleStreams from the SDK and it does show a speedup using streams.

[codebox]frisdawg:release wfrisby$ ./simpleStreams

running on: GeForce 8600M GT

memcopy: 39.70

kernel: 88.01

non-streamed: 125.98 (127.71 expected)

4 streams: 94.69 (97.93 expected with compute capability 1.1 or later)

Test PASSED

Press ENTER to exit…

[/codebox]

Has anyone else seen this before? BTW I did run the simpleStreams tool with CUDA 2.2 and the there was no speedup. Is this being caused by not correctly detecting the abilities of the GPU in CUDA 2.2?

Thanks,

Wes

tmurray · June 9, 2009, 8:00pm

sounds like a bug, I’ll look into it…

tmurray · June 9, 2009, 9:42pm

already fixed in 2.3, apparently.