9800GX2 cannot overlap memcpy and kernel execution?

I have a 9800GX2 card, which is of compute capacity 1.1, and according to deviceQuery form the 2.3 SDK, it supports “Concurrent copy and execution”. However, when I tried the simpleStream from the SDK, the result was:

Device name : GeForce 9800 GX2

CUDA Capable SM 1.1 hardware with 16 multi-processors

scale_factor = 1.0000

array_size = 16777216

memcopy: 23.92

kernel: 37.70

non-streamed: 61.32 (61.62 expected)

4 streams: 62.26 (43.68 expected with compute capability 1.1 or later)

It seems that memcopy and kernel execution are not overlapped. Can someone tell me why?

Here is the result from cuda profiler(only show the asynchronous memcpy and kernel):

method=[ memcpyDtoHasync ] gputime=[ 5986.496 ] cputime=[ 3.000 ]

method=[ memcpyDtoHasync ] gputime=[ 6027.520 ] cputime=[ 2.000 ]

method=[ memcpyDtoHasync ] gputime=[ 6027.776 ] cputime=[ 2.000 ]

method=[ memcpyDtoHasync ] gputime=[ 6028.480 ] cputime=[ 2.000 ]

method=[ _Z10init_arrayPiS_i ] gputime=[ 9357.216 ] cputime=[ 33343.000 ] occupancy=[ 0.667 ] gld_coherent=[ 32768 ] gld_incoherent=[ 524288 ] gst_coherent=[ 131072 ] gst_incoherent=[ 0 ]

method=[ _Z10init_arrayPiS_i ] gputime=[ 9352.864 ] cputime=[ 9366.000 ] occupancy=[ 0.667 ] gld_coherent=[ 32768 ] gld_incoherent=[ 524288 ] gst_coherent=[ 131072 ] gst_incoherent=[ 0 ]

method=[ _Z10init_arrayPiS_i ] gputime=[ 9363.552 ] cputime=[ 9376.000 ] occupancy=[ 0.667 ] gld_coherent=[ 32768 ] gld_incoherent=[ 524288 ] gst_coherent=[ 131072 ] gst_incoherent=[ 0 ]

method=[ _Z10init_arrayPiS_i ] gputime=[ 9349.312 ] cputime=[ 9363.000 ] occupancy=[ 0.667 ] gld_coherent=[ 32768 ] gld_incoherent=[ 524288 ] gst_coherent=[ 131072 ] gst_incoherent=[ 0 ]

If cputime means the time measured from cpu side, then the 4 asynchronous memcopy are really asynchronous. However the cpu time of the first init_array kernel seems to be the sum of itself and the 4 memcopy, which means the first init_array starts before all memcpy but the execution was serialized?

Can anyone help?

P.S. I am using CentOS 5.2 x86_64

I seems have found some clue. When running deviceQuery from the SDK, it shows:

Support host page-locked memory mapping: No

It seems my card does not support page-locked memory. Is this gpu related?

What’s weired is that, when running bandwidthtest from the SDK, pinned memory and pageable memory do make a difference:

Host to Device Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 2703.9

Quick Mode

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 2051.3

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1337.9

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1101.8

Can anyone tell me whether my card support page-locked memory?

OK. I got the answer. when you want to overlap memcpy and kernel execution, you have to turn off cuda profiler!!!