I have a 9800GX2 card, which is of compute capacity 1.1, and according to deviceQuery form the 2.3 SDK, it supports “Concurrent copy and execution”. However, when I tried the simpleStream from the SDK, the result was:
Device name : GeForce 9800 GX2
CUDA Capable SM 1.1 hardware with 16 multi-processors
scale_factor = 1.0000
array_size = 16777216
memcopy: 23.92
kernel: 37.70
non-streamed: 61.32 (61.62 expected)
4 streams: 62.26 (43.68 expected with compute capability 1.1 or later)
It seems that memcopy and kernel execution are not overlapped. Can someone tell me why?
Here is the result from cuda profiler(only show the asynchronous memcpy and kernel):
method=[ memcpyDtoHasync ] gputime=[ 5986.496 ] cputime=[ 3.000 ]
method=[ memcpyDtoHasync ] gputime=[ 6027.520 ] cputime=[ 2.000 ]
method=[ memcpyDtoHasync ] gputime=[ 6027.776 ] cputime=[ 2.000 ]
method=[ memcpyDtoHasync ] gputime=[ 6028.480 ] cputime=[ 2.000 ]
method=[ _Z10init_arrayPiS_i ] gputime=[ 9357.216 ] cputime=[ 33343.000 ] occupancy=[ 0.667 ] gld_coherent=[ 32768 ] gld_incoherent=[ 524288 ] gst_coherent=[ 131072 ] gst_incoherent=[ 0 ]
method=[ _Z10init_arrayPiS_i ] gputime=[ 9352.864 ] cputime=[ 9366.000 ] occupancy=[ 0.667 ] gld_coherent=[ 32768 ] gld_incoherent=[ 524288 ] gst_coherent=[ 131072 ] gst_incoherent=[ 0 ]
method=[ _Z10init_arrayPiS_i ] gputime=[ 9363.552 ] cputime=[ 9376.000 ] occupancy=[ 0.667 ] gld_coherent=[ 32768 ] gld_incoherent=[ 524288 ] gst_coherent=[ 131072 ] gst_incoherent=[ 0 ]
method=[ _Z10init_arrayPiS_i ] gputime=[ 9349.312 ] cputime=[ 9363.000 ] occupancy=[ 0.667 ] gld_coherent=[ 32768 ] gld_incoherent=[ 524288 ] gst_coherent=[ 131072 ] gst_incoherent=[ 0 ]
If cputime means the time measured from cpu side, then the 4 asynchronous memcopy are really asynchronous. However the cpu time of the first init_array kernel seems to be the sum of itself and the 4 memcopy, which means the first init_array starts before all memcpy but the execution was serialized?
Can anyone help?
P.S. I am using CentOS 5.2 x86_64