Hi all!
I wanted to improve the time delay of my algorithm by streams. But the results i got by cuda profiler are strange. I think kernel and memcpy function are performed sequential. I hope you understand what I got… This are the values the profiler shows before using streams:
Method______|_Timestamp|GPUTime|CPUTIme|Occupancy| streamId|gridsize|blocksize|sm cta lauched |cta launched|
memcpyHtoD_|_____ 0 ____|110058|_4,795|
Kernel_______|110071| 4001,22|113280| 1___|1_|72| 256 | 3_________|_____ 8 ______|
memcpyDtoH_|_114243| 87783,4|_5,159|
And this those WITH (three) streams:
Method______|_Timestamp|GPUTime|CPUTIme|Occupancy| streamId|gridsize|blocksize|sm cta lauched |cta launched|
memcpyHtoD_|____ 0______|36684_|_4,795|
memcpyHtoD_|___ 36685___|36739_|_4,795|
memcpyHtoD_|73425|36734|_4,795|
memcpyHtoD_|110172|2207__|111600|1_ |1|24|256|______1 ______|3|
memcpyHtoD_|112602|2201__|2216__|1_ |2|24|256|1_ |3_|
memcpyHtoD_|114994|2195__|2239__|1_ |3|24|256|1_ |2_|
memcpyDtoH_|117346|29261_|2,911|
memcpyDtoH_|146608|29317_|1,49_|
memcpyDtoH_|175926|29310_|1,351|
You see, there is no improvement of time delay. I know, the kernel execution time is to short to get really significant differences (There is not much time for overlapping with memcpy, but this is only a test), but I should see small differences, shouldn’t I?
The memcpy-functions don’t show a stream number? Is this right? I read in the profiler manual, it says that the stream id usually is show in every function of a stream (so, in memcpy functions, too).
What did I wrong??
I’m using a Geforce260GTX with CC1.3 and SDK2.3, Toolkit2.3. For the implementation of streams I don’t use much more code than in the code example of the programming guide. Here are some parts:
[codebox]
//Creating streams
cudaStream_t *Stream = NULL;
Stream = (cudaStream_t*)malloc(uiNumOfStreams * sizeof(cudaStream_t));
for (int i = NULL; i < uiNumOfStreams; ++i)
cudaStreamCreate(&Stream[i]);
//Memcpy for both directions
for(int i = NULL; i < uiNumOfStreams; ++i)
{
unsigned int uiOffset = i * uiWidth * uiHeight * m_uiImagesPerStream;
if(uiDirection == cudaMemcpyHostToDevice)
{
Destination = CudaBase::GetDeviceMemoryPtr().idtInputDevData + uiOffset;
Source = CudaBase::sm_HostPtr + uiOffset;
}
else
{
Source = CudaBase::GetDeviceMemoryPtr().idtOutputDevData + uiOffset;
Destination = CudaBase::sm_HostPtr + uiOffset;
}
cudaMemcpyAsync( Destination, Source, GetCopySize(i, uiNumberOfImages, uiWidth, uiHeight),(cudaMemcpyKind) uiDirection, Stream[i]);
}
//Kernel execution
for(int i = NULL; i < uiNumOfStreams; ++i)
{
unsigned int uiOffset = i * uiWidth * uiHeight * uiImagesPerStream;
if(uiNumberOfImages - uiImagesPerStream * i >= uiImagesPerStream)
uiImgPerCurrStream = uiImagesPerStream;
else
uiImgPerCurrStream = uiNumberOfImages - uiImagesPerStream * i;
KernelXY<<<(uiWidth / NUMOFTHREADS) * uiImgPerCurrStream, NUMOFTHREADS, NULL, Stream[i]>>>( idtInputDevData + uiOffset, idtOutputDevData + uiOffset, uiWidth, uiHeight);
}
//destroy streams
for (int i = NULL; i < uiNumOfStreams; ++i)
cudaStreamDestroy(Stream[i]);
[/codebox]
Some additional questions:
-
Is it possible to perform data transfers between host und device in both directions concurrent?
-
What is the meaning of the profiler column “cta launched”? I know it stands for the number of threads executed, but how did they get the values above? How can be cta launched = 3 if the gird size is 24? How can the number of executed blocks be different in two identity kernels (in this case the values are2 and 3)?
I hope you can help me!
Best regards!
Edit: Naturally I use page locked memory. And I never call any cuda function with stream id = NULL inbetween the other commands.