within a given cuda program, I sent the 4 Sets of the size ‘n’ and 2 sets of the size ‘m’ (whereas m< n*n) to a given kernel and gave back a set of size ‘n’.
I have these results for using cuda timer:
for n=319
HOST to DEVICE Communication time: 0.119052 (ms)
Processing time: 0.062505 (ms)
DEVICE TO HOST Communication time: 0.020511 (ms)
for n=1072
HOST to DEVICE Communication time: 0.203001 (ms)
Processing time: 0.062257 (ms)
DEVICE TO HOST Communication time: 0.031104 (ms)
And after this operations, I sent the one set of size ‘n*n’ and one set of size ‘n’ to another kernel and gave back a set of size ‘n’.
I have these results:
for n=319
HOST TO DEVICECommunication time: 0.358962 (ms)
Processing time: 0.048393 (ms)
DEVICE TO HOST Communication time: 1.930656 (ms)
for n=1072
HOST TO DEVICECommunication time: 4.485273 (ms)
Processing time: 0.052156 (ms)
DEVICE TO HOST Communication time: 72.515274 (ms)
As you see, despite of the same-size of the Returned Data-sets, the delay on second scenario is much more than another.
I wonder if you could tell me the reason!
Hard to say without seeing the code. Could you strip your kernel of the compute code (leaving just the global memory I/O) and post your code?
Paulius
// allocate device memory
float* d_Y;
float* d_DIAG;
float* d_X;
float* d_OFFDIAG;
unsigned int* d_R;
unsigned int* d_C;
unsigned int* d_K;
// Allocate Device Memory
CUDA_SAFE_CALL( cudaMalloc( (void**) &d_Y, MemSize));
CUDA_SAFE_CALL( cudaMalloc( (void**) &d_DIAG, MemSize));
CUDA_SAFE_CALL( cudaMalloc( (void**) &d_X, MemSize));
CUDA_SAFE_CALL( cudaMalloc( (void**) &d_OFFDIAG, OffDiagSize));
CUDA_SAFE_CALL( cudaMalloc( (void**) &d_R, IMemSize));
CUDA_SAFE_CALL( cudaMalloc( (void**) &d_C, IOffDiagSize));
CUDA_SAFE_CALL( cudaMalloc( (void**) &d_K, IMemSize));
// copy host memory to device
unsigned int timer0 = 0;
CUT_SAFE_CALL( cutCreateTimer( &timer0));
CUT_SAFE_CALL( cutStartTimer( timer0));
CUDA_SAFE_CALL(cudaMemcpy(d_X, h_X, MemSize, cudaMemcpyHostToDevice));
CUDA_SAFE_CALL(cudaMemcpy(d_DIAG, h_DIAG, MemSize, cudaMemcpyHostToDevice));
CUDA_SAFE_CALL(cudaMemcpy(d_OFFDIAG, h_OFFDIAG, OffDiagSize, cudaMemcpyHostToDevice));
CUDA_SAFE_CALL(cudaMemcpy(d_R, h_R, IMemSize, cudaMemcpyHostToDevice));
CUDA_SAFE_CALL(cudaMemcpy(d_C, h_C, IOffDiagSize, cudaMemcpyHostToDevice));
CUDA_SAFE_CALL(cudaMemcpy(d_K, h_K, IMemSize, cudaMemcpyHostToDevice));
CUT_SAFE_CALL( cutStopTimer( timer0));
printf( "HOST to DEVICE Communication time: %f (ms)\n", cutGetTimerValue( timer0));
CUT_SAFE_CALL( cutDeleteTimer( timer0));
// setup execution parameters
dim3 grid(VectSize/BLOCKSIZE, 1, 1);
dim3 threads(BLOCKSIZE, 1, 1);
// execute the kernel
unsigned int timer1 = 0;
CUT_SAFE_CALL( cutCreateTimer( &timer1));
CUT_SAFE_CALL( cutStartTimer( timer1));
SparseMUL<<< grid,threads,0>>>(d_Y,d_DIAG,d_OFFDIAG,d_X,d_R,d_C,d_K);
CUT_SAFE_CALL( cutStopTimer( timer1));
printf( "Processing time: %f (ms)\n", cutGetTimerValue( timer1));
CUT_SAFE_CALL( cutDeleteTimer( timer1));
//check if kernel execution generated and error
CUT_CHECK_ERROR("Kernel execution failed");
// copy result from device to host
unsigned int timer11 = 0;
CUT_SAFE_CALL( cutCreateTimer( &timer11));
CUT_SAFE_CALL( cutStartTimer( timer11));
CUDA_SAFE_CALL(cudaMemcpy(h_Y, d_Y, MemSize, cudaMemcpyDeviceToHost));
CUT_SAFE_CALL( cutStopTimer( timer11));
printf( "DEVICE TO HOST Communication time: %f (ms)\n", cutGetTimerValue( timer11));
CUT_SAFE_CALL( cutDeleteTimer( timer11));
One thing that I have seen if you are trying to print out the times you have to use the call :
“cudaThreadSynchronize();”
right after the call to the kernel. What happens i think is the CPU continues to work when you start the kernel. The call to access the resulting data is suspended until the kernel is done. In other words, some of the processing time from the GPU is running over to the accessing time.
I hope this helps you…