I programmed a filter which simply inverts an image. I noticed that copying the image (900kb) to the device takes around 4 ms. I think I’m doing something wrong.
// Filter values
int filterStart = startY * stride + startX * pixelSize;
int filterWidth = stopX - startX;
int filterHeight = stopY - startY;
//// Threads properties
int threads = 256;
int blocks = filterHeight / threads + 1;
// Debug information
printf("blocks: %d threads: %d \n", blocks, threads);
printf("invert => filterHeight: %d filterWidth: %d filterStart: %d pixelSize: %d stride: %d img_length: %d \n", filterHeight, filterWidth, filterStart, pixelSize, stride, img_length);
// Allocate memory on device
BYTE* img_d;
cudaMalloc(&img_d, img_length);
// Copy complete image to device
cudaMemcpy(img_d, img, img_length, cudaMemcpyHostToDevice);
//Call kernel function for each line
invertCuda<<<blocks, threads>>>(img, filterHeight, filterWidth, filterStart, pixelSize, stride);
printf("%s\n", cudaGetErrorString(cudaPeekAtLastError()));
printf("%s\n", cudaGetErrorString(cudaThreadSynchronize()));
// Copy image back to host
cudaMemcpy(img, img_d, img_length, cudaMemcpyDeviceToHost);
// Free memory
cudaFree(img_d);
The call invertCuda<<<blocks, threads>>> returns almost immediately, it does not wait for the operation to complete. Then when you call cudaMemcpy for the second time, first it waits for the kernel to finish, then it waits for the copy operation to complete. So the number you see in “Time for calculating inverted image” is just 30 microseconds and the number you see in “Time for copying memory back to host” is really the total time to invert + to copy.
You could get correct numbers by adding a call to cudaThreadSynchronize() after invertCuda<<<blocks, threads>>>.
0.65 ms to copy 900 kb is 1.4 Gb/s. You should do better, I think. 3.5 to 5 Gb/s is what people usually see. There’s an app called bandwidthTest in CUDA SDK, did you try to run that?
Your calculation is very accurate. It seems like my device is slow. The test produced the following results:
Unpinned memory
Pinned memory
So this means I need to reduce the amount of data copied to the device as much as possible, since it’s the bottle neck. Additonally switching to pinned memory might also increase the speed drastically.
Here are the new values with your fix:
/edit
Why does allocating memory take that much time? Possible due to very fragmented RAM on my device?
The first CUDA function you call will be slow because it also has to create the CUDA context. If you want to test this, time a second copy after the first one and see if it takes the same amount of time.
As far as I can tell, cudaMalloc involves running some kind of kernel internally. In my tests its execution time scales with allocation size, and in inverse proportion to the GPU clock. Does not really make sense why this is happening (the only thing that could explain the behavior is if the device was clearing the memory inside cudaMalloc, but documentation says explicitly that it is not so.) But that’s the way things are. On a GeForce 560 (which is about 20 times faster than your Quadro 770M), the relationship is 0.048 ms per megabyte.
I tried debugging this a little; most of the time is spent inside a system function called D3DKMTCreateAllocation2, which, in turn, seems to call a kernel-mode driver. At that point, the trail goes cold. We’d have to ask people who wrote the driver.
Edit: and it’s only happening in Windows. The Linux driver does not have the problem, all calls to cudaMemcpy (except the first one) take 0.03-0.05 ms.
Wow thanks a lot for your help! I was also experimenting some more. I’m allocating the memory now once and use it for each frame without deallocating it. The deallocation happens in the end of the stream. This kind of works but I’m having trouble, if I combine two filters. But I need to do some more testing, since it really doesn’t make sense. So prolly my testframework is buggy or something.
I guess it shouldn’t be a problem not to deallocate memory after a CUDA call and to reuse it without freeing it?