Memory copy very slow memory copy, image

Hi,

I programmed a filter which simply inverts an image. I noticed that copying the image (900kb) to the device takes around 4 ms. I think I’m doing something wrong.

// Filter values

	int filterStart = startY * stride + startX * pixelSize;

	int filterWidth = stopX - startX;

	int filterHeight = stopY - startY;

	//// Threads properties

	int threads = 256;

	int blocks = filterHeight / threads + 1;

	// Debug information

	printf("blocks: %d threads: %d \n", blocks, threads);

	printf("invert => filterHeight: %d filterWidth: %d filterStart: %d pixelSize: %d stride: %d img_length: %d \n", filterHeight, filterWidth, filterStart, pixelSize, stride, img_length);

	

	// Allocate memory on device

	BYTE* img_d;

	cudaMalloc(&img_d, img_length);

	// Copy complete image to device

	cudaMemcpy(img_d, img, img_length, cudaMemcpyHostToDevice);

	//Call kernel function for each line

	invertCuda<<<blocks, threads>>>(img, filterHeight, filterWidth, filterStart, pixelSize, stride);

	printf("%s\n", cudaGetErrorString(cudaPeekAtLastError()));

	printf("%s\n", cudaGetErrorString(cudaThreadSynchronize()));

	// Copy image back to host

	cudaMemcpy(img, img_d, img_length, cudaMemcpyDeviceToHost);

	// Free memory

	cudaFree(img_d);

The first call to a CUDA function sets up the context, so it takes some extra time.

The thing is that I call this code from c# (via a dll). So it seems to need this extra time at each call. Is there any kind of work around for this?

It shouldn’t need to set up context every time if all calls are made from the same process. You need to double check, which line takes 4 ms?

If it’s cudaMalloc(), 4 ms for the first call is actually low (I see 50-60 ms on my system). Subsequent calls take fractions of millisecond.

If it’s cudaMemcpy(), 4 ms is very high, unless it’s on a very old hardware.

Thanks for your answer. I did some more tests and it seems that even allocating takes a hell of time.

My video card isn’t that old: Quadro FX 770M

Memory allocation (900kB) takes 1.5ms and every second time 3ms.

Copying the image takes around 0.7ms

Copying back the image takes around 1.5ms (why would this take double the time?)

Code which I used to measure:

__global__ void invertCuda(BYTE* img, int filterHeight, int filterWidth, int filterStart, int pixelSize, int stride) {

	// Calculate line, only x component is available

	int line = blockIdx.x * blockDim.x + threadIdx.x;

	// check boundaries

	if(line < filterHeight) {

		// Go through each pixel on the current line

		int currPixel;

		for(int i = 0; i < filterWidth * pixelSize; i++) {

			currPixel = filterStart + line * stride + i;

			img[currPixel] = 255 - img[currPixel];

		} 

	}

}

extern "C" __declspec(dllexport) void invert(BYTE* img, int startX, int startY, int stopX, int stopY, int pixelSize, int stride, int img_length) {

	LARGE_INTEGER li_start, li_stop, li_frequency;

	int time;

	QueryPerformanceFrequency(&li_frequency);

	QueryPerformanceCounter(&li_start);

	// Filter values

	int filterStart = startY * stride + startX * pixelSize;

	int filterWidth = stopX - startX;

	int filterHeight = stopY - startY;

	//// Threads properties

	int threads = 256;

	int blocks = filterHeight / threads + 1;

	QueryPerformanceCounter(&li_stop);

	time = int ((1000000 * (li_stop.QuadPart - li_start.QuadPart)) / li_frequency.QuadPart);

	printf("Time for initial calculations %d", time);

	// Debug information

	//printf("blocks: %d threads: %d \n", blocks, threads);

	//printf("invert => filterHeight: %d filterWidth: %d filterStart: %d pixelSize: %d stride: %d img_length: %d \n", filterHeight, filterWidth, filterStart, pixelSize, stride, img_length);

	

	QueryPerformanceFrequency(&li_frequency);

	QueryPerformanceCounter(&li_start);

	// Allocate memory on device

	BYTE* img_d;

	cudaMalloc(&img_d, img_length);

	QueryPerformanceCounter(&li_stop);

	time = int ((1000000 * (li_stop.QuadPart - li_start.QuadPart)) / li_frequency.QuadPart);

	printf("Time for allocating memory %d \n", time);

	// Copy complete image to device

	QueryPerformanceFrequency(&li_frequency);

	QueryPerformanceCounter(&li_start);

	

	cudaMemcpy(img_d, img, img_length, cudaMemcpyHostToDevice);

	

	QueryPerformanceCounter(&li_stop);

	time = int ((1000000 * (li_stop.QuadPart - li_start.QuadPart)) / li_frequency.QuadPart);

	printf("Time for copying memory %d \n", time);

	//Call kernel function for each line

	QueryPerformanceFrequency(&li_frequency);

	QueryPerformanceCounter(&li_start);

	

	invertCuda<<<blocks, threads>>>(img_d, filterHeight, filterWidth, filterStart, pixelSize, stride);

	QueryPerformanceCounter(&li_stop);

	time = int ((1000000 * (li_stop.QuadPart - li_start.QuadPart)) / li_frequency.QuadPart);

	printf("Time for calculating inverted image %d \n", time);

	//printf("%s\n", cudaGetErrorString(cudaPeekAtLastError()));

	//printf("%s\n", cudaGetErrorString(cudaThreadSynchronize()));

	// Copy image back to host

	QueryPerformanceFrequency(&li_frequency);

	QueryPerformanceCounter(&li_start);

	cudaMemcpy(img, img_d, img_length, cudaMemcpyDeviceToHost);

	QueryPerformanceCounter(&li_stop);

	time = int ((1000000 * (li_stop.QuadPart - li_start.QuadPart)) / li_frequency.QuadPart);

	printf("Time for copying memory back to host %d \n \n", time);

	// Free memory

	cudaFree(img_d);

}

Please see the image attached for more measurements.
cuda_performance.png

You’re not measuring time quite right.

The call invertCuda<<<blocks, threads>>> returns almost immediately, it does not wait for the operation to complete. Then when you call cudaMemcpy for the second time, first it waits for the kernel to finish, then it waits for the copy operation to complete. So the number you see in “Time for calculating inverted image” is just 30 microseconds and the number you see in “Time for copying memory back to host” is really the total time to invert + to copy.

You could get correct numbers by adding a call to cudaThreadSynchronize() after invertCuda<<<blocks, threads>>>.

0.65 ms to copy 900 kb is 1.4 Gb/s. You should do better, I think. 3.5 to 5 Gb/s is what people usually see. There’s an app called bandwidthTest in CUDA SDK, did you try to run that?

Thanks for your hints.

Your calculation is very accurate. It seems like my device is slow. The test produced the following results:

Unpinned memory

Pinned memory

So this means I need to reduce the amount of data copied to the device as much as possible, since it’s the bottle neck. Additonally switching to pinned memory might also increase the speed drastically.

Here are the new values with your fix:

/edit

Why does allocating memory take that much time? Possible due to very fragmented RAM on my device?

The first CUDA function you call will be slow because it also has to create the CUDA context. If you want to test this, time a second copy after the first one and see if it takes the same amount of time.

I know that. It’s 90ms. So the next calls for memory allocation are still like 1.4ms, which is a lot compared to the calculation.

As far as I can tell, cudaMalloc involves running some kind of kernel internally. In my tests its execution time scales with allocation size, and in inverse proportion to the GPU clock. Does not really make sense why this is happening (the only thing that could explain the behavior is if the device was clearing the memory inside cudaMalloc, but documentation says explicitly that it is not so.) But that’s the way things are. On a GeForce 560 (which is about 20 times faster than your Quadro 770M), the relationship is 0.048 ms per megabyte.

I tried debugging this a little; most of the time is spent inside a system function called D3DKMTCreateAllocation2, which, in turn, seems to call a kernel-mode driver. At that point, the trail goes cold. We’d have to ask people who wrote the driver.

Edit: and it’s only happening in Windows. The Linux driver does not have the problem, all calls to cudaMemcpy (except the first one) take 0.03-0.05 ms.

Wow thanks a lot for your help! I was also experimenting some more. I’m allocating the memory now once and use it for each frame without deallocating it. The deallocation happens in the end of the stream. This kind of works but I’m having trouble, if I combine two filters. But I need to do some more testing, since it really doesn’t make sense. So prolly my testframework is buggy or something.

I guess it shouldn’t be a problem not to deallocate memory after a CUDA call and to reuse it without freeing it?