texture memory performance on a multiGPU system takes too much time to setup a texture for some GPUs

Hi all,

I have an issue with the performance of texture memory on a multiGPU system. My setup is of four GTX 295 cards, so a total of 8 GPUs in a computer.

I wrote a CUDA code for some image processing algorithm. When I distribute works to the 8 GPUs, I evenly split the input image into 8 segments and make each segment as a texture for each GPU. Therefore, there are 8 textures in total.

My code for preparing the textures follows, and each GPU call this function with its device ID:

void setupTexture(int texID, int iw, int ih)


   	cudaChannelFormatDesc desc;

	desc = cudaCreateChannelDesc<unsigned char>();

	cutilSafeCall(cudaMallocArray(&initArray, &desc, iw, ih));    

	cutilSafeCall(cudaMemcpyToArray(initArray, 0, 0, frame, sizeof(unsigned char)*iw*ih, cudaMemcpyHostToDevice));



		case 0:

		cutilSafeCall(cudaBindTextureToArray(frameTex0, initArray)); break;

		case 1:

		cutilSafeCall(cudaBindTextureToArray(frameTex1, initArray)); break;

		case 2:

		cutilSafeCall(cudaBindTextureToArray(frameTex2, initArray)); break;

		case 3:

		cutilSafeCall(cudaBindTextureToArray(frameTex3, initArray)); break;

		case 4:

		cutilSafeCall(cudaBindTextureToArray(frameTex4, initArray)); break;

		case 5:

		cutilSafeCall(cudaBindTextureToArray(frameTex5, initArray)); break;

		case 6:

		cutilSafeCall(cudaBindTextureToArray(frameTex6, initArray)); break;

		case 7:

		cutilSafeCall(cudaBindTextureToArray(frameTex7, initArray)); break;



With this, I get some weird execution time results. Following is the test results with a 512 x 512 image: As you can see, it does not take that much time for GPU0 to setup a texture, but it does for the rest. The same pattern is always observed. With 512 x 512 image, each GPU gets a texture of 64 x 64 and I believe it should not take this much time, for example, 1861.027 msec in GPU2.

GPU0: Kernel time (ms.): 536.776978

GPU0: MemCopy time (ms.): 0.899000

GPU1: Kernel time (ms.): 532.778992

GPU1: MemCopy time (ms.): 716.260986

GPU2: Kernel time (ms.): 566.466980

GPU2: MemCopy time (ms.): 1861.026978

GPU3: Kernel time (ms.): 564.166016

GPU3: MemCopy time (ms.): 684.744995

GPU4: Kernel time (ms.): 772.661011

GPU4: MemCopy time (ms.): 698.025024

GPU5: Kernel time (ms.): 781.702026

GPU5: MemCopy time (ms.): 683.133972

GPU6: Kernel time (ms.): 564.013000

GPU6: MemCopy time (ms.): 684.008972

GPU7: Kernel time (ms.): 559.395020

GPU7: MemCopy time (ms.): 685.513000

With a 1024 x 1024 image (which means 128 x 128 texture per GPU), I got:

GPU0: Kernel time (ms.): 2134.320068

GPU0: MemCopy time (ms.): 2.591000

GPU1: Kernel time (ms.): 2146.308105

GPU1: MemCopy time (ms.): 2181.872070

GPU2: Kernel time (ms.): 2149.717041

GPU2: MemCopy time (ms.): 2183.366943

GPU3: Kernel time (ms.): 2121.158936

GPU3: MemCopy time (ms.): 2219.584961

GPU4: Kernel time (ms.): 3841.458008

GPU4: MemCopy time (ms.): 2186.680908

GPU5: Kernel time (ms.): 3853.925049

GPU5: MemCopy time (ms.): 2185.125000

GPU6: Kernel time (ms.): 2154.561035

GPU6: MemCopy time (ms.): 2184.000977

GPU7: Kernel time (ms.): 2155.851074

GPU7: MemCopy time (ms.): 2187.509033

I really cannot interpret what these results mean and don’t know what could cause the problem. Does anybody have an idea what is going on here?



If you “evenly split the input image into 8 segments”, then 512x512 is not 64x64x8, as well as 1024x1024 is not 1281288.

Oops, I realized that. Thank you L F for pointing out this.
I looked into my code, and actually what I did to make a texture is;
I made a whole input image as a texture to each GPU, so all GPUs got textures in the same size of the input image.

So, for a 512 x 512 input, each GPU has a 512 x 512 texture. The workload of each GPU is 512*(512/8) pixels.

I changed the code to make each GPU have a W*(H/8) texture given a W x H input image. However I still get the similar pattern of results. To make a texture, each GPU accesses the original image with the offset of H/(number_of_GPU)*deviceID to the row index, so I believe there would be no memory contention problem.

Have you initialized all the GPUs before running your code? And how do you calculate the time?

Yep. I initialize all the GPUs and reset them at the end of the program. I use an array of global timer with cutStartTimer/cutStopTimer APIs.

FYI, I used MonteCarloMultiGPU sample from the SDK as a base code of mine.

for(i = 0; i < GPU_N; i++) {

        	threadID[i] = cutStartThread((CUT_THREADROUTINE)solverThread, (void *)(plan + i));


cutWaitForThreads(threadID, GPU_N);

and solverThread looks like:

static CUT_THREADPROC solverThread(TGPUplan *plan)


	 //Set device

	cutilSafeCall( cudaSetDevice(plan->device) );


//Set texture	

	cutStartTimer( hMemcpyTimer[plan->device] );

	setupTexture(plan->device, imWidth, imHeight);

	cutStopTimer( hMemcpyTimer[plan->device] );


    	//Perform GPU computations  

	cutStartTimer( hKernelTimer[plan->device] );

	launch_Kernel(plan, h_result, h_basis, h_idx, h_coefNumInfo, sharedMemNeeded);


        cutStopTimer( hKernelTimer[plan->device] );





Thanks for helping me!

It’s not clear what does the launch_Kernel do. Are you waiting inside it until the kernel finished?

Thanks for the input . The launch_Kernel sets up some constant memory, and launches an actual kernel code.

I think cudaStreamSynchronize(0) ensures the synchronization of GPUs.
I did a little experiment by putting cudaThreadSynchronize() inside the launch_Kernel, but it made no difference. Could this be a hardware(board) problem?

Another question is if the launch_kernel is copying a result back to host, in this case you may have overlapped and .

And if you have not cudaDeviceSynchronize() before cutStopTimer(), it will be time for launching the kernel, not for executing.

Oh I dropped cudaDeviceSynchronize() before the cutStopTimer( hKernelTimer[plan->device] ) in the above code snippet although I have it in my actual code.

I indeed copy back results to CPU in the launch_Kernel. Each GPU accesses to a host memory, say h_A, with a different offset. This offset is the same as the one for segmenting an input image. e.g., For a 512 x 512 image, GPU deviceID writes to h_A+512*(512/8)deviceIDsizeof(char).

I thought that this caused the problem, so ran the code with commenting out the part for the host memory access. However I still see the similar numbers and patterns. :wallbash:

Thank you L F, your advice is really helpful.

I think this is related to a bus latency issue for communication among multiGPUs and CPU, interrupt handling in OS (I am using Linux), and some other factors. Can’t explain further, I need more exploration. Thanks again L F!