texture memory performance on a multiGPU system takes too much time to setup a texture for some GPUs

mya1114 · February 15, 2012, 7:46am

Hi all,

I have an issue with the performance of texture memory on a multiGPU system. My setup is of four GTX 295 cards, so a total of 8 GPUs in a computer.

I wrote a CUDA code for some image processing algorithm. When I distribute works to the 8 GPUs, I evenly split the input image into 8 segments and make each segment as a texture for each GPU. Therefore, there are 8 textures in total.

My code for preparing the textures follows, and each GPU call this function with its device ID:

void setupTexture(int texID, int iw, int ih)

{

   	cudaChannelFormatDesc desc;

	desc = cudaCreateChannelDesc<unsigned char>();

	cutilSafeCall(cudaMallocArray(&initArray, &desc, iw, ih));    

	cutilSafeCall(cudaMemcpyToArray(initArray, 0, 0, frame, sizeof(unsigned char)*iw*ih, cudaMemcpyHostToDevice));

	switch(texID)

	{

		case 0:

		cutilSafeCall(cudaBindTextureToArray(frameTex0, initArray)); break;

		case 1:

		cutilSafeCall(cudaBindTextureToArray(frameTex1, initArray)); break;

		case 2:

		cutilSafeCall(cudaBindTextureToArray(frameTex2, initArray)); break;

		case 3:

		cutilSafeCall(cudaBindTextureToArray(frameTex3, initArray)); break;

		case 4:

		cutilSafeCall(cudaBindTextureToArray(frameTex4, initArray)); break;

		case 5:

		cutilSafeCall(cudaBindTextureToArray(frameTex5, initArray)); break;

		case 6:

		cutilSafeCall(cudaBindTextureToArray(frameTex6, initArray)); break;

		case 7:

		cutilSafeCall(cudaBindTextureToArray(frameTex7, initArray)); break;

	}

}

With this, I get some weird execution time results. Following is the test results with a 512 x 512 image: As you can see, it does not take that much time for GPU0 to setup a texture, but it does for the rest. The same pattern is always observed. With 512 x 512 image, each GPU gets a texture of 64 x 64 and I believe it should not take this much time, for example, 1861.027 msec in GPU2.

GPU0: Kernel time (ms.): 536.776978

GPU0: MemCopy time (ms.): 0.899000

GPU1: Kernel time (ms.): 532.778992

GPU1: MemCopy time (ms.): 716.260986

GPU2: Kernel time (ms.): 566.466980

GPU2: MemCopy time (ms.): 1861.026978

GPU3: Kernel time (ms.): 564.166016

GPU3: MemCopy time (ms.): 684.744995

GPU4: Kernel time (ms.): 772.661011

GPU4: MemCopy time (ms.): 698.025024

GPU5: Kernel time (ms.): 781.702026

GPU5: MemCopy time (ms.): 683.133972

GPU6: Kernel time (ms.): 564.013000

GPU6: MemCopy time (ms.): 684.008972

GPU7: Kernel time (ms.): 559.395020

GPU7: MemCopy time (ms.): 685.513000

With a 1024 x 1024 image (which means 128 x 128 texture per GPU), I got:

GPU0: Kernel time (ms.): 2134.320068

GPU0: MemCopy time (ms.): 2.591000

GPU1: Kernel time (ms.): 2146.308105

GPU1: MemCopy time (ms.): 2181.872070

GPU2: Kernel time (ms.): 2149.717041

GPU2: MemCopy time (ms.): 2183.366943

GPU3: Kernel time (ms.): 2121.158936

GPU3: MemCopy time (ms.): 2219.584961

GPU4: Kernel time (ms.): 3841.458008

GPU4: MemCopy time (ms.): 2186.680908

GPU5: Kernel time (ms.): 3853.925049

GPU5: MemCopy time (ms.): 2185.125000

GPU6: Kernel time (ms.): 2154.561035

GPU6: MemCopy time (ms.): 2184.000977

GPU7: Kernel time (ms.): 2155.851074

GPU7: MemCopy time (ms.): 2187.509033

I really cannot interpret what these results mean and donâ€™t know what could cause the problem. Does anybody have an idea what is going on here?

Thanks,

Mya1114

L_F · February 15, 2012, 3:50pm

If you “evenly split the input image into 8 segments”, then 512x512 is not 64x64x8, as well as 1024x1024 is not 1281288.

mya1114 · February 15, 2012, 5:36pm

Oops, I realized that. Thank you L F for pointing out this.
I looked into my code, and actually what I did to make a texture is;
I made a whole input image as a texture to each GPU, so all GPUs got textures in the same size of the input image.

So, for a 512 x 512 input, each GPU has a 512 x 512 texture. The workload of each GPU is 512*(512/8) pixels.

added:
I changed the code to make each GPU have a W*(H/8) texture given a W x H input image. However I still get the similar pattern of results. To make a texture, each GPU accesses the original image with the offset of H/(number_of_GPU)*deviceID to the row index, so I believe there would be no memory contention problem.

L_F · February 15, 2012, 7:51pm

Have you initialized all the GPUs before running your code? And how do you calculate the time?

mya1114 · February 15, 2012, 8:21pm

Yep. I initialize all the GPUs and reset them at the end of the program. I use an array of global timer with cutStartTimer/cutStopTimer APIs.

FYI, I used MonteCarloMultiGPU sample from the SDK as a base code of mine.

for(i = 0; i < GPU_N; i++) {

        	threadID[i] = cutStartThread((CUT_THREADROUTINE)solverThread, (void *)(plan + i));

	}

cutWaitForThreads(threadID, GPU_N);

and solverThread looks like:

static CUT_THREADPROC solverThread(TGPUplan *plan)

{		

	 //Set device

	cutilSafeCall( cudaSetDevice(plan->device) );

....

//Set texture	

	cutStartTimer( hMemcpyTimer[plan->device] );

	setupTexture(plan->device, imWidth, imHeight);

	cutStopTimer( hMemcpyTimer[plan->device] );

	   

    	//Perform GPU computations  

	cutStartTimer( hKernelTimer[plan->device] );

	launch_Kernel(plan, h_result, h_basis, h_idx, h_coefNumInfo, sharedMemNeeded);

        cudaDeviceSynchronize()

        cutStopTimer( hKernelTimer[plan->device] );

Â  Â  Â  Â  ....

        cudaStreamSynchronize(0);

        cutilDeviceReset();

}

Thanks for helping me!

L_F · February 15, 2012, 8:43pm

It’s not clear what does the launch_Kernel do. Are you waiting inside it until the kernel finished?

mya1114 · February 15, 2012, 8:57pm

Thanks for the input . The launch_Kernel sets up some constant memory, and launches an actual kernel code.

I think cudaStreamSynchronize(0) ensures the synchronization of GPUs.
I did a little experiment by putting cudaThreadSynchronize() inside the launch_Kernel, but it made no difference. Could this be a hardware(board) problem?

L_F · February 15, 2012, 9:14pm

Another question is if the launch_kernel is copying a result back to host, in this case you may have overlapped and .

And if you have not cudaDeviceSynchronize() before cutStopTimer(), it will be time for launching the kernel, not for executing.

mya1114 · February 15, 2012, 10:44pm

Oh I dropped cudaDeviceSynchronize() before the cutStopTimer( hKernelTimer[plan->device] ) in the above code snippet although I have it in my actual code.

I indeed copy back results to CPU in the launch_Kernel. Each GPU accesses to a host memory, say h_A, with a different offset. This offset is the same as the one for segmenting an input image. e.g., For a 512 x 512 image, GPU deviceID writes to h_A+512*(512/8)deviceIDsizeof(char).

I thought that this caused the problem, so ran the code with commenting out the part for the host memory access. However I still see the similar numbers and patterns. External Image

Thank you L F, your advice is really helpful.

mya1114 · February 16, 2012, 1:43am

I think this is related to a bus latency issue for communication among multiGPUs and CPU, interrupt handling in OS (I am using Linux), and some other factors. Can’t explain further, I need more exploration. Thanks again L F!

Topic		Replies	Views
Slow multigpu When converting the code to run as multigpu, the kernel is much slower CUDA Programming and Performance	2	4513	May 1, 2010
Memory allocation bug? using several textures CUDA Programming and Performance	3	2804	July 17, 2009
Decreased performance when using textures CUDA Programming and Performance	2	508	April 8, 2019
Streams and Texture Memory CUDA Programming and Performance	4	2887	August 18, 2011
Kernel optimization problem : Texture memory CUDA Programming and Performance	0	2781	June 10, 2010
memory allocation bug? CUDA Programming and Performance	1	1882	March 27, 2007
Texture Memory Are there any limitations to texture memory CUDA Programming and Performance	5	2614	January 22, 2012
multiGPU poor performance up to 10x lowest performance in multiGPU CUDA Programming and Performance	14	10962	January 18, 2008
CUDA texture memory performance CUDA Programming and Performance	4	33701	January 13, 2009
Accessing texture reference from mult CPU threads on single GPU CUDA Programming and Performance	0	689	April 23, 2010

texture memory performance on a multiGPU system takes too much time to setup a texture for some GPUs

Related topics