Hi all,
I have an issue with the performance of texture memory on a multiGPU system. My setup is of four GTX 295 cards, so a total of 8 GPUs in a computer.
I wrote a CUDA code for some image processing algorithm. When I distribute works to the 8 GPUs, I evenly split the input image into 8 segments and make each segment as a texture for each GPU. Therefore, there are 8 textures in total.
My code for preparing the textures follows, and each GPU call this function with its device ID:
void setupTexture(int texID, int iw, int ih)
{
cudaChannelFormatDesc desc;
desc = cudaCreateChannelDesc<unsigned char>();
cutilSafeCall(cudaMallocArray(&initArray, &desc, iw, ih));
cutilSafeCall(cudaMemcpyToArray(initArray, 0, 0, frame, sizeof(unsigned char)*iw*ih, cudaMemcpyHostToDevice));
switch(texID)
{
case 0:
cutilSafeCall(cudaBindTextureToArray(frameTex0, initArray)); break;
case 1:
cutilSafeCall(cudaBindTextureToArray(frameTex1, initArray)); break;
case 2:
cutilSafeCall(cudaBindTextureToArray(frameTex2, initArray)); break;
case 3:
cutilSafeCall(cudaBindTextureToArray(frameTex3, initArray)); break;
case 4:
cutilSafeCall(cudaBindTextureToArray(frameTex4, initArray)); break;
case 5:
cutilSafeCall(cudaBindTextureToArray(frameTex5, initArray)); break;
case 6:
cutilSafeCall(cudaBindTextureToArray(frameTex6, initArray)); break;
case 7:
cutilSafeCall(cudaBindTextureToArray(frameTex7, initArray)); break;
}
}
With this, I get some weird execution time results. Following is the test results with a 512 x 512 image: As you can see, it does not take that much time for GPU0 to setup a texture, but it does for the rest. The same pattern is always observed. With 512 x 512 image, each GPU gets a texture of 64 x 64 and I believe it should not take this much time, for example, 1861.027 msec in GPU2.
GPU0: Kernel time (ms.): 536.776978
GPU0: MemCopy time (ms.): 0.899000
GPU1: Kernel time (ms.): 532.778992
GPU1: MemCopy time (ms.): 716.260986
GPU2: Kernel time (ms.): 566.466980
GPU2: MemCopy time (ms.): 1861.026978
GPU3: Kernel time (ms.): 564.166016
GPU3: MemCopy time (ms.): 684.744995
GPU4: Kernel time (ms.): 772.661011
GPU4: MemCopy time (ms.): 698.025024
GPU5: Kernel time (ms.): 781.702026
GPU5: MemCopy time (ms.): 683.133972
GPU6: Kernel time (ms.): 564.013000
GPU6: MemCopy time (ms.): 684.008972
GPU7: Kernel time (ms.): 559.395020
GPU7: MemCopy time (ms.): 685.513000
With a 1024 x 1024 image (which means 128 x 128 texture per GPU), I got:
GPU0: Kernel time (ms.): 2134.320068
GPU0: MemCopy time (ms.): 2.591000
GPU1: Kernel time (ms.): 2146.308105
GPU1: MemCopy time (ms.): 2181.872070
GPU2: Kernel time (ms.): 2149.717041
GPU2: MemCopy time (ms.): 2183.366943
GPU3: Kernel time (ms.): 2121.158936
GPU3: MemCopy time (ms.): 2219.584961
GPU4: Kernel time (ms.): 3841.458008
GPU4: MemCopy time (ms.): 2186.680908
GPU5: Kernel time (ms.): 3853.925049
GPU5: MemCopy time (ms.): 2185.125000
GPU6: Kernel time (ms.): 2154.561035
GPU6: MemCopy time (ms.): 2184.000977
GPU7: Kernel time (ms.): 2155.851074
GPU7: MemCopy time (ms.): 2187.509033
I really cannot interpret what these results mean and don’t know what could cause the problem. Does anybody have an idea what is going on here?
Thanks,
Mya1114