Cuda Convolution - best memory useage

ErisDyscordia · March 28, 2012, 4:17pm

Okay, I have written my own little separable convolution program that has two separate kernels (one for Row convolution, and one for Column). It is very similar to how the C version works in the SDK, except written parallel (the loop unrolling version in the SDK did not fit what I needed exactly). Basically the user calls the program and tells it how many times they want to run the convolution. Currently what it does, is copies the array (call it arrayA) over to the GPU, saves the result in arrayB on the GPU, then passes arrayB to the second kernel and saves the results back to arrayA…then arrayA is copied back to the host and returned to the user…if they user wants more than one run, the whole thing starts off (copying the new arrayA to the GPU etc). This means that for 600 convolutions (the low end of what I am doing), there are 1200 copies (600 each way) between the GPU and CPU…is there a better way to deal with the memory for this?

I don’t know how big the images are before hand, so I couldn’t figure out a way to just make it all sit in global memory, but maybe I am over thinking it, I am very new to cuda.

This is the loop:

int counter;

for (counter = 0; counter<numSmooths; counter++)
{
cutilSafeCall( cudaMemcpy(d_DataA, h_DataA, DATA_SIZE, cudaMemcpyHostToDevice) );

    cutilSafeCall( cudaThreadSynchronize() );
    
    convolutionRowGPU<<<blocks,threads>>>(
        d_DataB,
        d_DataA,
    d_Kernel,
        DATA_W,
        DATA_H,
    KERNEL_R
    );
    cutilCheckMsg("convolutionRowGPU() execution failed\n");

    convolutionColumnGPU<<<blocks,threads>>>(
        d_DataA,
        d_DataB,
    d_Kernel,
        DATA_W,
        DATA_H,
    KERNEL_R
    );
    cutilCheckMsg("convolutionColumnGPU() execution failed\n");
cutilSafeCall( cudaThreadSynchronize() );


    cutilSafeCall( cudaMemcpy(h_DataA, d_DataA, DATA_SIZE, cudaMemcpyDeviceToHost) );

}

Gilles_C · March 29, 2012, 5:14am

Hi,
Since I’m not familiar with the type of computation you are doing, I might say something plain stupid, so please forgive me if ever.
That said, if the code snippet you gave actually reflect your algorithm and if you don’t touch “h_data” inside your loop, why don’t you just push both cudaMemcpy H2D and D2H outside of the loop? The first H2D would go prior to the loop and the D2H right after. This way, no unnecessary data transfer would occur…
While writing this, it looks so obvious that I must miss something here…

ErisDyscordia · March 30, 2012, 1:53pm

Wow. Okay, I see what you are saying, for some reason I had it in my head that to restart the loop again with the resulting image from the previous loop, I had to copy it over again…

I am very new at this, thanks!

pasoleatis · March 30, 2012, 5:16pm

Take a look at the best practices document on the nvidia website. It gives good idea about what to keep look after in order to have good performance.

Topic		Replies	Views
Problem with cudamemcopy CUDA Programming and Performance	6	1831	September 18, 2009
2D Convolution problem following example from SDK source code included CUDA Programming and Performance	9	11642	June 7, 2011
how to use kernel with 2 loop for? CUDA Programming and Performance	8	1083	March 7, 2013
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	13931	September 5, 2008
CUDA error on cudaMemcpy() to host, when data is big CUDA Programming and Performance	5	1242	September 23, 2023
Simple 2d Convolution Low Pass filter like blur filter CUDA Programming and Performance	3	2819	April 15, 2014
Shared memory and running time Results not reproducible CUDA Programming and Performance	10	1723	August 24, 2009
Loop inside kernel or over kernels in host code? [performance question] CUDA Programming and Performance	8	6731	September 25, 2008
How would you do this? CUDA Programming and Performance	12	4466	August 5, 2008
Optimization of kernel for batch convolution of many small matrices CUDA Programming and Performance	4	1735	August 1, 2013

Cuda Convolution - best memory useage

Related topics