Copies between CPU and GPU

Hi everyone,
I faced the following problem. My algorithm has two parts, that are operating on the same data: the first part is running on CPU (no chance to parallelize and its sequential implementation on GPU is very very slow), the second part is running on GPU and it’s very fast. But since the second part needs results from the first part I have to copy data from the host to the device using cudaMemcpy. And these copies slow down the whole algorithm significantly. Of course I’m using pinned memory for the data that is being transfered to the device. But are there some other tricks how copies CPU<-> GPU can be accelerated?

The second problem is the consequence of the first one. In order to have benefit of using GPU I’m using 16 bits integer instead of 32bits (than I have less data to transfer). But using 16 bits brings another limit to my algorithm. (since values can be higher than 16 535). 24 bits integer would solve my problems (because as I said with 32bits I have to copy a lot), but this format doesn’t exist. Or maybe someone has an experience of using 24 bits integers on GPU?

Thanks a lot in advance for any advices!

I’m sure you could hack together 24 bit integers using bitmasks. Not ever done it myself.

Also… 16 bit ints can have values up to 65535 - I’m guessing you typoed?

Have you tried hiding some of the latency by overlapping kernel execution and memory transfers?

/Lars

Hi, thanks. By overlapping you mean async copies and kernel executions using streams I suppose, right? Actually I didn’t try it yet. But I will try and let you know whether it helps or not.

I mean that my values can be bigger than 65535…

Yep, that’s what I mean. I’ve seen mentioned here that H2D copy and kernel execution can only overlap on linear transfers using cudaMemcpy, not when copying to device arrays or pitch linear memory, although I haven’t found any mention of this in the programming guide.

/L

Hi again,

actually I tried to use streams in order to overlap memory copies between CPU and GPU but I faced some problems. My code is based on the NVIDIAS’s “simpleStreams” example:

[codebox]

// h_Labels (heightwidth) and h_Table(heightwidth/4) are input arrays

int nstreams = 4;

int n = height*width;

int nbytes = n * sizeof(unsigned short);

dim3 threads, blocks;

float elapsed_time;

unsigned short *d_Labels = 0;

unsigned short *d_Table = 0;

SC( cudaMalloc((void **)&d_Labels, heightwidthsizeof(unsigned short)) );

SC( cudaMalloc((void **)&d_Table, heightwidthsizeof(unsigned short)) );

cudaStream_t streams = (cudaStream_t) malloc(nstreams * sizeof(cudaStream_t));

for(int i = 0; i < nstreams; i++)

cutilSafeCall( cudaStreamCreate(&(streams[i])) );

cudaEvent_t start_event, stop_event;

cutilSafeCall( cudaEventCreate(&start_event) );

cutilSafeCall( cudaEventCreate(&stop_event) );

cudaEventRecord(start_event, 0);

for(int i = 0; i < nstreams; i++){

cudaMemcpyAsync(d_Labels + i*n / nstreams, h_Labels + i*n / nstreams, nbytes/nstreams, cudaMemcpyHostToDevice, streams[i]);

cudaMemcpyAsync(d_Table + i*n / nstreams, h_Table + i*n / nstreams, nbytes/nstreams, cudaMemcpyHostToDevice, streams[i]);

}

threads = dim3(512,1);

blocks = dim3(n/(nstreams*threads.x),1);

// asynchronously launch nstreams kernels, each operating on its own portion of data

for(int i = 0; i < nstreams; i++)

DEVICE_Labelling_fast_async<<<blocks, threads, 0, streams[i]>>>(d_Labels + i*n / nstreams, d_Table + i*n / nstreams, height, width);

for(int i = 0; i < nstreams; i++)

cudaMemcpyAsync(h_Labels + i*n / nstreams, d_Labels + i*n / nstreams, nbytes/nstreams, cudaMemcpyDeviceToHost, streams[i]);

cudaEventRecord(stop_event, 0);

cudaEventSynchronize(stop_event);

cutilSafeCall( cudaEventElapsedTime(&elapsed_time, start_event, stop_event) );

std::cout << "elapsed_time = " << elapsed_time << std::endl;

for(int i = 0; i < nstreams; i++)

cudaStreamDestroy(streams[i]);

cudaEventDestroy(start_event);

cudaEventDestroy(stop_event);

cudaFree(d_Labels);

cudaFree(d_Table);

[/codebox]

The kernel code:

[codebox]

int idx = blockIdx.x * blockDim.x + threadIdx.x;

*(d_Data + idx) = *(d_Table + *(d_Data + idx) );

[/codebox]

The problem is that only 1/4 of the Labels array is processed in the right way. If I will run the code above only for one stream stream[0] then output results will be correct. What can be the problem? Maybe I’m using streams wrongly? I’m using CUDA 2.1.

Thanks a lot in advance!

Recently I installed CUDA 2.3 and tried new features like write-combining memory and mapped memory. I’m using GTX 295. Using mapped memory doesn’t help in my case, my program became even slower. I think it’s just interface for programming, because anyway data has to be transferred to device, isn’t it?

The fastest way to copy data to device is using pinned memory on the host with “cudaHostAllocWriteCombined” property. Should it be really like that?

Thanks for any advices!

Mapped memory is useful if you have an integrated graphics card as the memory you’re using is physically the same. For non-integrated cards it’s slower.

Write combined memory is meant to be faster but if you want to read from the host side at any point it’ll be slow there.