Copies between CPU and GPU

salzis · October 23, 2009, 12:00pm

Hi everyone,
I faced the following problem. My algorithm has two parts, that are operating on the same data: the first part is running on CPU (no chance to parallelize and its sequential implementation on GPU is very very slow), the second part is running on GPU and it’s very fast. But since the second part needs results from the first part I have to copy data from the host to the device using cudaMemcpy. And these copies slow down the whole algorithm significantly. Of course I’m using pinned memory for the data that is being transfered to the device. But are there some other tricks how copies CPU<-> GPU can be accelerated?

The second problem is the consequence of the first one. In order to have benefit of using GPU I’m using 16 bits integer instead of 32bits (than I have less data to transfer). But using 16 bits brings another limit to my algorithm. (since values can be higher than 16 535). 24 bits integer would solve my problems (because as I said with 32bits I have to copy a lot), but this format doesn’t exist. Or maybe someone has an experience of using 24 bits integers on GPU?

Thanks a lot in advance for any advices!

Tigga · October 23, 2009, 12:26pm

I’m sure you could hack together 24 bit integers using bitmasks. Not ever done it myself.

Also… 16 bit ints can have values up to 65535 - I’m guessing you typoed?

lars · October 23, 2009, 12:37pm

Have you tried hiding some of the latency by overlapping kernel execution and memory transfers?

/Lars

salzis · October 23, 2009, 3:24pm

Hi, thanks. By overlapping you mean async copies and kernel executions using streams I suppose, right? Actually I didn’t try it yet. But I will try and let you know whether it helps or not.

salzis · October 23, 2009, 3:27pm

I mean that my values can be bigger than 65535…

lars · October 25, 2009, 3:25am

Yep, that’s what I mean. I’ve seen mentioned here that H2D copy and kernel execution can only overlap on linear transfers using cudaMemcpy, not when copying to device arrays or pitch linear memory, although I haven’t found any mention of this in the programming guide.

/L

salzis · October 27, 2009, 3:52pm

Hi again,

actually I tried to use streams in order to overlap memory copies between CPU and GPU but I faced some problems. My code is based on the NVIDIAS’s “simpleStreams” example:

[codebox]

// h_Labels (heightwidth) and h_Table(heightwidth/4) are input arrays

int nstreams = 4;

int n = height*width;

int nbytes = n * sizeof(unsigned short);

dim3 threads, blocks;

float elapsed_time;

unsigned short *d_Labels = 0;

unsigned short *d_Table = 0;

SC( cudaMalloc((void **)&d_Labels, heightwidthsizeof(unsigned short)) );

SC( cudaMalloc((void **)&d_Table, heightwidthsizeof(unsigned short)) );

cudaStream_t streams = (cudaStream_t) malloc(nstreams * sizeof(cudaStream_t));

for(int i = 0; i < nstreams; i++)

cutilSafeCall( cudaStreamCreate(&(streams[i])) );

cudaEvent_t start_event, stop_event;

cutilSafeCall( cudaEventCreate(&start_event) );

cutilSafeCall( cudaEventCreate(&stop_event) );

cudaEventRecord(start_event, 0);

for(int i = 0; i < nstreams; i++){

cudaMemcpyAsync(d_Labels + i*n / nstreams, h_Labels + i*n / nstreams, nbytes/nstreams, cudaMemcpyHostToDevice, streams[i]);

cudaMemcpyAsync(d_Table + i*n / nstreams, h_Table + i*n / nstreams, nbytes/nstreams, cudaMemcpyHostToDevice, streams[i]);

}

threads = dim3(512,1);

blocks = dim3(n/(nstreams*threads.x),1);

// asynchronously launch nstreams kernels, each operating on its own portion of data

for(int i = 0; i < nstreams; i++)

DEVICE_Labelling_fast_async<<<blocks, threads, 0, streams[i]>>>(d_Labels + i*n / nstreams, d_Table + i*n / nstreams, height, width);

for(int i = 0; i < nstreams; i++)

cudaMemcpyAsync(h_Labels + i*n / nstreams, d_Labels + i*n / nstreams, nbytes/nstreams, cudaMemcpyDeviceToHost, streams[i]);

cudaEventRecord(stop_event, 0);

cudaEventSynchronize(stop_event);

cutilSafeCall( cudaEventElapsedTime(&elapsed_time, start_event, stop_event) );

std::cout << "elapsed_time = " << elapsed_time << std::endl;

for(int i = 0; i < nstreams; i++)

cudaStreamDestroy(streams[i]);

cudaEventDestroy(start_event);

cudaEventDestroy(stop_event);

cudaFree(d_Labels);

cudaFree(d_Table);

[/codebox]

The kernel code:

[codebox]

int idx = blockIdx.x * blockDim.x + threadIdx.x;

*(d_Data + idx) = *(d_Table + *(d_Data + idx) );

[/codebox]

The problem is that only 1/4 of the Labels array is processed in the right way. If I will run the code above only for one stream stream[0] then output results will be correct. What can be the problem? Maybe I’m using streams wrongly? I’m using CUDA 2.1.

Thanks a lot in advance!

salzis · November 2, 2009, 6:04pm

Recently I installed CUDA 2.3 and tried new features like write-combining memory and mapped memory. I’m using GTX 295. Using mapped memory doesn’t help in my case, my program became even slower. I think it’s just interface for programming, because anyway data has to be transferred to device, isn’t it?

The fastest way to copy data to device is using pinned memory on the host with “cudaHostAllocWriteCombined” property. Should it be really like that?

Thanks for any advices!

Tigga · November 3, 2009, 10:52am

Mapped memory is useful if you have an integrated graphics card as the memory you’re using is physically the same. For non-integrated cards it’s slower.

Write combined memory is meant to be faster but if you want to read from the host side at any point it’ll be slow there.

Topic		Replies	Views
Copies between CPU and GPU CUDA Programming and Performance	0	1270	October 23, 2009
Copy-Compute Overlap Performance CUDA Programming and Performance	4	933	January 19, 2019
Unpredictable nature of GPU action timing in Nsight CUDA Programming and Performance	10	1202	October 18, 2015
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2155	January 18, 2023
Data transfers are slower when overlapped than when running sequentially CUDA Programming and Performance	9	1356	September 29, 2021
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	153	July 7, 2024
Fast copy (device->kernel) after aync kernel call CUDA Programming and Performance	11	3196	July 6, 2015
cudaMemcpy2DAsync a lot slower than cudaMemcpy normally CUDA Programming and Performance	6	87	August 22, 2024
Problem with cudamemcopy CUDA Programming and Performance	6	1829	September 18, 2009
Memory copy for max coalescing CUDA Programming and Performance	11	2520	January 7, 2015

Copies between CPU and GPU

Related topics