Hi everyone,
I faced the following problem. My algorithm has two parts, that are operating on the same data: the first part is running on CPU (no chance to parallelize and its sequential implementation on GPU is very very slow), the second part is running on GPU and it’s very fast. But since the second part needs results from the first part I have to copy data from the host to the device using cudaMemcpy. And these copies slow down the whole algorithm significantly. Of course I’m using pinned memory for the data that is being transfered to the device. But are there some other tricks how copies CPU<-> GPU can be accelerated?
The second problem is the consequence of the first one. In order to have benefit of using GPU I’m using 16 bits integer instead of 32bits (than I have less data to transfer). But using 16 bits brings another limit to my algorithm. (since values can be higher than 16 535). 24 bits integer would solve my problems (because as I said with 32bits I have to copy a lot), but this format doesn’t exist. Or maybe someone has an experience of using 24 bits integers on GPU?
Hi, thanks. By overlapping you mean async copies and kernel executions using streams I suppose, right? Actually I didn’t try it yet. But I will try and let you know whether it helps or not.
Yep, that’s what I mean. I’ve seen mentioned here that H2D copy and kernel execution can only overlap on linear transfers using cudaMemcpy, not when copying to device arrays or pitch linear memory, although I haven’t found any mention of this in the programming guide.
actually I tried to use streams in order to overlap memory copies between CPU and GPU but I faced some problems. My code is based on the NVIDIAS’s “simpleStreams” example:
[codebox]
// h_Labels (heightwidth) and h_Table(heightwidth/4) are input arrays
The problem is that only 1/4 of the Labels array is processed in the right way. If I will run the code above only for one stream stream[0] then output results will be correct. What can be the problem? Maybe I’m using streams wrongly? I’m using CUDA 2.1.
Recently I installed CUDA 2.3 and tried new features like write-combining memory and mapped memory. I’m using GTX 295. Using mapped memory doesn’t help in my case, my program became even slower. I think it’s just interface for programming, because anyway data has to be transferred to device, isn’t it?
The fastest way to copy data to device is using pinned memory on the host with “cudaHostAllocWriteCombined” property. Should it be really like that?
Mapped memory is useful if you have an integrated graphics card as the memory you’re using is physically the same. For non-integrated cards it’s slower.
Write combined memory is meant to be faster but if you want to read from the host side at any point it’ll be slow there.