I faced the following problem. My algorithm has two parts, that are operating on the same data: the first part is running on CPU (no chance to parallelize and its sequential implementation on GPU is very very slow), the second part is running on GPU and it’s very fast. But since the second part needs results from the first part I have to copy data from the host to the device using cudaMemcpy. And these copies slow down the whole algorithm significantly. Of course I’m using pinned memory for the data that is being transfered to the device. But are there some other tricks how copies CPU<-> GPU can be accelerated?
The second problem is the consequence of the first one. In order to have benefit of using GPU I’m using 16 bits integer instead of 32bits (than I have less data to transfer). But using 16 bits brings another limit to my algorithm. (since values can be higher than 16 535). 24 bits integer would solve my problems (because as I said with 32bits I have to copy a lot), but this format doesn’t exist. Or maybe someone has an experience of using 24 bits integers on GPU?
Thanks a lot in advance for any advices!