I have some problem in my cuda program. I have well-optimized algorithm for realtime video encoding. My kernel copy part of data from mapped memory to shared, thеn process data in shared. Processing of data is 13 times faster than copy from mapped to shared. Device ION. Copy performing with coalescing global memory accesses; What ways to speed up copying?
I have some problem in my cuda program. I have well-optimized algorithm for realtime video encoding. My kernel copy part of data from mapped memory to shared, thеn process data in shared. Processing of data is 13 times faster than copy from mapped to shared. Device ION. Copy performing with coalescing global memory accesses; What ways to speed up copying?
How close do you get to the theoretical memory bandwidth? Maybe there is no way of improving the throughput further.
That said, there are two things that seem to improve memory bandwidth slightly even with optimal coalesced accesses:
Try to use the widest memory accesses possible, which would be uint4.
And, if possible, read sequentially through memory and avoid interleaving reads and writes.
How close do you get to the theoretical memory bandwidth? Maybe there is no way of improving the throughput further.
That said, there are two things that seem to improve memory bandwidth slightly even with optimal coalesced accesses:
Try to use the widest memory accesses possible, which would be uint4.
And, if possible, read sequentially through memory and avoid interleaving reads and writes.