Hello cuda developers.
I have some problem in my cuda program. I have well-optimized algorithm for realtime video encoding. My kernel copy part of data from mapped memory to shared, thÐµn process data in shared. Processing of data is 13 times faster than copy from mapped to shared. Device ION. Copy performing with coalescing global memory accesses; What ways to speed up copying?
Sorry for my bad english.