kernel calculating time for 100M arrays: 15ms, equal 6,67 GigaFLOPS while peak permonace must be ~900 GigaFLOPS
theoretical videomemory bandwith(102 GB\sec) must not limit it. Why perfomance can be so low?
That code block is far too big for me to read right now.
With the numbers you quoted you’re on 80GB/s memory bandwidth. You’re memory bound. I’ve found it’s fairly hard to push too much beyond 80-85GB/s on a C1060.
Pinned memory and streams won’t help your kernel execution time at all. Streams might even slow it down. They’ll only change overally run time.
as about streams - i am using them for concurrent copying and kernels executing - ‘CUDA best practice guide’ thinks that it can increase overall perfomance )