Need advice - low perfomance

Hello, need some help in learning CUDA.

Task: adding two float arrays
GPU: tesla c1060

kernel calculating time for 100M arrays: 15ms, equal 6,67 GigaFLOPS while peak permonace must be ~900 GigaFLOPS
theoretical videomemory bandwith(102 GB\sec) must not limit it. Why perfomance can be so low?

Using pagelocked memory and async memory copying.

thank you

looks like your kernel is memory-bound

I think you can try using less # of threads per block: say, 64 instead of 512
because your threads need not to work cooperatively inside a block

also try computing several sums per one thread (say 4 sums),
to take advantage of overlapping memory access with ALUs

and zero-copies instead of streams (easier to handle)

tried it, but it did not help much

as i understan, it is direct access to usual memory via cudaHostGetDevicePointer?

it is also working slower than usual copying from pagelocked memory

That code block is far too big for me to read right now.

With the numbers you quoted you’re on 80GB/s memory bandwidth. You’re memory bound. I’ve found it’s fairly hard to push too much beyond 80-85GB/s on a C1060.

Pinned memory and streams won’t help your kernel execution time at all. Streams might even slow it down. They’ll only change overally run time.

Thank you for information.

as about streams - i am using them for concurrent copying and kernels executing - ‘CUDA best practice guide’ thinks that it can increase overall perfomance )