Hello, need some help in learning CUDA.
Task: adding two float arrays
GPU: tesla c1060
kernel calculating time for 100M arrays: 15ms, equal 6,67 GigaFLOPS while peak permonace must be ~900 GigaFLOPS
theoretical videomemory bandwith(102 GB\sec) must not limit it. Why perfomance can be so low?
Using pagelocked memory and async memory copying.