Performance Problem Don't get performance at all ..

Hey ,

I have changed the kernel of scalar production as follows:
HOST :
get_time(a)
Loop i=1 to NUM {
Initialize Data (Buffers)
Send the buffers to Device
invoke the kernel
get results(copy back)
}
get_time( b )
total = (b-a) / NUM;

I didn’t changed anything in the Kernel, but with this change which i think it’s nessecery because we have all the time to copy data to the device , compute , and return it back.
As i see i don’t have performace at all, Anyone can help me what’s the problem ?

Copying from the host to device and back again are extremely expensive and time consuming operations (2-6 GiB/s of memory bandwidth).

They way you measure performance of a microbenchmark is entirely dependent on how that code is to be used in the full application. The ideal situation is to get your entire algorithm onto the GPU so you only need to perform the slow host ↔ device transfers at initialization and completion of the app, not for each individual operation.