Hey ,
I have changed the kernel of scalar production as follows:
HOST :
get_time(a)
Loop i=1 to NUM {
Initialize Data (Buffers)
Send the buffers to Device
invoke the kernel
get results(copy back)
}
get_time( b )
total = (b-a) / NUM;
I didn’t changed anything in the Kernel, but with this change which i think it’s nessecery because we have all the time to copy data to the device , compute , and return it back.
As i see i don’t have performace at all, Anyone can help me what’s the problem ?