Transferring data to and fro device to host and vice versa is very time consuming. So is there anyway so that one kernel generates certain values (array) and other kernel can use it without transferring the generated value (array) back to host and then again passing into device for its use by second kernel?
Which will give better performance, 4 threads doing 16 operation each or 64 threads doing 1 operation each. what i mean to know is, what is the effect in performance if we use small no of thread ( with multiple operations) and large number of thread (with single operations)?
Yup, it’s possible. You just need to declare a device memory in the global memory location. Look at a sample code below. (However, this code assumes that the HOST CODE between the 2 kernel calls will not use the data stored in ‘d_data’. If this is not true, then you again should have to transfer data from device to host :) )
float* d_data; cudaMalloc((void**)&d_data, ....); kernelCallNumber1<<<a, b>>(d_data); ... HOST CODE ... kernalCallNumber2<<c, d>>(d_data); .... HOST CODE ...
Or, simply you could create a pinned memory (See cuda programming guide version >= 2.2 for more details on this technique)
All these threads you create through a kernel will finally be launched in terms of warps. So, too small a number of threads, there might not be sufficient warps available to hide the latency of memory fetches.