Some Questions regarding CUDA

miztaken · September 29, 2009, 6:21am

Transferring data to and fro device to host and vice versa is very time consuming. So is there anyway so that one kernel generates certain values (array) and other kernel can use it without transferring the generated value (array) back to host and then again passing into device for its use by second kernel?
Which will give better performance, 4 threads doing 16 operation each or 64 threads doing 1 operation each. what i mean to know is, what is the effect in performance if we use small no of thread ( with multiple operations) and large number of thread (with single operations)?

_teju · September 29, 2009, 8:56am

Yup, it’s possible. You just need to declare a device memory in the global memory location. Look at a sample code below. (However, this code assumes that the HOST CODE between the 2 kernel calls will not use the data stored in ‘d_data’. If this is not true, then you again should have to transfer data from device to host :) )

float* d_data;

cudaMalloc((void**)&d_data, ....);

kernelCallNumber1<<<a, b>>(d_data);

... HOST CODE ...

kernalCallNumber2<<c, d>>(d_data);

.... HOST CODE ...

Or, simply you could create a pinned memory (See cuda programming guide version >= 2.2 for more details on this technique)

All these threads you create through a kernel will finally be launched in terms of warps. So, too small a number of threads, there might not be sufficient warps available to hide the latency of memory fetches.

Topic		Replies	Views
passing information between kernels ? CUDA Programming and Performance	1	1858	July 17, 2009
memory concept CUDA Programming and Performance	3	443	September 28, 2016
efficiency of block/thread ratios CUDA Programming and Performance	2	3817	April 18, 2007
device to device memory use CUDA Programming and Performance	1	3525	April 27, 2010
Some CUDA/GPU implementation related questions CUDA Programming and Performance	6	2259	May 30, 2009
17x drop in Cuda performance When each thread operate on subset of kernel input data CUDA Programming and Performance	7	1683	April 16, 2012
Odd performance problem/question CUDA Programming and Performance	3	829	June 3, 2009
CUDA processor allocation CUDA Programming and Performance	7	3434	October 5, 2007
Execute instruction only once inside a block/grid? CUDA Programming and Performance	7	1993	May 10, 2010
Parallel Kernels Best practices for creating a pipeline CUDA Programming and Performance	7	4683	June 1, 2007

Some Questions regarding CUDA

Related topics