Optimu way for this code?

Manjunath_Gudisi · March 12, 2009, 9:05am

Hi ,

Here I have written sample of My code. Its taking 240+ms. this much time is not fair for me, so

1) Could you please tell Me what could be the optimum way for this following program?

I want to execute the device function with using thredas concept ( I mean individually )

     2) How can I execute device function using threds?
     3) What could be the kernal function configuration?

//device function
device void DevFun()
{
// loop should execute 640 times…
for(int j=0; j<640; ++j ) { … some code … };
}

//kernal function
global void KerFun()
{
… some code …
… some code …

//call to device function
DevFun()

… some code …
… some code …
}

//main function
main()
{
… some code …
… some code …

//call to kernal function
KerFun<<<1,1,0>>>()

… some code …
… some code …
}

Thanks
Manjunath

Jamie_K · March 12, 2009, 7:24pm

Run the 640 iterations in parallel instead of sequentially, by creating 640 threads (20 blocks of 32 threads per block):

__global__ void KerFun() {

  int j = blockIdx.x * blockDim.x + threadIdx.x;

// do here what you would do within the loop.

}

main() {

  // set up for kernel execution

  KerFun<<<20,32,0>>>();

}

Manjunath_Gudisi · March 13, 2009, 6:21am

Run the 640 iterations in parallel instead of sequentially, by creating 640 threads (20 blocks of 32 threads per block):
__global__ void KerFun() {

  int j = blockIdx.x * blockDim.x + threadIdx.x;

// do here what you would do within the loop.

}

main() {

  // set up for kernel execution

  KerFun<<<20,32,0>>>();

}

Hi Jamie,

Thanks for your reply.

I have one doubt here…, that is

When we declare kernal function configuration like this: KerFun<<<20,32,0>>>();

It means that the KerFun() is executed 20*32 times. am I right??

but I dont want execute the hole function that many times[ it takes too much time ].

I want to execute DevFun() 20*32 times [ DevFun() only ], which is calling from Kernal function. How can I do this?

Thanks

Manjunath

Jamie_K · March 13, 2009, 1:59pm

It’s not possible to spawn threads from within kernel code, so from one invocation of KerFun you can’t launch multiple threads for the 640 executions DevFun.

If KerFun needs to be run only once, you could perform the processing on the host and pass the information, or invoke it as a separate kernel:

__global__ void DevFun() {

  int j = blockIdx.x * blockDim.x + threadIdx.x;

// do here what you would do within the loop.

}

__global__ void KerFun() {

  // do here what you want to do only once.

}

main() {

  // set up for kernel execution

  KerFun<<<1,1,0>>>();

  DevFun<<<20,32,0>>>();

}

Topic		Replies	Views
Using threads in device function. CUDA Programming and Performance	0	1053	March 12, 2009
Kernels launch - parallel or serial? CUDA Programming and Performance	16	7047	January 11, 2010
pthreads and concurrent kernels CUDA Programming and Performance	0	664	May 31, 2013
Function pointers on device and per thread conditions CUDA Programming and Performance	6	3659	March 22, 2007
Less time for sequential execution and more time for parallel execution. why? CUDA Programming and Performance	2	878	March 13, 2009
Problem with kernels CUDA Programming and Performance	13	7225	December 6, 2009
CUDA functions How should CUDA functions be called? CUDA Programming and Performance	7	5655	August 13, 2009
Sequential code in kernal ? Does it still run ? CUDA Programming and Performance	1	2632	June 11, 2008
CUDA simple questions. please answer! CUDA Programming and Performance	4	1465	April 29, 2009
Basic Cuda Confusion - help CUDA Programming and Performance	9	2001	February 11, 2013

Optimu way for this code?

Related topics