loop inside kernel

gpugpu · May 1, 2009, 4:35pm

Is it possible to have a loop of variable size in the CUDA kernel?

Thanks.

jesusgumbau · May 1, 2009, 4:38pm

Yes it is.

ludx · May 3, 2009, 5:39pm

Yes, however I made the bad experience that this loop slow down computation massively. This is of course limited to my algorithm that is run as Kernel. In other words, the for loop on CPU would be much faster than on GPU in my case for sure.

black_ij · May 3, 2009, 7:36pm

I guess it is worth to try, as it seems a bit application-specific.

Just to clarify on this, can I have threads that each can execute the same loop with variable length?

MisterAnderson42 · May 3, 2009, 9:18pm

Absolutely. Why would you think you are not be able to?

Lermy · May 3, 2009, 9:26pm

Yes that is completely possible, that’s one way of exposing parallelism when you have to many elements to process, and they would exceed the resources of the device if you put exactly one thread to process one data element, each thread may process a completely different group of elements so you keep the device at his full processing capability.

black_ij · May 4, 2009, 12:02am

I thought if threads execute the same loop with different number of iterations, SIMD paradigm can be violated.

Assume that a loop reached its end in some threads while it is not finished in the other threads in the same thread block …

So, I wanna know if this can be implemented according to SIMD paradigm?

Lermy · May 4, 2009, 2:37am

Oh no, by any means the paradigm would be violated, letÂ´s see, having different length loops it doesn’t mean you aren’t executing the same instruction over several different data, but is the case that you will lose some performance gain by the fact that the application will have to wait that larger cycles end processing the elements when the last ones that are waiting to be processed can be distributed among others stream processors that had less elements, , that’s why it is not advised to be used unless there is no other way to solve the problem, and like you said it would be in specific applications.

Lermy · May 4, 2009, 2:47am

Sorry I forgot to mention that in the case that if it’s only one thread that has a loop larger than the others, and is not the case that you have several threads with large loops, then you are right, and the SIMD of course it is not applied here, in which case the SIMD architecture of the device would be completely wasted having only one thread executing instructions in the multiprocessor.

hoixhoi · May 4, 2009, 9:59am

I have implemented a program with large loops in each thread, and if the loops are really very large, the program will be crashed.

Topic		Replies	Views
thread local 'for loop' question thread parallel for loop execution CUDA Programming and Performance	5	3388	August 29, 2007
Does CUDA support variable loop limits? CUDA Programming and Performance	2	1205	October 12, 2011
Loop inside kernel or over kernels in host code? [performance question] CUDA Programming and Performance	8	6731	September 25, 2008
Loops in kernels CUDA Programming and Performance	2	1323	September 3, 2009
kernel execution in FOR loops CUDA Programming and Performance	8	5728	January 9, 2010
for loop inside kernel CUDA Programming and Performance	2	5371	September 12, 2011
For Loops with If Statements Inside Kernel CUDA Programming and Performance	10	288	June 20, 2024
Performance of Divergent Threads CUDA Programming and Performance	2	1639	September 29, 2008
Execute instruction only once inside a block/grid? CUDA Programming and Performance	7	2008	May 10, 2010
Is it possible to execute kernels in parallel CUDA Programming and Performance	9	4569	February 6, 2009

loop inside kernel

Related topics