loop inside kernel

Is it possible to have a loop of variable size in the CUDA kernel?


Yes it is.

Yes, however I made the bad experience that this loop slow down computation massively. This is of course limited to my algorithm that is run as Kernel. In other words, the for loop on CPU would be much faster than on GPU in my case for sure.

I guess it is worth to try, as it seems a bit application-specific.

Just to clarify on this, can I have threads that each can execute the same loop with variable length?

Absolutely. Why would you think you are not be able to?

Yes that is completely possible, that’s one way of exposing parallelism when you have to many elements to process, and they would exceed the resources of the device if you put exactly one thread to process one data element, each thread may process a completely different group of elements so you keep the device at his full processing capability.

I thought if threads execute the same loop with different number of iterations, SIMD paradigm can be violated.

Assume that a loop reached its end in some threads while it is not finished in the other threads in the same thread block …

So, I wanna know if this can be implemented according to SIMD paradigm?

Oh no, by any means the paradigm would be violated, let´s see, having different length loops it doesn’t mean you aren’t executing the same instruction over several different data, but is the case that you will lose some performance gain by the fact that the application will have to wait that larger cycles end processing the elements when the last ones that are waiting to be processed can be distributed among others stream processors that had less elements, , that’s why it is not advised to be used unless there is no other way to solve the problem, and like you said it would be in specific applications.

Sorry I forgot to mention that in the case that if it’s only one thread that has a loop larger than the others, and is not the case that you have several threads with large loops, then you are right, and the SIMD of course it is not applied here, in which case the SIMD architecture of the device would be completely wasted having only one thread executing instructions in the multiprocessor.

I have implemented a program with large loops in each thread, and if the loops are really very large, the program will be crashed.