Can I use MIMT execution model in CUDA like as CPU?

Hi everyone.
Can I use MIMT(multiple instruction, multiple data) execution model in CUDA like as CPU?
(for handling branch divergence effectively)

I’m porting my code with OpenACC.
While working, I found my code makes branch divergence and it occurs ‘inactive’ thread status so many times.
(almost of running time. 80% over of total running time)
But I can’t fix my whole code. So I want to dealing branch divergence effectively.

I heard CUDA execution model is SIMT with 32 warps. In this architecture, braach diverged thread blocks other threads
until the diverged instruction execution complete.
So I think I have to run GPU as MIMD execution model which like thread computing on CPU. (And I’m finding how to)

I had found some documents, but there is some contents.

From ‘CUDA Programming Guide’
A multiprocessor is designed to execute hundreds of threads concurrently. To manage such a large amount of threads, it employs a unique architecture called SIMT (Single-Instruction, Multiple-Thread) that is described in SIMT Architecture. The instructions are pipelined to leverage instruction-level parallelism within a single thread, as well as thread-level parallelism extensively through simultaneous hardware multithreading as detailed in Hardware Multithreading. Unlike CPU cores they are issued in order however and there is no branch prediction and no speculative execution.

But from ‘GPUGems2’(a book)
The latest GPUs, such as the NVIDIA GeForce 6 Series, have similar branch instructions, though their performance characteristics are slightly different. Older GPUs do not have native branching of this form, so other strategies are necessary to emulate these operations.
The two most common control mechanisms in parallel architectures are single instruction, multiple data (SIMD) and multiple instruction, multiple data (MIMD). All processors in a SIMD-parallel architecture execute the same instruction at the same time; in a MIMD-parallel architecture, different processors may simultaneously execute different instructions. There are three current methods used by GPUs to implement branching: MIMD branching, SIMD branching, and condition codes.
MIMD branching is the ideal case, in which different processors can take different data-dependent branches without penalty, much like a CPU. The NVIDIA GeForce 6 Series supports MIMD branching in its vertex processors.

What is true?
Can I use MIMD exec model from NVIDIA GPU with CUDA?
If OK, how can I use it?

I really thanks for your all reply. :)

Ubuntu 14.04 LTS
CUDA 7.0
NVIDIA GeForce 960
compiler : PGI 15.7 (for OpenACC)

At the CUDA thread level, the only CUDA GPU branching mechanisms are SIMD (32-way) or predication (or a combination of the two). By causing the branch divergence to occur on a 32-way boundary (warp boundary) or perhaps a threadblock boundary, then most of the penalties associated with branching can be mitigated. Warp specialization and/or threadblock specialization are two techniques that could be used to get “MIMD” like capability that extend this concept. However these specialization techniques are not “typical” CUDA programming, and current PGI OpenACC compilers don’t know how to generate warp or threadblock specialization automatically.

The gpugems 2 book you are referencing was published in 2005, and it predates CUDA. The reference to vertex processor there has no analog in CUDA. Likewise the specific GPU families referenced are not CUDA-capable GPUs.

So, the CUDA programming guide is “true” with respect to CUDA. The GPUGems 2 book refers to technology that is no longer commercially available.

GeForce 6 series:

Thank you txbob. :)
I will trust CUDA reference only from now.

And can I ask one more question if you don’t mind?
Then, Is there any way to improve performance to dealing branch divergence area with OpenACC?
As you told, PGI OpenACC compiler doesn’t have any way to specific solution to determine
block and thread index. (first of all, OpenACC 2.0 spec doesn’t have that too)

Is there any way for you to re-group your threads, such that threads belonging to same warp likely follow the same branches?

Such can be true for raytracing algorithms, where one CUDA warp will process several rays that are similar (such as neighboring sub-pixel rays for rendering oversampled images). Initially most of these rays would intersect the same polygons and get scattered and dispersed in a similar way.

I’ve done that, but only for the simpler case where you just want to skip certain items or split them into just 2 groups (each with the same branch behavior). That’s called stream compaction.

I follow computer architecture advances a lot and there is talk about improving performance on diverged codes. One method is temporal SIMD, which executes a vector instruction using a scalar execution unit over multiple cycles, which gets the same saving as conventional SIMD from fewer instructions. If not all SIMD lanes execute an instruction, then the next instruction can be executed sooner.