Assign instructions / algorithms to specific ALUs?

Hi everybody,

I am pretty new to CUDA and GPU programming. I read some material explaining the basic concepts.

However, I want to know if it is possible to assign a set of instructions or a whole algorithm / binary to a specific ALU, such that I can be sure that these instructions were executed only by this ALU (therefore, bypassing the system that automatically takes care of parallelization)?

Best regards,
André

Hi,

You could launch a kernel with one block and one thread but if you are trying to do that you need to ask why you are wanting to do that as you are losing the strength of a GPU. A single thread on the GPU will run significantly slower than that equivalent thread on a CPU. If you can give a few more details on exactly what you are trying to do people may be able to provide alternatives that leverage the strengths of the GPU.

The parallization that is done on a GPU is you provide a set of instructions that N threads will each run through, and then launch as many of those threads as you require to optimally get through your entire data set.

Hi,

thanks for the reply. I intentionally do not want to use the strength of the CPU. I rather have to “bennchmark” the individual ALUs on a GPU regarding any potential computing latencies among them. Thus, I want to assign a (similar) set of instructions to several specific ALUs, measure the time needed for executing this set of instructions and compare the results if there are any differences.

In general I want to check a GPU for certain sources of race conditions. The first I thought of is a potential, minuscule difference in the execution speed of the different ALUs. Maybe you guys know of other potential sources of race conditions.

However, since my goal is rather diametrical to the typical use of a GPU (parllelization, etc.) for me it is rather difficult to see how I can access an inidividual ALU on a low-level with the common tools.

Best regards,
André

One idea to make sure that the same set of instructions are processed by different ALUs:

To the best of my knowledge all threads of the same thread block are executed in the same streaming multiprocessor (SM) at the same time. Thus, if I assign a set of instruction to several threads which are all in different blocks shouldn’t this lead to the situation that it is guaranteed that these instruction are computed by different SMs and thus by different ALUs?

Best regards,
André