Running Just One Thread on each Core


My scenario is to run just one thread or one kernel on each core.
The mean cause is that i am not running a real parallel operation, but my operation is taking too long ( about one day), and my goal is to get the maximum number of results in each runtime. ie: i need to run this ‘taking long’ operation just one per core, because this operation contains a huge while loop, an it takes on a normal CPU the maximum of one CPU Core.
for example: if my GPU contains 48 cores, how can i run my kernel just 48 times on each one must run on a different core ??


You can’t do that. The basic scheduling unit is a warp, which consists of 32 threads. So if you want to run completely independent tasks, this means you can use just one of these 32 threads, and have to throw away 31/32 or 96.875% of the computational power of your GPU. Some basic arithmetic shows that in that scenario a multicore CPU will be faster under any circumstances.

I think Nvidia’s marketing department didn’t do themselves a favor when they decided to rename the ALU to “core” in order to plant a wrong impression of what a GPU is. Because now that they’ve succeeded (judging from the rate with which similar questions pop up on the forums) anybody who they succeeded in impressing will start with a massive disappointment once finding out what CUDA actually is capable of doing.

Thanks for the help,
As i said before, my process (kernel) will take all the core performance, and i want is to run exactly one thread on each core,
Is it possible to know in which core the thread is running, so i will perform my job only once on each core, and i will kill other threads directly, like that:

if (coreid==1) // or 32, or 64, or 96 …
// perform some operation
// exit kernel


If as you said , that the schedule will run exactly 32 thread in each core, so i will use only the first thread, and other threads will be done directly; the running thread will consume the maximum performance of his owner core !!

Threads are not assigned to cores, because “CUDA Cores” are more like fancy ALUs. The SM has a scheduler that identifies warps available for execution (i.e. not waiting on memory reads or other things) and issues the next instruction for the warp to a group of CUDA Cores (8 or 16 depending on your device) that process it for all 32 threads. Two or four clocks later (again, depending on device), the scheduler issues another instruction, probably from a different warp.

Because of the pipeline depth in the CUDA Cores (16 to 20 stages? I lost track), the next instruction is issued before the previous one has finished. If you don’t have enough threads available, the scheduler will not have another warp available for execution and stall. Having too few threads actually can reduce throughput. This is totally different than a CPU core, where there is often significant operating system overhead in context switching between threads, and maximum compute throughput is achieved by matching the number of threads to the number of cores.

So is it that you really want to run the same program with different data (or conditions) multiple times ?

How many different instances (data and/or conditions) do you want to run ?

PS cuda prefers to be running thousands of threads at once, do you have hundreds or thousands of instances ?

I need to run only 30 threads !! is there any problem, why those thread could not be distributed each one on a different core !!!
suppose that i am calculating x[sup]2[/sup] , n times. i want that a single kernel calculate each n iteration.
if n = 100000000, what can be happen ? the thread may take about 1 hour, and so the core will be busy to perform this thread !!

What i want to say is that i must be able to distribute my threads according to the kernel job. If i am not able to separate my kernel in subkernels, so i could not use CUDA ???

Yes there is a problem. CUDA fundamentally doesn’t work like that.

If you were doing that in CUDA you would launch 100000000 threads, not a single thread containing a large loop.

It sounds like either your problem is a very poor fit for the CUDA programming and execution model, or you are being overly prescriptive about how you think your problem should be solved, without really understanding how CUDA works.

Thanks for the help