Running Just One Thread on each Core

Joe_Sona · April 24, 2011, 4:33pm

Hello

My scenario is to run just one thread or one kernel on each core.
The mean cause is that i am not running a real parallel operation, but my operation is taking too long ( about one day), and my goal is to get the maximum number of results in each runtime. ie: i need to run this ‘taking long’ operation just one per core, because this operation contains a huge while loop, an it takes on a normal CPU the maximum of one CPU Core.
for example: if my GPU contains 48 cores, how can i run my kernel just 48 times on each one must run on a different core ??

thanks

tera · April 24, 2011, 8:10pm

You can’t do that. The basic scheduling unit is a warp, which consists of 32 threads. So if you want to run completely independent tasks, this means you can use just one of these 32 threads, and have to throw away 31/32 or 96.875% of the computational power of your GPU. Some basic arithmetic shows that in that scenario a multicore CPU will be faster under any circumstances.

I think Nvidia’s marketing department didn’t do themselves a favor when they decided to rename the ALU to “core” in order to plant a wrong impression of what a GPU is. Because now that they’ve succeeded (judging from the rate with which similar questions pop up on the forums) anybody who they succeeded in impressing will start with a massive disappointment once finding out what CUDA actually is capable of doing.

Joe_Sona · April 25, 2011, 8:44pm

Thanks for the help,
As i said before, my process (kernel) will take all the core performance, and i want is to run exactly one thread on each core,
Is it possible to know in which core the thread is running, so i will perform my job only once on each core, and i will kill other threads directly, like that:

if (coreid==1) // or 32, or 64, or 96 …
// perform some operation
else
// exit kernel

???

If as you said , that the schedule will run exactly 32 thread in each core, so i will use only the first thread, and other threads will be done directly; the running thread will consume the maximum performance of his owner core !!

seibert · April 26, 2011, 12:49pm

Threads are not assigned to cores, because “CUDA Cores” are more like fancy ALUs. The SM has a scheduler that identifies warps available for execution (i.e. not waiting on memory reads or other things) and issues the next instruction for the warp to a group of CUDA Cores (8 or 16 depending on your device) that process it for all 32 threads. Two or four clocks later (again, depending on device), the scheduler issues another instruction, probably from a different warp.

Because of the pipeline depth in the CUDA Cores (16 to 20 stages? I lost track), the next instruction is issued before the previous one has finished. If you don’t have enough threads available, the scheduler will not have another warp available for execution and stall. Having too few threads actually can reduce throughput. This is totally different than a CPU core, where there is often significant operating system overhead in context switching between threads, and maximum compute throughput is achieved by matching the number of threads to the number of cores.

kbam · April 26, 2011, 11:44pm

So is it that you really want to run the same program with different data (or conditions) multiple times ?

How many different instances (data and/or conditions) do you want to run ?

PS cuda prefers to be running thousands of threads at once, do you have hundreds or thousands of instances ?

Joe_Sona · April 27, 2011, 3:24pm

I need to run only 30 threads !! is there any problem, why those thread could not be distributed each one on a different core !!!
suppose that i am calculating x[sup]2[/sup] , n times. i want that a single kernel calculate each n iteration.
if n = 100000000, what can be happen ? the thread may take about 1 hour, and so the core will be busy to perform this thread !!

What i want to say is that i must be able to distribute my threads according to the kernel job. If i am not able to separate my kernel in subkernels, so i could not use CUDA ???

avidday · April 27, 2011, 3:49pm

Yes there is a problem. CUDA fundamentally doesn’t work like that.

If you were doing that in CUDA you would launch 100000000 threads, not a single thread containing a large loop.

It sounds like either your problem is a very poor fit for the CUDA programming and execution model, or you are being overly prescriptive about how you think your problem should be solved, without really understanding how CUDA works.

Joe_Sona · April 28, 2011, 3:02pm

Thanks for the help

Topic		Replies	Views
How to use just the GPU cores without any threading CUDA Programming and Performance	2	3524	March 24, 2010
Does a CUDA thread get assigned to a specific core from the start and until it finishes execution? CUDA Programming and Performance	12	5018	May 2, 2011
Thread Scheduling Concept CUDA Programming and Performance	3	3903	June 21, 2012
Warp thread Scheduling CUDA Programming and Performance	7	2392	June 28, 2010
Are Turing's CUDA kernels divided into 4 partitions managed by 4 warp schedulers? CUDA Programming and Performance	3	938	March 10, 2022
Scheduling individual threads CUDA Programming and Performance	4	4660	June 1, 2009
CUDA threads and warps Teaching & Curriculum Support	3	7966	May 12, 2015
How to address the CUDA cores? CUDA Programming and Performance	2	7905	September 21, 2011
Blocks/Warps/Threads Allocation I have some doubts about the allocation of blocks/warps/thread in CU CUDA Programming and Performance	5	2680	November 1, 2012
Threads per warp vs number of cores CUDA Programming and Performance	2	2680	February 3, 2009

Running Just One Thread on each Core

Related topics