varying no of cores at runtime

saiyedul · November 7, 2009, 9:26am

can we specify no of processor cores only on which our program should run…
actually i m doing a project on showing speedup achieved by same algorithm on differnt no of processing cores…
Do C for CUDA provide any way for doing this…

plz anybody do reply…

avidday · November 7, 2009, 9:47am

The short answer is no.

In CUDA, the hardware details are completely abstract and the programmer really only controls the number of threads to be run. The hardware itself decides how that thread count should be translated into hardware execution parameters. That can change, depending on the hardware generation you are running.

Cygnus_X1 · November 7, 2009, 10:00am

I was told you can use RivaTuner to do that. Originally it was designed for overclocking, but you can do various things with it and disabling some SM is likely one of its functions. If I am not mistaken you can also downclock the cores or memory access rates to see how these have an impact on your execution and - in a way - check if your kernels or compute- or bandwith-limited.

Never used it so far, but I will probably try it one day too…

saiyedul · November 7, 2009, 10:14am

hey, actually i have a 9400GT( 2 multiprocessor, 16 SMs)…originally i have planned to run my program on 2,4,8 and 16 SMs and watch the difference on run time…

so wat should i do now???

avidday · November 7, 2009, 10:19am

Give up, because you can’t do that. Scheduling and execution control happens at the multiprocessor level - there is no way to have finer grained scheduling than that.

cbuchner1 · November 7, 2009, 10:43am

Giving up? That word is not known to me.

If you take one card with, say 12 SMs like the 9600GSO, you can easily launch a grid consisting of hundreds of blocks, where you hardcode the first N blocks to wait for the rest to do real work (using global atomics spinlocks for example).

if (blockIdx < N)

	{

		// spinlock until algorithm is finished

	}

	else

	{

		// do real work based on blockIdx - N as "true" block index.

	}

N should be < 12 of course. So you can have your algorithm execute on 12-N cores.

However you have to make sure that one SM only executes one block at a time, for example by allocating > 8192 of shared memory per block.

Or by using enough registers to not allow for a second block.

Cheers, you can quote me when you get the Nobel prize for your work.

Christian

saiyedul · November 7, 2009, 12:34pm

Giving up? That word is not known to me.

If you take one card with, say 12 SMs like the 9600GSO, you can easily launch a grid consisting of hundreds of blocks, where you hardcode the first N blocks to wait for the rest to do real work (using global atomics spinlocks for example).
if (blockIdx < N)

	{

		// spinlock until algorithm is finished

	}

	else

	{

		// do real work based on blockIdx - N as "true" block index.

	}
N should be < 12 of course. So you can have your algorithm execute on 12-N cores.

However you have to make sure that one SM only executes one block at a time, for example by allocating > 8192 of shared memory per block.

Or by using enough registers to not allow for a second block.

Cheers, you can quote me when you get the Nobel prize for your work.

Christian

thanx buddy for ur idea…but i m totally new to parallel programming n CUDA (infact i m a 3rd engg graduate student), so if could plz explain ur method a bit more…or if u could suggest some further reading on dis particular problem…

avidday · November 7, 2009, 12:59pm

What he is suggesting is that you write code which runs on all all cores, but simulates running on fewer through a combination of manipulating execution parameters to precisely control how many threads run on each multiprocessor and interprocess communication between the threads to determine which of those threads actually do computations. Through careful instrumentation you could deduce the approximate scalability of the algorithm on 2N cores, where N could be 1,2,4,8 or 8 for your 2 multiprocessor GPU.

It isn’t trivial and it isn’t really what you are asking, just an approximation of it. CUDA isn’t like MPI or OpenMP where you can just pick a number of processes or threads and their affinity at runtime and the code itself need not know anything about it.

Cygnus_X1 · November 7, 2009, 1:12pm

I still believe RivaTuner is the simplest way to do it. Try googling it, it is a program which goes in background and allows you to tweak a lot of parameters of your GPU. Just be careful, if you set your GPU to do too much too fast you may simply burn it!

From my understanding you don’t have to change which multiprocessors are running at runtime, but rather between program launches.

Btw. “SM” stands for “Stream Multiprocessor”. I understand you want to try running it on one or both of them on your GPU.

I think that tweaking it so much that you will use less than all SPs (scalar processors) of your SM wouldn’t be that useful. What for? So far warp size does not change between different GPUs, while SM count does.

P.S.

I like this way of thinking :D