Dynamically restricting # of threads

I’m working on adapting a neural net library to CUDA and I’m a bit confused about how to handle some situations.

Depending on the size and configuration of the network, it’s quite possible that the biggest restriction to the number of threads is going to be the memory requirements of the weights on the inputs. For example, with one net I’m testing, there are 4k of inputs. Given that each neuron will have 4K weights associated with it, this works out to a maximum of about 30 neurons, after overhead, that can be calculated at a time (with my GeForce 9800 GTX).

In most cases, an entire layer can be calculated at once, but in situations where the # of inputs is really large, these need to be done in sections.

I’m a bit confused on how to launch the kernel method.

I’m assuming that the parameters for the grid and threads per block and <<<Dg, Db>>>, that Dg and Db, can be variables.

Assuming a dynamic # of threads, depending on memory, what’s the best way to layout the grid and threads per block?

Why only 30 neurons at a time? Maybe I’m missing something about how much memory you use, but you can easily allocate 100’s of MiB of global memory.

And yes, Dg, and Db are variables. Calculate what you need and launch a kernel of any size within the limits. Grid and thread layout depends heavily on the algorithm.

Each neuron essentially needs to read all inputs, add them up applying some function on the way and write a single output, correct? Two different grid/thread combinations immeadiately jump to my mind
The simplest first: Run one thread per neuron, with an arbitrary block size (adjust for maximum performance). This will probably only pan out performance wise if you have 5,000+ neurons calculating at a time.

The 2nd method would be to run one block per neuron. Run at a block size of 128 or 256 (or 512 if you can keep the register usage low enough) and use a reduction to calculate the output of each neuron. See the reduction sample in the SDK and read the white paper that comes with it. This method could potentially perform well for as small as 50-100 neurons.

And keep in mind that data structures matter! (often much more than the number of FLOPs you perform) Depending on which block/thread setup you use, your threads will be reading memory in different patterns. So you need to choose the way your weights/inputs are structured in memory based on the block configuration so that they are read coalesced.

It depends on task whether overall performance is better when number of threads per block is greater or not. In case of my task it is. So, I try to launch as many threads per block as possible. As the number of threads per block is limited by the registers usage and amount of shared memory each thread requires, I’ve implemented two macros:

#define MAX_THREADS_PER_BLOCK_REGS(DeviceProps, RegUsage)\


	((DeviceProps->regsPerBlock / (16 * RegUsage)) & ~3) * 16)

#define MAX_THREADS_PER_BLOCK_SHMEM(DeviceProps, SharedMemUsage, SharedMemPerThread)\

	(((((int)DeviceProps->sharedMemPerBlock - SharedMemUsage) / SharedMemPerThread) /\

	DeviceProps->warpSize) * DeviceProps->warpSize)

DeviceProps - pointer to the CUDA props of the device,

RegUsage - the number of registers your kernel requires (compilation with -Xptxas -v or -cubin provide the information of shared memory and registers usage),

SharedMemUsage - number of bytes of shared mem you kernel allocates by default,

SharedMemPerThread - number of bytes of shared mem your kernel requires for actual work, this value may vary for different input data do be processed.

Final number of threads per block is min(MAX_THREADS_PER_BLOCK_REGS, MAX_THREADS_PER_BLOCK_SHMEM).

Number of blocks in grid may vary. According to the prog guide, SM may process no more than 8 blocks at a time. If the kernel does not rely on the index of block in grid, I’d recommend to run 2NumberOfSMs or 4NumberOfSMs blocks, where NumberOfSMs is the number of streaming multiprocessors on your particular GPU (in this case your kernel should be functional on the grid of arbitrary size).

All the statements described above should be tested anyway - you kernel may do it’s best with other run settings.

Hope this may help.

Thanks for the information. I hadn’t really considered register usage and that’s obviously something I’ll need to take into consideration.

Mr. Anderson asked why only 30 neurons at time. That would be because I suck at math and I was off by a factor of 100. The actual number is closer to 3000 neurons (not taking into consideration other limitations), so I’m in much better shape there.

I’m still not entirely clear on the block/grid sizing stuff and I guess, to some degree, that stuff will probably become more clear as I’ve done more development. But I think part of my question has to do with how to parcel things out in a generic fashion.

Let’s say that my neural net is memory constrained and I can do a big block of threads and after those are done, I have a prime number of neurons left to execute, say 37.

How would I execute 37 threads?

I could do say 1 neuron per block and execute 37 blocks, but in the case where I’m executing large numbers of neurons, I’ll probably want more than 1 neuron per block.

The thing is, the neural net library is a library, so I can’t know, up-front, how many neurons are going to be used, so I could end up in situations where things aren’t easily divisible like this.

So how do you handle the sizing in a generic fashion like this? Does that make sense?



First, it is always best to run blocks in multiples of the warp size (which is 32 on all released hardware to date) because the way warps are executed on the device.

As long as you have no __syncthreads from iner-block communcication you can always do something like this:

__global__ void kernel(int N_neurons, ....)

 Â  Â {

 Â  Â int neuron = blockDim.x * blockIdx.x + threadIdx.x;

  Â // due to the block size, some threads will be past the end of N_neurons, quit now

 Â  Â if (neuron >= N_neurons)

 Â  Â  Â  return;


 Â  Â .. rest of calculation

 Â  Â }

Okay, that makes sense, but then that raises a question regarding divergent execution that I’ve been wondering about. Let’s say you have a some code that looks like this:

global void myMethod(…)
if (hasData)
… do work here …

Let’s say the “if” statement covers all of the method and “hasData” is simply a flag indicating whether or not the thread has data to work with.

The threads will diverge. The ones that have data will do the work. What happens to the threads that don’t have data? Do they just exit early? What’s the penalty for that kind of divergence? Is there a penalty if some of the threads of a block exit early?


If all even threads have data and all odd threads don’t then every warp will diverge into to and the calculation will potentially take twice as long. I say potentially because memory bound calculations will leave the device with more than enough FLOP/s left over to hide the penalty due to divergence.

Divergences only happen at the warp level, so if you group threads that work and threads that don’t separately, the penalty is even less. In the case I gave, only one warp in one block will diverge and you will never even notice it.

Other than keeping in mind the thought of attempting to group threads together, divergences should be the last thing you optimize for (if at all).

Thanks, that’s a very useful thing to know. Really, at this point, optimizations are the last thing I’m going to address. But knowing where NOT to waste time optimizing is always helpful.

I just built a restricted boltzmann machine that operates with a large number of missing values, and my approach was a blocksize equal to the number of hidden features \ neurons (which in practice tops out at 256). I have something like a 25x speedup on an overclocked gtx260 over a single threaded CPU.

You get excellent memory read characteristics for the feature weights and the feature biases in this setup. I get fairly horrible characteristics for reading in the input data and visible unit biases because of the sparsity in my situation. There are a number of optimizations I can make to handle that but they lead to significant code complexity.

I devote 1 kernel to this, but there might be room for multiple kernel approach. Currently, some of the data I generate early in the process is written back to global memory (the negative phase data reconstruction) and then read later when updating the weights. I suspect there is room to seperate and optimize this into two kernels.