I’m working on adapting a neural net library to CUDA and I’m a bit confused about how to handle some situations.
Depending on the size and configuration of the network, it’s quite possible that the biggest restriction to the number of threads is going to be the memory requirements of the weights on the inputs. For example, with one net I’m testing, there are 4k of inputs. Given that each neuron will have 4K weights associated with it, this works out to a maximum of about 30 neurons, after overhead, that can be calculated at a time (with my GeForce 9800 GTX).
In most cases, an entire layer can be calculated at once, but in situations where the # of inputs is really large, these need to be done in sections.
I’m a bit confused on how to launch the kernel method.
I’m assuming that the parameters for the grid and threads per block and <<<Dg, Db>>>, that Dg and Db, can be variables.
Assuming a dynamic # of threads, depending on memory, what’s the best way to layout the grid and threads per block?