Reworking Library to launch a graph for Deep Neural Network

Hello,

I am relatively new to Cuda and have spent the last couple of months working on improving a Closed loop deep learning library to optimise performance for a university project. I have identified that the remaining thing hindering the current performance is the time taken for the kernel launches and I am working on trying to solve this issue using graphs.

Basically, my current application deploys a series of kernels for a single iteration of learning following the steps shown below.

Overall, I currently launch 4 kernels for every layer in the network which causes my overhead to be far too high.

I would like to use stream capture if possible but I need to somehow capture that the inputs to each kernel are changing after each subsequent kernel launch.

I have a few key questions:

  1. do I need to create a graph for every iteration of learning or is there some way I could create the graph to have dynamic kernel inputs and just relaunching it each time ?
  2. Is it likely that the creation of a graph for this purpose will take longer than the launch overhead of 4n kernels per layer

Thanks in advance for any help and sorry if I have worded anything poorly

full library code: CLDL-CUDA/lib at main · L-A-F-987/CLDL-CUDA · GitHub

setInputs_layer_0
//
//
for (int i=0;i<nLayers-1; i++) {
// Calculates the output to the given layer using a layer function
layers[i]->calcOutputs();
double* layerOutputs = layers[i]->getOutputs();
// Propagates the new outputs to the Input of the next layer
layers[i+1]->propInputs(layerOutputs);
}
layers[nLayers-1]->calcOutputs();
//
//
set_backward_error(output - input)
//
//
double* sumlist;
for (int i = nLayers - 1; i > 0; i–) {
sumlist = layers[i]->calcErrorWeightProductSum();
layers[i-1]->propErrorBackward(sumlist);
}

Perhaps you’ve already seen this, and the other related articles listed down the right hand side.

1 Like

Thank you !