High GPU usage for small grid size ?

Hello everyone,

I was wondering what cause the gpu usage to be high ? Is it the grid size initialization ? Because I’m currently trying to run multiple instance of an handmade neural network and the gpu is always at 99% (nvidia-smi) whatever is the size of my neural net. My gpu is a Titan X Maxwell, here is a sample of the initialization used before launching different kernel :

// Initializing grid dimensions for each case                                                                                                                                                                                       
dim3 dim_grid((loader.maxValue + BLOCK_SIZE -1)/ BLOCK_SIZE);                                                                                                                                                                       
dim3 dim_grid_output((loader.nbOutputs + BLOCK_SIZE -1)/ BLOCK_SIZE);                                                                                                                                                               
dim3 dim_grid_update((loader.weightSize + BLOCK_SIZE -1)/ BLOCK_SIZE);                                                                                                                                                              
dim3 dim_grid_train((loader.trainBatchSize + BLOCK_SIZE -1)/ BLOCK_SIZE);                                                                                                                                                           
dim3 dim_grid_dev((loader.devBatchSize + BLOCK_SIZE -1)/ BLOCK_SIZE);                                                                                                                                                               
dim3 dim_grid_test((loader.testBatchSize + BLOCK_SIZE -1)/ BLOCK_SIZE);                                                                                                                                                             
dim3 dim_grid_batch((loader.miniBatchSize + BLOCK_SIZE - 1)/BLOCK_SIZE);                                                                                                                                                            
dim3 dim_block(BLOCK_SIZE);


The biggest one is dim_grid_update where loader.weightSize represent each weight of the neural network. But in my current example it’s only 36 …

When I run one instance of the neural network I got 4 epochs per seconds, when I run two of them, I got 2 epochs. So if I try to launch 10 instance it’s freakin slooooooow.

Thanks for the help !

gpu utilization as reported by nvidia-smi has nothing to do with grid size. It also has nothing to do with memory utilization (which is reported separately anyway.)


If one instance is “filling the GPU” (ie. utilizing resources in such a way that it mostly precludes kernel concurrency) then running multiple instances is not likely to see any benefit, as you are witnessing.

Well thanks you for this answer, but I can’t understand how a simple kernel by kernel (no concurrent kernel execution) program with small grid size can fill the GPU in such a way that I can’t launch another instance of the same program _.