High GPU usage for small grid size ?

Hello everyone,

I was wondering what cause the gpu usage to be high ? Is it the grid size initialization ? Because I’m currently trying to run multiple instance of an handmade neural network and the gpu is always at 99% (nvidia-smi) whatever is the size of my neural net. My gpu is a Titan X Maxwell, here is a sample of the initialization used before launching different kernel :

// Initializing grid dimensions for each case                                                                                                                                                                                       
dim3 dim_grid((loader.maxValue + BLOCK_SIZE -1)/ BLOCK_SIZE);                                                                                                                                                                       
dim3 dim_grid_output((loader.nbOutputs + BLOCK_SIZE -1)/ BLOCK_SIZE);                                                                                                                                                               
dim3 dim_grid_update((loader.weightSize + BLOCK_SIZE -1)/ BLOCK_SIZE);                                                                                                                                                              
dim3 dim_grid_train((loader.trainBatchSize + BLOCK_SIZE -1)/ BLOCK_SIZE);                                                                                                                                                           
dim3 dim_grid_dev((loader.devBatchSize + BLOCK_SIZE -1)/ BLOCK_SIZE);                                                                                                                                                               
dim3 dim_grid_test((loader.testBatchSize + BLOCK_SIZE -1)/ BLOCK_SIZE);                                                                                                                                                             
dim3 dim_grid_batch((loader.miniBatchSize + BLOCK_SIZE - 1)/BLOCK_SIZE);                                                                                                                                                            
dim3 dim_block(BLOCK_SIZE);

BLOCK_SIZE = 64

The biggest one is dim_grid_update where loader.weightSize represent each weight of the neural network. But in my current example it’s only 36 …

When I run one instance of the neural network I got 4 epochs per seconds, when I run two of them, I got 2 epochs. So if I try to launch 10 instance it’s freakin slooooooow.

Thanks for the help !

gpu utilization as reported by nvidia-smi has nothing to do with grid size. It also has nothing to do with memory utilization (which is reported separately anyway.)

http://stackoverflow.com/questions/40937894/nvidia-smi-volatile-gpu-utilization-explanation/40938696#40938696

If one instance is “filling the GPU” (ie. utilizing resources in such a way that it mostly precludes kernel concurrency) then running multiple instances is not likely to see any benefit, as you are witnessing.

Well thanks you for this answer, but I can’t understand how a simple kernel by kernel (no concurrent kernel execution) program with small grid size can fill the GPU in such a way that I can’t launch another instance of the same program _.