Tx1 inconsistent GPU utilization ratio for the same function loops

Hi, I tried profiling inference timing for googlenet, for batch size of 1 and about 8k images on my Tx1.
I hope to get consistent timing for each input image because I fixed GPU frequency to max 998MHz.

However, the timing fluctuated about 200%.
I believe this is due to inconsistent Tx1 GPU utilization ratio where tegratats shows GR3D fluctuated about 200% too.
When I set batch size to 16, both timing and GPU utilization ratio are consistent.

May I have recommendations on measurement techniques or nvprof metrics for further analysis ? TQ.

Attached are codes, corresponding timing and tegrastats output for 1second.

for (int m = 0; m < total_googlnet_layer; m += batch)
{

   copy_from_cpu_to_gpu(); 

   cudaEvent_t start, stop;    				
   cudaEventCreate(&start); 
   cudaEventCreate(&stop);    				                    
   <b>cudaEventRecord(start, 0);</b>
                        
   <b>forward_googlenet_layer_using_cudnn();</b>

<b>   cudaEventRecord(stop, 0);</b>
   while( cudaEventQuery(stop) == cudaErrorNotReady ){}    
   gpuErrchk(cudaEventSynchronize(stop));
   cudaDeviceSynchronize();				
   cudaStreamSynchronize(0);
   float time0;
   <b>cudaEventElapsedTime(&time0, start, stop);  </b>                  
   cudaEventDestroy( start ); 
   cudaEventDestroy( stop );		                                        

   gpuErrchk( cudaPeekAtLastError() );                

}
time   forward timing
3:29:26	33.61ms
3:29:26	24.28ms
3:29:26	27.96ms
3:29:26	24.42ms
3:29:26	26.63ms
3:29:26	24.33ms
3:29:26	47.63ms
3:29:26	31.28ms
3:29:26	37.65ms
3:29:26	48.37ms
3:29:26	24.34ms
3:29:26	44.53ms
3:29:26	31.80ms
3:29:26	24.26ms
3:29:26	38.80ms
3:29:26	44.69ms
3:29:26	25.70ms
3:29:26	25.43ms
3:29:26	24.25ms
3:29:26	28.40ms
3:29:26	51.45ms
3:29:26	35.59ms
3:29:26	24.20ms
3:29:26	25.15ms
3:29:26	24.35ms
3:29:26	24.30ms
3:29:26	24.97ms
3:29:26	44.61ms
3:29:26	30.35ms
3:29:26	27.33ms
RAM is consistent at 1607/3853MB
[time]   CPU utilization, GPU utilization
[03:29:26]	cpu	[0%,66%,0%,90%]@1734	EMC	GR3D	90%@998
[03:29:26]	cpu	[10%,69%,16%,83%]@1734	EMC	GR3D	85%@998
[03:29:26]	cpu	[9%,66%,0%,90%]@1734	EMC	GR3D	91%@998
[03:29:26]	cpu	[9%,80%,9%,91%]@1734	EMC	GR3D	30%@998
[03:29:26]	cpu	[10%,75%,0%,100%]@1734	EMC	GR3D	65%@998
[03:29:26]	cpu	[0%,80%,0%,90%]@1734	EMC	GR3D	48%@998
[03:29:26]	cpu	[9%,54%,0%,88%]@1734	EMC	GR3D	88%@998
[03:29:26]	cpu	[0%,72%,9%,90%]@1734	EMC	GR3D	43%@998
[03:29:26]	cpu	[0%,58%,0%,75%]@1734	EMC	GR3D	89%@998
[03:29:26]	cpu	[10%,66%,0%,81%]@1734	EMC	GR3D	75%@998