Unexpected results on Setting GPU Clocks with nvmlDeviceSetApplicationsClocks

Hi, guys

I am using the following function to control the GPU frequency on my Tesla K40c, and I found that setting the graphics frequency has a huge impact on a memory-intensive kernel. Any comments?

nvmlReturn_t nvmlDeviceSetApplicationsClocks (nvmlDevice_t device, unsigned int memClockMHz, unsigned int graphicsClockMHz)

I have queried all the combinations of memory clocks and graphics clocks on Tesla K40c, which are:
0) Memory - 3004, Graphic - 875 (MHz)

  1. Memory - 3004, Graphic - 810
  2. Memory - 3004, Graphic - 745
  3. Memory - 3004, Graphic - 666
  4. Memory - 324, Graphic - 324

I wrote a kernel which includes lots of random memory accesses, and almost little computation. (code is shown below)
When I set the memory clock to 3004 and set the graphic clock from 666 to 875, I found that the execution time of the kernel ranges from 812 us to 615 us.

I don’t quite understand it. Because the following kernel is memory bond, the compute resources should be underutilized when executing this kernel. Then why setting a higher GPU frequency would lead to a higher throughput?

35 global static void memoryAccess(float *input, float *output, unsigned int *random)
36 {
37 const int tid = threadIdx.x;
38 int i;
39 double sum;
41 unsigned int start = tid + random[tid];
42 for (i = 0; i < 100; i ++) {
43 sum += input[start % N];
44 start = random[(tid + start) % N];
45 }
46 output[tid] = sum;
47 return ;
48 }

Any comments would be appreciated.

Based on the design of the K40, it turns out that in order to get maximum memory bandwidth available to your application, you also need to set the core clocks to their highest value (in addition to the memory clocks).

This is because some part of the processing of the memory traffic originating in the GPU SMs is under control of the core clock, not the memory clock, before it gets into the domain of the device that is under control of the memory clock.

This is documented here: