Reading globaltimer register or calling clock/clock64 in loop prevent concurrent kernel execution?

Hi, I’m trying to learn how GPU schedule concurrent kernels with the resource limitation (thread/register/shared mem). It allows normal computation to be concurrent, trying to schedule blocks from different kernels as much as possible. But when I want to control the running time for each block, by reading globaltimer register or calling clock/clock64 functions in a loop, kernels will be serialized, starting a new one after previous one finish.

I only read in the documentation saying that configuration such as L1/shared mem changing could lead to serializing kernels, is reading globaltimer and so causing the similar serialization? In another word, is reading globaltimer/calling clock/clock64 preventing concurrent kernel execution?

Thanks,

  • Ming

Reading clock or clock64 should have no effect on serialization of concurrent kernels.

But how to explain that if I have the following code in the kernel, then no kernel would be executed concurrently with it. If I remove these code, concurrent kernel execution happens.

I couldn’t find the reason neither - ideally clock shouldn’t influence the concurrent kernel execution, but it seems the only reason that CKE is prevented.

long long int start = clock64();
while(clock64() < start + 1000000000) continue;

How many blocks does your grid have - do you leave enough resources for the following kernel to launch in parallel?

Yes, all resources are enough to run kernels concurrently. My experiment detail is below:

Two streams:

Stream 1: K1 (1 block * 1024 threads/block)
Stream 2: K2 (4 blocks * 1024 threads/block)

Running on TX1 (Maxwell, 2 SMs, 4 * 1024 threads limit in total)

Comparison between two experiment

Experiment 1: use clock/globaltimer to control run time for each block
Result: K2 starts after K1 finishes

Experiment 2: no clock/globaltimer code in the kernel, pure normal computation (assign sqrt of the index to array element)
Result: K2 is concurrent with K1

I don’t have a TX1 to run on. I’m sure I can demonstrate concurrent kernels where the kernels read from clock() or clock64().

Here is an example, albeit with CDP, on a device that is approximately as resource constrained as TX1:

http://stackoverflow.com/questions/31058850/overlap-kernel-execution-on-multiple-streams

(that actually has 4 kernels simultaneously resident, two parent and two child kernels, on a device with 2 SMs)

How are you determining concurrency. Are you using CUDA events?

Thanks for the information. I tried with it. Then I realized one thing I forgot to mention, which is important:
If two kernels are launched closely in time, they will always be concurrent. But if one is launched late enough, for example, I launched the second kernel 1 second late, before the first finishes. Then they are not co-scheduled. Code below:

#include <stdio.h>

#define DELAY_VAL 5000000000ULL

long milliseconds()
{
    long            ms; // Milliseconds
    time_t          s;  // Seconds
    struct timespec spec;

    clock_gettime(CLOCK_REALTIME, &spec);

    s  = spec.tv_sec;
    ms = round(spec.tv_nsec / 1.0e6); // Convert nanoseconds to milliseconds
    return ms + s *1000;
}

__global__ void child(){

    unsigned long long start = clock64();
    while (clock64()< start+DELAY_VAL);
}

__global__ void parent(){

    child<<<1,1>>>();
}

int main(int argc, char* argv[]){

    cudaStream_t st1, st2;
    cudaStreamCreate(&st1);
    cudaStreamCreate(&st2);

    long start = milliseconds();
    long now = 0;
    parent<<<1,1,0,st1>>>();
    if (argc > 1){
        printf("running double kernel\n");
        while ( now < start + 1000) {
            now = milliseconds();
        }
        parent<<<1,1,0,st2>>>();
    }
    cudaDeviceSynchronize();
}

With normal computation (no time code), kernels would always run concurrently, at least every time I’ve seen. So it seems GPU is treating code referencing time with different policy.

I use nvvp to visualize the execution timeline.

That’s not what I see.

Here is my test case, using exactly the code you posted, on CUDA 8, CentOS7, Tesla K20X:

$ nvcc -arch=sm_35 -rdc=true -o t958 t958.cu -lcudadevrt
$ time ./t958

real    0m12.794s
user    0m2.373s
sys     0m8.661s
$ time ./t958 1
running double kernel

real    0m13.768s
user    0m3.169s
sys     0m8.815s
$ time cuda-memcheck ./t958 1
========= CUDA-MEMCHECK
running double kernel
========= ERROR SUMMARY: 0 errors

real    0m19.088s
user    0m4.087s
sys     0m13.272s
$

This system has a lot of main memory and 4 GPUs in it, and I don’t have the GPUs in persistence mode, so there is a long start-up delay (5-6s) for any CUDA code running on this system.

In any event, we see above that the single kernel run takes 12.794s, the double kernel run takes approximately 1 second more at 13.768s, and the serialized run (cuda-memcheck serializes kernel launches) takes ~19s.

These numbers make sense to me and suggest to me in the double kernel launch case, the extra kernel being delayed by 1 second causes almost exactly 1 additional second of execution time, which is exactly what I would expect. The only possible way these numbers make sense is if the two parent kernels (and their child kernels) are running concurrently in the 13.768s case.

Interesting! This could be special to Jetson TX1 then, or architecture after Maxwell, or newer CC (TX1 is sm53). See my result shows they are not concurrent. I also confirmed with nvvp, showing they are serialized.

$ time ./t815

real    0m5.451s
user    0m0.560s
sys     0m0.820s

$ time ./t815 1
running double kernel

real    0m10.467s
user    0m2.330s
sys     0m0.870s

$ time cuda-memcheck ./t815 1
========= CUDA-MEMCHECK
running double kernel
========= ERROR SUMMARY: 0 errors

real    0m12.028s
user    0m3.870s
sys     0m1.680s

What version of CUDA is installed on that TX1? I assume it is a Jetson TX1?

Yes Jetson TX1. CUDA 8.0 with Ubuntu 16.04. Full log of deviceQuery is below:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    5.3
  Total amount of global memory:                 3994 MBytes (4188229632 bytes)
  ( 2) Multiprocessors, (128) CUDA Cores/MP:     256 CUDA Cores
  GPU Max Clock rate:                            72 MHz (0.07 GHz)
  Memory Clock rate:                             13 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = NVIDIA Tegra X1
Result = PASS

Jetson has a variety of settings that I’m not intimately familiar with. It’s possible your issue is specific to Jetson. You might get some additional folks to look at it by posting in the Jetson forum.

Thanks for your help. New post link: https://devtalk.nvidia.com/default/topic/1000107/jetson-tx1/reading-globaltimer-register-or-calling-clock-clock64-in-loop-prevent-concurrent-kernel-execution-/

Concurrent kernels on TK1 was a real problem at the time. I don’t know if it was solved or if the same problem may apply to TX1.

Normal computation kernels run concurrently just fine. I only see the problem with timing function, globaltimer register.

Problem solved. I didn’t stop lightdm service. Although I haven’t figured out how such small size kernel of <<<1,1>>> can be delayed, it doesn’t happen any more.

Thanks to everyone!