How is the compiler optimizing the thread launch?

Hi:

I was doing some testing on CUDA, and I found an interesting thing regarding the thread scheduling… So, I have this “dummy” program that finds first prime factor for a given number

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include "conio.h"
#include "stdio.h"
#include "string.h"
#include "stdlib.h"
#include "time.h"

__global__ void addKernel(int* x, const int *n, const int o, int* r);

__global__ void addKernel(int* x, const int *n, const int o, int* r){
  while (*r == 0) {
    //THIS PRINTF ON PARTICULAR!
    printf("Trying with: %i result: %i\n", x[blockIdx.x * blockDim.x + threadIdx.x], *n % x[blockIdx.x * blockDim.x + threadIdx.x]);

    if (*n % x[blockIdx.x * blockDim.x + threadIdx.x] == 0 && x[blockIdx.x * blockDim.x + threadIdx.x] > 1) {
      memcpy(r, &x[blockIdx.x * blockDim.x + threadIdx.x], sizeof(int));
    }

    x[blockIdx.x * blockDim.x + threadIdx.x] += o;
  }
}

int main(int argc, char* argv[]) {
  //kernel control
  int cn;
  int cm;

  //aux data
  int i;

  if (argc > 1) {
    cn = atoi(argv[1]);
    cm = atoi(argv[2]);
  } else {
    cn = 4;
    cm = 4;
  }

  //heap memory control
  size_t maxMem;

  //loop control
  int ex = 0;
  cudaError_t cuerr;

  //measurement
  float str;
  float end;
  float freq;

  //HOST data
  int* hx = (int*)malloc(sizeof(int) * cn * cm);
  int* hr = (int*)malloc(sizeof(int));
  int* hn = (int*)malloc(sizeof(int));

  //DEVICE data
  int* x;
  int* r;
  int* n;

  cudaDeviceGetLimit(&maxMem, cudaLimitMallocHeapSize);
  cudaDeviceSetLimit(cudaLimitMallocHeapSize, maxMem);
  printf("Heap memory set result: %s\n", cudaGetErrorString(cudaGetLastError()));
  printf("Kernel <<<%i, %i>>> will be launched\n", cn, cm);

  //init DEVICE data
  cudaMalloc(&x, sizeof(int) * cn * cm);
  cudaMalloc(&r, sizeof(int));
  cudaMalloc(&n, sizeof(int));

  if (hx == NULL || hr == NULL || hn == NULL)
    exit(-1);

  //init HOST data
  for (i = 0; i < cn * cm; i++)
    hx[i] = i;

  *hn = 131 * 313; //test data
  *hr = 0;

  //copy from HOST to DEVICE
  cudaMemcpy(x, hx, sizeof(int) * cn * cm, cudaMemcpyHostToDevice);
  cudaMemcpy(n, hn, sizeof(int), cudaMemcpyHostToDevice);
  cudaMemcpy(r, hr, sizeof(int), cudaMemcpyHostToDevice);

  printf("Calculation - Loop Start @%i\n", clock());
  addKernel << <cn, cm >> > (x, n, (cn * cm), r);
  cuerr = cudaDeviceSynchronize();

  printf("Calculation - Loop Ends @%i\n", clock());

  //copy r to HOST
  cudaMemcpy(hr, r, sizeof(int), cudaMemcpyDeviceToHost);

  printf("Result: %i\n", *hr);

  return 0;
}

So, if I run it wtih first printf uncommented (the one that says “THIS PRINTF ON PARTICULAR”, the program ends quickly (as expected), and returns the lowest prime factor of the number

[...]
Trying with: 158 result: 81
Trying with: 159 result: 140
Calculation - Loop Ends @1245
Result: 131

On the other hand, if I comment that particular printf, program lasts considerably more and retrieves the highest primer factor of the number

Heap memory set result: no error
Kernel <<<4, 4>>> will be launched
Calculation - Loop Start @1107
Calculation - Loop Ends @48101
Result: 313

I don’t know why the behaviour changes so much… I even tried to use __shared__ data on the kernel, but similar result

__global__ void addKernel(int* x, const int *n, const int o, int* r){
   __shared__ int sr;
  sr = 0;
  __syncthreads(); //ensure all threads have initial value of 0

  while (sr == 0) {
    //printf("Trying with: %i result: %i\n", x[blockIdx.x * blockDim.x + threadIdx.x], *n % x[blockIdx.x * blockDim.x + threadIdx.x]);

    if (*n % x[blockIdx.x * blockDim.x + threadIdx.x] == 0 && x[blockIdx.x * blockDim.x + threadIdx.x] > 1) {
      //memcpy(r, &x[blockIdx.x * blockDim.x + threadIdx.x], sizeof(int));
      sr = x[blockIdx.x * blockDim.x + threadIdx.x];
    }

    x[blockIdx.x * blockDim.x + threadIdx.x] += o;

    __syncthreads(); //sync so every thread gets updated sr value
  }

  __syncthreads();
  memcpy(r, &x[blockIdx.x * blockDim.x + threadIdx.x], sizeof(int));
}

EDIT: after a few rounds of testing, I found that the way to make it work is by asking for block data specifically

__global__ void addKernel(int* x, const int n, const int o, int* r, const int cn){
  if (blockIdx.x > 0) {
    printf("Block %i activated...\n", blockIdx.x); //kind of "hack" to make it work"
  }

  while (*r == 0) {
    if (n % x[blockIdx.x * blockDim.x + threadIdx.x] == 0 && x[blockIdx.x * blockDim.x + threadIdx.x] > 1) {
      memcpy(r, &x[blockIdx.x * blockDim.x + threadIdx.x], sizeof(int));
    }

    x[blockIdx.x * blockDim.x + threadIdx.x] += o;
  }
}

Is there maybe some compiler optimization that is changing some values when I remove the printf?

Thanks.

Your code has a bug. Multiple kernel threads could update r simultaneously. This is a race condition and must be avoided.

Hi.
It’s not a bug, it’s intentional.
On this scenario, there’s only two possible value that will fill the condition, so having a race condition is acceptable (whenever value I get, I can get the second one).

Have in mind that this is only a practical exercise, also, as I showed, the use of shared memory was not the cause nor the solution.

Thanks.

You have a race condition.

Your expectations for program output could only be valid if you assumed that the thread responsible for discovering the 113 factor executed before the thread that discovers the 313 factor (or if you somehow knew that it would be the same thread that would make both “discoveries”).

CUDA provides no guarantees of order of thread execution

Since CUDA provides no guarantees of order of thread execution, threads can execute in any order, if you don’t provide any ordering (e.g. via explicit barriers).

If threads can execute in any order, then the thread that discovers the 113 factor may run before, concurrently, or after the one that discovers the 313 factor. If it runs after, then it may observe that *r is not zero, in which case it would do nothing, because the result has already been published as 313. We can reverse this statement as well, to declare that 113 is also a possible valid output.

Therefore either outcome is valid, and since you have provided no ordering of thread execution, its unreasonable to conclude that one output should be expected over the other.

Since either outcome is valid, the output will be determined by a race. I call that a race condition. It’s OK if we disagree about that. You can call it whatever you want to call it. And it doesn’t matter if it is intentional or not. If it is problematic, it is problematic.

As an aside, your use of global memory for inter-thread communication is also problematic. Just because one thread writes a value to global memory does not mean another thread will read that value, unless you take specific steps to make it so. If you’d like to see this in action, comment out the first in-kernel printf, and put this line as the last line in the body of the if-statement in your kernel:

printf("thread: %d, %d wrote: %d\n", (int)blockIdx.x, (int)threadIdx.x, *r);

When I do that on my tesla v100, I get this:

$ ./t2122
Heap memory set result: no error
Kernel <<<4, 4>>> will be launched
Calculation - Loop Start @380000
thread 2,1 wrote: 313
thread 0,3 wrote: 131

I imagine you wouldn’t expect that sort of output. In this case, due to the lack of visibility of *r, both threads attempted to update the output. Which result you see fits the definition of a race condition. Evidently your first printf statement is making a difference in the “visibility” of *r, but of course you should not depend on that.

Given that we have established that the visibility of *r is not guaranteed, we can now observe that a thread that cannot read a modified value from the r value written by another thread, and has no factors of its own to discover, will run forever, since the only termination method for your while loop is broken.

Hi Robert:

Thanks for the data. I know that on a race condition more than one thread can write same data and can be problematic if the output is important. The problem is not on the output value as both values given (131 and 313) are valid as well.

What I’m trying to point is one of the things you wrote: some thread doesn’t have an updated value for *r. But somehow same happens when using shared memory (I tried declaring shared memory both Inside kernel and on a global scope outside kernel).

Only when using printfall threads seems to acknowledge updated values (either of *r and shared values).

(EDIT: I jump to my laptop and add some more information)

Example 1: shared memory without using blockIdx.x printf:

__global__ void addKernel(int* x, const int n, const int o, int* r, const int cn);
__shared__ int ssr;

__global__ void addKernel(int* x, const int n, const int o, int* r, const int cn){
  ssr = 0;
  int i = 0;

  //if (blockIdx.x > 0) {
  //  printf("Block %i activated...\n", blockIdx.x);
  //}

  __syncthreads(); //ensure all threads have initial value of 0

  while (ssr == 0 && *r == 0) {

    if (n % x[blockIdx.x * blockDim.x + threadIdx.x] == 0 && x[blockIdx.x * blockDim.x + threadIdx.x] > 1) {
      //memcpy(r, &x[blockIdx.x * blockDim.x + threadIdx.x], sizeof(int));
      i = 1;
      ssr = x[blockIdx.x * blockDim.x + threadIdx.x];
    }

    x[blockIdx.x * blockDim.x + threadIdx.x] += o;
    __syncthreads();
  }

  __syncthreads();
  if (i == 1) { //only overwrite if that was the thread that found it
    x[blockIdx.x * blockDim.x + threadIdx.x] -= o;
    memcpy(r, &x[blockIdx.x * blockDim.x + threadIdx.x], sizeof(int));
    printf("FOUND! ssr [%i, %i, %i] = %i, addr: %p, r: %i\n", blockIdx.x, blockDim.x, threadIdx.x, ssr, &ssr, *r);
  }
}

Output:

Heap memory set result: no error
Kernel <<<4, 2>>> will be launched
Calculation - Loop Start @1948
FOUND! ssr [1, 2, 1] = 131, addr: 0000021074000000, r: 131
FOUND! ssr [0, 2, 1] = 313, addr: 0000021074000000, r: 313

{Then program keeps freezing, while blockId.x = 2 and 3 are running, have to manually force program stop}

Example 2: same code, but adding blockId.x printf

__global__ void addKernel(int* x, const int n, const int o, int* r, const int cn);
__shared__ int ssr;

__global__ void addKernel(int* x, const int n, const int o, int* r, const int cn){
  ssr = 0;
  int i = 0;

  if (blockIdx.x > 0) {
    printf("Block %i activated...\n", blockIdx.x);
  }

  __syncthreads(); //ensure all threads have initial value of 0

  while (ssr == 0 && *r == 0) {
    if (n % x[blockIdx.x * blockDim.x + threadIdx.x] == 0 && x[blockIdx.x * blockDim.x + threadIdx.x] > 1) {
      //memcpy(r, &x[blockIdx.x * blockDim.x + threadIdx.x], sizeof(int));
      i = 1;
      ssr = x[blockIdx.x * blockDim.x + threadIdx.x];
    }

    x[blockIdx.x * blockDim.x + threadIdx.x] += o;
    __syncthreads();
  }

  __syncthreads();
  if (i == 1) { //only overwrite if that was the thread that found it
    x[blockIdx.x * blockDim.x + threadIdx.x] -= o;
    memcpy(r, &x[blockIdx.x * blockDim.x + threadIdx.x], sizeof(int));
    printf("FOUND! ssr [%i, %i, %i] = %i, addr: %p, r: %i\n", blockIdx.x, blockDim.x, threadIdx.x, ssr, &ssr, *r);
  }

  __syncthreads();
  printf("On output, value for ssr: %i\n", ssr);
}

Output:

Heap memory set result: no error
Kernel <<<4, 2>>> will be launched
Calculation - Loop Start @1738
Block 1 activated...
Block 1 activated...
Block 3 activated...
Block 3 activated...
Block 2 activated...
Block 2 activated...
FOUND! ssr [0, 2, 1] = 313, addr: 000001B46C000000, r: 313
On output, value for ssr: 0
On output, value for ssr: 0
On output, value for ssr: 0
On output, value for ssr: 0
On output, value for ssr: 0
On output, value for ssr: 0
On output, value for ssr: 313
On output, value for ssr: 313
Calculation - Loop Ends @1740
Result: 313

{On this case, all threads finish and program ends automatically}

I don’t know how this simple printf is changing the visibility for *r.

I’m sorry, but I don’t know why printing blockId.x is capable to change the visibility for a variable. I would appreciate some info about that, please.

Then, how to guarantee visibiliy for variables? I mean, not even ssr that is declared on a global scope and __shared__was updated for all threads on this example…

Thanks.

Shared memory is only shared between threads of the same block, not between all threads. If block 1 executes srr = 131 , the value of srr in blocks 2 and 3 remains unchanged. That’s why the kernel does not terminate.

That is not correct. A race-condition is always problematic because it is undefined behavior. See CUDA C++ Programming Guide and CUDA C++ Programming Guide which explain how your issue can be solved. That section also states:

It is undefined behavior for two threads to read from or write to the same memory location without synchronization.

Thanks for that clarification. I thought that only implication of race condition was the data issue per se (that on this specific scenario is not rellevant).

Will take a look on provided links, for sure.

It’s not undefined behavior for two threads to read from the same location.

It is undefined behavior for two threads to write to the same location without synchronization, but the extent of the undefined aspect is limited to which value will be present in that location. It is guaranteed, subject to a few requirements, that one of the written values will end up in that location.

One of those requirements is that each thread must access the location according to its type, and the use of memcpy does not satisfy that requirement.

If we switch the memcpy statement to a more ordinary looking assignment statement, we can address that requirement.

Moving beyond that, we still have 2 issues:

  • a race condition that is still present, but you have stated that you don’t care if 131 or 313 is reported.
  • visibility of *r.

We can “fix” the visibility of *r by marking it with the volatile keyword. You can look up the definition of that keyword in the programming guide.

I believe these modifications should allow you to get what you seem to be looking for:

__global__ void addKernel(int* x, const int *n, const int o, volatile int* r);

__global__ void addKernel(int* x, const int *n, const int o, volatile int* r){
  while (*r == 0) {
    //THIS PRINTF ON PARTICULAR!
    //printf("Trying with: %i result: %i\n", x[blockIdx.x * blockDim.x + threadIdx.x], *n % x[blockIdx.x * blockDim.x + threadIdx.x]);

    if (*n % x[blockIdx.x * blockDim.x + threadIdx.x] == 0 && x[blockIdx.x * blockDim.x + threadIdx.x] > 1) {
      *r = x[blockIdx.x * blockDim.x + threadIdx.x];
      printf("thread %d,%d wrote value: %d\n", (int)blockIdx.x, (int)threadIdx.x, *r);
    }

    x[blockIdx.x * blockDim.x + threadIdx.x] += o;
  }
}

Note that the race condition is still present. In the above code I presented, it is still possible to get printout from multiple different threads. However, only a thread that has discovered a factor will do so. And given that you have said you don’t care which factor gets reported, this would seem to be “benign”.

Hi:

Thanks for the extra details.
The volatile keyword worked (I didn’t know it was available, I use it a sometimes on DB2 databases).

Just to ensure, the visibility of read-only variables is limited to the requiremets you wrote on your post? I mean, accessing x[blockIdx.x * blockDim.x + threadIdx.x] as on x[blockIdx.x * blockDim.x + threadIdx.x] += o; has some limitations on visibility?

Thanks.

in that case, x is not read-only

a read-only variable, initialized prior to the launch of a kernel in global memory, should be identically visible to all threads in the kernel, without any special measures.

Oh, sure… Sorry I didn’t use a good example.
Then n, and o are visible for all threads as their values será assigned on main right?

Sorry for the confussion

yes, *n and o are visible to all threads.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.