Stop other threads from executing parallelly

This is not a real problem - I know that there are other algorithms to solve this, I was wondering if there was a way to solve loop carried dependencies in CUDA kernels using naive cpu algorithms(of course with a big hit in performance).

#include <iostream>

__global__ void foo(int* bar) {
    int64_t j = 0;
    int64_t i = threadIdx.x + blockIdx.x*blockDim.x;
    if (i<5) {
        bar[j] = j;
        j++;
    }
}

int main() {
    int bar[5];
    int *d_bar;
    cudaMalloc((void **) &d_bar, 5*sizeof(int));
    cudaMemcpy(d_bar, bar, 5*sizeof(int), cudaMemcpyHostToDevice);
    foo<<<1, 5>>>(d_bar);
    cudaMemcpy(bar, d_bar, 5*sizeof(int), cudaMemcpyDeviceToHost);
    cudaFree(d_bar);
    for (int i=0; i<5; i++) {
        std::cout << bar[i] << "\n";
    }
    return 1;
}

The loop carried dependency is j. Since these are parallel threads, there will be a race condition and I will not get the values I want.

How do I stop other threads from executing when I come to the j assignment, and only resume the next thread when the previous thread is completed?

atomics

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions

Thanks! But I think I might be doing something wrong.

I replaced -

bar[j] = j;
j++;

with-

atomicExch(bar+j, j);
atomicExch(&j, j+1);

and I still get garbage values. What am I doing wrong?

Your code doesn’t make much sense to me:

__global__ void foo(int* bar) {
    int64_t j = 0;
    int64_t i = threadIdx.x + blockIdx.x*blockDim.x;
    if (i<5) {
        bar[j] = j;
        j++;
    }
}

Every thread initially has j set to zero. Then the first 5 threads will execute:

bar[j] = j;

which is basically:

bar[0] = 0;

So I would expect the first element of bar to be zero, and the other elements to be unchanged. That is true without atomics.

You need to communicate what your desired behavior is. Your initial code doesn’t communicate it to me, anyway.

Ill try to explain my desired behaviour better.

In my original code -

__global__ void foo(int* bar) {     
    int64_t j = 0;     
    int64_t i = threadIdx.x + blockIdx.x*blockDim.x;

I want all threads to stop here.

Just the first thread executes this -

if (i<5) {         
    bar[j] = j;         
    j++;     
}

The value of j is updated for the second thread.
The second thread then executes this -

if (i<5) {         
    bar[j] = j;         
    j++;     
}

and so on…

For now I’m ignoring some inconsistencies such as the fact that j is a thread-local variable in your code. It’s evident that you intend it to be shared or global somehow.

I guess you’re asking for a specific machine behavior. You want threads 2-5 to wait for thread 1, then 3-5 wait for thread 2, and so on. i.e. you want serialization.

You can certainly do it. It’s not an efficient way to use the machine. When doing parallel programming, especially when “converting” from a serial approach, its often a good idea to construct the problem statement from the standpoint of algorithm inputs and outputs, rather than specifying low-level thread behavior. Many algorithms will have a substantially different realization in a parallel environment, for efficiency, compared to the serial realization.

Since I don’t know what you really want to accomplish algorithmically, I’m not sure I can offer any more concrete advice. For example, your proposed intent would place values like so:

bar[0] = 0;
bar[1] = 1;
etc.

That input-output statement would be trivial to accomplish in a thread parallel way:

__global__ void foo(int* bar) {
    int64_t i = threadIdx.x + blockIdx.x*blockDim.x;
    if (i<5) {
        bar[i] = i;
    }
}

But I’m guessing from all the dust in the air that that is not really what you want, so we have an x-y problem. I can’t solve that without knowing the actual problem x. Because the problem you have presented (y) cannot be done efficiently. I suspect that y is not your actual desire, however.

For now I’m ignoring some inconsistencies such as the fact that j is a thread-local variable in your code. It’s evident that you intend it to be shared or global somehow.

Yes! Thank you!

[i]I guess you’re asking for a specific machine behavior. You want threads 2-5 to wait for thread 1, then 3-5 wait for thread 2, and so on. i.e. you want serialization.

You can certainly do it[/i]

Can you please tell me how? That is exactly what I am trying to figure out.

But I’m guessing from all the dust in the air that that is not really what you want, so we have an x-y problem. I can’t solve that without knowing the actual problem x. Because the problem you have presented (y) cannot be done efficiently. I suspect that y is not your actual desire, however.

I am sorry if I wasn’t clear, this is not a real problem. I am trying to understand something. That’s it :)

Here is one possible approach. It is not very efficient. It needs to be rewritten if you want sz > 1024:

$ cat t1655.cu
#include <iostream>
const int sz = 5;
__global__ void foo(int* bar) {
    __shared__ int j;
    int idx = threadIdx.x + blockIdx.x*blockDim.x;
    if (!threadIdx.x) j = 0;
    for (int i = 0; i < sz; i++){
      __syncthreads();
      if (i == idx){ bar[j] = j; j++;}
      }
}

int main() {
    int bar[sz];
    int *d_bar;
    cudaMalloc((void **) &d_bar, sz*sizeof(int));
    cudaMemcpy(d_bar, bar, sz*sizeof(int), cudaMemcpyHostToDevice);
    foo<<<1, sz>>>(d_bar);
    cudaMemcpy(bar, d_bar, sz*sizeof(int), cudaMemcpyDeviceToHost);
    cudaFree(d_bar);
    for (int i=0; i<sz; i++) {
        std::cout << bar[i] << "\n";
    }
    return 0;
}
$ nvcc -o t1655 t1655.cu
$ cuda-memcheck ./t1655
========= CUDA-MEMCHECK
0
1
2
3
4
========= ERROR SUMMARY: 0 errors
$

Thanks!

Can you explain this line please?

if (!threadIdx.x) j = 0;

Because this just works for a single thread block?

  1. It says, “if you are thread 0 of the thread block, then initialize j to 0.”

  2. Correct. And thread block can hold a maximum of 1024.

Thanks!

And there cannot be more than 1 thread blocks because __syncthreads() just syncs the threads in a single thread block?

In the above solution, will it break if the number of threads is less than sz?