Stop other threads from executing parallelly

reikdas · January 31, 2020, 1:55pm

This is not a real problem - I know that there are other algorithms to solve this, I was wondering if there was a way to solve loop carried dependencies in CUDA kernels using naive cpu algorithms(of course with a big hit in performance).

#include <iostream>

__global__ void foo(int* bar) {
    int64_t j = 0;
    int64_t i = threadIdx.x + blockIdx.x*blockDim.x;
    if (i<5) {
        bar[j] = j;
        j++;
    }
}

int main() {
    int bar[5];
    int *d_bar;
    cudaMalloc((void **) &d_bar, 5*sizeof(int));
    cudaMemcpy(d_bar, bar, 5*sizeof(int), cudaMemcpyHostToDevice);
    foo<<<1, 5>>>(d_bar);
    cudaMemcpy(bar, d_bar, 5*sizeof(int), cudaMemcpyDeviceToHost);
    cudaFree(d_bar);
    for (int i=0; i<5; i++) {
        std::cout << bar[i] << "\n";
    }
    return 1;
}

The loop carried dependency is j. Since these are parallel threads, there will be a race condition and I will not get the values I want.

How do I stop other threads from executing when I come to the j assignment, and only resume the next thread when the previous thread is completed?

Robert_Crovella · January 31, 2020, 2:32pm

atomics

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions

reikdas · January 31, 2020, 2:52pm

Thanks! But I think I might be doing something wrong.

I replaced -

bar[j] = j;
j++;

with-

atomicExch(bar+j, j);
atomicExch(&j, j+1);

and I still get garbage values. What am I doing wrong?

Robert_Crovella · January 31, 2020, 3:43pm

Your code doesn’t make much sense to me:

__global__ void foo(int* bar) {
    int64_t j = 0;
    int64_t i = threadIdx.x + blockIdx.x*blockDim.x;
    if (i<5) {
        bar[j] = j;
        j++;
    }
}

Every thread initially has j set to zero. Then the first 5 threads will execute:

bar[j] = j;

which is basically:

bar[0] = 0;

So I would expect the first element of bar to be zero, and the other elements to be unchanged. That is true without atomics.

You need to communicate what your desired behavior is. Your initial code doesn’t communicate it to me, anyway.

reikdas · February 1, 2020, 4:36pm

Ill try to explain my desired behaviour better.

In my original code -

__global__ void foo(int* bar) {     
    int64_t j = 0;     
    int64_t i = threadIdx.x + blockIdx.x*blockDim.x;

I want all threads to stop here.

Just the first thread executes this -

if (i<5) {         
    bar[j] = j;         
    j++;     
}

The value of j is updated for the second thread.
The second thread then executes this -

if (i<5) {         
    bar[j] = j;         
    j++;     
}

and so on…

Robert_Crovella · February 1, 2020, 11:02pm

For now I’m ignoring some inconsistencies such as the fact that j is a thread-local variable in your code. It’s evident that you intend it to be shared or global somehow.

I guess you’re asking for a specific machine behavior. You want threads 2-5 to wait for thread 1, then 3-5 wait for thread 2, and so on. i.e. you want serialization.

You can certainly do it. It’s not an efficient way to use the machine. When doing parallel programming, especially when “converting” from a serial approach, its often a good idea to construct the problem statement from the standpoint of algorithm inputs and outputs, rather than specifying low-level thread behavior. Many algorithms will have a substantially different realization in a parallel environment, for efficiency, compared to the serial realization.

Since I don’t know what you really want to accomplish algorithmically, I’m not sure I can offer any more concrete advice. For example, your proposed intent would place values like so:

bar[0] = 0;
bar[1] = 1;
etc.

That input-output statement would be trivial to accomplish in a thread parallel way:

__global__ void foo(int* bar) {
    int64_t i = threadIdx.x + blockIdx.x*blockDim.x;
    if (i<5) {
        bar[i] = i;
    }
}

But I’m guessing from all the dust in the air that that is not really what you want, so we have an x-y problem. I can’t solve that without knowing the actual problem x. Because the problem you have presented (y) cannot be done efficiently. I suspect that y is not your actual desire, however.

reikdas · February 1, 2020, 11:12pm

For now I’m ignoring some inconsistencies such as the fact that j is a thread-local variable in your code. It’s evident that you intend it to be shared or global somehow.

Yes! Thank you!

[i]I guess you’re asking for a specific machine behavior. You want threads 2-5 to wait for thread 1, then 3-5 wait for thread 2, and so on. i.e. you want serialization.

You can certainly do it[/i]

Can you please tell me how? That is exactly what I am trying to figure out.

But I’m guessing from all the dust in the air that that is not really what you want, so we have an x-y problem. I can’t solve that without knowing the actual problem x. Because the problem you have presented (y) cannot be done efficiently. I suspect that y is not your actual desire, however.

I am sorry if I wasn’t clear, this is not a real problem. I am trying to understand something. That’s it :)

Robert_Crovella · February 1, 2020, 11:57pm

Here is one possible approach. It is not very efficient. It needs to be rewritten if you want sz > 1024:

$ cat t1655.cu
#include <iostream>
const int sz = 5;
__global__ void foo(int* bar) {
    __shared__ int j;
    int idx = threadIdx.x + blockIdx.x*blockDim.x;
    if (!threadIdx.x) j = 0;
    for (int i = 0; i < sz; i++){
      __syncthreads();
      if (i == idx){ bar[j] = j; j++;}
      }
}

int main() {
    int bar[sz];
    int *d_bar;
    cudaMalloc((void **) &d_bar, sz*sizeof(int));
    cudaMemcpy(d_bar, bar, sz*sizeof(int), cudaMemcpyHostToDevice);
    foo<<<1, sz>>>(d_bar);
    cudaMemcpy(bar, d_bar, sz*sizeof(int), cudaMemcpyDeviceToHost);
    cudaFree(d_bar);
    for (int i=0; i<sz; i++) {
        std::cout << bar[i] << "\n";
    }
    return 0;
}
$ nvcc -o t1655 t1655.cu
$ cuda-memcheck ./t1655
========= CUDA-MEMCHECK
0
1
2
3
4
========= ERROR SUMMARY: 0 errors
$

reikdas · February 2, 2020, 12:10am

Thanks!

Can you explain this line please?

if (!threadIdx.x) j = 0;

Because this just works for a single thread block?

mnicely · February 2, 2020, 6:26am

It says, “if you are thread 0 of the thread block, then initialize j to 0.”
Correct. And thread block can hold a maximum of 1024.

reikdas · February 2, 2020, 3:03pm

Thanks!

And there cannot be more than 1 thread blocks because __syncthreads() just syncs the threads in a single thread block?

In the above solution, will it break if the number of threads is less than sz?

Topic		Replies	Views
Thread Synchronisation in parallel array write CUDA Programming and Performance	4	576	April 1, 2017
Really simple while loop issues CUDA Programming and Performance	4	3187	October 27, 2014
Failure with independent devices on independent processes Try it yourself! CUDA Programming and Performance	19	3464	March 10, 2011
CUDA Memory Consistency CUDA Programming and Performance	23	55577	March 8, 2007
Path Divergence CUDA Programming and Performance	20	4914	August 20, 2010
Synchronization methods? CUDA Programming and Performance	11	2126	November 7, 2010
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8622	December 18, 2008
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23205	October 7, 2010
Can CUDA do sequential processing? CUDA Programming and Performance	7	6592	August 24, 2011
Question regarding CUDA streams CUDA Programming and Performance	4	2474	May 21, 2009

Stop other threads from executing parallelly

Related topics