This is not a real problem - I know that there are other algorithms to solve this, I was wondering if there was a way to solve loop carried dependencies in CUDA kernels using naive cpu algorithms(of course with a big hit in performance).
#include <iostream>
__global__ void foo(int* bar) {
int64_t j = 0;
int64_t i = threadIdx.x + blockIdx.x*blockDim.x;
if (i<5) {
bar[j] = j;
j++;
}
}
int main() {
int bar[5];
int *d_bar;
cudaMalloc((void **) &d_bar, 5*sizeof(int));
cudaMemcpy(d_bar, bar, 5*sizeof(int), cudaMemcpyHostToDevice);
foo<<<1, 5>>>(d_bar);
cudaMemcpy(bar, d_bar, 5*sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(d_bar);
for (int i=0; i<5; i++) {
std::cout << bar[i] << "\n";
}
return 1;
}
The loop carried dependency is j. Since these are parallel threads, there will be a race condition and I will not get the values I want.
How do I stop other threads from executing when I come to the j assignment, and only resume the next thread when the previous thread is completed?
For now I’m ignoring some inconsistencies such as the fact that j is a thread-local variable in your code. It’s evident that you intend it to be shared or global somehow.
I guess you’re asking for a specific machine behavior. You want threads 2-5 to wait for thread 1, then 3-5 wait for thread 2, and so on. i.e. you want serialization.
You can certainly do it. It’s not an efficient way to use the machine. When doing parallel programming, especially when “converting” from a serial approach, its often a good idea to construct the problem statement from the standpoint of algorithm inputs and outputs, rather than specifying low-level thread behavior. Many algorithms will have a substantially different realization in a parallel environment, for efficiency, compared to the serial realization.
Since I don’t know what you really want to accomplish algorithmically, I’m not sure I can offer any more concrete advice. For example, your proposed intent would place values like so:
bar[0] = 0;
bar[1] = 1;
etc.
That input-output statement would be trivial to accomplish in a thread parallel way:
__global__ void foo(int* bar) {
int64_t i = threadIdx.x + blockIdx.x*blockDim.x;
if (i<5) {
bar[i] = i;
}
}
But I’m guessing from all the dust in the air that that is not really what you want, so we have an x-y problem. I can’t solve that without knowing the actual problem x. Because the problem you have presented (y) cannot be done efficiently. I suspect that y is not your actual desire, however.
For now I’m ignoring some inconsistencies such as the fact that j is a thread-local variable in your code. It’s evident that you intend it to be shared or global somehow.
Yes! Thank you!
[i]I guess you’re asking for a specific machine behavior. You want threads 2-5 to wait for thread 1, then 3-5 wait for thread 2, and so on. i.e. you want serialization.
You can certainly do it[/i]
Can you please tell me how? That is exactly what I am trying to figure out.
But I’m guessing from all the dust in the air that that is not really what you want, so we have an x-y problem. I can’t solve that without knowing the actual problem x. Because the problem you have presented (y) cannot be done efficiently. I suspect that y is not your actual desire, however.
I am sorry if I wasn’t clear, this is not a real problem. I am trying to understand something. That’s it :)