How to synchronize a Kernel with many for loops

gabrielmagno · November 26, 2011, 12:40pm

I have a kernel like this (i’ve simplified it, for better understanding):

__global__ void calculate(int* sizes, int* sums, int n)

{

    int node = blockIdx.x * blockDim.x + threadIdx.x;

    int sum = 0;

    if (node < n) {

        for (i = 0; i < sizes[node]; i++) {

            // node_i = some arbitrary node

            for (j = 0; j < sizes[node]; j++) {

                // node_j = some arbitrary node

                if (node_i != node_j) {

                    for (k = 0; k < sizes[node_i]; k++) {

                        // node_k = some arbitrary node

                        if (node_k == node_j) {

                            sum++;

                            break;

                        }

                    }

                }

            }

        }

        sums[node] = sum;

    }

    __syncthreads();

}

It has 3 for loops wich the number of iterations depends on the size of the node.

I call it like this:

calculate <<<blocksPerGrid, threadsPerBlock>>> (...);

cudaThreadSynchronize();

The problem is that the threads are not finishing well, the program finishes before all the sums have been made.

I though “cudaThreadSynchronize” would do this, but it seems that it’s not working.

How can I do a barrier or something like that to wait all the threads finish it jobs. (At this point I don’t mind if the performance will be bad.)

Thank you in advance.

tera · November 26, 2011, 1:05pm

cudaStreamSynchronize(0) waits until all kernels in the default stream have finished.

However, what do you need this for? Usually you would either use the computation results in another kernel, or copy them back to the host with cudaMemcpy(). Both operations wait until the data is available (as long as they are not on a different stream), so there is no need for explicit synchronization.

gabrielmagno · November 26, 2011, 1:18pm

Thank you, but it didn’t work.

Maybe it’s not sync, but look, if I put

sums[node] = 1;

instead of

sums[node] = sum;

It works, in the final the sum is 1, like I’ve assigned in the final of the loops. Any guess?

tera · November 26, 2011, 1:25pm

Apparently the code does not compute what you think it does. Can you post the complete code?

gabrielmagno · November 26, 2011, 1:30pm

Sure.

__global__ void clusteringc_nlinks(dados_nodo* nodos, int* arestas, dados_clusteringc* clusteringc, int nnodos)

{

    int nodo = blockIdx.x * blockDim.x + threadIdx.x;

    int n_links = 0, aresta_i, aresta_j, aresta_k, nodo_i, nodo_j, nodo_k;

    if (nodo < nnodos) {

        for (aresta_i = 0; aresta_i < (nodos[nodo].outdegree); aresta_i++) {

            nodo_i = arestas[nodos[nodo].inicio + aresta_i];

            for (aresta_j = 0; aresta_j < (nodos[nodo].outdegree); aresta_j++) {

                nodo_j = arestas[nodos[nodo].inicio + aresta_j];

                if (nodo_i != nodo_j) {

                    for (aresta_k = 0; aresta_k < (nodos[nodo_i].outdegree); aresta_k++) {

                        nodo_k = arestas[nodos[nodo_i].inicio + aresta_k];

                        if (nodo_k == nodo_j) {

                            n_links++;

                            break;

                        }

                    }

                }

            }

        }

        clusteringc[nodo].n_links = n_links;

    }

    __syncthreads();

}

tera · November 26, 2011, 1:56pm

I don’t see anything that could be problematic with CUDA. Does the algorithm produce the desired result in plain C on the CPU?

BTW., the __syncthreads() at the end of the kernel serves no purpose can should just be dropped.

gabrielmagno · November 26, 2011, 2:14pm

Yes. I have made the C version first, giving the right results.

Then I “ported” to CUDA.

Lev · November 27, 2011, 2:09pm

It is another problem, maybe call problem or errors in kernel or compiler errors.

gabrielmagno · November 27, 2011, 4:07pm

I’m calling like this:

int threadsPerBlock = 256;

int blocksPerGrid = (nnodos + threadsPerBlock - 1) / threadsPerBlock;

clusteringc_calcula <<<blocksPerGrid, threadsPerBlock>>> (clusteringc_d, nnodos);

tera · November 27, 2011, 9:03pm

Have you checked the return code from [font=“Courier New”]cudaStreamSynchronize(0)[/font]?

printf("cudaStreamSynchronize(0) returned \"%s\".\n", cudaGetErrorString(cudaStreamSynchronize(0)));

What happens if you run your program with [font=“Courier New”]cuda-memcheck[/font]?

gabrielmagno · November 27, 2011, 10:15pm

Have you checked the return code from [font=“Courier New”]cudaStreamSynchronize(0)[/font]?
printf("cudaStreamSynchronize(0) returned \"%s\".\n", cudaGetErrorString(cudaStreamSynchronize(0)));
What happens if you run your program with [font=“Courier New”]cuda-memcheck[/font]?

cudaStreamSynchronize(0) returned "no error".

cuda-memcheck returned an error:

========= Invalid __global__ read of size 4

=========     at 0x000000a0 in clusteringc_nlinks

=========     by thread (2,0,0) in block (0,0,0)

=========     Address 0x00101128 is out of bounds

=========

========= ERROR SUMMARY: 1 error

tera · November 28, 2011, 2:36am

Ok, so you have an out-of-bounds memory access. As the access pattern of your code is data dependent, you need to look at the data your kernel is working on. Try copying the data back from GPU to CPU and check it is what you expect. Also make sure this particular data causes no out-of-bounds access on the CPU.

If nothing works out, you can explicitly check the array bounds in your kernel and print a useful diagnostic on failure.

gabrielmagno · November 28, 2011, 11:44am

I got it!

The malloc of the “arestas” structure was of the size of the “nodos” structure, so it was wrong!

Thank you very much!

Topic		Replies	Views
Got wrong result when not using cudaDeviceSynchronize in threads CUDA Programming and Performance	6	844	February 1, 2024
cudaMemcpyAsync CUDA Programming and Performance	10	21057	October 16, 2015
Unexpected Behaviour 2+ block thread communication issues CUDA Programming and Performance	0	2275	February 23, 2010
Debugging inside my kernel I'm sure its something small. CUDA Programming and Performance	12	10399	June 19, 2008
syncthreads problem I guess this is a syncthreads problem CUDA Programming and Performance	9	5136	October 12, 2008
Iteration help in CUDA CUDA Programming and Performance	11	6906	April 19, 2012
Synchronization methods? CUDA Programming and Performance	11	2131	November 7, 2010
Question regarding CUDA streams CUDA Programming and Performance	4	2477	May 21, 2009
How can I be certain my Kernel runs with 32 threads in one block and thus perfect synchrony? (ie. via __syncthreads()) CUDA Programming and Performance	15	50	August 21, 2024
A few new to CUDA questions CUDA Programming and Performance	3	1113	February 4, 2011

How to synchronize a Kernel with many for loops

Related topics