How to synchronize a Kernel with many for loops

I have a kernel like this (i’ve simplified it, for better understanding):

__global__ void calculate(int* sizes, int* sums, int n)


    int node = blockIdx.x * blockDim.x + threadIdx.x;

    int sum = 0;

    if (node < n) {

        for (i = 0; i < sizes[node]; i++) {

            // node_i = some arbitrary node

            for (j = 0; j < sizes[node]; j++) {

                // node_j = some arbitrary node

                if (node_i != node_j) {

                    for (k = 0; k < sizes[node_i]; k++) {

                        // node_k = some arbitrary node

                        if (node_k == node_j) {








        sums[node] = sum;




It has 3 for loops wich the number of iterations depends on the size of the node.

I call it like this:

calculate <<<blocksPerGrid, threadsPerBlock>>> (...);


The problem is that the threads are not finishing well, the program finishes before all the sums have been made.

I though “cudaThreadSynchronize” would do this, but it seems that it’s not working.

How can I do a barrier or something like that to wait all the threads finish it jobs. (At this point I don’t mind if the performance will be bad.)

Thank you in advance.

cudaStreamSynchronize(0) waits until all kernels in the default stream have finished.

However, what do you need this for? Usually you would either use the computation results in another kernel, or copy them back to the host with cudaMemcpy(). Both operations wait until the data is available (as long as they are not on a different stream), so there is no need for explicit synchronization.

Thank you, but it didn’t work.

Maybe it’s not sync, but look, if I put

sums[node] = 1;

instead of

sums[node] = sum;

It works, in the final the sum is 1, like I’ve assigned in the final of the loops. Any guess?

Apparently the code does not compute what you think it does. Can you post the complete code?


__global__ void clusteringc_nlinks(dados_nodo* nodos, int* arestas, dados_clusteringc* clusteringc, int nnodos)


    int nodo = blockIdx.x * blockDim.x + threadIdx.x;

    int n_links = 0, aresta_i, aresta_j, aresta_k, nodo_i, nodo_j, nodo_k;

    if (nodo < nnodos) {

        for (aresta_i = 0; aresta_i < (nodos[nodo].outdegree); aresta_i++) {

            nodo_i = arestas[nodos[nodo].inicio + aresta_i];

            for (aresta_j = 0; aresta_j < (nodos[nodo].outdegree); aresta_j++) {

                nodo_j = arestas[nodos[nodo].inicio + aresta_j];

                if (nodo_i != nodo_j) {

                    for (aresta_k = 0; aresta_k < (nodos[nodo_i].outdegree); aresta_k++) {

                        nodo_k = arestas[nodos[nodo_i].inicio + aresta_k];

                        if (nodo_k == nodo_j) {








        clusteringc[nodo].n_links = n_links;




I don’t see anything that could be problematic with CUDA. Does the algorithm produce the desired result in plain C on the CPU?

BTW., the __syncthreads() at the end of the kernel serves no purpose can should just be dropped.

Yes. I have made the C version first, giving the right results.

Then I “ported” to CUDA.

It is another problem, maybe call problem or errors in kernel or compiler errors.

I’m calling like this:

int threadsPerBlock = 256;

int blocksPerGrid = (nnodos + threadsPerBlock - 1) / threadsPerBlock;

clusteringc_calcula <<<blocksPerGrid, threadsPerBlock>>> (clusteringc_d, nnodos);

Have you checked the return code from [font=“Courier New”]cudaStreamSynchronize(0)[/font]?

printf("cudaStreamSynchronize(0) returned \"%s\".\n", cudaGetErrorString(cudaStreamSynchronize(0)));

What happens if you run your program with [font=“Courier New”]cuda-memcheck[/font]?

cudaStreamSynchronize(0) returned "no error".

cuda-memcheck returned an error:

========= Invalid __global__ read of size 4

=========     at 0x000000a0 in clusteringc_nlinks

=========     by thread (2,0,0) in block (0,0,0)

=========     Address 0x00101128 is out of bounds


========= ERROR SUMMARY: 1 error

Ok, so you have an out-of-bounds memory access. As the access pattern of your code is data dependent, you need to look at the data your kernel is working on. Try copying the data back from GPU to CPU and check it is what you expect. Also make sure this particular data causes no out-of-bounds access on the CPU.

If nothing works out, you can explicitly check the array bounds in your kernel and print a useful diagnostic on failure.

I got it!

The malloc of the “arestas” structure was of the size of the “nodos” structure, so it was wrong!

Thank you very much!