Concurrent thread access in Dynamic Programming Convolution kernel

Hi everyone. First of all, here is my code:

#include <iostream>

__global__ void inner_convolution(bool* input, bool* kernel, bool* output, int support_index, int n_kernel)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if(idx >= n_kernel) return;

    if(input[support_index + idx] && kernel[idx])
    {
        output[support_index] = false;
    }
}

__global__ void outer_convolution(bool* input, bool* kernel, bool* output, int* support_indices, int n_input, int n_kernel)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if(idx >= n_input - n_kernel + 1) return;
    support_indices[idx] = idx;

    dim3 block(128);
    dim3 grid(((n_input - n_kernel + 1) + block.x - 1) / block.x);
    inner_convolution<<<grid, block>>>(input, kernel, output, support_indices[idx], n_kernel);
}

int main()
{
    int n_input_array = 100000;
    int n_kernel_array = 400;

    bool* input_array = (bool*)malloc(sizeof(bool) * n_input_array);
    for(int i = 0; i < n_input_array; i++)
    {
        input_array[i] = ((i + 39) / 5) % 2 == 0 ? false: true;
    }
    bool* kernel_array = (bool*)malloc(sizeof(bool) * n_kernel_array);
    for(int i = 0; i < n_kernel_array; i++)
    {
        kernel_array[i] = ((i % 777) == 0);
    }

    bool* d_input_array;
    bool* d_kernel_array;

    cudaMalloc((bool**)&d_input_array, sizeof(bool) * n_input_array);
    cudaMalloc((bool**)&d_kernel_array, sizeof(bool) * n_kernel_array);
    cudaMemcpy(d_input_array, input_array, sizeof(bool) * n_input_array, cudaMemcpyHostToDevice);
    cudaMemcpy(d_kernel_array, kernel_array, sizeof(bool) * n_kernel_array, cudaMemcpyHostToDevice);

    int* d_support_indices;
    bool* d_output;
    cudaMalloc((int**)&d_support_indices, sizeof(int) * (n_input_array - n_kernel_array + 1));
    cudaMalloc((bool**)&d_output, sizeof(bool) * (n_input_array - n_kernel_array + 1));
    cudaMemset(d_output, true, sizeof(bool) * (n_input_array - n_kernel_array + 1));

    dim3 block1(128);
    dim3 grid1(((n_input_array - n_kernel_array + 1) + block.x - 1) / block.x);

    outer_convolution<<<grid1, block1>>>(d_input_array, d_kernel_array, d_output, d_support_indices, n_input_array, n_kernel_array);
    cudaDeviceSynchronize();

    bool* output = (bool*) malloc(sizeof(bool) * (n_input_array - n_kernel_array + 1));
    cudaMemcpy(output, d_output, sizeof(bool) * (n_input_array - n_kernel_array + 1), cudaMemcpyDeviceToHost);

    cudaFree(d_input_array);
    cudaFree(d_kernel_array);
    cudaFree(d_support_indices);
    cudaFree(d_output);

    for(int i = 0; i < n_input_array - n_kernel_array + 1; i++)
    {
        printf("%d ", output[i]);
    }
    printf("\n");

    free(input_array);
    free(kernel_array);
    free(output);

    return 0;
}

where the bool data type is defined because we are in a C++ setting. The code itself performs a kind of convolution:

  1. An array of boolean values, input_array, is given, as well as a kernel, called kernel_array, each of which has a certain amount of elements, respectively n_input_array and n_kernel_array.
  2. The data is copied to the device global memory, and a first kernel, “outer_convolution”, is invoked. The only goal of each thread of this first kernel is to “forward” its index to a child grid that it creates, by storing that index in the corresponding array cell of “support_indices”.
  3. Now, the objective of each child thread is to apply the kernel mask to a batch of consecutive elements in the input thread. To do so, the initial index is retrieved from what the father thread forwarded to the grid. The masking is performed as follows: each child thread of the grid invoked by a parent thread computes the logic AND between an element of the input array and an element of the mask. If the two of them are found to be true, a certain data element of the output array, which is shared by all the child threads of the grid invoked by the single parent thread, is directly set to false. This means that different threads get access to the same data concurrently, which can cause race conditions. Or actually, does it really? Because, at the very beginning, host-side, the array d_output, in which the results of the convolution will be stored, is initialized completely to true. The only thing the threads can do is set a specific element of that output array to false, but never to true. This means, at least in theory, that even if there is concurrent access, no race condition occurs: either no thread wants to set to false the data at a specific location of the d_output array, or one or more threads want to concurrently set that value to false. In the first case, the data is left untouched. In the second, it is set to false.

At least, this is the behaviour I would expect. But if you run twice this program and compare the outputs you will see that they are different. My question is: how is that so? I get it that concurrent access to the same data address is bad, and that’s exactly what I’m doing here. But at the same time, it’s not that I’m allowing different threads to, at the same time, and therefore concurrently, set the same shared variable sometimes to true and sometimes to false, but just to false. So I would think that a concurrent access wouldn’t be that much of a problem, in this specific case, but apparently, it is.

Any help would be appreciated: am I doing something wrong? Why does the data seemingly gets corrupted? What happens under the hood here?

Have you set cudaLimitDevRuntimePendingLaunchCount high enough? The default number of launch slots is 2048.

Your code as posted will not compile.

You are indeed exceeding the pending launch limit.

Any time you are having trouble with a CUDA code, I encourage proper CUDA error checking. When using CUDA Dynamic Parallelism (CDP) as you are, that applies to proper instrumentation in device code as well. If you apply that to your code, you will observe errors reported. Even if you don’t understand the error output, it will be useful for others trying to help you.

Thank you so much, that was the problem. Despite my researches online I never found out about this property. Do you know if there is a maximum value to which that property can be set? I have made some tests but, even with very high numbers, it didn’t give me any error. This is a good thing of course, it’s just that I’m curious about the limitations.

You are right, sorry for not including error checking. I did it on my main code and then wrote the one I posted here, where I skipped error checking because I didn’t think it was important, as my goal was just to reproduce in a smaller scale my main problem. One thing I didn’t know and that I learnt from your link is to use cudaGetLastError after a kernel launch. For some reason I was doing error checking everywhere else but not for kernel launches.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.