Hi everyone. First of all, here is my code:
#include <iostream>
__global__ void inner_convolution(bool* input, bool* kernel, bool* output, int support_index, int n_kernel)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx >= n_kernel) return;
if(input[support_index + idx] && kernel[idx])
{
output[support_index] = false;
}
}
__global__ void outer_convolution(bool* input, bool* kernel, bool* output, int* support_indices, int n_input, int n_kernel)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx >= n_input - n_kernel + 1) return;
support_indices[idx] = idx;
dim3 block(128);
dim3 grid(((n_input - n_kernel + 1) + block.x - 1) / block.x);
inner_convolution<<<grid, block>>>(input, kernel, output, support_indices[idx], n_kernel);
}
int main()
{
int n_input_array = 100000;
int n_kernel_array = 400;
bool* input_array = (bool*)malloc(sizeof(bool) * n_input_array);
for(int i = 0; i < n_input_array; i++)
{
input_array[i] = ((i + 39) / 5) % 2 == 0 ? false: true;
}
bool* kernel_array = (bool*)malloc(sizeof(bool) * n_kernel_array);
for(int i = 0; i < n_kernel_array; i++)
{
kernel_array[i] = ((i % 777) == 0);
}
bool* d_input_array;
bool* d_kernel_array;
cudaMalloc((bool**)&d_input_array, sizeof(bool) * n_input_array);
cudaMalloc((bool**)&d_kernel_array, sizeof(bool) * n_kernel_array);
cudaMemcpy(d_input_array, input_array, sizeof(bool) * n_input_array, cudaMemcpyHostToDevice);
cudaMemcpy(d_kernel_array, kernel_array, sizeof(bool) * n_kernel_array, cudaMemcpyHostToDevice);
int* d_support_indices;
bool* d_output;
cudaMalloc((int**)&d_support_indices, sizeof(int) * (n_input_array - n_kernel_array + 1));
cudaMalloc((bool**)&d_output, sizeof(bool) * (n_input_array - n_kernel_array + 1));
cudaMemset(d_output, true, sizeof(bool) * (n_input_array - n_kernel_array + 1));
dim3 block1(128);
dim3 grid1(((n_input_array - n_kernel_array + 1) + block.x - 1) / block.x);
outer_convolution<<<grid1, block1>>>(d_input_array, d_kernel_array, d_output, d_support_indices, n_input_array, n_kernel_array);
cudaDeviceSynchronize();
bool* output = (bool*) malloc(sizeof(bool) * (n_input_array - n_kernel_array + 1));
cudaMemcpy(output, d_output, sizeof(bool) * (n_input_array - n_kernel_array + 1), cudaMemcpyDeviceToHost);
cudaFree(d_input_array);
cudaFree(d_kernel_array);
cudaFree(d_support_indices);
cudaFree(d_output);
for(int i = 0; i < n_input_array - n_kernel_array + 1; i++)
{
printf("%d ", output[i]);
}
printf("\n");
free(input_array);
free(kernel_array);
free(output);
return 0;
}
where the bool data type is defined because we are in a C++ setting. The code itself performs a kind of convolution:
- An array of boolean values, input_array, is given, as well as a kernel, called kernel_array, each of which has a certain amount of elements, respectively n_input_array and n_kernel_array.
- The data is copied to the device global memory, and a first kernel, “outer_convolution”, is invoked. The only goal of each thread of this first kernel is to “forward” its index to a child grid that it creates, by storing that index in the corresponding array cell of “support_indices”.
- Now, the objective of each child thread is to apply the kernel mask to a batch of consecutive elements in the input thread. To do so, the initial index is retrieved from what the father thread forwarded to the grid. The masking is performed as follows: each child thread of the grid invoked by a parent thread computes the logic AND between an element of the input array and an element of the mask. If the two of them are found to be true, a certain data element of the output array, which is shared by all the child threads of the grid invoked by the single parent thread, is directly set to false. This means that different threads get access to the same data concurrently, which can cause race conditions. Or actually, does it really? Because, at the very beginning, host-side, the array d_output, in which the results of the convolution will be stored, is initialized completely to true. The only thing the threads can do is set a specific element of that output array to false, but never to true. This means, at least in theory, that even if there is concurrent access, no race condition occurs: either no thread wants to set to false the data at a specific location of the d_output array, or one or more threads want to concurrently set that value to false. In the first case, the data is left untouched. In the second, it is set to false.
At least, this is the behaviour I would expect. But if you run twice this program and compare the outputs you will see that they are different. My question is: how is that so? I get it that concurrent access to the same data address is bad, and that’s exactly what I’m doing here. But at the same time, it’s not that I’m allowing different threads to, at the same time, and therefore concurrently, set the same shared variable sometimes to true and sometimes to false, but just to false. So I would think that a concurrent access wouldn’t be that much of a problem, in this specific case, but apparently, it is.
Any help would be appreciated: am I doing something wrong? Why does the data seemingly gets corrupted? What happens under the hood here?