Why does this implementation of argmax with Numba CUDA return the wrong result 0.01% of the time?

5xx7xx · August 11, 2023, 2:10pm

I’ve made a simple kernel implementing argmax using Numba (code below). It returns the correct result about 99.99% of the time, and a wrong one about 0.01% of the time.
I’ve tried to debug and the closest I got is that sometimes what should be the max value gets overwritten by a thread that definitely should not be writing anything at that moment. Which would be explainable if the syncthreads didn’t work, but I think my usage of it is correct - I syncthreads before and after any reads and writes to the shared memory, and none of them are in conditionals that could resolve differently for different threads. What am I missing?
Code:

from numba import cuda
import math

@cuda.jit    
def argmax2(arr, out):
    n_actual_threads = cuda.blockDim.x
    max_virtual_threads = len(arr)
    argmax_value_scratchpad = cuda.shared.array(0, dtype='float32')[:max_virtual_threads]
    argmax_index_scratchpad = cuda.shared.array(0, dtype='int32')[max_virtual_threads:]
    n_virtual_threads = max_virtual_threads
    for thread_batch in range(math.ceil(n_virtual_threads / n_actual_threads)):
        thread_idx = thread_batch * n_actual_threads + cuda.threadIdx.x
        if thread_idx < n_virtual_threads:
            argmax_value_scratchpad[thread_idx] = arr[thread_idx]
            argmax_index_scratchpad[thread_idx] = thread_idx
    cuda.syncthreads()
    scratchpad_length = len(arr)
    while scratchpad_length > 1:
        new_length = math.ceil(scratchpad_length / 2)
        n_virtual_threads = new_length
        for thread_batch in range(math.ceil(n_virtual_threads / n_actual_threads)):
            thread_idx = thread_batch * n_actual_threads + cuda.threadIdx.x
            cuda.syncthreads()
            if thread_idx < n_virtual_threads:
                a = argmax_value_scratchpad[thread_idx*2]
                idx_a = argmax_index_scratchpad[thread_idx*2]
                b = argmax_value_scratchpad[thread_idx*2+1]
                idx_b = argmax_index_scratchpad[thread_idx*2+1]
                if scratchpad_length % 2 == 1 and thread_idx == new_length - 1:
                    # special case if the length is odd. ceil already added one more thread for it
                    # a is already set correctly for it too
                    # but b is missing so we gotta set it to the first value in the array
                    b = argmax_value_scratchpad[0]
                    idx_b = argmax_index_scratchpad[0]
                if a > b:
                    winning_value = a
                    winning_index = idx_a
                else:
                    winning_value = b
                    winning_index = idx_b
            cuda.syncthreads()
            if thread_idx < new_length:
                argmax_value_scratchpad[thread_idx] = winning_value
                argmax_index_scratchpad[thread_idx] = winning_index
            cuda.syncthreads()
        scratchpad_length = new_length
        
    cuda.syncthreads()    
    out[0] = argmax_index_scratchpad[0]

Robert_Crovella · August 11, 2023, 3:52pm

I guess your intent is to run this code with only a single threadblock? It would probably help if you gave a complete example showing the kernel launch, the data set size you are passing, and indicate how to witness a failure example.

5xx7xx · August 13, 2023, 7:48pm

Yeah actually you saying that pointed me in the right direction. The code is working 100% of the time, it’s my testing method that’s failing 0.01% of the time, because, well, my sample size was big enough that sometimes a there were two or more random floats at the exact same max value, so two argmaxes gave two, both correct, but different, answers. Sorry to have wasted space on this forum for something so obvious. You can lock/delete this thread.

Topic		Replies	Views
Bug in my kernel : problem of memory consistency or race condition ? CUDA Programming and Performance	1	451	July 31, 2011
Using cuda for string matching CUDA Programming and Performance	4	3789	March 19, 2009
Cumpute Max of Vector or Matrix CUDA Programming and Performance	7	3765	June 6, 2011
CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE in Python CUDA Programming and Performance	11	9506	May 16, 2024
Neural Network (Backpropagation) implementation in CUDA CUDA Programming and Performance	0	1681	October 1, 2017
Find maximum value from threads CUDA Programming and Performance	6	359	December 16, 2023
Kernel runs successfully a few times, then crashes CUDA Programming and Performance	2	1830	August 18, 2009
Finding max in array CUDA Programming and Performance	15	41555	November 26, 2017
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	670	April 4, 2017
Cuda API error detected: cudaLaunchKernel returned (0x2bd) CUDA Programming and Performance	2	694	April 25, 2024

Why does this implementation of argmax with Numba CUDA return the wrong result 0.01% of the time?

Related topics