Garbage Values Read From Array When Introducing Additional Complexity

Hi all,

I’m coding my first CUDA application. The idea is that I have 5 slot reel strips and I am bruteforcing each possible combination of stops and grabbing the pay to determine the overall expected return. In the example code my reelstrip lengths are 100, 100, 10, 100, and 100 which totals to 1b combinations to evaluate. The main driver of the function happens in the evaluateCombo function in kernel.cu. Everything is working as expected when I evaluate each of the 1b combinations.

I then introduced an extra layer of complexity where for each of the 1 billion combinations I now evaluate 30 lines around that set of stops*, similar to a real slot machine, which is where my code starts to break down and return garbage values. This part is confusing to me because the extra logic is not very complicated. The only difference doing the same computations 30 times across the different lines instead just once.

*This added complexity is the loop: for (int iLine = 0; iLine < NUM_LINES; iLine++). All else the same, my code works fine when I remove the loop and query into lineset.

I noticed that the lower I set the number of lines to loop through the higher chance of the run completing, however I don’t always get the same result when running the same configuration multiple times. I also noticed that if I hard-code the payout the overall payout reflects that hard-coded payout accurately, which tells me that my computation of temp_idx is working and that the breakage may occur when fetching the payout from the payout_table. I’m not sure how that could be happening as we are simply indexing into a 2D array to grab a value. My best guess is that encountering some sort of race condition or memory issue that becomes more apparent as I add complexity. Based on my code below, is there any glaring issue that stands out?

My project isn’t too big, there’s a main.cpp and kernel.cu. Please let me know if there’s any additional clarification I can provide. Thank you for any help.

main.cpp:

#include <iostream>

// Function declaration for the CUDA slot simulation
extern "C" void runDoCycle(long long num_batches, long long batch_size, long long* h_payouts_accumulated, long long* h_hits_accumulated, int* h_reelSize, int offset);

int main() {
    const long long bet_amount = 50;

    long long total_payout = 0;
    long long total_hits = 0;

    int reelSize[5] = { 100, 100, 10, 100, 100 };

    long long total_combinations = static_cast<long long>(reelSize[0]) * reelSize[1] * reelSize[2] * reelSize[3] * reelSize[4];

    int num_rounds = (total_combinations / INT_MAX) + 1;

    long long num_batches = 20000;

    if (total_combinations < num_batches)
        num_batches = total_combinations;


    const long long batch_size = (total_combinations / num_batches / num_rounds);  // Adjust the batch size based on available memory

    for (int iRound = 0; iRound < num_rounds; iRound++) {

        // Allocate host memory for the payout results and hit results
        long long* h_payouts_accumulated = new long long[num_batches];
        long long* h_hits_accumulated = new long long[num_batches];

        // Run the CUDA simulation
        runDoCycle(num_batches, batch_size, h_payouts_accumulated, h_hits_accumulated, reelSize, iRound);

        // Calculate the total payout, total hits, and track specific symbol combinations
        for (long long i = 0; i < num_batches; i++) {
            total_payout += h_payouts_accumulated[i];
            total_hits += h_hits_accumulated[i];
        }

        // Free host memory
        delete[] h_payouts_accumulated;
        delete[] h_hits_accumulated;
    }

    // Calculate total cost and Expected Value (EV)
    long long total_cost = total_combinations * bet_amount;
    float expected_value = static_cast<float>(total_payout) / total_cost;

    // Display the overall results
    std::cout << "Cycle Size: " << total_combinations << std::endl;
    std::cout << "Total Payout: " << total_payout << std::endl;
    std::cout << "Total Cost: " << total_cost << std::endl;
    std::cout << "Expected Value (EV): " << expected_value << std::endl;
    std::cout << "Total Hits: " << total_hits << std::endl;

    return 0;
}

kernel.cu:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <curand_kernel.h>
#include <stdio.h>

#define THREADS_PER_BLOCK 256
#define NUM_REELS 5
#define NUM_SYMBOLS 3
#define NUM_LINES 30

__device__ int lineset[NUM_LINES][NUM_REELS] = {
    {1,1,1,1,1}, //0
    {0,0,0,0,0},
    {2,2,2,2,2},
    {0,1,2,1,0},
    {2,1,0,1,2},

    {1,0,1,2,1}, //5
    {1,2,1,0,1},
    {0,0,1,2,2},
    {2,2,1,0,0},
    {0,1,0,1,0},

    {2,1,2,1,2}, //10
    {1,0,0,0,1},
    {1,2,2,2,1},
    {0,1,1,1,0},
    {2,1,1,1,2},

    {1,1,0,1,1}, //15
    {1,1,2,1,1},
    {1,0,1,0,1},
    {1,2,1,2,1},
    {0,0,0,1,2},

    {2,2,2,1,0}, //20
    {0,1,2,2,2},
    {2,1,0,0,0},
    {1,1,1,0,1},
    {1,1,1,2,1},

    {0,1,1,1,2}, //25
    {2,1,1,1,0},
    {2,2,1,0,1},
    {0,0,1,2,1},
    {2,1,2,2,2}
};

// Define symbol payouts: each row is for a specific symbol (0, 1, 2),
// and columns represent payouts for 1, 2, and 3 in a row
__device__ int payout_table[NUM_SYMBOLS][NUM_REELS] = {
    {0, 0, 0, 0, 1000},  // Symbol 0: 4 in a row pays 100
    {0, 0, 0, 0, 750},   // Symbol 1: 4 in a row pays 75
    {0, 0, 0, 0, 500}    // Symbol 2: 4 in a row pays 50
};

// Reel strips (2D array)
__device__ int reels[NUM_REELS][100] = {
    {0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0}, // Reel 1
    {1, 2, 0, 1, 2, 0, 1, 2, 0, 1,1, 2, 0, 1, 2, 0, 1, 2, 0, 1,1, 2, 0, 1, 2, 0, 1, 2, 0, 1,1, 2, 0, 1, 2, 0, 1, 2, 0, 1,1, 2, 0, 1, 2, 0, 1, 2, 0, 1,1, 2, 0, 1, 2, 0, 1, 2, 0, 1,1, 2, 0, 1, 2, 0, 1, 2, 0, 1,1, 2, 0, 1, 2, 0, 1, 2, 0, 1,1, 2, 0, 1, 2, 0, 1, 2, 0, 1,1, 2, 0, 1, 2, 0, 1, 2, 0, 1}, // Reel 2
    {2, 0, 1, 2, 0, 1, 2, 0, 1, 2},  // Reel 3
    {0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0}, // Reel 4
    {0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0,0, 1, 2, 0, 1, 2, 0, 1, 2, 0} // Reel 5
};

__global__ void evaluateCombo(long long num_batches, long long batch_size, long long* payouts_accumulated, long long* hits_accumulated, int* reelSize, int offset) {
    long long idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx < num_batches) {
        int symbols[NUM_REELS];
        long long batch_payout = 0;
        long long batch_hits = 0;

        for (long long i = 0; i < batch_size; i++) {
            long long temp_idx = ((batch_size * idx) + i) + (offset * batch_size * num_batches);

            if (temp_idx >= 1000000000)
                break;

            for (int iLine = 0; iLine < NUM_LINES; iLine++) {
                long long temp_line_idx = temp_idx;

                // Calculate the symbol on each reel dynamically
                for (int reel = 0; reel < NUM_REELS; ++reel) {
                    symbols[reel] = reels[reel][(temp_line_idx + lineset[iLine][reel]) % reelSize[reel]];
                    temp_line_idx /= reelSize[reel];
                }

                // Assume all reels have the same symbol initially
                int symbol = symbols[0];
                int consecutive_count = 1;

                // Count consecutive reels showing the same symbol
                for (int j = 1; j < NUM_REELS; j++) {
                    if (symbols[j] == symbol) {
                        consecutive_count++;
                    }
                    else {
                        break;  // Stop counting if symbols don't match consecutively
                    }
                }

                // Retrieve payout based on symbol and consecutive match count
                // Make sure consecutive_count does not exceed the number of columns in payout_table
               
                int payout = payout_table[symbol][consecutive_count - 1];
                //int payout = 10;

                if (payout > 0) {
                    batch_payout += payout;
                    batch_hits++;
                }
            }
        }

        payouts_accumulated[idx] = batch_payout;
        hits_accumulated[idx] = batch_hits;
    }
}

extern "C" void runDoCycle(long long num_batches, long long batch_size, long long* h_payouts_accumulated, long long* h_hits_accumulated, int* h_reelSize, int offset) {
    long long* d_payouts_accumulated;
    long long* d_hits_accumulated;
    int* d_reelSize;

    cudaMalloc(&d_payouts_accumulated, num_batches * sizeof(long long));
    cudaMalloc(&d_hits_accumulated, num_batches * sizeof(long long));
    cudaMalloc(&d_reelSize, NUM_REELS * sizeof(int));

    cudaMemcpy(d_reelSize, h_reelSize, NUM_REELS * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_payouts_accumulated, h_payouts_accumulated, num_batches * sizeof(long long), cudaMemcpyHostToDevice);
    cudaMemcpy(d_hits_accumulated, h_hits_accumulated, num_batches * sizeof(long long), cudaMemcpyHostToDevice);

    int blocks = (num_batches + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;
    evaluateCombo << <blocks, THREADS_PER_BLOCK >> > (num_batches, batch_size, d_payouts_accumulated, d_hits_accumulated, d_reelSize, offset);

    cudaMemcpy(h_payouts_accumulated, d_payouts_accumulated, num_batches * sizeof(long long), cudaMemcpyDeviceToHost);
    cudaMemcpy(h_hits_accumulated, d_hits_accumulated, num_batches * sizeof(long long), cudaMemcpyDeviceToHost);

    cudaFree(d_payouts_accumulated);
    cudaFree(d_hits_accumulated);
    cudaFree(d_reelSize);
}
  1. I guess you are on windows. Possibly a kernel timeout situation?

  2. If you instrument your code with proper CUDA error checking, are any errors reported?

  3. Are you compiling a debug project or are you compiling a release project?

  4. Is it obvious to tell what is a correct result and what is an incorrect result?

  5. When I run your code as posted, I get the following output:

Cycle Size: 1000000000
Total Payout: 294300000000
Total Cost: 50000000000
Expected Value (EV): 5.886
Total Hits: 367200000

is that an incorrect output? If so, what would the correct output be?

  1. Suppose I wanted to eliminate your “added complexity”. Could that be done by setting NUM_LINES to 1 ?

Thank you for the quick reply!

  1. Yes, I’m specifically running it using Visual Studio 2022. I don’t think it’s a timeout issue as the code always executes. Its just that I’ll get garbage values as results. That said I’m not very experienced here so I may not be correctly diagnosing the behavior.
  2. I tried some error-checking but wasn’t sure if it was implemented properly so I deleted it. The farthest I got with it was there was returned an unknown error on the first cudaMalloc line, regardless of which of the three cudaMallocs was first. I can look into this more if my response to the subsequent questions leads to a dead end.
  3. Release
  4. and 5. Yes, the output you’re getting is correct! When I run the same code I get garbage values.
  5. Logically speaking yes. However to fully revert it to my last working state you would remove that loop and replace the contents of the int reel = 0; reel < NUMREELS loop to be:

symbols[reel] = reels[reel][temp_idx % reelSize[reel]];
temp_idx /= reelSize[reel];

Also I’m running on a Quadro M4000 GPU

There is no point in continuing execution of a program after a failed memory allocation. You would want to sort out the allocation failure(s) before looking at anything else. This advice applies entirely independent of the use of CUDA.

1 Like

I’ve added the following definition for error checking to my files

#define CUDA_CHECK_ERROR(call) do { \
    cudaError_t err = call; \
    if (err != cudaSuccess) { \
        printf("CUDA error at %s:%d - %s\n", __FILE__, __LINE__, cudaGetErrorString(err)); \
        return; \
    } \
} while (0)

Then I updated my RunDoCycle code to the below, doing an error check on the first cudaMalloc and calling the cudaPeekAtLastError and cudaDeviceSynchronize functions after the kernel.

extern "C" void runDoCycle(long long num_batches, long long batch_size, long long* h_payouts_accumulated, long long* h_hits_accumulated, int* h_reelSize, int offset) {
    long long* d_payouts_accumulated;
    long long* d_hits_accumulated;
    int* d_reelSize;

    
    CUDA_CHECK_ERROR(cudaMalloc(&d_payouts_accumulated, num_batches * sizeof(long long)));
    cudaMalloc(&d_hits_accumulated, num_batches * sizeof(long long));
    cudaMalloc(&d_reelSize, NUM_REELS * sizeof(int));

    cudaMemcpy(d_reelSize, h_reelSize, NUM_REELS * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_payouts_accumulated, h_payouts_accumulated, num_batches * sizeof(long long), cudaMemcpyHostToDevice);
    cudaMemcpy(d_hits_accumulated, h_hits_accumulated, num_batches * sizeof(long long), cudaMemcpyHostToDevice);

    int blocks = (num_batches + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;
    evaluateCombo << <blocks, THREADS_PER_BLOCK >> > (num_batches, batch_size, d_payouts_accumulated, d_hits_accumulated, d_reelSize, offset);

    CUDA_CHECK_ERROR(cudaPeekAtLastError());
    CUDA_CHECK_ERROR(cudaDeviceSynchronize());

    cudaMemcpy(h_payouts_accumulated, d_payouts_accumulated, num_batches * sizeof(long long), cudaMemcpyDeviceToHost);
    cudaMemcpy(h_hits_accumulated, d_hits_accumulated, num_batches * sizeof(long long), cudaMemcpyDeviceToHost);

    cudaFree(d_payouts_accumulated);
    cudaFree(d_hits_accumulated);
    cudaFree(d_reelSize);
}

When I run with these changes I get this output:
CUDA error at C:\Users*user*\source\repos\CudaDoCycle\CudaDoCycle\kernel.cu:152 - unknown error
Cycle Size: 1000000000
Total Payout: 408284517997793
Total Cost: 50000000000
Expected Value (EV): 8165.69
Total Hits: 210894013195140

Line 152 is the cudaDeviceSynchronize call, so it appears that extra layer of complexity is causing a device synchronization issue? I’m not sure how to debug it since the error is unknown.

When I re-run with NUM_LINES as 1, I get the expected output with no error, so it appears to be introduced when I add that layer of complexity.

It has a typical signature of a kernel timeout. I’ve already run your code with compute-sanitizer which reports no issues in my setup (which is not subject to kernel duration limits). In my setup, you’ve already indicated the correct answer is given, with your originally posted code.

Traditionally, kernel timeout on windows started to become an issue when kernel durations hit the 2 second boundary. I note that the overall runtime of the application on my GPU (L4, linux, CUDA 12.2) is ~3 seconds. Your M4000 GPU is a maxwell generation GPU. The details are not well defined, but I think preemption support really did not reach full operation potential until Pascal GPUs.

Adding the “complexity” increases the kernel duration. You can find additional information about the windows TDR mechanism with a google search or here.

1 Like

Wow, that’s so interesting. I agree, that’s almost certainly what’s happening. Funnily enough, before implementing multiple lines I was experimenting with a 10b reelstrip cycle and experienced the same problem. I added that outer “Round” loop in main which resolved the issue, although I didn’t know why at the time. Turns out I did exactly what that first bullet point suggested.

Can’t test the theory until tomorrow but assuming that’s it I’ll go that route while I look for another GPU that has preemptive support.

Thank you both for your quick informative replies and resources.

Hi, as also indicated in Robert’s link there is also the possibility to change the timeout duration or turn it off. You could also think about using two graphics card, one for displaying the user interface and one for computation.

Hi Curefab, thanks for pointing out those additional options! I’ll have plenty of routes to go. I’ve confirmed that it is indeed a kernel timeout issue as it now works when I do the work in batches. I want to minimize the performance hit from constantly jumping back and forth from the device, so I’ll be looking into all the options everyone has shared. Thanks for your help.