Optimizing performance of a serial <<<1, 1>>> kernel, after long debugging hours

UPDATE AT THE END ==============================

Good morning, all.

I have a wrapper function that declares 2 cudaMallocManaged variables, which need to be passed by reference to kernel_func_1, as it will populate them, and then kernel_func_2 is called, using the results in these variables.
The wrapper function is defined as:

__declspec(dllexport) void A_Wrapper_Func(void)    // Wrapper function in DLL
    {
    double *scale_factor_1;    // Shared between kernel functions
    size_t *scale_factor_2;    // Shared between kernel functions
    cudaError_t cuda_error;

    cudaMallocManaged(&scale_factor_1, sizeof(double));    // Just 1 element, it is an accumulator
    cudaMallocManaged(&scale_factor_2, sizeof(size_t));    // Just 1 element, it is an accumulator

    // Initializes both variables with 0 and prints errors, if any (but no error is returned)
    if (cudaMemset(scale_factor_1, 0, sizeof(double)) != cudaSuccess)
        {
        cuda_error = cudaGetLastError();
        cudaGetErrorString(cuda_error);
        std::cout << cuda_error << std::endl;
        }

    if (cudaMemset(scale_factor_2, 0, sizeof(size_t)) != cudaSuccess)
        {
        cuda_error = cudaGetLastError();
        cudaGetErrorString(cuda_error);
        std::cout << cuda_error << std::endl;
        }

    // Prints the sizes of the variables before kernel call, returns 8 and 8 as expected
    std::cout << "Sizes before kernel: " << sizeof(scale_factor_1[0]) << " " << sizeof(scale_factor_2[0]) << std::endl;

    // Serial part of the computation, populates scale factors 1 and 2
    kernel_func_1 <<< 1, 1 >>> (array_1, scale_factor_1, scale_factor_2, array_length);
    cudaDeviceSynchronize();

    // Parallel part of the computation (scaling array_1 and writing to array_2)
    kernel_func_2 <<< 200, 256 >>> (array_1, array_2, scale_factor_1, scale_factor_2, array_length);
    cudaDeviceSynchronize();

    cudaFree(scale_factor_1);
    cudaFree(scale_factor_2);
    }

The kernel functions are declared as:

__global__ void kernel_func_1(float *array_1, double *scale_fact_1, size_t *scale_fact_2, size_t array_len);
__global__ void kernel_func_2(float *array_1, float *array_2, double *scale_fact_1, size_t *scale_fact_2, size_t array_len);

Function 2 just takes 1 more argument than Function 1, which is a second array where array_1, scaled by factors 1 and 2, is written to.
From inside these kernel functions I do a:

printf("%i %i %i %i\n", sizeof(scale_fact_1[0], scale_fact_2[0], sizeof(double), sizeof(size_t));

It prints 8 0 8 0, when the expected is 8 8 8 8, indicating that the size_t variable as well as sizeof(size_t) are returning 0.
The code will run a couple of times without complaining (but providing incorrect results in array_2, which I am printing somewhere else in the program. As I increase the size of the arrays, an error is caught when allocating them (which I am doing and treating in other part of the program), even if there is more than enough free memory in the device.
If I comment the calls to the kernel functions in the wrapper, I can call it as many times as I want with array sizes as large as +45% (each) of the free device memory. When I uncomment any call to kernel function, then the problems arise.

So my questions are:
1 - What is wrong here? Why do these kernel functions consider a size_t as 0 bytes? Yes, I’ve rebooted the computer to make sure there was no memory address messed up.
2 - When I define a variable with cudaMallocManaged, it is visible/usable by the device so no need to pass anything by reference, correct?
3 - We are stuck with the 1-element array notation to refer to the variable. No way around it?

Thanks a lot.

UPDATE ========================================

I have changed scale_factor_2 to double as well as the parameters for the functions just to see what happens, and it has the same problem. So it is NOT a problem of size_t, as the size of scale_factor_2[0] and double inside the kernel functions is still 0.
There is something wrong with this second scaling factor being passed to the functions, but I am unable to figured out what, as it is just another argument.

UPDATE 2 ========================================

The functions were originally declared as:

__global__ void kernel_func_1(float *array_1, double *scale_fact_1, size_t *scale_fact_2, size_t array_len);
__global__ void kernel_func_2(float *array_1, float *array_2, double *scale_fact_1, size_t *scale_fact_2, size_t array_len);

Then I changed the order of the double and size_t arguments, making them:

__global__ void kernel_func_1(float *array_1, size_t *scale_fact_2, double *scale_fact_1, size_t array_len);
__global__ void kernel_func_2(float *array_1, float *array_2, size_t *scale_fact_2, double *scale_fact_1, size_t array_len);

Now it is the “double *scale_factor_1” that has problems. That is, if inside a kernel I print the sizeof(scale_factor_1[0]), and strangely sizeof(double) too, they come as 0.
So it seems to be third argument is function_1 and forth argument in function_2, regardless of type.
I honestly have no idea of what is happening, as no errors are returned from the CUDA calls or the kernels…

There is a bug in this line. The correct format specifier for size_t, which is what sizeof() returns, is %zu.

Here is a handy overview of printf() format specifiers:

https://en.cppreference.com/w/cpp/io/c/fprintf

Hi, Njuffa. Thanks for your time looking into this.

I have consulted the CPP reference for the formats of these types, however, a %zu in a printf from a kernel function will just print the string “%zu”, as if it didn’t recognize this format, even after including stdio.h (which for the use of size_t is not needed).

However, what could to be causing the strange behavior is most likely one of the 2:

  • An overflow on one of these variables, that I define as managed and serves as an accumulator. Which is kind of strange, because the CPU-version of the computation is identical and the thing seems to be handling fine.
  • A call to std::abs() from inside the first kernel function, which proved to be incredibly expensive (CPU version being much faster), to the point of causing Windows to do that driver recovery and most certainly messing up the running program session. Those cudaDeviceSynhronize() lines after each kernel launch will return error code 4 if I try to run again after a driver recovery.

Just to give you an idea, I can run this operation on arrays of 6GB each on CPU code. It takes time, but it works. The GPU version will fail at about 40MB, where the program takes more than a few seconds to complete and Windows thinks it has to recover the driver. At this point, I don’t really know what happens internally: if the program finishes anyway, or if it stops in the middle of the computation and messes up the variables. So I have to close and open the program again to have things “cleaned”.

I suspect it is this std::abs() call from inside a kernel function. I tried using those accumulators as device, but I didn’t manage. Since these are updated in gigantic loops, I certainly need them in faster storage.

What do you think?

I don’t know what set of format specifiers are supported by CUDA’s device-side printf(). Check the CUDA documentation. In practical terms, CUDA runs on 64-bit platforms where ‘size_t’ is the same as ‘unsigned long long int’, so try %llu instead. The relevant portion of the docs is here:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#format-specifiers

Your original code was trying to print 64-bit quantities as 32-bit quantities, causing a 0 to be printed every time this grabbed the upper 32-bits of a 64-bit variable.

I didn’t look at the rest of the code. I find debugging my own code annoying enough and try to avoid debugging other people’s code.

Using %llu worked for returning the size of size_t inside the device. Thanks, Njuffa.

However, because I was getting strange values in my output array and the program triggered the driver recovery with very small data load, the first thing I did was to check the scaling factors. Since I was printing them with the wrong format, it led me to think that the device wasn’t properly using the types I defined. Which of course makes no sense.

Now that I found that kernel_function_1 has this super-slow call to std::abs() (at least from the device), making Windows recover the driver during the execution and messing with the calculation, I’m stuck with a misleading title in this topic. What I have to do now is replace this call to std::abs() and learn how to use the faster storage of the device for accumulators.

There is probably plenty of online material on how to access these portions of device memory. So I already want to thank you for your patience and these inputs.

CUDA has an overloaded abs() function that works on integers and floating-point data, maybe try that for device code?

I tried abs() as you suggested and the gains are marginal. I am using C++ high resolution timer around the kernel function and maybe I just need to get over the fact that higher GHz on a CPU, plus its bigger cache, wins this one.

I was also reading some topics on SO you contributed some years ago to see if it is possible to force a variable to be a register. But it seems to be up to the compiler to decide, even if we declare as “register”.

This code will also be run on Linux, where we don’t have the timeout/recovery thing happening.

Any compiler for C/C++/CUDA produced in the past twenty years (or even longer), ignores ‘register’. In general, the CUDA compiler makes very good decisions about which data to place in registers.

I am not surprised by your results at all, as there is no reason that abs() of any kind, however expressed at source level, should be slow on the GPU as it’s actually baked into other instructions as an option in many cases.

I would strongly suggest using the CUDA profiler to help you pinpoint the bottlenecks in your code. Given that ALU operations (both integer and floating-point) are mostly “too cheap to meter” these days, there is a good chance that performance is limited by memory access of some kind or other.

You can edit your topic title.

You should take care when using std::(anything) in kernel code. Only a “small” subset of standard library functionality is specifically available in device code. While undefined usage works in some cases, you may be relying on unsupported behavior:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#mathematical-functions-appendix
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#standard-library

I don’t actually see std::abs variants listed as supported in device code.

I’m not sure how you decided that the abs function is the super-slow item in your kernel code. I suppose its possible, of course, but at first encounter, it seems unlikely to me, and I would only be able to reach that conclusion with some careful analysis. Simple code-substitution/code-commenting analyses for these types of inquiries are usually error-prone, in my experience.

Good evening, txbob.

I replaced the std::abs() with CUDA’s abs(), and also gave a try to:

for(size_t i = 0; i < array_len; i++)    // Immense loop
    {
    if(array_1[i] < 0)
        scale_factor_1 += (array_1[i] * -1);
    else
        scale_factor_1 += array_1[i];
    // Do more computation below
    }

Also tried the ternary operator ?:, but the results in time taken are essentially the same. As Njuffa said, compilers are very smart today and will most likely do the best optimizations.

When I comment out the accumulators lines, I just get wrong output as array_1 values get multiplied/subtracted by 0 and saved to array_2. However, as Njuffa suggested, a quick run of nvprof on the executable showed that the kernel function in question (runs serial code in configuration <<< 1, 1 >>>) takes 770ms to complete and cudaDeviceSynchronize takes 1,55s. Which is consistent with the result provided by C++ high resolution timer of 2,3s for an operation on an array of 20MB.

For comparison, the CPU version of this code running on a 500MB array takes 2,5s.
In the end, it turns out that it seems to be more a problem of finding the bottleneck and fixing it, just like any other CUDA app. So I need to learn how to interpret nvprof’s output and change this topic to a suitable name.

Oh, and this nvprof warning message, “Found 101 invalid records in the result. This can happen if the device ran out of memory or if a kernel was stopped due to an assertion”… damn this 1080ti working on a 20MB array. :)

As I understand from posts here and SO in general, the time shown in nvprof for cudaDeviceSynchronize() is not that it is taking long to “execute”. It is just waiting for the kernel function to finish. Since I have a slow running kernel, it looks fair enough.
Back to the drawing board…

I assume you are aware that GPUs are designed as highly parallel throughput-optimized processors that require on the order of 10,000 threads to achieve best performance. Running single-threaded on a GPU therefore pretty much never makes sense, nor does comparing the single-thread performance of a GPU to the single-thread performance of a CPU designed to run a dozen threads in latency-optimized fashion.

The way pointers and array references work in CUDA is identical to how pointers and array references work in standard C++.

Good morning, Njuffa, and thanks for your comments and patience.

Yes, I am aware of the highly parallel nature of GPUs, but what I didn’t mention is that I am also after those situations where a CPU is going to be faster and, more importantly, by how much.

I am making a benchmark with various geophysical processes applied to 1D, 2D and 3D data, running on CPU and GPU with user-specified data sizes. So what I need to do now is prepare the most optimized version of these processes to run on the GPU, so the fight is fair, and see how they fare against each other in various scenarios. But it is half a year worth of work ahead. :-)

I am not sure what your specific objective is, but comparing single-threaded performance seems the not to go into the right direction, for any such processing.

A comparison between a SIMD-optimized, multi-threaded CPU code and the equivalent GPU implementation, on the other hand, likely is just what the doctor ordered. Typically, that results in a factor 2x to 10x win for the GPU for performance comparisons at application level, with 5x to 6x speed-up as the average across a multitude of applications.