Trying to take some fundamentally serial code and cuda-ize it. Is there a way?

I didn’t think this kernel would work, and it didn’t. But is there some way to accomplish what I’m trying to do here (basically take some very serial code and cuda-parallelize it)?

__global__ void kernel_binomial_up_down(T_TREE_Args* d_aOpts, real* d_aUP, real* d_aDN)
{
    //trying to cuda-ize this cpu serial code: for each block, create arrays UP[1, u, u^2, ..., u^N] and DN[1, d, d^2, ..., d^N]:
    /*
    UP[ndx_zero] = 1.0;
    DN[ndx_zero] = 1.0;
    for (k = 1; k < bsteps_p_1; k++)
    {
        UP[ndx_zero + k] = u * UP[ndx_zero + k - 1];
        DN[ndx_zero + k] = d * DN[ndx_zero + k - 1];
    }
    for (k = bsteps_p_1; k < jdim; k++)
    {
        UP[ndx_zero + k] = 0.0;
        DN[ndx_zero + k] = 0.0;
    }
    */
    //populate the d_aUP, d_aDN arrays
    const long jsteps_p_1 = 1 + d_aOpts[blockIdx.x].jsteps;
    const long jndx = d_aOpts[blockIdx.x].iea;  //d_aOpts[*].iea is initialized to 0
    const real u = d_aOpts[blockIdx.x].u;
    const real d = d_aOpts[blockIdx.x].d;

    const long block_zero = blockIdx.x * blockDim.x;

    if (jndx == 0)
    {
        d_aUP[block_zero + jndx] = 1.0;
        d_aDN[block_zero + jndx] = 1.0;
    }
    else if(jndx < jsteps_p_1)
    {
        d_aUP[block_zero + jndx] = u * d_aUP[block_zero + jndx - 1];
        d_aDN[block_zero + jndx] = d * d_aDN[block_zero + jndx - 1];
    }
    else
    {
        d_aUP[block_zero + jndx] = 0.0;
        d_aDN[block_zero + jndx] = 0.0;
    }
    //atomicAdd(&d_aOpts[blockIdx.x].iea, 1); //compile error: no instance of overloaded function "atomicAdd" matched the argument list
    d_aOpts[blockIdx.x].iea++;
}

Why are the powers computed by repeated multiplication? If you used an exponentiation function like pow() instead, all powers 1 … n could be computed in parallel.

Because N pow()'s is extremely inefficient compared to N multiplications in any language … and just for fun I tried cuda pow() in a different kernel several days ago and it was even slower than I expected.

How large is n typically? If this can be parallelized across tens of thousands of threads, the “slowness” of pow() should not be an issue. For real == float it might even be limited by memory throughput.

FWIW, no context was provided for this computation. I would re-examine the need to pre-compute and stored the powers.

Calculating integer powers by a for loop has a more efficient alternative: E.g. x^18=x^16*x^2 and so on.

The trick is to calculate the power of two powers and use the binary bit representation of the power to choose the needed factors.

Yes, that’s an issue because the block size N is typically quite small, most commonly 160. But I’m not necessarily opposed to using pow(). And it’s not quite as bad as pow(), since for these integer powers I would precompute log(u) and log(d) and use exp(j*log) rather than pow(). I’m just trying first for the maximum speed theoretically possible. The cpu code is already fast enough; what I’m really trying to do here is avoid 2 big cudaMemcpy() calls. So anything at least as fast as the cpu would be a win.

The storage of the powers is definitely not optional.

Yes I’m aware of the efficient exponentiation algorithm. Again, trying first to get by with the max speed if it’s possible.

Obviously no massive parallelism is exposed and this is therefore somewhat poorly suited to GPUs. However, it can be beneficial to tolerate sub-optimal GPU code to ensure an entire processing pipeline with all its data can remain resident on the GPU.

If real == float you could also try __powf() (note reduced accuracy!). Given that n is around 160, I strongly suspect that this double computation, though.

I think you can get close to the speed of shared memory to calculate the powers and afterwards store them in shared memory.

Double is mainly in view. Also float versions but they are secondary. And the grid size is not bad, at least several thousand up to about 20,000.

I am reasonably sure this is what CUDA’s pow (double, int) does internally. At least I distinctly recall that it did that in early versions of CUDA, by bit-wise examination of the exponent. One can do even better than that for some exponents., e.g. a19:

    t = a * a * a;
    u = t * t * t;
    r = a * u * u;

So back to the original post, is it totally fruitless to try and “trick” cuda somehow to increment the array index serially in the kernel? What happens, exactly, in a given block or warp when the kernel has a line like my d_aOpts[blockIdx.x].iea++; above? Does the value get incremented more or less instantaneously by the block or warp size before I can do anything about it?

d_aOpts[blockIdx.x].iea++

is the same as

d_aOpts[blockIdx.x].iea = d_aOpts[blockIdx.x].iea + 1;

The threads of warp probably (no guarantee by Cuda) read at the same time and write at the same time, so each warp would increment by 1 (as one random thread is successful writing).

Different warps may read the updated (by other warps) or the old value.

So I would guess, the most likely outcomes are that the variable is increased by between one and the number of warps. Could be anything in between.

Why not just synchronize over the block (to make sure all threads are finished) and then let one specific thread (threadIdx.x==threadIdx.y==threadIdx.z==0) do the increment?

Mainly because that throws away the parallelism (and hence the speed) I’m trying to get. The whole intention is to determine if what I’m trying to do (the serial increment) is in any way possible. I don’t think it is, but several times cuda has surprised me to the upside on what is possible, which is why I made the original post.

At least within a warp I would just let one thread do the work. Between warps you can use atomic increments.

Lots of ways to get the job done in acceptable time. I have implemented one of them since I don’t think what I’m trying to do is possible. But I’m still interested in how close I can get to what I really want, since it would potentially have broad application. This link looks interesting on a quick read:

I wish I could use atomicAdd(int*, int) but when I try I get the dreaded compiler error “no instance of overloaded function “atomicAdd” matched the argument list”. I googled this for a half hour yesterday and only got mentally worn out. Do you have any hints on how to get rid of the compiler error?

That’s supported on all CUDA GPUs, currently. The problem is almost certainly the case that one of your supplied arguments does not match that prototype.

If you want to provide a short, complete, compilable example, that demonstrates the compile error you are getting, I think its quite likely that someone could sort out what is happening.

Thanks, I’m really glad you posted because I learned something very fundamental: In cuda I can’t just mix and match int and long as I do in C++. When I was fussing with atomicAdd yesterday my variables were longs so to figure out the error I was automatically trying:

long* ptr;
atomicAdd(ptr, 1);

and getting the compile error. Just now when I finally tried

int* ptr;
atomicAdd(ptr, 1);

of course it compiled fine.

Googling a bit, I gather that in cuda longs and ints can be different sizes depending on the platform. But is it a not-dumb question to ask why there is no atomicAdd overload for (long*, int) or (long*, long)?

CUDA is a dialect of C++. I think currently it supports most of C++23. CUDA also goes through pains to ensure that built-in types are handled exactly the the same way in device code that they are handled in host code.

In particular, the type system on 64-bit x86 Windows dictates that sizeof(int) == sizeof(long) == 4, whereas on 64-bit x86 Linux it dictates that sizeof(int) == 4, while sizeof(long) == sizeof(long long) == 8.

The corollary to this is that the use of long should be avoided at all times. In particularly it needs to be avoided for all CUDA-specific functions that require hardware backing and where the hardware supports 4-byte data but not 8-byte data (GPUs are fundamentally 32-bit processors), because that would make those functions inoperable on 64-bit Linux.

Thanks for that, it’s great stuff that should be required reading before coders like me are allowed to hit the road with their “cuda license”. I’m going to refactor my current project so that no C++ longs are piped into the cuda module.