Need help optimizing a kernel function - avoid divergence and loops?

I have a rather simple algorithm that is easy to implement but proving rather hard to optimize. The basic idea is that it goes through some number of elements on two different arrays, and then generates a third array whose ith element is the ith element of one of the two arrays.

In particular, it goes something like this:

__global__ void MySampling(float* array1,

		       float* array2,

	               int*   Reference,



        int idx = blockIdx.x * blockDim.x + threadIdx.x ;

        int thread_position = *(Reference + idx); // used to access elements from one part of array1, array2

        register float U;	// Random uniform variable

        register int i, base;

// Draw a uniform random number

        curandStateXORWOW_t state; // Make sure state exists 

        curand_init(seed + idx, 0, 0, &state);

        U = curand_uniform(&state);

base = (U < 0.5 ? 0:1);

        if (base == 0)

                *(array1 + pitch*idx) = *(array1 + pidch*thread_position);

        if (base == 1)

                *(array1 + pitch*idx) = *(array2 + pidch*thread_position);

	for (i = 1; i < NUM_IN_CONSTANT_MEMORY; i++)


        	U = curand_uniform(&state); // draw a uniform

		// If the uniform < baseline[i-1] switch from copying array1 to array2

                base = (U < baseline[i - 1] ? (1 - base) : base);

                if (base == 0)

                        *(array1 + idx * pitch + i) = *(array1 + thread_position *pitch + i);

                if (base == 1)

                        *(array1 + idx * pitch + i) = *(array2 + thread_position *pitch + i);




The variable array1, array2 are both in global memory (due to the scale of the problem fitting them in other memory types isn’t possible).

The variable “pitch” is an int and baseline is a float array both are loaded into constant memory. One problem is that the array baseline isn’t all 0.5 - so for example it could be that if element #2 is copied from array1, then element #3 could have a high likelihood of being copied from array1 as well. Otherwise the optimal way to do this would be draw at random for each element from each array.

As you can see there is a lot of potential for divergence as well, but the constraints placed on baseline don’t suggest an easy way to paralellize that segment of the code.

I’ve tried rewriting the algorithm as a matrix multiplication problem (I could expand on this if it would help at all), but for reasons mentioned above it hasn’t seemed to have affected performance.

Any ideas ???

That’s not how I read the code: it seems to be copying elements from array1 to array1 and from array2 to array2, but not from either of them to any other array.

The line “base = (U < baseline[i - 1] ? (1 - base) : base);” does not really have a divergence, it will be compiled into a conditional assignment instruction. Of the next two if’s, exactly one will always be executed, and there might be a way to take advantage of that.

Are you sure that your run time isn’t dominated by calls to curand_init and curand_uniform?

base = (U < 0.5 ? 0:1);

        if (base == 0)

                array1[pitch*idx] = array1[pitch*thread_position];

        if (base == 1)

                array2[pitch*idx] = array2[pitch*thread_position];

can easily be optimized to use just one global memory transaction:

float* array = (U < 0.5 ? array1:array2);

        array[pitch*idx] = array[pitch*thread_position];

By the way, this kernel will not give repeatable results as the order in which blocks are executed is undefined.

You are correct, and actually I just noticed there was an error in the way I rewrote the kernel to put up here. What it actually does is copy elements from array1 and array2 into a different part of array1, so this is the idea

if (base == 0)

                *(array1 + pitch*idx) = *(array1 + pidch*thread_position);

        if (base == 1)

                *(array1 + pitch*idx) = *(array2 + pidch*thread_position);

Sorry about that.

Good point. I’m going to look into this.

Apologies, as I actually made a mistake in typing out the original snipped. What it does is actually:

base = (U < 0.5 ? 0:1);

        if (base == 0)

                array1[pitch*idx] = array1[pitch*thread_position];

        if (base == 1)

                array1[pitch*idx] = array2[pitch*thread_position];

which I think under the modification you note potentially still requires 2 global memory reads.

Hmmm admittedly I’m confused why that would be if dimBlock, dimGrid are specified (currently

dim3 dimGrid(STARTSIZE/512 + STARTSIZE%512);

        dim3 dimBlock(512);


when calling the kernel. I thought nvcc automatically handled the scheduling?

even better.

if(U < baseline[i-1])

  base = 1-base; 

float* src = (base==0) ? array1 : array2;

array1[idx * pitch + i] = src[thread_position *pitch + i];

I think what he’s means that the outcome is non-deterministic if the memory region written by the kernel overlaps with the region that is read by the kernel. (Since you’re using array1 for both purposes.)

How many threads per block ?
and how many blocks are you running ?
and what is the value you are passing to pitch ?
and what GPU are you using ?

Thanks Hamster!

Hmm just tried that and removing the uniform random variable draws. It might not be an issue with the data across the two source arrays being located far away, as using the approach described above didn’t really change much. Getting rid of the curand stuff takes away about 20% of the GPU run time. However, I’m still going to need to generate these random uniform variables in another kernel and store them in global memory for access, so if this is the only way to optimize it, the fact that it won’t get better than 20% gives me pause about forging ahead with eliminating curand calls.

Since it isn’t obvious that there is a way to parallelize the routine even further to avoid the loop altogether, given that the value of “base” in the current time step depends on the value of “base” at the previous time step, perhaps there isn’t much more to squeeze out of this.

Oh, got it. yeah, actually the input array “Reference” is set up so that idx < min(Reference) which prevents the data from getting overwritten.

Sure, my answers are in the same order:

512 threads per block

1500 blocks (should be able to handle anywhere from 1 to ~2000 blocks because of the current GPU memory limitation)

The value of the pitch is an int, generally about 5 - 350

Currently I am doing the development/prototyping on the GT 240

It takes about 70200 microseconds according to the profiler, which isn’t bad, but it also shows up as being at least 10 times costlier than the other routines and because it has to be called millions of times, seemed like a logical choice to optimize.

I checked on the CUDA occupancy calculator, and the current settings seem to be maximizing use of the cores, at least for my system.


My feeling is that performance is limited by amount of data that has to be transfered to-from global memory.

So if it is an issue of global-memory transferring bandwidth, would there be a way to improve the coalescing? One constraint we currently have is that the input array “Reference” can’t really be changed much; however, we can potentially have the pitch variable for example be a multiple of 16. Or is this one of those things where we have to wait for the hardware to catch up to enable us to get away with a naive implementation?

Even if you can’t change the format of the input and output data, it’s probably worth reorganizing the data in device memory before and after the kernel runs. Check the matrix transpose example in the SDK to see how it can be done efficiently.