I have a rather simple algorithm that is easy to implement but proving rather hard to optimize. The basic idea is that it goes through some number of elements on two different arrays, and then generates a third array whose ith element is the ith element of one of the two arrays.

In particular, it goes something like this:

```
__global__ void MySampling(float* array1,
float* array2,
int* Reference,
)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x ;
int thread_position = *(Reference + idx); // used to access elements from one part of array1, array2
register float U; // Random uniform variable
register int i, base;
// Draw a uniform random number
curandStateXORWOW_t state; // Make sure state exists
curand_init(seed + idx, 0, 0, &state);
U = curand_uniform(&state);
base = (U < 0.5 ? 0:1);
if (base == 0)
*(array1 + pitch*idx) = *(array1 + pidch*thread_position);
if (base == 1)
*(array1 + pitch*idx) = *(array2 + pidch*thread_position);
for (i = 1; i < NUM_IN_CONSTANT_MEMORY; i++)
{
U = curand_uniform(&state); // draw a uniform
// If the uniform < baseline[i-1] switch from copying array1 to array2
base = (U < baseline[i - 1] ? (1 - base) : base);
if (base == 0)
*(array1 + idx * pitch + i) = *(array1 + thread_position *pitch + i);
if (base == 1)
*(array1 + idx * pitch + i) = *(array2 + thread_position *pitch + i);
}
__syncthreads();
}
```

The variable array1, array2 are both in global memory (due to the scale of the problem fitting them in other memory types isn’t possible).

The variable “pitch” is an int and baseline is a float array both are loaded into constant memory. One problem is that the array baseline isn’t all 0.5 - so for example it could be that if element #2 is copied from array1, then element #3 could have a high likelihood of being copied from array1 as well. Otherwise the optimal way to do this would be draw at random for each element from each array.

As you can see there is a lot of potential for divergence as well, but the constraints placed on baseline don’t suggest an easy way to paralellize that segment of the code.

I’ve tried rewriting the algorithm as a matrix multiplication problem (I could expand on this if it would help at all), but for reasons mentioned above it hasn’t seemed to have affected performance.

Any ideas ???