Striped memory access

I’m rather new to CUDA development and I’ve stumbled on to some odd performance issues when using striped vs coalesced memory access. The kernel I’m playing with is as follows:

I’m calling this kernel with 8192 blocks of 128 threads (2^20 or ~1mil threads total).

As is the kernel runs in 0.0256 ms but if I comment out line 18 it takes 9.9717 ms (390 times longer).

By removing line 18 we’re going from coalesced to striped access, but I was under the impression that the memory manager would just read the whole block its trying to access regardless. Since we’re striping over the same range shouldn’t we expect the same number of memory accesses and similar performance?

GTX 980 Ti
CUDA 7.0.28
NVIDIA x86_64-352.21

I left a comment on your gist that shows what instructions were generated for an sm_50 target.

Commenting out line 18 appears to unroll your loop and requires quite a few more registers.

I’m not sure if it’s a compiler bug but the performance difference smells like one.

You might want to file a bug report if you find that it is.

FWIW, I’ve found that if/then statements can sometimes be surprising in CUDA.

A more succinct idiom might be helpful:

d_array[idx] = idx + i + ((idx % 2 == 0) ? 1 : 2);

Welcome to Level 2!

Or even:

d_array[idx] = idx + i + 1 + (idx & 1);

Each iteration has a global memory access!
Why is there a loop anyway; the variable gets overwritten.
If the original intent was to accumulate values in the variable, use a local and write out once. By the way, you still would not need a loop as the result would be directly computable.

// Kernel

__global__ void testKernel(float* d_array, int size){

// Index

  int idx = blockIdx.x *blockDim.x + threadIdx.x;

// Initialize

  d_array[idx] = idx + 1000 + (idx & 0x01);

// Loop - WHY ???

  for(int i = 0; i < 1000; ++i){

// Condition

    if(idx % 2 == 0){

      d_array[idx] = idx + i + 1;



      d_array[idx] = idx + i + 2;