Striped memory access

Conditionals · July 13, 2015, 8:38pm

I’m rather new to CUDA development and I’ve stumbled on to some odd performance issues when using striped vs coalesced memory access. The kernel I’m playing with is as follows:

[url]https://gist.github.com/JCaskey/12728199f10c0cd47539[/url]

I’m calling this kernel with 8192 blocks of 128 threads (2^20 or ~1mil threads total).

As is the kernel runs in 0.0256 ms but if I comment out line 18 it takes 9.9717 ms (390 times longer).

By removing line 18 we’re going from coalesced to striped access, but I was under the impression that the memory manager would just read the whole block its trying to access regardless. Since we’re striping over the same range shouldn’t we expect the same number of memory accesses and similar performance?

Setup:
GTX 980 Ti
CUDA 7.0.28
NVIDIA x86_64-352.21

allanmac · July 13, 2015, 8:50pm

I left a comment on your gist that shows what instructions were generated for an sm_50 target.

Commenting out line 18 appears to unroll your loop and requires quite a few more registers.

I’m not sure if it’s a compiler bug but the performance difference smells like one.

You might want to file a bug report if you find that it is.

FWIW, I’ve found that if/then statements can sometimes be surprising in CUDA.

A more succinct idiom might be helpful:

d_array[idx] = idx + i + ((idx % 2 == 0) ? 1 : 2);

Welcome to Level 2!

njuffa · July 13, 2015, 8:54pm

Or even:

d_array[idx] = idx + i + 1 + (idx & 1);

Vectorizer · July 14, 2015, 3:16pm

Each iteration has a global memory access!
Why is there a loop anyway; the variable gets overwritten.
If the original intent was to accumulate values in the variable, use a local and write out once. By the way, you still would not need a loop as the result would be directly computable.

// Kernel

__global__ void testKernel(float* d_array, int size){

// Index

  int idx = blockIdx.x *blockDim.x + threadIdx.x;

// Initialize

  d_array[idx] = idx + 1000 + (idx & 0x01);

// Loop - WHY ???
  /*

  for(int i = 0; i < 1000; ++i){

// Condition

    if(idx % 2 == 0){

      d_array[idx] = idx + i + 1;

    }

    else{

      d_array[idx] = idx + i + 2;

    }

}

 */

}

Topic		Replies	Views
Coalescing access CUDA Programming and Performance	3	814	March 2, 2012
Strided memory access help CUDA Programming and Performance	1	936	May 25, 2021
Strided thread access with data reuse in cache lines CUDA Programming and Performance	4	1780	July 14, 2020
Strided vs non-strided access CUDA Programming and Performance	6	1159	May 21, 2024
Question about performance of accessing an only-read array from kernel CUDA Programming and Performance	2	486	December 23, 2018
Coalesced acces slower than non coalesced CUDA Programming and Performance	4	953	February 7, 2011
32 byte coalesced access is faster than 128 byte coalesced access? CUDA Programming and Performance	3	1186	October 12, 2021
memory accesses by thread block accessing memory by thread block is only semi-coalesced? CUDA Programming and Performance	7	3864	February 16, 2009
Visual debugger to see if mem access is coalesced CUDA Programming and Performance	7	1136	November 1, 2011
Coalesced memory access example CUDA Programming and Performance	2	3379	March 28, 2011

Striped memory access

Related topics