Coalescing issue, presumably due to the CUDA Optimizer

e.ping · December 7, 2009, 9:00pm

I believe there is an issue with the CUDA compiler/optimizer that actually forces my particular code to be uncoalesced, even though the data is structured in such a way that coalescing should occur. I can, in fact, force the issue and make it be coalesced, but it is a total hack to make that be the case. Can someone either a) confirm that this is a real issue or B) tell me how I don’t know what I am talking about? Either would work for me!

Some background - I have an array of complex data, called in. in is actually stored as a (float2 *).

The kernel in question needs to return the real part of in. The following kernel yields many uncoalesced reads:

global void getReals(int N, float *out, float2 *in)
{
unsigned int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
unsigned int gridSize = blockDim.x * gridDim.x; // stride by a gridsize number of threads

while(idx < N)
{
	out[idx] = in[idx].x;
	idx += gridSize;
}

}

The following simple tweak yields all coalesced reads, is substantially faster, and of course, yields the wrong answer:

global void getReals(int N, float *out, float2 *in)
{
unsigned int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
unsigned int gridSize = blockDim.x * gridDim.x; // stride by a gridsize number of threads

while(idx < N)
{
	out[idx] = in[idx].x + in[idx].y * 0.00000000000001;
	idx += gridSize;
}

}

So, what seems to be going on is that the compiler, in the first case, sees through my attempt at reading in a float2 and instead reads in a float (since it realizes that I never use the .y part). What this means is that successive threads read non-contiguous floats, and I get uncoalesced reads.
In the second case, I use a teeny bit of .y part - just enough to make the compiler not be too smart for its own good, and I get entirely coalesced results. Plus, the code is about 3x faster. But of course, I don’t like this solution - a) its wrong, and B) its a hack.

Any thoughts?
FWIW - I am using CUDA 2.3.0, on a 9800 GTX.

Thanks!

Eddie

ps -I have already tried this:

float2 temp; temp = in[idx]; out[idx] = temp.x; <<— Doesn’t help

Cygnus_X1 · December 7, 2009, 9:48pm

I tried both versions of your code on my GTX 260, CUDA 2.3 and didn’t observed any significant difference in execution time.
Case 1 (correct result) for N=1610241024: 18.397ms
Case 2 (incorrect result) for same N: 18.777ms

e.ping · December 7, 2009, 9:52pm

Thanks for the prompt reply - I can’t explain why you see such little difference… Could you do me a favor and run it in the profiler and see if you get different amounts of coalescing from one version to the other?

avidday · December 7, 2009, 9:57pm

I can - he is using a compute 1.3 capable card which can do a couple of memory access patterns that your 1.1 capable card can’t. Profiling for memory coalescing on a 1.3 card doesn’t work anyway, the profiler won’t let you collect that data for the same reason - it doesn’t have much relevance on compute 1.3 devices.

e.ping · December 7, 2009, 10:00pm

Ah! You just barely beat me! I just realized that myself! Thanks a lot though! Hopefully someone else with a 1.1 device can try this out?

avidday · December 7, 2009, 10:03pm

Unfortunately (or fortunately, depending on your perspective) not me…

Cygnus_X1 · December 7, 2009, 10:06pm

case 1:
GPU Time: 17134.4
CPU Time: 18932.2
Block size X: 512
Static shared mem: 28
Reg/thread: 7
Occupancy: 0.5
CTA launched: 1
Branch: 524369
Div branch: 1
Instructions: 3670359
gld 128b: 1048576
gst 64b: 1048576
gld request: 524288
gst request: 524288

case 2:
GPU Time: 17496.3
CPU Time: 19298.6
Block size X: 512
Static shared mem: 28
Reg/thread: 8
Occupancy: 0.5
CTA launched: 1
Branch: 524369
Div branch: 1
Instructions: 4194647
gld 128b: 1048576
gst 64b: 1048576
gld request: 524288
gst request: 524288

relevant nvcc options I compiled the code with:
-m32 -arch sm_13 -Xcompiler "/EHsc /W3 /nologo /O2 /Zi /MD " -maxrregcount=24 --ptxas-options=-v

C:>nvcc --version
nvcc: NVIDIA Â® Cuda compiler driver
Copyright Â© 2005-2009 NVIDIA Corporation
Built on Mon_Aug__3_19:43:55_PDT_2009
Cuda compilation tools, release 2.3, V0.2.1221

running on Windows 7 x64

Edit: It did collect some data regarding memory transactions (the gld and gst values)

1.3 devices still need global memory to be accessed in a coalesced way. It is just that if the block you fetch is misaligned and hence spans into two banks/chunks/whatever-you-call-them it takes twice as much time to load and not 16 times as much as it is with 1.1 devices.

avidday · December 7, 2009, 10:10pm

But not uncoalesced loads and stores, which is what he wants.

Cygnus_X1 · December 7, 2009, 10:15pm

Case 3:

out[idx] = in[idx+1].x

Leads to the following profiler output:

GPU Time: 17532.6 (worse than in case 1 and 2)

CPU Time: 19348.1

Block size X: 512

Static shared mem: 28

Reg/thread: 7

Occupancy: 0.5

CTA launched: 1

Branch: 524369

Div branch: 1

Instructions: 4194647

gld 32b: 1048576 ← additional load transaction which does not appear in previous 2 cases because of misalignment

gld 128b: 1048576

gst 64b: 1048576

gld request: 524288

gst request: 524288

To conclude:

case1 and case2 have perfectly coalesced reads of size 128 bytes per half-warp and perfectly coalesced writes of size 64 bytes per half-warp. (note 1048576=sizeof(float2)*N/128)

case3 had not coalesced read resulting in 2 as many read transactions on 1.3 machine.

e.ping · December 8, 2009, 3:21am

Case 3:
out[idx] = in[idx+1].x
Leads to the following profiler output:

GPU Time: 17532.6 (worse than in case 1 and 2)

CPU Time: 19348.1

Block size X: 512

Static shared mem: 28

Reg/thread: 7

Occupancy: 0.5

CTA launched: 1

Branch: 524369

Div branch: 1

Instructions: 4194647

gld 32b: 1048576 ← additional load transaction which does not appear in previous 2 cases because of misalignment

gld 128b: 1048576

gst 64b: 1048576

gld request: 524288

gst request: 524288

To conclude:

case1 and case2 have perfectly coalesced reads of size 128 bytes per half-warp and perfectly coalesced writes of size 64 bytes per half-warp. (note 1048576=sizeof(float2)*N/128)

case3 had not coalesced read resulting in 2 as many read transactions on 1.3 machine.

So I am definitely appreciative of people’s chiming in on this, but I am still left hanging about a 1.1 capable card… I understand now that this is not an issue with a 1.3 capable card, but I’d be interested in knowing for certain that there is no nicer way around this in my humble 1.1 situation. It is frustrating to me, because it seems like an optimizer trying to be a little too cute. As I mentioned earlier (I think in my first post), I can change my statements around so that I explicitly read in a float2 instead of a float, and it is clear that the compiler recognizes that since I don’t use the latter half of the float2, it will read a float from memory instead of a float2. Hence my code becoming uncoalesced. Even turning off the optimizer doesn’t help!

Thanks for all your help so far Cygnus!

Eddie

pcchen · December 8, 2009, 12:19pm

This is unfortunate, but if you really need have to do this, you can try using shared memory to read everything in coalesced fashion (you may have to treat the array as float* instead as float2*). Although you have to be careful to avoid bank conflict.

Cygnus_X1 · December 8, 2009, 3:48pm

I am now writing from my older laptop which uses 1.1 capable card and CUDA 2.0. Hopefully what I found here will be applicable to your case as well:

#include <stdio.h>

#include <cutil.h>

__global__ void dummyKernel() {

}

__global__ void case1(int N, float2 *in, float *out) {

  unsigned int idx=blockIdx.x*blockDim.x+threadIdx.x;

  unsigned int gridSize=blockDim.x*gridDim.x;

  while(idx<N) {

	float2 d=in[idx];

	out[idx]=d.x;

	idx+=gridSize;

  }

}

__global__ void case2(int N, float2 *in, float *out) {

  unsigned int idx=blockIdx.x*blockDim.x+threadIdx.x;

  unsigned int gridSize=blockDim.x*gridDim.x;

  while(idx<N) {

	float2 d=in[idx];

	out[idx]=d.x+d.y;

	idx+=gridSize;

  }

}

__global__ void case3(int N, float2 *in, float *out) {

  unsigned int idx=blockIdx.x*blockDim.x+threadIdx.x;

  unsigned int gridSize=blockDim.x*gridDim.x;

  while(idx<N) {

	volatile float2 d=in[idx];

	out[idx]=d.x;

	idx+=gridSize;

  }

}

int main() {

  const int dataSize=1024*1024*4;

  float2 *in;

  cudaMalloc((void**)&in,dataSize*sizeof(float2));

  float *out;

  cudaMalloc((void**)&out,dataSize*sizeof(float));

  unsigned int timer;

  dummyKernel<<<1,512>>>();

  cudaThreadSynchronize();

  cutCreateTimer(&timer);

  cutStartTimer(timer);

  case1<<<1,512>>>(dataSize,in,out);

  cudaThreadSynchronize();

  cutStopTimer(timer);

  printf("Case 1 run time: %f ms\n",cutGetTimerValue(timer));

  cudaThreadSynchronize();

  cutCreateTimer(&timer);

  cutStartTimer(timer);

  case2<<<1,512>>>(dataSize,in,out);

  cudaThreadSynchronize();

  cutStopTimer(timer);

  printf("Case 2 run time: %f ms\n",cutGetTimerValue(timer));

  cutCreateTimer(&timer);

  cutStartTimer(timer);

  case3<<<1,512>>>(dataSize,in,out);

  cudaThreadSynchronize();

  cutStopTimer(timer);

  printf("Case 3 run time: %f ms\n",cutGetTimerValue(timer));

  return 0;

}

Output:

Case 1 run time: 60.236000 ms

Case 2 run time: 26.169001 ms

Case 3 run time: 25.672001 ms

e.ping · December 8, 2009, 4:39pm

I am now writing from my older laptop which uses 1.1 capable card and CUDA 2.0. Hopefully what I found here will be applicable to your case as well:

#include <stdio.h>

#include <cutil.h>

__global__ void dummyKernel() {

}

__global__ void case1(int N, float2 *in, float *out) {

  unsigned int idx=blockIdx.x*blockDim.x+threadIdx.x;

  unsigned int gridSize=blockDim.x*gridDim.x;

  while(idx<N) {

	float2 d=in[idx];

	out[idx]=d.x;

	idx+=gridSize;

  }

}

__global__ void case2(int N, float2 *in, float *out) {

  unsigned int idx=blockIdx.x*blockDim.x+threadIdx.x;

  unsigned int gridSize=blockDim.x*gridDim.x;

  while(idx<N) {

	float2 d=in[idx];

	out[idx]=d.x+d.y;

	idx+=gridSize;

  }

}

__global__ void case3(int N, float2 *in, float *out) {

  unsigned int idx=blockIdx.x*blockDim.x+threadIdx.x;

  unsigned int gridSize=blockDim.x*gridDim.x;

  while(idx<N) {

	volatile float2 d=in[idx];

	out[idx]=d.x;

	idx+=gridSize;

  }

}

int main() {

  const int dataSize=1024*1024*4;

  float2 *in;

  cudaMalloc((void**)&in,dataSize*sizeof(float2));

  float *out;

  cudaMalloc((void**)&out,dataSize*sizeof(float));

  unsigned int timer;

  dummyKernel<<<1,512>>>();

  cudaThreadSynchronize();

  cutCreateTimer(&timer);

  cutStartTimer(timer);

  case1<<<1,512>>>(dataSize,in,out);

  cudaThreadSynchronize();

  cutStopTimer(timer);

  printf("Case 1 run time: %f ms\n",cutGetTimerValue(timer));

  cudaThreadSynchronize();

  cutCreateTimer(&timer);

  cutStartTimer(timer);

  case2<<<1,512>>>(dataSize,in,out);

  cudaThreadSynchronize();

  cutStopTimer(timer);

  printf("Case 2 run time: %f ms\n",cutGetTimerValue(timer));

  cutCreateTimer(&timer);

  cutStartTimer(timer);

  case3<<<1,512>>>(dataSize,in,out);

  cudaThreadSynchronize();

  cutStopTimer(timer);

  printf("Case 3 run time: %f ms\n",cutGetTimerValue(timer));

  return 0;

}

Output:

Case 1 run time: 60.236000 ms

Case 2 run time: 26.169001 ms

Case 3 run time: 25.672001 ms

Cygnus -

Thanks a lot! You found the solution! The volatile keyword on the float2 was the key to making this behave the way I want it to!

Eddie

Cygnus_X1 · December 8, 2009, 5:37pm

I am just worried that I don’t really know what volatile means for cuda compiler regarding registers and exactly which optimalisations it disables. It could happen that with some future release of the compiler this will actually revert to the previous slow code…

e.ping · December 8, 2009, 7:19pm

I hear you - the solution kind of has me annoyed a little bit! Just to complain a little bit more - the following will not even compile:

volatile float2 temp;

temp = Z[i];

while this does compile:

volatile float2 temp = Z[i];

This definitely has me flummoxed!

Eddie

Cygnus_X1 · December 8, 2009, 9:05pm

What was compiler’s error?

e.ping · December 9, 2009, 4:24pm

The following two lines:

volatile float2 z;

z = Z[idx];

yields the following compiler error:

error: no operator “=” matches these operands

1> operand types are: volatile float2 = float2

but simply saying

volatile float2 z = Z[idx];

compiles just fine.

This seems like I ought to be able to get the first case to work with proper casting, but syntactically, I have yet to figure it out. In other words,

z = (volatile float2) (Z[idx]);

does not compile either - and I get the kind of silly looking error message:

error: no operator “=” matches these operands

1> operand types are: volatile float2 = volatile float2

Cygnus_X1 · December 9, 2009, 4:55pm

In C++

Type var=value;

Actually calls a constructor of Type with parameter value which is equivalent to

Type var(value);

In any other case = is an assignment operator.

That is a semantic difference between those two, simmingly similar constructs. It does not explain however why there is no = operator for volatile float2. Sounds like a tiny ommision in the compiler for nonstandard types.

Could you try replacing float2 with float and see if you face similar compiler errors?

e.ping · December 9, 2009, 5:34pm

In C++
Type var=value;
Actually calls a constructor of Type with parameter value which is equivalent to
Type var(value);
In any other case = is an assignment operator.

That is a semantic difference between those two, simmingly similar constructs. It does not explain however why there is no = operator for volatile float2. Sounds like a tiny ommision in the compiler for nonstandard types.

Could you try replacing float2 with float and see if you face similar compiler errors?

Cygnus -

Thanks - your explanation makes a lot of sense. I tried out your suggestion and sure enough, I don’t have the problem with assigning to a volatile float. Just with a volatile float2. For grins, I tried out volatile char (works), volatile char2 (doesn’t work), volatile float3 (doesn’t work), etc. So I guess, as you say, there is a small bug in the compiler for these situations. I really appreciate all your help in this matter!

Eddie

Topic		Replies	Views
Float2 Coalescing CUDA Programming and Performance	12	10817	December 9, 2010
Cannot coalesce global memory reads using builtin vector types CUDA Programming and Performance	6	3375	July 14, 2010
Coalesced vs non-coalesced in reduction example Why float4-reads are not coalesced? CUDA Programming and Performance	10	4163	October 15, 2008
coalescing struct loading problem CUDA Programming and Performance	21	12849	March 5, 2010
Coalesced Memory Access to Structs CUDA Programming and Performance	11	4713	September 19, 2009
why is it uncoalesced ? SDK example simpleGL CUDA Programming and Performance	9	13736	February 3, 2011
Cannot coalesce global memory reads using builtin vector types CUDA Programming and Performance	1	1274	July 12, 2010
Kernel has 0 coalesced reads/writes... Profiler reveals my newbness CUDA Programming and Performance	1	1112	February 18, 2009
volatile breaks coalescing for vector types volatile trick can backfire. CUDA Programming and Performance	7	2101	November 4, 2009
Coalesced VBO Access CUDA Programming and Performance	14	1833	February 4, 2011

Coalescing issue, presumably due to the CUDA Optimizer

Related topics