Passin a structure much slower than passing "rough" variables

I have two functions which does exactly the same job - rasterize a triangle. They differ only in the way arguments are passed. Here are the definitions:

__global__ void rasterizePixel(

	byte *colorBuffer, float *depthBuffer, int width,

	Vertex v0,

	Vertex v1,

	Vertex v2,

	const CTexture* texture,

	float one_over_h0,

	float one_over_h1,

	float one_over_h2,

	int minX,

	int maxX,

	int minY,

	int maxY,

	float one_over_v0ToLine12,

	float one_over_v1ToLine20,

	float one_over_v2ToLine01,

	plane alphaPlane,

	plane betaPlane,

	plane gammaPlane,

	float one_over_alpha_c,

	float one_over_beta_c,

	float one_over_gamma_c,

	float alpha_ffx,

	float beta_ffx,

	float gamma_ffx,

	float alpha_ffy,

	float beta_ffy,

	float gamma_ffy)

And the second:

__global__ void rasterizePixel(

	byte *colorBuffer, float *depthBuffer, int width,

	TriangleToRasterize t)

TriangleToRasterize looks like this:

struct TriangleToRasterize

{

	Vertex v0, v1, v2;

	const CTexture* texture;

	float one_over_h0;

	float one_over_h1;

	float one_over_h2;

	int minX;

	int maxX;

	int minY;

	int maxY;

	float one_over_v0ToLine12;

	float one_over_v1ToLine20;

	float one_over_v2ToLine01;

	plane alphaPlane;

	plane betaPlane;

	plane gammaPlane;

	float one_over_alpha_c;

	float one_over_beta_c;

	float one_over_gamma_c;

	float alpha_ffx;

	float beta_ffx;

	float gamma_ffx;

	float alpha_ffy;

	float beta_ffy;

	float gamma_ffy;

};

To my surprise, when I call the first function I get around 8ms of total frame time, and when I call the second I get 40ms. Any idea of why passing structures could be slower?

I previously experienced the same problem. I had to pass 4x4 matrices as 16 individual function arguments and not as a single object to make it work fast.

Pass-by-value of large structures will use local memory and therefore consumes global memory bandwidth (stoopid compiler!). Use the verbose output of ptxas to see shared, local and constant memory use.

Alternatively try passing by reference or using pointers. Unfortunately the GPU’s Registers can’t be indexed so pointer addressing may have the same issues (it may use local memory instead of registers). If you need to make a copy of the data, investigate if you can do that in a controlled way (e.g in shared memory).

Code I:

__global__ void rasterizePixel(

	byte *colorBuffer, float *depthBuffer, int width,

	[long list of attributes])

{

	int x = t.minX + threadIdx.x + 16*blockIdx.x;

	int y = t.minY + threadIdx.y + 16*blockIdx.y;

...

Takes 16ms.

Code II:

__global__ void rasterizePixel(

	byte *colorBuffer, float *depthBuffer, int width,

	TriangleToRasterize t)

{

	int x = t.minX + threadIdx.x + 16*blockIdx.x;

	int y = t.minY + threadIdx.y + 16*blockIdx.y;

...

Takes 80ms.

Code III:

__global__ void rasterizePixel(

	byte *colorBuffer, float *depthBuffer, int width,

	TriangleToRasterize tt)

{

	__shared__ TriangleToRasterize t;

	t = tt;

	__syncthreads();

	int x = t.minX + threadIdx.x + 16*blockIdx.x;

	int y = t.minY + threadIdx.y + 16*blockIdx.y;

...

Takes 110ms.

I think I could greatly benefit from shared memory but I don’t know how to load the data to it from global memory and keep it in shared memory (the approach depicted in Code III definitely does not work as expected). This TriangleToRasterize struct the same for all thread within a block.

In what kind of memory do kernel’s arguments reside in?

Shared memory on 1.x devices and constant memory in 2.x. I would have thought that getting the arguments into shared memory on compute capability 1.x requires global memory access. If you are on 2.x though, the constant cache will be persistent across blocks.

You should be able to achieve the same speed if you place the struct in constant memory yourself (assuming you are not going to change it).

Constant memory seems to be a trouble because I can’t put there C++ elements. In my TriangleToRasterize struct there are Vertex or plane structures which have constructors and operators overloaded. And because of this CUDA does not want to compile it. Ideally, I would be happy if I could manually place my structure in shared memory of a block. Is there a way to do so other than the one I presented in Code III?

you may use tricks, define size of t, use pointer to tt in global memory and fill t in shared using many threads and coalesced access.

like int buf=(int)&t;

if (threadIdx.x<(sizeof(t)>>2))

buf[threadIdx.x]=((int*)&tt)[threadIndx.x];

__syncthreads();

This is pretty strange. I used your code and I also modified it a bit

__shared__ TriangleToRasterize t;

	byte *buf = (byte*)&t;

	int index = threadIdx.y*16 + threadIdx.x;

	if (index < 236) // 236 == sizeof(TriangleToRasterize)

		buf[index] = ((byte*)&tt)[index];

	__syncthreads();

and I not only got an artifacted output image and the performance has not changed at all.

Memory statistics:

1>ptxas info    : Compiling entry function '_Z17rasterizePixel222PhPfi19TriangleToRasterize' for 'sm_10'

1>ptxas info    : Used 19 registers, 236+0 bytes lmem, 248+16 bytes smem, 16 bytes cmem[1]

1>ptxas info    : Compiling entry function '_Z17rasterizePixel111PhPfi6VertexS1_S1_fffiiiifff5planeS2_S2_fffffffff' for 'sm_10'

1>ptxas info    : Used 12 registers, 244+16 bytes smem, 16 bytes cmem[1]

1>renderer.cu

1>ptxas info    : Compiling entry function '_Z17rasterizePixel222PhPfi19TriangleToRasterize' for 'sm_20'

1>ptxas info    : Used 24 registers, 4+0 bytes lmem, 280 bytes cmem[0], 8 bytes cmem[16]

1>ptxas info    : Compiling entry function '_Z17rasterizePixel111PhPfi6VertexS1_S1_fffiiiifff5planeS2_S2_fffffffff' for 'sm_20'

1>ptxas info    : Used 20 registers, 276 bytes cmem[0], 4 bytes cmem[16]

rasterizePixel111 == rough arguments

rasterizePixel222 == structure passed as the argument

you need to copy tt to global memory using cudamemcpy and pass to kernell its address. Also maybe nvcc uses different alignment, so size of tt is different than you suppose.

Now I have 20-25ms (which is still not that good when compared to 16ms from the “rough” arguments approach). Moreover, I have some artifacted image, that is, there are random 16x16 blocks of pixels (my CUDA blocks are sized 16x16) which are single-colored and they change “position” every frame.

get rid of byte access, use 32 bit data

Nothing changed.

It maybe compiler bug with structures, also different alignment on host and device sides.

One more question regarding the memory statistics. I see that passin a structure consumes additional local memory, whereas passing “rough” arguments consumes shared memory only (and constant memory a bit). Is it that passing a structure means that the structure is placed in global memory, and shared memory is used because this data is downloaded from global to shared memory? I do not see any other exeplanation for why passing a structure method would consume additional local memory.

One more question regarding the memory statistics. I see that passin a structure consumes additional local memory, whereas passing “rough” arguments consumes shared memory only (and constant memory a bit). Is it that passing a structure means that the structure is placed in global memory, and shared memory is used because this data is downloaded from global to shared memory? I do not see any other exeplanation for why passing a structure method would consume additional local memory.

"Pass-by-value of large structures will use local memory and therefore consumes global memory bandwidth (stoopid compiler!). Use the verbose output of ptxas to see shared, local and constant memory use.

"

Right. I got lost in the subject :). I guess there is no way to force the compiler to “roll-out” the structure. Anyway, thnk you for you help.