vector data types Speedup by Vectorizing

I see that there are many vector data types declared in CUDA. Like “float4”, “int4” etc… But I dont see (pardon my oversight if any) any vector operations that can be applied on them.

For example I tried to do this but could not get it compiled for Device :

shared extern float4 prices;

prices[i] = prices[i]*2

The compiler said “float4*int” is NOt possible. So, I tried this:

prices[i] = prices[i]*(2,2,2,2)

but this one too did NOT work.

Does PTX support Vector instructions? How can I take advantage of this one?

The only advantage I see now is that I can access more global memory per warp resulting in amazing speedups (I got 20X as pointed out by Mark in some earlier post)

Any inputs? Thanks.

Look in cutil_math.h for overloaded operators to use in code compiled with nvcc. Or your can create your own. nvcc supports some C++ features like operator overloading and templates.

I’m not sure if PTX has vector operations, but I don’t think so.

I was wrong in the above statement. I had a bug in my program that was showing very lessss time. Apologies.

But still, I would like to know the vector operations that can be performed.

G8x is a scalar architecture.

Interesting! So, What purpose do these vector data types serve? Is there other NVIDIA cards that support vector processing? Are these data types meant for those cards?

So, If all threads in a warp access “Float4” elements instead of “float” and assuming that all such float4 elements are in consecutive addresses, Is it possible to achieve good performance gain using “float4” ?

These data types are a programming convenience. Previous NVIDIA cards did vector operations in hardware, but they are not supported by CUDA.

When I’m working in image data

I’m using the vector types to boost my memory accesses on char arrays.

Means that if let’s say I have an array

char* d_idata

which contains the image data

I cast the pointer for memory accesses to (uint4*) and read all the elements in the kernel like this:

uint4 rgb_pixel = d_idata[globalTid];

So every thread reads 16 char elements.

I experienced a speedup of 5x compared to reading each 8 bit char element out of the array, since they are coalesced now.

I think it is even a 128-bit read.

Nice to know that. I tried similar thing with float4 and did NOT get correct results. Resulted in an unusual long period of execution and wrong results. I think I must have messed something.

According to the manual, The memory coalescing works best for 32-bit coalescing. 64-bit and 128-bit are NOT that effective though they are faster than non-coalesced accesses. 1.1 Manual , Page 65/143, Pg no 51, Last Paragraph.

Coalesced 128-bit reads are faster only by a factor of 2 when compared to non-coalesced reads. But still will have to be much faster than processing 1 character at a time. So, your speed up looks justifiable. Aah… or else – may b, you should try reading 4 characters per thread and see what kind of speed you get. I think that would be an interesting thing to look at. If you ever do it, Can you kindly post the results here. Thanks.

here’s my kernel code for converting an RGBA to a Grayscale Image.

the kernel takes 1,1 ms with 32 bit coalesced reads for a 3008x2000 RGBA image on a GeForce 8800GTX.

__global__ void convertRGBAtoGRAY(uint4* g_idata, unsigned int* g_odata)



	//Current global thread index

    const unsigned int globalTid = IMUL(blockIdx.x, blockDim.x) + threadIdx.x;

	uint4 in_pixel;

	unsigned int out_pixel;

	in_pixel.x = g_idata [globalTid].x;

	in_pixel.y = g_idata [globalTid].y;

	in_pixel.z = g_idata [globalTid].z;

	in_pixel.w = g_idata [globalTid].w;


	unsigned int gray_4 = ((in_pixel.x & 0xFF) * 0.1140) + (((in_pixel.x & 0xFF00) >> 8) * 0.5870) + (((in_pixel.x & 0xFF0000) >> 16) * 0.2989);

	unsigned int gray_3 = ((in_pixel.y & 0xFF) * 0.1140) + (((in_pixel.y & 0xFF00) >> 8) * 0.5870) + (((in_pixel.y & 0xFF0000) >> 16) * 0.2989);

	unsigned int gray_2 = ((in_pixel.z & 0xFF) * 0.1140) + (((in_pixel.z & 0xFF00) >> 8) * 0.5870) + (((in_pixel.z & 0xFF0000) >> 16) * 0.2989);

	unsigned int gray_1 = ((in_pixel.w & 0xFF) * 0.1140) + (((in_pixel.w & 0xFF00) >> 8) * 0.5870) + (((in_pixel.w & 0xFF0000) >> 16) * 0.2989);

	out_pixel = 0 | (gray_1 << 24) | (gray_2 << 16) | (gray_3 << 8) | gray_4;


	g_odata[globalTid] = out_pixel;


…forgot to mention the kernel call:

convertRGBAtoGRAY<<< dimGrid, dimBlock >>>(	(uint4*)cuMemory.d_rgba_imageData, (unsigned int*)cuMemory.d_grayscale_imageData);

d_rgba_imageData and d_grayscale_imageData are both char*

That is only 40.7 GiB/s, you can nearly double your kernel’s performance. Sarnath is correct that 128-bit coalesced reads are slow. See my testing in this post…ndpost&p=290441 to confirm the manual’s claims.

  1. Are you sure that this

in_pixel.x = g_idata [globalTid].x;

in_pixel.y = g_idata [globalTid].y;

in_pixel.z = g_idata [globalTid].z;

in_pixel.w = g_idata [globalTid].w;

results in a coalesced read? I for one, wouldn’t trust the compiler to be that smart and would write in_pixel = g_idata[globalTid];

  1. point 1) doesn’t matter anyways given the testing in the post I mentioned. Create a simple 1D texture and bind g_idata to it. Then access in_pixel with

in_pixel = tex1Dfetch(tex, globalTid); and watch your performance nearly double.

Since the performance I got was sufficient for my purposes i didn’t change the code.
I was hoping the compiler made the best out of it ;)

Thank you very much for the tip with the 1D texture Mister Anderson, results are stunning!

Should have looked at your post first before i started my implementation!

No problem, you’re welcome. I stumbled across the 128-bit coalesced read -> tex1Dfetch read optimization a while back (boosted my code’s overall performance by 6%) and it is unintuitive enough I’ve tried to share it on the forums when these issues come up.

I understand that when performance is as fast as you need it there is little reason to waste your time optimizing, but lately I’ve been trying to get every single percent of performance I can out of my code so I can’t escape that mindset :) One particular unoptimized part of my code that was “fast enough” a few months ago with only 1% of overhead now takes up almost 10% so now it actually is worth it to go back and optimize things.