vector data types Speedup by Vectorizing

Sarnath · December 13, 2007, 5:54pm

I see that there are many vector data types declared in CUDA. Like “float4”, “int4” etc… But I dont see (pardon my oversight if any) any vector operations that can be applied on them.

For example I tried to do this but could not get it compiled for Device :

shared extern float4 prices;
…

prices[i] = prices[i]*2

The compiler said “float4*int” is NOt possible. So, I tried this:

prices[i] = prices[i]*(2,2,2,2)

but this one too did NOT work.

Does PTX support Vector instructions? How can I take advantage of this one?

The only advantage I see now is that I can access more global memory per warp resulting in amazing speedups (I got 20X as pointed out by Mark in some earlier post)

Any inputs? Thanks.

jimh · December 13, 2007, 6:09pm

Look in cutil_math.h for overloaded operators to use in code compiled with nvcc. Or your can create your own. nvcc supports some C++ features like operator overloading and templates.

I’m not sure if PTX has vector operations, but I don’t think so.

Sarnath · December 13, 2007, 6:10pm

I was wrong in the above statement. I had a bug in my program that was showing very lessss time. Apologies.

But still, I would like to know the vector operations that can be performed.

mfatica · December 13, 2007, 6:19pm

G8x is a scalar architecture.

Sarnath · December 13, 2007, 6:29pm

Interesting! So, What purpose do these vector data types serve? Is there other NVIDIA cards that support vector processing? Are these data types meant for those cards?

So, If all threads in a warp access “Float4” elements instead of “float” and assuming that all such float4 elements are in consecutive addresses, Is it possible to achieve good performance gain using “float4” ?

seibert · December 13, 2007, 6:40pm

These data types are a programming convenience. Previous NVIDIA cards did vector operations in hardware, but they are not supported by CUDA.

VanDammage · December 14, 2007, 8:41am

When I’m working in image data

I’m using the vector types to boost my memory accesses on char arrays.

Means that if let’s say I have an array

char* d_idata

which contains the image data

I cast the pointer for memory accesses to (uint4*) and read all the elements in the kernel like this:

uint4 rgb_pixel = d_idata[globalTid];

So every thread reads 16 char elements.

I experienced a speedup of 5x compared to reading each 8 bit char element out of the array, since they are coalesced now.

I think it is even a 128-bit read.

Sarnath · December 14, 2007, 10:18am

Nice to know that. I tried similar thing with float4 and did NOT get correct results. Resulted in an unusual long period of execution and wrong results. I think I must have messed something.

According to the manual, The memory coalescing works best for 32-bit coalescing. 64-bit and 128-bit are NOT that effective though they are faster than non-coalesced accesses. 1.1 Manual , Page 65/143, Pg no 51, Last Paragraph.

Coalesced 128-bit reads are faster only by a factor of 2 when compared to non-coalesced reads. But still will have to be much faster than processing 1 character at a time. So, your speed up looks justifiable. Aah… or else – may b, you should try reading 4 characters per thread and see what kind of speed you get. I think that would be an interesting thing to look at. If you ever do it, Can you kindly post the results here. Thanks.

VanDammage · December 14, 2007, 10:39am

here’s my kernel code for converting an RGBA to a Grayscale Image.

the kernel takes 1,1 ms with 32 bit coalesced reads for a 3008x2000 RGBA image on a GeForce 8800GTX.

__global__ void convertRGBAtoGRAY(uint4* g_idata, unsigned int* g_odata)

{

	

	//Current global thread index

    const unsigned int globalTid = IMUL(blockIdx.x, blockDim.x) + threadIdx.x;

	uint4 in_pixel;

	unsigned int out_pixel;

	in_pixel.x = g_idata [globalTid].x;

	in_pixel.y = g_idata [globalTid].y;

	in_pixel.z = g_idata [globalTid].z;

	in_pixel.w = g_idata [globalTid].w;

	

	unsigned int gray_4 = ((in_pixel.x & 0xFF) * 0.1140) + (((in_pixel.x & 0xFF00) >> 8) * 0.5870) + (((in_pixel.x & 0xFF0000) >> 16) * 0.2989);

	unsigned int gray_3 = ((in_pixel.y & 0xFF) * 0.1140) + (((in_pixel.y & 0xFF00) >> 8) * 0.5870) + (((in_pixel.y & 0xFF0000) >> 16) * 0.2989);

	unsigned int gray_2 = ((in_pixel.z & 0xFF) * 0.1140) + (((in_pixel.z & 0xFF00) >> 8) * 0.5870) + (((in_pixel.z & 0xFF0000) >> 16) * 0.2989);

	unsigned int gray_1 = ((in_pixel.w & 0xFF) * 0.1140) + (((in_pixel.w & 0xFF00) >> 8) * 0.5870) + (((in_pixel.w & 0xFF0000) >> 16) * 0.2989);

	out_pixel = 0 | (gray_1 << 24) | (gray_2 << 16) | (gray_3 << 8) | gray_4;

	

	g_odata[globalTid] = out_pixel;

}

…forgot to mention the kernel call:

convertRGBAtoGRAY<<< dimGrid, dimBlock >>>(	(uint4*)cuMemory.d_rgba_imageData, (unsigned int*)cuMemory.d_grayscale_imageData);

d_rgba_imageData and d_grayscale_imageData are both char*

MisterAnderson42 · December 14, 2007, 2:28pm

That is only 40.7 GiB/s, you can nearly double your kernel’s performance. Sarnath is correct that 128-bit coalesced reads are slow. See my testing in this post http://forums.nvidia.com/index.php?showtop…ndpost&p=290441 to confirm the manual’s claims.

Are you sure that this

in_pixel.x = g_idata [globalTid].x;

in_pixel.y = g_idata [globalTid].y;

in_pixel.z = g_idata [globalTid].z;

in_pixel.w = g_idata [globalTid].w;

results in a coalesced read? I for one, wouldn’t trust the compiler to be that smart and would write in_pixel = g_idata[globalTid];

point 1) doesn’t matter anyways given the testing in the post I mentioned. Create a simple 1D texture and bind g_idata to it. Then access in_pixel with

in_pixel = tex1Dfetch(tex, globalTid); and watch your performance nearly double.

VanDammage · December 14, 2007, 2:58pm

Since the performance I got was sufficient for my purposes i didn’t change the code.
I was hoping the compiler made the best out of it ;)

Thank you very much for the tip with the 1D texture Mister Anderson, results are stunning!

Should have looked at your post first before i started my implementation!

MisterAnderson42 · December 14, 2007, 3:25pm

No problem, you’re welcome. I stumbled across the 128-bit coalesced read → tex1Dfetch read optimization a while back (boosted my code’s overall performance by 6%) and it is unintuitive enough I’ve tried to share it on the forums when these issues come up.

I understand that when performance is as fast as you need it there is little reason to waste your time optimizing, but lately I’ve been trying to get every single percent of performance I can out of my code so I can’t escape that mindset :) One particular unoptimized part of my code that was “fast enough” a few months ago with only 1% of overhead now takes up almost 10% so now it actually is worth it to go back and optimize things.

Topic		Replies	Views
Significant speedup with vector types - why? CUDA Programming and Performance	7	8260	July 15, 2010
Texture Memory vs. Global Memory and float4 CUDA Programming and Performance	5	1954	November 1, 2010
Float4 must read adjacent element? Can we modify it for coalesced reading? CUDA Programming and Performance	7	1063	May 11, 2022
Using float4 CUDA Programming and Performance	5	8152	April 3, 2024
Coalesced vs non-coalesced in reduction example Why float4-reads are not coalesced? CUDA Programming and Performance	10	4256	October 15, 2008
CUDA Fortran + float3/float4 Legacy PGI Compilers	4	5311	April 13, 2011
hwo to make float2 and float4 data coalesced? CUDA Programming and Performance	1	3616	May 27, 2008
Cannot coalesce global memory reads using builtin vector types CUDA Programming and Performance	1	1296	July 12, 2010
Converting vector types to arrays CUDA Programming and Performance	4	3371	March 26, 2009
Is it possible to avoid optimization? Compiler breaks memory coalescing. CUDA Programming and Performance	7	3248	March 14, 2008

vector data types Speedup by Vectorizing

Related topics