FLOAT4 address align problem

202476410arsmart · May 7, 2022, 8:12am

Hi! I am trying to use float4. Well…meet aligned problem. I am writing SGEMM function, and store my results in register, want to write back to global memory. Like:

float4_write(register[0], register[1], register[2], register[3], c)  // here c is a pointer address
float4_write(register[0], register[1], register[2], register[3], c+5)  // The line above can work, but this line seems can not.....

So I have two questions here:

I guess the lesson I learn here is: once we use vector4 reading for location 0,1,2,3 , then we can not vector4 read 5, 6, 7, 8!! We have to read 4, 5, 6, 7 or 8, 9, 10, 11. (Is this really correct? Or I miss use it and get error???)
Can I use other way to process it? (other than float4, but still fast)?

Some algorithms works on specific size…But I want to fit all size, some times when the column number of c is not multiple of 4, here will have problem…Any suggestions to save them faster? Must I save them one by one? Like:

write_in(register[0], c);
write_in(register[1], c+1);
write_in(register[2], c+2);
write_in(register[3], c+3);
write_in(register[4], c+4);

In previous post, one very clever contributor njuffa suggest

Where that is not easily possible, buffering in shared memory may help.
Float4 must read adjacent element? Can we modify it for coalesced reading? - #2 by njuffa

But I am not very sure how to do it in this specific case…

Thank you!!!

njuffa · May 7, 2022, 8:45am

Reading a float4 that exceeds the bound of the data object is non-critical. Since the load itself must be naturally aligned, it is either fully contained in a memory page owned by the user’s process, or not at all. So reading outside the bounds of the data object may pick up the contents of immediately adjacent data objects, but as long as that extraneous data is properly ignored by downstream code, nothing bad will happen.

The same analysis does not apply to writes. A write to a float4 that exceeds the bounds of the data object is likely to destroy the contents of an adjacent data object. You definitely do not want that. If your code is a general purpose library routine, you will have to add code to handle end cases for left-over slivers that are 1, 2, or 3 floats wide. If both called and calling code are 100% under your control, you could introduce a convention that all relevant data objects must be padded to a multiple of four floats.

202476410arsmart · May 7, 2022, 8:47am

Yes, thank you!!!
Well, the real problem is I can not always read 4, read 4…I will read 4, and skip some length, and then read 4…

njuffa · May 7, 2022, 8:56am

Before you get too deep into forcing the use of float4, you might want to research how much performance difference this is likely going to make. SGEMM tends to be compute limited.

If you are fairly new to writing code like SGEMM, I would suggest sticking to a straightforward implementation style that readable and maintainable, allowing a full functional implementation. Once you have progressed to ninja level, you can write SGEMM code that has the best possible performance but that nobody (you included, a couple of weeks after writing it) can read and understand at first sight, and that is a challenge to maintain when moving across new GPU architectures.

If you just want a fast SGEMM, use CUBLAS: Someone else already went through all the development, testing, and maintenance pain on your behalf.

202476410arsmart · May 7, 2022, 9:01am

Yes!! Well, I have some idea and I am doing it for research purpose…So I think write SGEMM is worthy.

Still, about my question before, if I read4 and skip some length and then read4, really can not use float4 reading? (Or my usage incorrect but actually can use?)

(Maybe no other faster way…?Even we write to shared memory and float4 write to c will even waste more time…)

striker159 · May 8, 2022, 6:56am

You can only read float4 if the source address is aligned to sizeof(float4) = 16. For pointers returned by cudaMalloc* this results in the observations you made, i.e you can read 0,1,2,3 ; 4,5,6,7 ; 8,9,10,11 ; … but not 5,6,7,8

202476410arsmart · May 11, 2022, 3:11am

Thank you very much!!! Yes, indeed I can not…

system · May 25, 2022, 3:11am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Bottlenecks in SpBlockMatrix * DeVector Ideas on now to remove bottlenecks. CUDA Programming and Performance	3	2286	October 4, 2007
Coalesced vs non-coalesced in reduction example Why float4-reads are not coalesced? CUDA Programming and Performance	10	4071	October 15, 2008
Float4 register write to shared has limit? CUDA Programming and Performance	3	376	July 27, 2022
Float4 must read adjacent element? Can we modify it for coalesced reading? CUDA Programming and Performance	7	847	May 11, 2022
Memory coalescing and multiple arrays CUDA Programming and Performance	23	11693	March 20, 2009
float4 in a register? CUDA Programming and Performance	4	1912	February 5, 2015
Register / Shared memory question memory copy max performance CUDA Programming and Performance	6	8146	September 13, 2009
Extremely slow smem reads Possible bank conflicts? CUDA Programming and Performance	11	3482	November 25, 2008
Register/SMEM Usage with different -arch=sm_xx not consistent.. CUDA Programming and Performance	5	2821	December 19, 2009
Cannot tell what pointer points to, assuming globa Ran out of register CUDA Programming and Performance	9	3043	September 19, 2008

FLOAT4 address align problem

Related topics