Hi! I am trying to use float4. Well…meet aligned problem. I am writing SGEMM function, and store my results in register, want to write back to global memory. Like:
float4_write(register, register, register, register, c) // here c is a pointer address
float4_write(register, register, register, register, c+5) // The line above can work, but this line seems can not.....
So I have two questions here:
I guess the lesson I learn here is: once we use vector4 reading for location 0,1,2,3 , then we can not vector4 read 5, 6, 7, 8!! We have to read 4, 5, 6, 7 or 8, 9, 10, 11. (Is this really correct? Or I miss use it and get error???)
Can I use other way to process it? (other than float4, but still fast)?
Some algorithms works on specific size…But I want to fit all size, some times when the column number of c is not multiple of 4, here will have problem…Any suggestions to save them faster? Must I save them one by one? Like:
In previous post, one very clever contributor njuffa suggest
Where that is not easily possible, buffering in shared memory may help.
Float4 must read adjacent element? Can we modify it for coalesced reading? - #2 by njuffa
But I am not very sure how to do it in this specific case…
float4 that exceeds the bound of the data object is non-critical. Since the load itself must be naturally aligned, it is either fully contained in a memory page owned by the user’s process, or not at all. So reading outside the bounds of the data object may pick up the contents of immediately adjacent data objects, but as long as that extraneous data is properly ignored by downstream code, nothing bad will happen.
The same analysis does not apply to writes. A write to a
float4 that exceeds the bounds of the data object is likely to destroy the contents of an adjacent data object. You definitely do not want that. If your code is a general purpose library routine, you will have to add code to handle end cases for left-over slivers that are 1, 2, or 3 floats wide. If both called and calling code are 100% under your control, you could introduce a convention that all relevant data objects must be padded to a multiple of four
Yes, thank you!!!
Well, the real problem is I can not always read 4, read 4…I will read 4, and skip some length, and then read 4…
Before you get too deep into forcing the use of
float4, you might want to research how much performance difference this is likely going to make. SGEMM tends to be compute limited.
If you are fairly new to writing code like SGEMM, I would suggest sticking to a straightforward implementation style that readable and maintainable, allowing a full functional implementation. Once you have progressed to ninja level, you can write SGEMM code that has the best possible performance but that nobody (you included, a couple of weeks after writing it) can read and understand at first sight, and that is a challenge to maintain when moving across new GPU architectures.
If you just want a fast SGEMM, use CUBLAS: Someone else already went through all the development, testing, and maintenance pain on your behalf.
Yes!! Well, I have some idea and I am doing it for research purpose…So I think write SGEMM is worthy.
Still, about my question before, if I read4 and skip some length and then read4, really can not use float4 reading? (Or my usage incorrect but actually can use?)
(Maybe no other faster way…?Even we write to shared memory and float4 write to c will even waste more time…)
You can only read
float4 if the source address is aligned to
sizeof(float4) = 16. For pointers returned by
cudaMalloc* this results in the observations you made, i.e you can read 0,1,2,3 ; 4,5,6,7 ; 8,9,10,11 ; … but not 5,6,7,8
Thank you very much!!! Yes, indeed I can not…