Hi! I am trying to use float4. Well…meet aligned problem. I am writing SGEMM function, and store my results in register, want to write back to global memory. Like:
float4_write(register[0], register[1], register[2], register[3], c) // here c is a pointer address
float4_write(register[0], register[1], register[2], register[3], c+5) // The line above can work, but this line seems can not.....
So I have two questions here:
I guess the lesson I learn here is: once we use vector4 reading for location 0,1,2,3 , then we can not vector4 read 5, 6, 7, 8!! We have to read 4, 5, 6, 7 or 8, 9, 10, 11. (Is this really correct? Or I miss use it and get error???)
Can I use other way to process it? (other than float4, but still fast)?
Some algorithms works on specific size…But I want to fit all size, some times when the column number of c is not multiple of 4, here will have problem…Any suggestions to save them faster? Must I save them one by one? Like:
Reading a float4 that exceeds the bound of the data object is non-critical. Since the load itself must be naturally aligned, it is either fully contained in a memory page owned by the user’s process, or not at all. So reading outside the bounds of the data object may pick up the contents of immediately adjacent data objects, but as long as that extraneous data is properly ignored by downstream code, nothing bad will happen.
The same analysis does not apply to writes. A write to a float4 that exceeds the bounds of the data object is likely to destroy the contents of an adjacent data object. You definitely do not want that. If your code is a general purpose library routine, you will have to add code to handle end cases for left-over slivers that are 1, 2, or 3 floats wide. If both called and calling code are 100% under your control, you could introduce a convention that all relevant data objects must be padded to a multiple of four floats.
Before you get too deep into forcing the use of float4, you might want to research how much performance difference this is likely going to make. SGEMM tends to be compute limited.
If you are fairly new to writing code like SGEMM, I would suggest sticking to a straightforward implementation style that readable and maintainable, allowing a full functional implementation. Once you have progressed to ninja level, you can write SGEMM code that has the best possible performance but that nobody (you included, a couple of weeks after writing it) can read and understand at first sight, and that is a challenge to maintain when moving across new GPU architectures.
If you just want a fast SGEMM, use CUBLAS: Someone else already went through all the development, testing, and maintenance pain on your behalf.
Yes!! Well, I have some idea and I am doing it for research purpose…So I think write SGEMM is worthy.
Still, about my question before, if I read4 and skip some length and then read4, really can not use float4 reading? (Or my usage incorrect but actually can use?)
(Maybe no other faster way…?Even we write to shared memory and float4 write to c will even waste more time…)
You can only read float4 if the source address is aligned to sizeof(float4) = 16. For pointers returned by cudaMalloc* this results in the observations you made, i.e you can read 0,1,2,3 ; 4,5,6,7 ; 8,9,10,11 ; … but not 5,6,7,8