For this code, the number of blocks must always be equal to the number of columns. This model is 1 block processes 1 column. If you have 6000 columns, you need to launch 6000 blocks. If you foresee using more than 65535 columns, then you will need to modify the kernel to use a 2D grid, which would allow you to have 4294836225 columns. If you ever had more columns than that, you will have to split the matrix up and process it is sections. That would also be the way to parallelise the problem over multiple GPUs.
I have no intention of using more than max numBlocks columns, so that is not a problem :)
If there is no other reason for slightly different results, I shall leave as is, although the change does seem strange, considering that secuential sums do give the correct result.
All I can suggest about the differences in results is to try coding a small test outside of matlab using normalized inputs. You should get within 1 last place digit, I would have thought Representation limitations should not come into play for 6000 numbers between 0 and 1, and the code is straight addition, where multiply-add fusion/intermediate result rounding cannot influence the results.
It is your project, your verification process, and your choice. Just for the record, I hereby declare that code was provided on an as-is basis, with no warranty, guarantees, or expressions of fitness for purpose, implied or otherwise. You use it at your own risk.
I compiled and ran, but just got a load of zeros… ;*(
I called it using:
void sumCols(Matrix C,Matrix A)
{
int numBlocks=A.w;
int numThreads=512;
const int m=A.h*A.w;
const int lda=A.h;
columnsumreduction<<<numThreads,numBlocks>>>(m,lda,A.data,C.data);
}