If you want to use a 2D array of M rows and N columns, you can allocate a 1D array of size MN and access the elements as data[yN+x] where x and y are the column- and row-indices respectively.
Thanks for your answers and I think I’ll use the first solution but could you be a little more explcit about the arrays of pointers in the host?
And let me remind you that I haven’t the same number of lines and columns.
Thanks a lot everybody
However, don’t be noob… use the 2nd solution. Using an array of arrays you will obligate your GPU to perform 2 memory access per array element - and you want to avoid it. the soluztion A[j*N + i] (ops - A[i*N+j] to be C compliant - the other way is the Fortran column-major ordering! - edited) is the correct one - since there are also CUDA functions explicitly thought to ensure your 2D matrices are correctly aligned for coalescense.
Thanks for your answers but I have just a little problem again,
I need to do a kernel who minus the mean values of a matrix in 1D;
How can I do that?
My function is like this:
Actually, I have not really understood the question… However I would write it like this (untested):
// a row for thread
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i>=nRows) return;
float avg=0.0;
float *mrow = &matrix[i*nCols];
// or, if matrix is your awful array of arrays:
// float *mrow = matrix[i];
for(int j=0;j<nCols;j++) avg += mrow[j];
avg /= nCols;
for(int j=0;j<nCols;j++) mrow[j] -= avg;
Note that in this way threads works very badly with memory accesses: if this is the core of your problem I would optimize transposing the problem (obviously, forget it if you are still working with float**): running the kernel on columns instead of rows will be really faster, allowing for coalesced memory accesses.
If this does not need to run optimzed, just ignore.
Ok thanks, I will explain what I want to do with a picture and the prog that I made and wich I want to work with CUDA.
-I have a matrix (in the form of a float** but I already transform it in a float*)
-It’s a matrix because I have different vectors with a number of samples for each one.
my matrix is float** frame
->nmsamples
->nmvects
OK, I see I guessed it right then. What I posted is more or less the kernel you need.
Transposed: put the data to average on columns:
// a row for thread
int j = blockIdx.x*blockDim.x + threadIdx.x;
if (j>=nCols) return;
float avg=0.0;
// now each thread has adjacent data each-other: colaesced!
float *mcol = &matrix[j];
for(int i=0;i<nRows;i+=nCols) avg += mcol[i];
avg /= nRows;
for(int i=0;i<nRows;i+=nCols) mrow[i] -= avg;
If the usage of the column pointer is not clear I show it without… probably the compiler will be able to generate identiacal code:
// a row for thread
int j = blockIdx.x*blockDim.x + threadIdx.x;
if (j>=nCols) return;
float avg=0.0;
// now each thread has adjacent data each-other: colaesced!
for(int i=0;i<nRows;i++) avg += matrix[i*nCols + j]
avg /= nRows;
for(int i=0;i<nCols;i++) matrix[i*nCols + j] -= avg;
The main problem I see in porting such a problem to CUDA is that it is not really compute intensive, and if the matrix values all come from the host, you probably will spend more time transferring data to the gpu than evaluating results.
I say this because this is O(N) (order of operations with respect number of data) - and the compute part is really light.
So, even if you can apply the “transpose trick”, that is, you arrange data for best usage by the GPU, the bottleneck will be your PCIexpress bus.
However try with it, smash your head against :) : see for real what I am saying - or hopefully show where I am wrong :D - so you will get acquainted with the topic of gpu optimization: profiling, trying different strategies, comparing with examples of the programming guide and samples, disassembling PTXes and so on.
I have problems now because the kernel is running but not quick enough,
I tried your answer but I didn’t have good results.
Could someone help me with my kernel, I want to do this function:
void Removemean(float* rfData)
{
//nmpts is my number of columns
//nmlne is my number of rows