Help me please hai I am a noob in cuda and I have some problems

Hello everyone,
My problem is:

I would like to process a float** (two dim float)
But I have some problem with the kernel that I am working one.

global void mykernel (float** data, float nbline, float nbcolumn)
{
int j;
int idx = blockIdx.x*blockDim.x + threadIdx.x;
if(idx<nbline)
{
for(j=0;j<nbcolumn;i++)
{
data[idx][j]++;//It’s something like this but it’s simple
}
}
}

You see, I would like to use a float** in the kernel and to allocate memory in the GPU memory I am using:

cudaMalloc((void **) &data_d, size_nmlne);
for(i=0;i<nmlne;i++)
{
cudaMalloc((void **) &data_d[i], size_nmpts);
}
for(i=0;i<nmlne;i++)
{
cudaMemcpy(data_d[i], data_h[i], size_nmpts, cudaMemcpyHostToDevice);
}

You see, I want to use this for two dim but something is wrong, my kernel seems to do nothing (the retreive of the float** is the same)

Pleae help me and if you can’t tell me how process 2dim tab like this.
Many thanks

Diss a regular NOOB problem.

cudaMalloc((void **) &data_d, size_nmlne);

for(i=0;i<nmlne;i++)

{

cudaMalloc((void **) &data_d[i], size_nmpts);

}

In the code above, inside the FOR loop, “data_d” is a GPU pointer… You cannot pass it inside “cudaMalloc”…

You need to have a CPU array to hold the pointers and then copy that array on to “data_d”…

If you want to use a 2D array of M rows and N columns, you can allocate a 1D array of size MN and access the elements as data[yN+x] where x and y are the column- and row-indices respectively.

N.

Thanks for your answers and I think I’ll use the first solution but could you be a little more explcit about the arrays of pointers in the host?
And let me remind you that I haven’t the same number of lines and columns.
Thanks a lot everybody

Hi,

you can find the solution in this post:
http://forums.nvidia.com/index.php?showtop…=0&p=560125

However, don’t be noob… use the 2nd solution. Using an array of arrays you will obligate your GPU to perform 2 memory access per array element - and you want to avoid it. the soluztion A[j*N + i] (ops - A[i*N+j] to be C compliant - the other way is the Fortran column-major ordering! - edited) is the correct one - since there are also CUDA functions explicitly thought to ensure your 2D matrices are correctly aligned for coalescense.

Thanks for your answers but I have just a little problem again,
I need to do a kernel who minus the mean values of a matrix in 1D;
How can I do that?
My function is like this:

float avg=0;
for(int i=0; i<nRows;i++)
{
avg=0;
for(int j=0;j<nCols;j++) avg += matrix[i][j];
avg /= nCols;
for(int j=0;j<nCols;j++) matrix[i][j] -= avg;
}

I see how to transform it for a 1Dim float* or integer* but I rly dunno for a CUDA kernel.

Actually, I have not really understood the question… However I would write it like this (untested):

// a row for thread

int i = blockIdx.x*blockDim.x + threadIdx.x; 

if (i>=nRows) return;

float avg=0.0;

float *mrow = &matrix[i*nCols];

// or, if matrix is your awful array of arrays: 

// float *mrow = matrix[i];

for(int j=0;j<nCols;j++) avg += mrow[j];

avg /= nCols;

for(int j=0;j<nCols;j++) mrow[j] -= avg;

Note that in this way threads works very badly with memory accesses: if this is the core of your problem I would optimize transposing the problem (obviously, forget it if you are still working with float**): running the kernel on columns instead of rows will be really faster, allowing for coalesced memory accesses.

If this does not need to run optimzed, just ignore.

Ok thanks, I will explain what I want to do with a picture and the prog that I made and wich I want to work with CUDA.
-I have a matrix (in the form of a float** but I already transform it in a float*)
-It’s a matrix because I have different vectors with a number of samples for each one.
my matrix is float** frame
->nmsamples
->nmvects

Here is my function

void my_function(float** frame)
{
float avg=0;
for(int i=0; i<nmvects;i++)
{
avg=0;
for(int j=0;j<nmsamples;j++) avg += frame[i][j];
avg /= nmsamples;
for(int j=0;j<nmsamples;j++) frame[i][j] -= avg;
}

And I would like to process it with cuda.
I think it’s clearer like that.

PS:
You can see the pictures uploaded for a better comprehension.

If you can give me some tips about the kernel of this process that 'll be nice.
my_matrix.JPG

OK, I see I guessed it right then. What I posted is more or less the kernel you need.

Transposed: put the data to average on columns:

// a row for thread

int j = blockIdx.x*blockDim.x + threadIdx.x;

if (j>=nCols) return;

float avg=0.0;

// now each thread has adjacent data each-other: colaesced!

float *mcol = &matrix[j];

for(int i=0;i<nRows;i+=nCols) avg += mcol[i]; 

avg /= nRows;

for(int i=0;i<nRows;i+=nCols) mrow[i] -= avg;

If the usage of the column pointer is not clear I show it without… probably the compiler will be able to generate identiacal code:

// a row for thread

int j = blockIdx.x*blockDim.x + threadIdx.x;

if (j>=nCols) return;

float avg=0.0;

// now each thread has adjacent data each-other: colaesced!

for(int i=0;i<nRows;i++) avg += matrix[i*nCols + j]

avg /= nRows;

for(int i=0;i<nCols;i++) matrix[i*nCols + j] -= avg;

The main problem I see in porting such a problem to CUDA is that it is not really compute intensive, and if the matrix values all come from the host, you probably will spend more time transferring data to the gpu than evaluating results.

I say this because this is O(N) (order of operations with respect number of data) - and the compute part is really light.

So, even if you can apply the “transpose trick”, that is, you arrange data for best usage by the GPU, the bottleneck will be your PCIexpress bus.

However try with it, smash your head against :) : see for real what I am saying - or hopefully show where I am wrong :D - so you will get acquainted with the topic of gpu optimization: profiling, trying different strategies, comparing with examples of the programming guide and samples, disassembling PTXes and so on.

Good luck!

I tested it and it seems to work so thank you all of you and now I can try something more difficult. External Image

I have problems now because the kernel is running but not quick enough,
I tried your answer but I didn’t have good results.
Could someone help me with my kernel, I want to do this function:

void Removemean(float* rfData)
{
//nmpts is my number of columns
//nmlne is my number of rows

float avg = 0.0;
for(int i=0,i<nmlne;i++)
{
avg=0.0;
for(int j=0;j<nmpts;j++) avg += rfData[ inmpts + j];
avg /= nmpts;
for(int j=0;j<nmpts;j++) rfData[ i
nmpts + j] -= avg;
}

I have a kernel that is working but not faster enough.

How can I do that?