Help me please hai I am a noob in cuda and I have some problems

antothenewbi · July 7, 2009, 11:23am

Hello everyone,
My problem is:

I would like to process a float** (two dim float)
But I have some problem with the kernel that I am working one.

global void mykernel (float** data, float nbline, float nbcolumn)
{
int j;
int idx = blockIdx.x*blockDim.x + threadIdx.x;
if(idx<nbline)
{
for(j=0;j<nbcolumn;i++)
{
data[idx][j]++;//It’s something like this but it’s simple
}
}
}

You see, I would like to use a float** in the kernel and to allocate memory in the GPU memory I am using:

cudaMalloc((void **) &data_d, size_nmlne);
for(i=0;i<nmlne;i++)
{
cudaMalloc((void **) &data_d[i], size_nmpts);
}
for(i=0;i<nmlne;i++)
{
cudaMemcpy(data_d[i], data_h[i], size_nmpts, cudaMemcpyHostToDevice);
}

You see, I want to use this for two dim but something is wrong, my kernel seems to do nothing (the retreive of the float** is the same)

Pleae help me and if you can’t tell me how process 2dim tab like this.
Many thanks

Sarnath · July 7, 2009, 11:31am

Diss a regular NOOB problem.

cudaMalloc((void **) &data_d, size_nmlne);

for(i=0;i<nmlne;i++)

{

cudaMalloc((void **) &data_d[i], size_nmpts);

}

In the code above, inside the FOR loop, “data_d” is a GPU pointer… You cannot pass it inside “cudaMalloc”…

You need to have a CPU array to hold the pointers and then copy that array on to “data_d”…

Nico · July 7, 2009, 11:33am

If you want to use a 2D array of M rows and N columns, you can allocate a 1D array of size MN and access the elements as data[yN+x] where x and y are the column- and row-indices respectively.

N.

antothenewbi · July 7, 2009, 12:53pm

Thanks for your answers and I think I’ll use the first solution but could you be a little more explcit about the arrays of pointers in the host?
And let me remind you that I haven’t the same number of lines and columns.
Thanks a lot everybody

sigismondo · July 7, 2009, 1:19pm

Hi,

you can find the solution in this post:
http://forums.nvidia.com/index.php?showtop…=0&p=560125

However, don’t be noob… use the 2nd solution. Using an array of arrays you will obligate your GPU to perform 2 memory access per array element - and you want to avoid it. the soluztion A[j*N + i] (ops - A[i*N+j] to be C compliant - the other way is the Fortran column-major ordering! - edited) is the correct one - since there are also CUDA functions explicitly thought to ensure your 2D matrices are correctly aligned for coalescense.

antothenewbi · July 8, 2009, 9:44am

Thanks for your answers but I have just a little problem again,
I need to do a kernel who minus the mean values of a matrix in 1D;
How can I do that?
My function is like this:

float avg=0;
for(int i=0; i<nRows;i++)
{
avg=0;
for(int j=0;j<nCols;j++) avg += matrix[i][j];
avg /= nCols;
for(int j=0;j<nCols;j++) matrix[i][j] -= avg;
}

I see how to transform it for a 1Dim float* or integer* but I rly dunno for a CUDA kernel.

sigismondo · July 8, 2009, 10:36am

Actually, I have not really understood the question… However I would write it like this (untested):

// a row for thread

int i = blockIdx.x*blockDim.x + threadIdx.x; 

if (i>=nRows) return;

float avg=0.0;

float *mrow = &matrix[i*nCols];

// or, if matrix is your awful array of arrays: 

// float *mrow = matrix[i];

for(int j=0;j<nCols;j++) avg += mrow[j];

avg /= nCols;

for(int j=0;j<nCols;j++) mrow[j] -= avg;

Note that in this way threads works very badly with memory accesses: if this is the core of your problem I would optimize transposing the problem (obviously, forget it if you are still working with float**): running the kernel on columns instead of rows will be really faster, allowing for coalesced memory accesses.

If this does not need to run optimzed, just ignore.

antothenewbi · July 8, 2009, 11:23am

Ok thanks, I will explain what I want to do with a picture and the prog that I made and wich I want to work with CUDA.
-I have a matrix (in the form of a float** but I already transform it in a float*)
-It’s a matrix because I have different vectors with a number of samples for each one.
my matrix is float** frame
->nmsamples
->nmvects

Here is my function

void my_function(float** frame)
{
float avg=0;
for(int i=0; i<nmvects;i++)
{
avg=0;
for(int j=0;j<nmsamples;j++) avg += frame[i][j];
avg /= nmsamples;
for(int j=0;j<nmsamples;j++) frame[i][j] -= avg;
}

And I would like to process it with cuda.
I think it’s clearer like that.

PS:
You can see the pictures uploaded for a better comprehension.

If you can give me some tips about the kernel of this process that 'll be nice.

sigismondo · July 8, 2009, 1:21pm

OK, I see I guessed it right then. What I posted is more or less the kernel you need.

Transposed: put the data to average on columns:

// a row for thread

int j = blockIdx.x*blockDim.x + threadIdx.x;

if (j>=nCols) return;

float avg=0.0;

// now each thread has adjacent data each-other: colaesced!

float *mcol = &matrix[j];

for(int i=0;i<nRows;i+=nCols) avg += mcol[i]; 

avg /= nRows;

for(int i=0;i<nRows;i+=nCols) mrow[i] -= avg;

If the usage of the column pointer is not clear I show it without… probably the compiler will be able to generate identiacal code:

// a row for thread

int j = blockIdx.x*blockDim.x + threadIdx.x;

if (j>=nCols) return;

float avg=0.0;

// now each thread has adjacent data each-other: colaesced!

for(int i=0;i<nRows;i++) avg += matrix[i*nCols + j]

avg /= nRows;

for(int i=0;i<nCols;i++) matrix[i*nCols + j] -= avg;

The main problem I see in porting such a problem to CUDA is that it is not really compute intensive, and if the matrix values all come from the host, you probably will spend more time transferring data to the gpu than evaluating results.

I say this because this is O(N) (order of operations with respect number of data) - and the compute part is really light.

So, even if you can apply the “transpose trick”, that is, you arrange data for best usage by the GPU, the bottleneck will be your PCIexpress bus.

However try with it, smash your head against :) : see for real what I am saying - or hopefully show where I am wrong :D - so you will get acquainted with the topic of gpu optimization: profiling, trying different strategies, comparing with examples of the programming guide and samples, disassembling PTXes and so on.

Good luck!

antothenewbi · July 8, 2009, 1:37pm

I tested it and it seems to work so thank you all of you and now I can try something more difficult. External Image

antothenewbi · July 16, 2009, 2:45pm

I have problems now because the kernel is running but not quick enough,
I tried your answer but I didn’t have good results.
Could someone help me with my kernel, I want to do this function:

void Removemean(float* rfData)
{
//nmpts is my number of columns
//nmlne is my number of rows

float avg = 0.0;
for(int i=0,i<nmlne;i++)
{
avg=0.0;
for(int j=0;j<nmpts;j++) avg += rfData[ inmpts + j];
avg /= nmpts;
for(int j=0;j<nmpts;j++) rfData[ inmpts + j] -= avg;
}

I have a kernel that is working but not faster enough.

How can I do that?

Topic		Replies	Views
multi dimension array CUDA Programming and Performance	26	32770	February 12, 2010
Efficient 2D memory access for row-based operations on contiguous array CUDA Programming and Performance	4	976	July 11, 2019
floyd on cuda--why so slow? CUDA Programming and Performance	15	5463	May 2, 2009
How would you do this? CUDA Programming and Performance	12	4466	August 5, 2008
Odd performance problem/question CUDA Programming and Performance	3	833	June 3, 2009
Passing a multidimensional array to kernel how to allocate space in host and pass to device? CUDA Programming and Performance	12	16201	November 22, 2014
Cuda code performance CUDA Programming and Performance	14	3133	December 16, 2014
Problem using cuda streams CUDA Programming and Performance	13	2005	March 4, 2015
Iteration help in CUDA CUDA Programming and Performance	11	6877	April 19, 2012
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	13931	September 5, 2008

Help me please hai I am a noob in cuda and I have some problems

Related topics