How to index last element of a row/column of an array selectively index specific elements of an arra

hello

I didnt get much responses in my last post so i figure I ask a simpler question.

well firstly elements seem to equal threads in the world of CUDA, but I will use elements since i am not well in CUDA.

Suppose I have an 2-d array and I want access the very last element of the first row and add this value to some other element (say the last element in the first column) in another array, can anyone shed some light on how to do this in CUDA kernel?

How would I address the elements/thread using blockIdx.x, blockDim.x, threadIdx.x, threadIDx.y??

Please help!

That isn’t the case at all. CUDA is just standard C. You can use just about any kind of data structures and access patterns you want in CUDA. There are efficiency and performance considerations which make certain similarities between storage layout and block/grid layout advantageous (as well as being relatively intuitive), but it certainly isn’t some kind of requirement or prerequisite.

If all you want to do is add one number to another number, why use the GPU at all? For it to make sense, you need a parallel problem.

Conceptually, think of the cuda threads as a loop where the body of the loop is executed for all values simultaneously. Everything you’d do in that loop, you can do in a cuda kernel. So, if on the CPU your code is

for(int x=0; x<nx; ++x)

  for(int y=0; y<ny; ++y) {

	 <do some stuff with x and y>

  }

in cuda, you’d have a kernel which launches with all combinations of x and y simultaneously:

__global__ void kernel() {

  int x = threadIdx.x;

  int y = threadIdx.y;

  <do some stuff with x and y>

}

In the loop you are free to index whatever arrays you want in whatever way you want. You are similarly free to do whatever you want with the x and y values in the kernel. You just have to make sure that the threads don’t step on each other and that you’re not relying on the sequential execution of the loop.

Then there are performance considerations. But first you need to wrap your head around the programming model.

Hope that helps,

/Patrik

Thanks for your message,

I kind of understand what you mean with CUDA threads executing the code in a C loop simultaneously, I actually had already successfully coded a kernel which in my algorithm initializes arrays by adding noise to each element in the array. This was easy to do because of the abundant references on CUDA to simultanenously do computations on all the elements of the array, especially in order. But my problem persists,

Continuing from the OP, say i have 2 arrays, ArrayA and ArrayB, both same size, and I simply want to add the last element of each row of ArrayA to the last element in each column of ArrayB, how would I write such a kernel? How can I write it so that the rest of the elements in the array are not touched at all except just the last elements of each row/column?

Please help!

btw wtf is up with the IPS drive error, isnt this a nVidia official website…

I know this is a long time after your original post but if you never got the answer then the following post The Official NVIDIA Forums | NVIDIA may be able to help you.

I had the same sort of issue as you and wasn’t sure how to go about indexing. I got several replies but the last couple by avidday really helped clarify things for me (i am still confused with some stuff, but then again who isn’t)

Cheers

Andrew

I know this is a long time after your original post but if you never got the answer then the following post The Official NVIDIA Forums | NVIDIA may be able to help you.

I had the same sort of issue as you and wasn’t sure how to go about indexing. I got several replies but the last couple by avidday really helped clarify things for me (i am still confused with some stuff, but then again who isn’t)

Cheers

Andrew