[beginer] Different arrays size indexing

AndrewCSIRO · November 6, 2010, 3:49am

Hi,

new to CUDA. Trying to port C code to CUDA.

Problem: Take two input arrays of size A512,126,100 and B4,126,100 and calculate the following in the kernel,
D[x,y,z]=A[x,y,z]-B[1,y,z]

(EDIT: I have made a mistake here the A array should actually be A[x,y] only. I really liked avidday’ response so i am leaving the original A array I posted so that others can see avidday was answering my OP.
The correct D should read as;
D[x,y,z]=A[x,y]-B[1,y,z])

E=B[2,y,z]/D[x,y,z]

C[x,y,z]=E[x,y,z]/(1.0+E[x,y,z]*B[3,y,z])

I am really confused by the indexing to use here and because I think (not entirely sure) I need as many threads as there are elements (512126100) I am really lost :( I am not trying to use my arrays in the form above, I am wanting to use 1D indexing as I did in C aka element [0,1,2] is represented by a single index.

To do this in C was trivial for me but involved 3 nested loops (I am not a good C programmer either :) )

All of the examples I see are for arrays of the same size dimension so the indexing is trivial. Cant find any examples where the inputs are arrays of different sizes and the calculations are using the different arrays. Any help greatly appreciated. I am sending in my input arrays as 1D arrays via cudaMALLOC and the standard host to device copy. Was hoping to avoid for loops (seems I might as well stick to C if I have to do this) but…

This is a remote sensing (aka satellite and airborne imagery) problem. The second array B contains atmospheric parameters required to convert my data from radiance to reflectance.

Cheers
Andrew (the slightly frazzled CUDA noob)

AndrewCSIRO · November 6, 2010, 3:49am

Hi,

new to CUDA. Trying to port C code to CUDA.

Problem: Take two input arrays of size A512,126,100 and B4,126,100 and calculate the following in the kernel,
D[x,y,z]=A[x,y,z]-B[1,y,z]

(EDIT: I have made a mistake here the A array should actually be A[x,y] only. I really liked avidday’ response so i am leaving the original A array I posted so that others can see avidday was answering my OP.
The correct D should read as;
D[x,y,z]=A[x,y]-B[1,y,z])

E=B[2,y,z]/D[x,y,z]

C[x,y,z]=E[x,y,z]/(1.0+E[x,y,z]*B[3,y,z])

I am really confused by the indexing to use here and because I think (not entirely sure) I need as many threads as there are elements (512126100) I am really lost :( I am not trying to use my arrays in the form above, I am wanting to use 1D indexing as I did in C aka element [0,1,2] is represented by a single index.

To do this in C was trivial for me but involved 3 nested loops (I am not a good C programmer either :) )

All of the examples I see are for arrays of the same size dimension so the indexing is trivial. Cant find any examples where the inputs are arrays of different sizes and the calculations are using the different arrays. Any help greatly appreciated. I am sending in my input arrays as 1D arrays via cudaMALLOC and the standard host to device copy. Was hoping to avoid for loops (seems I might as well stick to C if I have to do this) but…

This is a remote sensing (aka satellite and airborne imagery) problem. The second array B contains atmospheric parameters required to convert my data from radiance to reflectance.

Cheers
Andrew (the slightly frazzled CUDA noob)

avidday · November 6, 2010, 7:12am

If I am not misunderstanding this, don’t the calcualtions for each [y,z] slice reduce to:

D[x] = A[x] - B1

E[x] = B2 / D[x]

C[x] = E[x] / (1 + E[x]*B3)

Where B1, B2, B3 will be constants?

How best to implement the kernel for that will depend on whether you are using column major or row major ordering in the storage (column major would probably be preferable in this case). For the column order case, launch one block for each [y,z] slice (so use a 126x100 grid), and then have the threads within a block evaluate the expressions for every x entry of the slice, with B1,B2 and B3 read into shared memory at the beginning of the block. Block size can be any round multiple of 32 you want - it might be more efficient to have each thread process more than a single x dimension entry. All the array reads and writes will be coalesced for column major ordering. For row major ordering, you would want to do something different to optimize the memory access patterns.

avidday · November 6, 2010, 7:12am

If I am not misunderstanding this, don’t the calcualtions for each [y,z] slice reduce to:

D[x] = A[x] - B1

E[x] = B2 / D[x]

C[x] = E[x] / (1 + E[x]*B3)

Where B1, B2, B3 will be constants?

How best to implement the kernel for that will depend on whether you are using column major or row major ordering in the storage (column major would probably be preferable in this case). For the column order case, launch one block for each [y,z] slice (so use a 126x100 grid), and then have the threads within a block evaluate the expressions for every x entry of the slice, with B1,B2 and B3 read into shared memory at the beginning of the block. Block size can be any round multiple of 32 you want - it might be more efficient to have each thread process more than a single x dimension entry. All the array reads and writes will be coalesced for column major ordering. For row major ordering, you would want to do something different to optimize the memory access patterns.

AndrewCSIRO · November 6, 2010, 8:17am

If I am not misunderstanding this, don’t the calcualtions for each [y,z] slice reduce to:
D[x] = A[x] - B1

E[x] = B2 / D[x]

C[x] = E[x] / (1 + E[x]*B3)
Where B1, B2, B3 will be constants?

How best to implement the kernel for that will depend on whether you are using column major or row major ordering in the storage (column major would probably be preferable in this case). For the column order case, launch one block for each [y,z] slice (so use a 126x100 grid), and then have the threads within a block evaluate the expressions for every x entry of the slice, with B1,B2 and B3 read into shared memory at the beginning of the block. Block size can be any round multiple of 32 you want - it might be more efficient to have each thread process more than a single x dimension entry. All the array reads and writes will be coalesced for column major ordering. For row major ordering, you would want to do something different to optimize the memory access patterns.

Okay that is sounding pretty cool avidday, and i really like the block launch description as that was confusing the heck out of me.

I have also realized :( that I posted one of the arrays incorrectly. The A array has dimensions of A512,126 and not 512,126,100 as originally posted. So i guess that changes things somewhat. Not sure how I missed that. Not that I want to but I may just expand the A array to the dimensions listed in the OP so IU can follow the lead given by your post. But if i didn’t what would you recommend? The reason I ask is that this problem represents a single row of an spectral image. Ultimately I want to be sending in large chunks of the image at a time (but that is for later when I understand how to handle this case :) )

It is almost never that I am working with input arrays that have the same dimensions, they always share a common dimension or maybe two. I wish the CUDA folk would also make some examples that are not just constrained to “all of our arrays are of size N”

Cheers

Andrew

AndrewCSIRO · November 6, 2010, 8:17am

If I am not misunderstanding this, don’t the calcualtions for each [y,z] slice reduce to:
D[x] = A[x] - B1

E[x] = B2 / D[x]

C[x] = E[x] / (1 + E[x]*B3)
Where B1, B2, B3 will be constants?

How best to implement the kernel for that will depend on whether you are using column major or row major ordering in the storage (column major would probably be preferable in this case). For the column order case, launch one block for each [y,z] slice (so use a 126x100 grid), and then have the threads within a block evaluate the expressions for every x entry of the slice, with B1,B2 and B3 read into shared memory at the beginning of the block. Block size can be any round multiple of 32 you want - it might be more efficient to have each thread process more than a single x dimension entry. All the array reads and writes will be coalesced for column major ordering. For row major ordering, you would want to do something different to optimize the memory access patterns.

Okay that is sounding pretty cool avidday, and i really like the block launch description as that was confusing the heck out of me.

I have also realized :( that I posted one of the arrays incorrectly. The A array has dimensions of A512,126 and not 512,126,100 as originally posted. So i guess that changes things somewhat. Not sure how I missed that. Not that I want to but I may just expand the A array to the dimensions listed in the OP so IU can follow the lead given by your post. But if i didn’t what would you recommend? The reason I ask is that this problem represents a single row of an spectral image. Ultimately I want to be sending in large chunks of the image at a time (but that is for later when I understand how to handle this case :) )

It is almost never that I am working with input arrays that have the same dimensions, they always share a common dimension or maybe two. I wish the CUDA folk would also make some examples that are not just constrained to “all of our arrays are of size N”

Cheers

Andrew

AndrewCSIRO · November 6, 2010, 11:05am

Also forgot that while I like what avidday is saying I still don’t understand how I actually index my input arrays. Would someone be able to show me the kernel using the dimensions I outlined in the op. I am assuming that this kernel is actually quite small. Any examples will help me at this point as it is the code that i need to see to understand the implementation.

Do the NVIDIA folk read this forum at all?

Cheers

Andrew

AndrewCSIRO · November 6, 2010, 11:05am

Also forgot that while I like what avidday is saying I still don’t understand how I actually index my input arrays. Would someone be able to show me the kernel using the dimensions I outlined in the op. I am assuming that this kernel is actually quite small. Any examples will help me at this point as it is the code that i need to see to understand the implementation.

Do the NVIDIA folk read this forum at all?

Cheers

Andrew

avidday · November 6, 2010, 12:10pm

Before anyone can provide code, you need to answer the question I asked in my first reply: column major or row major ordering?

avidday · November 6, 2010, 12:10pm

Before anyone can provide code, you need to answer the question I asked in my first reply: column major or row major ordering?

seibert · November 6, 2010, 3:11pm

Some do in their spare time to keep an eye on user feedback, but I don’t believe any are officially paid to provide tech support on the forums.

seibert · November 6, 2010, 3:11pm

Some do in their spare time to keep an eye on user feedback, but I don’t believe any are officially paid to provide tech support on the forums.

AndrewCSIRO · November 6, 2010, 3:15pm

sorry, if I have an array that is [5,2] in this case the first dimension is my x values

0 1 2 3 4

5 6 7 8 9

then in memory the data is stored in memory as 0123456789

I hope this helps. The data comes from idl which the remote sensing community use for image processing. There are always arguments over the major type e.g row or column. I know that if I send my array from idl to C and look at the first 5 memory locations, for example, then I get 01234 reported back to me. So in C if iwant the value of the 2nd row, 1st column in a zero based index then my index is Ndx=(col-1)+(row-1)â€¢width=0+1â€¢5=5

I hope this helps. Sorry about the lack of information.

AndrewCSIRO · November 6, 2010, 3:15pm

sorry, if I have an array that is [5,2] in this case the first dimension is my x values

0 1 2 3 4

5 6 7 8 9

then in memory the data is stored in memory as 0123456789

I hope this helps. The data comes from idl which the remote sensing community use for image processing. There are always arguments over the major type e.g row or column. I know that if I send my array from idl to C and look at the first 5 memory locations, for example, then I get 01234 reported back to me. So in C if iwant the value of the 2nd row, 1st column in a zero based index then my index is Ndx=(col-1)+(row-1)â€¢width=0+1â€¢5=5

I hope this helps. Sorry about the lack of information.

happyjack272 · November 6, 2010, 3:52pm

my first impression:

combine the y,z into one dimension (y+z*ys). then give a subset of that space to each block.

first part of the kernel is moving that part of B into shared memory.

__global__ stuff(float* A, float* _B, float* C, int xs, int yzs, int yzs_per_block) {

__shared__ float B[yzs_per_block];

//move b into shared memory

   int offset = yzs_per_block*blockIdx.x;

   for( int y = threadIdx.x; y < yzs_per_block; y += blockDim.x)

      B[y] = _B[offset+y];

...

now distribute the x dimension among threads in the block.

for( int y = 0; y < yzs_per_block; y++) {

      int B1 = B[y].x;

      int B2 = B[y].y;

      int B3 = B[y].z;

      for( int x = threadIdx.x; x < xs; x += blockDim.x) {

          int E = B2 / (A[y+offset][x] - B1);

          C[y+offset][x] = E / (1 + E*B3);

      }

   }

}

roughly speaking.

the advantage here is that if there’s a stall on an access to B there are other threads/blocks that can fill that with a different access to B, so you amortize your load time.

happyjack272 · November 6, 2010, 3:52pm

my first impression:

combine the y,z into one dimension (y+z*ys). then give a subset of that space to each block.

first part of the kernel is moving that part of B into shared memory.

__global__ stuff(float* A, float* _B, float* C, int xs, int yzs, int yzs_per_block) {

__shared__ float B[yzs_per_block];

//move b into shared memory

   int offset = yzs_per_block*blockIdx.x;

   for( int y = threadIdx.x; y < yzs_per_block; y += blockDim.x)

      B[y] = _B[offset+y];

...

now distribute the x dimension among threads in the block.

for( int y = 0; y < yzs_per_block; y++) {

      int B1 = B[y].x;

      int B2 = B[y].y;

      int B3 = B[y].z;

      for( int x = threadIdx.x; x < xs; x += blockDim.x) {

          int E = B2 / (A[y+offset][x] - B1);

          C[y+offset][x] = E / (1 + E*B3);

      }

   }

}

roughly speaking.

the advantage here is that if there’s a stall on an access to B there are other threads/blocks that can fill that with a different access to B, so you amortize your load time.

happyjack272 · November 6, 2010, 4:04pm

Okay that is sounding pretty cool avidday, and i really like the block launch description as that was confusing the heck out of me.

I have also realized :( that I posted one of the arrays incorrectly. The A array has dimensions of A512,126 and not 512,126,100 as originally posted. So i guess that changes things somewhat. Not sure how I missed that. Not that I want to but I may just expand the A array to the dimensions listed in the OP so IU can follow the lead given by your post. But if i didn’t what would you recommend? The reason I ask is that this problem represents a single row of an spectral image. Ultimately I want to be sending in large chunks of the image at a time (but that is for later when I understand how to handle this case :) )

It is almost never that I am working with input arrays that have the same dimensions, they always share a common dimension or maybe two. I wish the CUDA folk would also make some examples that are not just constrained to “all of our arrays are of size N”

Cheers

Andrew

wait, so a has dimensions (x,y) and b has dimensiont (y,z)? then it’s like a matrix multiplication. if there are some dimentions unique to a, some unique to b, and some shared, then it’s like a matrix multiplication and your memory access should be organized accordingly. in which case i’d take a look at the matrix multiplication example provvided in the sdk and modify it to use the functions you wrote instead of multiplication and accumulation. though i don’t see how that could be the case because your formula doesn’t include a “reduction” operator. i guess that mean your c would have every dimension either in a or b? clarify please.

happyjack272 · November 6, 2010, 4:04pm

Okay that is sounding pretty cool avidday, and i really like the block launch description as that was confusing the heck out of me.

I have also realized :( that I posted one of the arrays incorrectly. The A array has dimensions of A512,126 and not 512,126,100 as originally posted. So i guess that changes things somewhat. Not sure how I missed that. Not that I want to but I may just expand the A array to the dimensions listed in the OP so IU can follow the lead given by your post. But if i didn’t what would you recommend? The reason I ask is that this problem represents a single row of an spectral image. Ultimately I want to be sending in large chunks of the image at a time (but that is for later when I understand how to handle this case :) )

It is almost never that I am working with input arrays that have the same dimensions, they always share a common dimension or maybe two. I wish the CUDA folk would also make some examples that are not just constrained to “all of our arrays are of size N”

Cheers

Andrew

wait, so a has dimensions (x,y) and b has dimensiont (y,z)? then it’s like a matrix multiplication. if there are some dimentions unique to a, some unique to b, and some shared, then it’s like a matrix multiplication and your memory access should be organized accordingly. in which case i’d take a look at the matrix multiplication example provvided in the sdk and modify it to use the functions you wrote instead of multiplication and accumulation. though i don’t see how that could be the case because your formula doesn’t include a “reduction” operator. i guess that mean your c would have every dimension either in a or b? clarify please.

AndrewCSIRO · November 7, 2010, 2:45am

Hi sorry to confuse the issue. I didn’t want to go into the details of each array as I didn’t think it was appropriate for this forum but to try and clarify I think I need to so that I am giving a, hopefully, more complete picture of what I am doing.

Array A is an array of radiance-at-the-sensor values for 512 spatial locations (pixel) and 126 spectral bands per spatial location e.g. [512,126]. So if I was to view a given spatial location x then i would see a radiance spectra of 126 bands (spanning the visible to short wave infrared spectral region)

Array B is an array of atmospheric parameters that are needed to solve the radiance-to-reflectance problem. Array B has no ties to array A’s spatial location. By this I mean that I could use the values in array B on any of the 512 pixels to produce a surface reflectance (e.g. it is only a function of the spectral band and water vapor amount). The values in array B are defined at a given spectral band (these correspond to the 126 spectral bands in array A) and for a given atmospheric water vapor amount (of which there are 100 values). So I can now see the value of a given atmospheric parameter for a given water vapor amount at a particular spectral band.

When I solve the radiance to reflectance for all 126 spectral bands, lets say for 1 spatial location of the possible 512, I need 3 of the atmospheric parameters from array B (the 4th value is unimportant at this point). So lets say I am at location x, and want to know the reflectance value C for water vapor amount z at spectral band y e.g C(x,y,z). Then I would calculate that particular value as outlined in the OP. So when I run this through a loop in C for all possible 126 spectral bands and all possible 100 water vapor amounts I will produce a new array that has the 100 estimated spectral reflectances at each of the 126 spectral bands. Then I would move onto the next pixel and do it all again.

So now I could look at my output array C and say show me the calculated reflectance for all 126 spectral bands at location x and for water vapor amount z. Equally I could also say I wonder what the spectral reflectance looks like at the same location but where the atmospheric water vapor was z+10 rather than z etc etc.

I could rewrite the B array as 3 separate arrays e.g. B1=C(1,0-125,0,99), B2=C(2,0-125,0,99), B3=C(3,0-125,0,99). So the original OP would be written as (the arrays C, D and E have been declared prior to any calculation and space allocated for them)

D[x,y,z]=A[x,y]-B1[y,z]

so the above in a for loop would look like

for x=0,511

for y=0,125

for z=0,99

D[x,y,z] = A[x,y]-B1[y,z]

E[x,y,z]=B2[y,z]/D[x,y,z]

so the above in a for loop would look like

for x=0,511

for y=0,125

for z=0,99

E[x,y,z] = B2[y,z]/D[x,y,z]

C[x,y,z]=E[x,y,z]/(1.0+E[x,y,z]*B3[y,z])

so the above in a for loop would look like

for x=0,511

for y=0,125

for z=0,99

C[x,y,z] = E[x,y,z]/(1.+E[x,y,z]*B3[y,z])

Of course I combine all of the above into the one set of for loops (the order of the loops here is only for demonstration purposes only). So we are not talking matrix multiplication in the linear algebra sense.

In a single set of for loops:

for x=0,511

for y=0,125

for z=0,99

value1 = A[x,y]-B1[y,z]

value2 = B2[y,z]/value1

C[x,y,z] = value2/(1.+value2*B3[y,z])

Super long-winded I know but I am hoping it clarifies how I go from array A to array C in my “C” code. I am hoping that this is something that the GPU can perform much faster than the three for loops since my understanding is that I can in the input arrays into a single kernel. Forgot to add that the for loops above are done in C with a single index for each array of course. I have used the [x,y,z] notation for clarity.

AndrewCSIRO · November 7, 2010, 2:45am

Hi sorry to confuse the issue. I didn’t want to go into the details of each array as I didn’t think it was appropriate for this forum but to try and clarify I think I need to so that I am giving a, hopefully, more complete picture of what I am doing.

Array A is an array of radiance-at-the-sensor values for 512 spatial locations (pixel) and 126 spectral bands per spatial location e.g. [512,126]. So if I was to view a given spatial location x then i would see a radiance spectra of 126 bands (spanning the visible to short wave infrared spectral region)

Array B is an array of atmospheric parameters that are needed to solve the radiance-to-reflectance problem. Array B has no ties to array A’s spatial location. By this I mean that I could use the values in array B on any of the 512 pixels to produce a surface reflectance (e.g. it is only a function of the spectral band and water vapor amount). The values in array B are defined at a given spectral band (these correspond to the 126 spectral bands in array A) and for a given atmospheric water vapor amount (of which there are 100 values). So I can now see the value of a given atmospheric parameter for a given water vapor amount at a particular spectral band.

When I solve the radiance to reflectance for all 126 spectral bands, lets say for 1 spatial location of the possible 512, I need 3 of the atmospheric parameters from array B (the 4th value is unimportant at this point). So lets say I am at location x, and want to know the reflectance value C for water vapor amount z at spectral band y e.g C(x,y,z). Then I would calculate that particular value as outlined in the OP. So when I run this through a loop in C for all possible 126 spectral bands and all possible 100 water vapor amounts I will produce a new array that has the 100 estimated spectral reflectances at each of the 126 spectral bands. Then I would move onto the next pixel and do it all again.

So now I could look at my output array C and say show me the calculated reflectance for all 126 spectral bands at location x and for water vapor amount z. Equally I could also say I wonder what the spectral reflectance looks like at the same location but where the atmospheric water vapor was z+10 rather than z etc etc.

I could rewrite the B array as 3 separate arrays e.g. B1=C(1,0-125,0,99), B2=C(2,0-125,0,99), B3=C(3,0-125,0,99). So the original OP would be written as (the arrays C, D and E have been declared prior to any calculation and space allocated for them)

D[x,y,z]=A[x,y]-B1[y,z]

so the above in a for loop would look like

for x=0,511

for y=0,125

for z=0,99

D[x,y,z] = A[x,y]-B1[y,z]

E[x,y,z]=B2[y,z]/D[x,y,z]

so the above in a for loop would look like

for x=0,511

for y=0,125

for z=0,99

E[x,y,z] = B2[y,z]/D[x,y,z]

C[x,y,z]=E[x,y,z]/(1.0+E[x,y,z]*B3[y,z])

so the above in a for loop would look like

for x=0,511

for y=0,125

for z=0,99

C[x,y,z] = E[x,y,z]/(1.+E[x,y,z]*B3[y,z])

Of course I combine all of the above into the one set of for loops (the order of the loops here is only for demonstration purposes only). So we are not talking matrix multiplication in the linear algebra sense.

In a single set of for loops:

for x=0,511

for y=0,125

for z=0,99

value1 = A[x,y]-B1[y,z]

value2 = B2[y,z]/value1

C[x,y,z] = value2/(1.+value2*B3[y,z])

Super long-winded I know but I am hoping it clarifies how I go from array A to array C in my “C” code. I am hoping that this is something that the GPU can perform much faster than the three for loops since my understanding is that I can in the input arrays into a single kernel. Forgot to add that the for loops above are done in C with a single index for each array of course. I have used the [x,y,z] notation for clarity.

Topic		Replies	Views
How to allocate a 3d array such that you can use the indecies to access its elements CUDA Programming and Performance	20	5543	October 24, 2009
multi dimension array CUDA Programming and Performance	26	32952	February 12, 2010
Indexing in 3D data CUDA Programming and Performance	19	18614	July 26, 2010
Working with really large arrays in CUDA (how to prevent negative indexes?) CUDA Programming and Performance	5	2380	November 19, 2019
Is it possible to process multidimensional arrays inside the kernel? CUDA Programming and Performance	13	9171	March 31, 2015
char array - my mistake? CUDA Programming and Performance	33	7803	October 22, 2010
Does anyone know how to build a 2D/3D array using CudaMalloc? CUDA Programming and Performance	7	19776	November 3, 2010
Kernel configuration: management of an array of 2 million positions CUDA Programming and Performance	7	1312	June 13, 2014
trouble indexing 1D->3D array in my kernel CUDA Programming and Performance	2	17902	January 31, 2011
matrix multiply reduction CUDA Programming and Performance	41	35871	January 15, 2011

[beginer] Different arrays size indexing

Related topics