Problem: Take two input arrays of size A512,126,100 and B4,126,100 and calculate the following in the kernel,
D[x,y,z]=A[x,y,z]-B[1,y,z]

(EDIT: I have made a mistake here the A array should actually be A[x,y] only. I really liked avidday’ response so i am leaving the original A array I posted so that others can see avidday was answering my OP.
The correct D should read as;
D[x,y,z]=A[x,y]-B[1,y,z])

E=B[2,y,z]/D[x,y,z]

C[x,y,z]=E[x,y,z]/(1.0+E[x,y,z]*B[3,y,z])

I am really confused by the indexing to use here and because I think (not entirely sure) I need as many threads as there are elements (512126100) I am really lost :( I am not trying to use my arrays in the form above, I am wanting to use 1D indexing as I did in C aka element [0,1,2] is represented by a single index.

To do this in C was trivial for me but involved 3 nested loops (I am not a good C programmer either :) )

All of the examples I see are for arrays of the same size dimension so the indexing is trivial. Cant find any examples where the inputs are arrays of different sizes and the calculations are using the different arrays. Any help greatly appreciated. I am sending in my input arrays as 1D arrays via cudaMALLOC and the standard host to device copy. Was hoping to avoid for loops (seems I might as well stick to C if I have to do this) but…

This is a remote sensing (aka satellite and airborne imagery) problem. The second array B contains atmospheric parameters required to convert my data from radiance to reflectance.

Problem: Take two input arrays of size A512,126,100 and B4,126,100 and calculate the following in the kernel,
D[x,y,z]=A[x,y,z]-B[1,y,z]

(EDIT: I have made a mistake here the A array should actually be A[x,y] only. I really liked avidday’ response so i am leaving the original A array I posted so that others can see avidday was answering my OP.
The correct D should read as;
D[x,y,z]=A[x,y]-B[1,y,z])

E=B[2,y,z]/D[x,y,z]

C[x,y,z]=E[x,y,z]/(1.0+E[x,y,z]*B[3,y,z])

I am really confused by the indexing to use here and because I think (not entirely sure) I need as many threads as there are elements (512126100) I am really lost :( I am not trying to use my arrays in the form above, I am wanting to use 1D indexing as I did in C aka element [0,1,2] is represented by a single index.

To do this in C was trivial for me but involved 3 nested loops (I am not a good C programmer either :) )

All of the examples I see are for arrays of the same size dimension so the indexing is trivial. Cant find any examples where the inputs are arrays of different sizes and the calculations are using the different arrays. Any help greatly appreciated. I am sending in my input arrays as 1D arrays via cudaMALLOC and the standard host to device copy. Was hoping to avoid for loops (seems I might as well stick to C if I have to do this) but…

This is a remote sensing (aka satellite and airborne imagery) problem. The second array B contains atmospheric parameters required to convert my data from radiance to reflectance.

How best to implement the kernel for that will depend on whether you are using column major or row major ordering in the storage (column major would probably be preferable in this case). For the column order case, launch one block for each [y,z] slice (so use a 126x100 grid), and then have the threads within a block evaluate the expressions for every x entry of the slice, with B1,B2 and B3 read into shared memory at the beginning of the block. Block size can be any round multiple of 32 you want - it might be more efficient to have each thread process more than a single x dimension entry. All the array reads and writes will be coalesced for column major ordering. For row major ordering, you would want to do something different to optimize the memory access patterns.

How best to implement the kernel for that will depend on whether you are using column major or row major ordering in the storage (column major would probably be preferable in this case). For the column order case, launch one block for each [y,z] slice (so use a 126x100 grid), and then have the threads within a block evaluate the expressions for every x entry of the slice, with B1,B2 and B3 read into shared memory at the beginning of the block. Block size can be any round multiple of 32 you want - it might be more efficient to have each thread process more than a single x dimension entry. All the array reads and writes will be coalesced for column major ordering. For row major ordering, you would want to do something different to optimize the memory access patterns.

Okay that is sounding pretty cool avidday, and i really like the block launch description as that was confusing the heck out of me.

I have also realized :( that I posted one of the arrays incorrectly. The A array has dimensions of A512,126 and not 512,126,100 as originally posted. So i guess that changes things somewhat. Not sure how I missed that. Not that I want to but I may just expand the A array to the dimensions listed in the OP so IU can follow the lead given by your post. But if i didn’t what would you recommend? The reason I ask is that this problem represents a single row of an spectral image. Ultimately I want to be sending in large chunks of the image at a time (but that is for later when I understand how to handle this case :) )

It is almost never that I am working with input arrays that have the same dimensions, they always share a common dimension or maybe two. I wish the CUDA folk would also make some examples that are not just constrained to “all of our arrays are of size N”

Okay that is sounding pretty cool avidday, and i really like the block launch description as that was confusing the heck out of me.

I have also realized :( that I posted one of the arrays incorrectly. The A array has dimensions of A512,126 and not 512,126,100 as originally posted. So i guess that changes things somewhat. Not sure how I missed that. Not that I want to but I may just expand the A array to the dimensions listed in the OP so IU can follow the lead given by your post. But if i didn’t what would you recommend? The reason I ask is that this problem represents a single row of an spectral image. Ultimately I want to be sending in large chunks of the image at a time (but that is for later when I understand how to handle this case :) )

It is almost never that I am working with input arrays that have the same dimensions, they always share a common dimension or maybe two. I wish the CUDA folk would also make some examples that are not just constrained to “all of our arrays are of size N”

Also forgot that while I like what avidday is saying I still don’t understand how I actually index my input arrays. Would someone be able to show me the kernel using the dimensions I outlined in the op. I am assuming that this kernel is actually quite small. Any examples will help me at this point as it is the code that i need to see to understand the implementation.

Also forgot that while I like what avidday is saying I still don’t understand how I actually index my input arrays. Would someone be able to show me the kernel using the dimensions I outlined in the op. I am assuming that this kernel is actually quite small. Any examples will help me at this point as it is the code that i need to see to understand the implementation.

sorry, if I have an array that is [5,2] in this case the first dimension is my x values

0 1 2 3 4

5 6 7 8 9

then in memory the data is stored in memory as 0123456789

I hope this helps. The data comes from idl which the remote sensing community use for image processing. There are always arguments over the major type e.g row or column. I know that if I send my array from idl to C and look at the first 5 memory locations, for example, then I get 01234 reported back to me. So in C if iwant the value of the 2nd row, 1st column in a zero based index then my index is Ndx=(col-1)+(row-1)â€¢width=0+1â€¢5=5

I hope this helps. Sorry about the lack of information.

sorry, if I have an array that is [5,2] in this case the first dimension is my x values

0 1 2 3 4

5 6 7 8 9

then in memory the data is stored in memory as 0123456789

I hope this helps. The data comes from idl which the remote sensing community use for image processing. There are always arguments over the major type e.g row or column. I know that if I send my array from idl to C and look at the first 5 memory locations, for example, then I get 01234 reported back to me. So in C if iwant the value of the 2nd row, 1st column in a zero based index then my index is Ndx=(col-1)+(row-1)â€¢width=0+1â€¢5=5

I hope this helps. Sorry about the lack of information.

combine the y,z into one dimension (y+z*ys). then give a subset of that space to each block.

first part of the kernel is moving that part of B into shared memory.

__global__ stuff(float* A, float* _B, float* C, int xs, int yzs, int yzs_per_block) {
__shared__ float B[yzs_per_block];
//move b into shared memory
int offset = yzs_per_block*blockIdx.x;
for( int y = threadIdx.x; y < yzs_per_block; y += blockDim.x)
B[y] = _B[offset+y];
...

now distribute the x dimension among threads in the block.

for( int y = 0; y < yzs_per_block; y++) {
int B1 = B[y].x;
int B2 = B[y].y;
int B3 = B[y].z;
for( int x = threadIdx.x; x < xs; x += blockDim.x) {
int E = B2 / (A[y+offset][x] - B1);
C[y+offset][x] = E / (1 + E*B3);
}
}
}

roughly speaking.

the advantage here is that if there’s a stall on an access to B there are other threads/blocks that can fill that with a different access to B, so you amortize your load time.

combine the y,z into one dimension (y+z*ys). then give a subset of that space to each block.

first part of the kernel is moving that part of B into shared memory.

__global__ stuff(float* A, float* _B, float* C, int xs, int yzs, int yzs_per_block) {
__shared__ float B[yzs_per_block];
//move b into shared memory
int offset = yzs_per_block*blockIdx.x;
for( int y = threadIdx.x; y < yzs_per_block; y += blockDim.x)
B[y] = _B[offset+y];
...

now distribute the x dimension among threads in the block.

for( int y = 0; y < yzs_per_block; y++) {
int B1 = B[y].x;
int B2 = B[y].y;
int B3 = B[y].z;
for( int x = threadIdx.x; x < xs; x += blockDim.x) {
int E = B2 / (A[y+offset][x] - B1);
C[y+offset][x] = E / (1 + E*B3);
}
}
}

roughly speaking.

the advantage here is that if there’s a stall on an access to B there are other threads/blocks that can fill that with a different access to B, so you amortize your load time.

wait, so a has dimensions (x,y) and b has dimensiont (y,z)? then it’s like a matrix multiplication. if there are some dimentions unique to a, some unique to b, and some shared, then it’s like a matrix multiplication and your memory access should be organized accordingly. in which case i’d take a look at the matrix multiplication example provvided in the sdk and modify it to use the functions you wrote instead of multiplication and accumulation. though i don’t see how that could be the case because your formula doesn’t include a “reduction” operator. i guess that mean your c would have every dimension either in a or b? clarify please.

wait, so a has dimensions (x,y) and b has dimensiont (y,z)? then it’s like a matrix multiplication. if there are some dimentions unique to a, some unique to b, and some shared, then it’s like a matrix multiplication and your memory access should be organized accordingly. in which case i’d take a look at the matrix multiplication example provvided in the sdk and modify it to use the functions you wrote instead of multiplication and accumulation. though i don’t see how that could be the case because your formula doesn’t include a “reduction” operator. i guess that mean your c would have every dimension either in a or b? clarify please.

Hi sorry to confuse the issue. I didn’t want to go into the details of each array as I didn’t think it was appropriate for this forum but to try and clarify I think I need to so that I am giving a, hopefully, more complete picture of what I am doing.

Array A is an array of radiance-at-the-sensor values for 512 spatial locations (pixel) and 126 spectral bands per spatial location e.g. [512,126]. So if I was to view a given spatial location x then i would see a radiance spectra of 126 bands (spanning the visible to short wave infrared spectral region)

Array B is an array of atmospheric parameters that are needed to solve the radiance-to-reflectance problem. Array B has no ties to array A’s spatial location. By this I mean that I could use the values in array B on any of the 512 pixels to produce a surface reflectance (e.g. it is only a function of the spectral band and water vapor amount). The values in array B are defined at a given spectral band (these correspond to the 126 spectral bands in array A) and for a given atmospheric water vapor amount (of which there are 100 values). So I can now see the value of a given atmospheric parameter for a given water vapor amount at a particular spectral band.

When I solve the radiance to reflectance for all 126 spectral bands, lets say for 1 spatial location of the possible 512, I need 3 of the atmospheric parameters from array B (the 4th value is unimportant at this point). So lets say I am at location x, and want to know the reflectance value C for water vapor amount z at spectral band y e.g C(x,y,z). Then I would calculate that particular value as outlined in the OP. So when I run this through a loop in C for all possible 126 spectral bands and all possible 100 water vapor amounts I will produce a new array that has the 100 estimated spectral reflectances at each of the 126 spectral bands. Then I would move onto the next pixel and do it all again.

So now I could look at my output array C and say show me the calculated reflectance for all 126 spectral bands at location x and for water vapor amount z. Equally I could also say I wonder what the spectral reflectance looks like at the same location but where the atmospheric water vapor was z+10 rather than z etc etc.

I could rewrite the B array as 3 separate arrays e.g. B1=C(1,0-125,0,99), B2=C(2,0-125,0,99), B3=C(3,0-125,0,99). So the original OP would be written as (the arrays C, D and E have been declared prior to any calculation and space allocated for them)

D[x,y,z]=A[x,y]-B1[y,z]

so the above in a for loop would look like

for x=0,511

for y=0,125

for z=0,99

D[x,y,z] = A[x,y]-B1[y,z]

E[x,y,z]=B2[y,z]/D[x,y,z]

so the above in a for loop would look like

for x=0,511

for y=0,125

for z=0,99

E[x,y,z] = B2[y,z]/D[x,y,z]

C[x,y,z]=E[x,y,z]/(1.0+E[x,y,z]*B3[y,z])

so the above in a for loop would look like

for x=0,511

for y=0,125

for z=0,99

C[x,y,z] = E[x,y,z]/(1.+E[x,y,z]*B3[y,z])

Of course I combine all of the above into the one set of for loops (the order of the loops here is only for demonstration purposes only). So we are not talking matrix multiplication in the linear algebra sense.

Super long-winded I know but I am hoping it clarifies how I go from array A to array C in my “C” code. I am hoping that this is something that the GPU can perform much faster than the three for loops since my understanding is that I can in the input arrays into a single kernel. Forgot to add that the for loops above are done in C with a single index for each array of course. I have used the [x,y,z] notation for clarity.

Hi sorry to confuse the issue. I didn’t want to go into the details of each array as I didn’t think it was appropriate for this forum but to try and clarify I think I need to so that I am giving a, hopefully, more complete picture of what I am doing.

Array A is an array of radiance-at-the-sensor values for 512 spatial locations (pixel) and 126 spectral bands per spatial location e.g. [512,126]. So if I was to view a given spatial location x then i would see a radiance spectra of 126 bands (spanning the visible to short wave infrared spectral region)

Array B is an array of atmospheric parameters that are needed to solve the radiance-to-reflectance problem. Array B has no ties to array A’s spatial location. By this I mean that I could use the values in array B on any of the 512 pixels to produce a surface reflectance (e.g. it is only a function of the spectral band and water vapor amount). The values in array B are defined at a given spectral band (these correspond to the 126 spectral bands in array A) and for a given atmospheric water vapor amount (of which there are 100 values). So I can now see the value of a given atmospheric parameter for a given water vapor amount at a particular spectral band.

When I solve the radiance to reflectance for all 126 spectral bands, lets say for 1 spatial location of the possible 512, I need 3 of the atmospheric parameters from array B (the 4th value is unimportant at this point). So lets say I am at location x, and want to know the reflectance value C for water vapor amount z at spectral band y e.g C(x,y,z). Then I would calculate that particular value as outlined in the OP. So when I run this through a loop in C for all possible 126 spectral bands and all possible 100 water vapor amounts I will produce a new array that has the 100 estimated spectral reflectances at each of the 126 spectral bands. Then I would move onto the next pixel and do it all again.

So now I could look at my output array C and say show me the calculated reflectance for all 126 spectral bands at location x and for water vapor amount z. Equally I could also say I wonder what the spectral reflectance looks like at the same location but where the atmospheric water vapor was z+10 rather than z etc etc.

I could rewrite the B array as 3 separate arrays e.g. B1=C(1,0-125,0,99), B2=C(2,0-125,0,99), B3=C(3,0-125,0,99). So the original OP would be written as (the arrays C, D and E have been declared prior to any calculation and space allocated for them)

D[x,y,z]=A[x,y]-B1[y,z]

so the above in a for loop would look like

for x=0,511

for y=0,125

for z=0,99

D[x,y,z] = A[x,y]-B1[y,z]

E[x,y,z]=B2[y,z]/D[x,y,z]

so the above in a for loop would look like

for x=0,511

for y=0,125

for z=0,99

E[x,y,z] = B2[y,z]/D[x,y,z]

C[x,y,z]=E[x,y,z]/(1.0+E[x,y,z]*B3[y,z])

so the above in a for loop would look like

for x=0,511

for y=0,125

for z=0,99

C[x,y,z] = E[x,y,z]/(1.+E[x,y,z]*B3[y,z])

Of course I combine all of the above into the one set of for loops (the order of the loops here is only for demonstration purposes only). So we are not talking matrix multiplication in the linear algebra sense.

Super long-winded I know but I am hoping it clarifies how I go from array A to array C in my “C” code. I am hoping that this is something that the GPU can perform much faster than the three for loops since my understanding is that I can in the input arrays into a single kernel. Forgot to add that the for loops above are done in C with a single index for each array of course. I have used the [x,y,z] notation for clarity.