Nested loops

hashemi_amirreza · August 4, 2022, 9:14pm

Hi,

I’m trying to use CUDA for my c++ code where I need to parallelize nested loops like following
for k=0 K<500
a1=b1+kc1
for j=0 j<500
a2=b2+jc2
for i=0 i<500
a3=b3+i*c3

some operations on a1,a2,a3…

end
end
end

[b1,b2,b3 and c1,2,c3 are pointers on host]

So my question is, how would you approach taking this into cuda in terms of defining the threads and blocks? here I have three loops, do you suggest using a grid in this case? or you would vectorize those loops and use the blocks and threads?

Also, what is your suggestion for dealing with the a’s [a1,a2,a3]? should I create two kernels: one for calculating a’s and then passing a’s in a separate kernel for the rest of the operations or you would suggest doing all the calculations once in one kernel on GPU.

The sizes of the a’s are very large.

Thank you very much
Ami

MatColgrove · August 5, 2022, 4:17pm

Hi Ami,

Sorry, but there’s not enough information here to give you a definitive answer. Can you provide a minimal example of the host code that you’re wanting to convert to CUDA?

For example, I’m not clear if the a’s are pointer offsets into the b arrays or if they are arrays themselves where you’re gathering values from the b’s.

Also, the schedule to use would be dependent on how the a’s are being used and the data access pattern. You may want a1 and a2 to be shared by the threads therefore map k to blockdim x and j to blockdim y, with i mapping to threaddim x. Though, perhaps a different schedule would be better? Hard to tell.

should I create two kernels: one for calculating a’s and then passing a’s in a separate kernel for the rest of the operations

Doubtful, but possible? If the a’s are array offsets then this wouldn’t make sense, but if you’re gathering non-contiguous data into a contiguous array, then it’s possible that it may be helpful to gather these in a separate kernel.

Also, unless you’re using CUDA Unified Memory, the b and c host pointers, as well as the data they point to, would need to be copied to the device.

-Mat

hashemi_amirreza · August 5, 2022, 6:57pm

Hi Mat,

Thanks for the message, here is the piece of my c++ code:

bool MLEM::line_calc(Event &ev, Head &head, Image &line)
{

return algo_CV( ev, head, line);

}

bool MLEM::algo_CV(const Event &ev, Head &head, Image &line)
{
const Point &V2 = ev.get_V2(); //V2 is coincidence position 2
const Point &V1 = ev.get_V1();//V1 is coincidence position 1

const Point &V2_source = ev.get_V2_source(); //V2 is coincidence position 2
const Point &V1_source = ev.get_V1_source();//V1 is coincidence position 1

Point OjV1, OjV1_init = line.Corner;

for(int k=0; k<line.DimIn[2]; k++)
{

  OjV1.z = OjV1_init.z+ k*line.Size.z;
  //
  for(int j=0; j<line.DimIn[1]; j++)
    {

  OjV1.y = OjV1_init.y +j*line.Size.y;

  for(int i=0; i<line.DimIn[0]; i++)
        {
	  OjV1.x= OjV1_init.x+ i*line.Size.x;

[some other operations for Maximum Likelihood calculations]

}
}
}

I want to take the algo_CV to the CUDA where I have the intertwined loops, so I just want to see what would be the best way to define the threads and blocks in this case. Any help is appreciated.

Please let me know if you need more information.

thank you,
Ami

MatColgrove · August 5, 2022, 7:55pm

Thanks Ami, though there’s still too much missing information to give you good advice.

What data type are QjV1 x, y, and z? Pointers? integers to be uses as indices?

In general, you want to understand the data access in the kernel where the data is accessed contiguously by the threads. However, you’ve omitted the body of the inner loop so I’m not sure how this is done.

Though to get you started, I’d probably recommend you use a 3D Grid and a 3D Block where “k” maps to blockIdx%z and threadidx%z, “j” to blockIdx%y and threadIdx%y, and “i” to blockidx%x and threadidx%x. So you’d have something like the following in the kernel:

   int k = (blockIdx%z * blockDim%z) + threadIdx%z;
   int j = (blockIdx%y * blockDim%y) + threadIdx%y;
   int i = (blockIdx%x * blockDim%x) + threadIdx%x;
   if (k < DimIn2 && j < Dimin1 && i < Dimin0) {
       OjV1.z = OjV1_init.z+ k*line.Size.z;
       OjV1.y = OjV1_init.y +j*line.Size.y;
       OjV1.x= OjV1_init.x+ i*line.Size.x;
       ... body of the loop ...
   }

Then you’re launch configuration would look something like:

dim3 grid;
dim3 block( 32, 4, 4 );  
grid.x = (line.DimIn[0]+block.x-1)/block.x;
grid.y = (line.DimIn[1]+block.y-1)/block.y;
grid.z = (line.DimIn[2]+block.z-1)/block.z;
call kernel<<<grid,block>>(...args...);

No idea if this will be the best schedule, but will be the simplest and straight-forward to implement.

-Mat

hashemi_amirreza · August 5, 2022, 8:46pm

Hi Mat,

Yes all the QjV1x,y,z are pointers which are called from a class (line).

Thank you very much for the help, I think your suggestion works and gives simple solution but I thought I might get in to problem since the DimIn0, Dimin1, and Dimin0 are above 300 for the 3D grid. But I think I would need to try that.

One more question, I have bunch of norm2 calculation inside the body of the loop, is there any library or an efficient way to calculate the norm2 (norm2 operates on the OjV1.x, OjV1.y, OjV1.z)?

Thank you very much!
Ami

MatColgrove · August 5, 2022, 9:11pm

Yes all the QjV1x,y,z are pointers which are called from a class (line).

Ok, though keep in mind that the QjV1_init pointers need to be pointing to data on the device, not host, unless the memory is allocated using cudaMallocManaged so handled via CUDA Unified Memory.

One more question, I have bunch of norm2 calculation inside the body of the loop, is there any library or an efficient way to calculate the norm2 (norm2 operates on the OjV1.x, OjV1.y, OjV1.z)?

Not sure, it’s out of my area of expertise. Looking at the CUDA docs, I see “norm” but for “norm2” I only see it available in cuBLAS but would need to be called from the host.

https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__DOUBLE.html#group__CUDA__MATH__DOUBLE_1gb8cdb44f3624276b26f6b151581b3088

hashemi_amirreza · August 6, 2022, 4:05am

Hi,

Sorry for buzzing you again about this, I was just thinking about your code and I have a hard time understanding the following:

did you mean e.g. (blockIdx.z*blockDim.z)+threaIdx.z? what % means in your code?

Thank you,
Ami

MatColgrove · August 6, 2022, 5:23am

Yes, sorry. I mostly do CUDA Fortran programming so mixed languages. Using ‘.’ Is correct for C++.

Topic		Replies	Views
Polite ask for help with paralelling program using CUDA. Task: Rewrite given program so it will use CUDA Programming and Performance	13	9284	April 20, 2011
3D arrays - where to start? CUDA Programming and Performance	12	20356	October 13, 2013
looking for further suggestion to speed up the code CUDA Programming and Performance	9	1268	February 4, 2014
Urgent help with threads please! CUDA Programming and Performance	21	10787	March 6, 2008
CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops Technical Blog	18	1580	September 12, 2019
Help with strange error CUDA Programming and Performance	8	2098	February 25, 2010
help with some cuda programming CUDA Programming and Performance	9	1818	August 31, 2009
Thread block clusters and distributed shared memory not working as intended CUDA Programming and Performance	8	1530	November 8, 2023
modifying simulator to work with CUDA.?. CUDA Programming and Performance	36	18516	April 15, 2009
how to implement double for loops in CUDA CUDA Programming and Performance	23	15791	January 30, 2012

Nested loops

Related topics