Hi,

I’m trying to use CUDA for my c++ code where I need to parallelize nested loops like following

for k=0 K<500

a1=b1+k*c1*

for j=0 j<500

a2=b2+jc2

for i=0 i<500

a3=b3+i*c3

some operations on a1,a2,a3…

end

end

end

[b1,b2,b3 and c1,2,c3 are pointers on host]

So my question is, how would you approach taking this into cuda in terms of defining the threads and blocks? here I have three loops, do you suggest using a grid in this case? or you would vectorize those loops and use the blocks and threads?

Also, what is your suggestion for dealing with the a’s [a1,a2,a3]? should I create two kernels: one for calculating a’s and then passing a’s in a separate kernel for the rest of the operations or you would suggest doing all the calculations once in one kernel on GPU.

The sizes of the a’s are very large.

Thank you very much

Ami

Hi Ami,

Sorry, but there’s not enough information here to give you a definitive answer. Can you provide a minimal example of the host code that you’re wanting to convert to CUDA?

For example, I’m not clear if the a’s are pointer offsets into the b arrays or if they are arrays themselves where you’re gathering values from the b’s.

Also, the schedule to use would be dependent on how the a’s are being used and the data access pattern. You may want a1 and a2 to be shared by the threads therefore map k to blockdim x and j to blockdim y, with i mapping to threaddim x. Though, perhaps a different schedule would be better? Hard to tell.

should I create two kernels: one for calculating a’s and then passing a’s in a separate kernel for the rest of the operations

Doubtful, but possible? If the a’s are array offsets then this wouldn’t make sense, but if you’re gathering non-contiguous data into a contiguous array, then it’s possible that it may be helpful to gather these in a separate kernel.

Also, unless you’re using CUDA Unified Memory, the b and c host pointers, as well as the data they point to, would need to be copied to the device.

-Mat

Hi Mat,

Thanks for the message, here is the piece of my c++ code:

bool MLEM::line_calc(Event &ev, Head &head, Image &line)

{

return algo_CV( ev, head, line);

}

bool MLEM::algo_CV(const Event &ev, Head &head, Image &line)

{

const Point &V2 = ev.get_V2(); //V2 is coincidence position 2

const Point &V1 = ev.get_V1();//V1 is coincidence position 1

```
const Point &V2_source = ev.get_V2_source(); //V2 is coincidence position 2
const Point &V1_source = ev.get_V1_source();//V1 is coincidence position 1
```

Point OjV1, OjV1_init = line.Corner;

for(int k=0; k<line.DimIn[2]; k++)

{

```
OjV1.z = OjV1_init.z+ k*line.Size.z;
//
for(int j=0; j<line.DimIn[1]; j++)
{
OjV1.y = OjV1_init.y +j*line.Size.y;
for(int i=0; i<line.DimIn[0]; i++)
{
OjV1.x= OjV1_init.x+ i*line.Size.x;
```

[some other operations for Maximum Likelihood calculations]

```
}
```

}

}

}

I want to take the algo_CV to the CUDA where I have the intertwined loops, so I just want to see what would be the best way to define the threads and blocks in this case. Any help is appreciated.

Please let me know if you need more information.

thank you,

Ami

Thanks Ami, though there’s still too much missing information to give you good advice.

What data type are QjV1 x, y, and z? Pointers? integers to be uses as indices?

In general, you want to understand the data access in the kernel where the data is accessed contiguously by the threads. However, you’ve omitted the body of the inner loop so I’m not sure how this is done.

Though to get you started, I’d probably recommend you use a 3D Grid and a 3D Block where “k” maps to blockIdx%z and threadidx%z, “j” to blockIdx%y and threadIdx%y, and “i” to blockidx%x and threadidx%x. So you’d have something like the following in the kernel:

```
int k = (blockIdx%z * blockDim%z) + threadIdx%z;
int j = (blockIdx%y * blockDim%y) + threadIdx%y;
int i = (blockIdx%x * blockDim%x) + threadIdx%x;
if (k < DimIn2 && j < Dimin1 && i < Dimin0) {
OjV1.z = OjV1_init.z+ k*line.Size.z;
OjV1.y = OjV1_init.y +j*line.Size.y;
OjV1.x= OjV1_init.x+ i*line.Size.x;
... body of the loop ...
}
```

Then you’re launch configuration would look something like:

```
dim3 grid;
dim3 block( 32, 4, 4 );
grid.x = (line.DimIn[0]+block.x-1)/block.x;
grid.y = (line.DimIn[1]+block.y-1)/block.y;
grid.z = (line.DimIn[2]+block.z-1)/block.z;
call kernel<<<grid,block>>(...args...);
```

No idea if this will be the best schedule, but will be the simplest and straight-forward to implement.

-Mat

Hi Mat,

Yes all the QjV1x,y,z are pointers which are called from a class (line).

Thank you very much for the help, I think your suggestion works and gives simple solution but I thought I might get in to problem since the DimIn0, Dimin1, and Dimin0 are above 300 for the 3D grid. But I think I would need to try that.

One more question, I have bunch of norm2 calculation inside the body of the loop, is there any library or an efficient way to calculate the norm2 (norm2 operates on the OjV1.x, OjV1.y, OjV1.z)?

Thank you very much!

Ami

Yes all the QjV1x,y,z are pointers which are called from a class (line).

Ok, though keep in mind that the QjV1_init pointers need to be pointing to data on the device, not host, unless the memory is allocated using cudaMallocManaged so handled via CUDA Unified Memory.

One more question, I have bunch of norm2 calculation inside the body of the loop, is there any library or an efficient way to calculate the norm2 (norm2 operates on the OjV1.x, OjV1.y, OjV1.z)?

Not sure, it’s out of my area of expertise. Looking at the CUDA docs, I see “norm” but for “norm2” I only see it available in cuBLAS but would need to be called from the host.

https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__DOUBLE.html#group__CUDA__MATH__DOUBLE_1gb8cdb44f3624276b26f6b151581b3088

1 Like

Hi,

Sorry for buzzing you again about this, I was just thinking about your code and I have a hard time understanding the following:

did you mean e.g. (blockIdx.z*blockDim.z)+threaIdx.z? what % means in your code?

Thank you,

Ami

Yes, sorry. I mostly do CUDA Fortran programming so mixed languages. Using ‘.’ Is correct for C++.