There is data mapped to a grid with dimensions ptdimx × ptdimy × ptdimz.
A kernel is designed to process the data by doing some calculation based on each cell’s value and its neighbors at one‐ and two‐cell padding, for example:
D00–D01–D02–D03–D04
| |
D10–D11–D12–D13–D14
| |
D20–D21– X –D23–D24
| |
D30–D31–D32–D33–D34
| |
D40–D41–D42–D43–D44
Currently, my code allocates blocks with a thread‐block size of 8×8×8 to read and compute the data:
__global__
void calcdata(int ptdimx, int ptdimy, int ptdimz,
float* extdata, float* extdata2)
{
// two data fields per cell, total block dimension = 12^3 = 1728
__shared__ float datas[2][12][12][12];
// flatten thread index in [0..511]
int thlocid = threadIdx.x
+ threadIdx.y * 8
+ threadIdx.z * 64;
// global coordinates of this thread
int thglox = blockIdx.x * 8 + threadIdx.x;
int thgloy = blockIdx.y * 8 + threadIdx.y;
int thgloz = blockIdx.z * 8 + threadIdx.z;
// origin of the 12^3 read window (block start minus 2)
int orix = blockIdx.x * 8 - 2;
int oriy = blockIdx.y * 8 - 2;
int oriz = blockIdx.z * 8 - 2;
// each thread performs 4 reads of 512 elements to cover 1728
for (int multiread = 0; multiread < 4; ++multiread)
{
int idx = thlocid + multiread * 512;
int full_x = idx % 12;
int full_y = (idx / 12) % 12;
int full_z = idx / 144;
if (full_z >= 12) break;
int fx = orix + full_x;
int fy = oriy + full_y;
int fz = oriz + full_z;
// periodic wrap in x
if (fx == -1) fx = ptdimx - 1;
else if (fx == ptdimx) fx = 0;
else if (fx == -2) fx = ptdimx - 2;
else if (fx == ptdimx+1) fx = 1;
// periodic wrap in y
if (fy == -1) fy = ptdimy - 1;
else if (fy == ptdimy) fy = 0;
else if (fy == -2) fy = ptdimy - 2;
else if (fy == ptdimy+1) fy = 1;
// periodic wrap in z
if (fz == -1) fz = ptdimz - 1;
else if (fz == ptdimz) fz = 0;
else if (fz == -2) fz = ptdimz - 2;
else if (fz == ptdimz+1) fz = 1;
if (fx < ptdimx && fy < ptdimy && fz < ptdimz)
{
int offset = fx + fy * ptdimx + fz * ptdimx * ptdimy;
datas[0][full_z][full_y][full_x] = extdata[offset];
datas[1][full_z][full_y][full_x] = extdata2[offset];
}
}
// some calculations and outputs are omitted
}
In summary, this occupies about 13.8 KB of shared memory per block.
However, I’m concerned that it is a poor scalability. If I increase either the number of threads per block (currently 512) or the dimension of data (currently 2), a shared‐memory limit may be hit.
Say GPU is RTX 4090 Ti, what optimizations can I do to handle more data (e.g. 4–5 ) per kernel launch?
Thank you!
LEE