I have a 3D Matrix in global memory. It contains Z matrices each contains X * Y float samples.
The samples in the Z axis are consecutive in memory.
On each Z samples I have to take each sample to another place in memory and in this way to create Z matrices of X * Y samples. Those 2D matrices will be consecutive in memory.
I know 3 alternatives:
- Each thread will run a for loop on Z samples.
- Run the kernel Z times
- Use the following code:
dim3 block (32, 32, nz);
dim3 grid ((nx + block.x-1)/block.x, (ny + block.y-1)/block.y, (nz+block.z-1)/block.z);
my_kernel <<<grid, block>>> (pSrc, pDest, nx, ny, nz);
__global__ my_kernel (float *pSrc, float *pDest, int nx, int ny, int nz)
int ix = threadIdx.x + blockIdx.x * blockDim.x;
int iy = threadIdx.y + blockIdx.y * blockDim.y;
int iz = threadIdx.z + blockIdx.z * blockDim.z;
Can you please advise what will give the best performance ?
If Z is deep and X * Y is not large enough to spam many thread blocks and keep the device as close to 100% occupancy is possible, then #1 is no go, because a few threads will be busy doing the loop while a large portion of the device does not get any work.
In #2, if your problem requires more than 1 kernel launch (such as a reduction), then ok. But you don’t mention what you are going to do with the samples, just that “I have to take each sample to another place in memory”. Kernel launches are very fast to call, but if you do it on long loops then it is easy to see time spent calling kernel.
#3 seems to be the appropriate path to take if you are doing a strided loop, so each thread works on 1 element. You also have to account for the number of reads and writes to global memory, as it is the slowest. You probably have read about shared memory, which is a user-programmable cache, small but very fast, which can be used as buffer until you have to write things back to global memory.
But we are just discussing generic strategies based on the information provided. In general try to think of making 1 thread operate on 1 element with grid strided loops, as these devices perform better in this field. Long loops, this is more CPU territory.
Hello saulocpp, All,
Thank you very much for the detailed answer.
After arranging the 3D matrix in Z (e.g 12) 2D matrices in global memory, the next step is:
For each 2D matrix (X columns, Y rows, row is consecutive in memory), each row is subtracted from its previous one.
So I will get Z new matrices in global memory each has Y-1 rows, X columns.
According to my measure, on TX2, option #2 is the fastest. A thread runs a loop of 12.
Do you think I can use shared memory in this scenario ?
You can use shared memory any time you want, but I can’t really make a guess to whether it will be beneficial or not in a particular case, so I think you will have to try and see what happens.
Good luck and have fun. :)