I have an application which allocates 512 threads per block. In each thread I will have to allocate several local variables, and the biggest local variable will occupy 486 bytes memory. But my graphics card only has 16kB shared memory per block, which indicates only 16k/512 = 32B per thread.
It is said that local variables in the kernel reside in shared memory. So if this is the case, then my kernel would not launch, right?
Global Memory is really painfully slow, and you may consider benchmarking your code launching less threads per blocks to have all the variables into Scalar Processors Registers instead Global Memory! (Say 64 threads = 256 bytes/threads or 32 threads = 512 bytes / threads, or even 16 threads = 1024 bytes/ thread)
I know it’s “unnatural” to launch less than 32 threads per blocks, with GPU occupancy that will end-up under 50%, and all that you can read says that you have to launch at least 6x32 (192) threads per block.
Another point you may consider, if 16 threads is faster than 512, is to use natural communication between same-warp threads (says, the program counter!) to micro-parallelize some tasks and occupy the GPU completely… (I prepare an article describing this technique called “micro-threading”)
Thanks a lot for your reply! You are right, it’s better to have all variables in the registers. But there is always a tradeoff between how many threads per block and how many registers occupied in the block.
Instead of allocate big local variables inside the kernel, I now use a loop.
The kernel is running very fast right now, only 0.2ms. I think it’s because I’m dealing with two relatively small image sets, each of which is 51251265*sizeof(unsigned short), and my graphics card has a clock rate of 1.19 GHz, with 24 MPs.
My other question is about 3D texture fetching. The voxel values of the images are unsigned short type. When I use tex3D with the voxel coordinates (float), and filter mode point, I got wrong unsigned short outputs. I checked programming guide, it’s OK to have 16-bit unsigned integer elements.
I also tried a simple example below with unsigned short elements. Also got strange outputs. The program is just doing a 3D interpolation.
[codebox]//image size: 844, 1644
#define USHRT unsigned short int
using namespace std;
texture<USHRT, 3, cudaReadModeElementType> texIn;
global static void CudaTrilinearInterpTex(USHRT *ImageO, int wO, int hO, int dO, int wI, int hI, int dI)
int const xO = blockIdx.x * blockDim.x + threadIdx.x;
int const yO = blockIdx.y * blockDim.y + threadIdx.y;
int const zO = threadIdx.z;
if(xO < wO && yO < hO && zO < dO)
int idx = zO*wO*hO + yO*wO + xO;
ImageO[idx] = tex3D(texIn, ((float)xO)*wI/wO + 0.5f, ((float)yO)*hI/hO + 0.5f, ((float)zO)*dI/dO + 0.5f);
For the micro-threading it is similar in some ways, but a little different, due to CUDA’s GPU Scalar Processor sharing the same program counter inside a warp (32 threads). I will probably finish my article this week-end.
For the 3D texture, I may not help you at this time, I dont use that on my own developments :-(