texture memory limt

Hello Everyone,

I have an application which allocates 512 threads per block. In each thread I will have to allocate several local variables, and the biggest local variable will occupy 486 bytes memory. But my graphics card only has 16kB shared memory per block, which indicates only 16k/512 = 32B per thread.
It is said that local variables in the kernel reside in shared memory. So if this is the case, then my kernel would not launch, right?

Thanks for any advice!

Yuping

I’m sorry about the title…

I wanted to ask about texture memory limit, but found the answer.

Now the above topic is what I really want to ask…

Yuping

Thread-local variables reside in registers or in local memory (which in turn resides in global memory). Shared memory is not used unless explicitly specified by the keyword.

Thanks, Big_Mac! So now I have nothing to worry about.

Yuping

You will have to worry a lot about that!

Global Memory is really painfully slow, and you may consider benchmarking your code launching less threads per blocks to have all the variables into Scalar Processors Registers instead Global Memory! (Say 64 threads = 256 bytes/threads or 32 threads = 512 bytes / threads, or even 16 threads = 1024 bytes/ thread)

I know it’s “unnatural” to launch less than 32 threads per blocks, with GPU occupancy that will end-up under 50%, and all that you can read says that you have to launch at least 6x32 (192) threads per block.

Another point you may consider, if 16 threads is faster than 512, is to use natural communication between same-warp threads (says, the program counter!) to micro-parallelize some tasks and occupy the GPU completely… (I prepare an article describing this technique called “micro-threading”)

iAPX,

Thanks a lot for your reply! You are right, it’s better to have all variables in the registers. But there is always a tradeoff between how many threads per block and how many registers occupied in the block.

Instead of allocate big local variables inside the kernel, I now use a loop.

The kernel is running very fast right now, only 0.2ms. I think it’s because I’m dealing with two relatively small image sets, each of which is 51251265*sizeof(unsigned short), and my graphics card has a clock rate of 1.19 GHz, with 24 MPs.

My other question is about 3D texture fetching. The voxel values of the images are unsigned short type. When I use tex3D with the voxel coordinates (float), and filter mode point, I got wrong unsigned short outputs. I checked programming guide, it’s OK to have 16-bit unsigned integer elements.

I also tried a simple example below with unsigned short elements. Also got strange outputs. The program is just doing a 3D interpolation.

[codebox]//image size: 844, 1644

include

include

include <cutil.h>

#ifndef USHRT

define USHRT unsigned short int

endif

using namespace std;

texture<USHRT, 3, cudaReadModeElementType> texIn;

global static void CudaTrilinearInterpTex(USHRT *ImageO, int wO, int hO, int dO, int wI, int hI, int dI)

{

int const xO = blockIdx.x * blockDim.x + threadIdx.x;

int const yO = blockIdx.y * blockDim.y + threadIdx.y;

int const zO = threadIdx.z;

if(xO < wO && yO < hO && zO < dO)

{

	int idx = zO*wO*hO + yO*wO + xO;

	ImageO[idx] = tex3D(texIn, ((float)xO)*wI/wO + 0.5f, ((float)yO)*hI/hO + 0.5f, ((float)zO)*dI/dO + 0.5f);		

}

}

int main()

{

int wI, wO, hI, hO, dI, dO; //image sizes, input and output 

USHRT *hIn = NULL;         //input image

USHRT *hOut = NULL;         //output image



cout << "input image width wI: " << endl;

cin >> wI;

cout << "input image height hI: " << endl;

cin >> hI;

cout << "input image depth dI: " << endl;

cin >> dI;

cout << "output image width wO: " << endl;

cin >> wO;

cout << "output image height hO: " << endl;

cin >> hO;

cout << "output image depth dO: " << endl;

cin >> dO;

int NUM_PIX_IN = wI * hI * dI;

int NUM_PIX_OUT = wO * hO * dO;



hIn = (USHRT *)malloc(NUM_PIX_IN*sizeof(USHRT));

hOut = (USHRT *)malloc(NUM_PIX_OUT*sizeof(USHRT));

for(int i = 0; i < NUM_PIX_IN; i++) hIn[i] = (USHRT)(i + NUM_PIX_IN);



for(int z = 0; z < dI; z++)

{		

	cout << "pixels of slice number " << z;

	cout << endl;

	for(int y = 0; y < hI; y++)

	{			

		cout << "pixels of row " << y << endl;

		for(int x = 0; x < wI; x++)

			cout << hIn[wI*hI*z + wI*y + x] << " ";

		cout << endl;

	}

}

cudaArray *cuArray = NULL;

cudaChannelFormatDesc shortTex = cudaCreateChannelDesc<USHRT>();

cudaExtent const ext = {wI, hI, dI};

CUDA_SAFE_CALL(cudaMalloc3DArray(&cuArray, &shortTex, ext));

cudaMemcpy3DParms copyParams = {0};

copyParams.extent = make_cudaExtent(wI, hI, dI);

copyParams.kind = cudaMemcpyHostToDevice;

copyParams.dstArray = cuArray;

copyParams.srcPtr = make_cudaPitchedPtr((void*)hIn, ext.width*sizeof(USHRT),ext.width,ext.height);

CUDA_SAFE_CALL(cudaMemcpy3D(&copyParams));

texIn.normalized = 0;

texIn.filterMode = cudaFilterModeLinear;

for(int dim = 0; dim < 3; dim++)

	texIn.addressMode[dim] = cudaAddressModeClamp;

CUDA_SAFE_CALL(cudaBindTextureToArray(texIn, cuArray, shortTex));

unsigned int timer = 0;

cutCreateTimer(&timer);

cutResetTimer(timer);

cutStartTimer(timer);

USHRT *ImageO = NULL;

CUDA_SAFE_CALL(cudaMalloc((void**)&ImageO, NUM_PIX_OUT*sizeof(USHRT)));

dim3 const block(16,4,4);

dim3 const grid(ceil((float)wO/block.x), ceil((float)hO/block.y), 1);

CudaTrilinearInterpTex <<<grid,block>>> (ImageO,wO,hO,dO,wI,hI,dI);

cudaThreadSynchronize();

cutStopTimer(timer);

CUDA_SAFE_CALL(cudaMemcpy(hOut,ImageO,NUM_PIX_OUT*sizeof(USH

RT),cudaMemcpyDeviceToHost));

CUDA_SAFE_CALL(cudaFree(ImageO));

cout << "kernel run time: " << (cutGetTimerValue(timer)) << " ms" << endl;

for(int z = 0; z < dO; z++)

{

	cout << "pixels of slice number " << z;

	cout << endl;

	for(int y = 0; y < hO; y++)

	{

		cout << "pixels of row " << y << endl;

		for(int x = 0; x < wO; x++)

			cout << hOut[wO*hO*z + wO*y + x] << " ";

		cout << endl;

	}

}

CUDA_SAFE_CALL(cudaUnbindTexture(texIn));

CUDA_SAFE_CALL(cudaFreeArray(cuArray));

free(hIn);

free(hOut);

return 0;

}[/codebox]

If I’m just going to fetch the voxel values, what do I change besides cudaFilterMode?

Btw, you mentioned micro-threading, is that “Micro-Threading: A New Approach to Future RISC”?

Thanks,

Yuping

I have the same problems of 3d texture fetching,
Could anyone help. :">

For the micro-threading it is similar in some ways, but a little different, due to CUDA’s GPU Scalar Processor sharing the same program counter inside a warp (32 threads). I will probably finish my article this week-end.

For the 3D texture, I may not help you at this time, I dont use that on my own developments :-(