Code modified for Textures - Now will run the Kernel only once... ?

Hi all,

I’ve modified a large CUDA program to use textures for a particular array that is a constant, but too large to fit in the constant space.

This array is read sort of randomly (impossible to coalesce, besides by luck) - so the texture cache will help.

Anyways, I’ve implemented the following into my program, and now the kernel can only run once per program execution without a hiccup. If I have the kernel inside a CPU loop (which is the normal scenario), it will fail with the message: “Cuda error: kernel invocation: the launch timed out and was terminated.”

Anyone have any ideas why that is? I’ve posted relevant parts of the code below.

Texture declaration (global in context)

[codebox]texture<float,1,cudaReadModeElementType> big_Emesh_t;


Assigning array to texture big_Emesh_t[codebox] cudaMalloc((void**)&big_Emesh_d,big_Emesh_size*sizeof(float));





Using the tex1Dfetch in a binary Srch algorithm:

[codebox]device unsigned int textureSearch(unsigned int first, unsigned int last, float key, unsigned int loc)


unsigned int return_val=0;

while ((first <= last)&&(return_val==0))


   unsigned int mid = (first + last) / 2;  // compute mid point.

   if (key > tex1Dfetch(big_Emesh_t,mid))

       first = mid + 1;  // repeat search in top half.

   else if (key < tex1Dfetch(big_Emesh_t,mid))

       last = mid - 1; // repeat search in bottom half.


       return_val= mid;     // found it. return position /////


if (return_val==0)

	return_val= last+1;    // failed to find key


return return_val;


nevermind big_Emesh_offsets, it just corrects the result to point to the correct location in a seperate array.

EDIT: I should note that the result for that one loop is exactly what it should be and matches my ‘pre-texture’ code.

Thanks all!


Texture reads can be no slower than global memory reads, correct? I ask this because the first run-through of my kernel should have more hits in the same region of memory (i.e., most requests to textureSearch will have the same value for they key) than subsequent runs. But if uncached texture reads are no slower than global memory reads, a cache miss should not result in my program slowing down to the level where the program times out after 5 seconds.

Any help is greatly appreciated… this textureSearch function is where approximately 20% of time is spent, anything I can do to speed that up would make my day that much better.

Well, last night I quit Gnome, and just ran the prog through the terminal. The code, untouched, worked just fine.

Isn’t that interesting? I didn’t think I was anywhere close to a memory limit, even with X/Gnome running. Does this behaviour make sense??

(I am running Ubuntu 9.04 64-bit, w/ 8800GTS 320MB.)

Two things to add: 1) it runs slower than just reading from global memory. Huh?

More Importantly: 2)Then as I add more texture fetches to the code (the same texture, mind you), then the problem reappears, even in the terminal, it just happens after more texture fethces are added to the code than if it were run inside gnome.

What is going on here?

Well, i forgot to put the size argument in my cudabindtexture… but that made no difference, fyi.

Anyone have any suggestions??