Hi all,
I’ve modified a large CUDA program to use textures for a particular array that is a constant, but too large to fit in the constant space.
This array is read sort of randomly (impossible to coalesce, besides by luck) - so the texture cache will help.
Anyways, I’ve implemented the following into my program, and now the kernel can only run once per program execution without a hiccup. If I have the kernel inside a CPU loop (which is the normal scenario), it will fail with the message: “Cuda error: kernel invocation: the launch timed out and was terminated.”
Anyone have any ideas why that is? I’ve posted relevant parts of the code below.
Texture declaration (global in context)
[codebox]texture<float,1,cudaReadModeElementType> big_Emesh_t;
[/codebox]
Assigning array to texture big_Emesh_t[codebox] cudaMalloc((void**)&big_Emesh_d,big_Emesh_size*sizeof(float));
cudaMemcpy(big_Emesh_d,big_Emesh,big_Emesh_size*sizeof(float
),cudaMemcpyHostToDevice);
cudaBindTexture(0,big_Emesh_t,big_Emesh_d);
checkCUDAError("texture");[/codebox]
Using the tex1Dfetch in a binary Srch algorithm:
[codebox]device unsigned int textureSearch(unsigned int first, unsigned int last, float key, unsigned int loc)
{
unsigned int return_val=0;
while ((first <= last)&&(return_val==0))
{
unsigned int mid = (first + last) / 2; // compute mid point.
if (key > tex1Dfetch(big_Emesh_t,mid))
first = mid + 1; // repeat search in top half.
else if (key < tex1Dfetch(big_Emesh_t,mid))
last = mid - 1; // repeat search in bottom half.
else
return_val= mid; // found it. return position /////
}
if (return_val==0)
return_val= last+1; // failed to find key
return_val-=big_Emesh_offsets_d[loc];
return return_val;
}[/codebox]
nevermind big_Emesh_offsets, it just corrects the result to point to the correct location in a seperate array.
EDIT: I should note that the result for that one loop is exactly what it should be and matches my ‘pre-texture’ code.
Thanks all!
Adam