non power of two data kernel(s) launch

Dear CUDA fellows,

If this topic was answered before, my apologies, but the search functions doesn’t work for me.

My question it’s about how to launch kernel(s) to process data length different from power of two:

I know 2 approaches.

One is to launch first the closest power of two with one kernel and then launch a second one with the rest of the data, that avoids to launch iddle threads.

Second is just launch one kernel that calculates thread id (aka tid) and before any code I must put a IF sentence:

if ( tid < Total_length )

the first approach never call a iddle thread but I require to make 2 kernel launch and the second kernel must be arranged to access the rest of the data (offset). The second approach avoids second kernels and any offset calculation but it calls iddle threads in the last block.

I just looking for nice method and efficient. What do you guys ussually do to overcome this situation?

Daniel.

Well it is not for just power of two data length, I mean any data length different from multiple of 128 or 192, you know what I mean

Well it is not for just power of two data length, I mean any data length different from multiple of 128 or 192, you know what I mean

The second approach is fine and what most people do. The cost of the branch is negligible.

The second approach is fine and what most people do. The cost of the branch is negligible.

Just thought I’d show my way of doing things here. If there is a better way, please feel free to rip it apart.

Assuming you have a large N 1D elements you need to perform some computation on, where N is arbitrary and non-power of two:

[codebox]int threadpos = threadIdx.x;

while(1)

if(threadpos>=N)

break;

// Do stuff here

threadpos+=blockDim.x;

}[/codebox]

Or for multiple blocks too:

[codebox]int threadpos = threadIdx.x+blockDim.x*blockIdx.x;

int increment = gridDim.x*blockDim.x;

while(1)

{

if(threadpos>=N)

break;

//Do stuff here

threadpos+=increment;

}[/codebox]

Just thought I’d show my way of doing things here. If there is a better way, please feel free to rip it apart.

Assuming you have a large N 1D elements you need to perform some computation on, where N is arbitrary and non-power of two:

[codebox]int threadpos = threadIdx.x;

while(1)

if(threadpos>=N)

break;

// Do stuff here

threadpos+=blockDim.x;

}[/codebox]

Or for multiple blocks too:

[codebox]int threadpos = threadIdx.x+blockDim.x*blockIdx.x;

int increment = gridDim.x*blockDim.x;

while(1)

{

if(threadpos>=N)

break;

//Do stuff here

threadpos+=increment;

}[/codebox]

I change one of my kernels using this approach, but the resting threads makes the program to crash… I’m trying to find it out… the error msg is:

Microsoft C++ exception: cudaError at memory location 0x0003c928…

Microsoft C++ exception: cudaError at memory location 0x0003c924…

It’s strange because the remaining threads do nothing but declare variables at startup, I run emudebug and it looks like the whole non-iddle threads runs good but when I step over the first iddle thread the program crash inmediately. What can I do???

I change one of my kernels using this approach, but the resting threads makes the program to crash… I’m trying to find it out… the error msg is:

Microsoft C++ exception: cudaError at memory location 0x0003c928…

Microsoft C++ exception: cudaError at memory location 0x0003c924…

It’s strange because the remaining threads do nothing but declare variables at startup, I run emudebug and it looks like the whole non-iddle threads runs good but when I step over the first iddle thread the program crash inmediately. What can I do???

I don’t have a clear picture of the ‘while’ there, my threads don’t need go over again for computation

I don’t have a clear picture of the ‘while’ there, my threads don’t need go over again for computation