.uni is a suffix for a PTX branch, call, return instructions. You have to manually edit the assembly code of your program to insert these. They merely say that the branch is always taken in the same direction by every thread in a warp, which can be used by the PTX assembler to optimize the code for a particular GPU. The optimizations are not that dramatic (I implemented it once for an architecture without hardware support for divergence with two extra instructions per branch), and NVCC should detect a significant number of these cases already. Unless you are writing a compiler for CUDA I wouldn’t worry about this at all.
Hello, tried but failed!!! The real problem is this:
[codebox]
global void cuMykernel( float* g_odata, int width, int height)
{
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
__shared__ float p,q,q_out,out;
__shared__ int pos,i,j,mask,flag,idx,idy;