Binary Arithmetic

hi, I’m having some problem with the following code. taking time of run, sometimes it runs in 10 ms, some in 0,1 ms, and some else it cause a “invalid configuration argument”. i run this kernel with 256 threads per block and (16001200sizeof(float)/256) block (30’000 block). the focus of kernel is to translate each byte of Src in a sequence of bit. so, cod variable is 256*80 and length is 256. to encode bit to bit, i must use bit operations, so i set a mask and use OR an AND operation to set the specific bit. the next step will be to copy the result on the array Dst, but first i must know why the kernel runs in such different way.

thank you all, A.

__global__ void Encode(float *Src,unsigned char *Dst,bool *cod, unsigned int *length){

	const int tid=threadIdx.x;

	const int bid=blockIdx.x*blockDim.x;

	unsigned char mycod[10];

	unsigned char mask;

	unsigned char *c=(unsigned char*)Src;

	unsigned char curr=c[tid+bid];

	unsigned int len=length[curr];

	//#pragma unroll

	for(int i=0;i<len;i++){



			mask <<= i%8;

			mycod[i%8] |=mask;



			mask <<= i%8;

			mask= ~mask;

			mycod[i%8] &=mask;




What happens if you run your program under cuda-memcheck?

thanks for reply, but:

========= ERROR SUMMARY: 0 errors

always, even if run gives “invalid configuration argument” error…

Hmm. How do you launch the kernel?

I launch the kernel in classic way,


grid and threads are always the same number such i wrote in first post, da is already in device memory, dcod and dlength are copied just before kernel call…

i found where is the problem… grid variable changes runtime even if I set it constant and even if I used it once when call one other kernel, and sometimes grid.x > 65’000, some other it’s correct to 30’000, other it’s 257 and so on… what the hell is happening???