dct8x8 code in cuda4.0 sdk


I try to run code example in dct8x8 of cuda 4.0 sdk. It runs perfectly when I run it by itself. But if I put after it some cuda functions like cudaMemcpy or cudaThreadSynchronize, it gives “unknown error”.
Actually what I’ve done is something like this:

if (isCameraOutOfFocus())
//printf(“Camera Out of Focus\n”);
cutilSafeCall(cudaMemcpy2D(In_D_2d, DeviceStride_charsizeof( char), In, ImgStridesizeof( char), frame_width*sizeof( char),frame_height, cudaMemcpyHostToDevice));

pGMM_long = cvCreateFastBgGMM(pGMMParams_long,d_pGMMParams_long, In_rgb, frame_width, frame_height, 1);
pGMM = cvCreateFastBgGMM(pGMMParams, d_pGMMParams, In_rgb, frame_width, frame_height, 0);

cvUpdateFastBgGMM(d_pGMMParams_long, pGMM_long, In_rgb, frame_width, frame_height, 1);
cvUpdateFastBgGMM(d_pGMMParams, pGMM, In_rgb, frame_width, frame_height, 0);

cutilSafeCall(cudaMemcpy(Fn_gmm_long, pGMM_long->d_outputImg2, framesize*sizeof(char), cudaMemcpyDeviceToHost));

In isCameraOutOfFocus function, I call CUDAkernel2DCT kernel and it doesn’t give any error for cudaMemcpy2D function or for any cuda fuction called in cvCreateFastBgGMM and cvUpdateFastBgGMM. But it gives error in cudaMemcpy in the last line. In CUDAkernel2DCT kernel, in the last line there is this code:
for(unsigned int i = 0; i < BLOCK_SIZE; i++)
dst[i * ImgStride] = bl_ptr[i * KER2_SMEMBLOCK_STRIDE];
If I put a comment sign before this part “bl_ptr[i * KER2_SMEMBLOCK_STRIDE]”, then there are no “unknown error”. So this line somehow affects other cuda functions but not all of them.
I don’t know, if I make any sense, but I don’t know how to describe it else.