cudaHostalloc and cudaGetDevicePointer seem to slow kernels

I am creating a shared memory space between my Telsa and my Intel CPU with the following statements

char *a=NULL;
char *b=NULL;
HANDLE_ERROR( cudaSetDeviceFlags(cudaDeviceMapHost) );// set flags to map shared memory
HANDLE_ERROR( cudaHostAlloc((void **)&a, 32, ( cudaHostAllocMapped |
cudaHostAllocPortable )) );
HANDLE_ERROR( cudaHostGetDevicePointer((void **)&b, (void *)b, 0) );

my_kernel<<<blocks,threads>>>(unsigned char * b)

*b=0;  //slow

//*b=0 // veryfast

My code is not trivial and takes a while to complete. What I have discovered that that if *b=0 is in,
the cuda Kernels run very slow, but if it is commented out I see orders of magnitude increase in performance. Can anybody tell me why?

I think it pretty much depends on how the hidden part of the code exploits b.

if the computation of b is the main point of the kernel, then commenting out the write to global memory will optimize away the entire computation of b, thusly turning the kernel into a no-op.