Since CUDA kernel launches are asynchronous, unless you use synchronization calls that cause busy waiting for the CPU, the full processing power of the CPU is in fact available while the GPU kernel is running.
However, because the GPU is completely taken up by the CUDA kernel, the GUI is frozen because GUI updates also require the GPU, which is unavailable for graphics while the CUDA kernel is running.
A straightforward workaround is to acquire a cheap low-end GPU (for example, I have used a Quadro 600 in the past for this purpose, but there are likely cheaper options than that) and use that to drive the display, while the expensive high-performance GPU remains reserved for CUDA apps.
Getting back to your original question, given that the code seems to be bug free and merely affected by time-out, try pre-computing the stride at the start of the kernel:
const int stride = d_cholp / (int)sizeof(float);
Then later use the stride in the index computation:
tmp2 = _chol_[ cb + i*stride + j ] ; //cb=cluster i=row, j=column
That way you are avoiding the magic constant, while retaining good performance.