branching on device


I would like to perform a code which looks like :

operationA<<< grid, threads >>>(matrixA);
if (matrixA[i][j] == 0.)
operationB<<< grid, threads >>>(matrixA);
operationC<<< grid, threads >>>(matrixA);

For the moment, I get the value of matrix[i][j] by retrieving it from device to host and doing the comparison (on host). However, I guess this is far from optimal since it requires data transfers…
Is-there any better solution letting everything run on the device ?
What if operationA, operationB and operationC are cuBLAS functions ? (ie., I cannot modify their device code).

Another thing: can we call cuBLAS functions directly in a device code ?

Thank you very much in advance!