What is the practical difference between these two?

  1. kernel<>>(a,b)
  2. kernel<<>>(a,b)

a and b both are arrays.

One launches one block of e threads, and the other launches e blocks of 1 thread. The total thread count is the same, but the hardware utilization on the device will be different. Neither approach is good for performance on the GPU.

why and which one can be better for gpu utilisation?

The total thread count should ideally be on the order of 10,000 or more, and you typically would want around 128 to 256 threads per thread block (count should definitely be a multiple of 32). The exact approach will differ based on use case; we don’t know anything about yours.