advantage to using reinterpret_cast<T*> ?

Example:
I have a pointer to a device array of floats (starts off as float type), but intend to read in kernel as a float4 type.

Will there be any performance difference(optimizations) with using reinterpret_cast within the kernel vs. casting in the kernel call from host?

In other words will it be ‘fundamentally’ the same if I do this;

(calling from host)

kernel<<<…>>((float4*)(&arr[0]),…)

Or this;

(within kernel with arr as float(assuming correct indexing is done) )

… = reinterpret_cast<float4*>(arr)[offset];

EDIT: so far in my testing of both methods, there seems to not be much of a difference. Another issue is that some of the aligned types will be loaded via __ldg().

How big is the array, and how many times do you recast (the array)?

Would total execution time spent on recasting really be that significant that it actually matters, in either case?

At times, I think that both approaches predominantly constitute the same and same amount of work, such that there can not be a significant performance difference - my thinking is that in both cases, much of the recasting would be done by threads
Perhaps it also depends on where that recast ends up - local, shared or global memory

I suppose you only need to recast the array once; otherwise permanently recasting the array becomes feasible