Example:
I have a pointer to a device array of floats (starts off as float type), but intend to read in kernel as a float4 type.
Will there be any performance difference(optimizations) with using reinterpret_cast within the kernel vs. casting in the kernel call from host?
In other words will it be ‘fundamentally’ the same if I do this;
(calling from host)
kernel<<<…>>((float4*)(&arr[0]),…)
Or this;
(within kernel with arr as float(assuming correct indexing is done) )
… = reinterpret_cast<float4*>(arr)[offset];
EDIT: so far in my testing of both methods, there seems to not be much of a difference. Another issue is that some of the aligned types will be loaded via __ldg().