I have a cuda float4 array on the device that I would like to copy back to my object in the host using memcpy.
So, do I first have to first use cudamemcpy with cudaMemcpyDeviceToHost to get it into a host array and then iterate over this array and copy the values or is their a faster and more efficient way to do this?
Why would you need to iterate over the values? Just copy it back into a float4 array on the host. If you wanted to, you could probably type cast the resulting float4* to a float (*)[4] on the host, but I haven’t verified this. Doing so would allow you to, from the host, access the defuault return type by index instead of by field name.
const size_t floatCount(20);
const size_t memoryReq(floatCount * sizeof(float4));
float4* hostDataFloat4 = new float4[floatCount];
float (*hostDataArray)[4] = new float[4][floatCount];
float4* deviceData;
cudaMalloc((void**)&deviceData, memoryReq);
...
cudaMemcpy(hostDataFloat4, deviceData, memoryReq, cudaMemcpyDeviceToHost);
cudaMemcpy(hostDataArray, deviceData, memoryReq, cudaMemcpyDeviceToHost);
cudaFree(deviceData);
...
// Both of the following should be equivalent, I think (again, I haven't tested this code)
float xVal1 = hostDataFloat4[0].x;
float xVal2 = hostDataArray[0][0];
...
delete[] hostDataFloat4;
delete[] hostDataArray;