1, my kernel reads from one device array d_Src and writes them into d_Dst. When the array elements are of int4, it’s blazing fast; when I changed int4 into Pos, where Pos is my own-defined struct:
int x, y, z,w;
it’s 4~5 times slower. why’s that?
btw, the int4 bandwidth is about 70G/s in a rough approximate. What’s the peak bandwidth of G80, please?
2, I use cudaMemcpy() to test device-memory bdwd, roughly, copy-in is 0.6G/s, copy-out is 0.8G/s. What’s the peak bandwidth of copy-in/out, please?