Hi,
1, my kernel reads from one device array d_Src and writes them into d_Dst. When the array elements are of int4, it’s blazing fast; when I changed int4 into Pos, where Pos is my own-defined struct:
struct Pos
{
int x, y, z,w;
}
it’s 4~5 times slower. why’s that?
btw, the int4 bandwidth is about 70G/s in a rough approximate. What’s the peak bandwidth of G80, please?
2, I use cudaMemcpy() to test device-memory bdwd, roughly, copy-in is 0.6G/s, copy-out is 0.8G/s. What’s the peak bandwidth of copy-in/out, please?
thanks!
Thank you very much! I was using the old Jan15 progGuide and gave you troubles :) My account has no power to install Feb6 currently. Please forgive me for following progGuide Jan15’s align sample code align(16) , which might be a typo.