I just copy/paste into my CUDA portage, but it doesn’t seems to work in the same way. This macro is supposed to add v bytes to sb (which is an array) [int = 4 bytes / little endian].
Here is a sample program using it (standard C) followed by execution traces.
int i, tab[5];
for(i=0; i<5; i++)
tab[i] = 2;
for(i=0; i<5; i++)
printf("%d = %ld\n", i, tab[i]);
printf("=> %ld\n", SBA(tab, 1 & 0xffff));
printf("=> %ld\n", SBA(tab, 3 & 0xffff));
So the story seems to be about little and big endian. However, I don’t know how to make a code (C function or macro) in CUDA to have the same results as the CPU macro. You also mentionned that unaligned numbers give bad performances, do you have some information about that ?
Results from CPU and GPU are the same. So as you mentionned it, it’s not an endian issue.
And as you can see, this is nor a sync issue.
The point is i’m porting a big CPU code to GPU. I think there is another tricky issue in my portage. I will investigate in depth and come back if I need. However, thank you for your time & help.
you problem is related to signed/unsigned… I tested you results wrongly: the first time I had hypothesized it and - wrongly - discarded! (n00b me too!) see:
2^32 - 3003121664 = 1291845632
64 bits will change the size of your addresses, not of your data.
It is possible that printf of nvcc behaves differently of the one of gcc (or whatever compiler you are using).
However you should declare tab unsigned (but I do not think it will change what you see printed).
Instead of %ld, try with %u.
On many compiler %ld refers to 32 bits integers too (you need %lld to print 64 bit long long ints). However that’s not your intent - you want to print a 32 bit unsigned - %u is you magic world.
You can also try and print the result consistently transferring it to the host and letting it print it.