Portage of a macro / CUDA Architecture

Hello,

I’ve the following macro in my standard C code.

#define SBA(sb, v) (*(unsigned int*)((char*)(sb)+(v)))

I just copy/paste into my CUDA portage, but it doesn’t seems to work in the same way. This macro is supposed to add v bytes to sb (which is an array) [int = 4 bytes / little endian].

Here is a sample program using it (standard C) followed by execution traces.

int i, tab[5];

for(i=0; i<5; i++)

	tab[i] = 2;

for(i=0; i<5; i++)

	printf("%d = %ld\n", i, tab[i]);

printf("=> %ld\n", SBA(tab, 1 & 0xffff));

printf("=> %ld\n", SBA(tab, 3 & 0xffff));
0 = 2

1 = 2

2 = 2

3 = 2

4 = 2

=> 33554432

=> 512

Does anyone have any clue ? :-/

Thanks

In little endian:

2 = 00000010 00000000 00000000 00000000

tab contains it repeated:

adding one byte offset you start reading from here:

[font=“Courier New”]

---------v

00000010 00000000 00000000 00000000 00000010 00000000 00000000 00000000 00000010 00000000 00000000 00000000

[/font]

and you get the number:

00000000 00000000 00000000 00000010

that’s what you have got in the first result (2^25), so what you are reportin is what I do expect from such a code.

If your results are inverted, than you are on a big endian architecture.

take care using unaligned numbers… often performance can be worse.

Thanks sigismondo for your reply.

Even if it help me to understand the problem (I think!), it doesn’t help me to solve it !

Here is a “new” code I made to compare results :

int i, tab[5];

for(i=0; i<5; i++)

		tab[i] = 15795 * i;

for(i=0; i<5; i++)

		printf("%d = %ld\n", i, tab[i]);

printf("=> %ld\n", SBA(tab, 1 & 0xffff));

printf("=> %ld\n", SBA(tab, 3 & 0xffff));

As you can see it’s the same code as previously posted, except the initialization of “tab”.

Here is the code executed on my CPU (AMD Athlon 64 X2 ) on a gentoo system.

0 = 0

1 = 15795

2 = 31590

3 = 47385

4 = 63180

=> -1291845632

=> 4043520

And then the exactly same code executed in a kernel in emulation mode (only threadIdx.x 0 execute the code). My GPU is a GTX260.

0 = 0

1 = 15795

2 = 31590

3 = 47385

4 = 63180

=> 3003121664

=> 4043520

So the story seems to be about little and big endian. However, I don’t know how to make a code (C function or macro) in CUDA to have the same results as the CPU macro. You also mentionned that unaligned numbers give bad performances, do you have some information about that ?

Thanks,

Actually, the fact that the second result is correct let me think it is not related to endianity. I found this post regarding it.
http://forums.nvidia.com/index.php?showtopic=29256

Why don’t you post the cuda code - so we can have a look to it?

Could be some problem related to synchronization before than transferring the results?

First, sorry for the delay !

Then, I compiled a small cuda program running the same code on the cpu and the gpu to compare results.

#include <stdio.h>

#include <cuda.h>

#define SBA(sb, v) (*(unsigned int*)((char*)(sb)+(v)))

__host__ void mykernel_cpu()

{

		int i, tab[5];

		for(i=0; i<5; i++)

				tab[i] = 15795 * i;

		for(i=0; i<5; i++)

				printf("%d = %ld\n", i, tab[i]);

		printf("=> %ld\n", SBA(tab, 1 & 0xffff));

		printf("=> %ld\n", SBA(tab, 3 & 0xffff));

}

__global__ void mykernel_gpu()

{

		int i, tab[5];

		if( threadIdx.x == 0 )

		{

		for(i=0; i<5; i++)

				tab[i] = 15795 * i;

		//__syncthreads();

		for(i=0; i<5; i++)

				printf("%d = %ld\n", i, tab[i]);

		printf("=> %ld\n", SBA(tab, 1 & 0xffff));

		printf("=> %ld\n", SBA(tab, 3 & 0xffff));

		}

}int main( void )

{

		printf("\n-----------------------\n");

		printf("\tCPU RUN");

		printf("\n-----------------------\n");

		mykernel_cpu();

		printf("\n-----------------------\n");

		printf("\tGPU RUN");

		printf("\n-----------------------\n");

		dim3 dimGrid(5,5,1);

		dim3 dimBlock(2,2,2);

		mykernel_gpu <<< dimGrid, dimBlock >>> ();

		printf("\n");

		return 1;

}

Results from CPU and GPU are the same. So as you mentionned it, it’s not an endian issue.

And as you can see, this is nor a sync issue.

The point is i’m porting a big CPU code to GPU. I think there is another tricky issue in my portage. I will investigate in depth and come back if I need. However, thank you for your time & help.

Hi again !

After working on my project after 3 days I think I found the problem : the macro was developped for a 32 bit CPU and I’m working with cuda 64 bit !

… I know… I know… I’m a n00b ! :o)

++

you problem is related to signed/unsigned… I tested you results wrongly: the first time I had hypothesized it and - wrongly - discarded! (n00b me too!) see:
2^32 - 3003121664 = 1291845632
64 bits will change the size of your addresses, not of your data.

It is possible that printf of nvcc behaves differently of the one of gcc (or whatever compiler you are using).
However you should declare tab unsigned (but I do not think it will change what you see printed).

Instead of %ld, try with %u.

On many compiler %ld refers to 32 bits integers too (you need %lld to print 64 bit long long ints). However that’s not your intent - you want to print a 32 bit unsigned - %u is you magic world.

You can also try and print the result consistently transferring it to the host and letting it print it.