Memory access with shifted pointer

Hello everyone,

I try to migrate a CPU code on CUDA but I’m stuck on one aspect of memory adressing which work on CPU but not on GPU.

About the context, my purpose is to be able to read/write a small pack of bits (i.e. smaller compared to the size of a element of the array) on only one step in an array even if these bits are distributed on 2 index of the array. For that i create a pointer shifted of the half of the size of a element of the array.
If required, I can explain more why I do that but I don’t think this important.

Here is an simplified example with no error on CPU execution but one error on GPU for the same code.

int main() {

	if (true) {
		setbuf(stdout, NULL); //For avoid conflict between output on cpu and gpu
		printf("**** Start Test CPU ****\n");
		int nbItems = 3;

		printf("ull\n");
		unsigned long long* h_grilleTest = (unsigned long long*)malloc(nbItems * sizeof(unsigned long long));
		h_grilleTest[0] = 0b0000000100000001;
		h_grilleTest[1] = 0b000000010000000100000000;
		for (int i = 0; i < nbItems; i++) {
			printf("%d = %p : %d\n", i, h_grilleTest + i, h_grilleTest[i]);
		}
		printf("\n");

		printf("ull access - shift 4 bytes\n");
		unsigned long long* h_grilleTestDecale_ull = (unsigned long long*)(h_grilleTest_c + 4);
		for (int i = 0; i < nbItems - 1; i++) {
			printf("Index %d | MemAdress : %p\n", i, h_grilleTestDecale_ull + i);
			printf("Index %d | MemAdress : %p | value : %d\n", i, h_grilleTestDecale_ull + i, h_grilleTestDecale_ull[i]); //No error on read h_grilleTestDecale_ull[i]
		}
		printf("\n");

		printf("**** End Test CPU ****\n");
	}

	if (true) {		
		test << <1, 1 >> > ();
		cudaDeviceSynchronize();		
	}

	return;
}


__global__ void test() {

	if (true) {
		printf("**** Start Test GPU ****\n");
		int nbItems = 3;

		printf("ull\n");
		unsigned long long* h_grilleTest = (unsigned long long*)malloc(nbItems * sizeof(unsigned long long));
		h_grilleTest[0] = 0b0000000100000001;
		h_grilleTest[1] = 0b000000010000000100000000;
		for (int i = 0; i < nbItems; i++) {
			printf("%d = %p : %d\n", i, h_grilleTest + i, h_grilleTest[i]);
		}
		printf("\n");

		printf("ull access - shift 4 bytes\n");
		unsigned long long* h_grilleTestDecale_ull = (unsigned long long*)(h_grilleTest_c + 4);
		for (int i = 0; i < nbItems - 1; i++) {
			printf("Index %d | MemAdress : %p\n", i, h_grilleTestDecale_ull + i);
			printf("Index %d | MemAdress : %p | value : %d\n", i, h_grilleTestDecale_ull + i, h_grilleTestDecale_ull[i]); //Error on read h_grilleTestDecale_ull[i]
		}
		printf("\n");

		printf("**** End Test GPU ****\n");

		return;
	}
}

Result :

**** Start Test CPU ****
ull
0 = 000001BE0C503780 : 257
1 = 000001BE0C503788 : 65792
2 = 000001BE0C503790 : 0

ull access - shift 4 bytes
Index 0 | MemAdress : 000001BE0C503784
Index 0 | MemAdress : 000001BE0C503784 | value : 0
Index 1 | MemAdress : 000001BE0C50378C
Index 1 | MemAdress : 000001BE0C50378C | value : 0

**** End Test CPU ****
**** Start Test GPU ****
ull
0 = 0000000B035FF920 : 257
1 = 0000000B035FF928 : 65792
2 = 0000000B035FF930 : 0

ull access - shift 4 bytes
Index 0 | MemAdress : 0000000B035FF924 
//Error

Is there a constraint on the alignement of GPU memory depending on the size of the variable that not exist on CPU ?
Do you know how I could read in my array with a bit precision for the pointer (or at least byte precision) to avoid to have to reconstruct my small stack of bits from two index ?

Thanks for your help

Yes, it is documented. The requirement is that data be “naturally aligned”. This means that retrieving a byte must be done on a 1-byte boundary. Retrieving a short must be done on a 2-byte boundary. Retrieving a 4-byte quantity such as int or float must be done on a 4-byte boundary, and so on for 8 or 16 byte quantities.

This means that the lower bits of the address used must be zero. For the 1 byte case, there are no bits that must be zero. For the 2-byte case, the lowest bit must be zero. For the 4-byte case, the lowest 2 bits must be zero. For the 8 byte case, the lowest 3 bits must be zero. And for the 16 byte case, the lowest 4 bits of the address must be zero.

You will have to reconstruct from two indices, if the data you require lies across boundaries. But if the data you require can be found within a larger boundary, you may be able to retrieve a larger quantity.

For example, data that exists at byte address/offset 15, and data that exists at offset 16, cannot be retrieved in a single request (by a single thread). But data that exists at byte address/offset 7 and offset 8 can be retrieved in a single request by a single thread by requesting a 16-byte quantity at location 0.

You could also retrieve adjacent data across multiple threads, and then recombine data (perhaps) using e.g. warp shuffle.

In any event, taking an unsigned long long pointer that is properly formatted, and then “shifting it by 4 bytes” (and then attempting to use that shifted pointer for unsigned long long data retrieval) is illegal, in CUDA.

Also this may be of interest.

Thank you for this prompt, thorough, and precise response !

I gonna adapt the code and reconstruct the data from 2 requests because the array is larger than 8 (even 16) bytes.

I will check later how (and if) I can use multiple threads to improve performance on this operation.

Kind regards

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.