Shared memory boundary writing issue Writing different data types to shared memory banks

Hey

As part of a project I am working I have many different data types with different bit sizes I am trying to squeeze into dynamically allocated array and I encountered a weird bit of behaviour that I cannot mark as either a bug or a feature.

Below is is a piece of code that illustrates the problem :

#include <cuda.h>

#include <stdio.h>

#include <stdlib.h>

/**

 * Function creates some values in shared memory and copies them 

 * to device memory

 */

__global__ 

void testit(char * res_b, char * res_c, float * res_d, char * bit_layout) {

	// Init shared pool

	extern __shared__ char spool[];

	// Create some pointers to shared memory

	char * thechar1 = (char *)spool;

	char * thechar2 = (char *)&thechar1[1];

	float * thefloat = (float *)&thechar2[1];

	(*thechar2) = '*'; // Set the char to 42, this works as expected

	(*thechar2) = '*'; // Set the short so that spool as a string is '**'

	(*thefloat) = 1.0; // To recap layout should be = [42|42|0|0] [-128|63|X|X] with X being unknown

	//Copy to device mem

	(*res_b) = (*thechar1);

	(*res_c) = (*thechar2);

	(*res_d) = (*thefloat);

	//Copy contents of spool to host

	bit_layout[0] = spool[0];

	bit_layout[1] = spool[1];

	bit_layout[2] = spool[2];

	bit_layout[3] = spool[3];

	bit_layout[4] = spool[4];

	bit_layout[5] = spool[5];

	bit_layout[6] = spool[6];

	bit_layout[7] = spool[7];

}

int main() {

	char c_h;

	char * c_d;

	char b_h;

	char * b_d;

	float d_h;

	float * d_d;

	// Layout of shared memory

	char * bitlayout_h = (char *)calloc(8, sizeof(char));

	char * bitlayout_d;

	

	size_t shared = (sizeof(char) * 2) +  sizeof(short);

	cudaMalloc((void**)&b_d, sizeof(char));

	cudaMemset(b_d, 0, sizeof(char));

	cudaMalloc((void**)&c_d, sizeof(char));

	cudaMemset(c_d, 0, sizeof(char));

	cudaMalloc((void**)&d_d, sizeof(float));

	cudaMemset(d_d, 0, sizeof(float));

	cudaMalloc((void**)&bitlayout_d, 8 * sizeof(char));

	cudaMemset(bitlayout_d, 0, 8 * sizeof(char));

	//Only one thread, no race conditions

	testit<<<1, 1, shared>>>(b_d, c_d, d_d, bitlayout_d);

	cudaMemcpy(&b_h, b_d, sizeof(char), cudaMemcpyDeviceToHost);

	cudaMemcpy(&c_h, c_d, sizeof(char), cudaMemcpyDeviceToHost);

	cudaMemcpy(&d_h, d_d, sizeof(float), cudaMemcpyDeviceToHost);

	cudaMemcpy(bitlayout_h, bitlayout_d, 8 * sizeof(char), cudaMemcpyDeviceToHost);

	cudaFree(b_d);

	cudaFree(c_d);

	cudaFree(d_d);

	cudaFree(bitlayout_d);

	printf("Char 1: %c Char 2: %c The Float: %f\n", b_h, c_h, d_h);

	printf("Expected bitlayout:  [42|42|0|0] [-128|63|X|X]\n");

	printf("Bitlayout: [%i|%i|%i|%i] [%i|%i|%i|%i]\n", (int)bitlayout_h[0],

													   (int)bitlayout_h[1],

													   (int)bitlayout_h[2],

													   (int)bitlayout_h[3],

													   (int)bitlayout_h[4],

													   (int)bitlayout_h[5],

													   (int)bitlayout_h[6],

													   (int)bitlayout_h[7]); 

   free(bitlayout_h);	

}

On output I get the following :

Char 1:  Char 2:  The Float: 1.000000

Expected bitlayout:  [42|42|0|0] [-128|63|X|X]

Bitlayout: [0|0|-128|63] [0|14|0|1]

It appears that the float is being written to the whole 32 bit chunk into the first bank of shared memory, erasing anything that was there previously. I have also done it with writing 3 chars and 1 short, when I did this the last of the 3 chars was overwritten.

Basically the problem is that when writing to a pointer where the value has to spill over into an adjacent shared memory bank it spills up to a bank with values already written, erasing them.

I have run this on two different cards: one an old 8800 GTX with a G80 chip and a Quadro NVS 290 both with similar results, I also single stepped it running on the 290 with the debugger and saw the changes as they occurred. Both times I was running Ubuntu Linux with cuda SDK version 2.2 installed.

So my question is this a bug or a feature? Is it documented? If it is documented where is it because I could not find anything about it.

I bet you have a memory address alignment issue.

Floats are 4 bytes and should be aligned to a 4 byte boundary.

Your line “float * thefloat = (float *)&thechar2[1];” isn’t aligned.

Easy solution is to put the float first and the chars second. Alternatively you can leave a couple unused padding bytes.

I’m not even sure if the CUDA docs talks about memory alignment restrictions for floats but certainly this problem would happen on the CPU, and I would guess it’d be the same on the GPU.

I bet you have a memory address alignment issue.

Floats are 4 bytes and should be aligned to a 4 byte boundary.

Your line “float * thefloat = (float *)&thechar2[1];” isn’t aligned.

Easy solution is to put the float first and the chars second. Alternatively you can leave a couple unused padding bytes.

I’m not even sure if the CUDA docs talks about memory alignment restrictions for floats but certainly this problem would happen on the CPU, and I would guess it’d be the same on the GPU.