Hey
As part of a project I am working I have many different data types with different bit sizes I am trying to squeeze into dynamically allocated array and I encountered a weird bit of behaviour that I cannot mark as either a bug or a feature.
Below is is a piece of code that illustrates the problem :
#include <cuda.h>
#include <stdio.h>
#include <stdlib.h>
/**
* Function creates some values in shared memory and copies them
* to device memory
*/
__global__
void testit(char * res_b, char * res_c, float * res_d, char * bit_layout) {
// Init shared pool
extern __shared__ char spool[];
// Create some pointers to shared memory
char * thechar1 = (char *)spool;
char * thechar2 = (char *)&thechar1[1];
float * thefloat = (float *)&thechar2[1];
(*thechar2) = '*'; // Set the char to 42, this works as expected
(*thechar2) = '*'; // Set the short so that spool as a string is '**'
(*thefloat) = 1.0; // To recap layout should be = [42|42|0|0] [-128|63|X|X] with X being unknown
//Copy to device mem
(*res_b) = (*thechar1);
(*res_c) = (*thechar2);
(*res_d) = (*thefloat);
//Copy contents of spool to host
bit_layout[0] = spool[0];
bit_layout[1] = spool[1];
bit_layout[2] = spool[2];
bit_layout[3] = spool[3];
bit_layout[4] = spool[4];
bit_layout[5] = spool[5];
bit_layout[6] = spool[6];
bit_layout[7] = spool[7];
}
int main() {
char c_h;
char * c_d;
char b_h;
char * b_d;
float d_h;
float * d_d;
// Layout of shared memory
char * bitlayout_h = (char *)calloc(8, sizeof(char));
char * bitlayout_d;
size_t shared = (sizeof(char) * 2) + sizeof(short);
cudaMalloc((void**)&b_d, sizeof(char));
cudaMemset(b_d, 0, sizeof(char));
cudaMalloc((void**)&c_d, sizeof(char));
cudaMemset(c_d, 0, sizeof(char));
cudaMalloc((void**)&d_d, sizeof(float));
cudaMemset(d_d, 0, sizeof(float));
cudaMalloc((void**)&bitlayout_d, 8 * sizeof(char));
cudaMemset(bitlayout_d, 0, 8 * sizeof(char));
//Only one thread, no race conditions
testit<<<1, 1, shared>>>(b_d, c_d, d_d, bitlayout_d);
cudaMemcpy(&b_h, b_d, sizeof(char), cudaMemcpyDeviceToHost);
cudaMemcpy(&c_h, c_d, sizeof(char), cudaMemcpyDeviceToHost);
cudaMemcpy(&d_h, d_d, sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(bitlayout_h, bitlayout_d, 8 * sizeof(char), cudaMemcpyDeviceToHost);
cudaFree(b_d);
cudaFree(c_d);
cudaFree(d_d);
cudaFree(bitlayout_d);
printf("Char 1: %c Char 2: %c The Float: %f\n", b_h, c_h, d_h);
printf("Expected bitlayout: [42|42|0|0] [-128|63|X|X]\n");
printf("Bitlayout: [%i|%i|%i|%i] [%i|%i|%i|%i]\n", (int)bitlayout_h[0],
(int)bitlayout_h[1],
(int)bitlayout_h[2],
(int)bitlayout_h[3],
(int)bitlayout_h[4],
(int)bitlayout_h[5],
(int)bitlayout_h[6],
(int)bitlayout_h[7]);
free(bitlayout_h);
}
On output I get the following :
Char 1: Char 2: The Float: 1.000000
Expected bitlayout: [42|42|0|0] [-128|63|X|X]
Bitlayout: [0|0|-128|63] [0|14|0|1]
It appears that the float is being written to the whole 32 bit chunk into the first bank of shared memory, erasing anything that was there previously. I have also done it with writing 3 chars and 1 short, when I did this the last of the 3 chars was overwritten.
Basically the problem is that when writing to a pointer where the value has to spill over into an adjacent shared memory bank it spills up to a bank with values already written, erasing them.
I have run this on two different cards: one an old 8800 GTX with a G80 chip and a Quadro NVS 290 both with similar results, I also single stepped it running on the 290 with the debugger and saw the changes as they occurred. Both times I was running Ubuntu Linux with cuda SDK version 2.2 installed.
So my question is this a bug or a feature? Is it documented? If it is documented where is it because I could not find anything about it.