Memcpy trouble at runtime invalid device pointer

hdinh · March 9, 2009, 12:04am

Hi, I am getting a runtime error when I tried to allocate an array of 3 bytes structs:

This is my error:

cudaSafeCall() Runtime API error in file <test.cu>, line 120 : invalid device pointer.

const int FIXELEMENT = 72;

struct bit24_t

{

	unsigned a:24;

};

//allocate device memory

	bit24_t *d_bseq, *d_reg, *d_cbseq, *d_creg;

	size_t pitch;

	cutilSafeCall(cudaMallocPitch((void**) &d_bseq, &pitch, 12*sizeof(bit24_t), FIXELEMENT));

	cutilSafeCall(cudaMallocPitch((void**) &d_reg, &pitch, 12*sizeof(bit24_t), FIXELEMENT));

	cutilSafeCall(cudaMallocPitch((void**) &d_cbseq, &pitch, 12*sizeof(bit24_t), FIXELEMENT));

	cutilSafeCall(cudaMallocPitch((void**) &d_creg, &pitch, 12*sizeof(bit24_t), FIXELEMENT));

	std::cout<<"size of bit24_t"<<sizeof(bit24_t)<<endl;

	// copy host to device

	for(int r = 0; r<12; r++)

	{

		bit24_t *rowbseq = (bit24_t*)((char*)d_bseq + r*pitch);

		bit24_t *rowreg = (bit24_t*)((char*)d_reg + r*pitch);

		bit24_t *rowcbseq = (bit24_t*)((char*)d_cbseq + r*pitch);

		bit24_t *rowcreg = (bit24_t*)((char*)d_creg + r*pitch);

		bit24_t *hrow = new bit24_t[FIXELEMENT];

		for(int c=0; c<FIXELEMENT; c++)

		{

			hrow[c] = bseq[c][r];

		}

		cutilSafeCall(cudaMemcpy(rowbseq, hrow, FIXELEMENT*sizeof(bit24_t), cudaMemcpyHostToDevice));

		

		for(int c=0; c<FIXELEMENT; c++)

		{

			hrow[c] = reg[c][r];

		}

		cutilSafeCall(cudaMemcpy(rowreg, hrow, FIXELEMENT*sizeof(bit24_t), cudaMemcpyHostToDevice));

		

		for(int c=0; c<FIXELEMENT; c++)

		{

			hrow[c] = cbseq[c][r];

		}

		cutilSafeCall(cudaMemcpy(rowcbseq, hrow, FIXELEMENT*sizeof(bit24_t), cudaMemcpyHostToDevice));

		

		for(int c=0; c<FIXELEMENT; c++)

		{

			hrow[c] = creg[c][r];

		}

		cutilSafeCall(cudaMemcpy(rowcreg, hrow, FIXELEMENT*sizeof(bit24_t), cudaMemcpyHostToDevice));

		

	}

Please help!

SPWorley · March 9, 2009, 1:08am

Bitfields are always complex and dangerous tools to use! You’ll likely run into problems.

As a guess, your problem is not CUDA related, it’s making an assumption that an array of structs will be packed tightly even when the structs are an odd size smaller than a word.
This is extremely unlikely to work in practice… think of how the compiler would have to access an element, it’d have to do integer math to figure out the phase of the struct, read from either one or two words based on that phase, patch the results back together with shifts to re-align in, THEN access it. Ugh. So likely what a compiler would do is pad the struct to a whole word length when accessing by an array.

Again, not a CUDA specific issue. The C standard says that bit fields are packed into implementation-dependent “storage units” which are likely 4-byte words in this case.

I’d say don’t try to use the complexity and ugliness of bitfields at all. If tight packing is really crucial, deal with a char array and assemble the data yourself with shifts and reads.
But also remember in CUDA words are the preferable quantum memory unit, not chars.
If you just use an int array, you may have 25% wasted space, but it will be easy, clean, portable, and efficient.

jack · March 9, 2009, 1:28am

I’ll reiterate what SPWorley said…you don’t want to do 24-bit integers on CUDA. I believe the smallest ‘atomic’ size (I hesitate saying that, since with CUDA, that word has a totally different meaning) is the 32-bit integer. Also, keep in mind that you’ll (probably) get better performance by ‘wasting’ that extra space, but having more coalesced memory reads.

If you absolutely want/need to use the 24-bit ints, I might go with the approach of using a 96-bit struct (3x32-bit integers), and then breaking out the 4x24-bit integers once the data has been loaded inside the thread.

hdinh · March 9, 2009, 3:04am

Thanks for the head-up!

Since my arrays aren’t going to get larger, I’m going to go with using int.

However, after changing to int, I am still getting the same error.

//allocate device memory

	size_t pitch;

	cutilSafeCall(cudaMallocPitch((void**) &d_bseq, &pitch, 12*sizeof(int), FIXELEMENT));

	cutilSafeCall(cudaMallocPitch((void**) &d_reg, &pitch, 12*sizeof(int), FIXELEMENT));

	cutilSafeCall(cudaMallocPitch((void**) &d_cbseq, &pitch, 12*sizeof(int), FIXELEMENT));

	cutilSafeCall(cudaMallocPitch((void**) &d_creg, &pitch, 12*sizeof(int), FIXELEMENT));

	// copy host to device

	for(int r = 0; r<12; r++)

	{

		int *rowbseq = (int*)((char*)d_bseq + r*pitch);

		int *rowreg = (int*)((char*)d_reg + r*pitch);

		int *rowcbseq = (int*)((char*)d_cbseq + r*pitch);

		int *rowcreg = (int*)((char*)d_creg + r*pitch);

		int *hrow = new int[FIXELEMENT];

		for(int c=0; c<FIXELEMENT; c++)

		{

			hrow[c] = (int)bseq[c][r].a;

		}

		cutilSafeCall(cudaMemcpy(rowbseq, hrow, FIXELEMENT*sizeof(int), cudaMemcpyHostToDevice));

		

		for(int c=0; c<FIXELEMENT; c++)

		{

			hrow[c] = (int)reg[c][r].a;

		}

		cutilSafeCall(cudaMemcpy(rowreg, hrow, FIXELEMENT*sizeof(int), cudaMemcpyHostToDevice));

		

		for(int c=0; c<FIXELEMENT; c++)

		{

			hrow[c] = (int)cbseq[c][r].a;

		}

		cutilSafeCall(cudaMemcpy(rowcbseq, hrow, FIXELEMENT*sizeof(int), cudaMemcpyHostToDevice));

		

		for(int c=0; c<FIXELEMENT; c++)

		{

			hrow[c] = (int)creg[c][r].a;

		}

		cutilSafeCall(cudaMemcpy(rowcreg, hrow, FIXELEMENT*sizeof(int), cudaMemcpyHostToDevice));

		

	}

hdinh · March 9, 2009, 6:43am

gave up on 2D matrix, linearized to array of 12*FIXELEMENT, and it’s working now. :rolleyes: