Writing to global memory

I have the following kernel with dimensions 6 by 6 and one thread per block:

integrals_2e_kernel<<<dimGrid, dimBlock, 1045 * sizeof(double)>>>(deviceruntime, integral_device, shell_device, d_F, d_fgtuvfinal);

double *p_F_temp;

double* p_F_sh;

__shared__ double F_temp[784];

__shared__ double F_sh[225];

__shared__ double c1fac[3][3][3];

__shared__ double s_sab[1];

__shared__ double s_fgtuv[1];

__shared__ double s_c2x[1];

p_F_temp = F_temp;

			for (i = 0; i < sheli * shelj *shelk * shell; i++) {

			 *p_F_temp = zero;

			  p_F_temp++;

			}

//write to array in global memory

d_fgtuvfinal[blockIdx.x + blockIdx.y * 6] = s_fgtuv[0];

}

/*

p_F_sh = F_sh;

			for (i = 0; i < sheli1 * shelj1 *shelk1 * shell1; i++) {

			 *p_F_sh = zero;

			  p_F_sh++;

			}

}

*/

The code executes fine in device and emu mode with that last pointer initialization commented p_F_sh = F_sh. That is to say d_fgtuvfinal receives all the expected values from s_fgtuv then these are brought back onto the host and all is well. When I uncomment the p_F_sh declaration things go awry. d_fgtuvfinal is not written to at all and subsequent kernel calls return the same values of d_fgtuvfinal that were there at the last kernel call when F_sh was commented.

I have a cudaFree(d_fgtuvfinal) statement in host code and am surprised that the values aren’t getting cleaned after each call since the rest of the program runs to completion. Would this behaviour likely be one of my arrays overflowing or am I doing illegal things with shared memory?

thanks very much in advance.

If you do some error checking after the kernel launch, I think you will find that the kernel never launches with that second code segment uncommented. When that case you are trying to use 16624 bytes of shared memory, and it should be failing with a insufficient resources error. With it commented out, the dead code optimizer is probably optimizing F_sh away and putting the kernel just under the maximum 16kb per block, which allows it to run.

Why are you dynamically allocating shared memory and defined all those static shared memory variables when the dynamic allocation doesn’t look to be used for anything?

I was worried that might be the case but I’m not getting the kernel launch failures messages. When I compile with build cubin I get the following memory usage output:
with p_F_sh commented:
Local: 448
Shared: 6624
Registers: 25

with it uncommented:
Local: 448
Shared: 8432
Registers: 28

I’ve attached a more complete version of the sample code which might make it clearer what’s going on with those shared variables. (There the shared variable are declared outside the kernel declaration.) Those shared variables should all be loading values from arrays I’ve stored in global memory and then manipulate them within the block… although this attached code might become a ‘how not to’ CUDA demonstration.
ATOM_SCF_post.cu (10.5 KB)

That should be OK - 6224 + 1045*8 = 14584 bytes per block

That should fail = 8432 + 1045*8 = 16792 bytes per block, which is more than the 16384 bytes per block limit (actually a bit less than that because execution parameters use some shared memory).

The symptoms you are describing is consistent with the kernel never running. I haven’t looked at the code yet, but are you sure the error checking is right?

Ah. That was basic. I went to add the shared memory specification in the kernel call <<< ,1045 * 8>>> thinking that represented the amount of static shared memory that was allocated rather than dynamic: foolish. However should calling this line after kernel invocation:
CUDA_SAFE_CALL( cudaThreadSynchronize() );
be sufficient for error checking?
thanks for the help.

No. That won’t work. Either examine the return status of cudaThreadSynchronize directly or call cudaGetLastError() directly but don’t try and do both. Better still, don’t use cutil at all. It does more harm than good…