Device memset only works on debug builds On the device, memset does not work reliably on non-debug b

Hi Floks,

At the bottom of this post is a simple test kernel that does a memset on a shared memory array on the device . However, copying the memory to the host and examining it shows that the memset does not always work.

On debug builds (compiled with –G0) memset always works.
On non-debug builds (compiled without –G0) memset never works on a GT 240
On non-debug builds (compiled without –G0) memset fails on a GTX 460 unless the memory is getting set to zero.
In other words, on GTX 460 non-debug builds memset(rgBuffer,5…) always fails but memset(rgBuffer,0,…) always works.

Examining the ptx files for each compilation shows the most probable culprit.

Without -G0 (version that fails):
.loc 28 107 0
mov.s32 %r5, %r1;
mov.s32 %r6, 500;
$Lt_0_3842:
// Loop body line 107, nesting depth: 1, iterations: 500
st.shared.u32 [%r5+0], %r7; <------ ** r7 is not initialized anywhere in the ptx file
add.u32 %r5, %r5, 4;
sub.s32 %r6, %r6, 1;
mov.u32 %r8, 0;
setp.ne.s32 %p2, %r6, %r8;
@%p2 bra $Lt_0_3842;

With -G0 (version that works):
$L_0_2562:
.loc 28 107 0
mov.s32 %r14, 0; <---------**Notice how this version initializes the register used for the memset
mov.u32 %r15, __cuda___cuda_local_var_67239_29_rgBuffer4;
mov.s32 %r16, %r15;
mov.s32 %r17, 500;
$L_0_2818:
.loc 28 107 0
st.shared.u32 [%r16+0], %r14;
add.u32 %r16, %r16, 4;
sub.s32 %r17, %r17, 1;
mov.u32 %r18, 0;
setp.gt.s32 %p3, %r17, %r18;
@%p3 bra $L_0_2818;

Notice how the version that works initializes the register used (second line) for the memset and the version that does not work fails to do this. Adding the missing register initialization to the ptx file of the version that fails and manually building the executable fixes the problem.

Is this a known issue? (I could not find anything in the forums)
How come GTX 460 seems to always work even without the register initialization?

Kernel code:
#define BUFFSIZE 500
global void MemsetTestCUDA( int* piBuffer )
{
shared int rgBuffer[BUFFSIZE];
int i;

            for( i = 0; i < BUFFSIZE; i++ )
            {
                            rgBuffer[i] = 1; // make sure non zero
            }

            memset( rgBuffer, 0, sizeof(int) * BUFFSIZE ); // zero out memory

            for( i = 0; i < BUFFSIZE; i++ )
            {
                            piBuffer[i] = rgBuffer[i]; // copy to client buffer
            }

}

#include <windows.h>
#include <stdio.h>
void MemsetTest()
{
void* pvBuffer;
int i;
int rgBuffer[BUFFSIZE];

            cudaMalloc( &pvBuffer, sizeof(int) * BUFFSIZE );

            MemsetTestCUDA<<< 1, 1>>>( (int*)pvBuffer );

            cudaMemcpy( rgBuffer, pvBuffer, sizeof(int) * BUFFSIZE, cudaMemcpyDeviceToHost );                


            for( i = 0; i < BUFFSIZE; i++ )
            {
                            if( rgBuffer[i] != 0 )
                            {
                                            goto Error;
                            }
            }
            
            printf("\r\n---------------------- MemsetTest Passed--------------------------" );
            return;

Error:
printf("\r\n---------------------- MemsetTest Failed!--------------------------" );
}

Hi Floks,

At the bottom of this post is a simple test kernel that does a memset on a shared memory array on the device . However, copying the memory to the host and examining it shows that the memset does not always work.

On debug builds (compiled with –G0) memset always works.
On non-debug builds (compiled without –G0) memset never works on a GT 240
On non-debug builds (compiled without –G0) memset fails on a GTX 460 unless the memory is getting set to zero.
In other words, on GTX 460 non-debug builds memset(rgBuffer,5…) always fails but memset(rgBuffer,0,…) always works.

Examining the ptx files for each compilation shows the most probable culprit.

Without -G0 (version that fails):
.loc 28 107 0
mov.s32 %r5, %r1;
mov.s32 %r6, 500;
$Lt_0_3842:
// Loop body line 107, nesting depth: 1, iterations: 500
st.shared.u32 [%r5+0], %r7; <------ ** r7 is not initialized anywhere in the ptx file
add.u32 %r5, %r5, 4;
sub.s32 %r6, %r6, 1;
mov.u32 %r8, 0;
setp.ne.s32 %p2, %r6, %r8;
@%p2 bra $Lt_0_3842;

With -G0 (version that works):
$L_0_2562:
.loc 28 107 0
mov.s32 %r14, 0; <---------**Notice how this version initializes the register used for the memset
mov.u32 %r15, __cuda___cuda_local_var_67239_29_rgBuffer4;
mov.s32 %r16, %r15;
mov.s32 %r17, 500;
$L_0_2818:
.loc 28 107 0
st.shared.u32 [%r16+0], %r14;
add.u32 %r16, %r16, 4;
sub.s32 %r17, %r17, 1;
mov.u32 %r18, 0;
setp.gt.s32 %p3, %r17, %r18;
@%p3 bra $L_0_2818;

Notice how the version that works initializes the register used (second line) for the memset and the version that does not work fails to do this. Adding the missing register initialization to the ptx file of the version that fails and manually building the executable fixes the problem.

Is this a known issue? (I could not find anything in the forums)
How come GTX 460 seems to always work even without the register initialization?

Kernel code:
#define BUFFSIZE 500
global void MemsetTestCUDA( int* piBuffer )
{
shared int rgBuffer[BUFFSIZE];
int i;

            for( i = 0; i < BUFFSIZE; i++ )
            {
                            rgBuffer[i] = 1; // make sure non zero
            }

            memset( rgBuffer, 0, sizeof(int) * BUFFSIZE ); // zero out memory

            for( i = 0; i < BUFFSIZE; i++ )
            {
                            piBuffer[i] = rgBuffer[i]; // copy to client buffer
            }

}

#include <windows.h>
#include <stdio.h>
void MemsetTest()
{
void* pvBuffer;
int i;
int rgBuffer[BUFFSIZE];

            cudaMalloc( &pvBuffer, sizeof(int) * BUFFSIZE );

            MemsetTestCUDA<<< 1, 1>>>( (int*)pvBuffer );

            cudaMemcpy( rgBuffer, pvBuffer, sizeof(int) * BUFFSIZE, cudaMemcpyDeviceToHost );                


            for( i = 0; i < BUFFSIZE; i++ )
            {
                            if( rgBuffer[i] != 0 )
                            {
                                            goto Error;
                            }
            }
            
            printf("\r\n---------------------- MemsetTest Passed--------------------------" );
            return;

Error:
printf("\r\n---------------------- MemsetTest Failed!--------------------------" );
}

Interesting find. What CUDA SDK and toolkit were you working with?

Don’t call us Floks - I don’t even know what that is ;)

Interesting find. What CUDA SDK and toolkit were you working with?

Don’t call us Floks - I don’t even know what that is ;)

I experienced this on both the 3.1 and 3.2 CUDA SDKs.
I use VS 2008 + Parallel Nsight 1.5.

I promise to spell “Folks” correctly next time :smile:

I experienced this on both the 3.1 and 3.2 CUDA SDKs.
I use VS 2008 + Parallel Nsight 1.5.

I promise to spell “Folks” correctly next time :smile:

Well i have just exact problem

void move_to_cuda() {

	cudaMalloc(&h_d, sizeof(int) * width * height);

	cudaMemset(h_d, 1, sizeof(int) * width * height);

	cudaMemcpy(h_h, h_d, sizeof(int) * width * height, cudaMemcpyDeviceToHost);

	for(int i = 0; i < width * height; ++i) {

		printf("%d ", h_h[i]);

	}

	printf("\n");

}

such simple code initializes memory to some random values both on GF8800GTX and GF480GTX

am using build customization for cuda 3.2 and Nsight 1.5 under VS2010

If anyone have any simple solution to that problem different than memset memory on host and then cudaMemcpy i’d really appreciate

these are my building command lines

(Approximate command-line, please see the output window after a build for the full command-line)

Driver API (NVCC Compilation Type is .cubin, .gpu, or .ptx)

set CUDAFE_FLAGS=–sdk_dir “C:\Program Files\Microsoft SDKs\Windows\v6.0A”

“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\bin\nvcc.exe” --use-local-env --cl-version 2008 -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin” -G0 --keep-dir “Debug\” -maxrregcount=32 --machine 32 --compile -o “Debug%(Filename).obj” “%(FullPath)”

Runtime API (NVCC Compilation Type is hybrid object or .c file)

set CUDAFE_FLAGS=–sdk_dir “C:\Program Files\Microsoft SDKs\Windows\v6.0A”

“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\bin\nvcc.exe” --use-local-env --cl-version 2008 -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin” -G0 --keep-dir “Debug\” -maxrregcount=32 --machine 32 --compile -D_NEXUS_DEBUG -g -Xcompiler "/EHsc /nologo /Od /Zi /MDd " -o “Debug%(Filename).obj” “%(FullPath)”

Well i have just exact problem

void move_to_cuda() {

	cudaMalloc(&h_d, sizeof(int) * width * height);

	cudaMemset(h_d, 1, sizeof(int) * width * height);

	cudaMemcpy(h_h, h_d, sizeof(int) * width * height, cudaMemcpyDeviceToHost);

	for(int i = 0; i < width * height; ++i) {

		printf("%d ", h_h[i]);

	}

	printf("\n");

}

such simple code initializes memory to some random values both on GF8800GTX and GF480GTX

am using build customization for cuda 3.2 and Nsight 1.5 under VS2010

If anyone have any simple solution to that problem different than memset memory on host and then cudaMemcpy i’d really appreciate

these are my building command lines

(Approximate command-line, please see the output window after a build for the full command-line)

Driver API (NVCC Compilation Type is .cubin, .gpu, or .ptx)

set CUDAFE_FLAGS=–sdk_dir “C:\Program Files\Microsoft SDKs\Windows\v6.0A”

“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\bin\nvcc.exe” --use-local-env --cl-version 2008 -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin” -G0 --keep-dir “Debug\” -maxrregcount=32 --machine 32 --compile -o “Debug%(Filename).obj” “%(FullPath)”

Runtime API (NVCC Compilation Type is hybrid object or .c file)

set CUDAFE_FLAGS=–sdk_dir “C:\Program Files\Microsoft SDKs\Windows\v6.0A”

“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\bin\nvcc.exe” --use-local-env --cl-version 2008 -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin” -G0 --keep-dir “Debug\” -maxrregcount=32 --machine 32 --compile -D_NEXUS_DEBUG -g -Xcompiler "/EHsc /nologo /Od /Zi /MDd " -o “Debug%(Filename).obj” “%(FullPath)”