Memory Corruption on a Fermi-Class GPU Error only on Fermis, program works on non-Fermis.

Hi,

I am facing the following problem on a GeForce GTX 580 (Fermi-class) GPU.

Just to give you some background, I am reading single-byte samples packed in the following manner in a file: Real(Signal 1), Imaginary(Signal 1), Real(Signal 2), Imaginary(Signal 2). (Each byte is a signed char, taking values between, -128 and 127.) I read these into a char4 array, use a custom function to copy them to two float2 arrays corresponding to each signal. (This is just an isolated part of a larger program.)

When I run the program using cuda-memcheck, I get either:

or

or

The thread and block indices where the invalid writes happen are random.

The code is attached, and the main kernel is reproduced below. The strange thing is that this code works (and cuda-memcheck throws no error) on a non-Fermi-class GPU that I have access to. Is it that the kernel needs to be rewritten in some way for Fermi-class GPUs, or could it be that the specific GPU that I am working on is broken? Another point to note is that the Fermi gives no error for N <= 8192, so I am more inclined towards the latter possibility.

Here is the kernel:

__global__ void CopyDataForFFT(char4 *pc4Data,

                               float2 *pf2FFTInX,

                               float2 *pf2FFTInY)

{

    int i = (blockIdx.x * blockDim.x) + threadIdx.x;

pf2FFTInX[i].x = (float) pc4Data[i].x;

    pf2FFTInX[i].y = (float) pc4Data[i].y;

    pf2FFTInY[i].x = (float) pc4Data[i].z;

    pf2FFTInY[i].y = (float) pc4Data[i].w;

return;

}

I use CUDA 4.0 (also tried CUDA 3.2) on RHEL 5.6. Details of the GPU that might be relevant:

CUDA Capability Major/Minor version number: 2.0

Total amount of global memory: 1535 MBytes

(16) Multiprocessors x (32) CUDA Cores/MP: 512 CUDA Cores

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 49152 bytes

Total number of registers available per block: 32768

Maximum number of threads per block: 1024

Maximum sizes of each dimension of a block: 1024 x 1024 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

Any help will be greatly appreciated! Thank you!
testfft.cu (3.78 KB)

“Address 0x600222738” suspiciously reminds me of this thread: Weird pointer arithmetic bug ?

I’m sorry I don’t have a solution, but I suspect compiling as a 32 bit program might work around the problem.

Showing the kernel launch parameters could help shine some more light on this External Image

Thank you for your responses!

@tera: I compiled the program with the ‘-m32’ option, but when I run the program I get the error ‘CUDA driver version is insufficient for CUDA runtime version’, which I suspect is because I haven’t installed the 32-bit compatibility libraries. I’ll try again after I install those.

@Skybuck: The kernel invocation and related parts of the code are given below. (The entire code is attached with my first post, if you’d like to take a look.)

#define N   32768

...

dim3 dimBCopy(1, 1, 1);

    dim3 dimGCopy(1, 1);

...

dimBCopy.x = 1024;  /* number of threads in a block, for my GPU. I tried 512 instead of 1024 on these two lines, too, just for fun - didn't help. */

    dimGCopy.x = N / 1024;

CopyDataForFFT<<<dimGCopy, dimBCopy>>>(pc4Buf_d,

                                           pf2InX_d,

                                           pf2InY_d);

In continuing with the weird behaviour, today I observed that when I run the program using cuda-memcheck, I get these 16x16 blocks of multi-coloured pixels distributed randomly all over my screen. This does not happen if I run the program directly. External Image

UPDATE: The 32-bit compatibility libraries are OpenGL libraries, so that’s not relevant here. Anyway, I tried installing the 32-bit CUDA Toolkit 4.0, but faced the same problem.

One other thing I noticed in my program is that if I comment out a couple of lines in my kernel, there’s no memory error.

The original kernel:

__global__ void CopyDataForFFT(char4 *pc4Data,

                               float2 *pf2FFTInX,

                               float2 *pf2FFTInY)

{

    int i = (blockIdx.x * blockDim.x) + threadIdx.x;

pf2FFTInX[i].x = (float) pc4Data[i].x;

    pf2FFTInX[i].y = (float) pc4Data[i].y;

    pf2FFTInY[i].x = (float) pc4Data[i].z;

    pf2FFTInY[i].y = (float) pc4Data[i].w;

return;

}

Compilation output (with --ptxas-options=-v):

Modified kernel:

__global__ void CopyDataForFFT(char4 *pc4Data,

                               float2 *pf2FFTInX,

                               float2 *pf2FFTInY)

{

    int i = (blockIdx.x * blockDim.x) + threadIdx.x;

pf2FFTInX[i].x = (float) pc4Data[i].x;

    pf2FFTInX[i].y = (float) pc4Data[i].y;

    //pf2FFTInY[i].x = (float) pc4Data[i].z;

    //pf2FFTInY[i].y = (float) pc4Data[i].w;

return;

}

Compilation output (with --ptxas-options=-v):

As you can see in the compilation output, the modified kernel uses one less register.

Is this some sort of hardware issue, or a CUDA bug?

Hi Cudator, does your app use OpenGL interop? i.e. are the cuda kernels writing to buffers that have been allocated using OpenGL calls and then mapped for use by CUDA? Cuda-memcheck in CUDA 4.0 does not support detection of accesses to interop buffers, which may explain some of the issues you are seeing. If possible, could you try converting the buffer allocations to use cudaMalloc/cudaFree ?

I’m not using OpenGL. I’m allocating memory using plain cudaMalloc().

If this code works:

pf2FFTInX[i].x = (float) pc4Data[i].x;
pf2FFTInX[i].y = (float) pc4Data[i].y;
//pf2FFTInY[i].x = (float) pc4Data[i].z;
//pf2FFTInY[i].y = (float) pc4Data[i].w;

But without comments it doesn’t then this could be an indication that .z and .w does not exist for the last element i.

So perhaps the pointer pc4Data is not actually pointing towards an array of 4 bytes per element ?

Or perhaps the allocation size was not large enough.

This is just some general programming advice, I will have to look deeper into your code later on, so I might get back to you on this, for now this is my first initial guess/observation/experience sharing (lol) External Image :)

It’s actually independent of which two statements are commented. It seems the lesser number of registers used, the lesser the number of errors thrown up. In fact, I just noticed that even if I comment out two statement, I get these weird errors once in a while. External Image

I need to correct myself: If I comment out either the first two or the last two char-to-float assignment statements in my kernel, there’s no memory error. If I comment out one from the first two (pf2FFTInX), and another from the second two (pf2FFTInY), errors still crop up, but less frequently. The kernel uses 6 registers with all four assignment statements uncommented, and uses 4 registers with two assignment statements commented out.

One final update for the day: Just for the heck of it, I inserted a __syncthreads() between the pf2FFTInX and the pf2FFTInY assignment statements, and memory errors disappeared for N = 32768. But at N = 65536, I still get errors.

UPDATE: Ignore my previous post. I still get errors at N = 32768.

Ok,

Now I see what the problem is, it’s probably very easy:

You are trying to typecast a char to a float.

A char is usually 1 byte, a float is usually 4 bytes.

I am not sure what the C language is supposed to do when you typecast:

(float)(char).

But I would not be surprised if this creates a “memory overrun” like a buffer overrun.

You should verify is it’s malfunctioning yes or no.

So what happens if a char is typecasted to a float ?!? Does this produce an overflow in C does this produce an overflow in Cuda C ?!

Once you know the answer to test we can proceed… But this is my best guess so far External Image

I am not going to test this myself now, because I have other things to do… but you should be able to test it somehow ! External Image

Let me know how that goes ! External Image Perhaps show a test example.

Actually I am going to test this in C because it’s kinda interesting.

Delphi sometimes does automatic conversions for these kind of typecasts by using temporarelies to scale them up/resize them…

I am interested in what C’s behaviour is External Image

Personally I never use code like that… this could introduce porting/translating issue’s so it’s quite interesting ! External Image :)

At least visual c/c++/cpu seems to work ok:

Perhaps you could test this code inside a kernel and see what happens for cuda/c/c++/nvcc/gpu:

// TestProgram.cpp : Defines the entry point for the console application.

//

/*

(C++) Test program to see what happens when C types are typecasted to larger types.

So far Visual C/C++ Compiler seems to get off the hook and works ok.

This leaves question what Cuda C/C++ (nvcc.exe) would do <img src='http://forums.nvidia.com/public/style_emoticons/<#EMO_DIR#>/wink.gif' class='bbc_emoticon' alt=';)' />

Especially for char4 type ! <img src='http://forums.nvidia.com/public/style_emoticons/<#EMO_DIR#>/wink.gif' class='bbc_emoticon' alt=';)' />

*/

#include "stdafx.h"

int _tmain(int argc, _TCHAR* argv[])

{

	int IntegerArray[4];

	long long LargeInteger;

	char CharArray[4];

	float FloatArrayA[2];

	float FloatArrayB[2];

//	int Sentinel;

	IntegerArray[0] = 0;

	IntegerArray[1] = 0;

	IntegerArray[2] = 0;

	IntegerArray[3] = 0;

	IntegerArray[0] = 5354432;

	IntegerArray[2] = 2654762;

//	Sentinel = 0;

	LargeInteger = (long long)(IntegerArray[0]) * (long long)(IntegerArray[2]);

	printf("IntegerArray[0]: %d\n", IntegerArray[0] );

	printf("IntegerArray[1]: %d\n", IntegerArray[1] );

	printf("IntegerArray[2]: %d\n", IntegerArray[2] );

	printf("IntegerArray[3]: %d\n", IntegerArray[3] );

	printf("LargeInteger: %lld \n", LargeInteger );

	CharArray[0] = (char)255;

	CharArray[1] = 124;

	CharArray[2] = 63;

	CharArray[3] = 14;

	FloatArrayA[0] = (float) CharArray[0];

	FloatArrayA[1] = (float) CharArray[1];

	FloatArrayB[0] = (float) CharArray[2];

	FloatArrayB[1] = (float) CharArray[3];

	printf("FloatArrayA[0]: %f \n", FloatArrayA[0] );

	printf("FloatArrayA[1]: %f \n", FloatArrayA[1] );

	printf("FloatArrayB[0]: %f \n", FloatArrayB[0] );

	printf("FloatArrayB[1]: %f \n", FloatArrayB[1] );

	// printf("Sentinel: %d \n", Sentinel );

	return 0;

}

Perhaps the problem is with something else :|

Maybe weird character values create some kind of weird floating point problem ? Seems unlikely… but still…

Here is 32 bit version of kernel in ptx form:

"
.version 2.3
.target sm_20
.address_size 32
// compiled with C:\Tools\CUDA\Toolkit 4.0\v4.0\bin/…/open64/lib//be.exe
// nvopencc 4.0 built on 2011-05-13

//-----------------------------------------------------------
// Compiling C:/Users/Skybuck/AppData/Local/Temp/tmpxft_00000f34_00000000-11_testfft.cpp3.i (C:/Users/Skybuck/AppData/Local/Temp/ccBI#.a04964)
//-----------------------------------------------------------

//-----------------------------------------------------------
// Options:
//-----------------------------------------------------------
//  Target:ptx, ISA:sm_20, Endian:little, Pointer Size:32
//  -O3	(Optimization level)
//  -g0	(Debug level)
//  -m2	(Report advisories)
//-----------------------------------------------------------

.file	1	"C:/Users/Skybuck/AppData/Local/Temp/tmpxft_00000f34_00000000-10_testfft.cudafe2.gpu"
.file	2	"c:\tools\microsoft visual studio 10.0\vc\include\codeanalysis\sourceannotations.h"
.file	3	"C:\Tools\CUDA\Toolkit 4.0\v4.0\bin/../include\crt/device_runtime.h"
.file	4	"C:\Tools\CUDA\Toolkit 4.0\v4.0\bin/../include\host_defines.h"
.file	5	"C:\Tools\CUDA\Toolkit 4.0\v4.0\bin/../include\builtin_types.h"
.file	6	"c:\tools\cuda\toolkit 4.0\v4.0\include\device_types.h"
.file	7	"c:\tools\cuda\toolkit 4.0\v4.0\include\driver_types.h"
.file	8	"c:\tools\cuda\toolkit 4.0\v4.0\include\surface_types.h"
.file	9	"c:\tools\cuda\toolkit 4.0\v4.0\include\texture_types.h"
.file	10	"c:\tools\cuda\toolkit 4.0\v4.0\include\vector_types.h"
.file	11	"c:\tools\cuda\toolkit 4.0\v4.0\include\builtin_types.h"
.file	12	"c:\tools\cuda\toolkit 4.0\v4.0\include\host_defines.h"
.file	13	"C:\Tools\CUDA\Toolkit 4.0\v4.0\bin/../include\device_launch_parameters.h"
.file	14	"c:\tools\cuda\toolkit 4.0\v4.0\include\crt\storage_class.h"
.file	15	"C:\Tools\Microsoft Visual Studio 10.0\VC\bin/../../VC/INCLUDE\time.h"
.file	16	"C:/Junk tijdelijk/testfft.cu"
.file	17	"C:\Tools\CUDA\Toolkit 4.0\v4.0\bin/../include\common_functions.h"
.file	18	"c:\tools\cuda\toolkit 4.0\v4.0\include\math_functions.h"
.file	19	"c:\tools\cuda\toolkit 4.0\v4.0\include\math_constants.h"
.file	20	"c:\tools\cuda\toolkit 4.0\v4.0\include\device_functions.h"
.file	21	"c:\tools\cuda\toolkit 4.0\v4.0\include\sm_11_atomic_functions.h"
.file	22	"c:\tools\cuda\toolkit 4.0\v4.0\include\sm_12_atomic_functions.h"
.file	23	"c:\tools\cuda\toolkit 4.0\v4.0\include\sm_13_double_functions.h"
.file	24	"c:\tools\cuda\toolkit 4.0\v4.0\include\sm_20_atomic_functions.h"
.file	25	"c:\tools\cuda\toolkit 4.0\v4.0\include\sm_20_intrinsics.h"
.file	26	"c:\tools\cuda\toolkit 4.0\v4.0\include\surface_functions.h"
.file	27	"c:\tools\cuda\toolkit 4.0\v4.0\include\texture_fetch_functions.h"
.file	28	"c:\tools\cuda\toolkit 4.0\v4.0\include\math_functions_dbl_ptx3.h"


.entry CopyDataForFFT (
	.param .u32 __cudaparm_CopyDataForFFT_pc4Data,
	.param .u32 __cudaparm_CopyDataForFFT_pf2FFTInX,
	.param .u32 __cudaparm_CopyDataForFFT_pf2FFTInY)
{
.reg .u32 %r<19>;
.reg .f32 %f<6>;
.loc	16	6	0

$LDWbegin_CopyDataForFFT:
.loc 16 10 0
mov.u32 %r1, %ctaid.x;
mov.u32 %r2, %ntid.x;
mul.lo.u32 %r3, %r1, %r2;
mov.u32 %r4, %tid.x;
add.u32 %r5, %r4, %r3;
mul.lo.u32 %r6, %r5, 4;
mul.lo.u32 %r7, %r5, 8;
ld.param.u32 %r8, [__cudaparm_CopyDataForFFT_pc4Data];
add.u32 %r9, %r8, %r6;
ld.param.u32 %r10, [__cudaparm_CopyDataForFFT_pf2FFTInX];
add.u32 %r11, %r10, %r7;
ld.global.v4.s8 {%r12,%r13,%r14,%r15}, [%r9+0];
cvt.rn.f32.s32 %f1, %r12;
.loc 16 11 0
cvt.rn.f32.s32 %f2, %r13;
st.global.v2.f32 [%r11+0], {%f1,%f2};
.loc 16 12 0
ld.param.u32 %r16, [__cudaparm_CopyDataForFFT_pf2FFTInY];
add.u32 %r17, %r16, %r7;
cvt.rn.f32.s32 %f3, %r14;
.loc 16 13 0
cvt.rn.f32.s32 %f4, %r15;
st.global.v2.f32 [%r17+0], {%f3,%f4};
.loc 16 15 0
exit;
$LDWend_CopyDataForFFT:
} // CopyDataForFFT

"

Maybe you should try compiling for 32 bit machine, maybe that help ?

Have you successfully compiled it for 32 bit machine ? (just requires a compiler switch for nvcc probably… no special libaries should be required I think… only one nvcuda.dll as far as I know…)Or maybe one nvcuda.dll is in 32 bit folder and one in 64 bit folder somewhere in windows… hmm…

I can’t really test the program since I have no data file… also this seems to be written for unix/linux or something… but I might be able to convert it to windows.

Perhaps you can provide the data file… then I can give it a test to see what happens on my card… then this could help determine if it’s your hardware at fault or the program External Image

(Could also be compiler bug ofcourse… hard to say… for now… External Image External Image

@Skybuck: Thank you for your interest. But it now looks like a bad card. Many of the GPU computing SDK samples fail, and one of them throws up pixel blocks randomly all over the screen. Besides, I had a friend run this program on a machine with a Fermi-architecture card, and he reports that it works for him.