Crash after large number of FFTs

There is a problem running many FFTs using CUFFT. After running approximately 400,000 2D FFTs, the CUDA program crashes. Sometimes the crash requires unloading and reloading the kernel module, other times I can start another CUDA program immediately.

Here some stripped down code that shows the problem:

#include <stdlib.h>

#include <time.h>

#include <cuda.h>

#include <cuda_runtime_api.h>

#include <cutil.h>

#include <cufft.h>

#include <iostream>

#include <sstream>

#include <iomanip>

using namespace std;

int nxfft = 256;

int nyfft = 64;

//int batchSize = 280;

//const int num_runs = 1500;

int batchSize = 1;

const int num_runs = 1500 * 280;

const int colwidth = 13;

///////////////////////////////////////////////////////////////////////////////

void fillrand(float2* array, unsigned int size);

int testFFT(int nx, int ny, int count, int direction);

///////////////////////////////////////////////////////////////////////////////

int main()

{

	cudaThreadSynchronize(); // forces CUDA runtime to initialize

	srand((int)time(NULL));

	cout << setfill('-');

	cout << setw(colwidth) << left << "-Run" << "|";

	cout << setw(colwidth) << left << "-Size" << "|";

	cout << setw(colwidth) << left << "-Direction" << "|";

	cout << setw(colwidth) << left << "-Time (ms)" << "|";

	cout << setw(colwidth) << left << "-Time/xform" << "|";

	cout << setfill(' ');

	cout << endl;

	for(int i = 0; i < num_runs; i++)

	{

  cout << setw(colwidth) << left << i << "|";

  if(testFFT(nxfft, nyfft, batchSize, CUFFT_FORWARD) != 0)

  {

  	cout << endl << "===FAILED===" << endl;

  	exit(1);

  }

 cout << setw(colwidth) << left << " " << "|";

  if(testFFT(nxfft, nyfft, batchSize, CUFFT_INVERSE) != 0)

  {

  	cout << endl << "===FAILED===" << endl;

  	exit(1);

  }

	}

	return 0;

}

///////////////////////////////////////////////////////////////////////////////

int testFFT(int nx, int ny, int count, int direction)

{

	ostringstream buf;

	buf << nx << "x" << ny << "x" << count;

	cout << setw(colwidth) << left << buf.str() << "|";

	cout << setw(colwidth) << left << (direction == CUFFT_FORWARD ? "forward" : "inverse") << "|";

	unsigned int timer = 0;

	cutCreateTimer(&timer);

	int size = nx * ny;

	int byte_size = sizeof(float2) * size;

	// Allocate and initalize host memory

	float2* data = (float2*)malloc(byte_size * count);

	if(data == NULL)

	{

  cout << "Allocating host buffer failed" << endl;

  return 1;

	}

	fillrand(data, size * count);

	// Allocate device memory and copy to device

	float2* d_data;

	CUDA_SAFE_CALL(cudaMalloc((void**)&d_data, byte_size * count));

	CUDA_SAFE_CALL(cudaMemcpy(d_data, data, byte_size * count, cudaMemcpyHostToDevice));

	// Run the transform

	cufftHandle cufftplan;

	CUFFT_SAFE_CALL(cufftPlan2d(&cufftplan, ny, nx, CUFFT_C2C));

	cutResetTimer(timer);

	cutStartTimer(timer);

	for(int n = 0; n < count; n++)

	{

  float2* d_ptr = d_data + n * nx * ny;

  CUFFT_SAFE_CALL(cufftExecC2C(cufftplan, (cufftComplex*)d_ptr, (cufftComplex*)d_ptr, direction));

	}

	cudaThreadSynchronize();

	cutStopTimer(timer);

	cout << setw(colwidth) << left << fixed << cutGetTimerValue(timer) << "|";

	cout << setw(colwidth) << left << fixed << cutGetTimerValue(timer) / (float)count << "|" << endl;

	// cleanup

	CUFFT_SAFE_CALL(cufftDestroy(cufftplan));

	CUDA_SAFE_CALL(cudaFree(d_data));

	free(data);

	cutDeleteTimer(timer);

	return 0;

}

///////////////////////////////////////////////////////////////////////////////

void fillrand(float2* array, unsigned int size)

{

	for(unsigned int i = 0; i < size; i++)

	{

  array[i].x = rand() / (float)RAND_MAX;

  array[i].y = rand() / (float)RAND_MAX;

	}

}

Snippet from program output at crash:

196803       |256x64x1     |forward      |0.135000     |0.135000     |

             |256x64x1     |inverse      |0.136000     |0.136000     |

196804       |256x64x1     |forward      |0.136000     |0.136000     |

             |256x64x1     |inverse      |0.135000     |0.135000     |

196805       |256x64x1     |forward      |0.135000     |0.135000     |

             |256x64x1     |inverse      |0.136000     |0.136000     |

cufft: ERROR: execute.cu, line 992

cufft: ERROR: CUFFT_EXEC_FAILED

cufft: ERROR: execute.cu, line 286

cufft: ERROR: CUFFT_EXEC_FAILED

cufft: ERROR: cufft.cu, line 115

cufft: ERROR: CUFFT_EXEC_FAILED

196806       |256x64x1     |forward      |1.785000     |1.785000     |

cufft: ERROR: plan.cu, line 41

cufft: ERROR: CUFFT_INTERNAL_ERROR

cufft: ERROR: context.cu, line 27

Aborted

When I start the program, dmesg shows:

NVRM: API mismatch: the client has the version 100.14.10, but

NVRM: this kernel module has the version 100.14.11.  Please

NVRM: make sure that this kernel module and all NVIDIA driver

NVRM: components have the same version.

(I also get this message for CUDA programs that run normally.) After the crash, dmesg contains:

NVRM: Xid (0001:00): 13, 0001 00000000 000050c0 00000368 00000000 00000080

The system specs are:

Q6600

MSI P6N Diamond motherboard (nForce 680i)

4GB RAM

2 XFX 8800GTX

Ultra X3 1600W power supply

CentOS 4.4 (RHEL 4.4)

CUDA 1.0

This is a particular problem for me - over 50 million FFTs are executed during the run of my algorithm. Let me know if I’ve overlooked something obvious or if I need to submit this as a bug.

Jim

The API mismatch errors suggest that you’re not running the final 1.0 release of CUDA. Please make sure that you’ve downloaded the version on NVIDIA’s website.

I’m not able to build your code on RHEL4. I’m seeing:
#########
]# nvcc foo.cu -o foo -I/root/NVIDIA_CUDA_SDK/common/inc
In file included from /usr/local/cuda/bin/…/include/common_functions.h:88,
from /usr/local/cuda/bin/…/include/crt/host_runtime.h:195,
from /tmp/tmpxft_00005d91_00000000-0.stub.c:5,
from foo.cu:128:
/usr/local/cuda/bin/…/include/math_functions.h: In function long long int __cuda_llabs(long long int)': /usr/local/cuda/bin/../include/math_functions.h:947: error: call of overloaded llabs(long long int&)’ is ambiguous
/usr/include/stdlib.h:783: note: candidates are: long long int llabs(long long int)
/usr/lib/gcc/i386-redhat-linux/3.4.5/…/…/…/…/include/c++/3.4.5/cstdlib:156: note: long long int __gnu_cxx::llabs(long long int)
/usr/lib/gcc/i386-redhat-linux/3.4.5/…/…/…/…/include/c++/3.4.5/bits/stl_iterator_base_types.h: In function typename std::iterator_traits<_Iterator>::iterator_category std::__iterator_category(const _Iter&) [with _Iter = char*]': /usr/lib/gcc/i386-redhat-linux/3.4.5/../../../../include/c++/3.4.5/bits/stl_iterator_base_funcs.h:117: instantiated from typename std::iterator_traits<_Iterator>::difference_type std::distance(_InputIterator, _InputIterator) [with _InputIterator = char*]’
/usr/lib/gcc/i386-redhat-linux/3.4.5/…/…/…/…/include/c++/3.4.5/bits/basic_string.tcc:147: instantiated from static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&, std::forward_iterator_tag) [with _FwdIterator = char*, _CharT = char, _Traits = std::char_traits<char>, _Alloc = std::allocator<char>]' /usr/lib/gcc/i386-redhat-linux/3.4.5/../../../../include/c++/3.4.5/bits/basic_string.h:1388: instantiated from static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct_aux(_InIterator, _InIterator, const _Alloc&, __false_type) [with _InIterator = char*, _CharT = char, _Traits = std::char_traits, _Alloc = std::allocator]’
/usr/lib/gcc/i386-redhat-linux/3.4.5/…/…/…/…/include/c++/3.4.5/bits/basic_string.h:1403: instantiated from static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&) [with _InIterator = char*, _CharT = char, _Traits = std::char_traits<char>, _Alloc = std::allocator<char>]' /usr/lib/gcc/i386-redhat-linux/3.4.5/../../../../include/c++/3.4.5/bits/basic_string.tcc:244: instantiated from std::basic_string<_CharT, _Traits, _Alloc>::basic_string(_InputIterator, _InputIterator, const _Alloc&) [with _InputIterator = char*, _CharT = char, _Traits = std::char_traits, _Alloc = std::allocator]’
/usr/lib/gcc/i386-redhat-linux/3.4.5/…/…/…/…/include/c++/3.4.5/sstream:147: instantiated from std::basic_string<_CharT, _Traits, _Alloc> std::basic_stringbuf<_CharT, _Traits, _Alloc>::str() const [with _CharT = char, _Traits = std::char_traits<char>, _Alloc = std::allocator<char>]' /usr/lib/gcc/i386-redhat-linux/3.4.5/../../../../include/c++/3.4.5/sstream:523: instantiated from std::basic_string<_CharT, _Traits, _Alloc> std::basic_ostringstream<_CharT, _Traits, _Alloc>::str() const [with _CharT = char, _Traits = std::char_traits, _Alloc = std::allocator]’
foo.cu:68: instantiated from here
/usr/lib/gcc/i386-redhat-linux/3.4.5/…/…/…/…/include/c++/3.4.5/bits/stl_iterator_base_types.h:165: error: dependent-name std::iterator_traits<_Iterator>::iterator_category' is parsed as a non-type, but instantiation yields a type /usr/lib/gcc/i386-redhat-linux/3.4.5/../../../../include/c++/3.4.5/bits/stl_iterator_base_types.h:165: note: saytypename std::iterator_traits<_Iterator>::iterator_category’ if a type is meant
#######

How are you building this code?

Thanks. It looks like I have toolkit and SDK files that have identical names to those on the download page, but the files are different (as tested by diff). I’ll re-install.

The program is a cpp file; compile it with g++, not nvcc. I have a makefile that does the work, but here are the commands from make:

g++ -I. -I/usr/local/cuda/include -I/usr/local/cudasdk/common/inc -I../../include -DUNIX -g -DDEBUG -o ffttest.o -c ffttest.cpp

g++ -fPIC -o ffttest ffttest.o   -L/usr/local/cuda/lib -L/usr/local/cudasdk/lib -L/usr/local/cudasdk/common/lib -lcuda -lcudart -lGL -lGLU -lcutilD -lcufft

Note: The CUDA SDK is installed in /usr/local/cudasdk on my system, not /root/NVIDIA_CUDA_SDK. You should be able to update the commands as necessary for your system.

Edit: Those compile commands have some leftovers from my build environment. If you need more information to get the program built, let me know.

BTW, thanks for the quick response. After fixing the driver/toolkit/sdk mismatch, I’ve been testing other possible sources of the problem.

I pulled the second 8800GTX out of the system and removed the grub.conf and kernel options necessary for multi-GPU. (uppermem 524288; vmalloc=256MB pci=nommconf) The program still crashes around 200,000 loops.

The system boots with the kernel parameter vga=0x307 to get a framebuffer console. I’ve removed this kernel option and only have one GPU in the system (as above). So far I’ve now run the program twice successfully through all 420,000 loops (840,000 FFTs). I’m trying again after putting the second GPU back in.

So it appears the problem may be related to enabling the framebuffer. The framebuffer console is not required in my case, so it’s not a showstopper if the frambuffer is the problem. I’ll test my full code later today without the framebuffer enabled to see if that fixes the problem that led me to look at CUFFT.

Jim

I’ve modified this test program to run in an infinite loop. With the console framebuffer disabled, the program has run for over 12 hours now. I can do without the framebuffer, but I’d be happy to open a bug if you like. Let me know.

Jim