Possible bug on beta 3.0 when using cufft and driver api

Hello!

I have problem when using beta 3.0 (cudadriver_3.0-beta1_linux_64_195.17-beta.run) and DRIVER api with cufft. Host is native linux, and gcc is:

Target: x86_64-linux-gnu

Configured with: ../src/configure -v --enable-languages=c,c++ --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/include/c++/4.1.3 --program-suffix=-4.1 --enable-__cxa_atexit --enable-clocale=gnu --enable-libstdcxx-debug --with-tune=generic --enable-checking=release x86_64-linux-gnu

Thread model: posix

gcc version 4.1.3 20080704 (prerelease) (Ubuntu 4.1.2-27ubuntu1)

The problem is that i cannot destroy (or push) context after CUFFT call is made. Pop reports no error but following Destroy and Push return CUDA_ERROR_INVALID_VALUE.

The program code, that i used do detect this is

http://pastebin.com/f38a282b6

Here are the highlights:

// Get handle for device 0

   CUdevice cuDevice = 0;

   CUDA_CHECK( cuDeviceGet(&cuDevice, 0) );

// Create context

   CUcontext cuContext;

   CUDA_CHECK( cuCtxCreate(&cuContext, 0, cuDevice) );

// Create module from binary file

   CUmodule cuModule;

   printf("Loading module PTX!\n");

   CUDA_CHECK( cuModuleLoad(&cuModule, "test_drv_kernel.ptx") );

// Get function handle from module

   CUfunction vecAdd;

   CUDA_CHECK( cuModuleGetFunction(&vecAdd, cuModule, "VecAdd") );

.... MORE CODE IN ACTUAL FILE ...

// OK then freese here

   CUDA_CHECK( cuCtxPopCurrent(NULL));

   printf("Waiting for 1 s %lu %lu\n", sizeof(CUcontext), sizeof(void*));

   sleep(1);

   CUDA_CHECK( cuCtxPushCurrent( cuContext ) );

.... MORE CODE IN ACTUAL FILE ...

CUFFT_CHECK( cufftPlan1d( &fftkernel, size, CUFFT_C2C, 1 ) );

   CUFFT_CHECK( cufftExecC2C(fftkernel, (cufftComplex *)(size_t)d_a, 

							 (cufftComplex *)(size_t)d_a, CUFFT_FORWARD));

							 

   CUFFT_CHECK( cufftDestroy(fftkernel) );

						   

   CUDA_CHECK( cuMemFree(d_a) );

   CUDA_CHECK( cuMemFree(d_b) );

   CUDA_CHECK( cuMemFree(d_c) );

CUDA_CHECK( cuCtxPopCurrent(NULL));

   CUDA_CHECK( cuCtxDestroy( cuContext ) );

And the first call that fails is the cuCtxDestroy.

Am i doing something wrong or is this really error (bug)?

Also valgrind reports some memory leaks + uninitialised conditional jumps on libraries:

==12562== Memcheck, a memory error detector												

==12562== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.				  

==12562== Using Valgrind-3.5.0-Debian and LibVEX; rerun with -h for copyright info		 

==12562== Command: ./test_drv.exe														  

==12562==																				  

Creating context																		   

==12562== Syscall param ioctl(generic) points to uninitialised byte(s)					 

==12562==	at 0x82CD587: ioctl (in /lib/libc-2.10.1.so)								  

==12562==	by 0x4F29EB3: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x4EF0EBE: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x4ED4057: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x4EACBEA: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x4EA597D: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x4F3D3A6: cuCtxCreate (in /usr/lib/libcuda.so.195.17)					 

==12562==	by 0x4011DB: main (test_drv.c:95)											 

==12562==  Address 0x7fefff6a0 is on thread 1's stack									  

==12562==																				  

Loading module PTX!																		

==12562== Conditional jump or move depends on uninitialised value(s)					   

==12562==	at 0x50C50EF: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x50C4E9B: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x50C53D1: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x50C5750: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x5089E96: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x5090B73: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x5086C81: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x4EBD1DB: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x4EBDEAD: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x4EA9C81: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x4F3C30B: cuModuleLoad (in /usr/lib/libcuda.so.195.17)					

==12562==	by 0x401225: main (test_drv.c:100)											

==12562==																				  

==12562== Conditional jump or move depends on uninitialised value(s)					   

==12562==	at 0x50C50F4: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x50C4E9B: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x50C53D1: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x50C5750: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x5089E96: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x5090B73: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x5086C81: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x4EBD1DB: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x4EBDEAD: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x4EA9C81: ??? (in /usr/lib/libcuda.so.195.17)							 

==12562==	by 0x4F3C30B: cuModuleLoad (in /usr/lib/libcuda.so.195.17)					

==12562==	by 0x401225: main (test_drv.c:100)											

==12562==																				  

Allocating memory																		  

Memory allocated and transferred ok!													   

Doing 40 blocks each containing 128 threads												

Waiting for 1 s 8 8																		

Kernel done!																			   

Total diff was: 0.000000																   

 Testing FFT																			   

Bye bye!																				   

==12562==																				  

==12562== HEAP SUMMARY:																	

==12562==	 in use at exit: 16,592 bytes in 12 blocks									

==12562==   total heap usage: 6,507 allocs, 6,495 frees, 10,715,544 bytes allocated		

==12562==																				  

==12562== LEAK SUMMARY:																	

==12562==	definitely lost: 3,376 bytes in 2 blocks									  

==12562==	indirectly lost: 0 bytes in 0 blocks										  

==12562==	  possibly lost: 0 bytes in 0 blocks										  

==12562==	still reachable: 13,216 bytes in 10 blocks									

==12562==		 suppressed: 0 bytes in 0 blocks										  

==12562== Rerun with --leak-check=full to see details of leaked memory					 

==12562==																				  

==12562== For counts of detected and suppressed errors, rerun with: -v					 

==12562== Use --track-origins=yes to see where uninitialised values come from			  

==12562== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 4 from 4)				   

sundberg@mediapc:~/eigenor_local/tomosuite/branches/cuda_drv_test$ valgrind --leak-check=full ./test_drv.exe 

==12567== Memcheck, a memory error detector																  

==12567== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.									

==12567== Using Valgrind-3.5.0-Debian and LibVEX; rerun with -h for copyright info						   

==12567== Command: ./test_drv.exe																			

==12567==																									

Creating context																							 

==12567== Syscall param ioctl(generic) points to uninitialised byte(s)									   

==12567==	at 0x82CD587: ioctl (in /lib/libc-2.10.1.so)													

==12567==	by 0x4F29EB3: ??? (in /usr/lib/libcuda.so.195.17)											   

==12567==	by 0x4EF0EBE: ??? (in /usr/lib/libcuda.so.195.17)											   

==12567==	by 0x4ED4057: ??? (in /usr/lib/libcuda.so.195.17)											   

==12567==	by 0x4EACBEA: ??? (in /usr/lib/libcuda.so.195.17)											   

==12567==	by 0x4EA597D: ??? (in /usr/lib/libcuda.so.195.17)											   

==12567==	by 0x4F3D3A6: cuCtxCreate (in /usr/lib/libcuda.so.195.17)									   

==12567==	by 0x4011DB: main (test_drv.c:95)															   

==12567==  Address 0x7fefff6a0 is on thread 1's stack														

==12567==																									

Loading module PTX!																						  

==12567== Conditional jump or move depends on uninitialised value(s)										 

==12567==	at 0x50C50EF: ??? (in /usr/lib/libcuda.so.195.17)											   

==12567==	by 0x50C4E9B: ??? (in /usr/lib/libcuda.so.195.17)											   

==12567==	by 0x50C53D1: ??? (in /usr/lib/libcuda.so.195.17)											   

==12567==	by 0x50C5750: ??? (in /usr/lib/libcuda.so.195.17)											   

==12567==	by 0x5089E96: ??? (in /usr/lib/libcuda.so.195.17)											   

==12567==	by 0x5090B73: ??? (in /usr/lib/libcuda.so.195.17)											   

==12567==	by 0x5086C81: ??? (in /usr/lib/libcuda.so.195.17)											   

==12567==	by 0x4EBD1DB: ??? (in /usr/lib/libcuda.so.195.17)											   

==12567==	by 0x4EBDEAD: ??? (in /usr/lib/libcuda.so.195.17)											   

==12567==	by 0x4EA9C81: ??? (in /usr/lib/libcuda.so.195.17)											   

==12567==	by 0x4F3C30B: cuModuleLoad (in /usr/lib/libcuda.so.195.17)									  

==12567==	by 0x401225: main (test_drv.c:100)															  

==12567==																									

==12567== Conditional jump or move depends on uninitialised value(s)										 

==12567==	at 0x50C50F4: ??? (in /usr/lib/libcuda.so.195.17)											   

==12567==	by 0x50C4E9B: ??? (in /usr/lib/libcuda.so.195.17)

==12567==	by 0x50C53D1: ??? (in /usr/lib/libcuda.so.195.17)

==12567==	by 0x50C5750: ??? (in /usr/lib/libcuda.so.195.17)

==12567==	by 0x5089E96: ??? (in /usr/lib/libcuda.so.195.17)

==12567==	by 0x5090B73: ??? (in /usr/lib/libcuda.so.195.17)

==12567==	by 0x5086C81: ??? (in /usr/lib/libcuda.so.195.17)

==12567==	by 0x4EBD1DB: ??? (in /usr/lib/libcuda.so.195.17)

==12567==	by 0x4EBDEAD: ??? (in /usr/lib/libcuda.so.195.17)

==12567==	by 0x4EA9C81: ??? (in /usr/lib/libcuda.so.195.17)

==12567==	by 0x4F3C30B: cuModuleLoad (in /usr/lib/libcuda.so.195.17)

==12567==	by 0x401225: main (test_drv.c:100)

==12567==

Allocating memory

Memory allocated and transferred ok!

Doing 40 blocks each containing 128 threads

Waiting for 1 s 8 8

Kernel done!

Total diff was: 0.000000

 Testing FFT

Bye bye!

==12567==

==12567== HEAP SUMMARY:

==12567==	 in use at exit: 16,592 bytes in 12 blocks

==12567==   total heap usage: 6,507 allocs, 6,495 frees, 10,715,544 bytes allocated

==12567==

==12567== 32 bytes in 1 blocks are definitely lost in loss record 3 of 12

==12567==	at 0x4C25153: malloc (vg_replace_malloc.c:195)

==12567==	by 0x4ECB422: ??? (in /usr/lib/libcuda.so.195.17)

==12567==	by 0x4ECB80C: ??? (in /usr/lib/libcuda.so.195.17)

==12567==	by 0x4EACD34: ??? (in /usr/lib/libcuda.so.195.17)

==12567==	by 0x4EA597D: ??? (in /usr/lib/libcuda.so.195.17)

==12567==	by 0x4F3D3A6: cuCtxCreate (in /usr/lib/libcuda.so.195.17)

==12567==	by 0x4011DB: main (test_drv.c:95)

==12567==

==12567== 3,344 bytes in 1 blocks are definitely lost in loss record 10 of 12

==12567==	at 0x4C25153: malloc (vg_replace_malloc.c:195)

==12567==	by 0x508625D: ??? (in /usr/lib/libcuda.so.195.17)

==12567==	by 0x4EBD14E: ??? (in /usr/lib/libcuda.so.195.17)

==12567==	by 0x4EBDEAD: ??? (in /usr/lib/libcuda.so.195.17)

==12567==	by 0x4EA9C81: ??? (in /usr/lib/libcuda.so.195.17)

==12567==	by 0x4F3C30B: cuModuleLoad (in /usr/lib/libcuda.so.195.17)

==12567==	by 0x401225: main (test_drv.c:100)

==12567==

==12567== LEAK SUMMARY:

==12567==	definitely lost: 3,376 bytes in 2 blocks

==12567==	indirectly lost: 0 bytes in 0 blocks

==12567==	  possibly lost: 0 bytes in 0 blocks

==12567==	still reachable: 13,216 bytes in 10 blocks

==12567==		 suppressed: 0 bytes in 0 blocks

==12567== Reachable blocks (those to which a pointer was found) are not shown.

==12567== To see them, rerun with: --leak-check=full --show-reachable=yes

==12567==

==12567== For counts of detected and suppressed errors, rerun with: -v

==12567== Use --track-origins=yes to see where uninitialised values come from

==12567== ERROR SUMMARY: 5 errors from 5 contexts (suppressed: 4 from 4)

And just to point out that also the Push fails, modifying code to:

CUFFT_CHECK( cufftPlan1d( &fftkernel, size, CUFFT_C2C, 1 ) );

   CUFFT_CHECK( cufftExecC2C(fftkernel, (cufftComplex *)(size_t)d_a, 

							 (cufftComplex *)(size_t)d_a, CUFFT_FORWARD));

							 

   CUFFT_CHECK( cufftDestroy(fftkernel) );

CUDA_CHECK( cuCtxPopCurrent(NULL));

   printf("Waiting for 1 s %lu %lu\n", sizeof(CUcontext), sizeof(void*));

   sleep(1);

   CUDA_CHECK( cuCtxPushCurrent( cuContext ) );

Fails on the Push with invalid value.

You can’t mix driver and runtime API. Cufft works with the runtime API.

I have the beta 3 drivers and on this thread says that

So i kind of got the picture i should be able to do that? But this is not the case?

You can’t use context migration with the runtime API (which CUFFT uses) at this time.