CPU hangs when calling thrust::copy_if

I am running a simple test code, which loops on thrust::copy_if(). While running this code, the CPU hangs. It happens after some many millions of loop passes, or sometimes after just thousands of loop passes. I tested the code with 4 types of GTX GPUs. Problem shows up only with some models:
GTX780Ti - tested on 5 GTX780ti. had no problem.
Titan Kepler - No problem.
GTX980 - tested on 6 GTX980. All hanged.
TITAN Maxwell 12GB Ram -tested on 2 cards. All hanged.

Here below is the source code, which consists of 2 source file: main.cpp and copyIfWrapper.cu.

Here is main.c which inits the pointer and loops on copyIfWrapper:

#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>



void copyIfWrapper(float *d_in,  float *d_out, unsigned int length, float thresh);


int main(int argc, char **argv)
{
	if (argc <2 ){
		std::cout << "Format: " << argv[0] << " <devid>\n";
		exit(0);
	}
	unsigned int devId = atoi(argv[1]);
	std::cout << "devId: " << devId<< "\n";

	if(cudaSuccess != cudaSetDevice((int)devId)){
		std::cout << "cudaSetDevice err\n";
	}
	if(cudaSuccess !=cudaDeviceReset()){
		std::cout << "cudaDeviceReset err\n";
	}

        unsigned int arraySize = 400000;  
	float* d_outVal;
	float* d_inVal;

	if(cudaSuccess !=  cudaMalloc((void **) &d_outVal, arraySize * sizeof(float))){
		std::cout << "cudaMalloc err\n";
	}
	if(cudaSuccess !=  cudaMalloc((void **) &d_inVal, arraySize * sizeof(float))){
		std::cout << "cudaMalloc err\n";
	}

  // prepare values for copy -  program hangs regardless the value here
	float* hostBuf =  new float[arraySize];
	for(unsigned int idx =0; idx < arraySize;idx++){
		hostBuf[idx] = 0;// program hangs regardless the value here
	}
        cudaMemcpy(d_inVal, hostBuf,  arraySize * sizeof(float), cudaMemcpyHostToDevice);		
	float m_thresh =  0.34;//m_param->value("PP::MIP:InputImageThresh");
	unsigned int debugLoopCounter = 0;
	while(true){
		// debug prints:
		std::cout << "\r" <<debugLoopCounter++;
		std::cout << " m_maxVoxCount.length " << arraySize <<std::flush;
		copyIfWrapper( d_inVal,  d_outVal, arraySize,	m_thresh  );
	}

	return 0;
}

Next is copyIfWrapper.cu:

struct aboveThresh_predicate
{
	__host__ __device__
	bool operator()(const float val) const
	{
		return  (bool)( val > 0.24 );
	}
};


void copyIfWrapper(float *d_in,  float *d_out, unsigned int length, float thresh) {

	thrust::device_ptr<float> d_inVPtr(d_in);
	thrust::device_ptr<float> d_outPtr(d_out);
	
	// apply threshold using copy_if with "stencil[i] = (sv.values[i] > thresh)"
	thrust::device_ptr<float> valIter = thrust::copy_if(d_inVPtr,  d_inVPtr + length, d_inVPtr, d_outPtr,
													aboveThresh_predicate() ); 
  
}

Here’s the Traceback taken while cpu hangs:

#14 in main () at main.cpp: 49
#13 in copyIfWrapper()
#12 in thrust::device_ptr(float) thrust::system::cuda::detail::copy_if<threust::system....etc..
#11 in thrust::system::cuda::detail::trivial_copy_n<thrust::system::cuda::detail::tag, ....etc
#10 in cudaMemcpy () from libcudart.so.6.5
#9 in ?? () from libcudart.so.6.5
#8 in cuMemcpyDtoH_v2 () from libcuda.so.1
#7 in ?? () from libcuda.s0.1
#6 in ?? () from libcuda.s0.1
#5 in ?? () from libcuda.s0.1
#3 in ?? () from libcuda.s0.1
#2 in ?? () from libcuda.s0.1
#1 in clock_gettime () from librt.so.1
#0 in clock_gettime ()

Note: At main.cpp line 40, copy value is set to 0 which is < threshold (0.24). Still, any other value would make it hang too.
I would be grateful to get any idea about this behavior.
Thanks,
Ronen Halevy.

seems to be only maxwell then

“libcudart.so.6.5”

your version of 6.5 supports maxwell?

I would suggest trying both a more recent version of CUDA such as 7.0 or 7.5RC, and also making sure you have the latest driver. Bugs get fixed all the time.

I’ve run your code on a Fermi device and a Kepler device on CUDA 7.5RC and I was not able to see any problems after several million iterations each. I’ll try to test on a Maxwell device as time permits.

@little_jimmy:

https://developer.nvidia.com/cuda-downloads-geforce-gtx9xx
CUDA 6.5 Production Release with Support for GeForce GTX9xx GPUs

i remember an earlier 6.5 version without said support; hence the point

Hi Guys,
Problem is still not solved.
Following your suggestions, I’ve installed cuda 7.0 + latest driver, but that did not solve the problem.
Driver version: 346.46
OS: Linux Centos 6.5

I have just executed the code with 4 GTX980 simultanously. It didn’t take much till all hanged.
CPU Traceback is quite the same - please find it below.

Will anyone be able to duplicate the setup?

Thanks.
ronen.

#15 in main () at main.cpp: 49
#14 in copyIfWrapper()
#13 in thrust::device_ptr(float) thrust::system::cuda::detail::copy_if<threust::system....etc..
#12 in thrust::system::cuda::detail::trivial_copy_n<thrust::system::cuda::detail::tag, ....etc
#11 in cudaMemcpy () from libcudart.so.7.0
#10 in ?? () from libcudart.so.7.0
#9 in ?? () from libcudart.so.7.0
#8 in cuMemcpyDtoH_v2 () from libcuda.so.1
#7 in ?? () from libcuda.s0.1
#6 in ?? () from libcuda.s0.1
#5 in ?? () from libcuda.s0.1
#3 in ?? () from libcuda.s0.1
#2 in ?? () from libcuda.s0.1
#1 in clock_gettime () from librt.so.1
#0 in clock_gettime ()

your gcc version?

if you run the program in the debugger, and you suspend it whenever it stalls, how does the stack compare?
the debugger sometimes has a more elaborate stack report (less ??)

Hi little_jimmy,
gcc ver is 4.4.7.

The program is not executed from within the debugger.
Here is the procedure taken:

  1. compile with -g. (It would hang without -g as well).
  2. run the program, with device id as an argument. e.g. ./main 0
  3. A counter which is incremented every pass by 1 is printed at the command line (see source, lines 47-48).
  4. Wait till running counter freezes. This is the indication for hang.
  5. Attach the process to the deubgger, and watch the traceback.
    Thanks,
    ronen.

isn’t gcc 4.4.7 (extremely) “old”?

i realize centos and rhos may require and build on more stable versions, but regardless

perhaps check that your gcc version is not too old for your cuda version - just check the cuda7/ centos prerequisites

gcc 4.4.7 comes with CentOS 6.x, and is listed as supported by cuda 7.0.
Thanks,
ronen.

#8 in cuMemcpyDtoH_v2 () from libcuda.so.1”

#8 in cuMemcpyDtoH_v2 () from libcuda.so.1”

the code seems to trip on maxwell, during the d2h memory copy

this may be software, or hardware related
perhaps a key hardware feature/ discrepancy between maxwell and others, allows the tripping point

the stack report is normally more informative, and can indeed be helpful
i remember having identified that the code is polling indefinitely, etc from the stack report, before
in your case, the stack report is less informative, likely because you subsequently attach to the program “from behind”

i do not see why you would not be able to run the code from within the debugger, even if you pass in arguments
you should be able to get a more informative stack report then, with fewer (??)

Hi l_j,
It seems that stack backtrace is less informative for .cu files.

  • Running from a debuggger or later attaching to it - same traceback results were provided.
    Thanks,
    ronen.

looking at your options:
either this is a bug (i doubt), or it is not a bug
bugs take enormous time to fix (certainly not a day or two), hence you may just as well interrogate the matter yourself

i do not know how thrust synchronizes; the stack may point to the d2H memory copy, simply because it is the point of synchronization (a non-async memory copy should ‘stall’ as it waits for completion); and it then becomes a synchronization point that is never reached

i think you need to attack the thrust call, in order to see exactly what is upsetting its stomach

i would consider using a separate input array for the stencil as well - even if the values are the same, try using 2 different input arrays: d_inVPtr, d_inVPtr2
next, use the most elementary predicate possible - one from an example known to work perhaps
you could also demote the function - see if you can get a thrust::copy going with the same input parameters
if this yields nothing of note, you may very well simply code your own thrust version

Hi l_j,
Replacing thrust functions with self coded functions is indeed a solution. It’s not only copy_if, but also thrust::transform() hangs.
Anyway, testing with cuda7.5 CR now. It is running w/o a problem for more than 24hrs already, duplicated on 4 GPUs. Crossing fingers…
I’ll keep you updated soon.
Thanks,
ronen.

Hi,
This issue is soleved: cuda7.5 CR runs without hangs.
Trying to migrate the entire program to cuda 7.5, I’m tackled by crashes which don’t show up in previous cuda sdk releases.
Problem is with thrust::reduce_by_key which crashes for input size greater than 20000.
I’ll open a new topic dedicated to that issue.
Thanks,
ronen.