Error 719 (failure to launch) for JCUDA and PyCUDA; How to run GPU consecutive times for 'large' data blocks

Background:
Consecutive runs of the GPU with JCUDA code (through Eclipse Neon) end the second iterative step with lauch failure error 719. Similar issue with PyCUDA (running from VS 2015/NSight). However pure C version (.CU) in VS 2015/Nsight runs without issue.

Question(s):

  1. Is there any way to get JCuda to run in its entirety? Why wont it finish?
  2. Are there any NSight debugging capabilities on a windows machine for JCuda? No NSight tools were added to Eclipse during the NVIDIA Nsight installation.

Note:
For completeness, the “Sieve of Atkin” was implemented for CPU. It returns the correct answers and in less time - so getting the correct prime numbers isn’t the issue here. The objective here is to find out how to run the GPU consecutive times for large blocks of data.

System Information
DeviceQuery:

Detected 1 CUDA Capable device(s)
Device 0: “GeForce GT 755M”
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2048 MBytes (2147483648 bytes) - [32768x64536]
( 2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores
GPU Max Clock rate: 1020 MHz (1.02 GHz)
Memory Clock rate: 2700 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Total amount of constant memory: 65536 bytes - used for kernel param passing.
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)

MEMCHECK OUTPUT
“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\cuda-memcheck.exe” java -cp .;.\libs\JCuda-All-0.8.0-bin-windows-x86_64\jcuda-0.8.0.jar com.self.euler.search.numerical.Problem0027gpu

========= CUDA-MEMCHECK
Runtime=11
Driver=11
Step=0 start=0 endVal=200000
First = 0, Last = 199999 len = 200000
Host to device copy complete!
Kernel params set!
Grid Size & Block Size = 196, 1024
Context Sync starting:
Device to Host Copy complete!
For numElements = 200000, primes found = 17984

Reset = 0
Step=1 start=200000 endVal=400000
First = 200000, Last = 399999 len = 200000
Host to device copy complete!
Kernel params set!
Grid Size & Block Size = 196, 1024
Context Sync starting:
Exception in thread “main” jcuda.CudaException: CUDA_ERROR_LAUNCH_FAILED
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:337)
at jcuda.driver.JCudaDriver.cuCtxSynchronize(JCudaDriver.java:1983)
at com.self.euler.search.numerical.Problem0027gpu.getPrimesInRange(Problem0027gpu.java:165)
at com.self.euler.search.numerical.Problem0027gpu.main(Problem0027gpu.java:93)
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to “unspecified launch failure” on CUDA API call to cuCtxSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\Windows\SYSTEM32\nvcuda.dll (cuD3D10GetDevices + 0x18d2cc) [0x194bbd]
========= Host Frame:[0x2507f74]
=========
========= ERROR SUMMARY: 1 error

CODE (Excuse the similar looking code to other working examples)

JAVA CUDA VERSION (results in fail to launch on second iteration of main loop):
nvcc .cu file is compiled with 64 bit machine settings.

public static void main(String[] args) throws IOException {
	
	System.out.println("Runtime="+JCuda.cudaRuntimeGetVersion(null));
	System.out.println("Driver="+JCuda.cudaDriverGetVersion(null));
	
	int stepSize = 5;
	// Length of each portion part
	int stepinterval = 1000000 / stepSize;

	for (int step = 0; step < stepSize; step++) {
		int startVal = stepinterval * step;
		int endVal = startVal + stepinterval;
		System.out.println("Step=" + step + " start=" + startVal + " endVal=" + endVal);
		List<Integer> primeVals = getPrimesInRange(startVal, endVal);
		System.out.println("For numElements = " + (endVal - startVal) + ", primes found = " + primeVals.size());
		int res = JCuda.cudaDeviceReset();
		System.out.println(">>>Reset = "+res);
	}

}

private static List<Integer> getPrimesInRange(int startVal, int endVal) throws IOException {
	int res = 0;
	int numElements = endVal - startVal;

	// Enable exceptions and omit all subsequent error checks
	JCudaDriver.setExceptionsEnabled(true);

	// Create the PTX file by calling the NVCC
	String ptxFileName = preparePtxFile("path/to/file/cuda/JCudaIsPrimeKernel.cu");

	// Initialize the driver and create a context for the first device.
	cuInit(0);
	CUdevice device = new CUdevice();
	cuDeviceGet(device, 0);
	CUcontext context = new CUcontext();
	cuCtxCreate(context, 0, device);

	// Load the ptx file.
	CUmodule module = new CUmodule();
	cuModuleLoad(module, ptxFileName);

	// Obtain a function pointer to the "is_prime" function.
	// If this errors, look for the ptx file and delete, for the compiler to
	// regen.
	CUfunction function = new CUfunction();
	cuModuleGetFunction(function, module, "IsPrime");

	// Allocate and fill the host input data
	int hostInputVals[] = new int[numElements];
	for (int i = 0; i < numElements; i++)
		hostInputVals[i] = (int) i + startVal;
	System.out.println("First = " + hostInputVals[0] + ", Last = " + hostInputVals[hostInputVals.length - 1]
			+ " len = " + hostInputVals.length);

	// Allocate the device input data, and copy the
	// host input data to the device
	CUdeviceptr deviceInOutVals = new CUdeviceptr();
	cuMemAlloc(deviceInOutVals, (numElements) * (Sizeof.INT));
	cuMemcpyHtoD(deviceInOutVals, Pointer.to(hostInputVals), (numElements) * Sizeof.INT);

	System.out.println("Host to device copy complete!");
	// Set up the kernel parameters: A pointer to an array
	// of pointers which point to the actual values.
	Pointer kernelParameters = Pointer.to(Pointer.to(deviceInOutVals), Pointer.to(new int[]{numElements}));
	
	System.out.println(" Kernel params set!");
	// Call the kernel function.
	int blockSizeX = 1024;
	int gridSizeX = (int) Math.ceil((double) numElements / blockSizeX);

	System.out.println(" Grid Size & Block Size = "+gridSizeX+", "+blockSizeX);
	cuLaunchKernel(function, 
			gridSizeX, 1, 1, // Grid dimension
			blockSizeX, 1, 1, // Block dimension
			0, null, // Shared memory size and stream
			kernelParameters, null // Kernel- and extra parameters
	);
	System.out.println(" Context Sync starting:");
	res = cuCtxSynchronize();


	// Allocate host output memory and copy the device output
	// to the host.
	int hostOutput[] = new int[numElements];
	cuMemcpyDtoH(Pointer.to(hostOutput), deviceInOutVals, (numElements) * Sizeof.INT);
	System.out.println("Device to Host Copy complete!");

	List<Integer> primeVals = new ArrayList<Integer>();
	// Verify the result
	for (int i = 2; i < numElements; i++) {
		if (hostOutput[i] > 0)
			// System.out.println("At index " + i + " found " +
			// hostOutput[i]);
			primeVals.add(hostOutput[i]);
	}

	// Clean up.
	res = cuMemFree(deviceInOutVals);
	context = null;
		
	return primeVals;
}

path/to/file/cuda/JCudaIsPrimeKernel.cu is:
include <stdio.h>

extern “C”
global void IsPrime(int *val, int size)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
//printf(“With IDX:[%d], processing value %d”, idx, val[idx]);

if(idx<0)
	printf("ERROR: IDX < 0: %d\n",idx);

if(idx<size){

	if(val[idx]<0)
		printf("ERROR: val[%d] < 0: %d\n",val[idx]);

	int divisor = 2;
	if(val[idx] >= 2) {
		while(divisor < val[idx]) {
			if(val[idx] % divisor == 0) {
				val[idx] = 0;
				return;
			}
			divisor++;
		}
	} else {
		val[idx] = 0;
	}
	__syncthreads();
}

}

VS 2015 CUDA project
VS properties for CUDA project:
Code generation: compute_20, sm_20

Command line:

Driver API (NVCC Compilation Type is .cubin, .gpu, or .ptx)

set CUDAFE_FLAGS=–sdk_dir "C:\Program Files (x86)\Windows Kits\8.1"
“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvcc.exe” --use-local-env --cl-version 2015 -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin” -G --keep-dir Debug -maxrregcount=0 --machine 32 --compile -cudart static -o Debug%(Filename)%(Extension).obj “%(FullPath)”

Runtime API (NVCC Compilation Type is hybrid object or .c file)

set CUDAFE_FLAGS=–sdk_dir "C:\Program Files (x86)\Windows Kits\8.1"
“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvcc.exe” --use-local-env --cl-version 2015 -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin” -G --keep-dir Debug -maxrregcount=0 --machine 32 --compile -cudart static -g -Xcompiler "/EHsc /nologo /FS /Zi " -o Debug%(Filename)%(Extension).obj “%(FullPath)”

C CUDA (.cu) file runs without issue.

include “cuda_runtime.h”
include “device_launch_parameters.h”

include <math.h>
include <stdio.h>
include <time.h>

cudaError_t getWithCuda(int *v, unsigned int size);

global
void checkKernel(int *val, unsigned int size)
{
const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
bool showOutput = false;

if(idx < size) 
{
	unsigned int denominator = 2;
	if (val[idx] >= 2) {
		while (denominator < (int)ceil(sqrt((float)val[idx]))) {// val[idx]) {//
			if (val[idx] % denominator == 0) {
				val[idx] = 0;
				return;
			}
			denominator++;
		}
	}
	else {
		val[idx] = 0;
	}
}

}

int main()
{
clock_t begin = clock();
const int maxVal = 1000000; // 1048576;
const int divisor = 5; // 32
const int arraySize = maxVal / divisor;
int v[arraySize];// = //{ 11, 12, 13, 14, 15, 16 };
int total = 0;

for (int i = 0; i < arraySize; i++) {
	v[i] = i;
}

for (int j = 0; j < divisor; j++) { 
	int offset = j * arraySize;
	if (NULL == v)
		printf("Error: Allocation failed ");
	else
		printf("[SECTION %d]: ArraySize = %d, Max = %d, sections = %d\n", j, arraySize, maxVal, divisor);
	for (int i = 0; i < arraySize; i++)
		v[i] = i + offset;

	printf("  BEFORE: {%d,%d,%d,%d,%d,...,%d}[%d]\n",
		v[0], v[1], v[2], v[3], v[4], v[arraySize-1], arraySize);

	// Calc primes in parallel.
	cudaError_t cudaStatus = getWithCuda(v, arraySize);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, " call to getWithCuda failed! ");
		return 1;
	}

	printf("  AFTER : {%d,%d,%d,%d,%d,...,%d}[%d]\n",
		v[0], v[1], v[2], v[3], v[4], v[arraySize-1], arraySize);

	int subtotal = 0;
	for (int i = 0; i < arraySize; i++)
		if (v[i] != 0)
			subtotal++;
	total += subtotal;
	printf("  Total: %d, subtotal: %d\n", total, subtotal);
}
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("Total of %d primes found in range (%d<=x<%d) [Duration: %d secs] \n", total, 0, maxVal, (unsigned int)time_spent);

// cudaDeviceReset must be called before exiting in order for profiling and
// tracing tools such as Nsight and Visual Profiler to show complete traces.
cudaError_t 
	cudaStatus = cudaDeviceReset();
if (cudaStatus != cudaSuccess) {
	fprintf(stderr, "cudaDeviceReset failed! ");
	return 1;
}

printf("Press enter a value when ready: ");
int i;
scanf("%d", &i);
//getchar();

return 0;

}

// Helper function for using CUDA to add vectors in parallel.
cudaError_t getWithCuda(int *host_vals, unsigned int arraySize)
{
int *dev_vals = 0;
// calc through 0 and up to and including the last number equal to the arraysize value.
int memSize = (arraySize) * sizeof(int);
cudaError_t cudaStatus;

// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
	fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
	goto ExitPoint;
}

// Set device memory size
cudaStatus = cudaMalloc((void**)&dev_vals, memSize);
if (cudaStatus != cudaSuccess) {
	fprintf(stderr, "cudaMalloc failed! ");
	goto ExitPoint;
}

// Copy input vectors from host memory to GPU buffers.
cudaStatus = cudaMemcpy(dev_vals, host_vals, memSize, cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
	fprintf(stderr, "cudaMemcpy failed! ");
	goto ExitPoint;
}

// Calc block and threads
int dim = 1; // = ~1024/1024
if (arraySize > 1024)
	dim = arraySize / 1024 + 1;

dim3 blocks(dim, 1, 1);
dim3 threads(1024, 1, 1);

// Launch a kernel on the GPU with one thread for each element.
printf("    <<<blocks=%d,threads=%d>>>\n", blocks.x, threads.x);
checkKernel <<<blocks, 1024>>>(dev_vals, arraySize);

// Check for any errors launching the kernel
cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess) {
	fprintf(stderr, "kernel launch failed: %s\n ", cudaGetErrorString(cudaStatus));
	goto ExitPoint;
}

// cudaDeviceSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
	fprintf(stderr, "cudaDeviceSynchronize returned ExitPoint code %d after launching kernel!\n ", cudaStatus);
	goto ExitPoint;
}

// Copy output vector from GPU buffer to host memory.
cudaStatus = cudaMemcpy(host_vals, dev_vals, memSize, cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
	fprintf(stderr, "cudaMemcpy failed! ");
	goto ExitPoint;
}

ExitPoint:
cudaFree(dev_vals);

return cudaStatus;

}