Kernel hangs when accessing array on Jetson Tk1 (But works fine on Fermi)

As the title states, I’m having an issue where the kernel seems to be hanging for a few seconds when I access an array in certain conditions. This problem only seems to exist on my jetson tk1 board with CUDA 6.5, I’ve tried it out on both the Fermi and Kepler architectures, and they work fine.

Here’s the problem code:


using namespace std;

__global__ void kernel(float* mask, int cols, int rows) {
	int item=0;
	float c;
	int limi = 2500;
	int limk = 102;
	for(int i = 0; i < limi; i++) {
		for(int k = 0; k < limk; k++) {
			c = mask[i];
			item = item--;
			if(c == 0)
				item = i-1;
	item = item;

void kernel_caller(float* data_h, int rows, int cols) {
	float* data_d;
	cudaMalloc((void**)&data_d, cols*rows*sizeof(float));
	cudaMemcpy(data_d, data_h, cols*rows*sizeof(float), cudaMemcpyHostToDevice);
	dim3 gridDim((cols*rows)/THREADS_PER_BLOCK + 1, 1, 1);
	dim3 blockDim(THREADS_PER_BLOCK, 1, 1);
	printf("Running Kernel...\n");
	kernel<<<gridDim, blockDim>>>(data_d, cols, rows);
	printf("Kernel complete!\n");

int main() {
	int rows = 2800;
	int cols = 5000;
	float* data = new float[rows*cols];
	for(int i = 0; i < rows*cols; i++)
		data[i] = 0;

	kernel_caller(data, rows, cols);
	return 0;

Yes, I know this code does nothing of use, its simply a simplified form of the code I’m working on, and still has the same issue. For some reason, when limk becomes sufficiently large, the kernel will hang for 4-18 seconds on execution. Has anyone encountered a similar issue? Or see anything wrong with my code?

Don’t you think that serial code should take longer to execute as you make limk larger?

That would normally be the case, yes. Here, however, it jumps from having a runtime of 1~ second to a runtime of 4 to 18 seconds when I change limk from 101 to 102.

Your kernel has no “output”, so most of the computation might be optimized away by the compiler. Check the SASS, when you change limk from 101 to 102, the emitted code could be drastically different.

Alright, I’ll check this out in the morning and get back to you.

Janet is almost certainly correct. When limk is small, the inner loop is unrolled and the compiler optimizes everything away. I suspect when the loop is 100 or more, the compiler stops unrolling and that blocks it from noticing there are no side effects or stores in the code, so it leaves actual (but useless) computation in the compiled code.

I’m sure your kernel runtime is not ~1 second in the first case when it gets optimized to an empty kernel. It’s likely microseconds, but you haven’t measured it.