Weirdness at the 2 gigabyte boundary

Hi all! Been working on the Jetson TX1 for a week and have run into some weird behaviour. We wanted to make sure we can malloc two gigs of memory and it’d be no problem. I wrote the memory full of ints on the device and read it back to the host to check everything is OK. If I cudaMalloc exactly two gigabytes (2048 megs) of memory, the last megabyte of the area is full of -1. This doesn’t happen with other amounts of memory malloc’d with a resolution of 1MB:

ubuntu@tegra-ubuntu:~/testruns/TwoGigsCheck$ for i in seq 1 3072

do
~/nsight/TwoGigsCheck/Release/TwoGigsCheck $i
done
WRONG: 262144 with parameter ‘2048’

(Yeah, it takes a few hours to run)

The same artifact has been popping up on many different implementations of the kernel.

Ubuntu, “14.04.4 LTS, Trusty Tahr”.
Built with Nsight Eclipse Edition 7.0, Release version.
Generating PTX/GPU code 5.3
SSH into the Jetson, run from the command line: “path/to/TwoGigsCheck 2048”

#include <iostream>
#include <sys/time.h>

__global__ void add(int *c) {
	unsigned int tid = blockIdx.x*blockDim.x + threadIdx.x;
	c[tid] = tid;
}

#define ONE_MEG 1024*1024

__host__ int main(int argc, char *argv[]) {
	int c[ONE_MEG];
	int *dev_c;
	struct timeval start, end;
	cudaStream_t stream1;
	cudaStreamCreate(&stream1);

	int HOW_MANY = 2;
	if (argc == 2) {
		HOW_MANY = atoi(argv[1]);
	}
	//std::cout << "Trying to grab " << HOW_MANY << " megs of memory and fill it with ints\n";
	const unsigned int TWOGIGS = 1024U*1024*HOW_MANY;//2147483648;
	cudaMalloc((void**)&dev_c, TWOGIGS);

	gettimeofday(&start, NULL);
	add<<<(TWOGIGS/4/64),64,0,stream1>>>(dev_c); //4 bytes per int, 64 streams per block

	cudaStreamSynchronize(stream1);

	gettimeofday(&end, NULL);

	unsigned int i = 0;
	unsigned int count = 0;
	unsigned int wrong = 0;

	while (i < (TWOGIGS/sizeof(int)/(ONE_MEG))) {
    	cudaMemcpyAsync(&c, dev_c+i*ONE_MEG, sizeof(int)*ONE_MEG, cudaMemcpyDeviceToHost, stream1);
    	int j = 0;
    	while (j<1024*1024) {
    		if (c[j] != count) {
    			//std::cout << j << " (" << &c[j] << "): '" << c[j] << "' was not '" << count <<"'\n";
    			wrong++;
    		}
    		j++;
    		count++;
    	}
    	i++;
    }
	if (wrong) std::cout << "WRONG: " << wrong << " with parameter '" << HOW_MANY << "'\n";
    //std::cout << "\nusecs: " << ((end.tv_sec * 1000000 + end.tv_usec) - (start.tv_sec * 1000000 + start.tv_usec)) << "\n";
	cudaFree(dev_c);
	return 0;
}

Anyone care to confirm?

Another attempt, that does the validation on the GPU:

http://pastebin.com/FdrpFJKY

The original author (Hi pisto from #cuda on freenode!) told me it runs with no trouble on a K40. I made small mods: sync on line 88, parameter from argv and plain printing instead of throwing at the end.

I tested different parameters, and it seems the behaviour is the same than with the previous example:

ubuntu@tegra-ubuntu:~/nsight/PistoTest/Release$ for i in seq -261893 -261884; do ./PistoTest i; done -261893 ok -261892 ok -261891 ok -261890 ok -261889 ok -261888 failed -261887 failed -261886 failed -261885 failed -261884 failed ubuntu@tegra-ubuntu:~/nsight/PistoTest/Release for i in seq -4 5; do ./PistoTest $i; done
-4 failed
-3 failed
-2 failed
-1 failed
0 failed
1 ok
2 ok
3 ok
4 ok
5 ok

You have to take into account that the fillmem and validate divide by blocksize and the result is floored, so they don’t fill and validate up to the whole malloc’d memory, so they miss up to 255 locations. This is why -262144 (1024*256) up to -261889 seems to be valid.

I added one byte extra to the cudaMalloc on line 85 as so:
cudaMalloc(&ints, total_ints * 4 + 1) assertcu;

lo and behold,

ubuntu@tegra-ubuntu:~/nsight/PistoTest/Release$ ./PistoTest 0
0 ok

So it seems mallocing from 2GB-1MB up to 2GB does something weird indeed on the Jetson TX1.

Anyone else care to confirm the behaviour? Can I file a bug report somewhere?

Hi halides,

For this CUDA related issue, please file the bug here https://developer.nvidia.com/nvbugs/cuda/add
(you will need to be a registered developer and login first.)

Thanks

Hi halides,

I’ve tested this with new JetPack 2.2/L4T 24.1 using the new 64-bit executable support on TX1. The test completes successfully up to and beyond 2 GB, but there is a new “weirdness” near 3 GB causing the Unity desktop manager to restart. Running the test via ssh shell, however, shows it does complete successfully. This is a stressful test scenario, so I’m not surprised by this behavior with Unity, and would not consider this a bug (it does automatically restart).

Please download JetPack 2.2 from https://developer.nvidia.com/embedded/jetpack to see if this resolves your issue.

Thanks

Hello kayccc, bkeating!

Been swamped with other work - have been running JetPack 2.2 for a few days now and just tested with pisto’s test that indeed the 2GB weirdness is gone.

I have seen similiar behaviour bkeating pictured but that has been regards to running heavy loads in the default stream, which caused trouble with Unity. Running on other than default streams hasn’t been problematic. I’ll run some tests around the 3GB mark as soon as I have time!

Cheers,
Pekka.