CUDA 5.0 - cudaGetDeviceProperties returns very strange grid size

PiotrJustyna · June 5, 2013, 7:55pm

Hi guys,

I’m just starting to learn CUDA and was really surprised that, contrary to what I found on the Internet, my card (GeForce GTX 660M) supports some insane grid sizes (2147483647 x 65535 x 65535). Please take a look at the following results I’m getting from deviceQuery.exe provided with the toolkit:

c:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.0\bin\win64\Release>deviceQuery.exe
deviceQuery.exe Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “GeForce GTX 660M”
CUDA Driver Version / Runtime Version 5.5 / 5.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2048 MBytes (2147287040 bytes)
( 2) Multiprocessors x (192) CUDA Cores/MP: 384 CUDA Cores
GPU Clock rate: 950 MHz (0.95 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.0, NumDevs = 1, Device0 = GeForce GTX 660M

I was curious enough to write a simple program testing if it’s possible to use more than 65535 blocks in the first dimension of the grid, but it crashes confirming what I found on the Internet (or, to be more precise, works fine for 65535 blocks and crashes for 65536).

So my question is: is cudaGetDeviceProperties returning rubbish values or am I doing something wrong?

Thanks!
Piotr

allanmac · June 5, 2013, 8:24pm

I think you’re doing something wrong. I’ve successfully tested grids on sm_3x devices with grid.x > 65535.

Perhaps your kernel is getting killed by the Windows driver’s watchdog timer because it’s running for more than a few seconds?

Or perhaps your kernel is indexing based on the large grid values and generating a memory exception?

PiotrJustyna · June 5, 2013, 8:50pm

Thanks a lot for your quick reply.

I must have gotten something wrong then. My program is extremely simple and basically just adds two vectors. It definitely doesn’t take even a second to run (and that’s including all cudaMallocs and cudaMemcpys). Am I missing something obvious here? Please find my source below:

#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <stdio.h>
#include <math.h>

#pragma comment(lib, "cudart") 

typedef struct 
{
	float *content;
	const unsigned int size;
} pjVector_t;

__global__ void AddVectorsKernel(float *firstVector, float *secondVector, float *resultVector)
{
	unsigned int index = threadIdx.x + blockIdx.x * blockDim.x;
	resultVector[index] = firstVector[index] + secondVector[index];
}

int main(void)
{
	const unsigned int vectorLength = 67107840; // 1024 * 65535 - works fine
	//const unsigned int vectorLength = 67108864; // 1024 * 65536 - crashes
	const unsigned int vectorSize = sizeof(float) * vectorLength;
	int threads = 0;
	unsigned int blocks = 0;
	cudaDeviceProp deviceProperties;

	cudaGetDeviceProperties(&deviceProperties, 0);

	threads = deviceProperties.maxThreadsPerBlock;
	blocks = (unsigned int)ceil(vectorLength / (double)threads);

	pjVector_t firstVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
	pjVector_t secondVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
	pjVector_t resultVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };

	float *d_firstVector;
	float *d_secondVector;
	float *d_resultVector;

	cudaMalloc((void **)&d_firstVector, vectorSize);
	cudaMalloc((void **)&d_secondVector, vectorSize);
	cudaMalloc((void **)&d_resultVector, vectorSize);

	for (unsigned int i = 0; i < vectorLength; i++)
	{
		firstVector.content[i] = 1.0f;
		secondVector.content[i] = 2.0f;
	}

	cudaMemcpy(d_firstVector, firstVector.content, vectorSize, cudaMemcpyHostToDevice);
	cudaMemcpy(d_secondVector, secondVector.content, vectorSize, cudaMemcpyHostToDevice);
	AddVectorsKernel<<<blocks, threads>>>(d_firstVector, d_secondVector, d_resultVector);
	cudaMemcpy(resultVector.content, d_resultVector, vectorSize, cudaMemcpyDeviceToHost);

	for (unsigned int i = 0; i < vectorLength; i++)
	{
		if(resultVector.content[i] != 3.0f)
		{
			free(firstVector.content);
			free(secondVector.content);
			free(resultVector.content);

			cudaFree(d_firstVector);
			cudaFree(d_secondVector);
			cudaFree(d_resultVector);
			cudaDeviceReset();

			printf("Error under index: %i\n", i);

			return 1;
		}
	}

	free(firstVector.content);
	free(secondVector.content);
	free(resultVector.content);

	cudaFree(d_firstVector);
	cudaFree(d_secondVector);
	cudaFree(d_resultVector);
	cudaDeviceReset();

	printf("Everything ok!\n");

	return 0;
}

Thanks,
Piotr

njuffa · June 5, 2013, 10:07pm

How does the app “crash”? What happens when you add error status checks to every CUDA API call and every kernel launch?

allanmac · June 5, 2013, 10:26pm

The code runs on a K20c and a GTX 680 (4GB) with both vectorLength sizes. I looked at the code and couldn’t spot any obvious problem.

As @njuffa suggests, all those error codes are worth checking.

You’re allocating a fair amount of memory (at least 786,420 MB) but your 660M should have most of 2GB free.

You might also want to check you don’t have an zombie instance of your app running in the background (I’ve had that happen before).

PiotrJustyna · June 5, 2013, 11:02pm

Thanks for your replies.

You’re right @njuffa, I should have clarified what it means “crashes”. When I run it from Visual Studio in debug mode (67108864 vector), the last cudaMemcpy always fills my resultVector with seemingly random data (very close to 0 if it matters) so that the result doesn’t pass the final validation. Where it actually seems like it crashes is in the profiler, which returns following error message:

2 events, 0 metrics and 0 source-level metrics were not associated with the kernels and will not be displayed

As a result, profiler measures only cudaMalloc and cudaMemcpy operations and doesn’t even show the kernel execution.

As per error status checking (and I’m not sure I’m doing it right - apologies if so), cudaPeekAtLastError function returns cudaErrorInvalidValue(11) error. All other operations (cudaMalloc and cudaMemcpy) return cudaSuccess(0).

I hope I shed some more light on my problem but please let me know if you have any further questions.

Thanks,
Piotr

PiotrJustyna · June 5, 2013, 11:29pm

@allanmac

Thanks for the zombie instance hint - I checked and unfortunately there is none. Also, thanks for running my code on your hardware, much appreciated.

Piotr

njuffa · June 6, 2013, 12:47am

From the symptoms you describe, it seems the kernel in question either never executed, or died, and thus the final device->host copy copies back garbage, and the profiler has no record for this kernel since it never ran to completion.

In conjunction with allanmac’s observations it seems now would be a good time to add 100% error check coverage to this code. The resulting error message(s) should let you pinpoint what the issue is.

PiotrJustyna · June 6, 2013, 12:46pm

I finally fixed it with a lot of help from stackoverflow (http://stackoverflow.com/questions/16954931/cuda-5-0-cudagetdeviceproperties-strange-grid-size-or-a-bug-in-my-code). As it turns out the code itself was ok, the only problem was the code generation flags I didn’t change. Visual Studio quite confusingly sets them initially to:

compute_10,sm_10

while for my card, more appropriate option would be

compute_30,sm_30

Thanks for all your help,
Piotr

PiotrJustyna · June 6, 2013, 12:48pm

What’s actually interesting, since I’ve been using compute profile 1.0, how was it possible to create 1024 threads per block? Have I missed something?

allanmac · June 6, 2013, 4:08pm

Interesting answer on StackOverflow!

I had always (wrongly) assumed that the special registers %tid, %ntid, %ctaid and %nctaid would entirely isolate kernels from architectural CTA-size differences.

Looking at the PTX manual reveals that “legacy code” will use a “mov.u16” to access the low 16 bits of these special registers.

Compiling for sm_10 and dumping the PTX (or SASS) verifies that 16-bit mov’s are used:

$> nvcc -m 32 -arch sm_10 -ptx addvecs.cu
...
$LDWbegin_AddVectorsKernel:
	mov.u16 	%rh1, %ctaid.x;
	mov.u16 	%rh2, %ntid.x;
	mul.wide.u16 	%r1, %rh1, %rh2;
	cvt.u32.u16 	%r2, %tid.x;
	add.u32 	%r3, %r2, %r1;
	mul.lo.u32 	%r4, %r3, 4;

1024 thread blocks continued to work because “mov.u16 %rh1,%ntid.x;” easily fits into 16 bits. The failure is in the legacy 16-bit move of “%ctaid.x”.

The descriptions for %ntid and %ctaid are at the very end of the PTX manual.

Topic		Replies	Views
CUDA very slow performance CUDA Programming and Performance	21	16424	March 6, 2020
Can a Kernel be too big?? CUDA_ERROR_NO_BINARY_FOR_GPU error 209 CUDA Programming and Performance	11	2975	November 13, 2017
gpu computing sdk 4.0 runtime failures build the sdk succesfully, but the run of any exe failed CUDA Programming and Performance	3	2793	August 8, 2011
GTX Titan Black Memory Issues CUDA Programming and Performance	8	1635	December 14, 2016
why "all CUDA-capable devices are busy or unavailable" ? CUDA Programming and Performance	34	64103	April 20, 2011
CUDA error, bandwithTest.exe CUDA Setup and Installation	12	2467	January 21, 2019
GTX295 Specefications & CUDA CUDA Programming and Performance	5	12269	October 7, 2010
code that crashes unpredictably CUDA Programming and Performance	15	12635	April 28, 2010
CUDA 2.1 Beta Problem/Bugs (Linux) CUDA Programming and Performance	5	1645	January 6, 2009
CUDA Address out of bounds error - help! CUDA Programming and Performance	0	826	August 8, 2019

CUDA 5.0 - cudaGetDeviceProperties returns very strange grid size

Related topics