32-bit nvcc makes faster GPU code than 64-bit variant In CUDA version 2.1

I’ve noticed that the 32-bit toolkit generates slightly more optimal code than its bigger 64-bit brother. Mainly, the shared memory footprint is smaller. Sometimes, the register usage is smaller by one or two registers, in which case, the increased occupancy brings inimaginable performance gains (anywhere between 10 and 15%).

On one particular kernel, the two compilers produce the following outputs (taken from the .cubin file):

32-bit:

lmem = 12

smem = 2096

reg = 20

bar = 1

64-bit:

lmem = 12

smem = 2112

reg = 21

bar = 1

That’s 16 extra bytes of shared memory, and one extra register for doing the exact same thing…

I’ve posted below the kernel that produces these results.

After an intense struggle to compare the ptx files from the two different versions, I found that (correct me if I’m wrong) in the 64-bit code, pointers to device mempry are treated as 64-bit long, which would explain the larger shared memory usage, and maybe the higher register abuse. Of course, that is complete utter bulls*it, considering that the G80, G84, G86, G92, G200, and all the others that I’ve missed use a 32-bit memory space.

Here’s the kernel (the template parameter, LineSteps, is <16>):

[codebox]

// This union allows a kernel to use the same shared memory for three different arrays

// This is possible because the kernel will only use one array at a time

union kernelData

{

// The order of this array is specifically reversed in order [y][x] in order to produce less indexing

// overhead when being read from [y][0] to [y][BLOCK_X_MT - 1]

pointCharge<float> charge[BLOCK_Y_MT][BLOCK_X_MT];

// A shared array of accumulators is also needed for the final summation step

Vector3<float> smTemp[BLOCK_X_MT][BLOCK_Y_MT];

// A shared array of points allows the starting point to be read and computed once per column (ty==0)

Vector3<float> smPoint[BLOCK_X_MT];

};

template

global void CalcField_MTkernel(float2* xyInterleaved, float* z, pointCharge *Charges,

							unsigned int n, unsigned int p, unsigned int fieldIndex, float resolution)

{

unsigned int tx = threadIdx.x;

unsigned int ty = threadIdx.y;

unsigned int ti = blockDim.x * blockIdx.x + tx;

// Using a unoin between all needed data tyoes allows massive smem economy

__shared__ kernelData kData;

// previous point ,used to calculate current point, and cumulative field vector

Vector3<float> point, temp;

float2 ptXY;

if(!ty)

{

	// Load starting point

	// The field vectors are arranged as structure of arrays in order to enable coalesced reads

	// The x and y coordinates are interleaved in one array, producing coalesced 64-byte reads,

	// and the z coordinates are placed in a separate array, producing coalesced 32-byte reads

	ptXY = xyInterleaved[n * (fieldIndex - 1) + ti];

	// Once the xy coordinates are read, place them in the appriopriate variable

	point.x = ptXY.x;

	point.y = ptXY.y;

	// Now read the z coordinate

	point.z = z[n * (fieldIndex - 1) + ti];

	// Place the point in shared memory for other threads to access

	kData.smPoint[tx] = point;

}



for(unsigned int bigStep = 0; bigStep < LineSteps; bigStep ++)

{

	// Number of iterations of main loop

	// Recalculating the number of steps here, allows a while loop to be used rather than a for loop

	// This reduces the register usage by one register, allowing a higher warp occupancy

	unsigned int steps = (p + BLOCK_DIM_MT - 1) / BLOCK_DIM_MT;

	// Reset the cummulative field vector

	temp.x = temp.y = temp.z = 0;

	// All starting points need to be loaded to smem, othwerwise, threads may read back wrong pint

	__syncthreads();

	// load the starting point

	point = kData.smPoint[tx];

	// equivalent to for (int i = 0; i < steps, i++) where steps is used as i

	do{

		// It is important to decrement steps independently, and outside the while condition for the register gain to happen

		steps--;

		// Load point charges from global memory

		// The unused charges must be padded until the next multiple of BLOCK_X

		kData.charge[ty][tx] = Charges[steps * BLOCK_DIM_MT + ty * BLOCK_X_MT + tx];

		// Wait for all loads to complete

		__syncthreads();

		// Unrolling the following loop completely saves one register compared to when doing a partial unroll

		// While performance-wise there is no benefit in a complete unroll, the saved register will enable

		// a higher warp occupancy

		#pragma unroll

		for(unsigned int i = 0; i < BLOCK_X_MT; i++)

		{

			temp += electroPartField(kData.charge[ty][i], point);	// ElectroPartFieldFLOP + 3 FLOPs

		}

		__syncthreads();

	}while(steps);

	// Now that each partial field vector is computed, it can be written to global memory

	kData.smTemp[tx][ty] = temp;

	// Before summing up all partials, the loads must complete

	__syncthreads();

	// The next section is for summing the vectors and writing the result

	// This is to be done by threads with a y index of 0

	if(!ty)

	{

		// The first sum is already in registers, so it is only necesary to sum the remaining components

		#pragma unroll

		for(unsigned int i = 1; i < BLOCK_Y_MT; i++)

		{

			temp += kData.smTemp[tx][i];

		}

		// Finally, add the unit vector of the field divided by the resolution to the previous point to get the next point

		point += vec3SetInvLen(temp, resolution);// 13 FLOPs (10 set len + 3 add)

		// The results must be written back as interleaved xy and separate z coordinates

		ptXY.x = point.x;

		ptXY.y = point.y;

		xyInterleaved[n * fieldIndex + ti] = ptXY;

		z[n * fieldIndex + ti] = point.z;

		kData.smPoint[tx] = point;

		fieldIndex ++;

	}

}

}//*/

[/codebox]

Yeah, that definitely sucks. Although, there is a reason for the 64-bit device pointers on 64-bit hosts: If you start moving structs of pointers between the host and the device (either through cudaMemcpy() or kernel parameters), things are going to get very confusing very quickly if the host and device don’t agree on the size of a pointer.

Although, this does bring up an interesting problem: the largest CUDA device has 4 GB of memory. The next Tesla generation could easily have 8GB of memory, at which time device pointers will have to be 64-bit regardless of the host pointer size. Will this mean that such cards require a 64-bit OS to avoid compiler issues?

Unless they drop subword addressing support, ie, a char and an int become the same size.

That would give you room for 16GB of address space instead of 4GB.

Yes, an ugly hack, but it’d give some nice breathing room.

This isn’t really feasible for a CPU, but it’s not completely unworkable for CUDA where you don’t deal with chars and shorts very often.

OpenCL already has flags for something like this, basically telling you when subword addressing is possible.

Such a change could break existing code, but only if you use chars or shorts.

There would also be a lot of tech support hassle with programmer confusion… basically the rule would be “don’t use chars, since the host and device have different sizes.”

I believe you already need a 64bit OS. 4Gb of device memory would mean you can have no memory on your host anymore, as far as I understood the total amount of mem (host + device(s)) determines the need for a 64bit OS.

No, I don’t think so. The device memory is a different memory space from the CPU memory anyway. It’s clearly said that you cannot address the device memory from the host or vice versa.

The Tesla system requirements claims a 32-bit OS is acceptable, but a 64-bit OS is recommended, so maybe it has to do with this.

Frankly, if you’re buying a hypothetical future Tesla with > 4GB of device memory, I don’t think it is too much to require a 64-bit OS. :) People in that market are likely to have already upgraded to 64-bit anyway.

There’s nothing impossible with using a 64-bit device with >4 GB of RAM on a 32-bit OS. If the host is smart enough, it can use tricks, like Physical Adress Extension, or quad word pointers when referencing device memory. This way the host could copy to any zone of the device memory. CPU assisted, DMA, sync, async memory copies are all possible with this scheme, and the device would be able to use all its memory.

Of course there are some limitations, such as only being able to copy 2GB at once, and a few minor ones. These limitations may question the feasibility of such an implementation, but not its possibility. It is possible, and the smart team of programmers at nVidia can do it.

Now, for the original concern about running a 32-bit device on a 64-bit host. I see no real need to make device pointers on the host 64-bit wide; however that does provide for a more streamlined programming enviroment, even though the half MSB will always be zero. That’s for host code.
Device code on the other hand can only access and see the device memory. Thus, on a 32-bit device, there is nothing wrong with employing a 32-bit scheme, considering the overhead a large scheme would bring, yet without increasing the amount of addressable memory. Doing this for “compatibility” reasons with the host is purely idiotic on an massively parallel architecture where every register counts.
It’s very simple to have device pointers residing on the host to be quad-wide, and to truncate those pointers to double-wide when sending them to the device, or the device can truncate them upon reciept. However, treating pointers as quad-wide in what us CUDA programmers call kernels finds no justification. I couldn’t care less about that if a MP had eight million registers instead of eight thousand, or if this adressing mode didn’t bloat kernels.

The solution is quite simple, much simpler than in the case of running a 64-bit device on an obsolete OS, and any smart team of programmers could seamlessly implement it.

Translation when sending the pointers to the device is unworkable in general since they can potentially be embedded in blocks of memory that are transfered to with cudaMemcpy(). (This is why I mentioned structs as a potential problem.) However, the compiler could generate PTX which treated pointers stored in global memory as being 64-bits wide for alignment purposes, but only ever copied the lower 32-bits to a register. That would solve your the performance problem.

Such code could fail on future > 4GB CUDA devices, regardless of host OS, so this would have to be a compiler option. nvcc --small-pointers anyone?

And I would absolutely love to see that. It’s very frustrating to have to choose between a %10 performance gain on a GPU kernel vs. a 100% performance gain on a CPU kernel (32 vs 64 bit modes respectively).

The exact same way 32-bit CUDA code can and will fail. If 32-bit code is to be able to run on the new devices, then a workaround is still needed.

One solution would be for the > 4GB devices to have a 32-bit mode, much like x64 CPUs that run 32-bit binaries. Kernels written for older devices would not use more than 4GB anyway. A compiler option such as -arch sm_14 would give the kernel the full 64-bit pointer treatment. Because a workaround is still needed, we might as well optimize our 64-bit kernels for older architectures when not using the brand new shiny chips.

To me, this seems the most logical resolution.

EDIT - By 32-bit CUDA code I mean code compiled with the 32-bit toolkit.

It seems that using ‘-maxrregcount 18’ produces code with the same amount of local memory as 21, and with the performance characteristics of the 32-bit variantnt.
Interesting…