GTX750Ti and buffers > 1GB on Win7

I (still) have a problem where (random) global memory lookups in buffers > 1GB have a severe performance impact on the achieved bandwidth. The program below creates a 1GB buffer and fills it with random numbers. The kernel then loads 128 * 128 coalesced bytes from random locations per thread. It is a simplified version of a crypto currency hashing alogorhytm (Ethereum).

When the buffer is exactly 1GB, the bandwidth is about 70GB/s, close to peak bandwidth on GTX750Ti. But when it is increased to i.e. 1152MB, 1280MB or 1536MB, the bandwidth drastically drops (exponentially, I think). On GTX780, I get about 135GB/s regardless of the buffer size.

I have strong reasons to believe that this doesn’t happen on GTX750Ti/Linux (field reports from my Ethereum miner that has the same behaviour), but I’ll have to double-check that. If anyone wants to try this out on Linux, that would save me a lot of hassle ;).

What causes this, is it a bug?

#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <stdint.h>

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#define GRID_SIZE  8192
#define BLOCK_SIZE 256
#define BUFFER_SIZE (1024 * 1024 * 1024) // <---- change buffer size here!
#define ITERATIONS 16
#define FNV_PRIME	0x01000193

#define fnv(x,y) ((x) * FNV_PRIME ^(y))
#define random() (rand() * rand()) // <---- RAND_MAX on Win is 32767

#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
	cudaError_t err = call;                                           \
	if (cudaSuccess != err) {                                         \
		fprintf(stderr, "Cuda error in func '%s' at line %i : %s.\n", \
		         __FUNCTION__, __LINE__, cudaGetErrorString(err) );   \
		exit(EXIT_FAILURE);                                           \
		}                                                             \
} while (0)

typedef union
	uint32_t uint32s[128 / sizeof(uint32_t)];
	uint4	 uint4s[128 / sizeof(uint4)];
} hash128_t;

__constant__ hash128_t * d_buffer;
__constant__ unsigned int d_buffer_size;

__device__ uint32_t fnv_reduce(uint4 v)
	return fnv(fnv(fnv(v.x, v.y), v.z), v.w);

__global__ void test(int search, volatile unsigned int * num_results)
	const unsigned int gid = blockIdx.x * blockDim.x + threadIdx.x;

	const int thread_id = threadIdx.x &  (THREADS_PER_HASH - 1);
	const int start_lane = threadIdx.x & ~(THREADS_PER_HASH - 1);

	unsigned int s = gid;
	unsigned int r;

	for (int i = 0; i < THREADS_PER_HASH; i++) {
		for (int j = 0; j < ITERATIONS; j++) {
			unsigned int index = __shfl(s, start_lane + i);
			uint4 v = d_buffer[index % d_buffer_size].uint4s[thread_id];
			s = fnv_reduce(v);
		s = __shfl(s, start_lane + i);
		if (i == thread_id) {
			r = s;

	if (search == r) {
		atomicInc(const_cast<unsigned int *>(num_results), UINT_MAX);

int main()
	unsigned int * buffer = (unsigned int *)malloc(BUFFER_SIZE);
	printf("Creating buffer of size %u bytes...\n", BUFFER_SIZE);
	for (unsigned int i = 0; i < BUFFER_SIZE / 4; i++) {
		buffer[i] = random();

	hash128_t * h_buffer;
	volatile unsigned int * num_results;
	unsigned int h_buffer_size = BUFFER_SIZE / sizeof(hash128_t);

	CUDA_SAFE_CALL(cudaMallocHost((void**)&num_results, sizeof(unsigned int)));
	CUDA_SAFE_CALL(cudaMalloc((void**)&h_buffer, BUFFER_SIZE));
	CUDA_SAFE_CALL(cudaMemcpy(reinterpret_cast<void*>(h_buffer), buffer, BUFFER_SIZE, cudaMemcpyHostToDevice));
	CUDA_SAFE_CALL(cudaMemcpyToSymbol(d_buffer, &h_buffer, sizeof(hash128_t *)));
	CUDA_SAFE_CALL(cudaMemcpyToSymbol(d_buffer_size, &h_buffer_size, sizeof(unsigned int)));

	num_results[0] = 0;

	unsigned int target;
	target = random();

	cudaEvent_t start, stop;
	cudaEventRecord(start, nullptr);
	int count = 16;
	for (int i = 0; i < count; i++) {
		test << <GRID_SIZE, BLOCK_SIZE >> >(target, num_results);
	cudaEventRecord(stop, nullptr);

	float duration;
	cudaEventElapsedTime(&duration, start, stop);
	printf("%f GB/s\n", (1000.0f / duration) * count * sizeof(uint4) * THREADS_PER_HASH * ITERATIONS * GRID_SIZE * BLOCK_SIZE / static_cast<float>(1 << 30));
	return num_results[0];

i take it that all threads would eventually reach:

if (i == thread_id) {
r = s;

such that

if (search == r)

does not represent a race of sort

have you checked the size of (sizeof()) of hash128_t across buffer sizes and across os?

when you change the buffer size, is there a difference in number of hashes found (num_results)?
if so, is it significant?

perhaps use the profiler to note whether there are any significant global memory statistics discrepancies when you change the block size
i am sure the profiler would be able to unpack and analyze global memory access a bit more for you, and this might be insightful

I think the profiler is a good suggestion (from the standpoint of understanding). Timing activity on Windows on a WDDM device is a sketchy process, in my opinion. You are executing the kernel 1000 times, but in between each execution, windows could be scheduling any number of unrelated graphics tasks, and these will affect the cudaEvent timing.

If your GTX750 is hosting a display, WDDM can also be doing bizarre things like paging data in and out of the GPU, including CUDA data. As you make the buffer larger, it’s more likely that WDDM will want to demand-page graphics data onto the device right after your kernel, then demand-page CUDA data back onto the device right before the next kernel call. This paging activity is likely to be affected by how much memory a GPU has, so a GTX780 is less likely than a GTX750 to be affected by this.

And of course, all of that could be completely non-existent on linux, depending on how the machines are set up.

In the final analysis, there may not be much you can do. If you want predictable, repeatably high performance for CUDA codes on windows, the best suggestion is to use a GPU operating in TCC mode.

Thanks for your responses. This kernel doesn’t really do anything useful. There may be race conditions, but I’m not really looking for results, I’m only after this bug.

As the GTX780 performs equal in all conditions, there’s no reason to suspect a different size of hash128_t, but I’ll double-check.

The GTX750Ti in this case is run in headless mode, so the full buffer is available and Windows doesn’t use it. The GTX780 is my primary display adapter.

I have also profiled the kernel on both cards and made sure the reported bandwidth by the profiler is near-to-equal to my calculated results.

It doesn’t seem to make sense to analyze the PTX and SASS code, because it would be the same for 1GB and 1.5GB buffers, right? Or does the compiler see that the buffer size is a constant? I’ll make it a cmd line argument then to be 100% sure.

So, when you profiled on the GTX750Ti in the “slow” (large buffer) case, were the kernels essentially back-to-back, or were there signficant gaps in between the kernels?

– edit –

i did only single kernel launches when i profiled the memeory (int count = 1)

“This kernel doesn’t really do anything useful. There may be race conditions, but I’m not really looking for results, I’m only after this bug.”

this was understood to be the case; however, you do not want the kernel/ code to wander off too far on its own
i was curious as to how many times, the following code was hit:

if (search == r) {
atomicInc(const_cast<unsigned int *>(num_results), UINT_MAX);

your view seems to be not too many times, because your assumption is that what you feed the kernel is sound
have you checked?

“I have also profiled the kernel on both cards and made sure the reported bandwidth by the profiler is near-to-equal to my calculated results.”

personally, i would be less interested in statistics across devices, and more interested in statistics for the ‘erroneous’ device, with different buffer sizes - when you change the buffer size, do any of the memory statistics change with the drop in bandwidth, or not?
also, bandwidth is only one measure; one can break down and analyze memory access a thousand ways

I did some testing, but a result is never found. I guess it is basically a chance of 1 in UINT_MAX that there is a result.

I’m certain the buffer contains random numbers. If I change line 81 into “buffer[i] = i”, the L2 cache kicks in with a hit rate close to 100%. With random numbers the hit rate is close to 0%, as expected.

Below are the results of a test I did, with MB’s of buffer on the X-axis and Global memory throughput in GB/s:

The GTX750Ti is a 2GB model (headless), the 780 a 3GB model (primary display), all on stock clocks.

the 780 should have more sms than the 750, and thus more load/ store units

what does the profiler report in terms of global memory (read) coalescence, for the different buffer sizes?
your view is that they generally are coalesced
so, are they?

and what about alignment - number of ideal reads vs number of actual reads… something like that

I’m not surprised that the 780 does better :). I only added it to show it doesn’t seem to be influenced by the buffer size.

But now i’m beginning to think it’s not in the buffer size at all. I did a few more tests with 1.5GB of buffer.

  1. Split the buffer in two equal halves of 768MB. Bad results.
  2. Split the buffer in 1GB + 0.5GB. Bad results.
  3. Kept the single 1.5GB buffer but only looked up values in the lower 1GB. Full speed.
  4. Kept the single 1.5GB buffer but only looked up values in upper 0.5GB. Full speed.
  5. Kept the single 1.5GB buffer but only looked up values in upper 1GB. Full speed.
  6. Kept the single 1.5GB buffer but only looked up values in upper 1.25GB. Bad speed.

An interesting difference between the good and bad results are in the size of the right hand operand of the modulo operator. It looks like whenever I go over 8,388,608 (== 1GB / sizeof(hash128_t)), the performance starts dropping.

the 780 may cache differently than the 750 as well; i do not know maxwell that well

in my mind, alignment goes with coalescence
at the same time, it may be less of an issue when thread blocks access in vicinity of each other, and more of an issue when they do not
restricting the active block accessed may be a way indirectly ensuring thread blocks always access in vicinity of each other
the profiler should be able to reveal that kind of information with regards to memory access - alignment, caching, cache hits overall, ideal/ actual reads, and what not

I can’t find anything regarding coalescence in the profiling reports. Cache efficiency of L1 resp. L2 cache are 0% and 1.3%, regardless of buffer size.

The global reads are always 128B coalesced: 8 threads * sizeof(uint4) = 128, only the index where the read starts is random.

Requested global memory load throughput Multi-context
Global memory load throughput Single-context
Number of global memory load transactions Single-context
Average number of global memory load Single-context
transactions performed for each global
memory load

there are others - see the profiler guide

surely something must deviate at some point, somewhere, if global memory access degrades

I’ll have a look at those. Meanwhile, I moved the GTX750Ti into a Windows 8 machine (headless, primary on Intel HD4600) and there the deterioration already starts at buffers larger than 512MB. This is in line with the behavior of the full crypto hashing kernel.

I tried running your code with a few minor modifications (changing the random() macro and converting nullptr->0) on linux on a GTX960 (closest device I have to a GTX750) which also has 2GB of memory.

For buffer sizes in the 500MB to 1800MB range, the bandwidth reported was pretty flat around 80-83GB/s. (bandwidthTest reports 82GB/s D->D for this GPU)

If you can confirm that the issue does not show up on your GTX750Ti under linux, then I think it’s a solid lead that it has to do with windows (and some effect windows is having on GPU memory). I think coalescing or a bug in the code is unlikely if the behavior is flat and predictable under linux. Just because your GPU is headless does not mean that WDDM is not managing it. My guess would be WDDM demand-paging of memory getting in your way. Why WDDM would do that on a GPU that is otherwise idle I’m not sure.

Since your code has no locality, the effect of WDDM demand paging would be to reduce the apparent memory bandwidth from the “normal” main memory bandwidth down to the PCIE bandwidth (asymptotically) which is pretty much what your graph is showing.

Profiling the repeated kernel case should also shed some light on this, and possibly running the GTX780 closer to its memory limit (3GB), i.e with buffer sizes above 2GB, might also yield some clues.

Thanks txbob, I’ve read up a bit WDDM demand paging and this makes a lot of sense. I should get Linux running, but I’ve been particularly unlucky before in getting GPU drivers set up correctly. I guess I’ll have to take another chance. That is, if you in turn run the test on Windows/GTX960 ;-).

I have tried to run a test the GTX780 allocating 2GB, but it wouldn’t even let me allocate a buffer of that size. I’ll have access to it after the weekend, I’ll do another attempt then.

Found somebody to run the test on GTX750Ti/Linux. Turns out it shows a similar pattern, only less severe:

I’m beginning to suspect not much can be done about this. Apparently nvidia already ‘fixed’ it in the 900 series. Pity there isn’t a model available with the same low power consumption as the GTX750…

Yikes, I’m wondering why no one has complained about this before?

Is there a definitive answer on whether this is WDDM demand paging?

If there aren’t already performance counters for WDDM paging then it would be great if NVIDIA could add them.

I guess a random access pattern in a 1GB+ buffer on a mid-end consumer GPU isn’t very common :). I still hope I’m making some terrible mistake somewhere though.

Presumably it’s not (exclusively) due to WDDM demand paging since the most recent test was on linux.