Slow local memory, feigned constant memory. coalesced? global?

Hello everyone. This is a long post, so sorry for that, I tried to explain everything I have on memory types I used. Hope you read it till the very end and find some relevant to what you think/noticed.

While trying to figure out the bottleneck in the program I’m working on, I used several techniques such as using constant memory, using shared memory, and using local memory.

Playing around with the profiler, I noticed some strange (unexpected) global loads in the program that used constant memory (used constant for the matrix that is never changed). I believe those are loads from reading from the constant memory space. Having the “hit” counter very low (less than 10, not normalized), I assumed that the cache actually doesn’t do the work it was intended to do, and everything is just read from the device memory. How come the cache hit count is so low for truly-constant data? Now it seems to me that using shared memory as a buffer will perform better.

I’ve also rewritten the same application to use some shared memory. Since it is limited, I was able to place only 16 threads in block. Even though the most used part (half of it, to be exact) of another data structure (that is used the most in the app) was placed in the shared memory and bank conflicts were avoided (used hi-lo split of doubles), the program performed much worser than the program that used only local memory for the structure. Something like 16 times slower. I wonder if it is because of extra computational efforts (hi-lo split, extra indexing) that covers benefits from using on-chip memory.

The program that simply used local memory for the volatile data structure outperformed everything I tried so far. But the performance of it is much lower than was expected; computed effective bandwidth and throughput reported by cudaprof both are very low. This is because of local load and store, I believe. In cudaprof I see only two columns for local memory usage - load and store, and these counters are large values - values that could be smaller if local loads and stores were coalesced indeed. On the other hand, there are “gst 128b” and “gld 128b” counters available for the global memory. I recall some forum user who was asking if the local memory is a good choice; so do I. Local load/store counters values are too big to be coalesced to 128 bytes. Since the partition size for current gpus is 256 bytes, the bigger size of transaction the faster the code. Thus I assume that using global memory with hard-coded coalescing to 128 bytes will do better than using local memory.

Please comment if you (dis)agree with anything related to this post. Anything that you could point out could be of much help for me.

Myabe you should post some of the code tries… that might clarify some more details…

I’ve found textures to work better than constant memory, even when by the documentation, constant memory should work just as well.

The crucial difference between constant memory and shared access speed is that shared memory is multiported… each thread can read from a different bank simultanously. This could give you 16 loads per clock. There’s also broadcast support so the same value loaded by all the threads is also one clock.

But constant memory is only broadcast of a single value. If different threads simultaneously read different locations in constant memory, the accesses will be serialized, and that is quite often 16 times slower. This makes constant memory fine for single values, but terrible for array lookup tables.

Wow, I thought that the cache is what makes constant memory space fast - the cache is supposed to be fast, and is supposed to be enough for at least some data.

This is how I access constant memory space:

__constant__ double * constant_a;

__global__ void kernel(double* constant_a)

{

	double volatile_a[16][16];

	

	for (int i = 0; i < 16; i++){

		for (int j=0; j < 16; j++){

			volatile_a[i][j] = constant_a[i * 16 + j];

		}

	}

}

Running 64 threads per block, each thread can do one or more (up to config and input data) of this outer loop. We all hope that threads in the block agree on their execution path, and so all threads in a block access the same element at the moment in time (normalized counter “divergent branch” in cudaprof reports value of 2). Nevertheless, cudaprof reports substantial amount of global memory loads (32b). If I set the limit of i to 8 as follows

for (int i = 0; i < 8; i++){

		for (int j=0; j < 16; j++){

			volatile_a[i][j] = constant_a[i * 16 + j];

		}

	}

there are half of global loads occur (say 65000 for i = 16, 30000 for i = 8).

I think it’s clear that sometimes constant memory is not what you expect, so using faster on-chip memory is preferrable.

The code for constant memory usage is given above. Constant memory is used for constant_a which is set on the host.

Second approach I used is to use shared memory for storing part of volatile_a matrix. I noticed that upper triangular part of it is used the most, except the row 0 that is read but never written to (except copying from the constant_a matrix). So I employed the scheme:

  • used reads from constant_a if first row of volatile_a is read

  • stored upper tringular part of volatile_a (except the row 0) in shared memory

  • left the rest of volatile_a to be in local memory

In order to avoid bank conflicts, I used advanced indexing for accessing shared memory, as well as hi-lo split of doubles. I was to write special ugly getter and setter functions for accessing volatile_a elements, but those are inlined (because are device functions) so don’t cause any overhead. Here is the getMatrixElement function:

__device__ double getMatrixElement(int rowIndex, int columnIndex, int sharedMatrix[241 * 16], double localMatrix[15][15], double* constant_a, int ithread, double dwip)

{

	int offsets[15] = {0, 15, 29, 42, 54, 65, 75, 84, 92, 99, 105, 110, 114, 117, 119};

	if (rowIndex == 0){

		if(columnIndex == 10){

			return dwip;

		}else{

			return constant_a[columnIndex];

		}

	}else if (rowIndex <= columnIndex){

		int hi_index = 241 * ithread + 120 + (offsets[rowIndex - 1] + columnIndex - rowIndex + 8) - ((offsets[rowIndex - 1] + columnIndex - rowIndex + 8)/120)*120;

		int lo_index = 241 * ithread + offsets[rowIndex - 1] + columnIndex - rowIndex;

		return __hiloint2double(sharedMatrix[hi_index], sharedMatrix[lo_index]);

	}else{

		return localMatrix[rowIndex - 1][columnIndex];

	}

}

Sure it looks like a lot, and computation of indexes is complicated, and heavy/not recommended __hiloint2double function is used, but can that be the reason for 16 times slowdown compared to simple implementation based on local memory? I tested both apps running for the same time, with the same threads per block. The program that used local memory processed 16 times more operations than program that used shared memory scheme.

Local memory scheme:

__constant__ double * constant_a;

__global__ void kernel(double* constant_a)

{

	double volatile_a[16][16];

	double vector_x;

	double vector_x0;

	for (int m = 0; m < someValue; m++)

	{

		for (int i = 0; i < 16; i++){

			for (int j=0; j < 16; j++){

				volatile_a[i][j] = constant_a[i * 16 + j];

			}

		}

		solve(volatile_a, vector_x, vector_x0);

		// write result to global memory

	}

}

// solves system of linear equations using Gaussian

__device__ void solve(double a[16][16], double b[16], double x[16]){

	double c;

	for (int i = 0; i < 16; i++){

		x[i] = 0;

		b[i] = 0.;

	}

	b[0] = -1.0;

	for (int k = 0; k < 15; k++){

		for(int i = (k + 1); i < 16; i++){

			c = a[i][k] / a[k][k];

			a[i][k] = 0.0;

			for (int j = (k + 1); j < 16; j++) {

				a[i][j] = a[i][j] - c * a[k][j];

			}

			b[i] = b[i] - c * b[k];

		}

	}

	x[15] = b[15] / a[15][15];

	

	for(int i = 14; i >= 0; i--){

		c = 0.0;

		for(int j = i + 1; j < 16; j++) {

			c = c + a[i][j] * x[j];

		}

		x[i] = (b[i] - c) / a[i][i];

	}

}

So I have three data structures in local memory - volatile_a, vector_x and vector_x0 (they are in local - explored ptx file).

When profiling this app, normalized (per block) counters for local load and store are 828416 and 3704830 respectively. The store counter is much bigger than load - but it is not what’s happening in the code. And overall, values are too big, they report more than 50000 local stores per thread! Tested for someValue = 8 iterations. There should be around 5000 loads and stores total per iteration.

Actually, in the ptx file there are two extra local data structures

.local .align 8 .b8 __cuda___cuda_result_32152[40];

	.local .align 8 .b8 __cuda___cuda_result_72192[40];

	.local .align 8 .b8 __cuda___cuda_x_112232[128];

	.local .align 8 .b8 __cuda___cuda_x0_240360[128];

	.local .align 8 .b8 __cuda___cuda_aa_368488[2048];

First two are not present in the code, and could be some temp storages for data, but why are so meaningfully named?

To conclude:

-usage of constant memory is no good for this app

-unexplained slowdown when using shared mem

-some extra local memory loads and stores (could be coalesced to 128b in global memory)

I dont understand - you only need to load 256 elements???

what about something like this:

__shared__ double smData[ 16 * 16 ];

int ipos = threadIdx.x * 16 + threadIdx.y;

smData[ iPos ] = gmemData[ iPos ];

__syncthreads();

// and now you have the 16x16 matrix in shared mem

Is not that what you want?

eyal

[quote name=‘tonhead’ date=‘Jan 21 2010, 03:53 PM’ post=‘985402’]

__constant__ double * constant_a;

__global__ void kernel(double* constant_a)

{

	double volatile_a[16][16];

	

	for (int i = 0; i < 16; i++){

		for (int j=0; j < 16; j++){

			volatile_a[i][j] = constant_a[i * 16 + j];

		}

	}

}

I am sure the compiler generated “global memory” loads inside the function. Check for the advisory warning (Asuming global memory instead…) and also check the PTX. I dont think u r using const mem correctly.

yes, this is exactly what I want. And I’ve mentioned this once or twice in two my previous posts.

__constant__ double * constant_a;

__global__ void kernel(double* constant_a)

Don’t pass the constant memory variable as an argument like that. Not only isn’t it necessary (constant variables have compilation unit scope), but it will quite possibly confused the hell out of the compiler and result in a symbol being generated which masks the constant memory variable you are trying to use in the first place.

What are advisory warnings?

I’m sure because in ptx this data structure (constant_a) is declared with

.const .u64

Well yeah decuda will unveil more info, in case if the after-ptx compilation steps decided to put it into global mem.

Anyway, this is how I declare and initialize constant_a:

__constant__ double* constant_a;

void wrapper(){

	cudaError_t error_here;

	size_t size_constant_a = 16 * 16 * sizeof(double);

	error_here = cudaMalloc((void**) &constant_a, size_constant_a);

	cout << "cudaMalloc matrix a: " << cudaGetErrorString(error_here) << endl;

	error_here = cudaMemcpy(constant_a, host_constant_a, size_constant_a, cudaMemcpyHostToDevice);

	cout << "cudaMemcpy matrix a: " << cudaGetErrorString(error_here) << endl;

	// calling kernel, passing constant_a as a parameter

}

Man, I tried what you say - just using this pointer to the constant memory, without ever passing it. It complained that I can’t do so, access violation. Don’t remember the exact phrase.

Wait, can there be a pointer to the constant memory? Probably if I change it to simple array - even though it is the same pointer - it won’t complain… Should give a try.

The copying and initialization code you have there looks wrong to me. I don’t believe constant memory can be dynamically allocated like that. You can have statically declared constant variables and copy to them at run time using cudaMemcpyToSymbol, but I am pretty sure that code you posted will just wind up with your data in global memory.

Ok, I should have checked how it is actually done :) Will do.

Hope you are right and constant memory accesses can perform better.

Some things to point out:

All of your code examples are completly serial. All threads perform exactly the same thing, which is kinda pointless. Are you aware of that?

Secondly:

__constant__ double* constant_a;

This means that a pointer to double resides in constant memory, but the actual data is not! As a result your line

volatile_a[i][j] = constant_a[i * 16 + j];

will be compiled to something like this:

offset1:=i * 16 + j;

constant_a_base:=constant_memory[constant_a];

var:=global_memory[constant_a_base+offset1];

local_memory[volatile_a+offset1]:=var;

If you want a whole array to be in constant memory, write this instead:

__constant__ constant_a;

Thirdly:

error_here = cudaMemcpy(constant_a, host_constant_a, size_constant_a, cudaMemcpyHostToDevice);

constant_a (as it is declared so far) has a value which resides on GPU, thus cannot be accessed directly on the host side. I am surprised the compiler did not complain about this thing. Unless you use cudaemu, you should read some garbage instead. So you move host_constant_a data to some random position in GPU memory.

Similarly &constant_a should raise a compile error. You use cudaGetSymbolAddress (or something like that) to get address of device (or constant?) variables. Alternatively you may want to use cudaCopyToSymbol (or something like that) function to store data into constant memory. Constant variables are like global variables, they are allocated upon declaration so no need to call for cudaMalloc on them.

Fourthly:

double volatile_a[16][16];

Declares an array for each thread. Since it is big, it cannot fit into 16 (or 32) of 32-bit registers, it is stored in local memory. Physically local memory is a global memory. As a result all you do in that loop is to copy data from one place in global memory to another place in global memory!

Dear Cygnus X1,

Thank you for your helpful comments.

I will follow your path.

============================

  1. They are serial, they are doing almost the same work because it is Monte-Carlo algorithm implementation. But the code I posted here is not the whole program, there are some other calculations going on there.

============================

  1. That makes sense! Same as const in C++. So in order to place the data in constant I do
__constant__ double constant_a[16][16];

============================

  1. This was taken from the NVIDIA CUDA Programming Guide, where they do
float* d_A; 

	cudaMalloc((void**)&d_A, size); 

	cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

and it works fine for me. Sure, in the case if we have data in the constant memory, we want to get an address of it in the first place. Thank you for mentioning that.

It’s the first time I hear so.

============================

Sure it is in local :) I think it is no doubt in the local memory, also because indexing is not static, loops are not unrolled.

Yes, all I do is copying data from constant memory space to global local memory, and then I do some operations on that local data (solve the system, to be exact, function is given in my second post here).

All answers concern constant memory so far. Any comments on local memory usage? On how it coalesced/non-coalesced? :huh:

Local memory is scoped on a per thread basis. There should be no concept of “coalesced” access for local memory – coalesced implies all threads within a half-warp accessing sequential words within a 16 word-long, 16 word aligned memory segment. That clearly can’t happen with local memory structures or arrays, or anything which is more than word length, and probably doesn’t happen even for those cases.

Local Memory is arranged by the compiler and/or run-time in memory. One can always make it coalesced and I believe this feature was introduced in some CUDA release…

I don’t doubt that, under the correct conditions, thread level reads from local memory might be serviced using 32 byte, 64 bytes, or even 128 byte transactions, but that isn’t the same thing as coalescing memory reads. Coalescing memory reads would imply that half-warp or warp sized local memory would be “striped” somehow to allow the half warp to be serviced by a single transaction, and that some magic local area access pattern at the half warp level would yield better bandwidth than others. That can’t happen, can it?

Local memory accesses are always coalesced though since they are per-thread by definition.

As for my understanding, I would prefer thinking of it as of some special way of allocating data - so that accesses are always coalesced when threads read element 0 of local data, when they read element 1 of local data, and so on. There was a thread on the forum where a guy was asking this exact question - what does coalescing of local data mean. Need to look up for this thread, probably someone from NVIDIA clarified the case.