Slow local memory, feigned constant memory. coalesced? global?

[quote name=‘tonhead’ date=‘Jan 21 2010, 03:53 PM’ post=‘985402’]

__constant__ double * constant_a;

__global__ void kernel(double* constant_a)

{

	double volatile_a[16][16];

	

	for (int i = 0; i < 16; i++){

		for (int j=0; j < 16; j++){

			volatile_a[i][j] = constant_a[i * 16 + j];

		}

	}

}

I am sure the compiler generated “global memory” loads inside the function. Check for the advisory warning (Asuming global memory instead…) and also check the PTX. I dont think u r using const mem correctly.

yes, this is exactly what I want. And I’ve mentioned this once or twice in two my previous posts.

__constant__ double * constant_a;

__global__ void kernel(double* constant_a)

Don’t pass the constant memory variable as an argument like that. Not only isn’t it necessary (constant variables have compilation unit scope), but it will quite possibly confused the hell out of the compiler and result in a symbol being generated which masks the constant memory variable you are trying to use in the first place.

What are advisory warnings?

I’m sure because in ptx this data structure (constant_a) is declared with

.const .u64

Well yeah decuda will unveil more info, in case if the after-ptx compilation steps decided to put it into global mem.

Anyway, this is how I declare and initialize constant_a:

__constant__ double* constant_a;

void wrapper(){

	cudaError_t error_here;

	size_t size_constant_a = 16 * 16 * sizeof(double);

	error_here = cudaMalloc((void**) &constant_a, size_constant_a);

	cout << "cudaMalloc matrix a: " << cudaGetErrorString(error_here) << endl;

	error_here = cudaMemcpy(constant_a, host_constant_a, size_constant_a, cudaMemcpyHostToDevice);

	cout << "cudaMemcpy matrix a: " << cudaGetErrorString(error_here) << endl;

	// calling kernel, passing constant_a as a parameter

}

Man, I tried what you say - just using this pointer to the constant memory, without ever passing it. It complained that I can’t do so, access violation. Don’t remember the exact phrase.

Wait, can there be a pointer to the constant memory? Probably if I change it to simple array - even though it is the same pointer - it won’t complain… Should give a try.

The copying and initialization code you have there looks wrong to me. I don’t believe constant memory can be dynamically allocated like that. You can have statically declared constant variables and copy to them at run time using cudaMemcpyToSymbol, but I am pretty sure that code you posted will just wind up with your data in global memory.

Ok, I should have checked how it is actually done :) Will do.

Hope you are right and constant memory accesses can perform better.

Some things to point out:

All of your code examples are completly serial. All threads perform exactly the same thing, which is kinda pointless. Are you aware of that?

Secondly:

__constant__ double* constant_a;

This means that a pointer to double resides in constant memory, but the actual data is not! As a result your line

volatile_a[i][j] = constant_a[i * 16 + j];

will be compiled to something like this:

offset1:=i * 16 + j;

constant_a_base:=constant_memory[constant_a];

var:=global_memory[constant_a_base+offset1];

local_memory[volatile_a+offset1]:=var;

If you want a whole array to be in constant memory, write this instead:

__constant__ constant_a;

Thirdly:

error_here = cudaMemcpy(constant_a, host_constant_a, size_constant_a, cudaMemcpyHostToDevice);

constant_a (as it is declared so far) has a value which resides on GPU, thus cannot be accessed directly on the host side. I am surprised the compiler did not complain about this thing. Unless you use cudaemu, you should read some garbage instead. So you move host_constant_a data to some random position in GPU memory.

Similarly &constant_a should raise a compile error. You use cudaGetSymbolAddress (or something like that) to get address of device (or constant?) variables. Alternatively you may want to use cudaCopyToSymbol (or something like that) function to store data into constant memory. Constant variables are like global variables, they are allocated upon declaration so no need to call for cudaMalloc on them.

Fourthly:

double volatile_a[16][16];

Declares an array for each thread. Since it is big, it cannot fit into 16 (or 32) of 32-bit registers, it is stored in local memory. Physically local memory is a global memory. As a result all you do in that loop is to copy data from one place in global memory to another place in global memory!

Dear Cygnus X1,

Thank you for your helpful comments.

I will follow your path.

============================

  1. They are serial, they are doing almost the same work because it is Monte-Carlo algorithm implementation. But the code I posted here is not the whole program, there are some other calculations going on there.

============================

  1. That makes sense! Same as const in C++. So in order to place the data in constant I do
__constant__ double constant_a[16][16];

============================

  1. This was taken from the NVIDIA CUDA Programming Guide, where they do
float* d_A; 

	cudaMalloc((void**)&d_A, size); 

	cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

and it works fine for me. Sure, in the case if we have data in the constant memory, we want to get an address of it in the first place. Thank you for mentioning that.

It’s the first time I hear so.

============================

Sure it is in local :) I think it is no doubt in the local memory, also because indexing is not static, loops are not unrolled.

Yes, all I do is copying data from constant memory space to global local memory, and then I do some operations on that local data (solve the system, to be exact, function is given in my second post here).

All answers concern constant memory so far. Any comments on local memory usage? On how it coalesced/non-coalesced? :huh:

Local memory is scoped on a per thread basis. There should be no concept of “coalesced” access for local memory – coalesced implies all threads within a half-warp accessing sequential words within a 16 word-long, 16 word aligned memory segment. That clearly can’t happen with local memory structures or arrays, or anything which is more than word length, and probably doesn’t happen even for those cases.

Local Memory is arranged by the compiler and/or run-time in memory. One can always make it coalesced and I believe this feature was introduced in some CUDA release…

I don’t doubt that, under the correct conditions, thread level reads from local memory might be serviced using 32 byte, 64 bytes, or even 128 byte transactions, but that isn’t the same thing as coalescing memory reads. Coalescing memory reads would imply that half-warp or warp sized local memory would be “striped” somehow to allow the half warp to be serviced by a single transaction, and that some magic local area access pattern at the half warp level would yield better bandwidth than others. That can’t happen, can it?

Local memory accesses are always coalesced though since they are per-thread by definition.

As for my understanding, I would prefer thinking of it as of some special way of allocating data - so that accesses are always coalesced when threads read element 0 of local data, when they read element 1 of local data, and so on. There was a thread on the forum where a guy was asking this exact question - what does coalescing of local data mean. Need to look up for this thread, probably someone from NVIDIA clarified the case.

who knows… :ermm:

If it can’t, than one could get better performance with hard-coded coalesced accesses to global memory.

Sure, but you did not copy the first line of that code snippet! You have cudaMalloc, you have cudaMemcpy, but you do not have “float *d_A;”, instead you have “constant float *constant_a” which is completly different!

Programming Guide Page 20

Programming Guide Page 106, B.2.2 constant

Has the lifetime of an application, meaning you do not allocate it manually beause it already exists!

However local memory resides in global memory and although accessing it may be faster than accessing plain global memory, it will be still much slower than accessing registers, shared or constant (upon cache hit) and I believe it will be still detected by cuda profiler as global memory access (although I am not sure about this one).

Programming Guide page 88, 5.1.2.2 Local Memory

Firstly, thank you Cygnus X1 for pointing out the wrong usage of constant and providing with correct code and references to CUDA docs. That makes my understanding of the problem complete.

Yes, unfortunately registers and shared memory are scarce resources (max of 128 registers per thread and 16K of shared memory per block), so we cant just do it using them.

In cudaprof there are special counters - local load and local store (Number of local memory loads and Number of local memory stores). For global memory there are other counters, such as gst_32b, gst_64b, gst_128b (Number of 32 byte, 64 byte and 128 byte global memory store transactions) and gld_32b, gld_64b, gld_128b (Number of 32 byte, 64 byte and 128 byte global memory load transactions).

[s]I actually found the answer for the question concerning amount of global stores exceeding amount of global loads:

cudaprof help doc:

[/s]

Wrong wrong wrong. Cudaprof reported LOCAL stores exceeding LOCAL loads, so it is not applicable.

Dear Sarnath,

your post tells us about how the local data is (or could be?) organized in current CUDA release. Have you found any proof for that?

Thank you, avidday and Cygnus X1, for comments on constant memory usage. Now I have almost no global memory reads, and hit counter is high.

Actually you also have a limitation of 16K registers per SM (stream multiprocessor), yet SM can handle up to 1024 threads. Therefore, to reach maximum occupancy you are limited by 16 registers per thread. While it is not always good to sacrifice performance for maximum occupancy as the gain may be smaller than what you pay, usually something like 50% occupancy is good. That gives 32 registers per thread. And this is only for machines with compute capability 1.2 or higher. Smaller cards have only 8K registers per SM. On the other hand, if you use 128 registers, you will be albe to run only 128 threads in parallel which will be slow - GPU won’t be able to hide global memory access but also some hazards with shared memory or even registers! You need at least 192 threads per SM for that if I recall correctly.

I stand corrected.