Pass arguments through constant memory

Hi all !

I’m working on the design of a kernel which requires a full set of arrays (~30) that are stored in the device’s global memory. The question that now arises is how to transmit the addresses of these arrays to my kernel function.

The simplest strategy (the one used in the SDK samples) is to pass the pointers as arguments to the kernel. These (array-)pointers will then be stored in shared memory (Am I right on this point ?) and will thus be accessible from all the threads in my block. There won’t be any bank conflict as all threads will access the same shared memory bank to read the pointer. Each thread will then read the global memory using this pointer.

As I may run short of shared memory, I tried to think about another solution. I’m not using much of my constant memory and as it is cached accessing it is pretty fast. I could thus store the pointers to the global memory in the constant memory and then have all my threads read from this memory to find the address of the array in global memory.
I haven’t found much information about the constant memory in the documentation, so I don’t know how big the cache is and how fast a (broadcast) read from constant memory is. Could you help me here ?
Could you also tell me if the strategy I’m trying to implement is smart or not ? Would you advice another way of doing this ?

Thanks

P.S.: The other solution I have is to group all my arrays in a single one and then play with the indices on the device. This would however lead to a big re-factoring of my code.

You could indeed use the constant memory to do so. Its limited in size as well.

However you can simply pass a pointer to any data to the kernel and thats it, something like that:

struct MyData

{

  ...

  ...  

  Lots of data members, values...

};

__global__ void MyKernel( MyData *pData )

{

...

  pData->....

}

just allocate MyData on the host, on the device and copy it from the host to the device and you’re set.

It will only take you 4 or 8 bytes of smem.

eyal

You could indeed use the constant memory to do so. Its limited in size as well.

However you can simply pass a pointer to any data to the kernel and thats it, something like that:

struct MyData

{

  ...

  ...  

  Lots of data members, values...

};

__global__ void MyKernel( MyData *pData )

{

...

  pData->....

}

just allocate MyData on the host, on the device and copy it from the host to the device and you’re set.

It will only take you 4 or 8 bytes of smem.

eyal

Mmmhh. Ok. But if I allocate this in the device global memory then reading any value in one of my arrays will require two reads form global mem instead of one. Am I missing something ?

Mmmhh. Ok. But if I allocate this in the device global memory then reading any value in one of my arrays will require two reads form global mem instead of one. Am I missing something ?

Why two reads?

Why two reads?

One read to access pData. And one second read to access pData->array[index]. Is this wrong ?

One read to access pData. And one second read to access pData->array[index]. Is this wrong ?

That should only count for one read as far as I know.

eyal

That should only count for one read as far as I know.

eyal

pData->array[index] can be rewritten as:

((§.Data).array + size)

There are two “*” hence two pointer dereferences. Which means if I’m right that the kernel will need two memory reads. No ?

pData->array[index] can be rewritten as:

((§.Data).array + size)

There are two “*” hence two pointer dereferences. Which means if I’m right that the kernel will need two memory reads. No ?

You are correct. The two reads will have disastrous performance consequences on compute 1.x hardware, not so bad on 2.x as the pointer will likely be kept in L1 cache. While we are on the topic of compute 2.x, it automatically puts all function parameters in constant memory for you.

Are you really worried about the lost 240 bytes of shared memory that you don’t want to pass the pointers into the kernel by value? At that level, what you should be worried about is the 256 byte limit on kernel function parameters in compute 1.x.

You are correct. The two reads will have disastrous performance consequences on compute 1.x hardware, not so bad on 2.x as the pointer will likely be kept in L1 cache. While we are on the topic of compute 2.x, it automatically puts all function parameters in constant memory for you.

Are you really worried about the lost 240 bytes of shared memory that you don’t want to pass the pointers into the kernel by value? At that level, what you should be worried about is the 256 byte limit on kernel function parameters in compute 1.x.

I use it a lot on CC 1.x and didn’t see any performance degragation (the first version passed each array seperatly).

I guess I mostly read the stuff and then put it in shared memory or registers or outside the loops in the kernel - so maybe

its effect is lower.

What would happen if you do something like this:

__global__ void kernel( MyStruct *pData )

{

   float *parray = pData->array;

   ...

   for ( .... )

   {

	  use parray[ ... ] 

   }

}

The overhead, on any CC, should be very small no?

eyal

I use it a lot on CC 1.x and didn’t see any performance degragation (the first version passed each array seperatly).

I guess I mostly read the stuff and then put it in shared memory or registers or outside the loops in the kernel - so maybe

its effect is lower.

What would happen if you do something like this:

__global__ void kernel( MyStruct *pData )

{

   float *parray = pData->array;

   ...

   for ( .... )

   {

	  use parray[ ... ] 

   }

}

The overhead, on any CC, should be very small no?

eyal

Sure, if the compiler keeps the pointer around in memory. Common subexpression elimination would probably do this effectively even if you used pData->array everywhere.

I guess maybe I over-worry about the overhead - I’m still used to thinking about coalescing in compute 1.0 where an uncoalesced read is at 1/16 performance. In a complicated kernel that reads pData->array dozens of times, the overhead of the hit for the pointer read is probably not that large. However, in a simple kernel it could be huge: example

__global__ void (Data *d)

	{

	d->a[threadIdx.x] = d->b[threadIdx.x] + 1;

	}

// vs. 

__global__ void (float *a, float *b)

	{

	a[threadIdx.x] = b[threadIdx.x] + 1;

	}

In the first kernel, there are 2 pointer reads, 1 (lets say int) read and 1 write for a total of 28+24 = 24 bytes per thread.

In the second, only 8 bytes per thread are accessed - so the kernel will likely run 3x faster (or even more if you factor in the cost of the uncoalesced pointer read).

Arguably, one probably would not pass in a struct of 20 pointers for such a simple kernel - I’m just memory bandwidth bound in almost every kernel I write so I have a habit of avoiding absolutely all unnecessary memory reads.

Sure, if the compiler keeps the pointer around in memory. Common subexpression elimination would probably do this effectively even if you used pData->array everywhere.

I guess maybe I over-worry about the overhead - I’m still used to thinking about coalescing in compute 1.0 where an uncoalesced read is at 1/16 performance. In a complicated kernel that reads pData->array dozens of times, the overhead of the hit for the pointer read is probably not that large. However, in a simple kernel it could be huge: example

__global__ void (Data *d)

	{

	d->a[threadIdx.x] = d->b[threadIdx.x] + 1;

	}

// vs. 

__global__ void (float *a, float *b)

	{

	a[threadIdx.x] = b[threadIdx.x] + 1;

	}

In the first kernel, there are 2 pointer reads, 1 (lets say int) read and 1 write for a total of 28+24 = 24 bytes per thread.

In the second, only 8 bytes per thread are accessed - so the kernel will likely run 3x faster (or even more if you factor in the cost of the uncoalesced pointer read).

Arguably, one probably would not pass in a struct of 20 pointers for such a simple kernel - I’m just memory bandwidth bound in almost every kernel I write so I have a habit of avoiding absolutely all unnecessary memory reads.

Thanks for the detailed explaination :)

eyal