Resource usage & optimization read a cubin file...


I need some precisions about the values supplied by my cubin file:

lmem = 32

	smem = 36

	reg = 28

	bar = 0

As there is no declaration of shared variables in my kernel, the SMEM value should only correspond to my parameters:

__global__ void

FooKernel ( unsigned w,

   unsigned h,

   float s,

                  float sq, 

                  float* o );

So, 4 bytes per float and unsigned, and 1 byte for a float* -> 17B , there is a big difference :glare:

blockDim and gridDim are also stored in shared memory. So, add an additional 20 bytes (x,y,z for blockDim and x,y for gridDim). Then, if you allow for 4 bytes for your unsigned (perhaps this is needed for packing reasons…), you get 36 bytes total.

Thx a lot, I forgort blockDim and gridDim.

But I have 3 more questions:


2 unsigneds ( 2x4bytes) + 2 floats ( 2x4 bytes) + gridDim & blockDim ( 20bytes ) + float* ( I suppose that is coded on 1 byte) = 37 bytes, not 36. So do you have an idea about this small difference?


Do you suggest in your post that an unsigned can be coded on less than 4 bytes? :dry:


The LMEM value suggests that my 2 floats arrays

float e[4], d[4];

(declared in my kernel) are stored into the local memory. As I need to keep the array format, is there a solution to be sure they will be store into registers (shared memory would induce a lot of bank conflicts)



Oh, I missed the first unsigned.
2 unsigneds - 8 bytes
2 floats - 8 bytes
blockDim/GridDim - 20 bytes
float* - 4 bytes / 8 bytes on 64-bit platform
= 40 bytes / 44 bytes on 64-bit platform.

I don’t know why that doesn’t add up exactly to 36. As I said before, there may be some packing going on. I.e. blockDim doesn’t need 4 bytes for each value: they could be stored in a 16-bit region of memory. I don’t know the full details of how these are addressed. Either see the PTX ISA manual for the details, and read the ptx or use wumpus’s decuda tool to find out.

Local memory has been discussed many times on the forums. The summary is that as long as you index into the arrays with compile-time evaluated constants then the arrays will be stored in registers.

Thx, I’m going to read the PTX manual.