Weird profiler results in occupance analysis

Hi, I have been doing a little experimentation with workgroup size and I have come to this unexpected result: The register ratio was set to 1 with 32768 registers occupied, although if I am issuing 4 workgroups of size 160 with each workitem using 38 registers, the total register usage should have been 24320, not 32768. In fact, I was assuming that I can issue 5 workgroups at a time (then the total register usage would have been 30400, still within limits and using one more warp than 3 x 256 workitems). Does anyone have an explanation for these weird results?

Kernel details : Grid size: 5284 x 1, Block size: 160 x 1 x 1

Register Ratio		= 1  ( 32768 / 32768 ) [38 registers per thread] 

Shared Memory Ratio	= 0.75 ( 36864 / 49152 ) [8976 bytes per Block] 

Active Blocks per SM	= 4 : 8

Active threads per SM	= 640 : 1536

Occupancy		= 0.416667  ( 20 / 48 )

Achieved occupancy 	= 0.416667  (on 16 SMs)

Occupancy limiting factor	= Block-Size

In fact, now when I have looked on the 256 work-item example, the counts do not fit as well: 256 x 3 x 38 = 29184, not 30720. Are the 1536 registers on holiday?

Kernel details : Grid size: 5408 x 1, Block size: 256 x 1 x 1

Register Ratio		= 0.9375  ( 30720 / 32768 ) [38 registers per thread] 

Shared Memory Ratio	= 0.90625 ( 44544 / 49152 ) [14352 bytes per Block] 

Active Blocks per SM	= 3 : 8

Active threads per SM	= 768 : 1536

Occupancy		= 0.5  ( 24 / 48 )

Achieved occupancy 	= 0.5  (on 16 SMs)

Occupancy limiting factor	= Registers , Shared-memory

Thanks