question Regarding ptxas info and some weird problem

Hi all,
Can someone tell me what is the meaning of sm_10 and sm_20 in ptxas info?

When i compile my kernel it gives two different values for registers as follows:
ptxas info : Compiling entry function ‘Z13testKernelNewPiS_iiiiP12GPUstageInfoP14testClassifieriiPbS_S3’ for ‘sm_10’
1>ptxas info : Used 14 registers, 13248+16 bytes smem, 24 bytes cmem[1]

1>ptxas info : Compiling entry function ‘Z13testKernelNewPiS_iiiiP12GPUstageInfoP14testClassifieriiPbS_S3’ for ‘sm_20’
1>ptxas info : Used 20 registers, 13184+0 bytes smem, 84 bytes cmem[0], 8 bytes cmem[16]

Can someone tell me what will be the total number of register per thread will be used in this case?

I am facing a weird problem, if the number of registers in sm_20 are less than 20 then all the data in my shared memory is correct but if number of registers in sm_20 becomes more than or equal to 20 than my shared memory data becomes zero. MY GPU has 16kB of registers and my each block is using 256 threads. So according to me number of registers are enough so what could be the reason for such a behaviour?

Hi all,
Can someone tell me what is the meaning of sm_10 and sm_20 in ptxas info?

When i compile my kernel it gives two different values for registers as follows:
ptxas info : Compiling entry function ‘Z13testKernelNewPiS_iiiiP12GPUstageInfoP14testClassifieriiPbS_S3’ for ‘sm_10’
1>ptxas info : Used 14 registers, 13248+16 bytes smem, 24 bytes cmem[1]

1>ptxas info : Compiling entry function ‘Z13testKernelNewPiS_iiiiP12GPUstageInfoP14testClassifieriiPbS_S3’ for ‘sm_20’
1>ptxas info : Used 20 registers, 13184+0 bytes smem, 84 bytes cmem[0], 8 bytes cmem[16]

Can someone tell me what will be the total number of register per thread will be used in this case?

I am facing a weird problem, if the number of registers in sm_20 are less than 20 then all the data in my shared memory is correct but if number of registers in sm_20 becomes more than or equal to 20 than my shared memory data becomes zero. MY GPU has 16kB of registers and my each block is using 256 threads. So according to me number of registers are enough so what could be the reason for such a behaviour?

It’s in the PTX ISA (included with the CUDA toolkit), but sm_xy means Compute Capability x.y.

So, sm_10 = Compute Capability 1.0, and sm_20 = Compute Capability 2.0 (i.e., Fermi).

It’s in the PTX ISA (included with the CUDA toolkit), but sm_xy means Compute Capability x.y.

So, sm_10 = Compute Capability 1.0, and sm_20 = Compute Capability 2.0 (i.e., Fermi).

It probably is just a coincidence. In a different topic you wrote that your GPU is a Geforce 335M, which is compute capability 1.2. So the sm_20 compiled code will never be executed, and should not influence your results at all.

To make optimal use of your GPU, you could set the architecture to sm_12 instead of sm_10.

It probably is just a coincidence. In a different topic you wrote that your GPU is a Geforce 335M, which is compute capability 1.2. So the sm_20 compiled code will never be executed, and should not influence your results at all.

To make optimal use of your GPU, you could set the architecture to sm_12 instead of sm_10.

ohhhh. So it means the change in number of register in sm_20 will not affect results. Thanks for your reply. I will try to debug.

Is there any debugger which can run only with single GPU?

Regards

Mohit

[quotee=‘tera’ date=‘23 November 2010 - 08:51 AM’ timestamp=‘1290473462’ post=‘1150231’]

It probably is just a coincidence. In a different topic you wrote that your GPU is a Geforce 335M, which is compute capability 1.2. So the sm_20 compiled code will never be executed, and should not influence your results at all.

To make optimal use of your GPU, you could set the architecture to sm_12 instead of sm_10.

[/quote]

ohhhh. So it means the change in number of register in sm_20 will not affect results. Thanks for your reply. I will try to debug.

Is there any debugger which can run only with single GPU?

Regards

Mohit

[quotee=‘tera’ date=‘23 November 2010 - 08:51 AM’ timestamp=‘1290473462’ post=‘1150231’]

It probably is just a coincidence. In a different topic you wrote that your GPU is a Geforce 335M, which is compute capability 1.2. So the sm_20 compiled code will never be executed, and should not influence your results at all.

To make optimal use of your GPU, you could set the architecture to sm_12 instead of sm_10.

[/quote]