Regarding how to optimize a basic mass spring system

Hi all,
I have a basic mass spring system implemented in CUDA. I now am looking to improve its performance. My question is
where do i start optimizing from? I know that the launch configuration is also critical but how do i determine the optimal
configuration? What tools can i use to aid me in all this. In case I add in the shared memory to reduce the read latency, what
should the size be of the shared mem. array? Is there an optimization guide available? Can anyone help me with the optimization
or help me in finding the pressure points in my kernel. If so I am willing to share my basic code.

Thanks,
Mobeen

There is the Best Practices Guide as well as Chapter 5 “Performance Guidelines” of the Programming Guide.

If your code isn’t too long you can also post it here and we may have a look.

HI thanks for the link I will surely have a look at this.

Attached is the complete source.
Cloth_CUDA2.zip (19.2 KB)

Instead of just dumping a tarfile with 1359 lines of code on us, I was more thinking of posting just the kernel, together with it’s execution configuration.

Your kernel is clearly memory bound. What compute capability is your device? On 1.x devices, using either a texture for [font=“Courier New”]_vertexData[/font] or preloading blocks from it into shared memory would be essential to avoid a lot of unnecessary memory accesses.

Thanks for the fast response and sorry for the zip file dump.

The total simulation grid size is 256x256 and the execution config. is 8 x 8.

I am on ComputeCapab=1.3. If i cache the data into the shared memory, what should the size of the shared memory array be? How do I determine this size?

10x10 would naturally allow each thread to read all of it’s neighbors.

Using a texture you can potentially avoid re-reading the borders from global memory. Part of that gain may also be realized from larger blocks. I’d try something like 32x4 or so, as horizontal borders are more expensive than vertical borders due to coalescing rules/partly wasted bandwidth.

HI Tera,

Before doing the optimization by using shared memeory/texture usage, I thought I would time my kernel without these optimizations and here are the results.

+---------------+-----------------+--------------------+

| Configuration | kernel time(ms) |   Simulation size  |

+---------------+-----------------+--------------------+

|     16x16     |      ~0.456     |      256 x 256     |

+---------------+-----------------+--------------------+

|     32x4      |      ~0.483     |      256 x 256     |

+---------------+-----------------+--------------------+

|     8x8       |      ~0.356     |      256 x 256     |

+---------------+-----------------+--------------------+

So you can see that the kernel performs best on 8x8 configuration. Now i have one question about using texture. In the convolve example that ships with the sdk, the texture version performs poorer than the array version, do u think that since textures will handle the edge case well, I will get any speedup? Currently, I map the vbo and pass that mapped pointer directly which is probably the fastest method to push data to GPU if I am not mistaken what do you say?

The larger blocks will only be useful once you cache data in shared memory, so I’m not surprised (apart from 8x8 performing so much faster - apparently there is not enough work to load your whole device with larger blocks).

I would expect the texture version to be slightly faster but its hard to predict.

I second that the texture version would probably be faster. Make sure to test a variety of block sizings if you go that route - 16x4 will probably be best on Tesla-class hardware, a different size might work better on Fermi.

The shared memory optimization is key.

A lot of duplicated code in this file could be eliminated, with no performance penalty, by moving the force calculation to a separate device function. The compiler will inline calls to that function.

You might want to consider using the intrinsics for reciprocal and reciprocal square root, if your application can tolerate the precision loss.

Thanks for your replies. Actually, this code is based on my CPU cloth code. I am quite new to CUDA and so I am trying to do the conversion myself by learning CUDA at the same time and it is not easy atleast for me. Could u suggest a good site/article that may help me with the conversion of this code to using shared / texture memory?

Hi all,

Sorry to revive my old thread. I have done the shared memory optimization and here is the new code, In this case the threads per block are 8x8.

__global__ void updatePoint(float dT, int _size, float stiffness, float mass, float* _vertexData, float* _velocities)

{

    __shared__ float s_Data[8][8][3];

    __shared__ float s_Vel[8][8][3];

unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

s_Data[threadIdx.x][threadIdx.y][0] = _vertexData[((x * _size) + y)*3];

    s_Data[threadIdx.x][threadIdx.y][1] = _vertexData[((x * _size) + y)*3 + 1];

    s_Data[threadIdx.x][threadIdx.y][2] = _vertexData[((x * _size) + y)*3 + 2];

s_Vel[threadIdx.x][threadIdx.y][0] = _velocities[( (x * _size) + y)*3];

    s_Vel[threadIdx.x][threadIdx.y][1] = _velocities[( (x * _size) + y)*3 + 1];

    s_Vel[threadIdx.x][threadIdx.y][2] = _velocities[( (x * _size) + y)*3 + 2];

__syncthreads();

//calculate new position based on velocity:

    // dS = v*dT

    float newPoint[3];

    if(x==0 || x==(_size-1) || y==0 || y==(_size-1)) {

	newPoint[0] = s_Data[threadIdx.x][threadIdx.y][0];

	newPoint[1] = s_Data[threadIdx.x][threadIdx.y][1];

	newPoint[2] = s_Data[threadIdx.x][threadIdx.y][2];

    } else {

	newPoint[0] = s_Data[threadIdx.x][threadIdx.y][0] += dT * s_Vel[threadIdx.x][threadIdx.y][0];

	newPoint[1] = s_Data[threadIdx.x][threadIdx.y][1] += dT * s_Vel[threadIdx.x][threadIdx.y][1];

	newPoint[2] = s_Data[threadIdx.x][threadIdx.y][2] += dT * s_Vel[threadIdx.x][threadIdx.y][2];

    }

	

    __syncthreads();

float distanceToSphereCenter = sqrt(newPoint[0] * newPoint[0] + newPoint[1] * newPoint[1] + newPoint[2] * newPoint[2]);

float force[] = {0.0f, -0.02f, 0.0f};

if(distanceToSphereCenter < 5.05f)

    {

        //position point on the sphere's surface

        newPoint[0] = newPoint[0] / distanceToSphereCenter * 5.05f;

        newPoint[1] = newPoint[1] / distanceToSphereCenter * 5.05f;

        newPoint[2] = newPoint[2] / distanceToSphereCenter * 5.05f;

s_Vel[threadIdx.x][threadIdx.y][0] = 0.0f;

        s_Vel[threadIdx.x][threadIdx.y][1] = 0.0f;

        s_Vel[threadIdx.x][threadIdx.y][2] = 0.0f;

    }

    else

    {

        //for each of the 8 neighbouring points

        //calculate the difference of the normal to the current length

        //apply a force due to this difference

if(x  < _size - 1)

        {

            float distance[3];

            distance[0] = s_Data[threadIdx.x+1][threadIdx.y][0] - newPoint[0];

            distance[1] = s_Data[threadIdx.x+1][threadIdx.y][1] - newPoint[1];

            distance[2] = s_Data[threadIdx.x+1][threadIdx.y][2] - newPoint[2];

float length = sqrt(distance[0] * distance[0] + distance[1] * distance[1] + distance[2] * distance[2]);

force[0] += (distance[0] / length) * (length - normalLength) * stiffness;

            force[1] += (distance[1] / length) * (length - normalLength) * stiffness;

            force[2] += (distance[2] / length) * (length - normalLength) * stiffness;

        }

if(y < _size - 1)

        {

            float distance[3];

            distance[0] = s_Data[threadIdx.x][threadIdx.y+1][0] - newPoint[0];

            distance[1] = s_Data[threadIdx.x][threadIdx.y+1][1] - newPoint[1];

            distance[2] = s_Data[threadIdx.x][threadIdx.y+1][2] - newPoint[2];

float length = sqrt(distance[0] * distance[0] + distance[1] * distance[1] + distance[2] * distance[2]);

force[0] += (distance[0] / length) * (length - normalLength) * stiffness;

            force[1] += (distance[1] / length) * (length - normalLength) * stiffness;

            force[2] += (distance[2] / length) * (length - normalLength) * stiffness;

        }

if(x > 0)

        {

            float distance[3];

            distance[0] = s_Data[threadIdx.x-1][threadIdx.y][0] - newPoint[0];

            distance[1] = s_Data[threadIdx.x-1][threadIdx.y][1] - newPoint[1];

            distance[2] = s_Data[threadIdx.x-1][threadIdx.y][2] - newPoint[2];

float length = sqrt(distance[0] * distance[0] + distance[1] * distance[1] + distance[2] * distance[2]);

force[0] += (distance[0] / length) * (length - normalLength) * stiffness;

            force[1] += (distance[1] / length) * (length - normalLength) * stiffness;

            force[2] += (distance[2] / length) * (length - normalLength) * stiffness;

        }

if(y > 0)

        {

            float distance[3];

            distance[0] = s_Data[threadIdx.x][threadIdx.y-1][0] - newPoint[0];

            distance[1] = s_Data[threadIdx.x][threadIdx.y-1][1] - newPoint[1];

            distance[2] = s_Data[threadIdx.x][threadIdx.y-1][2] - newPoint[2];

float length = sqrt(distance[0] * distance[0] + distance[1] * distance[1] + distance[2] * distance[2]);

force[0] += (distance[0] / length) * (length - normalLength) * stiffness;

            force[1] += (distance[1] / length) * (length - normalLength) * stiffness;

            force[2] += (distance[2] / length) * (length - normalLength) * stiffness;

        }

if(x > 0 && y > 0)

        {

            float distance[3];

            distance[0] = s_Data[threadIdx.x-1][threadIdx.y-1][0] - newPoint[0];

            distance[1] = s_Data[threadIdx.x-1][threadIdx.y-1][1] - newPoint[1];

            distance[2] = s_Data[threadIdx.x-1][threadIdx.y-1][2] - newPoint[2];

float length = sqrt(distance[0] * distance[0] + distance[1] * distance[1] + distance[2] * distance[2]);

force[0] += (distance[0] / length) * (length - normalLengthDiagonal) * stiffness;

            force[1] += (distance[1] / length) * (length - normalLengthDiagonal) * stiffness;

            force[2] += (distance[2] / length) * (length - normalLengthDiagonal) * stiffness;

        }

if(x > 0 && y < _size - 1)

        {

            float distance[3];

            distance[0] = s_Data[threadIdx.x-1][threadIdx.y+1][0] - newPoint[0];

            distance[1] = s_Data[threadIdx.x-1][threadIdx.y+1][1] - newPoint[1];

            distance[2] = s_Data[threadIdx.x-1][threadIdx.y+1][2] - newPoint[2];

float length = sqrt(distance[0] * distance[0] + distance[1] * distance[1] + distance[2] * distance[2]);

force[0] += (distance[0] / length) * (length - normalLengthDiagonal) * stiffness;

            force[1] += (distance[1] / length) * (length - normalLengthDiagonal) * stiffness;

            force[2] += (distance[2] / length) * (length - normalLengthDiagonal) * stiffness;

        }

if(x < _size - 1 && y > 0)

        {

            float distance[3];

            distance[0] = s_Data[threadIdx.x+1][threadIdx.y-1][0] - newPoint[0];

            distance[1] = s_Data[threadIdx.x+1][threadIdx.y-1][1] - newPoint[1];

            distance[2] = s_Data[threadIdx.x+1][threadIdx.y-1][2] - newPoint[2];

float length = sqrt(distance[0] * distance[0] + distance[1] * distance[1] + distance[2] * distance[2]);

force[0] += (distance[0] / length) * (length - normalLengthDiagonal) * stiffness;

            force[1] += (distance[1] / length) * (length - normalLengthDiagonal) * stiffness;

            force[2] += (distance[2] / length) * (length - normalLengthDiagonal) * stiffness;

        }

if(x < _size - 1 && y < _size - 1)

        {

            float distance[3];

            distance[0] = s_Data[threadIdx.x+1][threadIdx.y+1][0] - newPoint[0];

            distance[1] = s_Data[threadIdx.x+1][threadIdx.y+1][1] - newPoint[1];

            distance[2] = s_Data[threadIdx.x+1][threadIdx.y+1][2] - newPoint[2];

float length = sqrt(distance[0] * distance[0] + distance[1] * distance[1] + distance[2] * distance[2]);

force[0] += (distance[0] / length) * (length - normalLengthDiagonal) * stiffness;

            force[1] += (distance[1] / length) * (length - normalLengthDiagonal) * stiffness;

            force[2] += (distance[2] / length) * (length - normalLengthDiagonal) * stiffness;

        }

//derive velocity from the force (v = a*t and a = F / m)

        s_Vel[threadIdx.x][threadIdx.y][0] += dT * force[0] / mass;

        s_Vel[threadIdx.x][threadIdx.y][1] += dT * force[1] / mass;

        s_Vel[threadIdx.x][threadIdx.y][2] += dT * force[2] / mass;

	s_Vel[threadIdx.x][threadIdx.y][0] *= C;

        s_Vel[threadIdx.x][threadIdx.y][1] *= C;

	s_Vel[threadIdx.x][threadIdx.y][2] *= C;

    }

if(newPoint[1]<0)

        newPoint[1]=0;

__syncthreads();

_vertexData[((x * _size) + y)*3 + 0] = newPoint[0];

    _vertexData[((x * _size) + y)*3 + 1] = newPoint[1];

    _vertexData[((x * _size) + y)*3 + 2] = newPoint[2];

_velocities[((x * _size) + y)*3 + 0] = s_Vel[threadIdx.x][threadIdx.y][0];

    _velocities[((x * _size) + y)*3 + 1] = s_Vel[threadIdx.x][threadIdx.y][1];

    _velocities[((x * _size) + y)*3 + 2] = s_Vel[threadIdx.x][threadIdx.y][2];

}

While this version performs better there are two things I want to ask,

  1. Are there anyother optimizations that I may add in this to optimize the code further?

  2. Doing this way, I am restricted by the size of the shared memory. I cannot use this code for any grid size > 8x8 on my device. Could anyone tell me how to use the shared memory such that I may use it with larger grid size.

__shared__ float s_Data[8][8][3];
__shared__ float s_Vel[8][8][3];

unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

s_Data[threadIdx.x][threadIdx.y][0] = _vertexData[((x * _size) + y)*3];
s_Data[threadIdx.x][threadIdx.y][1] = _vertexData[((x * _size) + y)*3 + 1];
s_Data[threadIdx.x][threadIdx.y][2] = _vertexData[((x * _size) + y)*3 + 2];

s_Vel[threadIdx.x][threadIdx.y][0] = _velocities[( (x * _size) + y)*3];
s_Vel[threadIdx.x][threadIdx.y][1] = _velocities[( (x * _size) + y)*3 + 1];
s_Vel[threadIdx.x][threadIdx.y][2] = _velocities[( (x * _size) + y)*3 + 2];

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Hi, I think there is bank conflict …
Lets say for some threadId.y=0 we are accessing s_Data[threadIdx.x][threadIdx.y][0]
Then for
threadIdx.x=0 we access location 0 … access to bank 0
threadIdx.x=1 we access location 24 … access to bank 8
threadIdx.x=2 we access location 48 … access to bank 0
threadIdx.x=3 we access location 72 … access to bank 8
threadIdx.x=4 we access location 96 … access to bank 0
and so on …

Try posting the PTX of it, that could be interesting External Image

I wonder if [y][z] leads to more pointer lookups then just [x *… y *… + z]

Hi,

I am using VS 9. The command line to nvcc

c:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\bin\nvcc.exe"    -gencode=arch=compute_10,code=\"sm_10,compute_10\"   --machine 32 -ccbin "c:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin"    -Xcompiler "/EHsc /W3 /nologo /O2 /Zi   /MT  "  -I"c:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\include" -maxrregcount=0 --ptxas-options=-v -ptx -o "Debug/setupCuda.cu.obj" setupCuda.cu

It generates this ptx file.

//setupCUDA.cu.ptx file

	.version 1.4

	.target sm_10, map_f64_to_f32

	// compiled with c:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\bin/../open64/lib//be.exe

	// nvopencc 4.0 built on 2011-05-13

	//-----------------------------------------------------------

	// Compiling C:/Users/Mobeen/AppData/Local/Temp/tmpxft_00000414_00000000-11_setupCuda.cpp3.i (C:/Users/Mobeen/AppData/Local/Temp/ccBI#.a03088)

	//-----------------------------------------------------------

	//-----------------------------------------------------------

	// Options:

	//-----------------------------------------------------------

	//  Target:ptx, ISA:sm_10, Endian:little, Pointer Size:32

	//  -O3	(Optimization level)

	//  -g0	(Debug level)

	//  -m2	(Report advisories)

	//-----------------------------------------------------------

	.file	1	"C:/Users/Mobeen/AppData/Local/Temp/tmpxft_00000414_00000000-10_setupCuda.cudafe2.gpu"

	.file	2	"c:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\INCLUDE\crtdefs.h"

	.file	3	"c:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\include\crt/device_runtime.h"

	.file	4	"c:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\include\host_defines.h"

	.file	5	"c:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\include\builtin_types.h"

	.file	6	"c:\program files\nvidia gpu computing toolkit\cuda\v4.0\include\device_types.h"

	.file	7	"c:\program files\nvidia gpu computing toolkit\cuda\v4.0\include\driver_types.h"

	.file	8	"c:\program files\nvidia gpu computing toolkit\cuda\v4.0\include\surface_types.h"

	.file	9	"c:\program files\nvidia gpu computing toolkit\cuda\v4.0\include\texture_types.h"

	.file	10	"c:\program files\nvidia gpu computing toolkit\cuda\v4.0\include\vector_types.h"

	.file	11	"c:\program files\nvidia gpu computing toolkit\cuda\v4.0\include\builtin_types.h"

	.file	12	"c:\program files\nvidia gpu computing toolkit\cuda\v4.0\include\host_defines.h"

	.file	13	"c:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\include\device_launch_parameters.h"

	.file	14	"c:\program files\nvidia gpu computing toolkit\cuda\v4.0\include\crt\storage_class.h"

	.file	15	"c:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\INCLUDE\time.h"

	.file	16	"d:/PhD_Stuff/PhD/MyPapers/SIPS2011/codes/Cloth_CUDA2_Optimized_SharedMem/Cloth_CUDA2/setupCuda.cu"

	.file	17	"c:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\include\common_functions.h"

	.file	18	"c:\program files\nvidia gpu computing toolkit\cuda\v4.0\include\math_functions.h"

	.file	19	"c:\program files\nvidia gpu computing toolkit\cuda\v4.0\include\math_constants.h"

	.file	20	"c:\program files\nvidia gpu computing toolkit\cuda\v4.0\include\device_functions.h"

	.file	21	"c:\program files\nvidia gpu computing toolkit\cuda\v4.0\include\sm_11_atomic_functions.h"

	.file	22	"c:\program files\nvidia gpu computing toolkit\cuda\v4.0\include\sm_12_atomic_functions.h"

	.file	23	"c:\program files\nvidia gpu computing toolkit\cuda\v4.0\include\sm_13_double_functions.h"

	.file	24	"c:\program files\nvidia gpu computing toolkit\cuda\v4.0\include\sm_20_atomic_functions.h"

	.file	25	"c:\program files\nvidia gpu computing toolkit\cuda\v4.0\include\sm_20_intrinsics.h"

	.file	26	"c:\program files\nvidia gpu computing toolkit\cuda\v4.0\include\surface_functions.h"

	.file	27	"c:\program files\nvidia gpu computing toolkit\cuda\v4.0\include\texture_fetch_functions.h"

	.file	28	"c:\program files\nvidia gpu computing toolkit\cuda\v4.0\include\math_functions_dbl_ptx1.h"

	.const .f32 normalLength = 0f3f800000 /* 1 */;

	.const .f32 normalLengthDiagonal = 0f3fb504f3 /* 1.41421 */;

	.const .f32 C = 0f3f59999a /* 0.85 */;

	.entry _Z11updatePointfiffPfS_ (

		.param .f32 __cudaparm__Z11updatePointfiffPfS__dT,

		.param .s32 __cudaparm__Z11updatePointfiffPfS___size,

		.param .f32 __cudaparm__Z11updatePointfiffPfS__stiffness,

		.param .f32 __cudaparm__Z11updatePointfiffPfS__mass,

		.param .u32 __cudaparm__Z11updatePointfiffPfS___vertexData,

		.param .u32 __cudaparm__Z11updatePointfiffPfS___velocities)

	{

	.reg .u16 %rh<6>;

	.reg .u32 %r<41>;

	.reg .f32 %f<208>;

	.reg .pred %p<16>;

	.shared .align 4 .b8 __cuda___cuda_local_var_103770_31_non_const_s_Data24[768];

	.shared .align 4 .b8 __cuda___cuda_local_var_103771_31_non_const_s_Vel792[768];

	.loc	16	65	0

$LDWbegin__Z11updatePointfiffPfS_:

	.loc	16	77	0

	mov.u16 	%rh1, %ctaid.y;

	mov.u16 	%rh2, %ntid.y;

	mul.wide.u16 	%r1, %rh1, %rh2;

	mov.u16 	%rh3, %ctaid.x;

	mov.u16 	%rh4, %ntid.x;

	mul.wide.u16 	%r2, %rh3, %rh4;

	cvt.u32.u16 	%r3, %tid.x;

	mul24.lo.u32 	%r4, %r3, 24;

	cvt.u32.u16 	%r5, %tid.y;

	mul24.lo.u32 	%r6, %r5, 3;

	add.u32 	%r7, %r1, %r5;

	add.u32 	%r8, %r2, %r3;

	add.u32 	%r9, %r4, %r6;

	ld.param.s32 	%r10, [__cudaparm__Z11updatePointfiffPfS___size];

	mul.lo.u32 	%r11, %r10, %r8;

	mul.lo.u32 	%r12, %r9, 4;

	add.u32 	%r13, %r7, %r11;

	mov.u32 	%r14, __cuda___cuda_local_var_103770_31_non_const_s_Data24;

	add.u32 	%r15, %r12, %r14;

	mul.lo.u32 	%r16, %r13, 12;

	ld.param.u32 	%r17, [__cudaparm__Z11updatePointfiffPfS___vertexData];

	add.u32 	%r18, %r17, %r16;

	ld.global.f32 	%f1, [%r18+0];

	st.shared.f32 	[%r15+0], %f1;

	.loc	16	78	0

	ld.global.f32 	%f2, [%r18+4];

	st.shared.f32 	[%r15+4], %f2;

	.loc	16	79	0

	ld.global.f32 	%f3, [%r18+8];

	st.shared.f32 	[%r15+8], %f3;

	.loc	16	81	0

	mov.u32 	%r19, __cuda___cuda_local_var_103771_31_non_const_s_Vel792;

	add.u32 	%r20, %r12, %r19;

	ld.param.u32 	%r21, [__cudaparm__Z11updatePointfiffPfS___velocities];

	add.u32 	%r22, %r21, %r16;

	ld.global.f32 	%f4, [%r22+0];

	st.shared.f32 	[%r20+0], %f4;

	.loc	16	82	0

	ld.global.f32 	%f5, [%r22+4];

	st.shared.f32 	[%r20+4], %f5;

	.loc	16	83	0

	ld.global.f32 	%f6, [%r22+8];

	st.shared.f32 	[%r20+8], %f6;

	.loc	16	84	0

	bar.sync 	0;

	.loc	16	65	0

	mov.u32 	%r23, 0;

	setp.eq.u32 	%p1, %r8, %r23;

	@%p1 bra 	$Lt_0_15874;

	sub.u32 	%r24, %r10, 1;

	setp.eq.u32 	%p2, %r8, %r24;

	@%p2 bra 	$Lt_0_15874;

	mov.u32 	%r25, 0;

	setp.eq.u32 	%p3, %r7, %r25;

	@%p3 bra 	$Lt_0_15874;

	setp.ne.u32 	%p4, %r7, %r24;

	@%p4 bra 	$L_0_9986;

$Lt_0_15874:

$L_0_10242:

	.loc	16	90	0

	ld.shared.f32 	%f7, [%r15+0];

	.loc	16	91	0

	ld.shared.f32 	%f8, [%r15+4];

	.loc	16	92	0

	ld.shared.f32 	%f9, [%r15+8];

	bra.uni 	$L_0_9730;

$L_0_9986:

	.loc	16	94	0

	ld.param.f32 	%f10, [__cudaparm__Z11updatePointfiffPfS__dT];

	ld.shared.f32 	%f11, [%r15+0];

	ld.shared.f32 	%f12, [%r20+0];

	mad.f32 	%f13, %f12, %f10, %f11;

	st.shared.f32 	[%r15+0], %f13;

	mov.f32 	%f7, %f13;

	.loc	16	95	0

	ld.shared.f32 	%f14, [%r15+4];

	ld.shared.f32 	%f15, [%r20+4];

	mad.f32 	%f16, %f15, %f10, %f14;

	st.shared.f32 	[%r15+4], %f16;

	mov.f32 	%f8, %f16;

	.loc	16	96	0

	ld.shared.f32 	%f17, [%r15+8];

	ld.shared.f32 	%f18, [%r20+8];

	mad.f32 	%f19, %f18, %f10, %f17;

	st.shared.f32 	[%r15+8], %f19;

	mov.f32 	%f9, %f19;

$L_0_9730:

	.loc	16	99	0

	bar.sync 	0;

	.loc	16	101	0

	mul.f32 	%f20, %f8, %f8;

	mad.f32 	%f21, %f7, %f7, %f20;

	mad.f32 	%f22, %f9, %f9, %f21;

	sqrt.approx.f32 	%f23, %f22;

	.loc	16	103	0

	mov.f32 	%f24, 0f00000000;    	// 0

	mov.f32 	%f25, 0fbca3d70a;    	// -0.02

	mov.f32 	%f26, 0f00000000;    	// 0

	mov.f32 	%f27, 0f40a1999a;    	// 5.05

	setp.lt.f32 	%p5, %f23, %f27;

	@!%p5 bra 	$Lt_0_11266;

	.loc	16	108	0

	div.full.f32 	%f28, %f7, %f23;

	mov.f32 	%f29, 0f40a1999a;    	// 5.05

	mul.f32 	%f7, %f28, %f29;

	.loc	16	109	0

	div.full.f32 	%f30, %f8, %f23;

	mov.f32 	%f31, 0f40a1999a;    	// 5.05

	mul.f32 	%f8, %f30, %f31;

	.loc	16	110	0

	div.full.f32 	%f32, %f9, %f23;

	mov.f32 	%f33, 0f40a1999a;    	// 5.05

	mul.f32 	%f9, %f32, %f33;

	.loc	16	112	0

	mov.f32 	%f34, 0f00000000;    	// 0

	st.shared.f32 	[%r20+0], %f34;

	.loc	16	113	0

	mov.f32 	%f35, 0f00000000;    	// 0

	st.shared.f32 	[%r20+4], %f35;

	.loc	16	114	0

	mov.f32 	%f36, 0f00000000;    	// 0

	st.shared.f32 	[%r20+8], %f36;

	bra.uni 	$Lt_0_11010;

$Lt_0_11266:

	sub.u32 	%r24, %r10, 1;

	setp.lt.u32 	%p6, %r8, %r24;

	@!%p6 bra 	$Lt_0_11522;

	.loc	16	125	0

	ld.shared.f32 	%f37, [%r15+96];

	sub.f32 	%f38, %f37, %f7;

	.loc	16	126	0

	ld.shared.f32 	%f39, [%r15+100];

	sub.f32 	%f40, %f39, %f8;

	.loc	16	127	0

	ld.shared.f32 	%f41, [%r15+104];

	sub.f32 	%f42, %f41, %f9;

	.loc	16	131	0

	mul.f32 	%f43, %f40, %f40;

	mad.f32 	%f44, %f38, %f38, %f43;

	mad.f32 	%f45, %f42, %f42, %f44;

	sqrt.approx.f32 	%f46, %f45;

	ld.const.f32 	%f47, [normalLength];

	sub.f32 	%f48, %f46, %f47;

	ld.param.f32 	%f49, [__cudaparm__Z11updatePointfiffPfS__stiffness];

	div.full.f32 	%f50, %f38, %f46;

	mul.f32 	%f51, %f48, %f50;

	mul.f32 	%f24, %f49, %f51;

	.loc	16	132	0

	mov.f32 	%f52, 0fbca3d70a;    	// -0.02

	div.full.f32 	%f53, %f40, %f46;

	mul.f32 	%f54, %f48, %f53;

	mad.f32 	%f25, %f49, %f54, %f52;

	.loc	16	133	0

	div.full.f32 	%f55, %f42, %f46;

	mul.f32 	%f56, %f48, %f55;

	mul.f32 	%f26, %f49, %f56;

$Lt_0_11522:

	setp.lt.u32 	%p7, %r7, %r24;

	@!%p7 bra 	$Lt_0_12034;

	.loc	16	139	0

	ld.shared.f32 	%f57, [%r15+12];

	sub.f32 	%f58, %f57, %f7;

	.loc	16	140	0

	ld.shared.f32 	%f59, [%r15+16];

	sub.f32 	%f60, %f59, %f8;

	.loc	16	141	0

	ld.shared.f32 	%f61, [%r15+20];

	sub.f32 	%f62, %f61, %f9;

	.loc	16	145	0

	mul.f32 	%f63, %f60, %f60;

	mad.f32 	%f64, %f58, %f58, %f63;

	mad.f32 	%f65, %f62, %f62, %f64;

	sqrt.approx.f32 	%f66, %f65;

	ld.const.f32 	%f67, [normalLength];

	sub.f32 	%f68, %f66, %f67;

	ld.param.f32 	%f49, [__cudaparm__Z11updatePointfiffPfS__stiffness];

	div.full.f32 	%f69, %f58, %f66;

	mul.f32 	%f70, %f68, %f69;

	mad.f32 	%f24, %f49, %f70, %f24;

	.loc	16	146	0

	div.full.f32 	%f71, %f60, %f66;

	mul.f32 	%f72, %f68, %f71;

	mad.f32 	%f25, %f49, %f72, %f25;

	.loc	16	147	0

	div.full.f32 	%f73, %f62, %f66;

	mul.f32 	%f74, %f68, %f73;

	mad.f32 	%f26, %f49, %f74, %f26;

$Lt_0_12034:

	mov.u32 	%r26, 0;

	setp.ne.u32 	%p8, %r8, %r26;

	@!%p8 bra 	$Lt_0_12546;

	.loc	16	153	0

	ld.shared.f32 	%f75, [%r15+-96];

	sub.f32 	%f76, %f75, %f7;

	.loc	16	154	0

	ld.shared.f32 	%f77, [%r15+-92];

	sub.f32 	%f78, %f77, %f8;

	.loc	16	155	0

	ld.shared.f32 	%f79, [%r15+-88];

	sub.f32 	%f80, %f79, %f9;

	.loc	16	159	0

	mul.f32 	%f81, %f78, %f78;

	mad.f32 	%f82, %f76, %f76, %f81;

	mad.f32 	%f83, %f80, %f80, %f82;

	sqrt.approx.f32 	%f84, %f83;

	ld.const.f32 	%f85, [normalLength];

	sub.f32 	%f86, %f84, %f85;

	ld.param.f32 	%f49, [__cudaparm__Z11updatePointfiffPfS__stiffness];

	div.full.f32 	%f87, %f76, %f84;

	mul.f32 	%f88, %f86, %f87;

	mad.f32 	%f24, %f49, %f88, %f24;

	.loc	16	160	0

	div.full.f32 	%f89, %f78, %f84;

	mul.f32 	%f90, %f86, %f89;

	mad.f32 	%f25, %f49, %f90, %f25;

	.loc	16	161	0

	div.full.f32 	%f91, %f80, %f84;

	mul.f32 	%f92, %f86, %f91;

	mad.f32 	%f26, %f49, %f92, %f26;

$Lt_0_12546:

	mov.u32 	%r27, 0;

	setp.ne.u32 	%p9, %r7, %r27;

	@!%p9 bra 	$Lt_0_13058;

	.loc	16	167	0

	ld.shared.f32 	%f93, [%r15+-12];

	sub.f32 	%f94, %f93, %f7;

	.loc	16	168	0

	ld.shared.f32 	%f95, [%r15+-8];

	sub.f32 	%f96, %f95, %f8;

	.loc	16	169	0

	ld.shared.f32 	%f97, [%r15+-4];

	sub.f32 	%f98, %f97, %f9;

	.loc	16	173	0

	mul.f32 	%f99, %f96, %f96;

	mad.f32 	%f100, %f94, %f94, %f99;

	mad.f32 	%f101, %f98, %f98, %f100;

	sqrt.approx.f32 	%f102, %f101;

	ld.const.f32 	%f103, [normalLength];

	sub.f32 	%f104, %f102, %f103;

	ld.param.f32 	%f49, [__cudaparm__Z11updatePointfiffPfS__stiffness];

	div.full.f32 	%f105, %f94, %f102;

	mul.f32 	%f106, %f104, %f105;

	mad.f32 	%f24, %f49, %f106, %f24;

	.loc	16	174	0

	div.full.f32 	%f107, %f96, %f102;

	mul.f32 	%f108, %f104, %f107;

	mad.f32 	%f25, %f49, %f108, %f25;

	.loc	16	175	0

	div.full.f32 	%f109, %f98, %f102;

	mul.f32 	%f110, %f104, %f109;

	mad.f32 	%f26, %f49, %f110, %f26;

$Lt_0_13058:

	selp.s32 	%r28, 1, 0, %p9;

	selp.s32 	%r29, 1, 0, %p8;

	and.b32 	%r30, %r28, %r29;

	mov.u32 	%r31, 0;

	setp.eq.s32 	%p10, %r30, %r31;

	@%p10 bra 	$Lt_0_13570;

	.loc	16	181	0

	ld.shared.f32 	%f111, [%r15+-108];

	sub.f32 	%f112, %f111, %f7;

	.loc	16	182	0

	ld.shared.f32 	%f113, [%r15+-104];

	sub.f32 	%f114, %f113, %f8;

	.loc	16	183	0

	ld.shared.f32 	%f115, [%r15+-100];

	sub.f32 	%f116, %f115, %f9;

	.loc	16	187	0

	mul.f32 	%f117, %f114, %f114;

	mad.f32 	%f118, %f112, %f112, %f117;

	mad.f32 	%f119, %f116, %f116, %f118;

	sqrt.approx.f32 	%f120, %f119;

	ld.const.f32 	%f121, [normalLengthDiagonal];

	sub.f32 	%f122, %f120, %f121;

	ld.param.f32 	%f49, [__cudaparm__Z11updatePointfiffPfS__stiffness];

	div.full.f32 	%f123, %f112, %f120;

	mul.f32 	%f124, %f122, %f123;

	mad.f32 	%f24, %f49, %f124, %f24;

	.loc	16	188	0

	div.full.f32 	%f125, %f114, %f120;

	mul.f32 	%f126, %f122, %f125;

	mad.f32 	%f25, %f49, %f126, %f25;

	.loc	16	189	0

	div.full.f32 	%f127, %f116, %f120;

	mul.f32 	%f128, %f122, %f127;

	mad.f32 	%f26, %f49, %f128, %f26;

$Lt_0_13570:

	selp.s32 	%r32, 1, 0, %p7;

	and.b32 	%r33, %r32, %r29;

	mov.u32 	%r34, 0;

	setp.eq.s32 	%p11, %r33, %r34;

	@%p11 bra 	$Lt_0_14082;

	.loc	16	195	0

	ld.shared.f32 	%f129, [%r15+-84];

	sub.f32 	%f130, %f129, %f7;

	.loc	16	196	0

	ld.shared.f32 	%f131, [%r15+-80];

	sub.f32 	%f132, %f131, %f8;

	.loc	16	197	0

	ld.shared.f32 	%f133, [%r15+-76];

	sub.f32 	%f134, %f133, %f9;

	.loc	16	201	0

	mul.f32 	%f135, %f132, %f132;

	mad.f32 	%f136, %f130, %f130, %f135;

	mad.f32 	%f137, %f134, %f134, %f136;

	sqrt.approx.f32 	%f138, %f137;

	ld.const.f32 	%f139, [normalLengthDiagonal];

	sub.f32 	%f140, %f138, %f139;

	ld.param.f32 	%f49, [__cudaparm__Z11updatePointfiffPfS__stiffness];

	div.full.f32 	%f141, %f130, %f138;

	mul.f32 	%f142, %f140, %f141;

	mad.f32 	%f24, %f49, %f142, %f24;

	.loc	16	202	0

	div.full.f32 	%f143, %f132, %f138;

	mul.f32 	%f144, %f140, %f143;

	mad.f32 	%f25, %f49, %f144, %f25;

	.loc	16	203	0

	div.full.f32 	%f145, %f134, %f138;

	mul.f32 	%f146, %f140, %f145;

	mad.f32 	%f26, %f49, %f146, %f26;

$Lt_0_14082:

	selp.s32 	%r35, 1, 0, %p6;

	and.b32 	%r36, %r28, %r35;

	mov.u32 	%r37, 0;

	setp.eq.s32 	%p12, %r36, %r37;

	@%p12 bra 	$Lt_0_14594;

	.loc	16	209	0

	ld.shared.f32 	%f147, [%r15+84];

	sub.f32 	%f148, %f147, %f7;

	.loc	16	210	0

	ld.shared.f32 	%f149, [%r15+88];

	sub.f32 	%f150, %f149, %f8;

	.loc	16	211	0

	ld.shared.f32 	%f151, [%r15+92];

	sub.f32 	%f152, %f151, %f9;

	.loc	16	215	0

	mul.f32 	%f153, %f150, %f150;

	mad.f32 	%f154, %f148, %f148, %f153;

	mad.f32 	%f155, %f152, %f152, %f154;

	sqrt.approx.f32 	%f156, %f155;

	ld.const.f32 	%f157, [normalLengthDiagonal];

	sub.f32 	%f158, %f156, %f157;

	ld.param.f32 	%f49, [__cudaparm__Z11updatePointfiffPfS__stiffness];

	div.full.f32 	%f159, %f148, %f156;

	mul.f32 	%f160, %f158, %f159;

	mad.f32 	%f24, %f49, %f160, %f24;

	.loc	16	216	0

	div.full.f32 	%f161, %f150, %f156;

	mul.f32 	%f162, %f158, %f161;

	mad.f32 	%f25, %f49, %f162, %f25;

	.loc	16	217	0

	div.full.f32 	%f163, %f152, %f156;

	mul.f32 	%f164, %f158, %f163;

	mad.f32 	%f26, %f49, %f164, %f26;

$Lt_0_14594:

	and.b32 	%r38, %r32, %r35;

	mov.u32 	%r39, 0;

	setp.eq.s32 	%p13, %r38, %r39;

	@%p13 bra 	$Lt_0_15106;

	.loc	16	223	0

	ld.shared.f32 	%f165, [%r15+108];

	sub.f32 	%f166, %f165, %f7;

	.loc	16	224	0

	ld.shared.f32 	%f167, [%r15+112];

	sub.f32 	%f168, %f167, %f8;

	.loc	16	225	0

	ld.shared.f32 	%f169, [%r15+116];

	sub.f32 	%f170, %f169, %f9;

	.loc	16	229	0

	mul.f32 	%f171, %f168, %f168;

	mad.f32 	%f172, %f166, %f166, %f171;

	mad.f32 	%f173, %f170, %f170, %f172;

	sqrt.approx.f32 	%f174, %f173;

	ld.const.f32 	%f175, [normalLengthDiagonal];

	sub.f32 	%f176, %f174, %f175;

	ld.param.f32 	%f49, [__cudaparm__Z11updatePointfiffPfS__stiffness];

	div.full.f32 	%f177, %f166, %f174;

	mul.f32 	%f178, %f176, %f177;

	mad.f32 	%f24, %f49, %f178, %f24;

	.loc	16	230	0

	div.full.f32 	%f179, %f168, %f174;

	mul.f32 	%f180, %f176, %f179;

	mad.f32 	%f25, %f49, %f180, %f25;

	.loc	16	231	0

	div.full.f32 	%f181, %f170, %f174;

	mul.f32 	%f182, %f176, %f181;

	mad.f32 	%f26, %f49, %f182, %f26;

$Lt_0_15106:

	.loc	16	235	0

	ld.param.f32 	%f183, [__cudaparm__Z11updatePointfiffPfS__mass];

	ld.param.f32 	%f184, [__cudaparm__Z11updatePointfiffPfS__dT];

	ld.shared.f32 	%f185, [%r20+0];

	mul.f32 	%f186, %f184, %f24;

	div.full.f32 	%f187, %f186, %f183;

	add.f32 	%f188, %f185, %f187;

	st.shared.f32 	[%r20+0], %f188;

	.loc	16	236	0

	ld.shared.f32 	%f189, [%r20+4];

	mul.f32 	%f190, %f184, %f25;

	div.full.f32 	%f191, %f190, %f183;

	add.f32 	%f192, %f189, %f191;

	st.shared.f32 	[%r20+4], %f192;

	.loc	16	237	0

	ld.shared.f32 	%f193, [%r20+8];

	mul.f32 	%f194, %f184, %f26;

	div.full.f32 	%f195, %f194, %f183;

	add.f32 	%f196, %f193, %f195;

	st.shared.f32 	[%r20+8], %f196;

	.loc	16	239	0

	ld.const.f32 	%f197, [C];

	mul.f32 	%f198, %f188, %f197;

	st.shared.f32 	[%r20+0], %f198;

	.loc	16	240	0

	mul.f32 	%f199, %f192, %f197;

	st.shared.f32 	[%r20+4], %f199;

	.loc	16	241	0

	mul.f32 	%f200, %f196, %f197;

	st.shared.f32 	[%r20+8], %f200;

$Lt_0_11010:

	.loc	16	248	0

	bar.sync 	0;

	.loc	16	250	0

	st.global.f32 	[%r18+0], %f7;

	.loc	16	251	0

	mov.f32 	%f201, 0f00000000;   	// 0

	mov.f32 	%f202, 0f00000000;   	// 0

	setp.lt.f32 	%p14, %f8, %f202;

	selp.f32 	%f203, %f201, %f8, %p14;

	st.global.f32 	[%r18+4], %f203;

	.loc	16	252	0

	st.global.f32 	[%r18+8], %f9;

	.loc	16	254	0

	ld.shared.f32 	%f204, [%r20+0];

	st.global.f32 	[%r22+0], %f204;

	.loc	16	255	0

	ld.shared.f32 	%f205, [%r20+4];

	st.global.f32 	[%r22+4], %f205;

	.loc	16	256	0

	ld.shared.f32 	%f206, [%r20+8];

	st.global.f32 	[%r22+8], %f206;

	.loc	16	259	0

	exit;

$LDWend__Z11updatePointfiffPfS_:

	} // _Z11updatePointfiffPfS_