I don't understand the the code in documentation of cudaMallocPitch() function

// Host code 
int width = 64, height = 64; 
float* devPtr; 
size_t pitch; 
cudaMallocPitch(&devPtr, &pitch, 
                width * sizeof(float), height); 
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);

 // Device code
__global__ void MyKernel(float* devPtr, 
                         size_t pitch, int width, int height) 
{ 
    for (int r = 0; r < height; ++r) { 
       float* row = (float*)((char*)devPtr + r * pitch); 
       for (int c = 0; c < width; ++c) { 
          float element = row[c]; 
       } 
     } 
}

I don’t understand we have already used GPU, why use loop in kernel function ?
Can anyone help me, thank you very much !

You don’t have to use a loop. The loop is not the important part there. The important part there is this line:

float* row = (float*)((char*)devPtr + r * pitch);

which demonstrates how to get a pointer to the first element of a row in a pitched allocation.

If you review the runtime API documentation for cudaMallocPitch:

http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g32bd7a39135594788a542ae72217775c

you will see that only that line of code is given as an example.

Thank you very much!