Problem with programming guide example async memory copying

I’m having problems getting a piece of code running correctly. As it is mostly just taken from the programming guide I figured I’m making some stupid little mistake but I can’t see it. The code is as follows:

#include <stdio.h>

__global__ void myKernel(int *output, int *input) {

   int idx = blockIdx.x * blockDim.x + threadIdx.x;

    int temp = input[idx] * 2 + 7;

    output[idx] = 2 * temp - 4;


int main() {

   int size = 5120 * sizeof(int);

   int *hostPtr, *inputDevPtr, *outputDevPtr;

    cudaMallocHost((void**)&hostPtr, 2 * size);

    cudaMalloc((void**)&inputDevPtr, 2 * size);

    cudaMalloc((void**)&outputDevPtr, 2 * size);

   for (int i = 0; i < 2*5120; i++)

        hostPtr[i] = 5;

   cudaStream_t stream[2];

    for (int i = 0; i < 2; ++i)


   for (int i = 0; i < 2; ++i)

        cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size,

                    size, cudaMemcpyHostToDevice, stream[i]);

    for (int i = 0; i < 2; ++i)

        myKernel<<<10, 512, 0, stream[i]>>>

                    (outputDevPtr + i * size, inputDevPtr + i * size);

    for (int i = 0; i < 2; ++i)

        cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size,

                    size, cudaMemcpyDeviceToHost, stream[i]);


   printf("First block: %d\nSecond block: %d\n",hostPtr[0],hostPtr[5120]);




   return 0;


The first result is fine (30) but the second result is 5. It seems the second update is not running.

I’ve got this working now. I think there is a problem with the addressing that leads to bad access of memory and kernel launch failure. I switched the size relative addressing to array index addressing and now it works.

I know you have a solution, but it’s curious why your first version would fail. When in doubt, always check your cuda call error return values.

As a guess with your original code, it may be some alignment issue with the second memcpy since that’s at a start memory location that’s been derived.

cudaMemcpy doesn’t say much about alignment requirements, just “don’t overlap reads and writes”. You might see if your original worked if you offset your second array to be aligned to something like 1024 or even 16384 bytes. Again, the docs don’t say so, but it feels like it could be the problem.

memcpyasync is very picky, it needs the page-locked memory or it will fail… perhaps it also depends on STARTING at the beginning of a page.

The failure is in the second memcpy. In device emu the program seg faults there, on the card there is an ‘unspecified kernel failure’.

Your problem is pointer arithmetic. Your device pointer is of type int*, yet you are adding a byte offset to it. You should be adding a word offset to it, or use a char* pointer.

An added question: is it possible for multiple kernels to be running at the same time e.g. in my code above can both ‘myKernel’ instances be running together?

On current hardware, no. The hardware can only process one kernel at a time. NVIDIA has left open the possibility for future hardware to process multiple kernels concurrently, in which case kernels in different streams will be able to overlap.

Adding 1 to a type* adds sizeof(type) to the address.