The programmer’s manual say that the host cudaMalloc() are automatically aligned to 256 bytes. This might result in wastage of memory if the allocated memory size is small.
the driver or runtime API is always aligned to at least 256 bytes (CUDA 4.0 Manual Page 95)
But the kernel side malloc() calls are aligned to 16 bytes.
The returned pointer is guaranteed to be aligned to a 16-byte boundary (CUDA 4.0 Manual Page 126/Dynamic Global memory allocation)
Anyone knows why this difference ?