Use pthread_create function to create 4 threads in CPU to execute a GPU code,the time is nearly the

when i use CUDA,i find the memory use in the GPU code is very small,nearly 200MB,so i try to promote the memory utilization,because the tesla K20’memory is 5GB.So i use pthread_create to create 4 threads in CPU to call the GPU code.In my opinion,the time will decrease a lot,but after changing,its time is nearly the same as four times the GPU execution time.So i want to know the cause and the solution.Thank you !

are you already read that: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#performance-guidelines
? in short, memory usage is only one limit, so you program may be morу limited by ram or alu speed. learn profiling to see limits of your code. look at the cuda streams, although probably it’s not what you need now