Running multiple instances of a sample code on a gpu

When I run parallel processors on a gpu card (different instances of one program), the speed drops. Is it always the case? Should I handle multi-threading in my code or there is a way that I can run different processes in parallel?