Data transferring from device to host taking too much time

My code is something like this:

struct objectType { char* str1; char* str2; }

cudaMallocManaged(&o, sizeof(objectType) * n)

for (int i = 0; i < n; ++i) { // use cudaMallocManaged to copy data }

if (useGPU) compute_on_gpu(objectType* o, ….) else compute_on_cpu(objectType* o, ….)

function1(objectType* o, ….) // on host

when computing on GPU, ‘function1’ takes a longer time to execute (around 2 seconds) compared to when computing on CPU (around 0.01 seconds). What could be a work around for this? I guess this is the time it takes to transfer back data from GPU to CPU but I’m just a beginner so I’m not quite sure how to handle this.

Note: I am passing ‘o’ to CPU just for a fair comparison even tho it is not required to be accessible from GPU due to the cudaMallocManaged call.