Basically it comes down to the cost of objects. Since each float is an object and the math makes two floats for each float, we've created an intermediate float that will be garbage collected. However, remember that we have 10 million of them in the final example, which means lots of new memory spaces, lots of objects being pushed in to old space, run inside a very tight loop, lots of garbage generated, lots of object table entries.. lots and lots of wasted time. The VM doesn't really stand a chance as you scale up.
But as we can see, even for small numbers of floats we still get an advantage using the GPU.. which is actually a little surprising given the amount of free cycles we have to waste on modern CPUs :)