This can be nice if user has memory for a the temporary buffer and doesnt want the sorting to allocate or be able to fail. Also counts are now stack allocated, there isn't really any reason to allocate them on the heap as 256x 64 bit values only adds up to 2 KiB