Add support for processor local futexes. These work the exact same way
as global ones, but only lock a process specific lock and use a process
specific hash map.
Also reduce the time futex lock is held. There was no need to hold the
global lock while validating addresses in the process' address space.