Current cpu index is stored at either segment. If userspace sets that
segment, kernel will not overwrite it on every reschedule. This is fine
as long as user program does not use anything that relies on it :)
We need to have interrupts enabled when signal kills the process as
process does mutex locking. Also signals are now only checked when
returning to userspace in the same place where userspace segments are
loaded.
Process's memory regions are now behind an rwlock instead of using the
full process lock. This allows most pointer validations to not block as
write operations to memory regions are rare.
Thread's userspace stack is now part of process's memory regions. This
simplifies code that explicitly looped over threads to see if the
accessed address was inside a thread's stack.
Only drawback of this is that MemoryRegions don't support guard pages,
so userspace stackoverflow will be handeled as cleanly as it was prior
to this.
This patch also fixes some unnecessary locking of the process lock and
moves locking to the internal helper functions instead of asserting that
the lock is held. Also we now make sure loaded ELF regions are in sorted
order as we previously expected.
Eeach futex object now has its own mutex to prevent unnecessary locking
of the process/global futex lock. This basically removes sys_futex from
profiles when running software with llvmpipe
Hashmap insertions and deletions made futex very slow to use. When
running SuperTuxKart, ~15% of cpu time was spent doing these.
Never freeing objects is not great either but at least the performance
is usable now :)
Add support for processor local futexes. These work the exact same way
as global ones, but only lock a process specific lock and use a process
specific hash map.
Also reduce the time futex lock is held. There was no need to hold the
global lock while validating addresses in the process' address space.
If the processor has invariant TSC it can be used to measure time. We
keep track of the last nanosecond and TSC values and offset them based
on the current TSC. This allows getting current time in userspace.
The implementation maps a single RO page to every processes' address
space. The page contains the TSC info which gets updated every 100 ms.
If the processor does not have invariant TSC, this page will not
indicate the capability for TSC based timing.
There was the problem about how does a processor know which cpu it is
running without doing syscall. TSC counters may or may not be
synchronized between cores, so we need a separate TSC info for each
processor. I ended up adding sequence of bytes 0..255 at the start of
the shared page. When a scheduler gets a new thread, it updates the
threads gs/fs segment to point to the byte corresponding to the current
cpu.
This TSC based timing is also used in kernel. With 64 bit HPET this
probably does not bring much of a benefit, but on PIT or 32 bit HPET
this removes the need to aquire a spinlock to get the current time.
This change does force the userspace to not use gs/fs themselves and
they are both now reserved. Other one is used for TLS (this can be
technically used if user does not call libc code) and the other for
the current processor index (cannot be used as kernel unconditionally
resets it after each load balance).
I was looking at how many times timer's current time was polled
(userspace and kernel combined). When idling in window manager, it was
around 8k times/s. When running doom it peaked at over 1 million times
per second when loading and settled at ~30k times/s.
Memory regions are now stored in a sorted array. This allows O(nlogn)
lookup for address validation instead of the old linear lookup.
Now inserting new regions is also O(nlogn) instead of the old constant
time, but lookups are **much** more frequent
Memory regions are now splitted when they get munmapped, mprotected, or
mmapped with MAP_FIXED. This is used by couple of ports, and without
this we were just leaking up memory or straight up crashing programs.
This also removes the now old recvfrom and sendto syscalls. These are
now implemented as wrappers around recvmsg and sendmsg.
Also replace unnecessary spinlocks from unix socket with mutexes
Some stuff tries to use shared futexes so make them all shared. Private
futexes would be faster as they are process specific but supporting both
would need some reworks