Kernel Blues, or Why x86 Is So Convoluted

2020-07-19

kernel

pwnyOS screenshot login

Preface

This weekend I casually played a CTF hosted by UIUC, and spent most of my time on their kernel challenge. First I'd like to commend @ravi for such a well written kernel challenge, it was quite interesting and nice being able to crawl around in a full implementation of a kernel, though I would've personally preferred an open kernel implentation, (or at least where the bins are given); the reason why I say that is it reduces the amount of guessing required, and I personally like being able to see exactly how the entire kernel works rather than having to guess my way through it.

In this writeup, I'm actually not gonna discuss a whole lot about the actual intended solutions, but rather focus on an unintended vulnerability in the kernel and how that came to be.

tl;dr

For those that are lazy and just scroll to the end, I'll give you a pass:

A number of I/O ports are actually not blocked

Though imo the bug itself isn't really that interesting, but why the bug is there is actually a lot more interesting, stick around to see why that is the case ;)

Scripting

Because the custom OS only has a graphical frontend (no serial port or connection to internet), I ended up having to rely heavily on a series of scripts throughout the entire CTF to automate copying/pasting into the VNC connection:

# paste.py
import autopy, time, sys

str = open(sys.argv[1], 'r').read()

time.sleep(3)
for line in str.split('\n'):
    for c in line + '\n':
        autopy.key.type_string(c)
        time.sleep(.005)
    time.sleep(1)

#!/bin/bash

set -e

nasm "$1.asm"

if [ ! -f "$1" ]; then
  echo 'Not a valid file!'
  exit 1
fi

echo 'Yanking in 3 seconds...'
(#echo 'binexec' && 
  hexdump -ve '48/1 "%02x" "\n" ' $1 &&
  echo -n 'done') |
  python3 paste.py /dev/stdin

Basically, I have a workflow where I assemble some x86 using nasm, then turn that into hex using hexdump, and then taking that hexdump and running a python script that "types" out those hex strings (kinda like autohotkeys in Windows).

Finding the bug

So before I've actully done quite a few kernel challenges before, and also a full on hypervisor-kernel-user challenge (back in HITCON 2018). I remember particularly with that challenge I was able to cheese it by actually directly making in/out calls thereby circumventing the kernel and directly making hypercalls (corrupting the kernel in the process). Pretty sure this was unintended (the intended I think was some allocator bug).

Sure enough, testing it here, I noticed a number of I/O ports were actually not causing #GP (general protection) exceptions and just successfully passing through, though this wasn't the case for all ports:

Ports that had large numbers actually still #GP.
There were certain port numbers (namingly 0x20...) that also #GP.

I/O Ports

How does an assembly instruction signal to the CPU to communicate to its periperhals? Before going to explaining the actual bug, I'd like to go and make a brief overview on how I/O ports work.

There are two main ways for a CPU to communicate with IO:

Isolated/port I/O (PIO): In this case, all I/O devices are mapped to a address space separate to the main memory address (often called "port numbers"). This was implemented by processors like Commodore 64, and the Intel 80x86 chip.
Memory-Mapped I/O (MMIO): In this case, I/O devices are directly mapped onto the same address space as the main memory. This is implemented by processors such as MIPS and the NES/SNES.

In x86, because it is completely backwards compatible to the entire family of 80x86 chips, they often inherit a lot of the antiquities of the old chips. In this case, it uses a 16-bit number to identify a port number, and can pass a 8-bit/16-bit/32-bit value (which should be in the e/ax register) at once using the in/out instructions.

In fact, most I/O devices only support 8-bit data transfers, which means, often the 16-bit and 32-bit versions of the in/out instructions aren't super useful and I/O devices would have to resort to more clever ways if they require data that is 16-bit/24-bit/32-bit in size:

Some I/O devices use multiple I/O ports where each port correlates to a specific byte in the resulting value.
Some I/O devices use one port, but flip states so that the first time a byte is writen it refers to the lower byte, and the second time a byte is written, it refers to the upper byte, etc... (this mechanism in low level digital logic is referred to as a flip-flop, which is a set of gates that feed signals back into itself to "preserve" its state and allow inputs that modify that state).
Some I/O devices have a selector port that selects which property to set in the "data" port.
Some I/O devices use a combination of the above

So yeah, I/O ports are a mess in x86, and there are no consistent standards on how I/O should be passed into the periperhals, but then again, it is the job of the kernel to create layers of abstraction over all that stuff so we don't have to worry about it (until we do...).

And unfortunately, even the port numbers sometimes are nonstandard depending on the exact device(s) that you are plugging in, though there are a few standard port numbers that generally should be the same across all machines. I found that this list of I/O ports was very helpful when I was enumerating the ports on the vmware machine.

How x86 restricts I/O

Clearly, if we could write code that interfaces the I/O periperhals directly using the in/out instructions, any user would be free to completely mess around with just about anything (including your hard drive)! No kernel could ever dream to protect that kinda bypass if that were the case. In fact in the original 16-bit 80x86 chips, that was the case. OS's in those days serves as a gatekeeper to supervise processes, but just another API to manage some common system functions (which was bypassable). Fortunately, when you enter protected mode (32-bit mode), the x86 processor will make a series of checks to restrict access to I/O.

Now it's time to take a look at how that works. I generally like this website for reference material on x86. Now for in (and also for out instructions), there are basically three essential conditions, to cause a #GP fault:

Needs to run in PE (protected mode).
CPL (current permission level) > IOPL (IO permission level)
The current IO port number is not in the "permission bit set"

Pseudocode of "in"

Obviously for (1) we are running in PE. For (2) our CPL obviously is 3, and it turns out the IOPL is actually a value on our EFLAGS, which when dumped using pushf; pop eax, it seems like it is properly set. Now here's where the kicker comes: I've actually observed a number of custom kernels out there forget about (3) on restricting I/O access. Now there is actually a very interesting quirk about what this "permission bit set" is. In other texts this is also referred to as an I/O bitmap, and it determines which I/O ports allowed, each port taking 1 bit. (Make careful note that 0 is ALLOW and 1 is DISALLOW).

When I was researching this bitmap, I stumbled upon this forum post stating something about setting iomap base to some value:

Setting IO bitmap to some value?

So why would this be the case? Perhaps the limit = sizeof(tss_struct), I will just believe that for now. At this point I decided to look at a bunch of tss_structs in other codebases, and to my horror, I found kernel after kernel disregarding this little edge case.

I started from this PhotonOS implementation by HsTP in Xmas CTF 2019. In their tss_struct, they have the iomap_base field defined and it seems like the tss is intialized in the write_tss, but does not initialize the iomap_base properly. I searched for iomap_base, seems like this is set NOWHERE.

Okay I guess time to upgrade to the Linux kernel :P . I found that Linux, unlike many other custom kernels I've seen people write casually, does, unsurprisingly, handle this corrrectly. How do they do that? By default they set the iobase offset to IO_BITMAP_OFFSET_INVALID, which, reading their comments: "Base offset outside of TSS_LIMIT so unpriviledged IO causes #GP". This actually turned out to be the key to this weird behavior.

Only the Linux kernel actually got this I/O edge-case correct :P

Unintended leak of I/O, just bad coding/specs, or coincidence?

Turns out this iomap is located in the TSS segment, valued as an offset from the TSS segment start. By default the values in this memory is ZERO'd, which means the offset is 0... exactly the start of the TSS segment/struct. Now remember how only certain IO ports result in a #GP and others don't? One of those are at 0x20, which also happened to be the bit of where the sp0 field is located! And when the size of the entire struct is 0x68 or 0x340 in bits, (which happened to also be the TSS limit, I think), this means that any ports at or above 0x340 will also #GP since we exceeded the TSS segment limit. This means we only get roughly between ports 0 and 0x340.

Unexploitable?

Of course armed with this knowledge, and knowing exactly which port numbers we can use, I went to try to plan how to exploit. My end goal would be of two ways:

Get some arbitrary R/W in the kernel, maybe using some DMA stuff
Try to directly read from harddisk/CD-ROM.

If I could accomplish either, I win big time. Sadly, I found absolutely no way to actually accomplish this. The underpinning reason in the end was: the authors installed a (read-only) CD-ROM drive, instead of a hard-disk to load kernel/filesystem.

Still I tried everyway possible, and even at point tried fiddling with BIOS settings (to no avail). I also tried enumerating every single active I/O port that I have access to (I used the heuristics that if a port just returns 0xffffffff when I try running in eax, dx on that port, it doesn't exist, not sure if I actually missed any ports/services because of this). Overall I actually found not too many active stuff going on; most of the ports seem "dead", (I also used this website to figure out which ports are what):

There were several DMA channels, one at 0x0-0xf, one at 0xc0-0xcf. It also had a third-byte extension located in the 0x80-0x8f region.
Keyboard was located at 0x60 and 0x64
PIC 2 (or slave PIC) was accessible at 0xa0-0xa1 (though I couldn't access PIC 1 or the master PIC, since sp0 was there, setting the allowed bitmap to 1 and restricting me from accessing that stuff :( )
Primary (0x1fx) and secondary (0x17x) hard-disk (!!!) seemed to give some other values, but seems like still dead because there's no hard drives attached
CMOS at 0x70-0x71

CMOS apparently is the interface for writing BIOs memory, so that's useless unless I also found a BIOs bug in vmware (unlikely), so I took that out

I tried many times trying to check if the primary/secondary hard-disk IO ports would be active, (did that by trying to read sector count or some properties of that, but it just gave me 0x7f7f7f7f each time). I even at one point simply just (jokingly) asked the organizers to just attach an empty hard-disk onto the vmware so that I can install some code onto the MBR of that hard disk and get access that way.

PIC seems to me like you could set some interrupt vectors (still don't entirely understand how that works), but you could only do that with the master PIC anyways, and that is entirely unaccessible.

Finally I also spent a ton of time seeing if DMA helps my case at all (for a brief introduction this guy's youtube video covers a good depth into how DMA is, though on NES, but tl;dr, DMA let's you configure a set amount of memory to transfer some periperhal from main memory or vice versa, without the intervention of the CPU). Turns out, I found a couple of hitches:

(ISA) DMA is slow (the hard disk DMA is something else, but not useful since there are no hard disks). This isn't a bother, but this means no one will use it, and less documentation
Normally DMA transfers occur memory to periperhals, not from memory to memory (and since direct CPU memory transfers are faster than ISA DMA, no one uses this in practice), and in fact memory-to-memory is undocumented.
DMA ony addresses real address (which is to be expected since this bypasses the CPU), and only addresses lower 24bits. This might've worked, but I would have to guess where my memory chunk is mapped from VMA to PHYS Addresses.

Well so it looks like for now, DMA might've worked, but was too hard to figure out during this time so it was also a dud.

Conclusion

Well it was kinda upsetting that I wasn't able to weaponize this attack, but I definitely did learn a good deal about why x86 is literally a clusterfuck of documentation upon documentation (at one point I had to even look into the x86 developers manual to verify something). I believe this would've been much easier if I had an actual hard-disk to play with (if that were the case I could directly write to the hard disk and get root that way, or at least be able to boot up using the hard drive and get code-exec that way). The entire process of uploading/downloading stuff was also made hard when I had to manually type everything into a GUI >:( and not being able to copy data off of that machine (we ended up using OCR on the hex data to extract that data).

I also now have a secret powerful weapon to ue against custom kernels now :)

Personal Website

Write-Ups