Supporting Multiple Architectures

Published: Aug. 19, 2020, 10:28 a.m.

As an application, or a library develops, chances are you will start to think about adding support for different usecases. This blog delves into how I would approach that by example, implementing support for more architectures in memflow memory introspection toolkit.

A Look at Some Old Work

Let's first take a look at vmread. It is also a library that allows accessing Windows memory through Direct Memory Access (DMA). I tried to decouple memory access to potentially support PCILeech, Coredumps, and so on, but in reality, it was really only made for QEMU/KVM, and only x86_64 architecture. The heart of architecture support for these tools is a virtual to physical memory translation function. Let's take a look at vmread's one:

static uint64_t VTranslateInternal(const ProcessData* data, _tlb_t* tlb, uint64_t dirBase, uint64_t address)
{
	uint64_t pageOffset = address & ~(~0ul << PAGE_OFFSET_SIZE);
	uint64_t pte = ((address >> 12) & (0x1ffll));
	uint64_t pt = ((address >> 21) & (0x1ffll));
	uint64_t pd = ((address >> 30) & (0x1ffll));
	uint64_t pdp = ((address >> 39) & (0x1ffll));

	uint64_t pdpe = VtMemReadU64(data, tlb, 0, dirBase + 8 * pdp);
	if (~pdpe & 1)
		return 0;

	uint64_t pde = VtMemReadU64(data, tlb, 1, (pdpe & PMASK) + 8 * pd);
	if (~pde & 1)
		return 0;

	/* 1GB large page, use pde's 12-34 bits */
	if (pde & 0x80)
		return (pde & (~0ull << 42 >> 12)) + (address & ~(~0ull << 30));

	uint64_t pteAddr = VtMemReadU64(data, tlb, 2, (pde & PMASK) + 8 * pt);
	if (~pteAddr & 1)
		return 0;

	/* 2MB large page */
	if (pteAddr & 0x80)
		return (pteAddr & PMASK) + (address & ~(~0ull << 21));

	address = VtMemReadU64(data, tlb, 3, (pteAddr & PMASK) + 8 * pte) & PMASK;

	if (!address)
		return 0;

	return address + pageOffset;
}

The most important part is the signature of the function. The above is a private implementation that gets wrapped by this function:

/**
 * @brief Translate a virtual VM address into a physical one
 *
 * @param data VM process data
 * @param dirBase page table directory base of a process
 * @param address virtual address to translate
 *
 * @return
 * Translated linear address;
 * 0 otherwise
 */
uint64_t VTranslate(const ProcessData* data, uint64_t dirBase, uint64_t address);

This function takes a x86 specific dirBase, and an address, and translates it with 64-bit specific page table offsets. The function is never aware of the architecture in question. It just assumes 64-bit x86. If I were to add architecture support, I could incorporate some value inside ProcessData structure, and then match a specific implementation inside VTranslate, but as you will see later on, it is not very sustainable.

The Start of Memflow

Memflow is a memory introspection toolkit written from scratch in Rust. It started off one step ahead, because it did have a concept of an architectre. As you can see:

pub enum Architecture {
    /**
    An empty architecture with some sensible defaults and no virt_to_phys translation.
    This is usually most useful when running automated tests.
    */
    Null,
    /// x86_64 architecture.
    X64,
    /// x86 architecture with physical address extensions.
    X86Pae,
    /// x86 architecture.
    X86,
}

In theory this is fine. There is a number architectures we can support, and we can match the right implementation to perform the correct translation:

pub fn virt_to_phys_iter<
    T: PhysicalMemory + ?Sized,
    B: TranslateData,
    VI: Iterator<Item = (Address, B)>,
    OV: Extend<(Result<PhysicalAddress>, Address, B)>,
>(
    self,
    mem: &mut T,
    dtb: Address,
    addrs: VI,
    out: &mut OV,
) {
    match self {
        Architecture::Null => {
            out.extend(addrs.map(|(addr, buf)| (Ok(PhysicalAddress::from(addr)), addr, buf)))
        }
        Architecture::X64 => x64::virt_to_phys_iter(mem, dtb, addrs, out),
        Architecture::X86Pae => x86_pae::virt_to_phys_iter(mem, dtb, addrs, out),
        Architecture::X86 => x86::virt_to_phys_iter(mem, dtb, addrs, out),
    }
}

Originally, we only supported X64, but we were prepared for X86, and X86Pae (32-bit x86 with the Physical Address Extension), and when the time came, we could easily add support for them. Just as we started to add support for them, I noticed a really particular detail.

The Unification

On x86, virtual address translation works by walking the page table tree. And this mechanism by itself is very much the same across All three x86 architectures we were planning to support, and the differences can be described using data. So, I described a MMU:

pub struct ArchMMUSpec {
    /// defines the way virtual addresses gets split (the last element
    /// being the final physical page offset, and thus treated a bit differently)
    pub virtual_address_splits: &'static [u8],
    /// defines at which page mapping steps we can return a large page.
    /// Steps are indexed from 0, and the list has to be sorted, otherwise the code may fail.
    pub valid_final_page_steps: &'static [usize],
    /// define the address space upper bound (32 for x86, 52 for x86_64)
    pub address_space_bits: u8,
    /// native pointer size in bytes for the architecture.
    pub addr_size: u8,
    /// size of an individual page table entry in bytes.
    pub pte_size: usize,
    /// index of a bit in PTE defining whether the page is present or not.
    pub present_bit: u8,
    /// index of a bit in PTE defining if the page is writeable.
    pub writeable_bit: u8,
    /// index of a bit in PTE defining if the page is non-executable.
    pub nx_bit: u8,
    /// index of a bit in PTE defining if the PTE points to a large page.
    pub large_page_bit: u8,
}

This is all there mostly is in x86 address translation, all I had to do is define the MMU for each of the architectures.

x64.rs:

pub fn get_mmu_spec() -> ArchMMUSpec {
    ArchMMUSpec {
        virtual_address_splits: &[9, 9, 9, 9, 12],
        valid_final_page_steps: &[2, 3, 4],
        address_space_bits: 52,
        addr_size: 8,
        pte_size: 8,
        present_bit: 0,
        writeable_bit: 1,
        nx_bit: 63,
        large_page_bit: 7,
    }
}

x86.rs:

pub fn get_mmu_spec() -> ArchMMUSpec {
    ArchMMUSpec {
        virtual_address_splits: &[10, 10, 12],
        valid_final_page_steps: &[1, 2],
        address_space_bits: 32,
        addr_size: 4,
        pte_size: 4,
        present_bit: 0,
        writeable_bit: 1,
        nx_bit: 31, //Actually, NX is unsupported in x86 non-PAE, we have to do something about it
        large_page_bit: 7,
    }
}

x86_pae.rs:

pub fn get_mmu_spec() -> ArchMMUSpec {
    ArchMMUSpec {
        virtual_address_splits: &[2, 9, 9, 12],
        valid_final_page_steps: &[2, 3],
        address_space_bits: 36,
        addr_size: 4,
        pte_size: 8,
        present_bit: 0,
        writeable_bit: 1,
        nx_bit: 63,
        large_page_bit: 7,
    }
}

The most notable differences are virtual_address_splits. x64 has 4 level page tables, thus has virtual address split parts (1 for page offset), x86 - 2 levels, 3 split parts, x86_pae - 3 levels, 4 splits. Everything goes through the same translation function (it is rather complex due to other optimizations in place, so I am not including it). For an extended x64 address space implementation, we just need to implement a MMU definition with an extra page table level, and be golden. But didn't I tell you that this is still unsustainable? Why?

"Folly of my kind. There's always a yearning for more"

We have entire modern x86 covered. All cases where it would make sense. As these possibilities are achieved, a new exciting opportunities come up. What about ARM? Windows has ARM version. Taking a look at ARMs design (particularly AArch64), we find that it is pretty much the same, apart from user and kernel memory being split into 2 different page tables (TTBR0, and TTBR1 registers). While we are at it, what about other architectures? RISC-V? POWER? What if Microsoft comes up with something called CISC-VI? Remember the enum:

pub enum Architecture {
    /**
    An empty architecture with some sensible defaults and no virt_to_phys translation.
    This is usually most useful when running automated tests.
    */
    Null,
    /// x86_64 architecture.
    X64,
    /// x86 architecture with physical address extensions.
    X86Pae,
    /// x86 architecture.
    X86,
}

Do we just put all the architectures in one enum? Do we make anyone wanting to use some new, unsupported architecture go and recompile memflow just to add support for it? This problem suddenly becomes much bigger than anticipated. We can't rely on Architecture as an enum, any more. We could, but why? Lot's of work to change the design? Yes. Is it going to help us in the long run? Yes. Do we have anyone relying on the current design? No, memflow is not yet released! With 2 weeks to release, there isn't a better time than now to perform such big design changes, because after the fact, we may very well be stuck with the same design forever. Better break the current one, and build something better.

Okay, yes, let's break it! But how?

For most effective results the core code should not have any idea about the specifics of an architecture, it should rather be a black box with a rather limited set of values that can be queried. Maybe this set will be enough?

pub trait Architecture: Send + Sync + 'static {
    /// Returns the number of bits of a pointers width on a `Architecture`.
    /// Currently this will either return 64 or 32 depending on the pointer width of the target.
    /// This function is handy in cases where you only want to know the pointer width of the target,
    /// but you don't want to match against all architecture.
    fn bits(&self) -> u8;

    /// Returns the byte order of an `Architecture`.
    /// This will either be `Endianess::LittleEndian` or `Endianess::BigEndian`.
    fn endianess(&self) -> Endianess;

    /// Returns the smallest page size of an `Architecture`.
    ///
    /// In x86/64 and arm this will always return 4kb.
    fn page_size(&self) -> usize;

    /// Returns the `usize` of a pointers width on a `Architecture`.
    fn size_addr(&self) -> usize;

    /// Returns the address space range in bits for the `Architecture`.
    fn address_space_bits(&self) -> u8;
}

Okay, how does one define such an architecture? How does one differentiate between 2 architectures? After a bit of thinking I came up to a conclusion that storing a single, unique, and static reference to a unique architecture would be the best way, like so:

pub(super) const ARCH_SPEC: X86Architecture = X86Architecture {
    bits: 64,
    endianess: Endianess::LittleEndian,
    mmu: ArchMMUSpec {
        virtual_address_splits: &[9, 9, 9, 9, 12],
        valid_final_page_steps: &[2, 3, 4],
        address_space_bits: 52,
        addr_size: 8,
        pte_size: 8,
        present_bit: 0,
        writeable_bit: 1,
        nx_bit: 63,
        large_page_bit: 7,
    },
};

pub static ARCH: &dyn Architecture = &ARCH_SPEC;

X86Architecture is just a type that implements functionality sharing between 32, and 64 bit architectures. Only ARCH is exposed, while the internal state is cleanly encapsulated. Perfect!

But how do we do dem virtual translates?

So since architecture is unknown, and it could have a completely different mechanism for translating memory (for example, MIPS), we can not have some 100% general implementation. The data needed for translation is of unknown size (for example, ARM separates user and kernel memory using 2 page tables), and that data must be linked to a particular architecture (we don't want to use x86 translation function on MIPS, no?). Thus, say hello to ScopedVirtualTranslate:

/// Translates virtual memory to physical using internal translation base (usually a process' dtb)
///
/// This trait abstracts virtual address translation for a single virtual memory scope.
/// On x86 architectures, it is a single `Address` - a CR3 register. But other architectures may
/// use multiple translation bases, or use a completely different translation mechanism (MIPS).
pub trait ScopedVirtualTranslate: Clone + Copy + Send {
    fn virt_to_phys<T: PhysicalMemory>(
        &self,
        mem: &mut T,
        addr: Address,
    ) -> Result<PhysicalAddress> {
        // forward implementation to virt_to_phys_iter
    }

    fn virt_to_phys_iter<
        T: PhysicalMemory + ?Sized,
        B: SplitAtIndex,
        VI: Iterator<Item = (Address, B)>,
        VO: Extend<(PhysicalAddress, B)>,
        FO: Extend<(Error, Address, B)>,
    >(
        &self,
        mem: &mut T,
        addrs: VI,
        out: &mut VO,
        out_fail: &mut FO,
        arena: &Bump,
    );

    fn translation_table_id(&self, address: Address) -> usize;

    fn arch(&self) -> &dyn Architecture;
}

This trait is tied to an architecture (well obvs it even has an arch function!). translation_table_id will be used by the caching system to query for uniqueness of the translation, but that's about it for the trait.

x86 specific translator implements it, and forwards translation calls to the ArchMMUSpec:

impl ScopedVirtualTranslate for X86ScopedVirtualTranslate {
    fn virt_to_phys_iter<
        T: PhysicalMemory + ?Sized,
        B: SplitAtIndex,
        VI: Iterator<Item = (Address, B)>,
        VO: Extend<(PhysicalAddress, B)>,
        FO: Extend<(Error, Address, B)>,
    >(
        &self,
        mem: &mut T,
        addrs: VI,
        out: &mut VO,
        out_fail: &mut FO,
        arena: &Bump,
    ) {
        self.arch
            .mmu
            .virt_to_phys_iter(mem, self.dtb, addrs, out, out_fail, arena)
    }

    fn translation_table_id(&self, _address: Address) -> usize {
        self.dtb.0.as_u64().overflowing_shr(12).0 as usize
    }

    fn arch(&self) -> &dyn Architecture {
        self.arch
    }
}

This way different page-table translation based architectures can share the ArchMMUSpec functionality, but still tweak the behaviour, if they want to!

Now with that out of the way, time to compile...

error: aborting due to 143 previous errors;

Okay, not so fast. These changes needed quite a lot more changes to accomodate for them. In the end, all this rewamp took 1815 additions, and 1449 deletions. Let's talk about the most notable of them!

Win32

memflow-win32 crate provides all windows related things. And a Win32Process structure holds a virtual memory object. It did it like so:

#[derive(Clone)]
pub struct Win32Process<T: VirtualMemory> {
    pub virt_mem: T,
    pub proc_info: Win32ProcessInfo,
}

The biggest issue was the virtual memory object. I would no longer know the size of it due to the translator not being well known. So, just Box it up, right?

pub virt_mem: Box<T>,

I did this at first, but I think you already have a feeling where this is going... For starters, you can't clone boxes with trait objects all that easily. You need to introduce a trait wrapping the trait we want to clone. Like so:

/// Wrapper trait around virtual memory which implements a boxed clone
pub trait CloneableVirtualMemory: VirtualMemory {
    fn clone_box(&self) -> Box<dyn CloneableVirtualMemory>;
}

We did this with physical memory objects, because we added support for dynamic linking of them. But no, this is not the same case, it seemed alright to me at first, but it just wasn't. Solution?

Win32VirtualTranslate!

pub struct Win32VirtualTranslate {
    pub sys_arch: &'static dyn Architecture,
    pub dtb: Address,
}

impl ScopedVirtualTranslate for Win32VirtualTranslate {
    fn virt_to_phys_iter<
        T: PhysicalMemory + ?Sized,
        B: SplitAtIndex,
        VI: Iterator<Item = (Address, B)>,
        VO: Extend<(PhysicalAddress, B)>,
        FO: Extend<(memflow_core::Error, Address, B)>,
    >(
        &self,
        mem: &mut T,
        addrs: VI,
        out: &mut VO,
        out_fail: &mut FO,
        arena: &memflow_core::architecture::Bump,
    ) {
        let translator = x86::new_translator(self.dtb, self.sys_arch).unwrap();
        translator.virt_to_phys_iter(mem, addrs, out, out_fail, arena)
    }

    fn translation_table_id(&self, _address: Address) -> usize {
        self.dtb.as_u64().overflowing_shr(12).0 as usize
    }

    fn arch(&self) -> &dyn Architecture {
        self.sys_arch
    }
}

Wait, what? Did we make all these changes to remove predefined architectures, just to bring them back?

It was surprising to me that it lead to the same code coming back, but if you really think about it, it does make sense. The core stays architecture-agnostic, but each operating system layer is not. After all, the OS support needs Arch-specific paths/offsets to function.

And Windows is really convenient to do like this, because even on ARM, which splits the page tables, it uses a single dtb address, and practically a single page table, just the first page split into two halves. This keeps the structure size consistent, and translation dispatch code very simple.

Summary

When you start working on a project it is a good idea to imagine all possible features from the start, but chances are you won't, but rather feature ideas will start popping up spontaniously, and you can't really do anything about it. Put them in a backlog, and if you see a particular deadline approaching that will freeze your API - do try to break it to make it much easier to accomade those features that you will so eagerly want to add! And once you go recreate the core systems in the code, chances are a similar version to them will appear somewhere else, but hopefully it will be enough to make things more flexible than they used to.