Making the Perfect Injector: Abusing Windows Address Sanitization and CoW

By the end of this post, I aim to make an injector unlike any other: one that by design makes your DLL not debuggable from UM, makes your pages invisible to NtQueryVirtualMemory and NtReadVirtualMemory, and lets you execute code in target process without even having a valid handle; and while doing this I want it to be compatible with Patchguard, have no kernel driver loaded while the target is running and require no handle at all.

Now, this may seem like a stupidly complicated goal, however, it is in fact really simple because Windows will be helping us.

(Source code can be found at the bottom)

0x1: Abusing Windows Address Sanitization

Anyone who has opened ntoskrnl.exe in IDA probably noticed these checks:

__int64 __usercall [email protected]<rax>(ULONG_PTR [email protected]<rcx>, unsigned __int64 [email protected]<rdx>, unsigned __int64 [email protected]<r8>, __int64 [email protected]<r9>, __int64 a5, int a6)
{
  ...
    if ( v10 < a3 || v9 > 0x7FFFFFFEFFFFi64 || v10 > 0x7FFFFFFEFFFFi64 )
      return 0xC0000005i64;
  ...
}
__int64 __fastcall MmQueryVirtualMemory(__int64 a1, unsigned __int64 a2, __int64 a3, unsigned __int64 a4, unsigned __int64 a5, unsigned __int64 *a6)
{
  ...
  if ( v12 > 0x7FFFFFFEFFFFi64 )
    return 0xC000000Di64;
  ...
}

Okay so what is so interesting about these you might be asking right now, 0x7FFFFFFEFFFF marks the end of user-mode memory so they are obviously there to make sure it doesn’t leak kernel memory to user-mode.

Here’s what makes them so interesting: these constants are hard-coded by the operating systems and are NOT what the processor actually uses to decide whether a page is accessible from cpl3 or not.

In case you are not familiar with page tables, here’s how virtual memory works:

The first 12 bits (&0xFFF) of a virtual address indicates the offset from the resolved page, the next four 9 bit combinations (&0x1FF000, &0x3FE00000, &0x7FC0000000, &0xFF8000000000) indicate the indices of the entry in the page table, page directory, page directory pointer and page map level4 respectively. These entries, apart from linking to the lower level also contain certain flags like write disable, execute disable, etc; as you can see from the definitions below.

#pragma pack(push, 1)
typedef union CR3_
{
  uint64_t value;
  struct
  {
    uint64_t ignored_1 : 3;
    uint64_t write_through : 1;
    uint64_t cache_disable : 1;
    uint64_t ignored_2 : 7;
    uint64_t pml4_p : 40;
    uint64_t reserved : 12;
  };
} PTE_CR3;

typedef union VIRT_ADDR_
{
  uint64_t value;
  void *pointer;
  struct
  {
    uint64_t offset : 12;
    uint64_t pt_index : 9;
    uint64_t pd_index : 9;
    uint64_t pdpt_index : 9;
    uint64_t pml4_index : 9;
    uint64_t reserved : 16;
  };
} VIRT_ADDR;

typedef uint64_t PHYS_ADDR;

typedef union PML4E_
{
  uint64_t value;
  struct
  {
    uint64_t present : 1;
    uint64_t rw : 1;
    uint64_t user : 1;
    uint64_t write_through : 1;
    uint64_t cache_disable : 1;
    uint64_t accessed : 1;
    uint64_t ignored_1 : 1;
    uint64_t reserved_1 : 1;
    uint64_t ignored_2 : 4;
    uint64_t pdpt_p : 40;
    uint64_t ignored_3 : 11;
    uint64_t xd : 1;
  };
} PML4E;

typedef union PDPTE_
{
  uint64_t value;
  struct
  {
    uint64_t present : 1;
    uint64_t rw : 1;
    uint64_t user : 1;
    uint64_t write_through : 1;
    uint64_t cache_disable : 1;
    uint64_t accessed : 1;
    uint64_t dirty : 1;
    uint64_t page_size : 1;
    uint64_t ignored_2 : 4;
    uint64_t pd_p : 40;
    uint64_t ignored_3 : 11;
    uint64_t xd : 1;
  };
} PDPTE;

typedef union PDE_
{
  uint64_t value;
  struct
  {
    uint64_t present : 1;
    uint64_t rw : 1;
    uint64_t user : 1;
    uint64_t write_through : 1;
    uint64_t cache_disable : 1;
    uint64_t accessed : 1;
    uint64_t dirty : 1;
    uint64_t page_size : 1;
    uint64_t ignored_2 : 4;
    uint64_t pt_p : 40;
    uint64_t ignored_3 : 11;
    uint64_t xd : 1;
  };
} PDE;

typedef union PTE_
{
  uint64_t value;
  VIRT_ADDR vaddr;
  struct
  {
    uint64_t present : 1;
    uint64_t rw : 1;
    uint64_t user : 1;
    uint64_t write_through : 1;
    uint64_t cache_disable : 1;
    uint64_t accessed : 1;
    uint64_t dirty : 1;
    uint64_t pat : 1;
    uint64_t global : 1;
    uint64_t ignored_1 : 3;
    uint64_t page_frame : 40;
    uint64_t ignored_3 : 11;
    uint64_t xd : 1;
  };
} PTE;
#pragma pack(pop)

The flag that interests us is the .user one, the user/supervisor flag gets to decide whether a memory region is accessible from user-mode. So in contrast to what people think, the microcode for these checks would be something like this:

Pte->user & Pde->user & Pdpte->user & Pml4e->user

instead of

Va >= 0xFFFFFFFF80000000

Doesn’t this sound abusable to you? Because it definitely is. We will be using it to create a page that is invisible to all user-mode APIs in our case which is as simple to do as:

BOOL ExposeKernelMemoryToProcess( MemoryController& Mc, PVOID Memory, SIZE_T Size, uint64_t EProcess )
{
  Mc.AttachTo( EProcess );

  BOOL Success = TRUE;

  Mc.IterPhysRegion( Memory, Size, [ & ] ( PVOID Va, uint64_t Pa, SIZE_T Sz )
  {
    auto Info = Mc.QueryPageTableInfo( Va );

    Info.Pml4e->user = TRUE;
    Info.Pdpte->user = TRUE;
    Info.Pde->user = TRUE;

    if ( !Info.Pde || ( Info.Pte && ( !Info.Pte->present ) ) )
    {
      Success= TRUE;
    }
    else
    {
      if ( Info.Pte )
        Info.Pte->user = TRUE;
    }
  } );

  Mc.Detach();

  return Success;
}
PVOID Memory = AllocateKernelMemory( CpCtx, KrCtx, Size );
ExposeKernelMemoryToProcess( Controller, Memory, Size, Controller.CurrentEProcess );
ZeroMemory( Memory, Size );

Voila, now we have our super-secret page.
(I am using a wrapper I made for physical memory access before so if you want to see how the linear translation or the resolving of page table entries are implemented you can check that out.)

0x2: Abusing Copy-on-Write

Now that we are done with hiding the memory, all that is left to do is actually execute it and to do that we will be abusing Copy-on-Write this time.

CoW is a technique used by operating systems to save memory by making processes share certain physical memory regions until they actually get edited.

We know that ntdll.dll gets loaded for every process and its code (.text) region is rarely modified if at all, so why allocate physical memory for it again and again for hundreds of processes? That is exactly why modern operating systems use the technique called CoW.

The implementation is very simple:

  1. When a PE file gets mapped, if it was mapped to some other process too and its VA is free on the current process as well, simply copy the PFN and set the flag to make it read-only.
  2. When a PageFault occurs due to an instruction trying to write on the page, allocate new physical memory, set the PFN of the PTE and remove the read-only flag.

This means that when we hook the DLL by using physical memory we actually end up creating a system-wide hook.

How can we hijack a thread with this?

Well, let’s pick a commonly called function and hook it: TlsGetValue.

Now, the PML4E changes from process to process so the kernel memory we exposed are not accessible from all processes, so we need to find a padding in KERNEL32.dll to check for the pid before we just jump to our stub in our lovely kernel page.

The pid check will be very simple:

std::vector<BYTE> PidBasedHook =
{
  0x65, 0x48, 0x8B, 0x04, 0x25, 0x30, 0x00, 0x00, 0x00,        // mov rax, gs:[0x30]
  0x8B, 0x40, 0x40,                                            // mov eax,[rax+0x40] ; pid
  0x3D, 0xDD, 0xCC, 0xAB, 0x0A,                                // cmp eax, TargetPid
  0x0F, 0x85, 0x00, 0x00, 0x00, 0x00,                          // jne 0xAABBCC
  0x48, 0xB8, 0xAA, 0xEE, 0xDD, 0xCC, 0xBB, 0xAA, 0x00, 0x00,  // mov rax, KernelMemory
  0xFF, 0xE0                                                   // jmp rax
};

As PE regions are always 0x1000 aligned, finding a 35-byte padding will be a piece of cake, as long as we look for 0x00 (page padding) and not 0xCC/0x90 (intra-function padding).

In the execution stub, we will have to do some tricks as well. We only want one thread to execute our code, we want to unhook TlsGetValue before we continue execution and I noticed that sometimes the changes in physical memory didn’t instantly have an effect on instructions executed and we want to make sure they are applied, so we will implement three checks at the beginning of the stub.

std::vector<BYTE> Prologue =
{ 
  0x00, 0x00, // data
  0xF0, 0xFE, 0x05, 0xF8, 0xFF, 0xFF, 0xFF,                     // lock inc byte ptr [rip-n]
                                                                // wait_lock:
  0x80, 0x3D, 0xF0, 0xFF, 0xFF, 0xFF, 0x00,                     // cmp byte ptr [rip-m], 0x0
  0xF3, 0x90,                                                   // pause
  0x74, 0xF5,                                                   // je wait_lock

  0x48, 0xB8, 0xAA, 0xEE, 0xDD, 0xCC, 0xBB, 0xAA, 0x00, 0x00,   // mov rax, 0xAABBCCDDEEAA
                                                                // data_sync_lock:
  0x0F, 0x0D, 0x08,                                             // prefetchw [rax]
  0x81, 0x38, 0xDD, 0xCC, 0xBB, 0xAA,                           // cmp dword ptr[rax], 0xAABBCCDD
  0xF3, 0x90,                                                   // pause
  0x75, 0xF3,                                                   // jne data_sync_lock

  0xF0, 0xFE, 0x0D, 0xCF, 0xFF, 0xFF, 0xFF,                     // lock dec byte ptr [rip-n]
  0x75, 0x41,                                                   // jnz continue_exec                         
  0x53,                                                         // --- start executing DllMain ---

The first spinlock, wait_lock is to make sure the threads entering this stub stall execution until we let it continue from our injector. The second spinlock, data_sync_lock is to make sure the old TlsGetValue data is written back before continuing execution. The final atomic instruction, lock dec, is the complementary part for the lock inc at the beginning of the stub; lock inc stored the amount of threads waiting in the spinlock, and the lock dec atomically decrements this count; as it does that if the value hits zero zero-flag is set and as this operation is atomic this is done only once so we check the zero-flag to decide whether we execute DllMain or continue execution.

Now that we have all tricks set-up, the implementation of the actual injector is very simple:

  1. Load vulnerable driver
  2. Map physical memory to user-mode
  3. Search for certain offsets (UniqueProcessId, DirectoryTableBase, ActiveProcessLinks)
  4. Save current EProcess and CR3 values for user-mode use
  5. Allocate enough kernel pool memory for our injector stub and image
  6. Unload vulnerable driver
  7. Map our image to the kernel memory (Fix .relocs and create a stub that gets the imports for us as I cannot bother reading EProcess->Peb)
  8. Wait for target process
  9. Expose the kernel page to target process
  10. Hook TlsGetValue system-wide and make it check for pid before jumping to our stub at kernel memory
  11. Wait for Stub->SpinningThreadCount to be non zero
  12. Unhook TlsGetValue, set Stub->Free = TRUE
  13. Profit.
Almost magic!

Almost magic!

 

Source code: https://github.com/can1357/ThePerfectInjector

Forgive me for the hasty image mapping implementation, and the debug code left if there is any.
This is meant to be a PoC rather than a ready to go pasta.

Share

I'm an independent security researcher and a self-employed reverse engineer; mostly interested in Windows kernel development, low-level programming, and pen-testing anti-cheat, anti-debug, anti-re and anti-tampering software but I also occasionally do machine learning research and GPU accelerated programming.

6 Comments

  1. 1337 Reply

    you forgot to mention it’s using capcom driver to do stuff, I didn’t know that after reading this post

  2. Kacper Kozera Reply

    What about kernel32 modifications, aren’t anticheats checking for them? Or they are but but period of time between checks is enough for injection?

    1. Can Bölük Post author Reply

      They are but we make the assumption that they can’t check the kernel32 memory between the very little time frame that is hook placed < -> hook removed.

Leave a Reply to 1337 Cancel reply

Your email address will not be published. Required fields are marked *