Splitting Data from Code, Forgotten x86 Feature: Segmentation

With the introduction of sTLB with Intel Nehalem, TLB splitting — once a reliable technique — became a thing of the past. Those who had to hook user-mode stealthily started looking into hypervisors; specifically EPT violations. However, implementing a hypervisor means implementing bloated, platform dependent code which is not the best way to go when you are trying to ship a software — especially if you are trying to be stealthy as virtualization is rather simple to detect and hard to hide.

This is where segmentation comes into play. Although we have been using CS==SS==DS model for a long time, segmentation has been here since 1978, inactive but functional. The value of CS dictates how the instructions get executed and the value of DS dictates how the memory gets read; practically what we wanted from TLB splitting.

Although we will have to disable Patchguard to use this technique (which is relatively simple), this technique will let us do many interesting things such as spoofing return pointer, hooking functions without changing .text and extensive call instrumentation.

 

We will have to hook a bunch of kernel functions and create additional segments. But before that, let’s talk a bit about the way this works fundamentally:

This technique works basically by creating a shadow module, which we modify instead of the original module. We allocate a memory block equal in size with the original module and then copy its contents 1:1. Although the data for the code is in a different memory region, as IP will not be different we will not have to do any relocations. Then we will clone the GDT entry of our original CS (whether its 0x23 or 0x33) and set the base to Newly Allocated Memory - Real Module Base, which is as simple as:

typedef struct _KGDTENTRY
{
  uint8_t Limit0;
  uint8_t Limit1;
  uint8_t Base0;
  uint8_t Base1;
  uint8_t Base2;
  uint8_t Access;
  uint8_t Limit2 : 4;
  uint8_t Unk : 1;
  uint8_t L : 1;
  uint8_t Db : 1;
  uint8_t Granularity : 1;
  uint8_t Base3;
} KGDTENTRY;

typedef struct _SET_ENTRY_DPC_ARGS
{
  uint16_t EntryId;
  uint64_t Entry;
  NTSTATUS Status;

  uint64_t Error_Trgt;
  uint64_t Error_Base;
  uint64_t Error_Lmt;
} SET_ENTRY_DPC_ARGS;

static void Gdt_SetEntryDpc( KDPC *Dpc, SET_ENTRY_DPC_ARGS* Args, PVOID SystemArgument1, PVOID SystemArgument2 )
{
  uint64_t Backup = DisableWP();

  GDTR Gdtr;
  _sgdt( &Gdtr );
  
  uint64_t* Limit = Gdtr.Base + Gdtr.Limit + 1 - 8;
  uint64_t* Target = Gdtr.Base + Args->EntryId * 8;
  
  if ( Target > Limit )
  {
    Args->Error_Trgt = Target;
    Args->Error_Base = Gdtr.Base;
    Args->Error_Lmt = Gdtr.Limit;
    Args->Status = GDT_SEG_NOT_PRES;
    Log( "Target (%x) > Limit (%x) [%d]\n", Target, Limit, KeGetCurrentProcessorNumber() );
  }
  else
  {
    *Target = Args->Entry;
    Log( "Target (%x) <= Limit (%x) [%d]\n", Target, Limit, KeGetCurrentProcessorNumber() );
  }

  KeSignalCallDpcSynchronize( SystemArgument2 );
  ResetWP( Backup );
  KeSignalCallDpcDone( SystemArgument1 );
}


static NTSTATUS Gdt_SetEntry( uint16_t EntryId, uint64_t Entry )
{
  static SET_ENTRY_DPC_ARGS Args;
  Args.EntryId = EntryId;
  Args.Entry = Entry;
  Args.Status = STATUS_SUCCESS;
  KeGenericCallDpc( Gdt_SetEntryDpc, &Args );
  return Args.Status;
}

static NTSTATUS Gdt_SetupSeg( uint32_t Seg, uint8_t Wow64, uint32_t Base, uint32_t Limit )
{
  BOOLEAN Granularity = Limit > 0xFFFFF;

  if ( Granularity )
    Limit /= 0x1000; // 4 kb

  if ( Limit > 0xFFFFF )
    return GDT_LIM_TOO_BIG;

  uint64_t SegBaseVal = Wow64 ? Gdt_GetEntry( GDT_ENTRY( 0x23 ) ) : Gdt_GetEntry( GDT_ENTRY( 0x33 ) );
  KGDTENTRY* SegBase = &SegBaseVal;

  SegBase->Base0 = ( Base >> 8 * 0 ) & 0xFF;
  SegBase->Base1 = ( Base >> 8 * 1 ) & 0xFF;
  SegBase->Base2 = ( Base >> 8 * 2 ) & 0xFF;
  SegBase->Base3 = ( Base >> 8 * 3 ) & 0xFF;

  SegBase->Limit0 = ( Limit >> 8 * 0 ) & 0xFF;
  SegBase->Limit1 = ( Limit >> 8 * 1 ) & 0xFF;
  SegBase->Limit2 = ( Limit >> 8 * 2 ) & 0xF;

  SegBase->Granularity = Granularity;
  return Gdt_SetEntry( GDT_ENTRY( Seg ), SegBaseVal );
}

Simply copy the value of the original segment, set the base, limit, and granularity accordingly and then spawn a DPC to get the GDT base for each processor with sgdt and then write to the specified index. (You might notice that I am not allocating a new GDT, this is because few of my users were getting weird system freezes when the GDT pointer was replaced)

Now that we set up the new segment successfully, this is where we run into our first problem, out of bounds IP.

Let’s look at the following example:

Module base (0x400000) /---------------------\  /---------------------\ (CS Base + 0x400000) Shadow module base
                       | ...                 |  | ...                 |
                       | CALL 0x600214       |  | CALL 0x600214       |
                       | MOV [0x312321], EAX |  | MOV [0x312321], EAX |
                       | CALL 0x333333       |  | CALL 0x333333       |
                       | ...                 |  | ...                 |
                       | ...                 |  | ...                 |
                       | ...                 |  | ...                 |
                       | ...                 |  | ...                 |
Module end (0x500000)  \---------------------/  \---------------------/ (CS Base + 0x500000) Shadow module end

This can be an issue when the target IP is lower than module base (CALL 0x333333) or when the target IP is higher than module base (CALL 0x600214) because those instructions will respectively execute CS Base + 0x333333 and CS Base + 0x600214, which we did not copy.

First one is simple to handle. Simply set the limit of the GDT entry to module size and when a GPF occurs either restore CS and set IP = IP + Shadow Base - Real Base (and risk leaking shadow base as it will push the return pointer, pointing at shadow module, to stack and then continue execution at 0x23:shadowmodule) or resolve the call yourself, like I do below:

static BOOLEAN ResolveCall( ITRAP_FRAME* Frame, UCHAR* Instruction, uint32_t* Target, uint8_t* InstructionSize )
{
  if ( Instruction[ 0 ] != 0xE8 && Instruction[ 0 ] != 0xFF )
    return FALSE;

  hde32s s = { 0 };
  *InstructionSize = hde32_disasm( Instruction, &s );

  if ( Instruction[ 0 ] == 0xFF && s.modrm_reg == 2 )
  {
    if ( s.sib )
    {
      if ( s.modrm_mod == 0 )
        *Target =  *( uint32_t* ) ( ResolveRegisterById( Frame, s.sib_index ) * ( 1 << s.sib_scale ) + s.disp.disp32 );
      else if ( s.modrm_mod == 1 )
        *Target =  *( uint32_t* ) ( ResolveRegisterById( Frame, s.sib_base ) + s.disp.disp32 );
      else
        *Target =  *( uint32_t* ) ( ResolveRegisterById( Frame, s.sib_base ) + ResolveRegisterById( Frame, s.sib_index ) * ( 1 << s.sib_scale ) + s.disp.disp32 );
    }
    else
    {
      if ( s.modrm_mod == 0 )
        *Target =  *( uint32_t* ) ( s.disp.disp32 );
      else if ( s.modrm_mod == 3 )
        *Target = ResolveRegisterById( Frame, s.modrm_rm );
      else if ( s.modrm_mod == 2 || s.modrm_mod == 1 )
        *Target =  *( uint32_t* ) ( ResolveRegisterById( Frame, s.modrm_rm ) + s.disp.disp32 );
    }
    return TRUE;
  }
  else if ( Instruction[ 0 ] == 0xE8 )
  {
    *Target = Frame->Rip + s.imm.imm32 + 5;
    return TRUE;
  }
  return FALSE;
}


static BOOLEAN NTAPI HkOnGpf( ITRAP_FRAME* TrapFrame )
{
  if ( TrapFrame->Cs == SHADOW_HOOK_SEG )
  {
    // CALL | JMP | RET to the outside of segment
    __swapgs();
    SHADOW_MODULE_ENTRY Sme = GetShadowModuleFromRip( PsGetCurrentProcessId(), TrapFrame->Rip );

    if ( Sme.ModuleReal )
    {
      uint64_t RspBackup = TrapFrame->Rsp;
      _enable();
      //Log( "Handling call to the outside of shadow module @ %llx\n", TrapFrame->Rip );
      __try
      {
        uint32_t Destination = 0;
        uint8_t InstructionSize = 0;
        if ( ResolveCall( TrapFrame, TrapFrame->Rip - Sme.ModuleReal + Sme.ModuleShadow, &Destination, &InstructionSize ) )
        {
          uint32_t IsPageMapped = FALSE;

          KIRQL Irql = RsAcquireSpinLockRaiseToDpc( &Rs_ProcessRecordSpinLock );
          DWORD Pid = PsGetCurrentProcessId();
          if ( Rs_ProcessRecordsMaxPid > Pid && Rs_ProcessRecords[ Pid ] )
          {
            for ( int i = 0; i < ARRAYSIZE( Rs_ProcessRecords[ Pid ]->SpoofedProtect ); i++ )
            {
              if ( !Rs_ProcessRecords[ Pid ]->SpoofedProtect[ i ].PageBase )
                break;
              if ( ( Rs_ProcessRecords[ Pid ]->SpoofedProtect[ i ].PageBase & ( ~0xFFF ) ) == ( TrapFrame->Rip & ( ~0xFFF ) ) )
              {
                IsPageMapped = TRUE;
                break;
              }
            }
          }
          RsReleaseSpinLock( &Rs_ProcessRecordSpinLock, Irql );


          TrapFrame->Rsp -= 0x4;
          *( uint32_t* ) TrapFrame->Rsp =
            IsPageMapped
            ? ( TrapFrame->Rip + InstructionSize - Sme.ModuleReal + Sme.ModuleShadow )
            : ( TrapFrame->Rip + InstructionSize );
          TrapFrame->Rip = Destination;
          TrapFrame->Cs = WOW64_SEG;
          //Log( " --> %llx (%d)\n", TrapFrame->Rip, IsPageMapped );

          _disable();
          __swapgs();
          return TRUE;
        }
      }
      __except ( 1 )
      {
      }
      _disable();
      __swapgs();
      TrapFrame->Rsp = RspBackup;
      TrapFrame->Rip -= Sme.ModuleReal;
      TrapFrame->Rip += Sme.ModuleShadow;
      TrapFrame->Cs = WOW64_SEG;
      return TRUE;
    }
    else
    {
      // You are fucked.
    }
    __swapgs();
  }
  return FALSE;
}

Now we have the second problem: CALL 0x333333. The problem with this one is that this is a perfectly valid operation no matter what we do as it technically is within CS boundaries (as module base != 0x0), just not shadow module boundaries; so the processor won’t help us here.

To solve this problem we can simply reserve the virtual memory below the lower boundary of the module, before the DLLs required by the target is loaded, by hooking PsMapSystemDlls like so:

static NTSTATUS NTAPI HkPsMapSystemDlls( PEPROCESS Process, BOOLEAN UseLargePages )
{
  USING_SYMBOL( ZwAllocateVirtualMemory );
  fnNtAllocateVirtualMemory ZwAllocateVirtualMemory = GET_SYMBOL( ZwAllocateVirtualMemory );

  KAPC_STATE Apc;
  KeStackAttachProcess( Process, &Apc );

  MEMORY_BASIC_INFORMATION Mbi = { 0 };

  while ( 1 )
  {
    NTSTATUS Status = VvQueryVirtualMemory
    (
      NtCurrentProcess(),
      ( PUCHAR ) Mbi.BaseAddress + Mbi.RegionSize,
      MemoryBasicInformation,
      &Mbi,
      sizeof( Mbi ),
      0
    );

    if ( Status )
    {
      break;
    }
    else
    {
      struct
      {
        UNICODE_STRING Str;
        wchar_t Buffer[ 1024 ];
      } Buffer;
      RtlZeroMemory( &Buffer, sizeof( Buffer ) );
      VvQueryVirtualMemory( NtCurrentProcess(), Mbi.BaseAddress, 2ull, &Buffer, sizeof( Buffer ), 0ull );

      if ( Buffer.Str.Buffer )
      {
        if ( wcsstr( Buffer.Str.Buffer, L"<process name for simplicity>" ) )
        {
          Log( "<process name> found @ %llx [0x%x bytes] (%ls)\n", Mbi.BaseAddress, Mbi.RegionSize, Buffer.Str.Buffer );
          Log( "Wasting %d MB...!\n", ( ( uint64_t ) Mbi.BaseAddress ) / 1024 / 1024 );
          for ( PUCHAR Page = 0x10000; Page < ( (PUCHAR)Mbi.BaseAddress - 0x20000 ); Page += 0x1000 )
          {
            PVOID Base = Page;
            SIZE_T Size = ( uint64_t ) Mbi.BaseAddress - ( uint64_t ) Page - 0x20000;
            if ( ZwAllocateVirtualMemory( NtCurrentProcess(), &Base, 0ull, &Size, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE ) == 0 )
            {
              Log( "Allocated %llx -> %llx\n", Page, Page + Size );
              break;
            }
          }
          break;
        }
      }

    }
  }


  KeUnstackDetachProcess( &Apc );

  return ( ( fnPsMapSystemDlls ) PsMapSystemDllsHook->OriginalBytes ) ( Process, UseLargePages );
}

EPROCESS is barely set-up at this stage and we cannot open a handle to the process so we have to use KeStackAttachProcess and NtCurrentProcess(). Simply search for the process base using NtQueryVirtualMemory & MemorySectionName, then try allocating the memory below it.

And voila, problem solved! Although this may sound like we will be wasting a lot of memory, I haven’t seen Windows mapping a 32-bit process above the 30 MB mark.

Now that we have our shadow module set, the only thing left is to redirect the control flow to our shadow module when needed, which we can achieve by setting the xd flag on the original page (or calling NtProtectVirtualMemory and spoofing the protection from NtQueryVirtualMemory). We will need to handle paged out memory (on execution) by ourselves, but apart from that our page fault hook will be relatively simple:

static BOOLEAN NTAPI HkOnPageFault( ITRAP_FRAME* TrapFrame )
{
  uint64_t FaultyPtr = __readcr2();

  if ( TrapFrame->Cs == WOW64_SEG ) // Wow64 NX
  {
    if ( TrapFrame->ExceptionCode & ( 1 << 4 ) &&		// Caused by instruction fetch
       TrapFrame->ExceptionCode & ( 1 << 0 ) )		// Page is present
    {
      if ( TrapFrame->Rip == FaultyPtr )
      {
        __swapgs();
        SHADOW_MODULE_ENTRY Sme = GetShadowModuleFromRip( PsGetCurrentProcessId(), TrapFrame->Rip );
        // Switch to fake CS
        if ( Sme.ModuleReal )
        {
          Log( "Handling call to remapped page of module @ %llx\n", TrapFrame->Rip );
          TrapFrame->Cs = SHADOW_HOOK_SEG;
          __swapgs();
          return TRUE;
        }
        __swapgs();
      }
      else if ( FaultyPtr - TrapFrame->Rip  < 15 )
      {
        __swapgs();
        SHADOW_MODULE_ENTRY Sme = GetShadowModuleFromRip( PsGetCurrentProcessId(), TrapFrame->Rip );
        if ( Sme.ModuleReal )
        {
          Log( "Fixed half instruction failure! %llx %llx\n", FaultyPtr, TrapFrame->Rip );
          PUCHAR From = TrapFrame->Rip;
          PUCHAR To = TrapFrame->Rip - Sme.ModuleReal + Sme.ModuleShadow;
          SIZE_T Size = FaultyPtr - TrapFrame->Rip;
          _enable();
          memcpy( To, From, Size );
          _disable();
          TrapFrame->Cs = SHADOW_HOOK_SEG;
          __swapgs();
          return TRUE;
        }
        __swapgs();
      }
    }
  }
  else if ( TrapFrame->Cs == SHADOW_HOOK_SEG )
  {
    if ( TrapFrame->ExceptionCode & ( 1 << 4 ) &&		// Caused by instruction fetch
       !( TrapFrame->ExceptionCode & ( 1 << 0 ) ) )		// Page is not present
    {
      // Page is not present

      __swapgs();
      _enable();

      SHADOW_MODULE_ENTRY Sme = GetShadowModuleFromRip( PsGetCurrentProcessId(), FaultyPtr );
      if ( Sme.ModuleReal )
      {
        //Log( "Handling paged out memory (%llx)\n", FaultyPtr );

        PUCHAR FaultyAdr = FaultyPtr - Sme.ModuleReal + Sme.ModuleShadow;
        __try
        {
          volatile uint64_t volatile PageIn[ 1 ];
          memcpy( PageIn, ( volatile UCHAR volatile * ) FaultyAdr, 8 );
        }
        __except ( 1 )
        {
        }
        _disable();
        __swapgs();

        return TRUE;
      }

      _disable();
      __swapgs();

      return FALSE;
    }
    else
    {
      // Real exception
      TrapFrame->Cs = WOW64_SEG; // Windows doesnt handle it otherwise...
    }
  }
  return FALSE;
}

Now you can detour the shadow module with jmp 0x23:Trampoline and return back with jmp 0xAA:RetPtr or change the instructions as you wish. There’s absolutely no difference except the fact that you need to make the original page no-execute.

There we go, hooks/.text modifications that are invisible from user mode when done correctly, without using any hypervisors. We can also use this for other cool things as I mentioned before which I will be talking about in another post.

 

Potential detection vectors:

  • They can check CS, either with mov ax, cs  or call far ; both can be avoided as you only expose the value of CS to the pages you “map” by settings the xd flag so you can read the code to see if those exist
  • They can try jumping to their own code with different values of CS and see if it generates an exception, which you can avoid by setting up the segment when it’s needed as we get a notification both when we are exiting the shadow section and entering the shadow section.
  • They can read from cs:[0xABCD] directly, which you can avoid by not making your CS value static and following the tip above.
  • They can write to the original page and check if the execution matches, which you can avoid by making the page no-write, catching the exception from kernel, virtualizing the instruction to apply to both pages, incrementing IP and then continuing execution.
Share

Security researcher and reverse engineer; mostly interested in Windows kernel development and low-level programming.
Founder of Verilave Inc.

5 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *