Distributed-Systems Exploits Software-Analysis Software-Engineering Web3 Windows-Internals X86
X86

Speculating the entire x86-64 Instruction Set In Seconds with This One Weird Trick

As cheesy as the title sounds, I promise it cannot beat the cheesiness of the technique I’ll be telling you about in this post. The morning I saw Mark Ermolov’s tweet about the undocumented instruction reading from/writing to the CRBUS, I had a bit of free time in my hands and I knew I had to find out the opcode so I started theory-crafting right away. After a few hours of staring at numbers, I ended up coming up with a method of discovering practically every instruction in the processor using a side(?)-channel. It’s an interesting method involving even more interesting components of the processor so I figured I might as well write about it, so here it goes.

You can find the full data-set and the implementation source at haruspex.can.ac / Github.

Preface 1/2: If storks are busy delivering babies, where do micro-instructions come from?

Modern processors are built with a crazy amount of microarchitectural complexity these days and as one would expect the good old instruction decoders no longer decode for the execution unit directly. They end up decoding them into micro-instructions according to the micro-code of the processor to dispatch to the execution ports of the processor. There are two units that perform this translation in a modern Intel processor:

  1. Micro-instruction Translation Engine (MITE). The unit responsible for the translation of simple legacy instructions that translate to four or less micro-instructions.
  2. Microcode Sequencer (MS). The unit responsible for the translation of much more complex instructions that drive the CISC craziness of the Intel architecture we all hold dearly.

Another unit that dispatches these micro-instructions is the Decoded Stream Buffer (DSB) more frequently known as the iCache but it is not really relevant to the experiment we’re going to do. Now, why am I telling you all of this? Mainly because we can profile these extremely low-level units thanks to the performance counters Intel gracefully provides us; mainly these two bad boys:

One of the advantages of utilizing these events compared to a Sandsifter-like approach is that even if the microcode of the instruction throws a #UD (say if the password does not match or if the conditions are not met) it cannot fool us as it’d still have to be decoded.

Preface 2/2: Executing in a parallel universe

Now the problem is getting those instructions there. There are a million of ways you can shoot yourself in the foot by executing random instructions. What happens if we hit an instruction that changes our interruptibility state, causes a system reset, messes with the internal caches or the very performance counters we’re using, the stack we’re executing on, the code of the profiler, the code segment we’re executing in…

The easy solution is to simply, not execute, at least in our visible universe, which brings us to our second cool microarchitectural detail: speculative and out-of-order execution. The processor evaluating the branch condition at the branch has apparently grown out of fashion so what really happens when you do a branch is that, first, the branch prediction engine attempts to guess the branch you are going to take to reduce the cost of the reversal of the instruction pipelines, given an equal possibility both branches execute at the same time where possible and no speculation fences are present and then one of them gets their net change reverted… which is pretty much what we wanted isn’t it? Now although probing the branch to reset the branch prediction state is possible and I’ve experimented with it before, it’s a bit bothersome so an easier way is to simply do a CALL. Consider this snippet:

1call x
2	<speculated code>
3x:
4	lea        rax,     [rip+z]
5	xchg       [rsp],   rax
6	ret
7z:

This will cause the execution of the speculated code and the invoked subroutine out-of-order if possible. The XCHG may seem a bit overkill compared to a simpler solution popping the stack but as far as my experiments went, the processor is too smart to split the execution if the routine is non-returning so we need to feed the branch target buffer what it wants. I’ve used XCHG here instead of a MOV since it implies LOCK which will cause the processor to end up in a memory stall given that it has to end up visible given the atomicity –or at least so I theorize.

We will also need to trash the Decoded Stream Buffer I’ve mentioned previously so that we do not get any micro-instructions fed from the cache but there’s a very simple solution for that. The instruction cache also has to handle self-rewriting code so execute memory is extremely sensitive to any memory writes, given this information adding the simple snippet below before every measurement takes care of this issue.

1auto ip = ia32::get_ip() & ~63ull;
2std::copy_n( ( volatile uint8_t* ) ip, 512, ( volatile uint8_t* ) ip );
3ia32::sfence();
4for ( size_t n = 0; n != 512; n += 64 )
5	ia32::clflush( ip + n );

0x0: Precise data collection

Finally, we’ve hit the exciting part of the experiment, experimenting. We want precise results which implies non-interruptibility. You might be tricked into thinking being in kernel-mode and doing a CLI solves this problem but this does not really work that way in reality. The first thing that worries me is an #SMI being delivered, although I keep hearing it freezes PMCs, as far as I’ve experimented it really doesn’t do a great job at that, but even if it did it’s still an impurity we have to eliminate so I’ll repeat the experiment until the mighty IA32_MSR_SMI_COUNT stays constant during the execution. #NMI is the other bugger, so setting the interrupt handler so that it signals a retry solves this problem (since I’m too lazy to unwind). #MC would also be considered in this category but at that point might as well let it all burn.

Repeating the experiment multiple times and picking the Mod(x), and writing the code in a tight manner eliminates pretty much every other problem we’ve left. The next step is simply writing the code and actually collecting the data. The speculated code will be 15 copies of NOP and a 0xCE at the end to cause an actual #UD and halt the speculative execution. We’ll be trying pretty much every opcode in the range of 0x00-0xFF and they will optionally take a prefix of 0x0F and optionally another prefix in the set { 0x66, 0xF2, 0xF3 } since Intel likes to use them to create new opcodes out of thin-air (e.g. the recent FRED instruction ERETS with its F2 0F 01 CA encoding). We also need to add a suffix for discovering ModR/M variants. This process completes in mere seconds, which gave the title to this post.

0x1: Reducing the results

First of all, we’ll get two baseline measurements, the NOPs left as is, and the 0xCE as the first opcode which will reveal the counter values for a complete execution and the for-real-#UD case (I’ve tested other opcodes, 0xCE really isn’t an NSA backdoor, it’s the real deal as far as #UD’s go.).

Simply removing all measurements matching the for-real-#UD case measurement gets rid of most of the garbage, now we need to get rid of the redundant prefixes, taking a look at the data below you can see a pattern emerge:

indexdecodingmitsms
90nop5480*
6690data16 nop5367
f290nop5380
f20f90seto byte ptr [rax-0x6f6f6f70]4880

Every nop translates to a single micro-instruction that will be handled by the MITE, which means if the prefix is redundant MITS should be always off by just one and MS should stay the same, additionally, we can filter out some redundant checks by declaring 0x0F never redundant. Combining both we get rid of most of the redundancies in a simple fashion and even might be able to calculate the instruction length, neat! The code below gets rid of 54954 entries.

 1const propertyMatch = (i1, i2) => {
 2	return i2.ms == i1.ms && i2.outOfOrder == i1.outOfOrder && i2.iclass == i1.iclass;
 3};
 4
 5// Purge redundant prefixes.
 6//
 7for (const k1 of Object.keys(instructions)) {
 8	// Skip if already deleted.
 9	//
10	const i1 = instructions[k1];
11	if (!i1) {
12		continue;
13	}
14
15	// Iterate each prefix (apart from 0f):
16	//
17	for (const pfx of prefixList) {
18		// If the instruction exists:
19		//
20		const k2 = pfx + k1;
21		if (k2 in instructions) {
22			// If the instruction has matching properties as the derived from parent, delete the entry.
23			//
24			const i2 = instructions[k2];
25			if (propertyMatch(i1, i2)) {
26				// MITS#1 == MITS#2 can indicate same instruction if instruction halts.
27				// Otherwise MITS#1 has to be one more than MITS#2 since it should execute one more NOP.
28				//
29				if (i1.mits != i2.mits) {
30					if (i1.mits != i2.mits + 1) {
31						continue;
32					}
33				} else if (i1.mits > faultBaseline.mits) {
34					continue;
35				}
36				delete instructions[k2];
37			}
38		}
39	}
40}

Suffix-based purging is also more or less the same logic which gets rid of 72869 instructions, leaving us with 1699 entries, which is good enough to start the analysis!

 1// Purge redundant suffixes.
 2//
 3for (const k1 of Object.keys(instructions)) {
 4	// Skip if already deleted or not relevant.
 5	//
 6	const i1 = instructions[k1];
 7	if (!i1 || k1.length <= 2) {
 8		continue;
 9	}
10
11	// Find maching entries:
12	//
13	for (const k2 of Object.keys(instructions)) {
14		// If it is matching except the last byte:
15		//
16		if (k2.startsWith(k1.substr(0, k1.length - 2)) && k2 != k1) {
17			// If it has matching properties ignoring the length, erase it
18			//
19			const i2 = instructions[k2];
20			if (propertyMatch(i1, i2)) {
21				delete instructions[k2];
22			}
23		}
24	}
25}

0x2: Deduction of behavior

Let’s demonstrate the amazing amount of information we’ve gathered just from these two counters and some bytes existing at some fixed place without even being executed. If MS is below the nop baseline, this would indicate that the control flow is interrupted meaning that it must be a branch or an exception and if the MITS is the same as the fault-baseline, this likely indicates a serializing instruction which dispatched a number of micro-instructions (given that it passed our initial filter of MS or MITS not remaining same) but then halted the speculative flow (since none of the NOP opcodes were decoded).

indexdecodingmitsmsserializingspeculationFence
668690xchg byte ptr [rax-0x6f6f6f70], dl4788truefalse
6cinsb39112truetrue
6dinsd3999truetrue
6eoutsb3998truetrue
6foutsd3998truetrue
8e90mov ss, word ptr [rax-0x6f6f6f70]4286truetrue
c290ret 0x909043107truetrue*
c3ret41106truetrue*
ca90ret far 0x909039145truetrue*
cbret far39145truetrue
ccint33994truetrue
cd90int 0x903991truetrue
cfiretd39136truetrue
e490in al, 0x9039110truetrue
e590in eax, 0x9039110truetrue
e690out 0x90, al39110truetrue
e790out 0x90, eax39110truetrue
ecin al, dx39109truetrue
edin eax, dx39109truetrue
eeout dx, al39109truetrue
efout dx, eax39109truetrue
f1int139112truetrue
f4hlt39124truetrue
0f0090lldt word ptr [rax-0x6f6f6f70]4793truetrue
0f0098ltr word ptr [rax-0x6f6f6f70]39110truetrue
0f0080sldt word ptr [rax-0x6f6f6f70]4787truefalse
0f0081sldt word ptr [rcx-0x6f6f6f70]4787truefalse
0f0088str word ptr [rax-0x6f6f6f70]4787truefalse
0f00a0verr word ptr [rax-0x6f6f6f70]4791truetrue
0f00a8verw word ptr [rax-0x6f6f6f70]4791truetrue
0f00d8ltr ax39108truetrue
0f0190lgdt ptr [rax-0x6f6f6f70]4794truetrue
0f0198lidt ptr [rax-0x6f6f6f70]4794truetrue
0f0180sgdt ptr [rax-0x6f6f6f70]4789truefalse
0f0188sidt ptr [rax-0x6f6f6f70]4788truefalse
0f01b0lmsw word ptr [rax-0x6f6f6f70]39103truetrue
0f01b8invlpg byte ptr [rax-0x6f6f6f70]39114truetrue
0f01a0smsw word ptr [rax-0x6f6f6f70]4785truefalse
f20f22a4mov cr4, rsp39103truetrue
f20f2396mov dr2, rsi39110truetrue
f20f2380mov dr0, rax39109truetrue
f20fc788cmpxchg8b qword ptr [rax-0x6f6f6f70]4695truefalse
f20fc78acmpxchg8b qword ptr [rdx-0x6f6f6f70]4695truefalse
f38690xrelease xchg byte ptr [rax-0x6f6f6f70], dl4788truefalse
f38790xrelease xchg dword ptr [rax-0x6f6f6f70], edx4788truefalse
f38890xrelease mov byte ptr [rax-0x6f6f6f70], dl4784truefalse
f38990xrelease mov dword ptr [rax-0x6f6f6f70], edx4784truefalse
f36crep insb39112truetrue
f36drep insd39112truetrue
f36erep outsb39111truetrue
f36frep outsd39111truetrue
f3a4rep movsb byte ptr [rdi], byte ptr [rsi]43118truetrue*
f3a6rep cmpsb byte ptr [rsi], byte ptr [rdi]43123truetrue*
f3a7rep cmpsd dword ptr [rsi], dword ptr [rdi]43123truetrue*
f3aarep stosb byte ptr [rdi]43125truetrue*
f3acrep lodsb byte ptr [rsi]43106truetrue*
f3adrep lodsd dword ptr [rsi]43106truetrue*
f3aerep scasb byte ptr [rdi]43123truetrue*
f3afrep scasd dword ptr [rdi]43123truetrue*
f30f0082sldt word ptr [rdx-0x6f6f6f70]4687truefalse
f30f0088str word ptr [rax-0x6f6f6f70]4687truefalse
f30f0180sgdt ptr [rax-0x6f6f6f70]4689truefalse
f30f018asidt ptr [rdx-0x6f6f6f70]4688truefalse
f30f01a1smsw word ptr [rcx-0x6f6f6f70]4685truefalse
f30f2190mov rax, dr239107truetrue
f30f22a4mov cr4, rsp39103truetrue
f30f2380mov dr0, rax39109truetrue
f30f238emov dr1, rsi39110truetrue
f30f78903987truetrue
f30f79903987truetrue
f30fc789cmpxchg8b qword ptr [rcx-0x6f6f6f70]4695truefalse
f30fc78fcmpxchg8b qword ptr [rdi-0x6f6f6f70]4695truefalse
f30fc7b0vmxon qword ptr [rax-0x6f6f6f70]39116truetrue
f30fc733vmxon qword ptr [rbx]39119truetrue
f30fc776vmxon qword ptr [rsi-0x70]39120truetrue

Considering how simplistic the conditions are, not bad right?

0x3: Deduction of speculative behavior

You might have noticed the “Speculation fence” indication in the previous data dump. I’ve gotten a bit greedy and wanted to also know if the instructions speculatively execute or if they halt the queue for the sake of side-channeling the results so I went ahead and collected another piece of information, which comes from a rather unexpected performance counter:

You might be wondering what this has to do with anything. This particular counter is very useful given the fact that we don’t do any divisions in our previous code. Why? Because this means adding a division right after the instruction and seeing if the cycles is non-zero will let us know if the speculative execution was halted or not. We achieve this by first squeezing a divps xmm4, xmm5 there right before the #UD‘ing 0xCE and finally we waste some cycles at the non-speculative counterpart to cause a stall giving the speculative code more execution time from the execution port. Changing the previous routine code to the following pretty much gives us the perfect setup:

 1vmovups    ymm0,    [temp]
 2vmovups    ymm1,    [temp]
 3vmovups    ymm2,    [temp]
 4vmovups    ymm3,    [temp]
 5vzeroupper
 6addps      xmm0,    xmm1
 7vaddps     ymm2,    ymm0,    ymm3
 8vaddps     ymm1,    ymm0,    ymm2
 9vaddps     ymm3,    ymm0,    ymm1
10vaddps     ymm0,    ymm0,    [temp]
11vaddps     ymm1,    ymm0,    [temp]
12vaddps     ymm2,    ymm0,    [temp]
13vaddps     ymm3,    ymm0,    [temp]
14vaddps     ymm0,    ymm0,    ymm1
15vaddps     ymm2,    ymm0,    ymm3
16vaddps     ymm1,    ymm0,    ymm2
17vaddps     ymm3,    ymm0,    ymm1
18vaddps     ymm0,    ymm0,    [temp]
19vaddps     ymm1,    ymm0,    [temp]
20vaddps     ymm2,    ymm0,    [temp]
21vaddps     ymm3,    ymm0,    [temp]
22vaddps     ymm0,    ymm0,    ymm1
23vaddps     ymm2,    ymm0,    ymm3
24vaddps     ymm1,    ymm0,    ymm2
25vaddps     ymm3,    ymm0,    ymm1
26vaddps     ymm0,    ymm0,    [temp]
27vaddps     ymm1,    ymm0,    [temp]
28vaddps     ymm2,    ymm0,    [temp]
29vaddps     ymm3,    ymm0,    [temp]
30vaddps     ymm0,    ymm0,    ymm1
31vaddps     ymm2,    ymm0,    ymm3
32vaddps     ymm1,    ymm0,    ymm2
33vaddps     ymm3,    ymm0,    ymm1
34lea        rax,     [rip+z]
35xchg       [rsp],   rax
36ret

AVX to SSE switch, memory stall, register dependencies, atomicity, this one has it all! Marking the entries that have non-zero counters as “speculative friendly” essentially lets us know what instructions we can speculatively execute and leak information from, and as you can see in the next example it seems to work pretty nicely.

These indeed leak data under speculative execution:

indexdecodingmitsmsserializingspeculationFence
8690xchg byte ptr [rax-0x6f6f6f70], dl4888falsefalse
e090loopne 0xffffffffffffff925687falsefalse
fbsti5683falsefalse
fccld5383falsefalse
0f0080sldt word ptr [rax-0x6f6f6f70]4787truefalse
0f0081sldt word ptr [rcx-0x6f6f6f70]4787truefalse
0f0088str word ptr [rax-0x6f6f6f70]4787truefalse
0f00c0sldt eax5185falsefalse
0f00c8str eax5185falsefalse
0f0009str word ptr [rcx]5187falsefalse
0f0180sgdt ptr [rax-0x6f6f6f70]4789truefalse
0f0188sidt ptr [rax-0x6f6f6f70]4788truefalse
0f01a0smsw word ptr [rax-0x6f6f6f70]4785truefalse
0f01a1smsw word ptr [rcx-0x6f6f6f70]4785truefalse
0f01d0xgetbv5188falsefalse
0f01d5xend5184falsefalse
0f01e0smsw eax5184falsefalse
0f010fsidt ptr [rdi]5188falsefalse
0f0140sgdt ptr [rax-0x70]5089falsefalse
0f2098mov rax, cr35188falsefalse
0f2080mov rax, cr05185falsefalse
0f31rdtsc5293falsefalse
0f77emms52111falsefalse
0fa1pop fs5287falsefalse
0fa390bt dword ptr [rax-0x6f6f6f70], edx5186falsefalse
f30f0082sldt word ptr [rdx-0x6f6f6f70]4687truefalse
f30f0088str word ptr [rax-0x6f6f6f70]4687truefalse
f30f0006sldt word ptr [rsi]5087falsefalse
f30f000astr word ptr [rdx]5087falsefalse
f30f0180sgdt ptr [rax-0x6f6f6f70]4689truefalse
f30f018asidt ptr [rdx-0x6f6f6f70]4688truefalse
f30f01a1smsw word ptr [rcx-0x6f6f6f70]4685truefalse
f30f010fsidt ptr [rdi]5088falsefalse
f30f0126smsw word ptr [rsi]5085falsefalse
f30f0140sgdt ptr [rax-0x70]4989falsefalse
f30faed0wrfsbase eax5087falsefalse
f30faed8wrgsbase eax5087falsefalse
f30faec0rdfsbase eax5086falsefalse
f30faec8rdgsbase eax5086falsefalse
f30fb391btr dword ptr [rcx-0x6f6f6f70], edx5086falsefalse
f30fb39fbtr dword ptr [rdi-0x6f6f6f70], ebx5086falsefalse
f30fbb92btc dword ptr [rdx-0x6f6f6f70], edx5086falsefalse
f30fbb94btc dword ptr [rax+rdx*4-0x6f6f6f70], edx4986falsefalse
f30fc789cmpxchg8b qword ptr [rcx-0x6f6f6f70]4695truefalse
f30fc78fcmpxchg8b qword ptr [rdi-0x6f6f6f70]4695truefalse

Yet these do not:

indexdecodingmitsmsserializingspeculationFence
6cinsb39112truetrue
6dinsd3999truetrue
6eoutsb3998truetrue
6foutsd3998truetrue
8e90mov ss, word ptr [rax-0x6f6f6f70]4286truetrue
9dpopfq5587falsetrue
c290ret 0x909043107truetrue
c3ret41106truetrue
c890enter 0x9090, 0x905093falsetrue
ca90ret far 0x909039145truetrue
cbret far39145truetrue
ccint33994truetrue
cd90int 0x903991truetrue
cfiretd39136truetrue
e190loope 0xffffffffffffff926483falsetrue
e490in al, 0x9039110truetrue
e590in eax, 0x9039110truetrue
e690out 0x90, al39110truetrue
e790out 0x90, eax39110truetrue
ecin al, dx39109truetrue
edin eax, dx39109truetrue
eeout dx, al39109truetrue
efout dx, eax39109truetrue
f1int139112truetrue
f390pause5286falsetrue
f4hlt39124truetrue
fdstd5383falsetrue
0f0090lldt word ptr [rax-0x6f6f6f70]4793truetrue
0f0098ltr word ptr [rax-0x6f6f6f70]39110truetrue
0f00a0verr word ptr [rax-0x6f6f6f70]4791truetrue
0f00a8verw word ptr [rax-0x6f6f6f70]4791truetrue
0f00d0lldt ax5191falsetrue
0f00d8ltr ax39108truetrue
0f00e0verr ax5190falsetrue
0f00e8verw ax5190falsetrue
0f0190lgdt ptr [rax-0x6f6f6f70]4794truetrue
0f0198lidt ptr [rax-0x6f6f6f70]4794truetrue
0f01b0lmsw word ptr [rax-0x6f6f6f70]39103truetrue
0f01b8invlpg byte ptr [rax-0x6f6f6f70]39114truetrue
0f01d1xsetbv51117falsetrue
0f01d23987truetrue
0f01d4vmfunc3983truetrue
0f2193mov rbx, dr239107truetrue

0x: Some of the interesting results

Here is the full list of undocumented easter-egg instructions my i7 6850k comes with:

indexdecodingmitsmsserializingspeculationFence
0f01d23987truetrue
0f01c63983truetrue
0f01cc39104truetrue
0f0c9039138truetrue*
0f0efemms52101falsetrue*
0faed03987truetrue
0fc7903987truetrue
660f38833981truetrue
660f38603987truetrue
660f3a803987truetrue
f30f78903987truetrue
f30f79903987truetrue
f30fe7fc7380falsetrue

Contrary to popular belief mov cr2, reg is not serializing.

indexdecodingmitsmsserializingspeculationFence
0f2090mov rax, cr25183falsefalse
0f2098mov rax, cr35188falsefalse
0f2080mov rax, cr05185falsefalse
0f2290mov cr2, rax5187falsetrue*
0f2298mov cr3, rax39161truetrue
0f2299mov cr3, rcx39151truetrue
0f229bmov cr3, rbx39155truetrue
0f2280mov cr0, rax39110truetrue
0f2281mov cr0, rcx39153truetrue
0f22a0mov cr4, rax39103truetrue
0f22a1mov cr4, rcx39120truetrue
0f22a4mov cr4, rsp39104truetrue
660f22a4mov cr4, rsp39103truetrue
f20f22a4mov cr4, rsp39103truetrue
f30f22a4mov cr4, rsp39103truetrue

Despite lacking the CPL check of int imm8, int1 has more logic in the microcode.

indexdecodingmitsmsserializingspeculationFence
ccint33994truetrue
cd90int 0x903991truetrue
f1int139112truetrue*

mov ss is a speculation fence whereas cli isn’t. lss is a speculation fence whereas lgs isn’t.

indexdecodingmitsmsserializingspeculationFence
8890mov byte ptr [rax-0x6f6f6f70], dl4980falsefalse
8990mov dword ptr [rax-0x6f6f6f70], edx4980falsefalse
668890mov byte ptr [rax-0x6f6f6f70], dl4880falsefalse
8a90mov dl, byte ptr [rax-0x6f6f6f70]4980falsefalse
8b90mov edx, dword ptr [rax-0x6f6f6f70]4980falsefalse
8c90mov word ptr [rax-0x6f6f6f70], ss5080falsefalse
8e90mov ss, word ptr [rax-0x6f6f6f70]4286truetrue*
facli5680falsefalse
0fb290lss edx, ptr [rax-0x6f6f6f70]3989truetrue*
0fb590lgs edx, ptr [rax-0x6f6f6f70]4789truefalse

Can Bölük

Can Bölük

Security researcher and reverse engineer. Interested in Windows kernel development, low-level programming, static program analysis and cryptography.