After improving the execution speed by about four times in the last article of this series, we will try out a bit of a different approach. It is called “threading” and allows running the compiled code without the need for any branching inside the main loop. This means our beloved large ‘switch’ will be dismantled.
Unstructured coding
If you know your way around assembly programming, you are probably aware, how much of a comfort structured programming languages are. Most of the time, at least. If not, here is a little overview. Assembly pretty much consists only of three things: The instructions to be executed, constant data and labels. These so-called labels are like positional markers, to name certain sections of your code. It is for example useful to refer to an array filled with data with a name instead of a hex-address in your code. Especially as the actual address might be decided by the assembler when your source code is converted to machine code.
/* R0= 10;
do {
...
R0--;
} while( R0 ); */
start:
set R0 10
loop:
// ...
dec R0
br not-zero loop
/* if( R1 == 24 ) {
...
} else {
...
} */
start:
sub R1 24
br not-zero doElse
// ... R1 == 24
jmp endIf
doElse:
// ... R1 != 24
endIf:
// ... Code behind if clause
But labels have another, way more important use. They can be used as names to refer to a place to jump or branch to during execution. Using this mechanic, you can jump anywhere you want in your code (as long as it’s not WASM). If you look at the code area above, you can see two examples written in pseudo assembly, to keep it simple.
The first one is a loop. Register zero R0 is set to 10. At the end of the loop it’s decremented. The following branch br instruction will jump to loop only if the result of the last instruction didn’t yield a zero. Therefore everything in between the label and the branch instruction will be executed ten times. It actually looks like ‘do-while’ loop and functions like one.
‘If’-clauses look a bit more complicated. This one is supposed to execute two different code sections whether the contents of R1 are equal to the number 24. To do the check, 24 is subtracted from R1. The next instruction will again branch to the ‘else’ block only if the subtraction didn’t yield a zero. Thus, the block behind the branch can only be reached if R1 contained 24, as 24 - 24= 0. The ‘else’ block is placed directly behind and leads to the end of the clause. The ‘if’ block on the contrary has to jump over the else block to the ‘end’ label.
This exact example is flawed as the contents of R1 are changed by the check. Usually, CPUs provide a compare instruction, which does the subtraction, without saving the result.
Threaded code
Going back to Brainfuck, our little interpreter has to do multiple of these checks for every command to evaluate. Instead of containing an ASCII code number referring to a command, the array could be filled with the addresses to jump to. This would eliminate all comparison operations, and even take away the overhead of a jump-table.
In C using a GCC extension (a feature not officially part of the language) you can actually do something like this:
// Bytecode representation of some program
const int source[]= { 1, 3, 2, 0, 0, 2, 3, 0, 1, 1, 2, 3, 2, 0, 1, 4 };
const int sourceSize= sizeof(source) / sizeof(int);
int main() {
// Alloc code arr
void** code= new void*[ sourceSize ];
// Convert to threaded code
for( int i= 0; i!= sourceSize; i++ ) {
switch( source[i] ) {
case 0: code[i]= &&add1; break;
case 1: code[i]= &&add2; break;
case 2: code[i]= &&sub1; break;
case 3: code[i]= &&shft1; break;
case 4: code[i]= &&stop; break;
}
}
// Execute
int pc= -1;
int accu= 0;
while( true ) {
// Move to next instruction and jump to its address
pc++;
goto *code[pc];
add1:
accu+= 1;
continue;
add2:
accu+= 2;
continue;
sub1:
accu-= 1;
continue;
shft1:
accu= accu << 1;
continue;
stop:
break;
}
printf("Result: %d", accu);
// Cleanup
delete code;
}
JavaScriptCore the JS engine for webkit used this exact mechanic. It is called “direct threading” as the code executed consists of the addresses, to directly jump to. As on 64bit machines an address consists of 8 bytes, JSCore switched to indirect threading , due to the memory space needed just for the commands. Now the label addresses are stored in an array and whenever a command gets executed, a lookup is done into said array. As there aren’t that many different commands, the whole array can be indexed by a single byte. Therefore the size of the memory required for the opcodes, not their operands, can be reduced by a factor of eight.
Although JS has labels, they can only be used in combination with break and continue . There is no goto keyword, that would allow for a jump to some place at least inside the same function body. So we need to get a bit creative.
Function based threading
As the heading probably already gave away, the solution is functions. Instead of filling our code array with addresses to jump to, we fill it with functions to call repeatedly. Now our ‘compile’ function needs to push a function to the array instead of opcode and operand pairs.
case ASCII.Plus:
case ASCII.Minus:
case ASCII.Left:
case ASCII.Right:
let n= 0;
while( src.readUInt8(i) === cur && i < src.length ) {
i++;
n++;
}
switch( cur ) {
case ASCII.Plus: funcs.push( function() { data[ptr] += n; }); break;
case ASCII.Minus: funcs.push( function() { data[ptr] -= n; }); break;
case ASCII.Left:
funcs.push( function() {
if( !ptr ) {
throw Error('Pointer underflow');
}
ptr-= n;
});
break;
case ASCII.Right:
funcs.push( function() {
ptr+= n;
while( ptr >= data.length ) {
const zeros= Array(10).fill(0);
data.push( ...zeros );
}
});
break;
}
i--;
break;
After counting the repetitions for every command, a different function is pushed to the array. Also, the end loop command sets a function at its counterpart instead of an operand.
case ASCII.CBracket:
if( !loops.length ) {
throw Error('Unexpected end of loop');
}
const x= loops.pop();
const pos= funcs.length;
funcs[x]= function() {
if( !data[ptr] ) {
instr= pos;
}
};
funcs.push( function() {
if( data[ptr] ) {
instr= x;
}
});
break;
The execution loop is also changed, to only call function the function at the currently indexed position by the program counter.
function run() {
console.log('Running...');
data= [ 0 ];
ptr= 0;
str= '';
for( instr= 0; instr< funcs.length; instr+=1 ) {
funcs[instr]();
}
}
Running the same benchmark as last time, we see that the performance is pretty bad. It's way slower than the byte code based interpreter.
As the compile function creates a new function object for every command appended to the code array, we also create a closure. My guess is, that this negatively affects the memory usage as well as v8’s ability to optimize that many functions.
To change that, we define all our functions once and give them access to the data array, program counter, and cell pointer via a call argument. As JS lacks call-by-reference functionality all values are stored in a context object, representing the current state of execution.
Furthermore the operand is bundled with the function to an object, that is then appended to the code array instead. During execution this object is unpacked and the function gets called with the operand and context object.
function add( ctx, n ) { ctx.data[ctx.ptr] += n; }
function sub( ctx, n ) { ctx.data[ctx.ptr] -= n; }
function left( ctx, n ) {
if( !ctx.ptr ) {
throw Error('Pointer underflow');
}
ctx.ptr-= n;
}
//...
function enterLoop( ctx, n ) {
if( !ctx.data[ctx.ptr] ) {
ctx.instr= n;
}
}
The main interpreter loop now calls every function with this object as a parameter.
function run() {
console.log('Running...');
const ctx= {
instr: 0,
data: [ 0 ],
ptr: 0,
str: ''
};
for( ctx.instr= 0; ctx.instr< funcs.length; ctx.instr++ ) {
const f= funcs[ctx.instr];
f.fn( ctx, f.data );
}
}
Aaaaaaaand… the performance still isn’t great. Although it improved, it is still slower than version 2, from last time.
But still, it was a great experiment, and writing this article took about three times the time than writing the code. Next time we go back on track getting quicker solutions than slower ones! 😊