Cooperative User-Level Threads (Coroutines)

CS 301: Assembly Language Programming Lecture, Dr. Lawlor

Don't call me, I'll call you.
-- Interface to every library ever invented.

Most of the time, different parts of your program cooperate by calling functions. The problem is that it's not always obvious who should call whom. For example, if the syntax tree traversal code calls the code optimizer, the optimizer has to take those calls from out of the blue; if the optimizer calls the tree traversal, the traversal needs to do the same thing. If both optimizer and traversal have complicated state, including nested functions, it's not easy to break up either flow into simple function arrival points.

One solution to this problem is to use threads, but threads have a number of drawbacks including 100+ nanosecond costs for nearly every thread operation, and OS-controlled and indeterminate switching times that complicate close coordination and result in hairy bugs.

I've always had a fondness for coroutines, also known as user-level threads, because:

Each coroutine has its own stack, so you can get three calls deep and jump back to the other coroutine without actually returning. This is great for things like depth-first tree traversals, which are naturally implemented using recursion.
Context switch times are excellent, about 12 nanoseconds per switch on a modern 64-bit Sandy Bridge machine. OS context switch times are about 100x worse than this.
You can build them yourself with just a little bit of assembly language.

Here's the 64-bit linux assembly used to switch between coroutines. Note that because this is a function call interface, we don't need to save any scratch registers.

; swap64(old,new): switch coroutines
global swap64
swap64:

; Save preserved registers to old stack
push rdi
push rbp
push rbx
push r12
push r13
push r14
push r15

; Save old stack pointer
mov [rdi],rsp
; Load new stack pointer
mov rsp,[rsi]

; Restore preserved regs from new stack
pop r15
pop r14
pop r13
pop r12
pop rbx
pop rbp
pop rdi
ret

(Try this in NetRun now!)

Here's the C++ side. The only complicated piece is creating a new coroutine, by pushing everything it will need onto the stack, so the "pop" and "ret" actually fire up the new coroutine.

typedef void (*coroutine_run_t)(void *arg);

void coroutine_exit(void) {
	std::cout<<"Coroutine has exited.  Goodbye.\n";
	exit(0);
}

class coroutine {
public:
	long *stack; // top of coroutine's stack
	long *stack_alloc; // allocated memory for stack
	
	// Used to make main into a coroutine
	coroutine() {
		stack=0;
		stack_alloc=0;
	}
	// Used to create a new coroutine
	coroutine(coroutine_run_t run,void *arg,int stacksize=1024*1024) {
		stack_alloc=new long[stacksize/sizeof(long)];
		stack=&stack_alloc[stacksize/sizeof(long)-1]; // top of stack
		*(--stack)=(long)coroutine_exit; // coroutine cleanup 
		*(--stack)=(long)run; // user's function to run (rop style!)
		*(--stack)=(long)arg; // user's function argument (rdi)
		for (int saved=0;saved<6;saved++)
			*(--stack)=0xdeadbeef; // initial values for saved registers
	}
	// Cleanup
	~coroutine() {
		delete[] stack_alloc;
	}
};

extern "C" void swap64(coroutine *old_co,coroutine *new_co);

coroutine *main_co;
coroutine *sub_co;

void sub(void *arg) {
	int i; // random local, to see stack pointer
	std::cout<<"  Inside sub-coroutine.  Stack="<<&i<<endl;
	swap64(sub_co,main_co); // back to main
	std::cout<<"  Back in sub.  Stack="<<&i<<endl;
	swap64(sub_co,main_co); // back to main
}

long foo(void) {
	int i;
	std::cout<<"Making coroutines.  Stack="<<&i<<std::endl;
	main_co=new coroutine();
	sub_co=new coroutine(sub,0);
	std::cout<<"Switching to coroutine"<<std::endl;
	swap64(main_co,sub_co);
	std::cout<<"Back in main from coroutine.  Stack="<<&i<<std::endl;
	swap64(main_co,sub_co);
	std::cout<<"Any questions?"<<std::endl;
	return 0;
}

(Try this in NetRun now!)

On my machine, this prints:

Making coroutines.  Stack=0x7fffffffe61c
Switching to coroutine
  Inside sub-coroutine.  Stack=0x2aaaab8f2fe4
Back in main from coroutine.  Stack=0x7fffffffe61c
  Back in sub.  Stack=0x2aaaab8f2fe4
Any questions?
Program complete.  Return 0 (0x0)

The only bummer here is the nonportable assembly language.

This program, by contrast, uses no assembly language, but does use the UNIX-specific "ucontext" functions such as swapcontext. It's portable to any UNIX but runs a lot slower, because it does a syscall so each coroutine can have its own set of signals, which I have never desired.

/**
  User-level threads example.
   Threads created using UNIX "ucontext" functions.
*/
#include <stdio.h>
#include <stdlib.h>
#include <ucontext.h> /* for makecontext/swapcontext routines */

/* Tiny little user-level thread library */
typedef void (*threadFn)(void);

/* Create a new user-level thread to run this function */
void makeThread(ucontext_t *T,threadFn fn) {
	getcontext(T);
	/* Allocate a stack for the new thread */
	int stackLen=32*1024;
	char *stack=new char[stackLen];
	printf("New thread stack is at %p\n",stack);
	T->uc_stack.ss_sp=stack;
	T->uc_stack.ss_size=stackLen;
	T->uc_stack.ss_flags=0;
	makecontext(T,fn,0);
}

/* Context structures for two user-level threads. */
ucontext_t A,B;
enum {runTimes=20, delayLen=1*1000*1000};
int total=0;

void runA(void) {
	int somelocal;
	printf("Thread A: stack at %p\n",&somelocal);
	for (int n=0;n<runTimes;n++) {
		for (int i=0;i<delayLen;i++) {}
		printf("A (it %d)  (total %d)\n",n,total);
		total++;
		swapcontext(&A,&B); /* switch from thread A to thread B */
	}
	printf("Thread A finished.  Exiting program.\n");
	exit(1);
}

void runB(void) {
	int somelocal;
	printf("Thread B: stack at %p\n",&somelocal);
	for (int n=0;n<runTimes;n++) {
		for (int i=0;i<delayLen;i++) {}
		printf("     B (it %d)  (total %d)\n",n,total);
		total++;
		swapcontext(&B,&A); /* switch from thread B to thread A */
	}
}

long foo() {
	makeThread(&A,runA);
	makeThread(&B,runB);
	setcontext(&A); /* Start running thread A first */
	return 0;
}

(Try this in NetRun now!)

If you use this in a real project, you should use a real library like boost::coroutines or libconcurrency.

I did a variety of benchmarks for user level and kernel level threads in this 2006 paper.