Cooperative User-Level Threads (Coroutines)

CS 301: Assembly Language Programming Lecture, Dr. Lawlor

    Don't call me, I'll call you.
                                    -- Interface to every library ever invented.

Most of the time, different parts of your program cooperate by calling functions.  The problem is that it's not always obvious who should call whom.  For example, if the syntax tree traversal code calls the code optimizer, the optimizer has to take those calls from out of the blue; if the optimizer calls the tree traversal, the traversal needs to do the same thing.  If both optimizer and traversal have complicated state, including nested functions, it's not easy to break up either flow into simple function arrival points.

One solution to this problem is to use threads, but threads have a number of drawbacks including 100+ nanosecond costs for nearly every thread operation, and OS-controlled and indeterminate switching times that complicate close coordination and result in hairy bugs.

I've always had a fondness for coroutines, also known as user-level threads, because:

Has its own...
Scratch registers and area of the stack
Coroutine /
Userspace Thread
see below
Registers and stack
OS Kernel Thread
std::thread t(foo);
... and thread-local storage
OS Process
... and memory space, open files, and permissions
OS Container
see docker
... and filesystem, and network configuration
Virtual Machine
see VM software
... and OS, and hardware

Here's the 64-bit linux assembly used to switch between coroutines.  Note that because this is a function call interface, we don't need to save any scratch registers.
; swap64(old,new): switch coroutines
global swap64

; Save preserved registers to old stack
push rdi
push rbp
push rbx
push r12
push r13
push r14
push r15

; Save old stack pointer
mov [rdi],rsp
; Load new stack pointer
mov rsp,[rsi]

; Restore preserved regs from new stack
pop r15
pop r14
pop r13
pop r12
pop rbx
pop rbp
pop rdi

(Try this in NetRun now!)

Here's the C++ side.  The only complicated piece is creating a new coroutine, by pushing everything it will need onto the stack, so the "pop" and "ret" actually fire up the new coroutine.
typedef void (*coroutine_run_t)(void *arg);

void coroutine_exit(void) {
	std::cout<<"Coroutine has exited.  Goodbye.\n";

class coroutine {
	long *stack; // top of coroutine's stack
	long *stack_alloc; // allocated memory for stack
	// Used to make main into a coroutine
	coroutine() {
	// Used to create a new coroutine
	coroutine(coroutine_run_t run,void *arg,int stacksize=1024*1024) {
		stack_alloc=new long[stacksize/sizeof(long)];
		stack=&stack_alloc[stacksize/sizeof(long)-1]; // top of stack
		*(--stack)=(long)coroutine_exit; // coroutine cleanup 
		*(--stack)=(long)run; // user's function to run (rop style!)
		*(--stack)=(long)arg; // user's function argument (rdi)
		for (int saved=0;saved<6;saved++)
			*(--stack)=0xdeadbeef; // initial values for saved registers
	// Cleanup
	~coroutine() {
		delete[] stack_alloc;

extern "C" void swap64(coroutine *old_co,coroutine *new_co);

coroutine *main_co;
coroutine *sub_co;

void sub(void *arg) {
	int i; // random local, to see stack pointer
	std::cout<<"  Inside sub-coroutine.  Stack="<<&i<<endl;
	swap64(sub_co,main_co); // back to main
	std::cout<<"  Back in sub.  Stack="<<&i<<endl;
	swap64(sub_co,main_co); // back to main

long foo(void) {
	int i;
	std::cout<<"Making coroutines.  Stack="<<&i<<std::endl;
	main_co=new coroutine();
	sub_co=new coroutine(sub,0);
	std::cout<<"Switching to coroutine"<<std::endl;
	std::cout<<"Back in main from coroutine.  Stack="<<&i<<std::endl;
	std::cout<<"Any questions?"<<std::endl;
	return 0;

(Try this in NetRun now!)

On my machine, this prints:
Making coroutines.  Stack=0x7fffffffe61c
Switching to coroutine
  Inside sub-coroutine.  Stack=0x2aaaab8f2fe4
Back in main from coroutine.  Stack=0x7fffffffe61c
  Back in sub.  Stack=0x2aaaab8f2fe4
Any questions?
Program complete.  Return 0 (0x0)

If you use this in a real project, you should use a real library like boost::coroutines or libconcurrency.

I did a variety of benchmarks for user level and kernel level threads in this 2006 paper.