x86 String Instructions

CS 301: Assembly Language Programming Lecture, Dr. Lawlor

There are some really strange instructions built into x86, but among the strangest are the "string instructions".
These string instructions read from register al, the low 8 bits of rax:
REP STOSB      Write AL to [RDI] a total of ECX times.

REPE SCASB Find non-AL byte starting at [RDI] (keep repeating while [rdi]==al)
REPNE SCASB Find AL, starting at [RDI] (keep repeating while [rdi]!=al)
These string instructions read from [rsi], which gets incremented like [rdi]:
REP MOVSB      Move ECX bytes from [RSI] to [RDI]

REPE CMPSB Find nonmatching bytes in [RDI] and [RSI] (keep repeating while [rdi]==[rsi])
REPNE CMPSB Find matching bytes in [RDI] and [RSI]



Here's how you use them:
mov rcx,2 ; rep repeats this many times
mov rax,'X' ; stosb stores al
mov rdi,str ; stosb stores data to [rdi]
rep stosb

mov rdi,str
extern puts
call puts
ret

section .data
str:
	db 'lawlor',0,0,0

(Try this in NetRun now!)

    Prints "XXwlor", because stosb has overwritten rcx=2 chars with al='X'
mov rcx,10 ; rep repeats up to this many times
mov rax,'l' ; scasb compares memory to al
mov rdi,str ; scasb reads memory at [rdi] (and increments rdi)
repe scasb

extern puts
call puts
ret


section .data
str:
	db 'lllllawlor',0,0,0

(Try this in NetRun now!)

    Prints "wlor", because each iteration increments rdi, and the iterations stopped when they hit 'a' ([rdi]!=al).

mov rcx,10 ; rep repeats up to this many times
mov rax,'w' ; scasb compares memory to al
mov rdi,str ; scasb reads memory at [rdi] (and increments rdi)
repne scasb

extern puts
call puts
ret


section .data
str:
	db 'lllllawlor',0,0,0

(Try this in NetRun now!)
    Prints "lor" because the iterations stopped when they hit 'w' ([rdi]==al).

mov rcx,3 ; repne repeats this many times
mov rsi,src ; movsb reads memory here (and increments)
mov rdi,str ; movsb writes memory here (and increments)
rep movsb

mov rdi,str
extern puts
call puts
ret


section .data
src:
	db 'NOPE',0
str:
	db 'lawlor',0,0,0

(Try this in NetRun now!)

    Prints "NOPlor" because rcx==3, so "rep movsb" copied 3 chars from [rsi] to [rdi].

mov rcx,10 ; rep repeats up to this many times
mov rsi,B ; cmpsb reads memory here (and increments)
mov rdi,A ; cmpsb reads memory here (and increments)
repe cmpsb

extern puts
call puts
ret


section .data
A:
	db 'lolnope',0
B:
	db 'lolor',0,0,0

(Try this in NetRun now!)

    Prints "ope" because the repe cmpsb stopped when it hit the 'n' (the first place where [rdi]!=[rsi]).

mov rcx,10 ; rep repeats up to this many times
mov rsi,B ; cmpsb reads memory here (and increments)
mov rdi,A ; cmpsb reads memory here (and increments)
repne cmpsb

extern puts
call puts
ret


section .data
A:
	db 'lawlor was here',0
B:
	db 'yolobrozzz',0,0,0

(Try this in NetRun now!)
    Prints " was here" because the repne cmpsb stopped when it hit the 'r' (the first place where [rdi]==[rsi]).

But is it Fast?

NetRun: Options -> Actions -> Time

mov rcx,10 ; rep repeats up to this many times
mov rsi,B ; cmpsb reads memory here (and increments)
mov rdi,A ; cmpsb reads memory here (and increments)
repne cmpsb

ret

section .data
A:
	db 'lawlor was here',0
B:
	db 'yolobrozzz',0,0,0

(Try this in NetRun now!)

mov rcx,10 ; rep repeats up to this many times
mov rsi,B ; cmpsb reads memory here (and increments)
mov rdi,A ; cmpsb reads memory here (and increments)
;repne cmpsb

jmp check_first
start:
	mov al,[rsi] ; load byte from rsi
	add rsi,1
	mov cl,[rdi] ; load byte from rdi
	add rdi,1
	cmp al,cl
	je done ; repne == break if equal
	
	sub rcx,1  ; "rep": decrement rcx
	check_first:
		cmp rcx,0
		jne start

done:

ret


section .data
A:
	db 'lawlor was here',0
B:
	db 'yolobrozzz',0,0,0

(Try this in NetRun now!)

16 ns/call
5 ns/call

The big lesson is: assume nothing.  Here, a single instruction "repne cmpsb" is much slower than a big block of simple mov, add, and cmp calls, probably because the CPU internally has to translate that weird single "repne cmpsb" into those simpler instructions.  Increasingly, CPUs are optimized for the common stuff, not the weird stuff.

(There are exceptions: "rep movsb" and "rep stosb" have good multi-core cache behavior, and are quite fast on some chips.)

See Dr. Agner Fog's optimization resources for all the gory details.