More PPU coding on the NES/Famicom

After getting sprites working in Lisp on the NES for our “What Remains” project, the next thing to figure out properly is the background tiles. With the sprites you simply have a block of memory you edit at any time, then copy the whole lot to the PPU each frame in one go – the tiles involve a bit more head scratching.

The PPU graphics chip on the NES was designed in a time where all TVs were cathode ray tubes, using an electron gun to build a picture up on a phosphor screen. As this scans back and forth across the screen the PPU is busy altering its signal to draw pixel colours. If you try and alter its memory while its doing this you get glitches. However, its not drawing all the time – the electron gun needs to reset to the top of the screen each frame, so you get a window of time (2273 cycles) to make changes to the PPU memory before it starts drawing the next frame.

(Trying out thematic images and some overlapping text via the display list)

The problem is that 2273 cycles is not very much – not nearly enough to run your game in, and only enough to update approx 192 background tiles per frame as DMA is a slow operation. It took me a while to figure out this situation – as I was trying to transfer an entire screenful in one go, which sort of works but leaves the PPU in an odd state.

The solution is a familiar one to modern graphics hardware – a display list. This is a buffer you can add instructions to at any time in your game, which are then acted on only in the PPU access window. It separates the game code from the graphics DMA, and is very flexible. We might want to do different things here, so we can have a set of ‘primitives’ that run different operations. Given the per-frame restriction the buffer can also limit the bandwidth so the game can add a whole bunch of primitives in one go, which are then gradually dispatched – you can see this in a lot of NES games as it takes a few frames to do things like clear the screen.

There are two kinds of primitives in the what remains prototype game engine so far, the first sets the tile data directly:

(display-list-add-byte 1)
(display-list-add-byte 2)
(display-list-add-byte 3)
(display-list-end-packet prim-tile-data 0 0 3)

This overwrites the first 3 tiles at the top left of the screen to patterns 1,2 and 3. First you add bytes to a ‘packet’, which can have different meanings depending on the primitive used, then you end the packet with the primitive type constant, a high and low 16 bit address offset for the PPU destination, and a size. The reason this is done in reverse is that this is a stack, read from the ‘top’ which is a lot faster – we can use a position index that is incremented when writing and decremented when reading.

We could clear a portion of the screen this way with a loop (a built in language feature in co2 Lisp) to add a load of zeros to the stack:

(loop n 0 255 (display-list-add-byte 0))
(display-list-end-packet prim-tile-data 0 0 256)

But this is very wasteful, as it fills up a lot of space in the display list (all of it as it happens). To get around this, I added another primitive called ‘value’ which does a kind of run length encoding (RLE):

(display-list-add-byte 128) ;; length
(display-list-add-byte 0) ;; value
(display-list-end-packet prim-tile-value 0 0 2)

With just 2 bytes we can clear 128 tiles – about the maximum we can do in one frame.

A 6502 lisp compiler, sprite animation and the NES/Famicom

For our new project “what remains”, we’re regrouping the Naked on Pluto team to build a game about climate change. In the spirit of the medium being the message, we’re interested in long term thinking as well as recycling e-waste – so in keeping with a lot of our work, we are unraveling the threads of technology. The game will run on the NES/Famicom console, which was originally released by Nintendo in 1986. This hardware is extremely resilient, the solid state game cartridges still work surprisingly well today, compared to fragile CDROM or the world of online updates. Partly because of this, a flourishing scene of new players are now discovering them. I’m also interested that the older the machine you write software for, the more people have access to it via emulators (there are NES emulators for every mobile device, browser and operating system).

Our NES with everdrive flashcart and comparatively tiny sdcard for storing ROMs.

These ideas combine a couple of previous projects for me – Betablocker DS also uses Nintendo hardware and although much more recent, the Gameboy DS has a similar philosophy and architecture to the NES. As much of the machines of this era, most NES games were written in pure assembly – I had a go at this for the Speccy a while back and while being fun in a mildly perverse way, it requires so much forward planning it doesn’t really encourage creative tweaking – or working collaboratively. In the meantime, for the weavingcodes project I’ve been dabbling with making odd lisp compilers, and found it very productive – so it makes sense to try one for a real processor this time, the 6502.

The NES console was one of the first to bring specialised processors from arcade machines into people’s homes. On older/cheaper 8 bit machines like the Speccy, you had to do everything on the single CPU, which meant most of the time was spent drawing pixels or dealing with sound. On the NES there is a “Picture Processing Unit” or PPU (a forerunner to the modern GPU), and an “Audio Processing Unit” or APU. As in modern consoles and PCs, these free the CPU up to orchestrate a game as a whole, only needing to sporadically update these co-processors when required.

You can’t write code that runs on the PPU or APU, but you can access their memory indirectly via registers and DMA. One of the nice things we can do if we’re writing a language for a compiling is building optimised calls that do specific jobs. One area I’ve been thinking about a lot is sprites – the 64 8×8 tiles that the PPU draws over the background tiles to provide you with animated characters.

Our sprite testing playpen using graphics plundered from Ys II: Ancient Ys Vanished.

The sprites are controlled by 256 bytes of memory that you copy (DMA) from the CPU to the PPU each frame. There are 4 bytes per sprite – 2 for x/y position, 1 for the pattern id and another for color and flipping control attributes. Most games made use of multiple sprites stuck together to get you bigger characters, in the example above there are 4 sprites for each 16×16 pixel character – so it’s handy to be able to group them together.

Heres an example of the the compiler code generation to produce the 6502 assembly needed to animate 4 sprites with one command by setting all their pattern IDs in one go – this manipulates memory which is later sent to the PPU.

(define (emit-animate-sprites-2x2! x)
   (emit-expr (list-ref x 2)) ;; compiles the pattern offset expression (leaves value in register a)
   (emit "pha")               ;; push the resulting pattern offset onto the stack
   (emit-expr (list-ref x 1)) ;; compile the sprite id expression (leaves value in a again)
   (emit "asl")               ;; *=2 (shift left)      
   (emit "asl")               ;; *=4 (shift left) - sprites are 4 bytes long, so = address
   (emit "tay")               ;; store offset calculation in y
   (emit "iny")               ;; +1 to get us to the pattern id byte position of the first sprite
   (emit "pla")               ;; pop the pattern memory offset back from the stack
   (emit "sta" "$200,y")      ;; sprite data is stored in $200, so add y to it for the first sprite
   (emit "adc" "#$01")        ;; add 1 to a to point to the next pattern location
   (emit "sta" "$204,y")      ;; write this to the next sprite (+ 4 bytes)
   (emit "adc" "#$0f")        ;; add 16 to a to point to the next pattern location
   (emit "sta" "$208,y")      ;; write to sprite 2 (+ 8 bytes)
   (emit "adc" "#$01")        ;; add 1 to a to point to the final pattern location
   (emit "sta" "$20c,y")))    ;; write to sprite 4 (+ 12 bytes)

The job of this function is to return a list of assembler instructions which are later converted into machine code for the NES. It compiles sub-expressions recursively where needed and (most importantly) maintains register state, so the interleaved bits of code don't interfere with each other and crash. (I learned about this stuff from Abdulaziz Ghuloum's amazing paper on compilers). The stack is important here, as the pha and pla push and pop information so we can do something completely different and come back to where we left off and continue.

The actual command is of the form:

(animate-sprites-2x2 sprite-id pattern-offset)

Where either arguments can be sub-expressions of their own, eg.:

(animate-sprites-2x2 sprite-id (+ anim-frame base-pattern))

This code uses a couple of assumptions for optimisation, firstly that sprite information is stored starting at address $200 (quite common on the NES as this is the start of user memory, and maps to a specific DMA address for sending to the PPU). Secondly there is an assumption how the pattern information in memory is laid out in a particular way. The 16 byte offset for the 3rd sprite is simply to allow the data to be easy to see in memory when using a paint package, as it means the sprites sit next to each other (along with their frames for animation) when editing the graphics:


You can find the code and documentation for this programming language on gitlab.

Neural Network livecoding and retrofitting ZX Spectrum hardware

An experimental, and quite angry neural network livecoding synth (with an audio ‘weave’ visualisation) for the ZX Spectrum: source code and TZX file (for emulators). It’s a bit hard to make out in the video, but you can move around the 48 neurons and modify their synapses and trigger levels. There are two clock inputs and the audio output is the purple neuron at the bottom right. It allows recurrent loops as a form of memory, and some quite strange things are possible. The keys are:

  • w,d: move diagonally north west <-> south east
  • s,e: move diagonally south west <-> north east
  • t,y,g,h: toggle incoming synapse connections for the current neuron
  • space: change the ‘threshold’ of the current neuron (bit shifts left)

This audio should load up on a real ZX Spectrum:

One of the nice things about tech like this is that it’s easily hackable – this is a modification to the video output better explained here, but you can get a standard analogue video signal by connecting the internal feed directly to the plug and detaching the TV signal de-modulator with a tiny bit of soldering. Look at all those discrete components!


Spork Factory: evolving a light follower robot

Continuing with the structured procrastination R&D project on evolvable hardware, I’m proud to report a pretty decent light following robot – this is a video of the first real-world test, with a program grown from primordial soup chasing me around:

After creating a software model simulation of the robot in the last post, I added some new bytecode instructions for the virtual machine: LEYE and REYE push 1 on the stack if we are detecting light from the left or right photoresistor, zero if it’s dark. LMOT and RMOT pop the top byte of the stack to turn the motors on and off. The strategy for the genetic algorithm’s fitness function is running each 16 byte generated program on the robot for 1000 cycles, moving the robot to a new random location and facing direction 10 times without stopping the program. At the end of each run the position of the robot was compared to the light position, and the distances were averaged as the fitness. Note that we’re not assigning fitness to how fast we get to the light.

This is pretty simple stuff, but it’s still interesting to look at what happens over time in the genetic algorithm. Both motors are running at startup by default, so the first successful programs learn how turn one motor off – otherwise the robot just shoots off and scores really low. So the first generations tend to just go round in circles. Then they start to learn how to plug the eyes in, one by one edging them closer to the goal – then it’s a case of improving the sample rate to improve accuracy, usually by using jmps and optimising the loops.

This is an example of a fairly simple and effective solution, the final generation shown in the animation above:

  nop nop nop nop nop 
  nop nop nop nop
  jmpz loop

Some explanation, the right and left eyes are plugged into the left and right motors, which is the essential bit making it work, the ‘nop’s are all values that are not executable. The ‘rmot’ before the ‘jmpz’ makes the robot scan around in circles if there is no light detected (strangely, a case which doesn’t happen in the simulation). The argumant to ‘jmpz’ is 0 (loop) which is actually the 17th byte – so it’s cheekily using memory which has been initialised to zero as part of it’s program.

This is a more complicated and stranger program which evolved after 70 generations with a high fitness, I haven’t worked out what it’s up to yet:

  pshl 171 
  pip 111 
  pip 30 
  pshl 214 
  jmp loop

Evolvable hardware

I’m modding a robot toy for the next Spork Factory experiment, the chassis provides twin motor driven wheels and I’m replacing it’s brains with a circuit based on the ATtiny85 for running the results of the genetic algorithm, and a pair of light dependant resistors for ‘seeing’ with.

Here’s the circuit (made in about 20 minutes after installing Fritzing for the first time). It’s quite simple – two LDR’s provide input, and some transistors are needed to provide enough power to the robot’s motors (it’s using PNP transistors as they were the first matching pair I could find, which means logical 0 is ‘on’ and 1 is ‘off’ in the code).

The robot needs to be emulated in software so the genetic algorithm can measure the fitness of hundreds of thousands of candidate programs. We need to simulate the effect of motor speeds on it’s position and velocity – here is a test running the right motor at a constant rate, and gradually increasing the speed of the left motor.

This is the code snippet that calculates the velocity from the 2 motor speeds – more work is needed to plug in the correct values for the distance between the wheels and the actual rotational speeds of the motor drives.

// update dir and pos from current motor speed
float relative_speed_diff=m_left_motor-m_right_motor;
float angle=atan(relative_speed_diff);
// rotate the direction by the angle
float nx=(m_dir.x*cos(angle))-(m_dir.y*sin(angle));
float ny=(m_dir.x*sin(angle))+(m_dir.y*cos(angle));
// update the position

SuperCollider Symposium

I had a great couple of days at the SuperCollider symposium, starting with a gameboy performance with Till Bovermann and ending with a talk on BetaBlocker with him and Tom Hall. As an outsider to the community (I have contributed code, but I’m not a regular user of Supercollider) it was interesting to pick up on the threads and burning issues of the scene.

Our performance went well, and I found it oddly satisfying to continually dismantle all repetitive dancable structures as they emerged in order to keep up with Till’s more fluid style. We were both running the Betablocker virtual machine, but using it in very different ways – I was running a single one at 4 or 5 cycles per second inside the DS, Till was running many at 44100 cycles per second inside Supercollider.

Photos by Steve Welburn

I also had a chance to experience Benoit and the Mandelbrots for the first time – both in livecoding performance and finding out more about their software during their talk. It seems that livecoding is very active with a lot of new approaches being tried – for example extensive use of text chat for communication during performances. Also I found out about BeeNoir an amazing hexagonal beehive sound installation made by the Mandelbrots which was inspired by Al Jazari!

One of the hightlights of the event was Takeko Akamatsu of CraftWife fame initiating a 5 minute code-off competition between Click Nilson, MCLD, redfrik and Juan Mandelbrot (including a fully loaded water pistol) during her keynote talk.

You can read about some of the other things at the conference on this BBC article. Thanks to Dan Stowell and the team for all the hard work putting on the symposium.

Betablocker @ SC2012 livecoding night

I’m travelling to London this weekend to appear as a guest performer with Till Bovermann at the SuperCollider Symposium livecoding night, alongside livecoding all stars Benoit and the Mandelbrots, Thor Magnusson, Alo Allik and Yota Morimoto.

Till and I will be collaboratively performing using the betablocker virtual machine, Till running it at audio rate inside Supercollider, while I’ll be livecoding it at much slower beat time on a GameBoy DS – projecting it’s screen via close up webcam.

BetaBlocker DS test tones

After a week of heavy development on Germination X, and with a fast approaching performance, it seemed the time to dive back into Gameboy DS homebrew and get the bits and pieces of assembler code I’d been playing with together into something I could build more sounds from.

After getting the low level stuff working properly, I got busy making a bunch of audio scenes that I can trigger from Betablocker DS code.

Sounds hosted by

Mostly I’m doing wavetable based things, with lots of frequency and ring modulation. I was also pleased to find this simple resonant filter thanks to the ever fabulous musicdsp archive (and thanks to Paul Kellett for originally posting it):

// set feedback amount given f and q between 0 and 1
fb = q + q/(1.0 - f);

// for each sample...
buf0 = buf0 + f * (in - buf0 + fb * (buf0 - buf1));
buf1 = buf1 + f * (buf0 - buf1);
out = buf1;

The next step was to read Jasper Vijn’s excellent explanation of fixed point maths which I’ve used a bit on the Android but not really totally understood before. Mobile devices tend to only have integer maths at the hardware level, so if we restrict the calculations to fixed point it’s much faster and kinder to batteries than the floating point equivalent. Here is the fixed point version of the filter:

// we are using .10 fixed point (10 bits for the fractional part)
#define FX_SHIFT 10

// convert an int into to it's fixed representation
inline s32 fx(s32 i) { return i<<FX_SHIFT; }

// multiply two fixed point numbers (returns fixed point)
inline s32 fxmul(s32 a, s32 b) { return (a*b)>>FX_SHIFT; }

// f is cutoff frequecy, q is resonance, s is the filter state
void lpf(s16* dst, s16* src, u32 length, s16 f, s16 q, s16 *s)
    u32 fb = q+fxmul(q,fx(1)-f);
    for (u32 i=0; i<length; ++i)

Betablocker DS: Table feedback after Claudius Maximus

After getting some dates for gigs with Betablocker DS, I am spending some time looking into audio algorithms, and implementing them on the Gameboy DS. Last Thursday I spent some time at the TAI studio/bunker with Till Bovermann investigating these PD patches by Claudius Maximus:

The algorithm uses feedback to create sounds that take some time to play through a wide range of frequencies. It works by writing into the same buffer it’s playing, but at a different rate/position – the resulting ugliness suits the style of BBDS very much. As an aid to our understanding Till converted the algorithm to Supercollider, then over the next couple of days I managed to get it running with an inner loop of 9 instructions on the DS (could probably be optimised further, but I’m still a beginner):

@ ----------------------------------------------------
@ qrz_maximus: *dst, length, readpos, writepos [freq, *tab]
@ ----------------------------------------------------
        .global qrz_maximus
	.type   qrz_maximus, %function
        push    {r4,r5,r6,r7}       @ push the registers we need
        ldr     r4, [r13,#20]       @ load freq from stack into r4 
        ldr     r5, [r13,#24]       @ load *tab from stack into r5 
        ldr     r6, .tablength      @ load the tablen into r6
        ldrh    r7, [r5,r2]         @ load the sample into r7
        strh    r7, [r0], #2        @ write output: *dst=r7 dst++
        strh    r7, [r5,r3]         @ feedback into tab[writepos]=r7 
        add     r2, r2, r4          @ readpos+=freq
        and     r2, r2, r6          @ mask readpos into range
        add     r3, r3, #2          @ writepos++
        and     r3, r3, r6          @ mask writepos into range
        subs    r1, r1, #1          @ length--
        bne     .maximus_loop       @ repeat until length zero
        mov     r0, r2              @ return readpos
        pop     {r4,r5,r6,r7}   
        bx      lr                  @ exit
        .word   0x00003FF    

And here it is running on autopilot with a test program in Betablocker: