NES/Famicom game programming discoveries

Working on a NES game you are treading in the footsteps of programmers from the 80's, and going back to modern development feels strangely bloated and inefficient in comparison. This is a log of some of the things I've encountered writing game code for the What Remains project.

Firstly, although it seemed like a lunatic idea at the start, I'm very happy we decided to build a compiler first so I could use a high level language (Lisp) and compile it to 6502 assembler that the NES/Famicom needs to run. This has made it so much faster to get stuff done, and for example to be able to optimise the compiler for what is needed as we go along, without having to change any game code. I'm thinking about how to bring this idea to less projects on less esoteric hardware.

These are the odds and ends that I've built into the game, and some of the reasons for decisions we've made.

0005

Collisions

As well as sprite drawing hardware, later games machines (such as the Amiga) had circuitry to automatically calculate collisions between sprites and also background tiles. There is a feature on the NES that does this for only the first sprite – but it turns out this is more for other esoteric needs, for normal collisions you need to do your own checks. For background collisions to prevent walking through walls, we have a list of bounding boxes (originally obtained by drawing regions over screenshots in gimp) for each of the two map screens we're building for the demo. We're checking a series of 'hotspots' on the character against these bounding boxes depending on which way you are moving in, shown in yellow below.

hotspots

You can also check for collisions between two sprites – all the sprites we're using are 2×2 'metasprites' as shown above as these are really the right size for players and characters, as used in most games. These collisions at the moment just trigger animations or a 'talk mode' with one of the characters.

Entity system

With the addition of scrolling and also thinking about how to do game mechanics, it became apparent that we needed a system for dealing with characters, including the player – normally this is called an entity system. The first problem is that with multiple screens and scrolling, the character position in the world is different to that of the screen position which you need to give to the sprites. The NES screens are 256 pixels wide, and we are using two screens side by side, so the 'world position' of a character in the x axis can be from 0 to 512. The NES is an 8 bit console, and this range requires a 16 bit value to store. We also need to be able to turn off sprites when they are not visible due to scrolling, otherwise they pop back on the other side of the screen. The way we do this is to store a list of characters, or entities, each of which contain five bytes:

  1. Sprite ID (the first of the 4 sprites representing this entity)
  2. X world position (low byte)
  3. X world position (high byte) – ends up 0 or 1, equivalent to which screen the entity is on
  4. Y world position (only one byte needed as we are not scrolling vertically)
  5. Entity state (this is game dependant and can be used for 8 values such as "have I spoken to the player yet" or "am I holding a key")

We are already buffering the sprite data in "shadow memory" which we upload to the PPU at the start of every frame, the entity list provides a second level of indirection. Each time the game updates, it loops through every entity converting the world position to screen position by checking the current scroll value. It can also see if the sprite is too far away and needs clipping – the simplest way to do that seems to be simply setting the Y position off the bottom of the screen. In future we can use the state byte to store animation frames or game data – or quite easily add more of these as needed.

0018

Sound

Obviously I was itching to write a custom sound driver to do all sorts of crazy stuff, but time is very limited and it's better to have something which is compatible with existing tracker software. So we gave in and used an 'off the shelf' open source sound driver. First I tried ggsound but this was a bit too demanding on memory and I couldn't get it playing back music without glitching. The second one I tried was famitone which worked quite quickly, the only problem I had was with the python export program! It only requires 3 bytes of zero page memory, which is quite simple to get the Lisp compiler to reserve – and a bit of RAM which you can easily configure.

New display primitives

I wrote before about how you need to use a display list to jam all the graphics data transfer into the first part of each frame, the precious time while the PPU is inactive and can be updated. As the project went on there were a couple more primitive drawing calls I needed to add to our graphics driver.

To display our RPG game world tiles, the easiest way is putting big lists describing an entire screen worth of tiles into PRG-ROM. These take up quite a bit of space, but we can use the display list to transfer them chunk by chunk to the PPU without needing any RAM, just the ROM address in the display list. I'm expecting this will be useful for the PETSCII section too.

I also realised that any PPU transfer ideally needs to be scheduled by the display list to avoid conflicts, so I also added a palette switch primitive too. We are switching palettes quite a lot, between each game mode (we currently have 3, intro, RPG and the PETSCII demo) but we're also using an entirely black palette to hide the screen refresh between these modes. So the last thing we do in the game mode switch process (which takes several frames) is set the correct palette to make everything appear at once.

Memory mapping

By default you get 8K of data to store your graphics – this means 256 sprites and 256 tiles only. In practice this is not enough to do much more than very simple games, and we needed more than this even for our demo. The fact that games are shipped as hardware cartridges meant that this could be expanded fairly simply – so most games use a memory mapper to switch between banks of graphics data – and also code can be switched too.

There are hundreds of variants of this technique employing different hardware, but based on Aymeric's cartridge reverse engineering we settled on MMC1 – one of the most common mappers.

In use this turns out to be quite simple – you can still only use 256 tiles/sprites at a time, but you can switch between lots of different sets, something else that happens between game modes. With MMC1 you just write repeatedly to an address outside of the normal range to talk to the mapper via serial communication – different address ranges control different things. 

0019

More PPU coding on the NES/Famicom

After getting sprites working in Lisp on the NES for our “What Remains” project, the next thing to figure out properly is the background tiles. With the sprites you simply have a block of memory you edit at any time, then copy the whole lot to the PPU each frame in one go – the tiles involve a bit more head scratching.

The PPU graphics chip on the NES was designed in a time where all TVs were cathode ray tubes, using an electron gun to build a picture up on a phosphor screen. As this scans back and forth across the screen the PPU is busy altering its signal to draw pixel colours. If you try and alter its memory while its doing this you get glitches. However, its not drawing all the time – the electron gun needs to reset to the top of the screen each frame, so you get a window of time (2273 cycles) to make changes to the PPU memory before it starts drawing the next frame.

0014
(Trying out thematic images and some overlapping text via the display list)

The problem is that 2273 cycles is not very much – not nearly enough to run your game in, and only enough to update approx 192 background tiles per frame as DMA is a slow operation. It took me a while to figure out this situation – as I was trying to transfer an entire screenful in one go, which sort of works but leaves the PPU in an odd state.

The solution is a familiar one to modern graphics hardware – a display list. This is a buffer you can add instructions to at any time in your game, which are then acted on only in the PPU access window. It separates the game code from the graphics DMA, and is very flexible. We might want to do different things here, so we can have a set of ‘primitives’ that run different operations. Given the per-frame restriction the buffer can also limit the bandwidth so the game can add a whole bunch of primitives in one go, which are then gradually dispatched – you can see this in a lot of NES games as it takes a few frames to do things like clear the screen.

There are two kinds of primitives in the what remains prototype game engine so far, the first sets the tile data directly:


(display-list-add-byte 1)
(display-list-add-byte 2)
(display-list-add-byte 3)
(display-list-end-packet prim-tile-data 0 0 3)

This overwrites the first 3 tiles at the top left of the screen to patterns 1,2 and 3. First you add bytes to a ‘packet’, which can have different meanings depending on the primitive used, then you end the packet with the primitive type constant, a high and low 16 bit address offset for the PPU destination, and a size. The reason this is done in reverse is that this is a stack, read from the ‘top’ which is a lot faster – we can use a position index that is incremented when writing and decremented when reading.

We could clear a portion of the screen this way with a loop (a built in language feature in co2 Lisp) to add a load of zeros to the stack:


(loop n 0 255 (display-list-add-byte 0))
(display-list-end-packet prim-tile-data 0 0 256)

But this is very wasteful, as it fills up a lot of space in the display list (all of it as it happens). To get around this, I added another primitive called ‘value’ which does a kind of run length encoding (RLE):


(display-list-add-byte 128) ;; length
(display-list-add-byte 0) ;; value
(display-list-end-packet prim-tile-value 0 0 2)

With just 2 bytes we can clear 128 tiles – about the maximum we can do in one frame.

A 6502 lisp compiler, sprite animation and the NES/Famicom

For our new project “what remains”, we’re regrouping the Naked on Pluto team to build a game about climate change. In the spirit of the medium being the message, we’re interested in long term thinking as well as recycling e-waste – so in keeping with a lot of our work, we are unraveling the threads of technology. The game will run on the NES/Famicom console, which was originally released by Nintendo in 1986. This hardware is extremely resilient, the solid state game cartridges still work surprisingly well today, compared to fragile CDROM or the world of online updates. Partly because of this, a flourishing scene of new players are now discovering them. I’m also interested that the older the machine you write software for, the more people have access to it via emulators (there are NES emulators for every mobile device, browser and operating system).

nes
Our NES with everdrive flashcart and comparatively tiny sdcard for storing ROMs.

These ideas combine a couple of previous projects for me – Betablocker DS also uses Nintendo hardware and although much more recent, the Gameboy DS has a similar philosophy and architecture to the NES. As much of the machines of this era, most NES games were written in pure assembly – I had a go at this for the Speccy a while back and while being fun in a mildly perverse way, it requires so much forward planning it doesn’t really encourage creative tweaking – or working collaboratively. In the meantime, for the weavingcodes project I’ve been dabbling with making odd lisp compilers, and found it very productive – so it makes sense to try one for a real processor this time, the 6502.

The NES console was one of the first to bring specialised processors from arcade machines into people’s homes. On older/cheaper 8 bit machines like the Speccy, you had to do everything on the single CPU, which meant most of the time was spent drawing pixels or dealing with sound. On the NES there is a “Picture Processing Unit” or PPU (a forerunner to the modern GPU), and an “Audio Processing Unit” or APU. As in modern consoles and PCs, these free the CPU up to orchestrate a game as a whole, only needing to sporadically update these co-processors when required.

You can’t write code that runs on the PPU or APU, but you can access their memory indirectly via registers and DMA. One of the nice things we can do if we’re writing a language for a compiling is building optimised calls that do specific jobs. One area I’ve been thinking about a lot is sprites – the 64 8×8 tiles that the PPU draws over the background tiles to provide you with animated characters.

spriteemu
Our sprite testing playpen using graphics plundered from Ys II: Ancient Ys Vanished.

The sprites are controlled by 256 bytes of memory that you copy (DMA) from the CPU to the PPU each frame. There are 4 bytes per sprite – 2 for x/y position, 1 for the pattern id and another for color and flipping control attributes. Most games made use of multiple sprites stuck together to get you bigger characters, in the example above there are 4 sprites for each 16×16 pixel character – so it’s handy to be able to group them together.

Heres an example of the the compiler code generation to produce the 6502 assembly needed to animate 4 sprites with one command by setting all their pattern IDs in one go – this manipulates memory which is later sent to the PPU.

(define (emit-animate-sprites-2x2! x)
  (append
   (emit-expr (list-ref x 2)) ;; compiles the pattern offset expression (leaves value in register a)
   (emit "pha")               ;; push the resulting pattern offset onto the stack
   (emit-expr (list-ref x 1)) ;; compile the sprite id expression (leaves value in a again)
   (emit "asl")               ;; *=2 (shift left)      
   (emit "asl")               ;; *=4 (shift left) - sprites are 4 bytes long, so = address
   (emit "tay")               ;; store offset calculation in y
   (emit "iny")               ;; +1 to get us to the pattern id byte position of the first sprite
   (emit "pla")               ;; pop the pattern memory offset back from the stack
   (emit "sta" "$200,y")      ;; sprite data is stored in $200, so add y to it for the first sprite
   (emit "adc" "#$01")        ;; add 1 to a to point to the next pattern location
   (emit "sta" "$204,y")      ;; write this to the next sprite (+ 4 bytes)
   (emit "adc" "#$0f")        ;; add 16 to a to point to the next pattern location
   (emit "sta" "$208,y")      ;; write to sprite 2 (+ 8 bytes)
   (emit "adc" "#$01")        ;; add 1 to a to point to the final pattern location
   (emit "sta" "$20c,y")))    ;; write to sprite 4 (+ 12 bytes)

The job of this function is to return a list of assembler instructions which are later converted into machine code for the NES. It compiles sub-expressions recursively where needed and (most importantly) maintains register state, so the interleaved bits of code don't interfere with each other and crash. (I learned about this stuff from Abdulaziz Ghuloum's amazing paper on compilers). The stack is important here, as the pha and pla push and pop information so we can do something completely different and come back to where we left off and continue.

The actual command is of the form:

(animate-sprites-2x2 sprite-id pattern-offset)

Where either arguments can be sub-expressions of their own, eg.:

(animate-sprites-2x2 sprite-id (+ anim-frame base-pattern))

This code uses a couple of assumptions for optimisation, firstly that sprite information is stored starting at address $200 (quite common on the NES as this is the start of user memory, and maps to a specific DMA address for sending to the PPU). Secondly there is an assumption how the pattern information in memory is laid out in a particular way. The 16 byte offset for the 3rd sprite is simply to allow the data to be easy to see in memory when using a paint package, as it means the sprites sit next to each other (along with their frames for animation) when editing the graphics:

spritepatternoffset

You can find the code and documentation for this programming language on gitlab.