Sunday, November 12, 2017

Composite (NTSC) Video on mbed Nucleo (stm32f401)

This weekend, I put together a demo program for an STM32F401 development board that generates a composite video output.  My development board didn't have a DAC, and I needed three different output levels, so I used two digital output pins and resistors.  This solution isn't incredibly robust - different monitors and displays require slightly different resistor values to function correctly.  I found that a ratio of around 1 to 2 worked pretty well.  The SYNC pin should generate a 0.3V signal at the composite input, and the SYNC and VID pins should generate a signal in the range of 0.7 to 1.0V at the composite input.

The goal of this project was to generate NTSC video output using only the mbed libraries.  I'm sure it's possible to do a much better job if the timers and interrupts are configured manually, but that's a lot of work...  To learn about NTSC, I used this PDF (link).  The basic idea is that each horizontal line begins with a synchronization signal, followed by the data for that line.  Each line is around 63 microseconds long, meaning you'll need more than 1 microsecond timing resolution if you want more than 63 horizontal pixels.  After all the horizontal lines are scanned, including a few bonus ones that don't end up shown on the screen, there is a vertical sync pattern, which starts the entire process over. The sub-microsecond resolution turned out to be quite an issue - mbed timing functions are based off of 1 microsecond timers, so I needed to get creative. Also, the mbed "Ticker" class fails to time accurately (around 20 microseconds of jitter) if more than one Ticker is in use, so I could have exactly one accurate source of timing. 

Here's what a single horizontal line looks like:
The vertical sync pattern is quite complicated:

Getting the v-sync timing reliable on multiple monitors turned out to be incredibly challenging, so I eventually wrote a stupid program which slowly adjusts the waveform, and just watched the screen until it worked.  I found that a much simpler v-sync pattern was sufficient.

Due to the 1-microsecond resolution limit of the default mbed library, I was unable to set up per-pixel timing.  Horizontal line timing used a Ticker running every 63 microseconds.  This is slightly faster than the 63.5 microsecond NTSC standard, but it seems to work.  64 microseconds did not.  The ISR is surprisingly simple:

void isr()
    uint8_t nop = 0; //use nops or use wait_us
    uint8_t* sptr; //pointer to sync buffer for line
    uint8_t* vptr; //pointer to video buffer for line
    if(l < V_RES){ vptr = im_line_va + ((l/4)*H_RES); sptr = im_line_s; nop = 1; } //pick line buffers
    else if(l < 254){ vptr = bl_line_v; sptr = bl_line_s; nop = 0; }
    else{ vptr = vb_line_v; sptr = vb_line_s; nop = 1;}
    uint8_t lmax = nop?H_RES:12; //number of columns
    for(uint8_t i = 0; i < lmax; i++) //loop over each column
        vout = vptr[i]; //set output pins
        sout = sptr[i];   
        if(nop) //nop delay
        else {wait_us(1); if(i > 2) i++;} //wait delay
    //move to next line
    if(l > 255) l = 0;

The ISR gets slightly more complicated because it uses two different timing strategies:

  1. "nop": A number of "nop" instructions are run, delaying for an exact number of CPU cycles.  This is very accurate, but hogs the CPU and prevents other contexts from running.
  2. "wait_us":  This is low resolution (can only do multiples of 1 us, which is 1/60 of a horizontal scan), and low accuracy (sometimes waits too long).  With only this method, I managed to get 18 horizontal pixels.  However, it allows other tasks to run in the background while it is waiting - a modern microcontroller can do a ton in 1 microsecond.

There are three cases for setup:

  1. The line number is less than the vertical resolution.  If this is the case, store the memory location of the current line of the image buffer in vptr, and store the memory location of the video synchronization pattern in sptr, and choose the "nop" timing.  More on this later.
  2. The line number is greater than the vertical resolution, but less than 254. In this case, prepare to display the blank patterns for video and sync.  Don't use "nop" timing, and use a horizontal resolution of 12.
  3. The last line: prepare for vertical sync.

In the display section, the video data loaded into vptr and sync data loaded into sptr are written to the VID and SYNC pins.  When important data (vertical sync and the actual video) is being sent, timing is done with a series of "nop" instructions.  For the less important signal (displaying "blank" on the bonus lines that don't show up on the screen), timing is done with a wait_us(1) command.  These wait_us(1) commands are very important - when using nop timing, the ISR takes around 62 microseconds to execute, leaving almost no time for other processing to be done. During the wait_us(1) command, the microcontroller is free to switch contexts and execute other code.  The wait_us function is terrible, and occasionally waits 2 or even 3 microseconds, so the horizontal resolution has to be reduced to 7.  This low resolution would look terrible when displaying video, so we can only use this trick when the stuff being drawn is off screen.

In the background, the microcontroller is busy updating the image buffer to display other items.  I have implemented code to translate and rotate 3D points, draw lines between points, draw checkerboards, draw simple bitmapped images, and calculate the position of a very simple bouncing ball (inspired by this).

Sadly, none of the rest of the code can use any sort of timing because the performance of the mbed Ticker class becomes too poor to have a stable video output when multiple Tickers are running, so it is adjusted to be more or less efficient so that the demo runs at a reasonable speed.  This involved doing all sorts of terrible modifications, compiling, loading, testing, and readjusting.  I don't know what compiler is used by the online IDE, other than it isn't gcc (or at least the error messages don't match gcc), and I suspect that it compiles with no optimization flags, so some code includes optimizations I'd normally rely on the compiler to do, and some code is intentionally not optimized to run more slowly.  For whatever reason, the code to generate the checkerboard was incredibly slow compared to everything else, including lots of floating point math used to rotate the cube and draw lines.  In the end, the version that was fast enough, but not too fast, looked like this:

void draw_v_check(int8_t r,uint8_t tt)
    for(int i = 0; i < H_RES; i++)
        for(int j = 0; j < V_RES; j++)
im_line_va[i+j*H_RES] = (((i > 20) && (i < 98)) && ( tt ^(((j%(r*2))>=r) ^ ((i%(r*2)))>=r)));

In the end, the silly demo looks like this:

 and the full code, including a very poorly-written demo:

Lesson learned: Don't use the mbed libraries for things that require complex timing!  This code is a disaster.