Special Affects - GBA Rust adventures, part 3

28 minute read Published: 2020-12-29

a boring forest and an interesting promotional image of the same forest

Yes, "Affects." Anodyne is an atmospheric, dreamlike game, and while proving that we can fit all its levels' tile data into the GBA's background VRAM may be a nice goal to have achieved, we're not really done representing them properly until we can give them the right humidity. For instance, just look at the header image above: On the left, I see what looks like a body of water, but that land above it is clearly sun-blasted and bone-dry! By contrast, the official game on the right looks like being in a damp forest shaded by tall trees feels. Porting a game is more than its assets and engine, you've gotta translate the feeling over as best you can too1!

Mosaic

The Anodyne series very often uses a "pixelate" transition between scenes. In the first game (the one we're studying), it happens in every door, warp tile, and loading zone between worlds! By sheer coincidence, the GBA happens to have a memory-mapped I/O register called MOSAIC that gives us precisely this effect, allowing us to pixelate some or all of the elements up to 16-fold independently in the X and Y axes on the screen as a stock feature of the PPU.

How convenient! Okay, it's hardly a coincidence--it was very commonly used by 16-bit games of the early 1990's, such as Super Mario World on the Super NES (pictured below), and Anodyne's 16-bit stylings need not stop at limited palettes and tile grids.

So naturally, one of the first things I implemented was this effect, which I found sorely missing from the Scaleform-based2 console ports of Anodyne!

original PC release Switch/PS4/XbOne port my GBA demo

It's as simple as setting the mosaic flags in all the background layers' control registers and sprite attributes to opt them into the effect, and then writing a new value of mosaic_amount to the MOSAIC register each time the PPU finishes drawing the screen:

gba::io::display::MOSAIC.write(
    MosaicSetting::new()
        .with_bg_horizontal_inc(mosaic_amount)
        .with_bg_vertical_inc(mosaic_amount)
        .with_obj_horizontal_inc(mosaic_amount)
        .with_obj_vertical_inc(mosaic_amount)
);

That was easy! What's next?

Color overlays

One interesting feature in Anodyne that diverges from the graphical style typical of 16-bit console games is the special lighting effects it uses in many areas. Let's take a look at a screen from our GBA demo, without any such effect implemented:

a canal opening up to a larger body of water

Looks good enough, everything's rendered properly as per the tile assets we converted and fed into VRAM... Say, what's this screen look like in the original game?

a beautiful sunset over the sea

Whoa! When'd that gorgeous sunset get here? What Anodyne is doing here is similar to something you'd see in a moderately sophisticated layered image editor: Special color blending modes, specifically overlay, hardlight, and multiply. It's taking the raw scene from above and this colorful gradient image and combining their colors according to the "Hard Light" algorithm, which Wikipedia helpfully describes for us along with the other two, and which I'll go ahead and translate to Rustic pseudocode for your reading pleasure:

pub fn overlay(a: f32, b: f32) -> f32 {
    if a < 0.5 {
        2.0 * a * b
    } else {
        1.0 - (2.0 * (1.0 - a) * (1.0 - b))
    }
}

pub fn hardlight(a: f32, b: f32) -> f32 { overlay(b, a) }
pub fn multiply(a: f32, b: f32) -> f32 { a * b }

Somewhere in either Flixel, Adobe AIR, or the GPU, this math is happening on each of the red, green, and blue channels of each pixel in the initial render of the game scene and the corresponding pixel of the aforementioned gradient image.

That's all well and good for the original, but how does it apply to us on the GBA? We don't have pixel-wise access to the rendered scene, and even if we did, iterating over 160x160 = 25,600 pixels and doing math on each one would be pretty memory- and CPU-time-consuming. There are some color-blending capabilities in the GBA's PPU, but they come with limitations: Only one blend of any type can be done at a time, sprites can't blend with other sprites, and the only blend modes available are alpha blending, fading to white, and fading to black. None of those are the special color blending techniques we're after.

So are we out of luck? Well, while we don't necessarily have access to the final pixels, we do have access to every color that can possibly appear in the scene, and we are on a platform that provides raster interrupts at the beginning and end of each line as they're drawn (rendering is done left-to-right, top-to-bottom, just like how you're reading this text).

So... Can we swap out the entire palette with a blended version corresponding to the color at each Y position in a vertical slice of the gradient image? Let's give it a try!

At level load, we'll make a big array and populate it by crunching those numbers on the whole level palette. We'll have to use fixed-point math for color blending, because the ARMv4 CPU with which we're working doesn't have floating-point support. Each color is 16-bits, and there are up to 256 colors for the background and 256 colors for the sprites, which means a whole copy of palette RAM is 2 bytes * (256+256) = 1KiB in size, and for 160 scanlines, that might mean 160KiB at full resolution. Where do we put it? Let's consult GBATek...

RegionBusRead lenWrite lenCyclesOffset range
BIOS ROM328/16/32-1/1/1000_0000-000_3fff
IWRAM 32K328/16/328/16/321/1/1300_0000-300_7fff
I/O Regs328/16/328/16/321/1/1400_0000-400_03fe
OAM328/16/32- /16/321/1/1*700_0000-700_03ff
EWRAM 256K168/16/328/16/323/3/6200_0000-203_ffff
Pal. RAM168/16/32- /16/321/1/2*500_0000-500_03ff
VRAM168/16/32- /16/321/1/2*600_0000-601_7fff
Cart (WS0)168/16/32- /16/325/5/8**800_0000-9ff_ffff
Cart (WS1)""""a00_0000-bff_ffff
Cart (WS2)""""c00_0000-dff_ffff
Cart SRAM88/ - / -8/ - / -5/ -/ -e00_0000-e00_ffff

*: Plus 1 cycle if GBA accesses video memory at the same time.
**: Separate timings for sequential and non-sequential accesses, depending on waitstate settings.

So the fast on-chip IWRAM's probably out, since we only have 32KiB of that to go around for everything performance-critical in the entire game. There's enough room in EWRAM, but it's a bit slower to read and write, with each 16-bit value costing an additional 3 CPU cycles. It'll probably add some "loading time" to a level if we go about it this way, but let's see where it takes us anyway.

We're going to need to copy each modified palette in the time between when the rightmost pixel of line n-1 is drawn and when the leftmost pixel of line n is drawn, known as the 'horizontal blank period,' or HBlank for short--think of it like the time your hand takes to move to back to the left margin of a paper you're writing. We may not be repositioning a cathode ray beam like consoles plugged into a TV would reasonably assume, but the GBA's PPU is architected similarly nonetheless.

Hey GBATek, how much time do we have during that, anyhow?

Visible     240 dots,  57.221 us,    960 cycles - 78% of h-time
H-Blanking   68 dots,  16.212 us,    272 cycles - 22% of h-time

272 cycles, so with an overhead of 3 cycles per color, that's... 90 colors, or 180 bytes, or about 5 and a half 16-color palette lines. Well, suffice to say if we try to use HBlank DMA to copy from EWRAM, we're probably not gonna get the whole palette copied in time. How about if we cache one palette at a time in IWRAM by copying it from EWRAM between HBlanks (so during HDraw)?

the sunset scene with noticeable color banding

Well, that's a start! It's got noticeable color banding, because despite having as many duplicate palettes stuffed in EWRAM as we do, they were calculated with fixed-point math based on 5-bit color channel values. (Also, note that I don't touch the HUD's palette in the otherwise-blended copies, so we conveniently maintain a look suggesting that they're not part of the play area)

Say, I know what I might be able to achieve with this technique that wouldn't be bothered by color banding of any sort...

A brief detour: Textboxes

Anodyne's dialogue is shown inside of a 156x44 PNG of a semi-transparent rounded rectangle with a solid 1-pixel black border, using a fixed-width pixel font with 7x8 pixel letters. Our visible play area is 160 pixels wide, so naturally there are a couple of pixels on either side of the textbox in the original game.

A textbox from trying to open a locked treasure chest in Anodyne on PC

Now, as mentioned before, the GBA hardware does have alpha composition capabilities, and we could use them here to achieve the original look. However, we only can have one alpha blend active at any given time, and I'd prefer to save it for a more sophisticated effect, if I can help it. But alas, if we wanted to swap the palette out with a pre-alpha-blended-with-the-textbox-color version, we can only do that for an entire row of pixels, so we wouldn't have that nice rounded-rectangle thing going on - we'd just awkwardly have a plain rectangle of different color stretched from one side of the HUD to the other.

And hey, what's with those 7x8 letters? That's so frustratingly close to the 8x8 we would need it to be to just throw it onto the HUD's background layer without messing about with drawing text pixel-by-pixel into VRAM! If we wanted to do that, we'd have to make the textbox 8/7 times its current width of 156, which would place its borders just outside of the play area and onto... the HUD... Where we could just bake the rounded rectangle parts (pre-blended) into the HUD's tile set...

The same text in the GBA demo, with the slightly wider font and the sides of the textbox drawn over the inner edges of the HUD, as described

Ever have two problems cancel each other out? Anyhow,

Back to the color gradient overlays

In the previous post in this series, we talked about how we built up our palettes at compile-time. As you might imagine, pre-computing these blended palettes at compile-time could solve several of our current problems if the game pak is fast enough to read. At the very least, it'd be nice to do the color computations with something closer to the source data & floating-point arithmetic precision. Say, what was all that business in the footnote of that memory timing table up there about "waitstate settings?" Hey GBATek?

4000204h - WAITCNT - Waitstate Control (R/W)

This register is used to configure game pak access timings. The game pak ROM is mirrored to three address regions at 08000000h, 0A000000h, and 0C000000h, these areas are called Wait State 0-2. Different access timings may be assigned to each area (this might be useful in case that a game pak contains several ROM chips with different access times each).

BitExplanationValues
0-1SRAM Wait Control0..3 = 4,3,2,8 cycles
2-3Wait State 0 First Access0..3 = 4,3,2,8 cycles
4Wait State 0 Second Access0..1 = 2,1 cycles

So if we write the appropriate flags to WAITCNT, we'll be able to speed up our random ("First") and sequential ("Second") access timings, provided the cartridge supports it3--crucially, we'll be able to get timings faster than EWRAM for sequential reads, which our palette streaming would definitely be. But we still won't be fast enough to copy both background and sprite palettes during HBlank alone - 272 cycles is less than the 512 colors we're dealing with, even if they took 1 cycle apiece to copy! HBlank DMA causes Bad Things to happen visually if it takes too long and locks the PPU out from the VRAM bus, too, so we'll need to find a reasonable alternative for copying the memory efficiently. Hm... Let's do some napkin math.

GBA screen timings:

      |<------- 240px ------->|<-68->|
     _ ______________________________
    ^ |>=== scanline == 1,232 cyc ==>|
    | |                       |      |
    | |>====== hdraw ========>|hblank|
160px |    960 cycles (~4/px) |272cyc|
    | |                       |      |
    | |     *  vdraw          |      |
    | | 197,120 cycles (= 1232 * 160)|
    v_|_______________________| __ __|
    ^ |  *  vblank                   |
68px| |  83,776 cycles (= 1232 * 68) |
    v_|______________________________|

Here's a rough diagram of the timings we're dealing with. As mentioned earlier, the GBA offers software interrupts at the beginning and end of the horizontal drawing period. Strictly speaking, in our case we don't necessarily care about the screen being undisturbed during all of HDraw; if the colors look wonky underneath our HUD, nobody's gonna see it! So why not catch VCount (the interrupt that can fire at the beginning of HDraw) and try to do something sneaky before and after HBlank? With the HUD being 40 pixels wide on each side, we can probably reclaim a bunch of additional CPU cycles for our shenanigans.

We should remember, however, the footnote below the table of memory timings that says we incur an additional 1-cycle penalty when accessing the palette and video RAM from CPU while they're in use by the PPU (which they very much are during HDraw).

And we should also note that if any interrupt service routine is running, other interrupts may get dropped - so if we're in the middle of copying palette colors when the next VCount interrupt would've fired, we'll miss it, so we'll likely want to limit our vertical resolution such that we're not hitting any rows back-to-back.

GBA screen timings with Anodyne Advance HUD elements:

      |<40>|<-- 160px -->|<40>|<-68->|
     _ ______________________________
    ^ |hud |gameplay area|hud |hblank|
    | |<------200px----->|<----148px-@  {200*4 = 800cyc after VCount hits to start copy}
    | @--->|             |    |      |  {148*4 = 596cyc to copy blends to PalRAM,
160px |    |             |<----------@   (40*2)*4 = 320cyc of which have an addn'l waitstate}
    | @---------------456px----------@
    | @--->|   textbox   |    |      |  {456*4 = 1,824cyc to copy textbox to PalRAM,
    | |    |=============|    |      |   (240+40*2)*4 = 1,280cyc of which have addn'l waitstate}
    v_|____|_____________|____| __ __|
    ^ |                              |
68px| | 83,776cyc of vblank for game |  {copy y=0 blend into PalRAM before end of vblank}
    v_|______________________________|

So let's try this approach (copied straight from my notes, so pardon my brevity):

Is that... a total of 16.5 palette lines? Now we might be getting somewhere! Let's see what it looks like to copy a new palette from ROM during our "virtual extended HBlank" period, which we can fire every 4th line or so...

a much better-looking sunset in the GBA demo

Nice! It's a good thing indeed that we managed to slightly shrink our palettes last time :)

Say, how's that forest scene from the post header looking?

some strange artifacts on the water, but a much dimmer bit of mood lighting beneath the trees

Better! Oh, and notice that as in the original game, I omit the treetops from the colors being affected - they're part of the foreground, which for FOREST does not get included in my build tools' blended palette generation.

However, the water in this is probably one of the most stark examples of color banding with this approach. Despite being computed with higher precision, I suspect we might be hitting a limit of the color depth we're able to show with 5 bits per channel. Still, we're not without options to try to address it.

The GBA's screen has a slow response time that results in a dramatic enough LCD ghosting that each frame has about 50% of the previous frame's pixel colors mixed into it. These days, folks in the market for an LCD screen for gaming would scoff at such a thing, but from our perspective... Hey, that's a free color-averaging effect! GBA emulators will often expose this feature as "interframe blending" in a configuration menu, and it's shockingly off by default in many of them, despite many contemporary games exploiting this property for intentional graphical effects by flickering elements between visible and invisible on alternating frames.

what's inside that green orb?oh, hi Sonic!
a fully-opaque shield in Sonic Advance 2a semi-opaque shield with Sonic inside

So if we stagger when we apply our colors by half of our gradient resolution (4 / 2 = 2), we should get a little bit of in-between color to make those transitions a little bit less glaring, right?

yeah, there are a couple of pixels of aquamarine between the navy blue and sea green

...Well, it helps a little, and I'll take it for now :) Technically, this would look a bit better if we used a lower blend resolution, which would give us a larger area of the in-between color; this is a point in favor of using per-level gradient resolutions so I can special-case FOREST here, but I've not implemented this yet.

Plus, if we're doing it this way, we can alternate that staggering with the sprite palettes, making sure that ~16.5 palette lines worth of virtual HBlank cycles we've given ourselves is sufficient for each of the background and sprite palette banks, while having the end result look as though they're being copied in simultaneously (since they're being blended anyhow).

Another not-so-perfect thing is that the absence of a horizontal component to our scanline-based color gradient blending is apparent in certain levels, some more than others. FOREST has one of the radial gradients, so it doesn't look as close to the original as, say, the sunset we showed earlier in this post, which is mostly just a vertical gradient with a bit of curve to it. CLIFF, GO, and a few others work very well, though!

CLIFF (overlay)GO (hardlight)
(y'know, the sunset)
FOREST (overlay)SUBURB (multiply)
ImageImageImageImage

Of all the areas, I'd have to say that SUBURB is the least-well-executed with this approach compared to its original. In the original game, the "multiply" color blend with the very-dark edges really helped engender a very ominous, claustrophobic, oppressive atmosphere, which isn't quite achieved to the same degree by our darkening of the top and bottom only:

original PC release GBA demo with palette gradient

Hey wait, why do those screenshots have video controls--oh. What's with all that... film grain?

Random TV-static noise

Okay, remember how I was saving the alpha blend register for something trickier? This is something trickier.

In SUBURB, there's an ever-present bit of semitransparent random noise overlaying the screen, which combines well with the very creepy ambient music, the greyscale filter, and the aforementioned dramatically darkened screen edges. From where I'm sitting. the only reasonable method we have to produce per-pixel randomness at runtime without a large performance cost is to have the PPU blit an alpha-blended background layer containing the noise pixels over everything else in the scene. That means it's time to deploy the alpha blend register--but first, we need our layer of noise pixels!

The way Anodyne achieves this effect is with a four-frame, 8 FPS animation that it generates once at launch. We don't quite have the spare VRAM to store that much random junk for the entire game's runtime, so we'll just do it each time we load a level that expects to use the effect.

We'll need an efficient pseudo-random number generator without stdlib dependencies; rand_xoshiro should do just fine. We're still using tiled graphics, so our approach will be a bit different: Make a palette that's just a gradient of black to white, fill a character block (16KiB) with random pixels (being careful that none of them are palette index 0, which is fully transparent), and then fill four adjacent screenblocks (2KiB each) with random tile IDs.

Here's the VRAM layout I've manually allocated so far - with a note that I've made some inroads to designing a method to (ab)use const fn's and const generics to create a type-safe compile-time abstraction over allocating these, but that's for another post ;)

/// # Anodyne Advance's VRAM division by charblock:
///  0. level charblock
///  1. extra charblock (windmill)
///  2. hud charblock
///  3. screenblocks (add 24 to all of these):
///     0. and 1: level geometry (256x512 or 512x256 for scrolling)
///     2. and 3: foreground (as above)
///     4. and 5: skybox, or affine 512x512 for windmill
///     6. hud
///     7. ???
///  4. and 5. obj (subdivisions TBD)

Conveniently enough, SUBURB doesn't have a normal foreground layer or skybox, so we can borrow two pairs of adjacent screenblocks from those. We'll configure the background control register for the layer we're using to use our random-filled character block and screenblocks, tell it to use four screenblocks in a 512x512 pixel configuration, and then every 8 frames (at 60 Hz) we'll reposition the layer to the next position in a sequence. Since our play area is 160x160, we can fit 9 frames in a 3-by-3 configuration (taking 480x480 total). Bonus!

Here's what populating the character block looks like:

if let SkyboxBehavior::TvStatic = self.skybox_anim {
    let cb = CHAR_BASE_BLOCKS.index(WORLD_CHARBLOCK_SPECIAL_ID as usize);
    let cb_ptr_u32 = cb.as_usize() as *mut u32;
    let cb_len_u32 = cb.read().len() / size_of::<u32>();
    // can only index 256 tiles with 8-bit BitUnpack
    let slice_len = cb_len_u32 / 2;
    let cb_slice = unsafe { core::slice::from_raw_parts_mut(cb_ptr_u32, slice_len) };
    for i in cb_slice.iter_mut() {
        // avoid fully-transparent pixels. theoretically halves our color resolution,
        // but in practice reducing the alpha coefficient on this layer makes it
        // mathematically impercievable at the GBA's own color output resolution.
        *i = rng.next_u32() | 0x11111111;
    }
}

Going a bit out of my way to work in u32's here; it's generally more efficent to do so when it's possible, as ARMv4 generally doesn't have different sizes of operations (like m68k's "move.b / move.w / move.l" for example), so operating on anything smaller sometimes emits superfluous masking operations. This admittedly doesn't look very much like idiomatic Rust--sometimes that happens when working at this level and experimenting as I'm doing here.

Here's what populating the screenblocks looks like:

#[repr(align(4))]
struct RandBitsBuffer(pub [u8; TV_STATIC_SBDATA_LEN]);

let mut random_bits = MaybeUninit::<RandBitsBuffer>::uninit();
let src = unsafe {
    driver.rng().fill_bytes(&mut (*random_bits.as_mut_ptr()).0);
    random_bits.assume_init().0
};

let params = BitUnpackParams {
    source_data_length: src.len() as u16,
    source_bit_width: BitUnpackSourceBitWidth::Eight,
    destination_bit_width: BitUnpackDestinationBitWidth::Sixteen,
    data_offset_and_zero_flag: BitUnPackDataParams::new()
        .with_data_offset(TextScreenblockEntry::new().with_palbank(1).0 as u32)
        .with_zero_data(true),
};

let dest = unsafe {
    WORLD_SCREENBLOCK_FG
        .offset(ScreenblockCorner::TopLeft.text_offset() as isize / 2)
        .as_usize() as *mut u32
};

gba::bios::bit_unpack(src.as_ptr(), dest, &params);

We're using a GBA BIOS function here called BitUnpack, which we use to convert each of our random bytes into a 16-bit tile ID in VRAM.

After we've got our noise layer, it's just a matter of configuring the alpha blend register to give us a roughly 1/8 (or 2:14) opacity for our static layer. BLDALPHA's coefficients are in 1.4 fixed-point, so those numbers below represent 2/16 and 14/16, respectively.

BLDCNT.write(
    ColorEffectSetting::new()
        .with_bg2_1st_target_pixel(true)
        .with_bg1_2nd_target_pixel(true)
        .with_obj_2nd_target_pixel(true)
        .with_backdrop_2nd_target_pixel(true)
        .with_color_special_effect(ColorSpecialEffect::AlphaBlending),
);
BLDALPHA.write(
    AlphaBlendingSetting::new()
        .with_eva_coefficient(2)
        .with_evb_coefficient(14),
);

Well, that wasn't so hard! But we're still not quite there with regard to the darkening the left and right screen edges. How on earth are we going to do that?

Faking a better radial multiply mask

Okay, remember in the last section when I said that was a tricky usage of the alpha blend register? I was lying. This is a tricky usage of the alpha blend register.

As described in my earlier diagram, the PPU draws its horizontal lines at a rate of 4 of our CPU cycles per pixel. What if we were to race the beam across the screen, writing new values of BLDALPHA repeatedly as we went? Strictly speaking, alpha compositing isn't "multiply" blending, but if we're trying to gradient toward black (0), we can achieve that easily enough by making our alpha coefficients not add up to 1.0.

We just need to write some code that takes a stable number of cycles to produce the next value we should write, and run that repeatedly starting (40 pixels of HUD * 4 cycles =) 160 cycles after our VCount interrupt service routine is called. How about a simple two-dimensional parabola? Whatever the compiler produces for that should be able to stick to some fast arithmetic operations on registers, and it's a matter of tweaking the translation and scale of the function to get the pixels drawn during however many cycles the function ends up taking to match the play area's width. We all remember high school math, right?

Beautiful4! This is much closer to what Anodyne looks like, and we did it despite our platform's limitations. Here's what the code looks like:

if vcount < VBLANK_SCANLINE {
    let b = 16 - (((vcount as i32) - 80).pow(2) / 512);
    for x in 0i32..=46 {
        let y = (b - ((x - 25).pow(2) / 32)).clamp(3, 14) as u16;
        BLDALPHA.write(AlphaBlendingSetting::new()
            .with_eva_coefficient(2)
            .with_evb_coefficient(y));
    }
    // fallback for line-at-a-time renderers (most emulators)
    BLDALPHA.write(AlphaBlendingSetting::new()
        .with_eva_coefficient(2)
        .with_evb_coefficient(14));
}

I have some ideas for how to optimize this, using a framebuffer of 4-bit brightness values that we can read into 10 registers and write every 8 cycles for a better resolution and fidelity of the effect, but this is definitely a good-enough-for-now point to direct our attention elsewhere for a bit.

Before we're through, though: What's that comment about emulators on about? Well, most GBA emulators out there (including the excellent mGBA which is otherwise quite accurate, and which I've used extensively during development) will take the state of the palette and alpha blending registers as it was at the beginning of HDraw, and render the entire horizontal line to their display without stopping every 4 cycles to check on the CPU's execution. This allows them to run much more efficiently, but it does mean they miss out on this cool effect. However, the behavior I've described would use whatever the value of BLDALPHA was at the end of the previous line, and in our case that would be invisibly dark! So for the pixels hidden beneath the right half of the HUD, I reset the alpha register to what it was before all these beam-chasing shenanigans.

The only emulator so far that I've found to actually render pixel-at-a-time and produce this effect more-or-less accurately is the GBA subsystem of the famously accuracy-focused higan family of emulators:

There's a strange sawtooth pattern (most apparent on the sidewalk in that particular scene), but that's easy to overlook. And as mentioned, the trade-off for this accuracy is performance: This struggles to run at 30FPS (out of 60) on my 3.8GHz POWER9 CPU. Regardless, well done!

Alright, we've covered a lot of ground today, and hopefully gotten much closer to ensuring that we'll be able to do the visuals of Anodyne justice so we can nail the atmospheric aspect of it. But there's another comparably important part to implement to that end: Music (part 4)!

1

One of Anodyne's direct inspirations, Link's Awakening, certainly did its fair share of pushing the GBC's limits with HBlank shenanigans, so it's only right that we should try to follow in its footsteps and press the GBA as far as we can.

2

AutoDesk Scaleform is a third-party, commercial alternative to Adobe AIR, and its primary use case seems to be simple vector GUI drawing atop the likes of Unreal and CryEngine games. It happens to have modern game console support for this reason, but I don't think it's particularly well-optimized for being the main course for an ActionScript-heavy game, as anyone who's played the Switch port of Anodyne can tell you--the framerate is pretty low and inconsistent! (The PS4 and XbOne ports aren't so bad because their desktop-class CPUs can gloss over the poor optimization, but they are missing the pixelization effects that are iconic to the Anodyne series.)

3

The only popular third-party GBA cartridge on the market that I'm aware of to not be able to handle the faster waitstate configurations is the cheap "SuperCard" you can (but shouldn't) find a bunch of on eBay. So sorry, SuperCard owners, if my palettes get too big you might get some weird color patterns on the left edge of the play area!

4

How'd I get such a clean image from a hardware GBA? By using GameBoyInterface, which can send a stream of PNGs over UDP through a Gamecube Broadband Adapter.