Writing a fast software renderer and audio engine in js (Pt.2 audio)
In my last post I gave a broad overview of the evolution of the renderer for Torchlight's Shadow. I talked about how I started by just rendering text, and then transitioned into using an HTML canvas drawn on with fillText and fillStyle, to an offscreen canvas handled by a dedicated thread, and finally to an offscreen canvas that filled image data directly into integer arrays and only touches the canvas to draw images and scale them to the screen. One thing I neglected to mention in my last post is that this is still likely several orders of magnitude slower than if I just wrote a shader that could run entirely on the GPU. I believe that what I wrote is, for this reason, what is called a software renderer and not an actual shader. If I had written the lighting engine or other parts of my codebase with eventually writing a shader in mind, it would certainly be the fastest way to do it, but given that my software renderer is far from the biggest bottleneck in my current code, and to really get the benefits of switching to using the gpu, I would have to rewrite quite a bit, I've opted to save writing a shader for some later time or perhaps a later project.
Onto the audio engine: When I first added sound to the game, I just used Javascript's WebAudio API. I created oscillator nodes for sine waves and audio buffer nodes for noise. I started building up a library of effects and ways to manipulate those oscillators and audio buffers into various sounds. I was basically writing a crude synthesizer, and the primary means by which I created new sounds was by writing envelope functions that modulated various parameters as a function of time. I originally wrote some very simple code to implement ADSR envelopes because that was what I read was a standard implementation, but I quickly realized that if I wanted to create natural sounding effects, I would want something more granular so eventually I switched over to using distinct functions for the envelopes. The envelopes were originally just to modulate volume, but I also added pitch modulation with the same method.
I spent about a week or two working out the kinks in the first version of the audio engine and created a whole library of sound effects, adding new functionality to the engine as needed, and I was pleasantly surprised by the results. I had a vague idea of how I wanted the game to sound. I knew that I wanted it to match the visual aesthetic of the game which in my mind was defined by the combination of the very crude symbolic representations of objects and the very polished and natural looking lighting system. My plan to make the audio engine match this aesthetic was to use simple sounds built up from noise, sinewaves, simple overtones, and filters, but to add lush detail by way of effects. Primarily, I wanted to use reverb to make these simple sounds feel like they were in a real space. Reverb and its interplay with effects like delays and filters were the primary tool in my arsenal that I planned to use to elevate the very simple sound generation to something more in the same way that I thought my lighting engine elevated the simple ascii graphics into something that was evocative and gave fodder to the imagination.
Perhaps I'll go deeper into my design philosophy, inspirations, and aesthetic aspirations for the project in another post, but in short, I wanted to represent what was happening without depicting it explicitly so that it could occur in the theater of the mind of the player, but at the same time I wanted the representation to be aesthetically pleasing, and I wanted it to evoke a certain tone to inform and contextualize the player's imagination.
Fittingly, making a functional reverb that was fast and efficient, that sounded as lush and detailed as I wanted it to turned out to be the single biggest project of the my audio-engine-rewrite (and will also fittingly take up most of this post). My thinking when I first came back to the audio engine was that I wanted to work out some bugs that had been present in it since I created it, and I wanted to make it run on its own thread like I had the renderer. I estimated that it wouldn't take much longer than the renderer rewrite so it would only keep me busy for a weekend, but it turned out to be over a month of work for a handful of reasons.
Similar to my experience with the renderer, the audio engine rewrite started with the discovery that there were tools for doing what I wanted to do that I had been unaware of namely Javascript's Audio Worklet. The worklet is an interface for running and processing audio on a dedicated audio rendering thread intended to be a platform that allows low-level code to run with very low latency. Again mirroring my experience rewriting the renderer to live on its own thread, I discovered that pretty much all the data types I had been using on my single-threaded audio engine were not well-suited to being sent from the main thread to the dedicated audio thread. Again I was going to switch from using higher-level abstractions that hid away half the work that the computer was doing to faster low-level data manipulation that required more fine-tuning.
Instead of working with nodes that could have their frequencies and volumes modulated using scheduled linearRampToValue calls, I ended up writing code to handle each synth voice and handle its modulation individually. Sinewaves needed to keep track of their phase and advance it by a value proportional to the node's frequency value each sample. Noise generators in my old engine also used linearRampToValue for both volume envelopes and for modulating the frequency of lowpass filters applied to them. After recreating the basics of my old engine entirely within the audio worklet's process function which is called to output each 128-sample block (called a render quantum), I quickly found out that doing everything fully real-time gave me no margin for error and if sounds I wanted to generate were going to require heavy computational lifting, it was going to need to be done on a separate thread otherwise the worklet would failed to process the samples in time. At 48k samples per second, my worklet needed to output 375 render quantums per second, and even if the average speed of the process was fast, if even one block took more than 1/375 seconds to run, it was going to fail.
This was when I moved synthesis into multiple dedicated worker threads that would generate the sounds and then send them to the audioworklet to be played. I wrote some tooling to port over my old library of sounds that I had generated so that they could be sent over to the workers as a lightweight set of instructions for producing sounds. My envelopes took some work because you can't send a function to a worker for it to calculate the envelope in real time so I had to extract out all the key points where the envelopes specified changes and then interpolate between those values on the worker thread to keep it sounding smooth. All of this was slow work, but it was manageable and was mostly a game of whack-a-mole where I simply had to squash one bug after another that was making the sounds generate improperly or differently in the new engine. The big project that I was chipping away at while I was doing these small adjustments was reproducing the convolution reverb I had before that was really the lynchpin of the whole audio engine. It was the primary way that I gave simple sound effects a sense of detail and of space.
As far as I know, the convolution reverb was the only thing in my old audio-engine that was already multi-threaded. This was not because I had intentionally made it so, but just because convolution is a process that can be so computationally intensive that the default WebAudio API actually spins up worker threads to process them faster. Convolution reverb, in brief, is a way of making one sound (an input) sound as though it was being played in a particular space by performing a convolution of that sound with a sound that was actually played in that space (called an impulse response or IR). A convolution is a process by which you can take two signals and blend them together by taking the reverse of one of the signals and moving across the other signal, multiplying each element from both signals at every step. An impulse response can be thought of as just the response a room has to a functionally instantaneous input sound like a clap. To get an intuition for why this works, I think of it as basically breaking up the input signal into every one of its samples, and individually replacing the clap with that particular sample.
As you can imagine, this is not a computationally cheap process, and if you naively try to multiply out every sample of input signal with every sample of the IR, you end up with an algorithm that very quickly gets very very slow. If you had a one second input, and a one second IR, that's 48k samples times 48k samples = 2.3 billion multiplies. My laptop has a 4.8GHz processor so in principle it could do something like 4.8 billion multiplies per second on a single thread before any vectorization, but to get the kind of ambience I wanted, I needed to be able to handle inputs and impulse responses in the 5-15 second range which genuinely no amount of multi-threading or vectorization was going to make even in principle possible, and in practice, any implementation of a convolution algorithm was going to be imperfect and was going to have to share CPU time with all the other processes necessary to produce sound.
Luckily a convolved signal can be produced much more quickly if you perform something called a Fourier transform on the input signal and impulse response first. A Fourier transform takes a signal and decomposes it into a set of complex numbers that represent the magnitude and phase of a series of waves that, when added together, would perfectly recreate the original signal. Mathematically a Fourier transform is continuous, but there is something called a discrete Fourier transform (or DFT) which, given a discrete input of a particular length and resolution, will output a discrete set of waves that are guaranteed to reproduce the original discrete input signal. After both the input signal and impulse response have been transformed using this DFT, if their individual elements are multiplied together (1-to-1 not all-to-all), then once the transformation is reversed (going from a series of numbers describing frequency information back to a signal describing changes over time), the output is identical to that of the naive convolution described above (modulo a normalizing factor that is often included in the inverse transform function).
The FFTs and IFFTs are not free, and even after they have been performed, the multiplying together element-wise is also not free. In fact, because they're complex multiplications on both the real and imaginary components, for each input sample, it's four multiplies, two adds and two subtracts. All this needs to be priced into the computational cost of running a reverb this way, but the magic of this algorithm is that it doesn't scale with N-squared like the naive convolution (where N is the number of samples to be convolved), and instead its computational complexity scales much more slowly allowing for very lengthy input signals and impulse responses.
Lots and lots of trial and error later, I got my convolution reverb working. It required breaking up impulse responses into blocks that could be FFT'd individually to be applied to incoming blocks of input so that the reverb could start to be applied before the entire input sound had generated (to reduce latency). I also switched to caching the FFTs of IRs of various lengths because my engine also synthesized IRs at runtime. It was still a huge performance bottleneck though because of the heavy lifting of doing all of those computations. It could run in real-time with lengthy IRs without stuttering, but the added computation introduced latency. I could reduce that latency by processing smaller FFT blocks and sending them to the worklet thread as soon as they were processed, but this would run the risk of causing glitchy audio if the impulse response was too long because smaller FFT blocks mean more CPU use per sample and more chances for the rendering thread to catch up to the synthesis thread before the synthesis had actually produced the next block of samples.
This trade-off between CPU and latency meant that I could make it work, but the more computationally intensive the algorithms, the higher latency I would need to introduce to prevent this under-run problem. I could have tried to make it work as is because it really only manifested itself with really long IRs applied to really long input signals (on the order of 10-20 seconds in length EG ~500k-1m samples), but I didn't want to compromise on the most important effect in my engine so I turned to WebAssembly.
I knew that the actual convolution step was perfectly suited to SIMD (single instruction multiple data) vectorization because it was basically just a bunch of sequential multiply, add, and subtract operations with each one independent of the last. I would not have, given a million years, been able to figure out left to my own devices on a desert island that convolution in the time domain was identical to element-wise multiplication in the frequency domain, but I had a loose understanding of the idea because I understood that, in some sense, multiplying one element in a frequency-domain representation should effect every element of the time-domain output signal because each element is doing the work of describing a wave that traverses the whole signal. In some sense the algorithm I was using was already leveraging a kind of parallelism. My audio engine was also running on multiple threads which introduced another means of parallelizing, but I knew that if I wanted to make it run as fast as possible, I would also need to implement SIMD instructions.
This is where we depart from pure Javascript because JS has no native implementation of SIMD instructions (as far as I can tell, it was once worked on but then abandoned in favor of leaving it to WebAssembly). First I rewrote the core of my convolution by hand-rolling WebAssembly to replace the step of element-wise complex multiplication of the two signals. After getting used to the syntax of Wasm, it was just a matter of getting all of the indices right and figuring out how to get all my data into Wasm memory in the right order to be operated on. After that, my reverb ran significantly faster, and the main bottleneck became the FFT and IFFT functions themselves which I eventually rewrote in Wasm as well. Those I used a little bit of Assemblyscript for. Assemblyscript is a language that is supposed to resemble Typescript as much as possible while still being able to be compiled directly into Wasm. It took a while of messing with the build tools to actually get Assemblyscript to just output portable Wasm code rather than Wasm code that included all of the Assemblyscript runtime, but once I got it working, I was able to write code that was semantically identical to hand-rolling my own WebAssembly but with nicer syntax.
I think most people who use Wasm primarily use it as a compilation target for C or C++, but because the primary things I wanted from this Wasm rewrite were SIMD and a faster memory layout, I felt that it would be easier to write it myself rather than having to trust a compiler to implement the vectorization and memory formatting for me.
Having a fast reverb meant I could easily add multiple layers of it, meaning I could define individual sound effects to have their own reverb that would run on the synthesizing worker threads, and then I could run another effects chain on the audioworklet thread that effected the entire output. After writing a delay that could dynamically change length and feedback and could apply various effects only to the delayed signal (applying them multiple times over with feedback), I could finally realize a goal I'd had for a while of having the ambient effects chain change depending on player location. Big open spaces now feel yawning and cavernous. Rooms of dirt and mud and moss dampen sounds while rooms of smooth stone reflect sounds with all their sparkling highs. Tight spaces feel small and claustrophobic, and now everything runs in full stereo with separate impulse-responses for the left and right channels.
Only after doing lots of reading on digital signal processing or DSP did I realize how much more I could do in my engine than could be done in similar engines that have to take in real-time input. For my use case, all of the sound is synthesized so it can be generated at a rate orders of magnitude faster than real time, and so my code for processing sounds has the luxury of being able to wait for large blocks of input to process all at once because that amount of time it spends waiting for that input is only a tiny fraction of the amount of time that the input will take to play.
In a truly real-time DSP paradigm, any process that requires a substantial buffer of input will introduce a latency equal to the amount of time it takes to collect that buffer of input. The default Javascript WebAudio API is quite powerful, but I was able to write something that leveraged a huge advantage that my use-case permits that the JS API could not because it was intended to be general-purpose. Writing a lower-level version of my audio engine that was already functional turned out not to just be an opportunity to learn to implement it myself but also a chance to tailor the product to specifically suit my project's needs. To echo the conclusion of my last post, I've learned that Javascript can be very fast, and modern JIT compilers can run some kinds of code at pretty much native speeds, but there are a whole lot of abstractions that Javascript offers developers so that they don't have to think about implementation details. This can be extremely useful when you just want something functional for testing, but I've found it very rewarding to eventually get rid of those merely functional abstractions like removing the scaffolding from a building as it nears completion. Once you know exactly what a piece of software needs to do, it is extremely useful to have a means of concisely writing those instructions in a format close to how the CPU will actually read them.
For future projects, I will likely not use Javascript because it is not particularly suited to this way of thinking about writing software, but I've been pleasantly surprised at the capabilities of the language as I've pushed it to make performant and powerful tools.
In summary, I rewrote my renderer and audio engine to make extensive use of parallelism and be far faster than my use-case actually necessitates all the while using almost entirely Javascript to do it (with a little bit of cheating by way of wasm).
If anyone is curious about my implementation details, my email is located in the credits screen of Torchlight's Shadow here on Itch. Feel free to reach out!
Torchlight's Shadow
Party-based ASCII-art procedurally-generated dungeon crawler with full dynamic lighting and audio
| Status | In development |
| Author | UnspeakableEmptiness |
| Genre | Role Playing |
| Tags | ascii, Atmospheric, Dungeon Crawler, Roguelike, Tactical RPG |
Leave a comment
Log in with itch.io to leave a comment.