The code is waaaay over my head, but I see a function called "findAccuratePeakFrequency". Is that using as input something like an array where each occurrence represents a frequency and the value represents the loudness at that frequency? When I first tried to use WebAudio for my little app I tried using input like that and simply looking for which frequency bucket had the highest number. That didn't work well at all. Is this function, though, using the same kind of input but using more complicated math to get a better answer?
Then I switched to very simple autocorrelation and then I switched to a fancier autocorrelation-based method called "Yin".
Also, There's also a comment "The logarithmically binned spectrum has a resolution of one cent". Does that mean that there is a bucket for each 1 cent? When I was using WebAudio I think the the buckets were something like 14 units of frequency apart (hz?), and I didn't discover a way to get finer resolution.
The problem with FFTs is that for the lower frequencies you have very few bins, but at the higher end you get ridiculous accuracy and there is no easy way to make this more linear. Binning on the high end saves some space but doesn't make the low any more accurate.
So you need to run multiple methods in parallel and decide based on the very rough distribution of the energy in the spectrum which method has the biggest chance of success, or, alternatively, to use the output of both methods to drive some logic that will assign a weight to the output of each.
It's a tricky problem, to put it mildly. Also, this is the simplest form of the problem, doing this accurately for multiple pitches at once is much harder.
Another source of inspiration is the 'onsets and frames' software that powers some automated transcription software: