There are two things I'm talking about here. One is that I think the warring audio factions might be talking about two very different things (although the FR ppl seem to think there's only one thing?). The other is which one I think is more important. It's a wall of words, and in the end I'm not sure if I truly understand it myself so I'm probably gonna get torn to shreds for suggesting it.
I probably should use the word "timing" instead of "time domain"
I think I personally value the timing realm more than the frequency (pitch) realm. The audio engineers are right... you can only discern so much in terms of pitch. It's 20 - 20,000 and even that's generous considering 16,000 is already the limit for lots of older listeners. They're also right that there are psychoacoustic things about sound. BUT I wonder if they forget about the timing when it comes to audio, because from what I can tell all 'measurements' when it comes to audio, are related to the Frequency Response (pitch) and not timing. A visual equivalent might be Audio is Color Spectrum and timing is "Frames Per Second".
Maybe all the in-fighting over the topic is this mis-understanding? On the one side you have the equivalent of FR people focusing on the 'color reproduction' saying "You can't even see Infrared light!" or "If you adjust the color, then the two pictures are exactly the same". But then team "timing" is talking about resolution and motion fidelity, not necessarily color reproduction.
For example. How do we determine the location of sounds? The difference in timing between when audio reaches the left and right ears. It can be as low as 10 microseconds according to this article:
https://www.sciencefocus.com/science/why-is-there-left-and-right-on-headphones
Another article mentions that humans can detect even less than 10 microseconds (3 - 5 microseconds?) of timing difference:
https://phys.org/news/2013-02-human-fourier-uncertainty-principle.html
So many things can be explained by this. Spatial Cues like staging and imaging. Transients and Textures depend on the speed of changes in frequency, not the frequencies themselves. I think those same things help in determining how detailed & resolving things seem and relate to micro and macro dynamics. It's known that if you compare a piano note to a guitar note... it's the brief attack characteristics, the pluck vs the hammer, that clue us into which sound comes from which instrument. I think all of the "life-like" things are mostly in timing dependent vs frequency or pitch.
From what I can tell... the things that make Hi-Fi gear stand out from just the cheapest gear with good EQ applied, are tied to the timing. I've been lucky enough to go to a Can Jam before and listened to very expensive things and everything below in terms of price. To my ears, there IS a difference and it didn't matter what the price tag said, I wasn't gonna buy the expensive stuff anyways... I just wanted to hear the differences for myself.
I've listened to things that "measure perfectly", like the near perfect Dan Clark Stealth and Dan Clark Expanse. DC uses meta materials to help dampen and "shape" the sound and coincidently measure nearly perfect to the Harman Curve. I've listened to many Chi-Fi DACs and AMPs that also measure perfectly (they all use mounds of negative feedback). And to my ears, those are some of the most boring and life-less things to listen to.
** So in my opinion, faithful reproduction of Frequency is NOT the holy grail. You can EQ things anyway you like and I agree that EQ is excellent! It changes the sound more than most things. But good FR performance is cheap in my opinion and that's great. What's not widely available are things that perform well in the timing. From what I can tell, that's what people pay up for.
I'd be interested to see if one day the industry starts creating ways to measure time-domain performance. In my analogy above I use the metaphor of "Frames Per Second", but timing changes can also be represented in Hz. In the first article, Humans can use timing cues as small as 10 microseconds (μs) which equates to 100,000 Hz in order to position a sound source. In the second article, Humans can detect changes as small as 3 μs. The article mentions 13x to 10x better time difference detection than expected so if 3 μs is on the extreme 13x side that means the other participants were closer to 4 μs or the 10x figure. Going by the 4 μs figure, that would equate to 250,000 Hz resolution. It's not about pitch, it's about changes in the audio.
Time domain is eminently measurable, and, indeed, measured as a direct consequence of frequency response measurements in essentially every software that does measurements. This is because the default paradigm for measuring headphones is a fourier transform of an impulse response, which gives us both the magnitude and phase values as a function of frequency.
The video metaphor is quite misleading because our eyes are capable of detecting multiple inputs at once, whereas our ears are pressure detectors - there are no "hearing pixels", just a set of bandpasses that come after the sum of sound pressure in our ears move the drum. That is, there's only one variable we're looking at (the level of displacement at the eardrum at any given time), whereas with our eyes, we have intensity across multiple points.
The reason that time domain measurements of headphones, amplifiers, so on are not discussed is that there simply is no 'there' there - headphones and amplifiers can be accurately approximated as minimum phase systems within their intended operating bandwidth and level, and the only cases where this isn't true of DACs is when it's intentional (the use of linear phase filters for reconstruction, for example). This being the case, we can infer the time domain behavior directly from the frequency response behavior.
"timing" is also an issue where you really need to look at the source material - a bandwidth limited system (like a recording microphone, preamp, and ADC) can only produce a "transient" change at a given speed, which is given by the frequency response of the system. A faster rise time requires, symmetrically, a larger bandwidth (at high frequencies, specifically). This is why you see - or saw - people measuring amplifiers with square waves and other "instantaneous" rise time signals. But if you feed those through the lowpass inherent to your ADC, or for that matter the microphone used for the recording itself, you'll find that your transient is slowed, because those systems have a high frequency cutoff.
Maybe you know more about this or have a source for how this works, but you comment reminded me of something that is a bit of a mystery to me: if the ear is a pressure detector, how does stuff like staging work in headphones, when there are just 2 membranes for output and 2 ears for input?
I get how it works that you hear sounds more on the left than on the right, that’s just a difference in volume… but precise positioning on something like a virtual stage?
Also by time difference of when the sound waves reach your ear… it’s not a mystery at all, that’s how we hear things in real life as well