I've been tearing my hair out trying to repro this, and eventually managed to force some mojibake on my dev server by... well, never mind. The point is, after pritkiy_kaban's help with investigating the HTTP headers, I think the problem comes from the way the old LJ code tries to avoid ever invoking Perl's internal Unicode handling.
Basically, the site is accepting UTF-8 from the outside world, storing UTF-8 in the database, and outputting UTF-8 to the web... but it's never admitting as much to Perl, and tries to always treat text as a sequence of (mostly) opaque bytes. (In fact, it's not even telling MySQL it's storing UTF-8, so if I do a direct query for a unicode content on my dev server, I get absolute garbage.) There's a bunch of code to check for unicode validity and convert old 8-bit encodings from the database (which isn't relevant to us, but was to LJ), but it all kinda does it in the down-low.
The problem is, if you ever combine "just bytes" text with a string that has, at some point, confessed to being real unicode, the "just bytes" text gets deserialized as ISO-8859-1 so that it can also Be Real Unicode Characters, resulting in garbage (because it was UTF-8 all along and was trying to stealth through the system without ever getting decoded).
So, something in the chain is outputting a string that's marked as being UTF-8... but only under SOME circumstances, for SOME users, on SOME pages. And to fix the bug, someone's gotta figure out exactly what's doing that, and have it re-encode that text back to "just bytes" before passing it on. (That, or launch a multi-year inquisition to make the entire 20-year-old codebase unicode-aware.)
And no one working on the site has ever been able to repro the damn thing. UGH.
(Yes, I also think the br tag thing is unrelated. I'm interested in that, but it seems less urgent; also a big patch that interferes with that whole area of HTML-mangling just got merged, so it might act totally different after the next code deploy anyway.)
no subject
I've been tearing my hair out trying to repro this, and eventually managed to force some mojibake on my dev server by... well, never mind. The point is, after
Basically, the site is accepting UTF-8 from the outside world, storing UTF-8 in the database, and outputting UTF-8 to the web... but it's never admitting as much to Perl, and tries to always treat text as a sequence of (mostly) opaque bytes. (In fact, it's not even telling MySQL it's storing UTF-8, so if I do a direct query for a unicode content on my dev server, I get absolute garbage.) There's a bunch of code to check for unicode validity and convert old 8-bit encodings from the database (which isn't relevant to us, but was to LJ), but it all kinda does it in the down-low.
The problem is, if you ever combine "just bytes" text with a string that has, at some point, confessed to being real unicode, the "just bytes" text gets deserialized as ISO-8859-1 so that it can also Be Real Unicode Characters, resulting in garbage (because it was UTF-8 all along and was trying to stealth through the system without ever getting decoded).
So, something in the chain is outputting a string that's marked as being UTF-8... but only under SOME circumstances, for SOME users, on SOME pages. And to fix the bug, someone's gotta figure out exactly what's doing that, and have it re-encode that text back to "just bytes" before passing it on. (That, or launch a multi-year inquisition to make the entire 20-year-old codebase unicode-aware.)
And no one working on the site has ever been able to repro the damn thing. UGH.
(Yes, I also think the br tag thing is unrelated. I'm interested in that, but it seems less urgent; also a big patch that interferes with that whole area of HTML-mangling just got merged, so it might act totally different after the next code deploy anyway.)