Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The following .html file encoded in UTF-8, when loaded from disk in Google Chrome (so no server headers hinting anything), yields document.characterSet == "UTF-8". If you make it "a" instead of "ä" it becomes "windows-1252".

    <html>ä
The renders correctly in Chrome and does not show mojibake as you might have expected from old browsers. Explicitly specifying a character set just ensures you're not relying on the browser's heuristics.


There may be a difference here between local and network, as well as if the multi-byte utf-8 character appears in the first 1024 bytes or how much network delay there is before that character appears.


The original claim was that browsers don't ever use UTF-8 unless you specify it. Then ko27 provided a counterexample that clearly shows that a browser can choose UTF-8 without you specifying it. You then said "I'm pretty sure this is incorrect"--which part? ko27's counterexample is correct; I tried it and it renders correctly as ko27 said. If you do it, the browser does choose UTF-8. I'm not sure where you're going with this now. This was a minimal counterexample for a narrow claim.


I think when most people say "web browsers do x" they mean when browsing the world wide web.

My (intended) claim is that in practise the statement is almost always untrue. There may be weird edge cases when loading from local disk where it is true sometimes, but not in a way that web developers will usually ever encounter since you don't put websites on local disk.

This part of the html5 spec isn't binding so who knows what different browsers do, but it is a reccomendation of the spec that browsers should handle charset of documents differently depending on if they are on local disk or from the internet.

To quote: "User agents are generally discouraged from attempting to autodetect encodings for resources obtained over the network, since doing so involves inherently non-interoperable heuristics. Attempting to detect encodings based on an HTML document's preamble is especially tricky since HTML markup typically uses only ASCII characters, and HTML documents tend to begin with a lot of markup rather than with text content." https://html.spec.whatwg.org/multipage/parsing.html#determin...


Fair enough. I intended only to test the specific narrow claim OP made that you had quoted, which seemed to be about a local file test. This shows it is technically true that browsers are capable of detecting UTF-8, but only in one narrow situation and not the one that's most interesting.

Indeed, in the Chromium source code we can see a special case for local files with some comment explanation. https://github.com/chromium/chromium/blob/dea8b2608dd5d95e38...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: