Jump to content

Working with Multibyte Encoding and Output Buffers


iwato
 Share

Recommended Posts

Question: Why is it that the same character setting has be to be set four times in order to obtain a proper rendering of a multi-byte font when the output buffer created by the ob_start() function in the below set of code is finally flushed.Discussion: Take out any of the first three lines or remove the callback function from the ob_start() function in the fourth line, and anything written in mutlibyte character format fails to render properly in the page that results.Now, I can understand that the browser, the mb module, and my system each has its own method of encoding. Still I cannot understand why all of the above four mentioned items are necessary.

<?php	header('"Content-Type" content="text/html; charset=UTF-8"');	mb_internal_encoding('UTF-8');	mb_http_output('UTF-8');	ob_start('mb_output_handler');?>

Roddy

Link to comment
Share on other sites

Well, what action would you expect? If you set the content-type header to "text/html; charset=UTF-8", would you expect changes anywhere else, or are you just trying to send a header? The multibyte extensions seem self-explanatory, one of them is for internal usage and one of them is for external usage. It makes sense to have both options. It would also make sense to have one function which sets both, but it does make sense to have both options individually. The output buffering call isn't related to the encoding necessarily, you're just telling it to use another output handler than the default.If you're asking why all of these steps are necessary, it's because it wouldn't make sense if doing any one of those also did everything else. If I set the content-type header to UTF-8, that doesn't automatically mean I also want to use the multibyte extension, and change the internal and external encodings, and also use an output buffer. It just means I want to set the content-type header to UTF-8. If you think all of those should be grouped, then write a function for yourself so that you can call one function in your code to execute any code you want to use to set the page up.

Link to comment
Share on other sites

The multibyte extensions seem self-explanatory, one of them is for internal usage and one of them is for external usage.
Unless setting up an output buffer is considered an external act I do not understand the externality of any of my code. I am in the same domain using the same webserver and web application. No HTTP message is either sent or received. Moreover according to the online PHP manual the value of the $encoding parameter of the mb_internal_encoding function is used for the HTTP input character encoding conversion, HTTP output character encoding conversion, and the default character encoding for string functions defined by the mbstring module.Roddy
Link to comment
Share on other sites

The thing is, the header() function doesn't really "know" that you're setting an encoding for the output, and doesn't even know the implications of the Content-Type for that matter. It just blindly writes out a line in the HTTP response as part of the headers. You are the one who is expected to know the semantics of whatever you write.*Asking for PHP to know that is like asking it to know that the string "<table>" is the start of an HTML table, and expect PHP to somehow automatically write "</table>" somewhere. On that note, XML APIs allow you to do just that, but that's just it - the API itself knows the consequences of having an opening tag, and so it writes out a closing one upon document completion. A plain echo won't know that.BTW, the correct syntax is:

header('Content-Type: text/html; charset=UTF-8');

And if PHP didn't warned you, it's for the same reason as above - the header() function doesn't know the semantics of HTTP headers - it just knows that they are part of the HTTP response and precede the output, so it writes out a line there.It is for pretty much the same reason that the rest of the things are needed - they only know their own semantics, but don't know what other parts may expect.You can avoid using the 3rd and 4th line if you save your file with UTF-8 encoding. I'm not exactly sure how you do that on a MAC though. On Windows, it's a separate dropdown menu in the "Save As..." dialog. And the mb_internal_encoding() can be set in php.ini (though using this function is probably best for portability; for cases where you can't edit php.ini).* I frequently suggest to people they use a tool like Fiddler to see the raw HTTP requests and responses... things like this make MUCH more sence when you train yourself into mapping them to what you see there. For MAC, Firebug's Net tab is a close enough equivalent, though not as good IMHO.Offtopic://(sort of)Watched Tron: Legacy yet? If you didn't, WARNING, spoilers ahead....OK, so Quorra and Kevin Flynn never knew what Clu was going to do during the 20 years they were in the grid. They constantly monitored each other. The only reason Quorra was able to help Sam Flynn was because she listened for events from the dome object (funny how close this sounds to "DOM object", yet it's not a tree... at least not literally). Clu could've killed Sam right there, and she wouldn't have known if the event wasn't dispatched. Add to that the fact she wouldn't know the semantics (the emotional sadness and all that) until she presents Kevin Flynn with event information and asks for input.It's the same deal here, only instead of "the grid", you have "PHP" and instead of "programs" you have "functions" and the like.BTW, the only reason the whole story in Tron: Legacy spins is because Kevin Flynn had the stupid idea of writing a global try-catch where the catch handler was removing the sender. Who'd knew the ISOs were going to be declared suddenly and make Clu throw an "UnknownClassException".

Link to comment
Share on other sites

Moreover according to the online PHP manual the value of the $encoding parameter of the mb_internal_encoding function is used for the HTTP input character encoding conversion, HTTP output character encoding conversion, and the default character encoding for string functions defined by the mbstring module.
I answered without checking the manual. It looks like using mb_http_output is redundant if you're using mb_internal_encoding.
Link to comment
Share on other sites

BTW, the correct syntax is:
header('Content-Type: text/html; charset=UTF-8');

This is much less complicated, but the one I provided also worked. I copied it from the Dreamweaver template for creating PHP files.
You can avoid using the 3rd and 4th line if you save your file with UTF-8 encoding. I'm not exactly sure how you do that on a MAC though. On Windows, it's a separate dropdown menu in the "Save As..." dialog. And the mb_internal_encoding() can be set in php.ini (though using this function is probably best for portability; for cases where you can't edit php.ini).
Unfortunately, this does not seem to be the case, unless, of course, the Dreamweaver preview function alters a file's content before sending it to the browser. Most of my PHP files are saved with UTF-8 character encoding.
I frequently suggest to people they use a tool like Fiddler to see the raw HTTP requests and responses... things like this make MUCH more sence when you train yourself into mapping them to what you see there. For MAC, Firebug's Net tab is a close enough equivalent, though not as good IMHO.
I just tried the Firebug Net tab. What a vital source of information. Thank you much for this very useful suggestion.You must have enjoyed Tron: Legacy. The original Tron was produced in 1982 about the time I took my first and last course in computer science. I can still remember creating and inputting punch cards in order to obtain a hard-copy read-out. No, I have seen neither Tron, nor Tron: Legacy, but I will keep the latter in mind should I someday soon find myself in a cinema theater.Roddy
Link to comment
Share on other sites

Unfortunately, this does not seem to be the case, unless, of course, the Dreamweaver preview function alters a file's content before sending it to the browser. Most of my PHP files are saved with UTF-8 character encoding.
Have you tried it with the syntax above? It's possible it didn't worked because the Content-Type header was never really respected. Now that the syntax is correct, the header will be respected, therefore it should work.
Link to comment
Share on other sites

Have you tried it with the syntax above? It's possible it didn't worked because the Content-Type header was never really respected. Now that the syntax is correct, the header will be respected, therefore it should work.
I have learned now that the code for an HTML meta tag and that of the header function, although similar, are not the same. Thank you for pointing this matter out. You were correct in noting that an important source of my confusing was an inappropriately formatted $string parameter in my header function.I have also come to realize the following:1) The mb_http_output() function is unnecessary when the character encoding for the header and mb_internal_encoding() functions agree.2) The header() function is unnecessary when both the mb_internal_encoding and mb_http_output functions are present and agree.3) The header() function overrides the mb_internal_encoding function when the mb_http_output function is absent.4) If both the mb_internal_encoding and mb_http_output are present, they must agree.All of the above assumes a default character encoding of the form iso-8859-1.It has been an interesting and very worthwhile dialogue.Many thanks to everyone involved.Roddy
Link to comment
Share on other sites

The header() tells the browser what to expect (when written correctly that is), mb_internal_encoding() specifies default for the other mb functions, and every mb function, including mb_http_output() can override that setting.With that in mind, only your first realization is correct. As for the rest...2. It might have been that according to your tests, but that's you reliying on the browser to guess the character set, possibly based on the HTML meta. Anytime you're relying on a guess, you're risking a break.3. header() overrides nothing. It deals with the HTTP header, while the mb functions deal with the content itself. That's exactly why saving your PHP file and having an HTTP header that says UTF-8 is enough for outputting UTF-8 content - your content would already be UTF-8 by the time PHP receives it, and you'd inform the browser for that, whereas mb_http_output (well... mb_output_handler() technically) can convert it to one if it's not.4. No. If both mb_internal_encoding and mb_http_output, mb _http_output wins as far the encoding of mb_output_handler() is concerned. For other mb functions, mb_internal_encoding() wins unless specified otherwise in the corresponding function.

Link to comment
Share on other sites

2. It might have been that according to your tests, but that's you reliying on the browser to guess the character set, possibly based on the HTML meta. Anytime you're relying on a guess, you're risking a break.
No, there is zero meta-data. The HTML code in my experimental page consists only of the barest necessities: an HTML tag and a BODY tag. A TITLE tag and some other HTML formatting tags are also present, but the document contains no <HEAD> tag.
3. ... your content would already be UTF-8 by the time PHP receives it, and you'd inform the browser for that, whereas mb_http_output (well... mb_output_handler() technically) can convert it to one if it's not.
Perhaps override was not the appropriate word. Certainly true is that the same result can be achieved either alone with the header() function, or alternatively with the mb_internal_code() and mb_http_output() functions in the absence of a header() function. Keep in mind that I am not using any other MB-String module functions in my example.
4. No. If both mb_internal_encoding and mb_http_output, mb _http_output wins as far the encoding of mb_output_handler() is concerned. For other mb functions, mb_internal_encoding() wins unless specified otherwise in the corresponding function.
If you consider output to be what is sent to the output buffer by the ob_start() function, then in my system you are wrong. Create an HTML page as I described above and enter the following PHP code while substituting appropriately in the echo statement. It fails to render double and triple byte character strings properly.
<?php	mb_internal_encoding('ISO-8859-1');	mb_http_output('UTF-8');	ob_start('mb_output_handler');?><?php	echo 'any mb-string you like';?>

Roddy

Link to comment
Share on other sites

Interesting... a little play around with Fiddler seems to suggest that what mb_http_output() really is doing is adding ";charset=$encoding" to the Content-Type header and nothing else.Furthermore, reading it's manual page...

from internal character encoding to HTTP output character encoding.
So... mb_output_handler inspects what's the current HTTP Content-Type header and converts from internal to whatever the header says.When the file is saved as UTF-8, your internal encoding should also be UTF-8, since that's where it starts. If your desired output is UTF-8, that's also what you should specify in an HTTP header - therefore, if your file is UTF-8 encoded and you want to output UTF-8 content, using mb_output_handler is rudunant at best. I tried to specify a different encoding in the HTTP header (one that contains the characters I want to output), and surely enough, the convertion works properly in that scenario.Here's my experimentation file IN FULL (I didn't want for the HTML surrounding to make a difference - it could, and at some combos, it did):
<?php header('Content-Type: text/html; charset=windows-1251'); mb_internal_encoding('UTF-8'); //mb_http_output('windows-1251');?><?php// ob_flush();// flush(); ob_start('mb_output_handler'); echo 'any mb-string you like', ' Кирилица for example';?>

(The test file is UTF-8 encoded; As it stands, it converts the UTF-8 to Windows-1251 in both the header and the content.)

Link to comment
Share on other sites

I tried to specify a different encoding in the HTTP header (one that contains the characters I want to output), and surely enough, the convertion works properly in that scenario.
Might it be that the UNIX and Windows are different in this regard? The following code was entered into a Wrangler text file and saved as a UNIX file with UTF-8 character encoding. The file was completely devoid of all other code. It was then opened in two different browser's via the addresses of my local and virtual host. In all four cases the result was the same (see below).
<?php	header('Content-Type: text/html; charset=ISO-8859-1');	mb_internal_encoding('UTF-8');	ob_start('mb_output_handler');?><?php	echo 'J\'étais comme ça, mais je ne le suis plus.';?>

The Result: J'étais comme ça, mais je ne le suis plus.Roddy

Edited by iwato
Link to comment
Share on other sites

Now by adding the mb_http_output() function with my character code set to 'UTF-8', I am able to "override" the newly created header.

<?php	header('Content-Type: text/html; charset=ISO-8859-1');	mb_internal_encoding('UTF-8');	mb_http_output('UTF-8');	ob_start('mb_output_handler');?><?php	echo 'J\'étais comme ça, mais je ne le suis plus.';?>

The Result: J'étais comme ça, mais je ne le suis plus.Roddy

Link to comment
Share on other sites

Further, if I save and view the following file just as before, but eliminate mb_internal_encoding('UTF-8'); mb_http_output('UTF-8'); ob_start('mb_output_handler');add replace it with ob_start(); $str = mb_convert_encoding($str, 'ISO-8859-1', 'UTF-8');I can also obtain a proper rendering.

<?php	header('Content-Type: text/html; charset=ISO-8859-1');	ob_start();?><?php	$str = 'J\'étais comme ça, mais je ne le suis plus.';	$str = mb_convert_encoding($str, 'ISO-8859-1', 'UTF-8');	echo $str;?>

The Result: J'étais comme ça, mais je ne le suis plus.Roddy

Link to comment
Share on other sites

Might it be that the UNIX and Windows are different in this regard? The following code was entered into a Wrangler text file and saved as a UNIX file with UTF-8 character encoding. The file was completely devoid of all other code. It was then opened in two different browser's via the addresses of my local and virtual host. In all four cases the result was the same (see below).
<?php	header('Content-Type: text/html; charset=ISO-8859-1');	mb_internal_encoding('UTF-8');	ob_start('mb_output_handler');?><?php	echo 'J\'étais comme ça, mais je ne le suis plus.';?>

The Result: J'étais comme ça, mais je ne le suis plus.Roddy

Could also be that the behaviour of mb_http_output() changed between versions from also informing mb_output_handler to only setting the header and letting mb_output_handler figure it out itself.Try
<?php	mb_http_output('ISO-8859-1');	mb_internal_encoding('UTF-8');	ob_start('mb_output_handler');?><?php	echo 'J\'étais comme ça, mais je ne le suis plus.';?>

Also try my example... does it say "Кирилица" or is it something else?And check the headers in all cases, just in case.If the result is the same either way, then yeah... it's probably a MAC specific issue.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...