Jump to content

preg_replace all &, <, >, " and all non-ascii characters with hex NCR's


Greywacke

Recommended Posts

hi all,you know i've got the problem with utf-8 characters in xml. the file encoding is all utf-8, and charsets of the form, and html pages as well as the xml pages have been set to utf-8 but i still get the problem with unicode characters. i am unable to change the PHP default encoding, so replacing the characters is the closest i can get.i am trying to run a preg_replace on the string to replace &, <, >, " as well as any unicode (non-ascii) characters with the relative hex numeric character reference (NCR), to write any string as valid xml attribute values.now i am not that familiar with regular expressions in PHP (PCRE), i can only seem to find reference to replacing NCR characters with the associated utf-8 character on the net.i need to do the reverse of this.this is what i've got sofar, but i know it's wrong - how can i fix this?

$value = preg_replace(array("&","<",">","\"","/[^\x{0-255}]/ue"), array("&","<",">",""","/ord($0);/"), $value);

please, will anyone point me in the right direction - this is of utmost urgence, and i am not sure what to do.i've started researching PCRE's (Perl Compatible Regular Expressions) for use with preg_replace, but sofar not found a solution for replacing unicode characters with their hex ncr's...

Link to comment
Share on other sites

okay temporarily repaired this line with the following

$value = str_replace(array("&","<",">","\"","ë","è","é","“","”"), array("&","<",">",""","ë","è","é","“","”"), $value);

can't seem to locate any decent examples, but i still need to do a PCRE preg_replace to replace the first four invalid ascii characters (&, <, >, "), as well as any utf-8 character and any unicode character (with the hexadecimal NCR), while i wait for the utf-8 support on the server.detailed instructions on setting apache to use utf-8the first reference i found to setting up utf-8 as default character set

Link to comment
Share on other sites

I'm always fascinated when looking at your excersices... have you considered using htmlspecialchars()? While yes, it replaces them with the named entities and not the numeric ones, any HTML, XHTML and XML aware environment can read those entities, so that's certainly not a reason to use numeric entities.Besides, I think I've already told you before how to properly store, extract and display non ASCII characters in a database - set your DB collation to utf8_general_ci, use mysql_set_charset() every time you connect to the DB to set the connection to use UTF-8, save your PHP file as UTF-8, set

header('Content-Type: text/html;charset=UTF-8');

and the equivalent meta element.If you already have some data, prepare for some experimentation related into migrating it to some proper storage. In particular, I think you can extract your old data on one connection, and then upload it on another connection that uses UTF-8.Take the time to do that just once, and you'll save yourself not only this headache, but the next several ones that you still haven't created topics for.

Link to comment
Share on other sites

well yes, as a matter of fact i tried using htmlspecialchars at first. the only named entities valid XML 1.0 supports is &, " > and <. the rest of the entities need to be hex based NCR's (eg. ÿ), dec based NCR's (eg. Ā) are not supported. :)the header code is implemented like that, except there is a space before charset after the semi-colon...

Link to comment
Share on other sites

xml 1.0 cannot read all named entities in firefox. nor decimal NCR's. i can only seem to pass hexadecimal NCR's while i wait for the server to be set up for UTF-8.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...