Jump to content

Windows-1252 Character Set Issue


son

Recommended Posts

Using the upload form to upload content into db I find very often weird characters inserted into relevant fields when doing a copy/paste instead of entering data directly from keyboard. I tried to combat with reg_expressions and str_replace, only to find more and more new characters that throw issues... In addition, I saved the relevant file (that holds test content) with encoding UTF-8, but problem remains...I have found already that it is an issue cause by the Windows-1252 character set (quote me if I am wrong), which is used by Word from Microsoft. The program I have to do most copy/past actions from. The database is set to UTF-8 and each page (including the upload form) has the following in head of document:<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />The current str_replace is:$t = str_replace(array("\xE2\x80\x99","\xE2\x80\x98", "\xe2\x80\x9c","\xe2\x80\x9d", "\xe2\x80\x93","\xe2\x80\x93"), array('\'', '\'', '"', '"', '-', '-'), $_POST['pageText']);Had a thorough search on internet, but all the advice given did not work for me. Not sure if the collation of db would need to be amended (also not quite sure which is the right collation to use for UK-English). Is this a geneal issue of CMS systems and if yes, how can you remedy the inconvenience of entering from the keyboard? Any help, hint, web links etc appreciated;-)Son

Link to comment
Share on other sites

Use this to replace all of those characters in both get and post:

array_walk_recursive($_POST, 'sanitize_ms_chars');array_walk_recursive($_GET, 'sanitize_ms_chars');# replace fancy MS Office charactersfunction sanitize_ms_chars(&$val, $i = ''){  $find = array(	'“',	'”',	'‘',	'’',	'…',	'—',	'–',	chr(145),	chr(146),	chr(147),	chr(148),	chr(151),	chr(0xe2) . chr(0x80) . chr(0x98),	chr(0xe2) . chr(0x80) . chr(0x99),	chr(0xe2) . chr(0x80) . chr(0x9c),	chr(0xe2) . chr(0x80) . chr(0x9d),	chr(0xe2) . chr(0x80) . chr(0x93),	chr(0xe2) . chr(0x80) . chr(0x94)  );  $replace = array(	'"',	'"',	"'",	"'",	'...',	'-',	'-',	"'",	"'",	'"',	'"',	'-',	"'",	"'",	'"',	'"',	'-',	'-'  );    $val = str_replace($find, $replace, $val);}

If you just want to replace a single string, use it like this:replace_ms_chars($_POST['pageText']);Notice the function doesn't have a return value, this would be wrong:$_POST['pageText'] = replace_ms_chars($_POST['pageText']);

Link to comment
Share on other sites

Use this to replace all of those characters in both get and post:
array_walk_recursive($_POST, 'sanitize_ms_chars');array_walk_recursive($_GET, 'sanitize_ms_chars');# replace fancy MS Office charactersfunction sanitize_ms_chars(&$val, $i = ''){  $find = array(	'“',	'â€',	'‘',	'’',	'…',	'—',	'–',	chr(145),	chr(146),	chr(147),	chr(148),	chr(151),	chr(0xe2) . chr(0x80) . chr(0x98),	chr(0xe2) . chr(0x80) . chr(0x99),	chr(0xe2) . chr(0x80) . chr(0x9c),	chr(0xe2) . chr(0x80) . chr(0x9d),	chr(0xe2) . chr(0x80) . chr(0x93),	chr(0xe2) . chr(0x80) . chr(0x94)  );  $replace = array(	'"',	'"',	"'",	"'",	'...',	'-',	'-',	"'",	"'",	'"',	'"',	'-',	"'",	"'",	'"',	'"',	'-',	'-'  );    $val = str_replace($find, $replace, $val);}

If you just want to replace a single string, use it like this:replace_ms_chars($_POST['pageText']);Notice the function doesn't have a return value, this would be wrong:$_POST['pageText'] = replace_ms_chars($_POST['pageText']);

Would it be ok then to use value of 'cleaned' string as:$t = replace_ms_chars($_POST['pageText']), $_POST['pageText']);?Also, do you think there might be some more problematic characters in the future? I was quite surprise when this problem came up. Many of my friends use CMS products (off the shelf) and this does not seem to happen. Do you know how they combat this issue?Son
Link to comment
Share on other sites

The function doesn't return a value, it operates directly on the variable you send to it. If you do this:sanitize_ms_chars($_POST['pageText']);Then after that line $_POST['pageText'] will be sanitized. If you do this:$t = sanitize_ms_chars($_POST['pageText']);Then $_POST['pageText'] will be sanitized, and $t will be null or undefined, because the function doesn't return a value. If you do this:array_walk_recursive($_POST, 'sanitize_ms_chars');Then all values in $_POST will be sanitized.I just noticed I used the wrong function name in the examples, it should be sanitize_ms_chars, not replace_ms_chars.

Also, do you think there might be some more problematic characters in the future?
If Microsoft wants to make Office do more strange things then yeah, there could be more characters in the future. It's just a matter of identifying the characters and adding new rules to the find and replace.
Many of my friends use CMS products (off the shelf) and this does not seem to happen. Do you know how they combat this issue?
The CMS does something like above.
Link to comment
Share on other sites

I have used your code in relevant file and works well for certain fields, but not for all. The relevant code snippet is:

if (isset($_POST['submitted']))	{# replace fancy MS Office charactersfunction sanitize_ms_chars(&$val, $i = ''){  $find = array(    '“',    'â€',    '‘',    '’',    '…',    '—',    '–',    chr(145),    chr(146),    chr(147),    chr(148),    chr(151),    chr(0xe2) . chr(0x80) . chr(0x98),    chr(0xe2) . chr(0x80) . chr(0x99),    chr(0xe2) . chr(0x80) . chr(0x9c),    chr(0xe2) . chr(0x80) . chr(0x9d),    chr(0xe2) . chr(0x80) . chr(0x93),    chr(0xe2) . chr(0x80) . chr(0x94)  );  $replace = array(    '"',    '"',    "'",    "'",    '...',    '-',    '-',    "'",    "'",    '"',    '"',    '-',    "'",    "'",    '"',    '"',    '-',    '-'  );    $val = str_replace($find, $replace, $val);}array_walk_recursive($_POST, 'sanitize_ms_chars');array_walk_recursive($_GET, 'sanitize_ms_chars');  	if (isset($_POST['parent_id']))	{	$parent_id = (int) $_POST['parent_id'];	}	else	{	$parent_id = 0;	}  	if (isset($_POST['parent_id2']))	{	$parent_id2 = (int) $_POST['parent_id2'];	}	else	{	$parent_id2 = 0;	}//initialise error array  $errors = array();	if (!isset($_POST['list']) OR empty($_POST['list'])) {	$list = FALSE;	}	else { 	if (eregi ('^[0-9]{1,3}$', stripslashes(trim($_POST['list'])))) {		$list = (int) $_POST['list'];	}	else	{	$list = FALSE;	$errors['list'] = 'Please enter 3-digit number';		}  		}	if (!isset($_POST['file_name']) OR empty($_POST['file_name'])) {	$fn = FALSE;	$errors['file_name'] = '\'File name\' is a required field';		}	else	{	if (!eregi ('^[[:alnum:]\-]{2,20}$', stripslashes(trim($_POST['file_name'])))) {	$fn = FALSE;	$errors['file_name'] = '\'File name\' not in correct format or too long';	} else	{	$fn = escape_data($_POST['file_name']);	}	}	if (!isset($_POST['title']) OR empty($_POST['title'])) {	$tt = FALSE;	} else	{		$tt = escape_data($_POST['title']);	}	if (!isset($_POST['description']) OR empty($_POST['description'])) {	$dt = FALSE;	} else	{		$dt = escape_data($_POST['description']);	}	if (!isset($_POST['keywords']) OR empty($_POST['keywords'])) {	$kt = FALSE;	}	else	{	$kt = escape_data($_POST['keywords']);	}	$allowed = array ('jpg', 'gif');	if (!isset($_FILES['img']['name']) OR empty($_FILES['img']['name']) OR 	$_FILES['img']['error'] == 4){	$errors['img'] = '\'Top Banner\' is a required field';		$img = FALSE;	}	else	{	$ext = explode('.',$_FILES['img']['name']);    $ext = $ext[count($ext)-1];	  if (!in_array(strtolower($ext), $allowed)) {        $errors['img'] = '\'Top Banner\' accepts format: jpg and gif';        $img = FALSE;      }		else	{	  $img = "{$fn}.{$ext}";	  }	}	if (!isset($_FILES['img2']['name']) OR empty($_FILES['img2']['name']) OR 	$_FILES['img2']['error'] == 4){	$errors['img2'] = '\'Right photo\' is a required field';		$img2 = FALSE;	}	else	{	$ext = explode('.',$_FILES['img2']['name']);      $ext = $ext[count($ext)-1];	  if (!in_array(strtolower($ext), $allowed)) {        $errors['img2'] = '\'Right photo\' accepts format: jpg and gif';        $img2 = FALSE;      }		else	{	  $img2 = "{$fn}.{$ext}";	  }	}	if (!isset($_POST['heading']) OR empty($_POST['heading'])) {	$h = FALSE;	$errors['heading'] = '\'Heading\' is a required field';		} else	{		$h = escape_data($_POST['heading']);	}	if (!isset($_POST['pageText']) OR empty($_POST['pageText'])) {	$t = FALSE;	$errors['pageText'] = '\'Web Page Copy\' is a required field';		}	else	{	$t = escape_data($_POST['pageText']);	}	if (!isset($_POST['suitabilityHead']) OR empty($_POST['suitabilityHead'])) {	$suitabilityHead = FALSE;	}	else {	$suitabilityHead = escape_data($_POST['suitabilityHead']);	}	if (!isset($_POST['suitability']) OR empty($_POST['suitability'])) {	$suitability = FALSE;	}	else	{	$suitability = escape_data($_POST['suitability']);	}	if (!isset($_POST['featuresHead']) OR empty($_POST['featuresHead'])) {	$featuresHead = FALSE;	} else	{		$featuresHead = escape_data($_POST['featuresHead']);	}	if (!isset($_POST['features']) OR empty($_POST['features'])) {	$features = FALSE;	}	else	{	$features = escape_data($_POST['features']);		}	if (!isset($_POST['optionsHead']) OR empty($_POST['optionsHead'])) {	$optionsHead = FALSE;	} else	{		$optionsHead = escape_data($_POST['optionsHead']);	}	if (!isset($_POST['options']) OR empty($_POST['options'])) {	$options = FALSE;	}	else	{	$options = escape_data($_POST['options']);		}		if ($fn  && $h && $t && $img && $img2)	{

The fields it does not work on is all after heading. The form fields are displayed in same order. Why could it be that it only works for certain fields? Son

Link to comment
Share on other sites

It should be applied to all fields. You can add an output line to verify that it's being run on each value in the array.
sanitizing "MAX_FILE_SIZE"sanitizing "parent_id"sanitizing "parent_id2"sanitizing "file_name"sanitizing "list"sanitizing "title"sanitizing "heading"sanitizing "description"sanitizing "keywords"sanitizing "pageText"sanitizing "suitabilityHead"sanitizing "suitability"sanitizing "featuresHead"sanitizing "features"sanitizing "optionsHead"sanitizing "options"sanitizing "submitted"sanitizing "submit"

I also did a var_dump ($_POST['optionsHead']) on the optionsHead field, which is one of the fields where the page displays the weird character (copy/paste of text: upto £30m ) and it looked fine. As I said all fields after heading are displayed incorrectly on web page. I thoroughly checked what is different to the other fields and found that those are the fields where I use htmlentities for safe display of html characters. Having the htmlentities function applied to the above mentioned dump also resulted in the weird character being displayed. However, saying this: I also checked the actual entries in database and they all show the incorrect 'upto £30m' (used the same text for all fields to test). Only that display web page the fields without the htmlentities function show up fine...So, although the sanitize function seems to work fine: why are the db entries still incorrect? The database itself is set up as:MySQL charset: UTF-8 Unicode (utf8)MySQL connection collation: utf-8-unicode-ciwith relevant table set to:Collation latin1_swedish_ciI would have thought that the database takes content exactly as the var_dump on optionsHead for example showed. Is this not the case?I really appreciate your help. This is way above me...Son

Link to comment
Share on other sites

I can't find the documentation now, but I believe that PHP defaults to using the ISO-8859-1 charset, not UTF-8. You can convert to UTF-8 using utf8_encode, or you can convert to any encoding using iconv. Did you insert those things in the database before using the sanitize function, or is that all since you've gotten it set up?

Link to comment
Share on other sites

I can't find the documentation now, but I believe that PHP defaults to using the ISO-8859-1 charset, not UTF-8. You can convert to UTF-8 using utf8_encode, or you can convert to any encoding using iconv. Did you insert those things in the database before using the sanitize function, or is that all since you've gotten it set up?
Since the website was finalised I used the mentioned upload form, now with the sanitize function. The problem was there before the sanitize function and is still the same... Does PHP also defaults although I have on each page (including upload form):<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />Do I need to change the database to ISO-8859-1 then?SonI also tried:
	if (!isset($_POST['optionsHead']) OR empty($_POST['optionsHead'])) {	$optionsHead = FALSE;	} else	{		$optionsHead = utf8_encode($_POST['optionsHead']);	$optionsHead = escape_data($optionsHead);	}

which did not help...

Link to comment
Share on other sites

The problem was there before the sanitize function and is still the same...
The sanitize function isn't going to change anything already in the database, only what gets submitted.
Does PHP also defaults although I have on each page (including upload form):<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
The character set that PHP uses and the content-type for the HTML page aren't related.
Do I need to change the database to ISO-8859-1 then?
You could just convert text to match the database.
Link to comment
Share on other sites

You could just convert text to match the database.
I converted with:
 if (!isset($_POST['optionsHead']) OR empty($_POST['optionsHead'])) {$optionsHead = FALSE;} else {$optionsHead = utf8_encode($_POST['optionsHead']);$optionsHead = escape_data($optionsHead);}

but the issue remained.Also, I meant with before and after sanitize function: entries added to database with upload form that did not have sanitize funcion and then entries added with same upload form, just this time with sanitize function...Son

Link to comment
Share on other sites

OK, so what's the problem now? Are there other characters that aren't being replaced, or an issue with the slashes escaping quotes?
'upto £30m' gets inserted into database instead of the copy/pasted 'upto £30m budgets'. Is this a new character?Son
Link to comment
Share on other sites

I'm not sure if that's a new character, you might want to paste your text into something that has a hex editor (PSPad has one) and check the hex code of the pound character. If it's not a regular pound character then you could replace that also. asciitable.com says a pound character is ASCII 156.

Link to comment
Share on other sites

I'm not sure if that's a new character, you might want to paste your text into something that has a hex editor (PSPad has one) and check the hex code of the pound character. If it's not a regular pound character then you could replace that also. asciitable.com says a pound character is ASCII 156.
I downloaded PSPad and could verify that the £ (in char column) symbol is HEX 'A3', which is correct in text I copy/paste. The HEX for the space shows '20' which shows in char column an empty box. What does this mean?Son
Link to comment
Share on other sites

Check the chart at asciitable.com, a space character is (dec)32 = (hex)20 in the ASCII character set. Most hex editors give the hex value of characters. Hex A3 in decimal is 163, character 163 is a u with an accent, not a pound sign. Check if A3 is the only byte listed for the character, other character sets may use 2 bytes or more to represent a character. Some of the characters it's searching for are 3-byte characters, e.g.:chr(0xe2) . chr(0x80) . chr(0x98)It might be possible that a pound sign is this:chr(0xe2) . chr(0x80) . chr(0xA3)

Link to comment
Share on other sites

Check the chart at asciitable.com, a space character is (dec)32 = (hex)20 in the ASCII character set. Most hex editors give the hex value of characters. Hex A3 in decimal is 163, character 163 is a u with an accent, not a pound sign. Check if A3 is the only byte listed for the character, other character sets may use 2 bytes or more to represent a character. Some of the characters it's searching for are 3-byte characters, e.g.:chr(0xe2) . chr(0x80) . chr(0x98)It might be possible that a pound sign is this:chr(0xe2) . chr(0x80) . chr(0xA3)
Now, I am completely lost. This way above my head... Could you recommend an easy to read introduction to character sets? The ones I found are quite difficult to understand, but it seems to me now that I am not getting anywhere if I do not get the basics right...Son
Link to comment
Share on other sites

I can't think of any character set references off the top of my head, sorry. Wikipedia or something like that might be able to help. Also search for character encodings.
Thanks for your advice. Had a good read, but there are still some things I do not understand. As one is regarding your sanitize function I have one more question to you:Before I started using your sanitize function I replaced problematic values for each input field as:str_replace(array("\xE2\x80\x99","\xE2\x80\x98", "\xe2\x80\x9c","\xe2\x80\x9d", "\xe2\x80\x93","\xe2\x80\x93"), array('\'', '\'', '"', '"', '-', '-')For obvious reasons I am relieve that this can be done in one place for all Post data and I only use your function now. Still, what I do not get:The replacement of '\xe2\x80\x93' worked well with my basic str_replace. Using your function the problematic dash character does not get replaced (chr(0xe2) . chr(0x80) . chr(0x93)). Why could that be? Also, it is rather confusing that sometimes the values are in four segments as opposed to three. Why is that?Son
Link to comment
Share on other sites

If it gets replaced using "\xe2\x80\x93", but not (chr(0xe2) . chr(0x80) . chr(0x93)), you could just use the other string in the code. I'm not sure why one would be different.

Also, it is rather confusing that sometimes the values are in four segments as opposed to three. Why is that?
If you're talking about this type of thing:(chr(0xe2) . chr(0x80) . chr(0x93))UTF-8 is a multi-byte character set, each character is not a certain number of bytes. I suspect that the first one or two bytes tells the system how long the byte sequence is.
Link to comment
Share on other sites

If it gets replaced using "\xe2\x80\x93", but not (chr(0xe2) . chr(0x80) . chr(0x93)), you could just use the other string in the code. I'm not sure why one would be different.If you're talking about this type of thing:(chr(0xe2) . chr(0x80) . chr(0x93))UTF-8 is a multi-byte character set, each character is not a certain number of bytes. I suspect that the first one or two bytes tells the system how long the byte sequence is.
Thanks for your info:-)Son
Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...