Jump to content
Sign in to follow this  
ala888

multi-charset whitespace normalization.

Recommended Posts

Title says it all. How do I normalize all white-space from all charsets and stuff?

preg_replace('/+/',' ',$str);

does not work if it is not in ascii

Share this post


Link to post
Share on other sites

The space character is the same in all encodings. If you just want to remove spaces that will work. If you want to include line breaks and tabs as whitespace then you can select the appropriate characters.

 

Functions like trim() consider the following characters whitespace, and since they're all single-byte characters they should work for pretty much any encoding:

s or x20 or " " Space

t Tab

n New line

r Carriage return

0 Null byte

x0B Vertical tab

 

The following regular expression will normalize all those:

preg_replace('/[x20tnr0x0B]+/',' ',$str);

For these characters the character set doesn't matter because almost all character sets share the same characters from 0 to 127.

Share this post


Link to post
Share on other sites

as an individual who is new to the agonizingly painful world of strings and their various encodings, is there a good online tutorial available that goes througheverything from collations to hex to collations, and how everything goes together. I dont know whats going on with strings in general.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
Sign in to follow this  

×
×
  • Create New...