Jump to content

multi-charset whitespace normalization.


ala888

Recommended Posts

The space character is the same in all encodings. If you just want to remove spaces that will work. If you want to include line breaks and tabs as whitespace then you can select the appropriate characters.

 

Functions like trim() consider the following characters whitespace, and since they're all single-byte characters they should work for pretty much any encoding:

s or x20 or " " Space

t Tab

n New line

r Carriage return

0 Null byte

x0B Vertical tab

 

The following regular expression will normalize all those:

preg_replace('/[x20tnr0x0B]+/',' ',$str);

For these characters the character set doesn't matter because almost all character sets share the same characters from 0 to 127.

Link to comment
Share on other sites

as an individual who is new to the agonizingly painful world of strings and their various encodings, is there a good online tutorial available that goes througheverything from collations to hex to collations, and how everything goes together. I dont know whats going on with strings in general.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...