Jump to content

convert an ANSI text file to utf-8 format


Gilbert

Recommended Posts

Hi all,  I upload a text file to extract info to put into my database on GoDaddy and when I run my php code on it, it tells me that it can't read the file because it is in ansi-xxxx format.  In my php code I'm using $var = fgets() to read each line and then put the $vars into the correct table of the database.   So I have clicked the button at the top of the code editor and converted the text file to utf-8 - but the conversion leaves the file with 2 odd characters at the beginning of the file and puts a blank line between each line.   When I delete the 2 characters and the blank lines and I run my code, everything works as it should and updates my tables.    My question is: Can I do a conversion on the ANSI text file using php without any manual manipulating?  I've read about the utf-8_encode(), but it says it encodes an ISO-8859-1 file, but mine is an ANSI, or is the ISO-8859-1 an umbrella to a lot of different codes?  Can I convert the whole file at once or do i set up a loop to read a line, convert it and write to a new file?   Am I interpreting this correctly?  I'd appreciate a code snippet so I can see how to set it up - or a reference to more reading so I can learn.  Thank you very much!

Link to comment
Share on other sites

the conversion leaves the file with 2 odd characters at the beginning

That's the UTF-8 BOM.  You can use this to strip the BOM from the beginning of a string:

// check for a variety of byte order marks at the beginning of the string and remove them if present
function strip_bom($str)
{
    $boms = [
        pack('CCC', 0xef, 0xbb, 0xbf),        # UTF-8
        pack('CC', 0xff, 0xfe),               # UTF-16 (BE)
        pack('CC', 0xfe, 0xff),               # UTF-16 (LE)
        pack('CCCC', 0x0, 0x0, 0xfe, 0xff),   # UTF-32 (BE)
        pack('CCCC', 0xff, 0xfe, 0x0, 0x0),   # UTF-32 (LE)
    ];

    foreach ($boms as $b) {
        if (substr($str, 0, strlen($b)) == $b) {
            return substr($str, strlen($b));
        }
    }
    return $str;
}

You can also check if the line is empty and skip it if so.

while (($line = fgets($handle, 4096)) !== false) {
  $line = trim(strip_bom($line));
  if ($line === '') {
    continue;
  }

  // process $line
}

You can also convert a character encoding:

http://php.net/manual/en/function.mb-convert-encoding.php

http://php.net/manual/en/function.iconv.php

If your database is set up to store UTF data, make sure you insert data using the correct encoding. 

Obviously, if the file starts with the correct encoding then you don't need to do anything special.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...