wilsonf1 Posted March 19, 2009 Share Posted March 19, 2009 So frustrating....I'm scraping an RSS feed and some characters are coming through as question marks - they are the following culprits:”‘’i have identified the following ascii codes to match the above and haven tried to replace them in the source of my scrape by 2 methods: oScrape.Source = Replace(oScrape.Source, chr(145), """")oScrape.Source = Replace(oScrape.Source, chr(146), """")oScrape.Source = Replace(oScrape.Source, chr(147), """")oScrape.Source = Replace(oScrape.Source, chr(148), """")oScrape.Source = Replace(oScrape.Source, "”", """")oScrape.Source = Replace(oScrape.Source, "‘", "'")oScrape.Source = Replace(oScrape.Source, "’", "'") but neither method picks them up and turns them into normal quotesive tested my replace statement by replacing a standard word in the RSS feed and it workedso how can i pick the buggers up if the above fails?????? Link to comment Share on other sites More sharing options...
ProblemHelpPlease Posted March 19, 2009 Share Posted March 19, 2009 Try using entitiy names instead, have a look at http://www.w3schools.com/tags/ref_entities.asp Link to comment Share on other sites More sharing options...
wilsonf1 Posted March 19, 2009 Author Share Posted March 19, 2009 thanks but no joy when trying this for example:oScrape.Source = Replace(oScrape.Source, "lsquo;", """")and oScrape.Source = Replace(oScrape.Source, "‘", """") ' no hash Link to comment Share on other sites More sharing options...
wilsonf1 Posted March 19, 2009 Author Share Posted March 19, 2009 when i do:asc("”")it comes out as 226when i search that character on an ascii list, its in the 140's..... cant work any of this out Link to comment Share on other sites More sharing options...
jlhaslip Posted March 19, 2009 Share Posted March 19, 2009 Check to see what the character set is for the page and the RSS feed. Chances are they should both be utf-8 to work correctly. Link to comment Share on other sites More sharing options...
justsomeguy Posted March 19, 2009 Share Posted March 19, 2009 I use this to get rid of that stuff: function sanitize_ms_chars(&$val, $i=0){ $find = array( '“', 'â€', '‘', '’', '…', '—', '–', chr(145), chr(146), chr(147), chr(148), chr(151), chr(0xe2) . chr(0x80) . chr(0x98), chr(0xe2) . chr(0x80) . chr(0x99), chr(0xe2) . chr(0x80) . chr(0x9c), chr(0xe2) . chr(0x80) . chr(0x9d), chr(0xe2) . chr(0x80) . chr(0x93), chr(0xe2) . chr(0x80) . chr(0x94) ); $replace = array( '"', '"', "'", "'", '...', '-', '-', "'", "'", '"', '"', '-', "'", "'", '"', '"', '-', '-' ); $val = str_replace($find, $replace, $val);} That function doesn't return a value, it just operates on the value directly. It does that so I can use it to sanitize an entire array at once:array_walk_recursive($_POST, 'sanitize_ms_chars');... and I just notice that this is the ASP forum. You can probably take the values in that function and convert it to VB instead of PHP. Most of the character sequences it's looking for are UTF-8 sequences, where the character contains more than one byte. When you use asc to get the value of the quote it's telling you the wrong thing because asc assumes it's an ascii character, which is only a single byte long. These are multi-byte characters. Hopefully you can get the info from the function to see which sequences to look for and which characters to replace them with. Link to comment Share on other sites More sharing options...
wilsonf1 Posted March 20, 2009 Author Share Posted March 20, 2009 Check to see what the character set is for the page and the RSS feed. Chances are they should both be utf-8 to work correctly.i looked and for this feed it is iso-8859-1but im not setting a charset in my asp. i just have a page of ASP code that rips the content, then sticks it into my access database - i dont have a <head> section setting anything? Link to comment Share on other sites More sharing options...
wilsonf1 Posted March 21, 2009 Author Share Posted March 21, 2009 little bump - any more thoughts on the charset? Link to comment Share on other sites More sharing options...
eggie Posted March 22, 2009 Share Posted March 22, 2009 I've used this function in the past:function.asp <- I use it as an include ' replace user enters by breaksFunction showBreaks(str)showBreaks = replace(str, chr(10), " <br>")End Function' replace single quote by two single quotesFunction TwoSingleQ(str) TwoSingleQ = replace(str,"'", "''" )End Functionq = Chr(34) then in my execute page I have soemthing like this: pmemo = TwoSingleQ(pmemo) I don't know if it will help, but I thought I'd throw it out there... Link to comment Share on other sites More sharing options...
wilsonf1 Posted March 23, 2009 Author Share Posted March 23, 2009 i appreciate it eggie but i dont think it will solve my problemits something to do with the charset and this strange single and double quotes in the scrape stringif you look closely at the characters i specify in my first post they are curly quotes and not like this: ' and "im sure it is a charset issue but as i said, im not setting one in my document <head> as i dont have one - im just running some ASP and stashing the result in a database? Link to comment Share on other sites More sharing options...
wilsonf1 Posted April 22, 2009 Author Share Posted April 22, 2009 Still struggling with this - i've tried the following code to identify the left curly single quote and only statement 4 is triggered when the single quote is defo in a RSS feed: If InStr(oScrape.Source, "‘") > 0 Then response.Write "1 is there" response.end Elseif InStr(oScrape.Source, chr(145)) > 0 Then response.Write "2 is there" response.end Elseif InStr(oScrape.Source, "") > 0 Then response.Write "3 is there" response.end Elseif InStr(oScrape.Source, " ") > 0 Then response.Write "4 is there" response.end End If The top of the RSS feed has: <?xml version='1.0' encoding='iso-8859-1' ?> - some one mentioned utf-8 but what can i do about it if its not mine?I dont have a doctype because i am just running pure asp which scrapes the feed then bungs in a databaseplease help!!!!!!!! Link to comment Share on other sites More sharing options...
justsomeguy Posted April 22, 2009 Share Posted April 22, 2009 If the RSS is set to ISO-8859-1, but includes UTF-8 characters, you might be able to contact the publisher and get them to either convert the characters on their end, or change the content type. Link to comment Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.