Jump to content

Grr! How To Convert Special Characters During A Scrape


wilsonf1

Recommended Posts

So frustrating....I'm scraping an RSS feed and some characters are coming through as question marks - they are the following culprits:”‘’i have identified the following ascii codes to match the above and haven tried to replace them in the source of my scrape by 2 methods:

oScrape.Source = Replace(oScrape.Source, chr(145), """")oScrape.Source = Replace(oScrape.Source, chr(146), """")oScrape.Source = Replace(oScrape.Source, chr(147), """")oScrape.Source = Replace(oScrape.Source, chr(148), """")oScrape.Source = Replace(oScrape.Source, "”", """")oScrape.Source = Replace(oScrape.Source, "‘", "'")oScrape.Source = Replace(oScrape.Source, "’", "'")

but neither method picks them up and turns them into normal quotesive tested my replace statement by replacing a standard word in the RSS feed and it workedso how can i pick the buggers up if the above fails??????

Link to comment
Share on other sites

I use this to get rid of that stuff:

function sanitize_ms_chars(&$val, $i=0){  $find = array(	'“',	'”',	'‘',	'’',	'…',	'—',	'–',	chr(145),	chr(146),	chr(147),	chr(148),	chr(151),	chr(0xe2) . chr(0x80) . chr(0x98),	chr(0xe2) . chr(0x80) . chr(0x99),	chr(0xe2) . chr(0x80) . chr(0x9c),	chr(0xe2) . chr(0x80) . chr(0x9d),	chr(0xe2) . chr(0x80) . chr(0x93),	chr(0xe2) . chr(0x80) . chr(0x94)  );  $replace = array(	'"',	'"',	"'",	"'",	'...',	'-',	'-',	"'",	"'",	'"',	'"',	'-',	"'",	"'",	'"',	'"',	'-',	'-'  );    $val = str_replace($find, $replace, $val);}

That function doesn't return a value, it just operates on the value directly. It does that so I can use it to sanitize an entire array at once:array_walk_recursive($_POST, 'sanitize_ms_chars');... and I just notice that this is the ASP forum. You can probably take the values in that function and convert it to VB instead of PHP. Most of the character sequences it's looking for are UTF-8 sequences, where the character contains more than one byte. When you use asc to get the value of the quote it's telling you the wrong thing because asc assumes it's an ascii character, which is only a single byte long. These are multi-byte characters. Hopefully you can get the info from the function to see which sequences to look for and which characters to replace them with.

Link to comment
Share on other sites

Check to see what the character set is for the page and the RSS feed. Chances are they should both be utf-8 to work correctly.
i looked and for this feed it is iso-8859-1but im not setting a charset in my asp. i just have a page of ASP code that rips the content, then sticks it into my access database - i dont have a <head> section setting anything?
Link to comment
Share on other sites

I've used this function in the past:function.asp <- I use it as an include

' replace user enters by breaksFunction showBreaks(str)showBreaks = replace(str, chr(10), " <br>")End Function' replace single quote by two single quotesFunction TwoSingleQ(str) TwoSingleQ	= replace(str,"'", "''" )End Functionq = Chr(34)

then in my execute page I have soemthing like this:

pmemo	= TwoSingleQ(pmemo)

I don't know if it will help, but I thought I'd throw it out there...

Link to comment
Share on other sites

i appreciate it eggie but i dont think it will solve my problemits something to do with the charset and this strange single and double quotes in the scrape stringif you look closely at the characters i specify in my first post they are curly quotes and not like this: ' and "im sure it is a charset issue but as i said, im not setting one in my document <head> as i dont have one - im just running some ASP and stashing the result in a database?

Link to comment
Share on other sites

  • 5 weeks later...

Still struggling with this - i've tried the following code to identify the left curly single quote and only statement 4 is triggered when the single quote is defo in a RSS feed:

If InStr(oScrape.Source, "‘") > 0 Then		response.Write "1 is there"		response.end	Elseif InStr(oScrape.Source, chr(145)) > 0 Then		response.Write "2 is there"		response.end	Elseif InStr(oScrape.Source, "") > 0 Then		response.Write "3 is there"		response.end	Elseif InStr(oScrape.Source, " ") > 0 Then		response.Write "4 is there"		response.end	End If

The top of the RSS feed has: <?xml version='1.0' encoding='iso-8859-1' ?> - some one mentioned utf-8 but what can i do about it if its not mine?I dont have a doctype because i am just running pure asp which scrapes the feed then bungs in a databaseplease help!!!!!!!!

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...