Jump to content

I need a tutorial for Loading Webpage Data - screen scraping


confused and dazed

Recommended Posts

Hello internet,

 

There is a webpage I would like to start building but I need to be able to pull data from other webpages. The information is public info so it’s nothing sketchy but I don’t know where to start. I began with just doing google searches but it appears it is more involved than I originally thought.

If you have any recommendations on tutorials or Forum discussions I would be grateful.

 

Thanks.

Link to comment
Share on other sites

  • Replies 72
  • Created
  • Last Reply

Top Posters In This Topic

What kind of information are you pulling from the other pages? I'm not entirely sure how to help you out here, because depending on the data and how it's used, you may need very different codes. Are you including user content? Scripting code? Video/music embedding? The more detail the better, so I know how to direct you in the best way.

 

-Chris

Link to comment
Share on other sites

No, putting .json wouldn't do anything. If they had an API they would say so in the documentation of their website.

 

Have you learned to use cURL yet?

Link to comment
Share on other sites

Some APIs return JSON, others return XML, others might have their own format. But with PHP, you still need cURL to get the data from the API. I suggest you familiarize yourself with the cURL library. If after reading the PHP manual you still have trouble, I could help with a couple of examples.

Link to comment
Share on other sites

This is as far as I have been able to get. I know the code below works because it grabs the site and displays it on the page. When I try to get the info I want out of the source code i end up with nothing. I am trying to grab the href links but my arrays come up empty with no data.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "some site");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
if(curl_exec($ch) === false)
{
echo 'Curl error: ' . curl_error($ch);
}
else
{
echo 'Operation completed without any errors';
}
echo $output;
curl_close($ch);
Link to comment
Share on other sites

You should check the value of $output in the if() statement, rather than calling curl_exec() a second time.

 

What value does $output have?

Link to comment
Share on other sites

So - I have been successful in pulling all the very specific links that I want from a webpage using the code below - it works well. However I am struggling now to pull the text in between the over all <a></a> tags.

Example: <a href=http...bla bla bla>THIS TEXT</a> How do I get the text "THIS TEXT" I wont be able to search for THIS TEXT because the text will not actually be "THIS TEXT" it will be different each time. Any thoughts?

 

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "some site");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$out = curl_exec($ch);
if(curl_exec($ch) === false){echo 'Curl error: ' . curl_error($ch);}
else {echo 'Operation completed without any errors'; echo "<br>";}
curl_close($ch);
$dom = new DOMDocument();
@$dom->loadHTML($out);
foreach($dom->getElementsByTagName('a') as $links) {
$try = $links->getAttribute('href');
if (preg_match('#^some very specific links#i', $try) === 1) {
print_r($try); echo "<br>";}
}
Link to comment
Share on other sites

That worked. Thanks!

Each time I progress I figure out I need more.... So here is my next quest - I need to get the text after the class="sort" but I need to be able to group the 22 with THIS TEXT1 and 0 with THIS TEXT2. Basically I am sending the data through mysql and THIS TEXT1 and 22 need to be in the same row in the database.

 

<td align="left"><a href='some site' target='_blank'>THIS TEXT1</a></td><td align="center" class="sort">22</td>

<td align="left"><a href='some site' target='_blank'>THIS TEXT2</a></td><td align="center" class="sort">0</td>

Link to comment
Share on other sites

One way to get it would be to look through the <td> elements inside the current <tr> element and if the class attribute is "sort" (using getAttribute()) then get the value of the child node.

Link to comment
Share on other sites

Again - thanks for setting me off in the right direction. I was able to resolve that issue as well. SO... now I have a new problem. I was able to send the href links to my database where I was going to pull them individually into a curl session (with a while loop) when I realized you need a password and login to get to the page. I have both those things but I don't know how to code them in so the curl session can access the data from that page. Where do I go from here?

Link to comment
Share on other sites

I'm using this but it is not working... It's displaying the login page but its not logging in. What do I do?

 

 

$username='usr1';
$password='pasw1';

 

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $link);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch, CURLOPT_USERPWD, "$username:$password");
$out = curl_exec($ch);
if(curl_exec($ch) === false)
{
echo 'Curl error: ' . curl_error($ch);
}
echo $out;
curl_close($ch);
Edited by confused and dazed
Link to comment
Share on other sites

You're trying to use HTTP basic authentication there. I doubt the server is using that. If you're trying to use cURL to log a user in then you need to do the same thing that the user would do with their browser, i.e. submit a post request that contains the data from the login form with the correct names, and get the cookies that the server sends back.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×
×
  • Create New...