Jump to content

I need a tutorial for Loading Webpage Data - screen scraping


confused and dazed

Recommended Posts

Here is what json_decode and var_dump (of that decode) shows me. How do I get "bingo"? I tried to use $this1 = $myjson->main_list->something->I_want_this; I did a var_dump and it came back NULL. Im guessing I may need to pull out I_want_this from object(stdClass)#9 but I dont quite know how to do that...

 

object(stdClass)#5 (2) { ["main_list"]=> object(stdClass)#6 (6) { ["something"]=> int(1) ["areas"]=> array(5) { [0]=> object(stdClass)#7 (6) { ["donuts"]=> int(1) ["pop"]=> string(5) "words" ["here"]=> string(9) "somewhere" ["nice"]=> array(4) { [0]=> object(stdClass)#8 (5) { ["hot"]=> array(8) { [0]=> object(stdClass)#9 (25) { ["cold"]=> string(1) "X" ["here_id"]=> string(5) "00110" ["I_want_this"]=> string(5) "bingo"........ and it just goes on and on......

Edited by confused and dazed
Link to comment
Share on other sites

  • Replies 72
  • Created
  • Last Reply

Top Posters In This Topic

I don't know the full structure of the data. When copying the output of var_dump() to the forum, copy it from the page's source code so that it's formatted and easy to read. Put it into a code block.

Link to comment
Share on other sites

I did a json_decode on the string and saved it in $data I did a var_dump and the following code was in the source code:

 

[some] => stdClass Object
(
[extra] => 1
[really] => Array
(
[0] => stdClass Object
(
[stupid] => 1
[long] => Array
(
[0] => stdClass Object
(
[thing] => Array
(
[0] => stdClass Object
(
[sent_me] => X
[to] => 00101

 

[i_want_this] => bingo

 

I used the following call and it worked!!!!!!!

 

$this1 = $data->some->really[0]->long[0]->thing[0]->I_want_this;

 

I have a lot of work ahead of me now that I have figured out how to get to the data I want. It may be a little while before you hear from me again. But until then may all your code be syntax error free!!

Edited by confused and dazed
Link to comment
Share on other sites

So I am well on my way to getting my next project up and running. Thanks for the help through this. I do have another question though, my program seems to time out and the code cannot complete all the commands on the rows "$row" from the database. In all I'm grabbing about 50+ different webpages, scrapping the data I care about, and then sending it to my database (one page at a time). Is there anything anyone can suggest for this issue?

 

while ($row = mysql_fetch_assoc($result)) {
$username='usr';
$password='pas';
$ch = curl_init();
$agent = $_SERVER["HTTP_USER_AGENT"];
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL, $thislink );
curl_setopt($ch, CURLOPT_POST, 1 );
curl_setopt($ch, CURLOPT_POSTFIELDS, 'user='.$username.'&pass='.$password);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookie.txt');
$postResult = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
@$dom->loadHTML($postResult);
foreach($dom->getElementsByTagName('script') as $links1) {
$links2 = $dom->saveXML($links1);
if (preg_match(something)) {
mysql_query("UPDATE something SET this='bingo' WHERE name='thisstuff'");
}
}
}
Link to comment
Share on other sites

Does it matter where the call is placed? Because it is still appears to be stopping before the limit and not completing the execution of the code. Right now I have placed as shown below:

 

<?php

set_time_limit(200);
$con = mysql_connect("source","db","pw");
if (!$con)
{
die('Could not connect: ' . mysql_error());
}
mysql_select_db("db", $con);
while ($row = mysql_fetch_assoc($result)) {
$ch = curl_init();
curl_close($ch);
$dom = new DOMDocument();
@$dom->loadHTML($postResult);
foreach($dom->getElementsByTagName('script') as $links1) {
$links2 = $dom->saveXML($links1);
if (preg_match(something)) {
mysql_query("UPDATE something SET this='bingo' WHERE name='thisstuff'");
}
}
}
?>
Edited by confused and dazed
Link to comment
Share on other sites

Something that is interesting is that the page wont load and show the echo statements I have in there. However, over time as I refresh my database all the 50+ lines end up showing up in my database. It will show 20 then I refresh and it shows 30 and I refresh and it shows 40 etc.... But I do this all after the php page has stopped showing the refresh arrow circulating in the tab. What does all this mean?

Link to comment
Share on other sites

So I did a bunch of test runs and it appears my chrome browser only wants to run 24 of the 52 requests with positive feedback through the echo statements that the code has executed (I used a die statement to limit the number of requests). Chrome will only leave the browser running for 41 seconds. I looked in the setting for the browser and there does not seem to be any settings for this. Any thoughts?

Link to comment
Share on other sites

OK so I did this and the Network Tab is showing instantly the php file that the javascript function sends it to.

 

The Network tab in the first window shows:

Name - the_name_of_my_php_file

Method - post

Status - 504

Type - text/...

Initiator - the_name_of_my_js_file

Size - 68B

The time shows "pending" until it cuts off and then showed - 45.99s

 

Also the Timing tab in the second window shows:

Stalled - 1.818ms

Request sent - 0.299ms

Waiting (TTFB) - 45.99s

content Download - 0.782ms

 

Also on the Preview & Response tab in the second window shows:

"Failed to load response data"

Link to comment
Share on other sites

A HTTP 504 response means a gateway timeout. That's something on the network like a router that is closing the connection. It could be your local network or it could be the network where the server is.

Link to comment
Share on other sites

Ok. Having said that - the networks that I am using I have no control over them so it appears I need another way around this. First I thought I would try to streamline my code through the routines and functions I am asking it to do but the fact remains I need to cURL 50+ pages and send data to mysql. I cannot get around this. So what I have done is limit the requests to 20 and string the php files along.

my_php_file1.php handles the first 20, my_php_file2.php handles the second 20, my_php_file3.php handles the remaining requests. At the end of each php file I have a simple form with a submit button to go to the next php file.

Any thoughts on what might work better than having to submit three php requests (in series) to complete all 50+ requests?

Link to comment
Share on other sites

Maybe PHP isn't the right tool for this job.In the scenario above you could make that a little more streamlined though. You only need 1 file, it can check for parameters like where to start and how many to do, and it can redirect the browser to the next page. You could even just have the file do one at a time and keep redirecting. But, since you're only apparently using the browser as a way to start this process, it probably makes sense to use something other than PHP.

Link to comment
Share on other sites

You could use any language that your server supports. Python, Perl, etc. You could also run the PHP script from the command line, it doesn't need to go through a browser. You could set up a cron job to execute it periodically.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×
×
  • Create New...