Jump to content

I need a tutorial for Loading Webpage Data - screen scraping


confused and dazed
 Share

Recommended Posts

submit a post request that contains the data from the login form with the correct names, and get the cookies that the server sends back.

I guess I have some investigating ahead of me. I understand what that is on the surface but have no idea how to accomplish it. I will be back at some point either thanking or asking more questions. Until then....... May all your code be syntax error free!

Link to comment
Share on other sites

  • Replies 72
  • Created
  • Last Reply

Top Posters In This Topic

You can't use DOMDocument to extract data from JSON. You need to use the json_decode() function for that.

 

First, extract the content from the script element and store it as a string, then pass that string to json_decode to get an object or associative array. After that, you can obtain the value you need from the "I_want_this" property of the object.

Link to comment
Share on other sites

I'm making assumptions as to how the HTML is structured, but it should be something like this:

foreach($dom->getElementsByTagName('script') as $script) {// First child is a text node with the JSON string in it    $JSON = $this1->firstChild->nodeValue;    $data = json_decode($JSON, true);    echo $data['I_want_this'];}

If the $JSON variable does not actually contain an isolated JSON string, then you'll have to use ordinary string manipulation methods to extract it from there. The main idea is that once you've pulled the string out of the document, you don't need to use DOMDocument methods anymore, just treat it like any other string.

Link to comment
Share on other sites

As I look back through the 3000 plus lines of source code in between the script tags - a few things stick out

1. There are MANY "function()"s before this particular bootstrap begins

2. There are several "_.extend(" methods before this particular bootstrap begins - but I'm guessing this does not mean much

3. The bootstrap is not inside any function - it stands alone

4. the bootstrap is the last set of data between the script tags

Link to comment
Share on other sites

I can do that - but I would like to try and get it on my own... are there any partiicular rules you can think of for me to try? Here is a very simplistic version of what is between the script tags

 

<script>

// comments

// comments

(function(){ A WHOLE LOT OF STUFF}).call(this);

// comments

// comments

(function(){ ANOTHER HUGE FUNCTION WITH A WHOLE LOT OF STUFF}).call(this);

// comments

// comments

Backbone.STUFF = (function() STUFF);

// comments

// comments

Wreqr.Commands = (function() STUFF);

// comments

// comments

var x1 = 'something1',
x2 = 'something2',
x3 = 'something3',
x4 = 'something4',
x5 = 'something5',
x6 = 'something6';
bootstrapthisstuff = {"status":"X","some_id":"112","getting_close":"ER","I_want_this":"bingo"}
</script>
This is basically what is between the scripts
Edited by confused and dazed
Link to comment
Share on other sites

You'll need to write a regular expression to find the line you're looking for and extract it. The variable name on the line could be part of the pattern to look for, and it sounds like you need everything on that one line (assuming it's all on one line).

Link to comment
Share on other sites

The bootstrap is all on one line - one VERY LONG LINE.

I'm not looking for any of the var values - I'm looking to extract "bingo" from the bootstrap

bootstrapthisstuff = {"status":"X","some_id":"112","getting_close":"ER","I_want_this":"bingo"}

 

Also I put in a counter{ $i=1......$i++} echo $i; and it lists the correct amount of Script tags in the source data which is 46. So I know its getting into the script tags

Edited by confused and dazed
Link to comment
Share on other sites

I'll mark out the steps:

1. Pull the text out of the <script> node with DOMDocument

2. Use a regular expression that will match everything between "bootstrapthisstuff = " and the following line break

3. Decode the JSON

4. Take the value you need out of the resulting object.

Link to comment
Share on other sites

no matter what I try it appears step 2 is not considering $links1 as a string so therefore I don't even get a 0 when this code executes.

 

1. foreach($dom->getElementsByTagName('script') as $links1) {
2. echo preg_match('/no_matter_what_I_try/', $links1);
3. $data = json_decode($links1, true);
4. echo $data['I_want_this'];
Link to comment
Share on other sites

OK so I was able to sort out the regular expression call and was able to get the bootstrap table into a string. I was able to echo the string out so I know I captured it. The string looks like this {the table} and is completely enclosed in curly brackets.

 

Now the next problem, I am not able to get "bingo" from "I_want_this". The bootstrap table is very complex it follows this sort of structure

{"main list":{level1":1,"stuff1":[{"a1":1,"a2":"2","a3":"3","somelevel":[{"getstricky":[{"keepsgoing":"X", and so forth.......} the table is quite complex. Also there appears to be more than one "I_want_this" within the table with different values. bingo is one, touchdown is another and so on.

 

How would I capture all of the individual "I_want_this" values from the table?

Edited by confused and dazed
Link to comment
Share on other sites

Did you manage to decode the JSON?

 

If you have done that, you can use var_dump() on the resulting object to see its structure. Show the structure here and I can give you an idea of what procedure needs to be followed to pull data from it.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share


×
×
  • Create New...