ADVANCED PHP Programming, OOP, Bots, Spiders, SE Crawlers

businessman332211@hotmail.com · August 25, 2006

I want this topic here to end up being large. This is my new testing ground, a place where I am advancing my php knowledge on the things mentioned in the Subject. I have some questions here, I would like help with if anyone has time.AdVANCED PHP PROGRAMMINGQuestions1. What is considered advanced php programming by most. I can do almost anything intermediate, I feel a few number of advanced things, but what exactly is advanced. OOP1. I think I have the hang of OOP, I think it's creating an object, that has methods, then you can call those methods easier viadocument.whatever.methodnameThis is what I saw for examples on PHP OOP, creating an instance of the object in memory, then appending methods to it, and going from there. 2. With php, the PEAR DB, and PEAR package, they use syntax like db=>whateverinclude=>whatever.phpI have seen this syntax is this OOPTHE MOST IMPORTANT QUESTIONS BELOWI want to learn some of these thingsData Harvesting, bots, spiders, search engine web crawlers, ex cetera.1. Ok someone send me a file called bot.php, something he was showing me, there is something that can take craigslist for instance and keep posting ads over and over again, with one page on the left, and craigslist open on the right, all that is required is manual entry of the text, this doesn't seem useful to me, but it's a explanation of what bot's are, and what they do. I also know they can harvest data, I know with PHP, ASP, JSP(I am using php), you can create web crawlers, data harvesters, and whatever else, using regular php pages, you can open them in the browser and it starts performing those actions at that moment. How do I learn how to do data harvesting, bots, spiders, search engine crawlers, whatever else, what kind of good tutorails are there, I know a lot of it is left to the imagination, but I could use help getting my foot in the world of bots.2. What all can bots do, what other things cna they do, what are there limitations, how were they created, does nayone have any information on this.3. Which ones of you use bots, and what kinds of things do you use them for4. Are bots also called (web crawlers, search engine spiders, data harvesters), are they all considered one in the same thing, or all accomplished using bots as well. 5. Does it require me to use curl in php in my bot programs to harvest data(I don't mean stripping private information or something similiar), but being able to pull public information, off a public site, iwthout having to do all the data entry. I had a project awhile back with xml, that required converting A LOT of huge txt files, 50 of them, into xml formatted xml files. Someone created a bot, 3 hours later he came back to me with the completed files, how on earth is something of this caliber possible, if bots are THAT powerful I could save a lot of time on a lot of things. ALOT.

aspnetguy · August 25, 2006

You will be able to write search engines, spiders, bots, etc with PHP but they won't be as fast as if you used C++ or C# or even Java

businessman332211@hotmail.com · August 25, 2006

I will be learning the c languages once I become better with php, learn regex, oop, bots..The main thing, is the data harvesters, I need to strip all the data fromhttp://whitepages.addresses.com/zip_codes_by_state/AL/A.htmlIt's all public information I am collecting some data from there on the 52 states, I want to build a data harvest bot to strip the data, put it into a word document, I just have to figure out how to create one, any more advice.

aspnetguy · August 25, 2006

try this http://www.daniweb.com/code/snippet293.html

businessman332211@hotmail.com · August 25, 2006

That might help some, but the thing is, I have to get really in depth. I need it to follow the right link to a city, get the data, put it somewhere. THen go back, go to another city(in alphabetical order), all the way through the state of alabama. getting all cities, like 4-5 fields of information per city, if there are more than 1 I need it to do a little more. THen I need it to go to another state, and start harvesting that data, I have to figure out something, and building it myself will teach me I am looking at curl now, do you know any tutorials on building data harvesters, or anything, or can anyone answer any of the questions I asked in the first post.

aspnetguy · August 25, 2006

your best bet is to "scrape" a whole page at a time then parse the returned html to get out the info you want. Only taking a piece every request then re-running the same request for another piece is going to slow thing s down a lot.

businessman332211@hotmail.com · August 25, 2006

http://www.freelancebusinessman.com/spider_bot.phpI am getting there, I am starting to become obsessed with learning, I am about to spend every waking moment studying. I work 10 hours per day, I am going to start studying 5 of that, plus late into the night, aside from when I play video games, I am starting to really feel that with programming someone can control the internet.

aspnetguy · August 25, 2006

Did you use AJAX to get it to list the links like that?

businessman332211@hotmail.com · August 25, 2006

	  echo '<i>'.date("h:i:s A").'</i>'; echo("- <u>Following link</u> <b>[$countthem]</b>: <font color=\"navy\">" .$p_link. "</font>\n<br>"); flush();

I used flush.But this script isn't mine. This is something that was given to me by a friend, so I can have a place to start learning.

aspnetguy · August 25, 2006

hmm I'll have to see if asp.net has something like flush....mine jsut processes then spits everything out at the end...I would like to have it show link by link.EDIT: what do you know Repsonse.Flush(); :)EDIT: We must have been posting at the same time....thanks.

businessman332211@hotmail.com · August 25, 2006

http://www.codingforums.com/archive/index.php?t-31612.htmlso it uses the dom for it's oop, on .netlikedocument.form.name.valuelike javascript??

aspnetguy · August 25, 2006

nope, everything is organized into Objects and classes. Unlike php which has lots of built in functions that are not organized, .net uses namespaces to manage things.eg The Response object has many functionsResponse.Flush();Response.Write();Response.Clear();as well as RequestRequest.Form[]Request.QueryString[]etc,etc,etcI have a demo here http://71.7.150.39/default.aspx

businessman332211@hotmail.com · August 25, 2006

That's nice I like the way that works you got the flush to work.

justsomeguy · August 25, 2006

1. What is considered advanced php programming by most. I can do almost anything intermediate, I feel a few number of advanced things, but what exactly is advanced.

"Beginner", "intermediate", "advanced", "expert" etc are all relative terms. There is no absolute meaning to any of them, aside from maybe beginner. What is an advanced topic to one person may be a beginner topic to another. The best thing to go off is probably years of experience, but even then, that's not to say that two people working with a language for 5 years will be at the same level. How quickly you pick things up and how much you are able to learn has a lot to do with how your mind works. Advanced topics are typically specialty topics that are very complex but don't get used very much.

With php, the PEAR DB, and PEAR package, they use syntax like db=>whateverinclude=>whatever.phpI have seen this syntax is this OOP

PHP uses the -> operator to reference a class' members. The => operator is used when defining an array. You can also use the :: operator to reference members of a static class.http://www.php.net/manual/en/language.oop.php http://www.php.net/manual/en/language.oop5.php

What all can bots do, what other things cna they do, what are there limitations, how were they created, does nayone have any information on this.

Bots or spiders or whatever else you want to call them are just programs. They are programs that make connections, usually through HTTP, get some data, parse it, analyze it, and take some other action based on it. What they can and cannot do is up to the person who programs them, just like any other program, there aren't any rules for these types of things.

Someone created a bot, 3 hours later he came back to me with the completed files, how on earth is something of this caliber possible, if bots are THAT powerful I could save a lot of time on a lot of things. ALOT.

You're just talking about software. It's not that a "bot" is powerful, it's that software in general is powerful. I bought an application that streams MP3s online that someone wrote in PHP, but since he sells it he obfuscated the code and included some safeguards so that it would not execute if the code was modified. I wanted to see how he did what he did, so I wrote a little PHP script to reformat his code, add indenting and line breaks in proper places, etc. It's all just software, the point of software is to make our lives easier.Actually, the point of software is to speed and automate the generation of errors. But that's another issue.

businessman332211@hotmail.com · August 25, 2006

So why can't I find ANY tutorials on bots, spiders, or anything similar. I can't figure out how to follow url's on websites, or how to make an action happen on a website if I call the url. Or how to do things to a file's actual data. Like for instance if I wanted a program to take say text files, and break the information up into sections, and pull information from it, I couldn't. That is what I need to do here, record url's, array them, follow them, collect data from them, then format them as my sql queries so I can put the data into my database. I can't seem to get it to work however hard I try, I don't get it.

aspnetguy · August 25, 2006

a tutorial on building a bot woudl be huge since there are so many things to consider.Once you stop and break it down you will realize that you can find tutorials or help on how to do specific pieces, for instance regualr expressions to parse data.

businessman332211@hotmail.com · August 25, 2006

Ok these are the things I do not know that I need to learn.1. When I get the information from a webpage, how do I record what url's are present on the page, based on specific criteria.2. How do I follow those urls.3. How do I follow url's of those urlsand however deep may need4. How do I pull information from a website5. I can build the queries once I get the data

SFB · August 25, 2006

you should really email or pm dcole.ath.cx. he is almost done with a php search engine. he would be able to help you a lot

businessman332211@hotmail.com · August 25, 2006

Who is this, I just might, I have been studying curl now for awhile, trying to learn it, because I think it'll do a lot of what I want.

SFB · August 25, 2006

Who is this, I just might, I have been studying curl now for awhile, trying to learn it, because I think it'll do a lot of what I want.

well on this forum his name is dcole.ath.cx but normaly he is just dcole or his real name is dan. his website is http://dcole.ath.cx email is dcole07@gmail.com or autocaddan@gmail.com

justsomeguy · August 26, 2006

1. When I get the information from a webpage, how do I record what url's are present on the page, based on specific criteria.

Regular expressions can be used to parse the page and return all links. The regular expressions for doing this will be fairly complex.

2. How do I follow those urls.

The same way you got the first one. fopen, using sockets, however you want to do it. You need to send an HTTP request for that URL and get the response back.

3. How do I follow url's of those urlsand however deep may need

The same way you did it for #1 and #2. The program itself will operate in a loop, or use a recursive function to keep following links.

4. How do I pull information from a website

Define "information". What do you mean? You get the contents of a web page (the HTML) by sending a request and getting the response, using something like fopen, sockets, or the curl library.

businessman332211@hotmail.com · August 28, 2006

http://whitepages.addresses.com/zip_codes_by_state/AL/A.htmlThe information within those cities, I want the city, state, zipcode, areacode, and timezone recorded. Here is what I want to do totally, and it'll be my first huge thing I am doing, partially to learn partially to get my database.1. Get my php script to start looking at this address.http://zipcodes.addresses.com/zip_code_lookup.php2. I want it to follow the alabama link3. I want it to go through all cities(in alphabetical order), from A-Z on that page, and pull out thecity, state, zipcode, areacode, and timezone for each one(As well as format ones that have 2, like if there are 2 zipcodes then 30082, 30083 type of format.)4. Repeat that for the second state(alphabetically), until it has gathered all the information together. THen I want to take all of the trapped information and record them in SQL queries, so I can copy them and paste them onto another page, and run the script to database all the information into the database, at once. That should be good, because if all the informaiton is saved in an array, I can run a while, or foreach control structure around that and create a whole bunch of insert queries, and run them all at once, just name them a little differently. That should work out.That is the whole scheme of what I am trying to do, and why I am asking these kinds of questions.

justsomeguy · August 28, 2006

It's not so easy to automate this because of how that page is laid out. This is how they display the cities (except without the linebreaks and indenting:

				<tr>				  <td class="F3">					<a class="L6"href=http://Abbeville.addresses.com/zip_codes_by_city/16021.html>Abbeville</a>				  </td>				  <td class="F3">					<a class="L6"href=http://Albertville.addresses.com/zip_codes_by_city/15845.html>Albertville</a>				  </td>				  <td class="F3">					<a class="L6"href=http://Altoona.addresses.com/zip_codes_by_city/15847.html>Altoona</a>				  </td>				  <td class="F3">					<a class="L6"href=http://Arley.addresses.com/zip_codes_by_city/15696.html>Arley</a>				  </td>				  <td class="F3">					<a class="L6"href=http://Auburn.addresses.com/zip_codes_by_city/16261.html>Auburn</a>				  </td>				</tr>

They are in a td with the class "F3", and then an anchor with the class "L6". Problem is, the cities aren't the only thing in that structure. The other states are also in that structure:

				  <td class="F3">					<a class="L6"href=http://South-Carolina.addresses.com/zip_codes_by_state/SC/A.html>SC</a>				  </td>				  <td class="F3">					<a class="L6"href=http://South-Dakota.addresses.com/zip_codes_by_state/SD/A.html>SD</a>				  </td>				  <td class="F3">					<a class="L6"href=http://Tennessee.addresses.com/zip_codes_by_state/TN/A.html>TN</a>				  </td>

But they do link to different pages. One set links to .../zip_codes_by_city/... and another links to .../zip_codes_by_state/... Also, the name of the city/state is given as a subdomain in the URL.So you may be able to parse out the URLs, and search for one of those two terms to determine if you are looking at a city or a state link. Also check for duplicates, some cities show up twice on the same page in two different lists. The different city letters for each state will also be a special case. This is how parsing works, the data you have to work with is the HTML of the page. Your task as a programmer is to figure out how the data is structured so that you can find what you are looking for, or so that you know what you are looking at. It would be nice if they used different CSS class names for the different links, because then you could go off those, but at least they have separate folders in the URL. The key is finding a pattern that you can use to find the information that you are looking for. In this case, it looks like the general pattern is the folder in the URL.

ADVANCED PHP Programming, OOP, Bots, Spiders, SE Crawlers

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in