Jump to content

ADVANCED PHP Programming, OOP, Bots, Spiders, SE Crawlers


businessman332211@hotmail.com

Recommended Posts

I want this topic here to end up being large. This is my new testing ground, a place where I am advancing my php knowledge on the things mentioned in the Subject. I have some questions here, I would like help with if anyone has time.AdVANCED PHP PROGRAMMINGQuestions1. What is considered advanced php programming by most. I can do almost anything intermediate, I feel a few number of advanced things, but what exactly is advanced. OOP1. I think I have the hang of OOP, I think it's creating an object, that has methods, then you can call those methods easier viadocument.whatever.methodnameThis is what I saw for examples on PHP OOP, creating an instance of the object in memory, then appending methods to it, and going from there. 2. With php, the PEAR DB, and PEAR package, they use syntax like db=>whateverinclude=>whatever.phpI have seen this syntax is this OOPTHE MOST IMPORTANT QUESTIONS BELOWI want to learn some of these thingsData Harvesting, bots, spiders, search engine web crawlers, ex cetera.1. Ok someone send me a file called bot.php, something he was showing me, there is something that can take craigslist for instance and keep posting ads over and over again, with one page on the left, and craigslist open on the right, all that is required is manual entry of the text, this doesn't seem useful to me, but it's a explanation of what bot's are, and what they do. I also know they can harvest data, I know with PHP, ASP, JSP(I am using php), you can create web crawlers, data harvesters, and whatever else, using regular php pages, you can open them in the browser and it starts performing those actions at that moment. How do I learn how to do data harvesting, bots, spiders, search engine crawlers, whatever else, what kind of good tutorails are there, I know a lot of it is left to the imagination, but I could use help getting my foot in the world of bots.2. What all can bots do, what other things cna they do, what are there limitations, how were they created, does nayone have any information on this.3. Which ones of you use bots, and what kinds of things do you use them for4. Are bots also called (web crawlers, search engine spiders, data harvesters), are they all considered one in the same thing, or all accomplished using bots as well. 5. Does it require me to use curl in php in my bot programs to harvest data(I don't mean stripping private information or something similiar), but being able to pull public information, off a public site, iwthout having to do all the data entry. I had a project awhile back with xml, that required converting A LOT of huge txt files, 50 of them, into xml formatted xml files. Someone created a bot, 3 hours later he came back to me with the completed files, how on earth is something of this caliber possible, if bots are THAT powerful I could save a lot of time on a lot of things. ALOT.

Link to comment
Share on other sites

I will be learning the c languages once I become better with php, learn regex, oop, bots..The main thing, is the data harvesters, I need to strip all the data fromhttp://whitepages.addresses.com/zip_codes_by_state/AL/A.htmlIt's all public information I am collecting some data from there on the 52 states, I want to build a data harvest bot to strip the data, put it into a word document, I just have to figure out how to create one, any more advice.

Link to comment
Share on other sites

That might help some, but the thing is, I have to get really in depth. I need it to follow the right link to a city, get the data, put it somewhere. THen go back, go to another city(in alphabetical order), all the way through the state of alabama. getting all cities, like 4-5 fields of information per city, if there are more than 1 I need it to do a little more. THen I need it to go to another state, and start harvesting that data, I have to figure out something, and building it myself will teach me I am looking at curl now, do you know any tutorials on building data harvesters, or anything, or can anyone answer any of the questions I asked in the first post.

Link to comment
Share on other sites

your best bet is to "scrape" a whole page at a time then parse the returned html to get out the info you want. Only taking a piece every request then re-running the same request for another piece is going to slow thing s down a lot.

Link to comment
Share on other sites

http://www.freelancebusinessman.com/spider_bot.phpI am getting there, I am starting to become obsessed with learning, I am about to spend every waking moment studying. I work 10 hours per day, I am going to start studying 5 of that, plus late into the night, aside from when I play video games, I am starting to really feel that with programming someone can control the internet.
Link to comment
Share on other sites

hmm I'll have to see if asp.net has something like flush....mine jsut processes then spits everything out at the end...I would like to have it show link by link.EDIT: what do you know Repsonse.Flush(); :)EDIT: We must have been posting at the same time....thanks.

Link to comment
Share on other sites

nope, everything is organized into Objects and classes. Unlike php which has lots of built in functions that are not organized, .net uses namespaces to manage things.eg The Response object has many functionsResponse.Flush();Response.Write();Response.Clear();as well as RequestRequest.Form[]Request.QueryString[]etc,etc,etcI have a demo here http://71.7.150.39/default.aspx

Link to comment
Share on other sites

1. What is considered advanced php programming by most. I can do almost anything intermediate, I feel a few number of advanced things, but what exactly is advanced.
"Beginner", "intermediate", "advanced", "expert" etc are all relative terms. There is no absolute meaning to any of them, aside from maybe beginner. What is an advanced topic to one person may be a beginner topic to another. The best thing to go off is probably years of experience, but even then, that's not to say that two people working with a language for 5 years will be at the same level. How quickly you pick things up and how much you are able to learn has a lot to do with how your mind works. Advanced topics are typically specialty topics that are very complex but don't get used very much.
With php, the PEAR DB, and PEAR package, they use syntax like db=>whateverinclude=>whatever.phpI have seen this syntax is this OOP
PHP uses the -> operator to reference a class' members. The => operator is used when defining an array. You can also use the :: operator to reference members of a static class.http://www.php.net/manual/en/language.oop.phphttp://www.php.net/manual/en/language.oop5.php
What all can bots do, what other things cna they do, what are there limitations, how were they created, does nayone have any information on this.
Bots or spiders or whatever else you want to call them are just programs. They are programs that make connections, usually through HTTP, get some data, parse it, analyze it, and take some other action based on it. What they can and cannot do is up to the person who programs them, just like any other program, there aren't any rules for these types of things.
Someone created a bot, 3 hours later he came back to me with the completed files, how on earth is something of this caliber possible, if bots are THAT powerful I could save a lot of time on a lot of things. ALOT.
You're just talking about software. It's not that a "bot" is powerful, it's that software in general is powerful. I bought an application that streams MP3s online that someone wrote in PHP, but since he sells it he obfuscated the code and included some safeguards so that it would not execute if the code was modified. I wanted to see how he did what he did, so I wrote a little PHP script to reformat his code, add indenting and line breaks in proper places, etc. It's all just software, the point of software is to make our lives easier.Actually, the point of software is to speed and automate the generation of errors. But that's another issue.
Link to comment
Share on other sites

So why can't I find ANY tutorials on bots, spiders, or anything similar. I can't figure out how to follow url's on websites, or how to make an action happen on a website if I call the url. Or how to do things to a file's actual data. Like for instance if I wanted a program to take say text files, and break the information up into sections, and pull information from it, I couldn't. That is what I need to do here, record url's, array them, follow them, collect data from them, then format them as my sql queries so I can put the data into my database. I can't seem to get it to work however hard I try, I don't get it.

Link to comment
Share on other sites

a tutorial on building a bot woudl be huge since there are so many things to consider.Once you stop and break it down you will realize that you can find tutorials or help on how to do specific pieces, for instance regualr expressions to parse data.

Link to comment
Share on other sites

Ok these are the things I do not know that I need to learn.1. When I get the information from a webpage, how do I record what url's are present on the page, based on specific criteria.2. How do I follow those urls.3. How do I follow url's of those urlsand however deep may need4. How do I pull information from a website5. I can build the queries once I get the data

Link to comment
Share on other sites

Who is this, I just might, I have been studying curl now for awhile, trying to learn it, because I think it'll do a lot of what I want.
well on this forum his name is dcole.ath.cx but normaly he is just dcole or his real name is dan. his website is http://dcole.ath.cx email is dcole07@gmail.com or autocaddan@gmail.com
Link to comment
Share on other sites

1. When I get the information from a webpage, how do I record what url's are present on the page, based on specific criteria.
Regular expressions can be used to parse the page and return all links. The regular expressions for doing this will be fairly complex.
2. How do I follow those urls.
The same way you got the first one. fopen, using sockets, however you want to do it. You need to send an HTTP request for that URL and get the response back.
3. How do I follow url's of those urlsand however deep may need
The same way you did it for #1 and #2. The program itself will operate in a loop, or use a recursive function to keep following links.
4. How do I pull information from a website
Define "information". What do you mean? You get the contents of a web page (the HTML) by sending a request and getting the response, using something like fopen, sockets, or the curl library.
Link to comment
Share on other sites

http://whitepages.addresses.com/zip_codes_by_state/AL/A.htmlThe information within those cities, I want the city, state, zipcode, areacode, and timezone recorded. Here is what I want to do totally, and it'll be my first huge thing I am doing, partially to learn partially to get my database.1. Get my php script to start looking at this address.http://zipcodes.addresses.com/zip_code_lookup.php2. I want it to follow the alabama link3. I want it to go through all cities(in alphabetical order), from A-Z on that page, and pull out thecity, state, zipcode, areacode, and timezone for each one(As well as format ones that have 2, like if there are 2 zipcodes then 30082, 30083 type of format.)4. Repeat that for the second state(alphabetically), until it has gathered all the information together. THen I want to take all of the trapped information and record them in SQL queries, so I can copy them and paste them onto another page, and run the script to database all the information into the database, at once. That should be good, because if all the informaiton is saved in an array, I can run a while, or foreach control structure around that and create a whole bunch of insert queries, and run them all at once, just name them a little differently. That should work out.That is the whole scheme of what I am trying to do, and why I am asking these kinds of questions.
Link to comment
Share on other sites

It's not so easy to automate this because of how that page is laid out. This is how they display the cities (except without the linebreaks and indenting:

				<tr>				  <td class="F3">					<a class="L6"href=http://Abbeville.addresses.com/zip_codes_by_city/16021.html>Abbeville</a>				  </td>				  <td class="F3">					<a class="L6"href=http://Albertville.addresses.com/zip_codes_by_city/15845.html>Albertville</a>				  </td>				  <td class="F3">					<a class="L6"href=http://Altoona.addresses.com/zip_codes_by_city/15847.html>Altoona</a>				  </td>				  <td class="F3">					<a class="L6"href=http://Arley.addresses.com/zip_codes_by_city/15696.html>Arley</a>				  </td>				  <td class="F3">					<a class="L6"href=http://Auburn.addresses.com/zip_codes_by_city/16261.html>Auburn</a>				  </td>				</tr>

They are in a td with the class "F3", and then an anchor with the class "L6". Problem is, the cities aren't the only thing in that structure. The other states are also in that structure:

				  <td class="F3">					<a class="L6"href=http://South-Carolina.addresses.com/zip_codes_by_state/SC/A.html>SC</a>				  </td>				  <td class="F3">					<a class="L6"href=http://South-Dakota.addresses.com/zip_codes_by_state/SD/A.html>SD</a>				  </td>				  <td class="F3">					<a class="L6"href=http://Tennessee.addresses.com/zip_codes_by_state/TN/A.html>TN</a>				  </td>

But they do link to different pages. One set links to .../zip_codes_by_city/... and another links to .../zip_codes_by_state/... Also, the name of the city/state is given as a subdomain in the URL.So you may be able to parse out the URLs, and search for one of those two terms to determine if you are looking at a city or a state link. Also check for duplicates, some cities show up twice on the same page in two different lists. The different city letters for each state will also be a special case. This is how parsing works, the data you have to work with is the HTML of the page. Your task as a programmer is to figure out how the data is structured so that you can find what you are looking for, or so that you know what you are looking at. It would be nice if they used different CSS class names for the different links, because then you could go off those, but at least they have separate folders in the URL. The key is finding a pattern that you can use to find the information that you are looking for. In this case, it looks like the general pattern is the folder in the URL.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...