ThePsion5 Posted August 7, 2006 Share Posted August 7, 2006 Hi guys,I'm writing a PHP script that parses GET values out of a log file with lines resembling something like this: 'IPAddress - - [Timestamp] "Retrieved Page" Number - "http://www.google.com/search?q=search+query&other=useless+stuff" "Request Details"'And I'm trying to figure out how to do this quickly, because there are a LOT of log files to parse. I've considered using regular expressions, like so:$Info = preg_match_all('%\?[^"\s\S]*"%', $Sample); But I've heard that regular expressions are none-too-fast. I've also thought of doing the same with string functions: $Lines = explode('?', $Sample);foreach($Lines as $line){$Info[] = explode('\"', $line);} But this seems like it would be comparably slow as well. Unfortunately, I'm not familiar with the exact speed of the regular expression functions in php, but i assume that they're somewhere between omega(n^2) and omega(nlogn), whereas my string-splitting version has a time of omega(2Xn^2), where X is the time taken for each explode() function. Can anyone shed light on this, or point me somewhere that gives precise information on the execution time of these two approaches? Thanks in advance! Link to comment Share on other sites More sharing options...
dcole.ath.cx Posted August 7, 2006 Share Posted August 7, 2006 I would just go with what seems best... I use the easier of the methods. There's no point in writing 100 lines of code to do the work of one regular expesssion, even if they are slow.When I question what one to use, I just think about how much work will happen with each way... although, most of the time I go with string functions..like with your way, I would use string functions because it's much easier to explode by ? and " and then getting some data ... vs. coming up with %\?[^"\s\S]*"% Then when I think about it both ways are checking every charactor but the regular has to check the ones after it too... over and over again. Unless I'm working with an unknown amount of dynamic data (like with my search engine) then it's easier to go with regular expressions because you don't always know how much your going to find.. Link to comment Share on other sites More sharing options...
justsomeguy Posted August 7, 2006 Share Posted August 7, 2006 You can use sscanf to do this. If you are parsing Apache log files, you can probably use this: <?php$log = array();$n = sscanf(trim($line), '%s %s %s [%[^]]] "%s %s %[^"]" %d %s "%[^"]" "%[^"]"', $log['ip'], $log['client'], $log['user'], $log['time'], $log['method'], $log['uri'], $log['prot'], $log['code'], $log['bytes'], $log['ref'], $log['agent']);?> Or else you can build your own custom format string to parse it however you want. Look up the document for sprintf to see the syntax for creating the format string. Link to comment Share on other sites More sharing options...
ThePsion5 Posted August 8, 2006 Author Share Posted August 8, 2006 Well, I wrote something that turned out to be pretty damn fast actually, lol. Fast enough to parse 1.6 gigs of information in about 75 seconds, at least . My general strategy was to minimize the code in loops as much as possible, avoid double-quotes, and the like. I'm pretty satisfied with what I got, hehe. Link to comment Share on other sites More sharing options...
MartinAustin Posted August 8, 2006 Share Posted August 8, 2006 Unless there is confidential information in it, would you mind sharing with the community what your final solution consisted of?Great work, 1.6G in 75 secs is pretty impressive. Link to comment Share on other sites More sharing options...
ThePsion5 Posted August 8, 2006 Author Share Posted August 8, 2006 Sure, I'll give you a run-down of what my algorithm does. By the way, this is designed to parse out any search keywords contained in a referral URL.There are actually two versions; the second uses less memory but isn't quite as fast as the first.The uber-fast one (note: when testing this, apache will use up to 170 megs of ram, lol):for each file being parsed:1. pull the content into a variable2. use strpos() on the content variable with ? (question mark) to retrieve the start of GET values.3. Using the value from the previous strpos() function as an offset, call strpos() again with " (double-quotation), signifying the end of the URL4.Armed with the two strpos() values, use substr() to pull the GET values out of the file content5. use explode() with & (ampersand) to get all the individual GET fields into an array6. Iterate through each value of the array, calling strpos() for the following ('q=', 'p=', or 'searchFor=')7. If any of them match, call substr() and strlen() with the result of your previous strpos() to extract the value portion of the GET field8. call url_decode() on it and use it as the key for an array like so $SearchWords[$value]++;9. Repeat from step 2 until the first strpos() (in step 2) returns false.10. Sort the array11. Put it into a file12. RULE THE WORLD:-DThe other version just exports the array to a different file for each file parsed, so it's pretty similar. It will also decrease Apache's maximum usage to 110 megs instead of 170, but it's also about 20%-25% slower. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now