MrFish Posted December 30, 2013 Share Posted December 30, 2013 I have a database of ~100,000 local businesses in my area. I have a few scrapers to pull in more businesses from other locations. I've spent the last few days creating a client-server application in python so I can scale up the amount of scrapers and have all the scraped data feed to one machine. So the client scraps sites and the server records the data. With 100,000 businesses the server can't keep up with even 1 client's scraping! When adding a business what's involved is- The business name and address are checked against the database to prevent duplicates. If no business is found the business is added to the current database of 100,000 businesses. Then tags found for that business (e.i. "pizza", "mexican", "chinese", etc.) are checked against a tags table to see if they exist. If they don't exist in the tags table the tags are added. A tag-to-business association table then records the business ID and the tag ID. Lastly a key-value database for random useful information like "known web address" is queried to see if the keys being added exist, and if they do it updates, if they don't it adds the key-value pair. In this there are a lot of string searches going on so I can see why it would slow down so much. I've considered indexing all of the fields that will be searched but I don't know if that will slow down insert queries. One thing I've considered is to just add all this information without checking and have a cleanup process later to remove duplicate data. Not sure is that would help. Any advice to speed up my databases? I'd like a dozen or so scrapers and not being able to keep up with just one scraper is bad news. Link to comment Share on other sites More sharing options...
davej Posted December 30, 2013 Share Posted December 30, 2013 You have the database doing string searches? Link to comment Share on other sites More sharing options...
MrFish Posted December 30, 2013 Author Share Posted December 30, 2013 You have the database doing string searches? Yes. Names are strings usually. I wish they'd all name themselves numbers but they didn't. Link to comment Share on other sites More sharing options...
davej Posted December 30, 2013 Share Posted December 30, 2013 Maybe it would help clarify things if you did some loop timing and then posted one of the slowest queries? Link to comment Share on other sites More sharing options...
MrFish Posted December 30, 2013 Author Share Posted December 30, 2013 Well I've done some indexing and it's helped a lot but it's still behind. I'll post the code that runs the queries and post the queries that are called on each. # Each tag is a string value for Tag in tags: # It first checks if the tag exists in the database. # SELECT * FROM tags WHERE tag_name=%s # tag_name is indexed if self.tagExists(Tag) == False: # If not it will add the tag # INSERT INTO tags (tag_name) VALUES(%s) self.addTag(Tag) # Then it will check if the business has the tag associated to it # SELECT * FROM business_tags WHERE bt_b_id=%s AND bt_tag_id=%s # This method also searches for the business ID and tag ID given teh Business and Tag objects # This can probably be trimmed down if self.businessHasTag(B, Tag) == False: # If the business doesn't have the tag it will associate it # INSERT INTO business_tags (bt_b_id, bt_tag_id) VALUES(%s, %s) # bt_b_id is indexed self.bindBusinessTag(B, Tag) for key in B.getKeys(): # Checks if the business has this key set in the key-value data table # SELECT * FROM business_data WHERE bd_b_id=%s AND bd_key=%s # bd_b_id is indexed if self.businessHasKey(B, key): # If the business has the key it will update it # UPDATE business_data SET bd_value=%s WHERE bd_b_id=%s AND bd_key=%s self.setBusinessKey(B, key, B.get(key)) else: # If the business doesn't have the key it will insert it # INSERT INTO business_data (bd_value, bd_b_id, bd_key) VALUES (%s, %s, %s) self.addBusinessKey(B, key, B.get(key)) I am doing a lot of redundant queries when finding the business Id but this is usually less than a millisecond. I'll clean that bit up a bit. I still don't think it will be enough. I'm thinking I may need to create temporary databases and later merge them. Link to comment Share on other sites More sharing options...
MrFish Posted December 30, 2013 Author Share Posted December 30, 2013 (edited) Got it down to an average of 6/s per 100 business processed. Here is a list of times taken to process individual businesses. Not very consistent. Oh boy getting 146.963 queries/s average. 0.009335041046140.01928496360780.07202100753780.02311992645260.05278611183170.02507495880130.01462388038640.03662300109860.03207707405090.01911401748660.04484701156620.05123400688170.0286669731140.01086711883540.05897092819210.01087403297420.0009930133819580.00136899948120.09078907966610.0108981132507 Edited December 30, 2013 by MrFish Link to comment Share on other sites More sharing options...
davej Posted December 31, 2013 Share Posted December 31, 2013 You might take a look at... http://docs.oracle.com/cd/B19306_01/server.102/b14211/part4.htm Link to comment Share on other sites More sharing options...
justsomeguy Posted January 6, 2014 Share Posted January 6, 2014 In general, try to avoid doing queries inside loops. That's one way to have a big impact on performance. You can also give the tables primary or unique indexes, and then use INSERT IGNORE or INSERT..ON DUPLICATE UPDATE instead of doing a check first, and then inserting or updating. If you're doing batch processing then can also keep lookup data in memory instead of needing to get it from the database each time. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now