Jump to content

How can I speed up my databases? Much too slow!


MrFish
 Share

Recommended Posts

I have a database of ~100,000 local businesses in my area. I have a few scrapers to pull in more businesses from other locations. I've spent the last few days creating a client-server application in python so I can scale up the amount of scrapers and have all the scraped data feed to one machine. So the client scraps sites and the server records the data. With 100,000 businesses the server can't keep up with even 1 client's scraping!

 

When adding a business what's involved is-

 

The business name and address are checked against the database to prevent duplicates. If no business is found the business is added to the current database of 100,000 businesses. Then tags found for that business (e.i. "pizza", "mexican", "chinese", etc.) are checked against a tags table to see if they exist. If they don't exist in the tags table the tags are added. A tag-to-business association table then records the business ID and the tag ID. Lastly a key-value database for random useful information like "known web address" is queried to see if the keys being added exist, and if they do it updates, if they don't it adds the key-value pair.

 

In this there are a lot of string searches going on so I can see why it would slow down so much. I've considered indexing all of the fields that will be searched but I don't know if that will slow down insert queries.

 

One thing I've considered is to just add all this information without checking and have a cleanup process later to remove duplicate data. Not sure is that would help.

 

Any advice to speed up my databases? I'd like a dozen or so scrapers and not being able to keep up with just one scraper is bad news.

Link to comment
Share on other sites

Well I've done some indexing and it's helped a lot but it's still behind. I'll post the code that runs the queries and post the queries that are called on each.

		# Each tag is a string value		for Tag in tags:			# It first checks if the tag exists in the database.			# SELECT * FROM tags WHERE tag_name=%s			# tag_name is indexed			if self.tagExists(Tag) == False:				# If not it will add the tag				# INSERT INTO tags (tag_name) VALUES(%s)				self.addTag(Tag)			# Then it will check if the business has the tag associated to it			# SELECT * FROM business_tags WHERE bt_b_id=%s AND bt_tag_id=%s			# This method also searches for the business ID and tag ID given teh Business and Tag objects			#	This can probably be trimmed down			if self.businessHasTag(B, Tag) == False:				# If the business doesn't have the tag it will associate it				# INSERT INTO business_tags (bt_b_id, bt_tag_id) VALUES(%s, %s)				# bt_b_id is indexed				self.bindBusinessTag(B, Tag)				for key in B.getKeys():			# Checks if the business has this key set in the key-value data table			# SELECT * FROM business_data WHERE bd_b_id=%s AND bd_key=%s			# bd_b_id is indexed			if self.businessHasKey(B, key):				# If the business has the key it will update it				# UPDATE business_data SET bd_value=%s WHERE bd_b_id=%s AND bd_key=%s				self.setBusinessKey(B, key, B.get(key))			else:				# If the business doesn't have the key it will insert it				# INSERT INTO business_data (bd_value, bd_b_id, bd_key) VALUES (%s, %s, %s)				self.addBusinessKey(B, key, B.get(key))

I am doing a lot of redundant queries when finding the business Id but this is usually less than a millisecond. I'll clean that bit up a bit. I still don't think it will be enough. I'm thinking I may need to create temporary databases and later merge them.

Link to comment
Share on other sites

Got it down to an average of 6/s per 100 business processed.

 

Here is a list of times taken to process individual businesses. Not very consistent.

 

Oh boy getting 146.963 queries/s average.

0.009335041046140.01928496360780.07202100753780.02311992645260.05278611183170.02507495880130.01462388038640.03662300109860.03207707405090.01911401748660.04484701156620.05123400688170.0286669731140.01086711883540.05897092819210.01087403297420.0009930133819580.00136899948120.09078907966610.0108981132507
Edited by MrFish
Link to comment
Share on other sites

In general, try to avoid doing queries inside loops. That's one way to have a big impact on performance. You can also give the tables primary or unique indexes, and then use INSERT IGNORE or INSERT..ON DUPLICATE UPDATE instead of doing a check first, and then inserting or updating. If you're doing batch processing then can also keep lookup data in memory instead of needing to get it from the database each time.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...