Jump to content

Help with sorting TONS of data


spoonraker

Recommended Posts

Well I'm not going to post exact source code because it's proprietary and would open up security holes on my website, plus it's 6 or 7 files and tens of thousands of lines of code, so I'll just describe my problemBasically what I have is a page where it lists a table of people, and various pieces of data (city/state/name/id/whatever...). There are different layouts that display different data, and you can filter and sort by just about everything displayed on every layout.The way it works is a large SQL query is built depending on what needs to be displayed/filtered/sorted.After that huge query is built it is passed onto a custom tag, which executes the query, and then loops through the result set to display the data.Now normally if I wanted to sort by something I would simply add an ORDER BY clause onto my query and then the already ordered result set would be sent to the custom tag and all would be well in the universe, however in one case, I can't do that.The problem is with the person's "rank". The rank its self isn't explicitly stored in the database, only several criteria which are updated monthly, which are used to calculate the rank. No...I can not just add rank to the database, I don't care to explain it more than that, but I can't.I can call a getRank() method on an ID to get the rank. btw the ranks are Strings like A, EA, BTL, STL, GTL, PTL.So...does anybody have any ideas on how I could sort by rank? I was thinking something along the lines of creating an array with ID/Rank in it, then sorting that, and then querying all the data somehow...blah I'm burned out, help please :):)

Link to comment
Share on other sites

like i said rank isnt in the db and therefore not in the result set, i cant simply use an order by statement.i can easily get the rank for a specific id using the data from the result set, but i need to somehow sort my result set by ranki can put the result set in an array or whatever needs to be done. what i need to do is select a bunch of data, manually add another piece of data to each row in that result set, and then sort all of the rows by the new data i just added

Link to comment
Share on other sites

like i said rank isnt in the db and therefore not in the result set, i cant simply use an order by statement.
You also said
I can call a getRank() method on an ID to get the rank
Not sure what kind of method you have in mind but if you can implement this logic in a database function getRank(), then you can return rank in the resultset (and therefore sort on it).
select col3, myfuncs.func1(col3) mycol from table1order by mycol

Link to comment
Share on other sites

Oh really? I'm intriguedI can just throw a java function in with an sql query like that?...will test today and report back later
No, not a java function, a database function. A user-defined function you define in the database.SQL Server: http://msdn2.microsoft.com/en-us/library/ms191007.aspxMySQL: http://dev.mysql.com/doc/refman/5.0/en/adding-functions.html
Link to comment
Share on other sites

It's a java function that gets the rank, not a MySQL function, and I can't write it just using an MySQL function.Anyway, I'm rewriting this entire program from scratch instead of continuing to modify it. It's really slow and inefficient right now, and adding on to these already ridiculously huge queries isn't going to help anything.

Link to comment
Share on other sites

It's a java function that gets the rank, not a MySQL function, and I can't write it just using an MySQL function.
Why not? SQL is a fairly flexible language (unless you need to get data from external sources).
Link to comment
Share on other sites

Why not? SQL is a fairly flexible language (unless you need to get data from external sources).
Because it would require a ton of extra manipulation, and even if it would work, it would slow the query down immensely, and it's already ungodly slow.All you do is pass the getRank() function the ID and it pulls a bunch more data in from several more tables, compares them, some of them are strings and require manipulation and testing, and it gives you your rank.Anyway, like I said, after working on this program and trying to add more to somebody else's work I realize how horribly inefficient it was. He was selecting almost 40 different pieces of data, from 15+ tables, all using left joins, and filtering/sorting, all in one single huge query. Needless to say it was insanely slow.What I'm going to do is create a new object for this program, which has get methods for each piece of data, and it uses a separate very small query for each table data needs to be pulled from.My "big" query is only going to need to select ID, and even with 240,000+ IDs to select (the person with the most in the DB) it only takes a few seconds to run the query. Then I'm going to make a vector of objects which take that ID as a parameter in the Constructor.All filtering and sorting will be done by the java server instead of the SQL server, so unnecessary searching through tons of extra data won't be done, and the whole thing should be loads faster.
Link to comment
Share on other sites

My "big" query is only going to need to select ID, and even with 240,000+ IDs to select (the person with the most in the DB) it only takes a few seconds to run the query.
It sounds like you need to learn about indexes. Some of the tables in our database have, literally, hundreds of millions of records and, because we have well designed indexes set on these tables, queries take milliseconds to execute.
Link to comment
Share on other sites

Anybody have some ideas for the best sorting methods? Know of any useful custom libraries? I'll be dealing with all types of data, ints, Strings, and dates.
The database still gets my vote. If you pull all the data to another server for filtering and sorting, performance is going to suffer, not improve. So you should avoid doing that unless it's completely unavoidable. An it doesn't sound as though it is unavoidable:
All you do is pass the getRank() function the ID and it pulls a bunch more data in from several more tables, compares them, some of them are strings and require manipulation and testing, and it gives you your rank.
So my recommendation remains the same: implement getRank() in the database, and do the sort in the query.
Link to comment
Share on other sites

I don't think you realize how this program worked before, it was incredibly inefficient.There are different formats for the report, each one displays different data. The way the program worked before is that it selected every piece of data, for every affiliate, despite the format. And it ran that query every time you changed a filter/sort/format or went to a new page on the list.Here is what the query looked like when I stopped continuing on in this mannerselect t1.*, t2.generation ,t5.currsvp as a_currsvp ,t5.prevsvp as a_prevsvp ,t5.ss_completed,t5.ssdate,t6.weight,t6.referralstl,t6.referrals3 ,t7.ea, t7.tl, t7.tlrank as rank ,t8.focus_id, t9.lastlogin ,t10.creationdate from affiliate_index t2, affiliates t1 left join affilaux t5 on t1.sfiid=t5.sfiid left join matrix.matrix_weight t6 on t1.sfiid=t6.sfiid left join sfi.tnet_rank t7 on t1.sfiid = t7.sfiid left join focus_map t8 on t1.sfiid = t8.sfiid left join lastlogin t9 on t1.sfiid = t9.sfiid left join affiliate_profile t10 on t1.sfiid = t10.sfiid where t2.refby=21505 and class in ('A','M','P') and t1.sfiid = t2.sfiid and t2.generation = 1 order by sfiid ASC limit 0,50and even with that much, I would have had to add on about another 5 joins to finish up adding the data I was supposed toWhat I'm going to do now is run a single query which only gets a list of IDs. Then I will pass that ID to a custom object which pulls all the other data out for that affiliate. Then I will create a vector of these objects. That way I only need to query the database once per person, and instead of having a billion joins and searching dozens of tables at a time, all the data pulling will be split up into smaller faster queries.Also, here is how we calculate Rank currently

 package com.sfimg.userroles;// ** Imports ** ===============================================================//Java SQL Importsimport java.sql.Connection;import java.sql.PreparedStatement;import java.sql.ResultSet;//SFI Affiliate Importsimport com.sfi.affiliate.Affiliate;import com.sfi.matrix.Matrix;/** * UserRoles.java * * Created on October 2, 2003, 1:32 PM * @author Dana */ /** * This class looks up and holds userroles to be used by sfimg.com website. * This class is used by every page (by the menu) at sfimg.com. */// ** Publicly Accessible Class ** =============================================public class UserRoles extends com.sfi.util.DAO {	// ** Initialize Class Variables ** ========================================	public static final int NONE = 0;	// non confirmed affiliate.	public static final int NCA = 1;	// confirmed affiliate.	public static final int AFFILIATE = 1;	// smartstart grad	public static final int SMARTSTART = 2;	// EAP, free ea program	public static final int EAP = 3;	// EA qualified	public static final int MIQ = 3;	// EA qualified	public static final int EA = 3;	// bronze team leader.	public static final int BTL = 4;	// Silver team leader	public static final int STL = 4;	// Gold team leader	public static final int GTL = 4;	// Platinum team leader	public static final int PTL = 4;	// ~general Team leader designation	public static final int TL = 4;	//public static final int GL = 5;	public static final String[] roles	= {"None","Affiliate","Smart Start Graduate","Executive Affiliate","Team Leader","Group Leader"};/*		if you bump numbers (i.e.) nca 1, aff 2, ss 3, miq 4  make sure you change the 'roles' string 	returned to the corresponding int above	public static final String[] roles	= {"None","Affiliate","Affiliate","Smart Start Graduate","Executive Affiliate","Executive Affiliate","Team Leader","Team Leader"};*/	private static String CHECK_IF_TL	= "select * from matrix.rank_mapping where rank > " + Matrix.RANK_EA + " and sfiid = ?";	private static String CHECK_IF_MIQ	= "select svp+bonus_svp+action_svp+promo_svp as currsvp from currsvp where sfiid = ?";// need to add a timeline to this ... like date = to this month or something.	private static String CHECK_IF_EAP	= "select * from sfi.transactions where sfiid=? and catid=4713 and paydate > '2007-01-31' and refund='N'";	private static String CHECK_IF_PAST_MIQ	= "select svp+bonus_svp+action_svp+promo_svp as prevsvp from prevsvp where sfiid = ?";	private static String CHECK_SO	= "select * from so.souser where sfiid = ? and active = 'Y'";	private static String CHECK_IF_SMARTSTART	= "select ss_completed from sfi.affilaux where sfiid = ? and ss_completed='Y'";	private static String CHECK_IF_AFFILIATE	= "select * from sfi.affiliates where sfiid = ? and class in ('A','M','P')";	//private static String CHECK_IF_GL = "select * from sfi.powerlines where groupldr = ?";	//private static String CHECK_IF_PTL = "select * from sfi.powerlines where owner = ?";	// ** Initialize Instance Variables ** =====================================		private int role = 0; 	// ** Class Constructor Methods ** =========================================	public UserRoles() {}	// ** Class Accessor Methods ** ============================================	public int getRole() {  return role; } 	// ** Class Mutator Methods ** =============================================	public void setRole(int role) { this.role = role; }	// ** Private Class Methods ** =============================================	// ** No Private Class Methods **	// ** Public Class Methods ** ==============================================/**  * main logic here, start by checking gl(4) -> ptl(4) -> miq(3) -> sscompleted(2) then affiliate(1) * if none of the previous return NONE (0) as their status, which means they are * not logged in, or don't meet any of the other 'status' queries. */	public int findRole(int sfiid) {		Connection c = null;		PreparedStatement ps = null;		ResultSet rs = null;		if (sfiid == 0) { 			return 0;		} else {			try {				try {					if (Affiliate.isTl(sfiid) >= 20) {						setRole(TL);						return getRole();										}				} catch (Exception e5) {					e5.printStackTrace();				}			   try {					if (Affiliate.isEa(sfiid)) {							setRole(EA);							return getRole();					}				} catch (Exception e3) {					e3.printStackTrace();					}			   try {					c = getDataSource();						ps = c.prepareStatement(CHECK_IF_PAST_MIQ);					ps.setInt(1,sfiid);					rs = ps.executeQuery();					if (rs.next()) {						if (rs.getDouble("prevsvp") >= 10) { 					//  -- Removed check for SO to allow all previous MIQ's access					//		Connection oraconn = getDataSource("jdbc/oracle");					//		System.out.println("got oraconn conn");					//		PreparedStatement ps2 = oraconn.prepareStatement(CHECK_SO);					//		ps2.setInt(1,sfiid);					//		ResultSet rs2 = ps2.executeQuery();					//		if (rs2.next()) {					//			setRole(3);					//		}							setRole(EA);							rs.close();							ps.close();					//		rs2.close();					//		ps2.close();					//		oraconn.close();														if (getRole() > 0) {								c.close();								return getRole();							}						}					}					rs.close();					ps.close();				} catch (Exception e3) {										e3.printStackTrace();						System.out.println("CHECK PREVSVP IN USERROLES threw exception");				}			   try {					ps = c.prepareStatement(CHECK_IF_EAP);					ps.setInt(1,sfiid);					rs = ps.executeQuery();					if (rs.next()) {						  setRole(EAP);					   	c.close();					   	ps.close();					   	rs.close();					   	return getRole();					}					rs.close();					ps.close();				} catch (Exception e3) {					e3.printStackTrace();					}				//try {					//ps = c.prepareStatement(CHECK_IF_SMARTSTART);					///ps.setInt(1,sfiid);					//rs = ps.executeQuery();					//if (rs.next()) {					//	c.close();					//	ps.close();					//	rs.close();					 //   setRole(SMARTSTART);					//	return getRole();					//}					//rs.close();					//ps.close();				//} catch (Exception e2) {				//	e2.printStackTrace();					//}						try {					ps = c.prepareStatement(CHECK_IF_AFFILIATE);					ps.setInt(1,sfiid);					rs = ps.executeQuery();					if (rs.next()) {						c.close();						ps.close();						rs.close();						setRole(AFFILIATE);						return getRole();					}					rs.close();					ps.close();				} catch (Exception e1) {					e1.printStackTrace();					}				// if we passed through all checks, close our connection				c.close();			} catch (Exception e) {					System.out.println("UserRoles.java CHECK c must have blown up");				e.printStackTrace();			}		}		// if we passed through all checks, setRole to 0, and return 0;	   // setRole(0);		return getRole();	}}

If you know how to implement that in an SQL query, be my guest.

Link to comment
Share on other sites

the point is, even if it is possible to do it, the query would ludicrously slow. It's already incredibly slow just from the shear amount of data that it's sorting through, I don't even want to imagine how bad it would be with 10 more joins and a bunch of comparison
Actually, I think the point is: certain data and certain logic is required to calculate rank. The data is all in the database. The logic could be implemented right there in the database too. Any optimization strategy you have in mind must be possible there too. Clearly you have strong feelings against doing this in the database (and we technical types do tend to produce strong feelings, don't we!) but, with all due respect, you have not produced any technical justification for doing this work on another server. So however impressive your code is on that server, with constructors taking vectors of ids and so forth, you will still, I think, produce an inferior performance compared to doing it in the database. And as well as increasing number of database calls, and increasing network traffic, you will also introduce the unneccessary complexity of integration with 3rd-party/custom sorting code, with consequent maintenance implications, when the database already offers what you need out of the box.So I urge you to reconsider! And unless you can in fact produce concrete technical justification for your current approach, showing why it will help performance, my own recommendation remains the same: implement getRank() in the database, and do the sort in the query.
Link to comment
Share on other sites

I think you guys are missing the point. I have absolutely no problem with pulling the rank using only SQL queries. The reason I originally asked if I could pull the rank in an SQL query was so I could implement that in the original method.The reason that I'm fundamentally changing the program isn't because I couldn't figure out how to pull rank out of the database, but because the way the program worked before was incredibly inefficient.There was going to have to be over twenty joins on the query, each of those tables having at minimum hundreds of thousands of rows, some even with millions. Not only that, but each time you changed a filter or sort, it would have to rebuild the query with different WHERE clauses and ORDER BY criteria, which is convenient for the programmer since all that logic is built in to SQL, but it makes the program terribly slow. The query took over 10 full seconds to run originally, and I was doing well more than tripling the amount of data with my modifications. The SQL tables are all properly indexed and made to run as efficiently as possible, so please don't try to tell me that I just don't know how to run SQL queries. There simply is no way to query that much data quickly.Did you guys not see this? This is what the SQL query looked like BEFORE I added anything. There would be 15+ more joins. By the time I added 2 or 3 more it was taking an ugodly amount of time to run, which is why I knew I had a problem. I posted the thing about the rank before I realized this.

select t1.*, t2.generation ,t5.currsvp as a_currsvp ,t5.prevsvp as a_prevsvp ,t5.ss_completed,t5.ssdate,t6.weight,t6.referralstl,t6.referrals3 ,t7.ea, t7.tl, t7.tlrank as rank ,t8.focus_id, t9.lastlogin ,t10.creationdate from affiliate_index t2, affiliates t1 left join affilaux t5 on t1.sfiid=t5.sfiid left join matrix.matrix_weight t6 on t1.sfiid=t6.sfiid left join sfi.tnet_rank t7 on t1.sfiid = t7.sfiid left join focus_map t8 on t1.sfiid = t8.sfiid left join lastlogin t9 on t1.sfiid = t9.sfiid left join affiliate_profile t10 on t1.sfiid = t10.sfiid where t2.refby=21505 and class in ('A','M','P') and t1.sfiid = t2.sfiid and t2.generation = 1 order by sfiid ASC

The way I'm solving this problem is quite simple. I'm creating an object that pulls the data from the DB using a bunch of small queries instead of one massive one, this is MUCH faster, and then I'm simply creating some kind of list (probably ArrayList) of those objects. All sorting and filtering will be done by the java server without having to query the database again.Sure, it's going to be harder for me to work with the data this way for sorting and filtering, but the end result should be a much faster operating page with far fewer errors.

Link to comment
Share on other sites

I don't know about Java, but in C#, which is based, partly, on Java, you can create classes that implement an IComparable interface which you can use to create custom sorting classes. Then, in your collection, like an ArrayList, you can call the Sort method and pass an instance of your IComparable class as a parameter and it will take care of all your sorting.However, at the risk of beating a dead horse here, databases are designed specifically to deal with the organizing and sorting of data. You would be reinventing the wheel if you decided to do this data work outside of the database.

Link to comment
Share on other sites

However, at the risk of beating a dead horse here, databases are designed specifically to deal with the organizing and sorting of data. You would be reinventing the wheel if you decided to do this data work outside of the database.
Databases are designed for storing and retrieving data, not sorting it. They can retrieve a sorted list of data, but they can't sort data without retrieving it all again. This is why the program was so slow before. In order to do a simple sort or filter, you had to pull ALL the data out again.How can any of you even attempt to argue that pulling the data once and sorting it with Java is slower than running a 30+ second query every time you change anything on the page.
Link to comment
Share on other sites

For those of you who were doubting my method, I've finally got it into a testing phase, and it's 1000000x faster than it was before.I made a few tweaks to greatly improve performance. Instead of creating a huge list of objects, I just made a huge hashtable of IDs and sort values. Whatever data was being sorted by was the key, and the ID was the value. Then I made a vector of keys, and sorted that, and created an Iterator for that vector to loop through my hashtable and pull out a sorted list of IDs. Then, instead of creating all the objects at once, I made it so it only does 50 at a time, since that's all that will be displayed on a single page. The controller object persists in the session so when they change pages I just had to make a simple function which iterates page*50 through the sorted list and then generateds 50 more objects.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...