Sunday, February 26, 2012

Group By clause killing performance

I have recently started working with a new group of people and I find myself doing a lot of reporting. While doing this reporting I have been writing a TON of sql. Some of my queries were not performing up to par and another developer in the shop recommended that I stay away from the "GROUP BY" clause.

Backing away from the "GROUP BY" clause and using "INNER SELECTS" instead as been more effective and some queries have gone from over 1 minute to less that 1 second.

Obviously if it works then it works and there is no arguing that point. My question to the forum is more about gather some opinions so that I can build an opinion of my own.

If I cannot do a reasonable query of a couple of million records using a group by clause what is the problem and what is the best fix?

Is the best fix to remove the "GROUP BY" and write a query that is a little more complex or should I be looking at tuning the database with more indexes and statistics?

I want to make sure that this one point is crystal clear. I am not against following the advice of my coworker and avoiding the "GROUP BY" clause. I am only intersted in listening to a few others talk about why the agree or disagree with my coworked so that I can gain a broader understanding.

It is a combination of few factors.(1) It is imptant to have proper indexes on columns being queried in WHERE and GROUP BY (2) Also see if you can move the GROUP BY to the reporting tool? (3) Is it the GROUP BY thats killing it or something else? Are you using any functions on columns in the WHERE? like some CONVERT(Datecolumn,100) >= '2007/01/01' etc?

|||

It is definetly the group by that is killing it. When the query was rewritten to to remove the group by the execution time dropped through the floor.

from 1 minute to less than 1 second.

Here is a sudo example of what I mean

old query first

SELECT
column1,
column2,
column3,
SUM(something)
FROM
table1
inner join table 2 on 1.columna = 2.columnb
GROUP BY
column1,
column2,
column3

new query

SELECT
column1,
column2,
column3,
(SELECT SUM(something) From sometable) AS 'blah'
FROM
table1
inner join table 2 on 1.columna = 2.columnb

I know that there huge gap between what is really going on and the code above but you get the main idea. moving the sum to a select so that the group by is no longer required. This and this allow drastically reduced the amount of time that it took to get the data. I knew that group by was expensive I just didn't realize how expensive it was.

|||

The group by clause is not in itself a bad performer. There is something else at work but without more detail, I can't tell you what.

The two queries you gave aren't the same thing. The second query doesn't do a sum based on the contents of the current row (No where clause relating the two). Which then of course it runs much faster, it's only executing the sum once, and using it on every row of the outer query.

|||

No mystery. "If I cannot do a reasonable query of a couple of million records..."

Sorting a couple of million records is, well, expensive! That's what a group by does, it sorts. And you don't even have a where clause to limit the answer set.

If you put a clustered index on column1, column2, column3 it will be able to avoid the sort, but you need to look at that carefully since it may have an impact on other queries (and you may already have a clustered index)

|||

True, the second query is unsorted (Not that common to request a set of data and not care about it's sort order), which will obviously be a completely different query plan. I would venture to guess that you don't have a good index on the table either that can/will help you.

Instead of putting a clustered index on the table, if you put an index on column1,column2,column3 and the field you are summing, your query time will drop significantly as well.

Faster yet, would be to use an indexed view.

|||

I am hearing basically what I thought I would hear. GROUP BY equals SORTING, a couple million records is a lot of data, no need to aviod GROUP BY like the plauge, check the indexes and statistics too.

Thanks. There is never a right or wrong answer to this kind of thing, it always depends on the shop and the database.

No comments:

Post a Comment