Friday, March 23, 2012

Grouping by Combinations of Values

Here's a Brain Twister:
I have a list of Customers who buy from a list of Products. I need to
summarize (countdistinct) customers by the combination(s) of products
they buy.
Product Combo Distinct Customer Count
Product A 10
Product B 8
Product C 6
Product A&B 7
Product A&C 5
Product B&C 4
Product A&B&C 1
I'm sure somebody must have had to do this before and it's probably
totally simple, but I'm just not seeing it. It gets even worse in that
there are multiple levels in multiple hierarchies of products, but if I
can get the technique down, I can probably get the rest.
Any ideas will be greatly appreciated..
Thanks!!how many products do you have?|||9 Level 1's with 1-3 Level 2's
12 Level 2's with 1-6 Level 3's
25 Level 3's with 1-15 Level 4's
40 Level 4's with 1-10 Level 5's
61 Level 5's with 1-6 Level 6's
etc...
304 possible 8 level combinations at present, and products can be added
or recategorized pretty regularly.|||I have not done this "cut & paste" in awhile:
=========================
Relational division is one of the eight basic operations in Codd's
relational algebra. The idea is that a divisor table is used to
partition a dividend table and produce a quotient or results table.
The quotient table is made up of those values of one column for which a
second column had all of the values in the divisor.
This is easier to explain with an example. We have a table of pilots
and the planes they can fly (dividend); we have a table of planes in
the hangar (divisor); we want the names of the pilots who can fly every
plane (quotient) in the hangar. To get this result, we divide the
PilotSkills table by the planes in the hangar.
CREATE TABLE PilotSkills
(pilot CHAR(15) NOT NULL,
plane CHAR(15) NOT NULL,
PRIMARY KEY (pilot, plane));
PilotSkills
pilot plane
=========================
'Celko' 'Piper Cub'
'Higgins' 'B-52 Bomber'
'Higgins' 'F-14 Fighter'
'Higgins' 'Piper Cub'
'Jones' 'B-52 Bomber'
'Jones' 'F-14 Fighter'
'Smith' 'B-1 Bomber'
'Smith' 'B-52 Bomber'
'Smith' 'F-14 Fighter'
'Wilson' 'B-1 Bomber'
'Wilson' 'B-52 Bomber'
'Wilson' 'F-14 Fighter'
'Wilson' 'F-17 Fighter'
CREATE TABLE Hangar
(plane CHAR(15) NOT NULL PRIMARY KEY);
Hangar
plane
=============
'B-1 Bomber'
'B-52 Bomber'
'F-14 Fighter'
PilotSkills DIVIDED BY Hangar
pilot
=============================
'Smith'
'Wilson'
In this example, Smith and Wilson are the two pilots who can fly
everything in the hangar. Notice that Higgins and Celko know how to
fly a Piper Cub, but we don't have one right now. In Codd's original
definition of relational division, having more rows than are called for
is not a problem.
The important characteristic of a relational division is that the CROSS
JOIN (Cartesian product) of the divisor and the quotient produces a
valid subset of rows from the dividend. This is where the name comes
from, since the CROSS JOIN acts like a multiplication operator.
Division with a Remainder
There are two kinds of relational division. Division with a remainder
allows the dividend table to have more values than the divisor, which
was Codd's original definition. For example, if a pilot can fly more
planes than just those we have in the hangar, this is fine with us.
The query can be written in SQL-89 as
SELECT DISTINCT pilot
FROM PilotSkills AS PS1
WHERE NOT EXISTS
(SELECT *
FROM Hangar
WHERE NOT EXISTS
(SELECT *
FROM PilotSkills AS PS2
WHERE (PS1.pilot = PS2.pilot)
AND (PS2.plane = Hangar.plane)));
The quickest way to explain what is happening in this query is to
imagine an old World War II movie where a cocky pilot has just walked
into the hangar, looked over the fleet, and announced, "There ain't no
plane in this hangar that I can't fly!" We are finding the pilots for
whom there does not exist a plane in the hangar for which they have no
skills. The use of the NOT EXISTS() predicates is for speed. Most SQL
systems will look up a value in an index rather than scan the whole
table. The SELECT * clause lets the query optimizer choose the column
to use when looking for the index.
This query for relational division was made popular by Chris Date in
his textbooks, but it is not the only method nor always the fastest.
Another version of the division can be written so as to avoid three
levels of nesting. While it is not original with me, I have made it
popular in my books.
SELECT PS1.pilot
FROM PilotSkills AS PS1, Hangar AS H1
WHERE PS1.plane = H1.plane
GROUP BY PS1.pilot
HAVING COUNT(PS1.plane) = (SELECT COUNT(plane) FROM Hangar);
There is a serious difference in the two methods. Burn down the
hangar, so that the divisor is empty. Because of the NOT EXISTS()
predicates in Date's query, all pilots are returned from a division by
an empty set. Because of the COUNT() functions in my query, no pilots
are returned from a division by an empty set.
In the sixth edition of his book, INTRODUCTION TO DATABASE SYSTEMS
(Addison-Wesley; 1995 ;ISBN 0-201-82458-2), Chris Date defined another
operator (DIVIDEBY ... PER) which produces the same results as my
query, but with more complexity.
Exact Division
The second kind of relational division is exact relational division.
The dividend table must match exactly to the values of the divisor
without any extra values.
SELECT PS1.pilot
FROM PilotSkills AS PS1
LEFT OUTER JOIN
Hangar AS H1
ON PS1.plane = H1.plane
GROUP BY PS1.pilot
HAVING COUNT(PS1.plane) = (SELECT COUNT(plane) FROM Hangar)
AND COUNT(H1.plane) = (SELECT COUNT(plane) FROM Hangar);
This says that a pilot must have the same number of certificates as
there planes in the hangar and these certificates all match to a plane
in the hangar, not something else. The "something else" is shown by a
created NULL from the LEFT OUTER JOIN.
Please do not make the mistake of trying to reduce the HAVING clause
with a little algebra to:
HAVING COUNT(PS1.plane) = COUNT(H1.plane)
because it does not work; it will tell you that the hangar has (n)
planes in it and the pilot is certified for (n) planes, but not that
those two sets of planes are equal to each other.
Note on Performance
The nested EXISTS() predicates version of relational division was made
popular by Chris Date's textbooks, while the author is associated with
popularizing the COUNT(*) version of relational division. The Winter
1996 edition of DB2 ON-LINE MAGAZINE
(http://www.db2mag.com/96011ar:htm) had an article entitled "Powerful
SQL:Beyond the Basics" by Sheryl Larsen which gave the results of
testing both methods. Her conclusion for DB2 was that the nested
EXISTS() version is better when the quotient has less than 25% of the
dividend table's rows and the COUNT(*) version is better when the
quotient is more than 25% of the dividend table.|||Try this link:
http://www.windowsitpro.com/Article...20169.html?Ad=1
Although its a simple request i dont think its a 2minute job.
Ian
<datagal@.msn.com> wrote in message
news:1126559995.257988.29900@.g44g2000cwa.googlegroups.com...
> Here's a Brain Twister:
> I have a list of Customers who buy from a list of Products. I need to
> summarize (countdistinct) customers by the combination(s) of products
> they buy.
> link
> Product Combo Distinct Customer Count
> Product A 10
> Product B 8
> Product C 6
> Product A&B 7
> Product A&C 5
> Product B&C 4
> Product A&B&C 1
> I'm sure somebody must have had to do this before and it's probably
> totally simple, but I'm just not seeing it. It gets even worse in that
> there are multiple levels in multiple hierarchies of products, but if I
> can get the technique down, I can probably get the rest.
> Any ideas will be greatly appreciated..
> Thanks!!
>|||Still not getting where I need to go....
Maybe I should clarify that my 8 levels of product categories (classes,
whatever...) and all adjacencies in a single table, making the tough
part my problem the gouping by multiple, concatinated values at each
level.
Jody|||Just modify the code and change the Baskets (Hangar) in the example to
include a basket name. Here is the painful details and code:
DROP TABLE Sales;
CREATE TABLE Sales
(customer CHAR(15) NOT NULL,
item_code CHAR(15) NOT NULL,
PRIMARY KEY (customer, item_code));
INSERT INTO Sales VALUES ('Celko', 'Piper Cub');
INSERT INTO Sales VALUES ('Higgins', 'B-52 Bomber');
INSERT INTO Sales VALUES ('Higgins', 'F-14 Fighter');
INSERT INTO Sales VALUES ('Higgins', 'Piper Cub');
INSERT INTO Sales VALUES ('Jones', 'B-52 Bomber');
INSERT INTO Sales VALUES ('Jones', 'F-14 Fighter');
INSERT INTO Sales VALUES ('Smith', 'B-1 Bomber');
INSERT INTO Sales VALUES ('Smith', 'B-52 Bomber');
INSERT INTO Sales VALUES ('Smith', 'F-14 Fighter');
INSERT INTO Sales VALUES ('Motts', 'B-1 Bomber');
INSERT INTO Sales VALUES ('Motts', 'B-52 Bomber');
INSERT INTO Sales VALUES ('Motts', 'F-14 Fighter');
INSERT INTO Sales VALUES ('Wilson', 'B-1 Bomber');
INSERT INTO Sales VALUES ('Wilson', 'B-52 Bomber');
INSERT INTO Sales VALUES ('Wilson', 'F-14 Fighter');
INSERT INTO Sales VALUES ('Wilson', 'F-17 Fighter');
CREATE TABLE Baskets
(item_code CHAR(15) NOT NULL,
basket_name CHAR(15)NOT NULL,
PRIMARY KEY (item_code, basket_name));
INSERT INTO Baskets VALUES ('B-1 Bomber', '3 planes');
INSERT INTO Baskets VALUES ('B-52 Bomber', '3 planes');
INSERT INTO Baskets VALUES ('F-14 Fighter', '3 planes');
INSERT INTO Baskets VALUES ('B-1 Bomber', '2 bombers');
INSERT INTO Baskets VALUES ('B-52 Bomber', '2 bombers');
INSERT INTO Baskets VALUES ('Piper Cub', 'wimp');
CREATE PROCEDURE ReportBasket (@.my_basket CHAR(15))
AS
SELECT @.my_basket, COUNT (DISTINCT cust)
FROM (
SELECT S1.customer
FROM Sales AS S1
LEFT OUTER JOIN
(SELECT item_code FROM Baskets WHERE basket_name =
@.my_basket)AS B1
ON S1.item_code = B1.item_code
GROUP BY S1.customer
HAVING COUNT(S1.item_code) = (SELECT COUNT(item_code)
FROM Baskets AS B2 WHERE basket_name
= @.my_basket)
AND COUNT(B1.item_code) = (SELECT COUNT(item_code)
FROM Baskets AS B2 WHERE basket_name =
@.my_basket)
) AS X(cust);
You can further modify this procedure with a derived table of the
basket names instead of a parameter and get it all at once.|||I know purists will be mad at this solution, as it uses proprietary
tricks, but anyway:
create table sales(customer_id int, product varchar(3))
insert into sales values(1, 'a')
insert into sales values(1, 'b')
insert into sales values(1, 'c')
insert into sales values(2, 'a')
insert into sales values(2, 'b')
insert into sales values(2, 'c')
insert into sales values(3, 'a')
insert into sales values(3, 'b')
insert into sales values(3, 'c')
insert into sales values(4, 'a')
insert into sales values(4, 'b')
insert into sales values(5, 'b')
insert into sales values(5, 'c')
insert into sales values(6, 'b')
insert into sales values(6, 'c')
insert into sales values(7, 'a')
insert into sales values(8, 'b')
insert into sales values(9, 'c')
insert into sales values(10, 'a')
insert into sales values(11, 'b')
go
create function customer_products(@.customer_id int)
returns varchar(100)
as
begin
declare @.ret varchar(100)
set @.ret = ''
select @.ret = @.ret + ',' + product
from sales where customer_id = @.customer_id
order by product
return @.ret
end
go
-- testing the function
select distinct customer_id, dbo.customer_products(customer_id)
product_list
from sales
go
customer_id product_list
----
--
1 ,a,b,c
2 ,a,b,c
3 ,a,b,c
4 ,a,b
5 ,b,c
6 ,b,c
7 ,a
8 ,b
9 ,c
10 ,a
11 ,b
-- what I think you need
select product_list, count(*)
from(select distinct customer_id, dbo.customer_products(customer_id)
product_list
from sales) t
group by product_list
product_list
-- --
,a 2
,a,b 1
,a,b,c 3
,b 2
,b,c 2
,c 1
(6 row(s) affected)
go
drop function customer_products
drop table sales|||You should probably add OPTION(MAXDOP 1) to your select statement.
"Alexander Kuznetsov" <AK_TIREDOFSPAM@.hotmail.COM> wrote in message
news:1126792522.609984.131080@.g14g2000cwa.googlegroups.com...
> I know purists will be mad at this solution, as it uses proprietary
> tricks, but anyway:
> create table sales(customer_id int, product varchar(3))
> insert into sales values(1, 'a')
> insert into sales values(1, 'b')
> insert into sales values(1, 'c')
> insert into sales values(2, 'a')
> insert into sales values(2, 'b')
> insert into sales values(2, 'c')
> insert into sales values(3, 'a')
> insert into sales values(3, 'b')
> insert into sales values(3, 'c')
> insert into sales values(4, 'a')
> insert into sales values(4, 'b')
> insert into sales values(5, 'b')
> insert into sales values(5, 'c')
> insert into sales values(6, 'b')
> insert into sales values(6, 'c')
> insert into sales values(7, 'a')
> insert into sales values(8, 'b')
> insert into sales values(9, 'c')
> insert into sales values(10, 'a')
> insert into sales values(11, 'b')
> go
> create function customer_products(@.customer_id int)
> returns varchar(100)
> as
> begin
> declare @.ret varchar(100)
> set @.ret = ''
> select @.ret = @.ret + ',' + product
> from sales where customer_id = @.customer_id
> order by product
> return @.ret
> end
> go
> -- testing the function
> select distinct customer_id, dbo.customer_products(customer_id)
> product_list
> from sales
> go
> customer_id product_list
> --
> ----
--
> 1 ,a,b,c
> 2 ,a,b,c
> 3 ,a,b,c
> 4 ,a,b
> 5 ,b,c
> 6 ,b,c
> 7 ,a
> 8 ,b
> 9 ,c
> 10 ,a
> 11 ,b
> -- what I think you need
> select product_list, count(*)
> from(select distinct customer_id, dbo.customer_products(customer_id)
> product_list
> from sales) t
> group by product_list
> product_list
> -- --
> ,a 2
> ,a,b 1
> ,a,b,c 3
> ,b 2
> ,b,c 2
> ,c 1
> (6 row(s) affected)
>
> go
> drop function customer_products
> drop table sales
>|||I think this might be it!!! I'm going to play with it and see what
happens.
Thank you everybody!
Jodysql

No comments:

Post a Comment