Firehose mode: group Number for millions of records.

I am using the following query which I found from a fragment of code
by itzik ben-gan to assign a common group id for group of records in my
case which have similar SSN and first Name and Last Name. if the SSN is
the same it should also check the first name and last name of the
record. Because records have more than three AKA names; I need to check
all the possibilities of first name last name combination to verify the
records are the same.
This code works fine and can assign group numbers for all the rows.
I am trying this code on a database of 65,000 rows. It's taking around
20 minute to complete. but I'll have to run the same code on
800,000,000 rows.
It will take years to finish.
Even if the query is optimized to run in 1 second for 65,000 rows, it
will take more than 4 hours to run on the 800,000,000 row. This where I
realized I am in the "wrong jungle".
1. is there any other feasible and faster way to do this? Very
important issue.
2. While assigning group number, it doesn't give sequential numbers. It
skips some of the numbers( group Number 1,2 5,9...) I am just curious
about this(
not very important issue)
SELECT c1.fname, c1.lname, c1.ssn , c3.tu_id,
(SELECT 1 + count(*)
FROM distFLS AS c2
WHERE c2.ssn < c1.ssn
or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) =
substring(c1.fname,1,1) or substring(c2.lname,1,1) =
substring(c1.lname,1,1)
or substring(c2.fname,1,1) =
substring(c1.lname,1,1) or substring(c2.lname,1,1) =
substring(c1.fname,1,1))
)) AS grp_num
into tmp_FLS
FROM distFLS AS c1
JOIN tu_people_data AS c3
ON (c1.ssn = c3.ssn and
c1.fname = c3.fname and
c1.lname= c3.lname)
GO
distinct firstname, lastname and SSN table from the tu_people_data.
I created this table to increase the query performance.
CREATE TABLE [distFLS] (
[fname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
[lname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
[ssn] [int] NULL
) ON [PRIMARY]
GO
CREATE TABLE [TU_People_Data] (
[tu_id] [bigint] NOT NULL ,
[count_id] [int] NOT NULL ,
[fname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
[lname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
[ssn] [int] NULL ,
CONSTRAINT [PK_tu_bulk_people] PRIMARY KEY CLUSTERED
(
[tu_id],
[count_id]
) ON [PRIMARY]
) ON [PRIMARY]
GO
sample data
there is a column count_id after the tu_id and before fname(tu_id and
count_id are primary keys)
tu_id fname lname SSN
156078480 KRISINA WALSH 999999000
156078480 KRISTINA GIERER 999999000
156078480 KRISTINA WALSH 999999000
151257883 J SOTO 999999111
151257883 JOSE LARIOS 999999111
151257883 JOSE SOTO 999999111
151257883 L SOTO 999999111
136312525 ELADIO GARCIA 999999222
136312525 ELADIO NAVA 999999222
136312525 ELADIO NAVAGARCIA 999999222
136312525 GARCIA NAVA 999999222
149180940 DARREN SAUERWINE 999999333
149180940 DARREN SUAERWIN 999999333I am assuming that SSN means U.S. Social Security Number. If this is not
the case, ignore the rest of this post.
If SSN is the same you should be able to assume the records are in fact the
same people (this is the rule, but due to human error there have been
exceptions). For the rare exceptions identify them after the fact. Odds
are you won't come accross a true duplicate SSN, although if your data is
innaccurate it is more likely.
If you can go this route, simply assign the same group to anyone with the
same SSN, it will make it much simpler and much, much, much faster. If you
have to do it with all the name comparisons, at least remove the following
subquery and replace it with a different method for assigning a unique
GRP_NUM. That extra full table lookup for every row is what is causing your
performance problems.
(SELECT 1 + COUNT(*)
FROM DISTFLS AS C2
WHERE C2.SSN < C1.SSN
OR (C2.SSN = C1.SSN
AND ( SUBSTRING(C2.FNAME,1,1) = SUBSTRING(C1.FNAME,1,1)
OR SUBSTRING(C2.FNAME,1,1) = SUBSTRING(C1.LNAME,1,1)
OR SUBSTRING(C2.LNAME,1,1) = SUBSTRING(C1.LNAME,1,1)
OR SUBSTRING(C2.LNAME,1,1) = SUBSTRING(C1.FNAME,1,1)
)
)
) AS GRP_NUM
<sql.greg@.gmail.com> wrote in message
news:1141742507.871093.35900@.i40g2000cwc.googlegroups.com...
> I am using the following query which I found from a fragment of code
> by itzik ben-gan to assign a common group id for group of records in my
> case which have similar SSN and first Name and Last Name. if the SSN is
> the same it should also check the first name and last name of the
> record. Because records have more than three AKA names; I need to check
> all the possibilities of first name last name combination to verify the
> records are the same.
> This code works fine and can assign group numbers for all the rows.
> I am trying this code on a database of 65,000 rows. It's taking around
> 20 minute to complete. but I'll have to run the same code on
> 800,000,000 rows.
> It will take years to finish.
> Even if the query is optimized to run in 1 second for 65,000 rows, it
> will take more than 4 hours to run on the 800,000,000 row. This where I
> realized I am in the "wrong jungle".
> 1. is there any other feasible and faster way to do this? Very
> important issue.
> 2. While assigning group number, it doesn't give sequential numbers. It
> skips some of the numbers( group Number 1,2 5,9...) I am just curious
> about this(
> not very important issue)
> SELECT c1.fname, c1.lname, c1.ssn , c3.tu_id,
> (SELECT 1 + count(*)
> FROM distFLS AS c2
> WHERE c2.ssn < c1.ssn
> or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) =
> substring(c1.fname,1,1) or substring(c2.lname,1,1) =
> substring(c1.lname,1,1)
> or substring(c2.fname,1,1) =
> substring(c1.lname,1,1) or substring(c2.lname,1,1) =
> substring(c1.fname,1,1))
> )) AS grp_num
> into tmp_FLS
> FROM distFLS AS c1
> JOIN tu_people_data AS c3
> ON (c1.ssn = c3.ssn and
> c1.fname = c3.fname and
> c1.lname= c3.lname)
> GO
> distinct firstname, lastname and SSN table from the tu_people_data.
> I created this table to increase the query performance.
> CREATE TABLE [distFLS] (
> [fname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
> [lname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
> [ssn] [int] NULL
> ) ON [PRIMARY]
> GO
> CREATE TABLE [TU_People_Data] (
> [tu_id] [bigint] NOT NULL ,
> [count_id] [int] NOT NULL ,
> [fname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
> [lname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
> [ssn] [int] NULL ,
> CONSTRAINT [PK_tu_bulk_people] PRIMARY KEY CLUSTERED
> (
> [tu_id],
> [count_id]
> ) ON [PRIMARY]
> ) ON [PRIMARY]
> GO
> sample data
> there is a column count_id after the tu_id and before fname(tu_id and
> count_id are primary keys)
> tu_id fname lname SSN
> 156078480 KRISINA WALSH 999999000
> 156078480 KRISTINA GIERER 999999000
> 156078480 KRISTINA WALSH 999999000
> 151257883 J SOTO 999999111
> 151257883 JOSE LARIOS 999999111
> 151257883 JOSE SOTO 999999111
> 151257883 L SOTO 999999111
> 136312525 ELADIO GARCIA 999999222
> 136312525 ELADIO NAVA 999999222
> 136312525 ELADIO NAVAGARCIA 999999222
> 136312525 GARCIA NAVA 999999222
> 149180940 DARREN SAUERWINE 999999333
> 149180940 DARREN SUAERWIN 999999333
>|||A while back, I ran a query against a customer account table, and found over
1000 unique SSN was used with a different date of birth. Those SSNs with the
highest frequency were obviously fake or just stubbs (ex: 111111111), but
there were some apparently valid numbers that were duplicated several
hundred times by different account holders.
"They were shocked to learn that 18 of the 19 terrorists possessed either
state-issued or counterfeit driver's licenses or ID cards and all 19 had
obtained Social Security numbers (SSNs) - some real, some fake."
http://www.cis.org/articles/2002/back1202.html
"Linda Trevino, who lives in a Chicago suburb, applied for a job last year
at a local Target department store, and was denied. The reason? She already
worked there -- or rather, her Social Security number already worked there."
http://www.msnbc.msn.com/id/6814673/
"Jim Underwood" <james.underwoodATfallonclinic.com> wrote in message
news:%23drgo7fQGHA.5552@.TK2MSFTNGP10.phx.gbl...
>I am assuming that SSN means U.S. Social Security Number. If this is not
> the case, ignore the rest of this post.
> If SSN is the same you should be able to assume the records are in fact
> the
> same people (this is the rule, but due to human error there have been
> exceptions). For the rare exceptions identify them after the fact. Odds
> are you won't come accross a true duplicate SSN, although if your data is
> innaccurate it is more likely.
> If you can go this route, simply assign the same group to anyone with the
> same SSN, it will make it much simpler and much, much, much faster. If
> you
> have to do it with all the name comparisons, at least remove the following
> subquery and replace it with a different method for assigning a unique
> GRP_NUM. That extra full table lookup for every row is what is causing
> your
> performance problems.
> (SELECT 1 + COUNT(*)
> FROM DISTFLS AS C2
> WHERE C2.SSN < C1.SSN
> OR (C2.SSN = C1.SSN
> AND ( SUBSTRING(C2.FNAME,1,1) = SUBSTRING(C1.FNAME,1,1)
> OR SUBSTRING(C2.FNAME,1,1) = SUBSTRING(C1.LNAME,1,1)
> OR SUBSTRING(C2.LNAME,1,1) = SUBSTRING(C1.LNAME,1,1)
> OR SUBSTRING(C2.LNAME,1,1) = SUBSTRING(C1.FNAME,1,1)
> )
> )
> ) AS GRP_NUM
>
> <sql.greg@.gmail.com> wrote in message
> news:1141742507.871093.35900@.i40g2000cwc.googlegroups.com...
>

Monday, March 19, 2012

group Number for millions of records.

No comments:

Post a Comment

Firehose mode

Blog Archive

About Me