SQL: Finding all Duplicate Records in a Table CodeUnit 03 MAR 2010

Sometimes it is quite handy to see which records based on a particular key appear more than once in a database. The trick here is that we are focusing on a particular column and deeming that should that particular key appear more than once in the table  then we have a duplicate scenario.

So let us look at constructing the necessary SQL statement.

For our example, let’s use a table called emails_sent which contains a whole lot of various columns including email_address, which will act as our key to search upon. So wanting to find all complete records for all email_addresses that appear more than once in the table, we put down:

FROM emails_sent
WHERE email_address IN (
SELECT email_address
FROM emails_sent
GROUP BY email_address
HAVING (COUNT(email_address) > 1)

Breaking this down, we’ll see that the SQL statement is actually made up of two distinct parts. The first part is the meaty one that is responsible for grabbing all the duplicate records, while the second part simply uses the result set stemming from this first segment to then return all the column values for all the duplicate records. Now while the second part is fairly simple, just a SELECT *FROM x WHERE key IN statement, the first segment deserves a bit more of an explanation.

What we are doing is grouping all email_addresses together that look the same using the GROUP BY clause. Next, we are applying a HAVING condition to the GROUP BY clause, this time asking that the only groups to keep are those for which the count of the group key is greater than 1. At this point the light should be blinking on and you should be shouting out, “Ah ha, so that is how it works”  in case you are wondering.

So the first part returns all the duplicate record keys and the second part then grabs all information for records that correspond on those initially returned duplicate keys.

Simple really.

We could of course through a simple modification to our HAVING clause change the SQL that it only returns records where the key only appears once – i.e. change HAVING (COUNT(email_address) > 1 to HAVING (COUNT(email_address) = 1.

To return the duplicate keys and the number of times that they appear we could use this:

SELECT email_addresses, count(email_address) AS numOccurrences
FROM emails_sent
GROUP BY email_address
HAVING (COUNT(email_address) > 1)

About Craig Lotter

Software developer, husband and dad to two little girls. Writer behind An Exploring South African. I don't have time for myself any more.