Friday, August 28, 2009

CTE to delete duplicate rows from a table

Common-table expressions are a very useful new feature in SQL Server 2005. You can use them for recursive queries,

removing duplicates, and even simple looping procedures.

With some crafty TSQL, this is a relatively easy task to do when a primary key defined on the table. Luckily, the

new CTE feature in SQL Server 2005 makes it very easy to remove these duplicates, with or without a primary key.

The script below defines my CTE. I am using a windowing function named DENSE_RANK to group the records together

based on the Product, SaleDate, and SalePrice fields, and assign them a sequential value randomly. This means that

if I have two records with the exact same Product, SaleDate, and SalePrice values, the first record will be ranked

as 1, the second as 2, and so on.

WITH SalesCTE(Product, SaleDate, SalePrice, Ranking)
AS
(

SELECT Product, SaleDate, SalePrice,Ranking = DENSE_RANK() OVER(PARTITION BY Product, SaleDate, SalePrice ORDER BY

NEWID() ASC)
FROM SalesHistory

)

DELETE FROM SalesCTE WHERE Ranking > 1

Because a CTE acts as a virtual table, I am able to process data modification statements against it, and the

underlying table will be affected. In this case, I am removing any record from the SalesCTE that is ranked higher

than 1. This will remove all of my duplicate records.

To verify my duplicates have been removed, I can review the data in the table, which should now contain 8 records,

rather than the previous 10.

SELECT *FROM SalesHistory

No comments: