{"id":92064,"date":"2021-08-03T00:09:42","date_gmt":"2021-08-03T00:09:42","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=92064"},"modified":"2021-07-30T00:34:43","modified_gmt":"2021-07-30T00:34:43","slug":"using-filetable-to-find-duplicate-image-files-with-different-names","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/blogs\/using-filetable-to-find-duplicate-image-files-with-different-names\/","title":{"rendered":"Using Filetable to Find Duplicate Image Files with Different Names"},"content":{"rendered":"<p>In my article \u201c<a href=\"https:\/\/www.red-gate.com\/simple-talk\/blogs\/filetables-great-for-business-great-for-hobbyist\/\">Filetables<\/a><a href=\"https:\/\/www.red-gate.com\/simple-talk\/blogs\/filetables-great-for-business-great-for-hobbyist\/\">, Great For Business, Great For Hobbyist<\/a>\u201d, I implemented the basis of a hobby database I use for my Twitter accounts (@disneypicaday and @dollywoodp), where I post daily pictures of theme parks.<\/p>\n<p>As my inventory of prepped pictures has grown, one thing has become a major pain. Duplicates. I have tens of thousands of pictures, and I scan them occasionally to add to my directories. Perhaps not surprisingly, the same pictures often get chosen a second time. I use a naming standard that integrates the files into my database, as well as for the copies to go out for a given day. So the second (and third and fourth) time I choose the same picture, it looks different, though it has the exact same bits.<\/p>\n<p>For example, consider the following 5 pictures. 4 are unique, but one is a duplicate of another.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-92070\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/07\/word-image-85.png\" width=\"1027\" height=\"469\" \/><\/p>\n<p class=\"caption\">Figure 1. Five files I will use in this demo<\/p>\n<p>(I know this because I pre-created the files. Sometimes the files are not perfect duplicates, like if I cropped a picture even a pixel, then it is a different physical picture. This cannot be solved easily, but exact copies definitely can.)<\/p>\n<p>Using the following code, I will create a database that allows filestream. You may need to change directories, and you may have to configure your server to allow filestream. More on that here (<a href=\"https:\/\/docs.microsoft.com\/en-us\/sql\/relational-databases\/blob\/enable-the-prerequisites-for-filetable\">https:\/\/docs.microsoft.com\/en-us\/sql\/relational-databases\/blob\/enable-the-prerequisites-for-filetable<\/a>):<\/p>\n<pre class=\"lang:none theme:none\">CREATE DATABASE [DemoFiletable]\r\nON  PRIMARY \r\n (NAME = N'DemoFiletable', \r\nFILENAME = N'C:\\Program Files\\Microsoft SQL Server\\MSSQL15.THEMEPARK\\MSSQL\\DATA\\DemoFiletable.mdf' , \r\nSIZE = 10000KB , MAXSIZE = UNLIMITED, FILEGROWTH = 10000KB ), \r\n FILEGROUP [FilestreamData] CONTAINS FILESTREAM  DEFAULT\r\n( NAME = N'FilestreamDataFile1', \r\n  FILENAME = N'c:\\sql\\DemoFiletableFilestream' , \r\n  MAXSIZE = UNLIMITED)\r\n LOG ON \r\n( NAME = N'DemoFiletable_log', \r\n  FILENAME = N'C:\\Program Files\\Microsoft SQL Server\\MSSQL15.THEMEPARK\\MSSQL\\DATA\\DemoFiletable_log.ldf' ,\r\n  SIZE = 335872KB , MAXSIZE = 2048GB , FILEGROWTH = 65536KB )\r\n  WITH CATALOG_COLLATION = DATABASE_DEFAULT;\r\nALTER DATABASE DemoFiletable \r\n   SET FILESTREAM( NON_TRANSACTED_ACCESS = FULL, \r\n   DIRECTORY_NAME = N'DemoFiletable' ) WITH NO_WAIT;<\/pre>\n<p>After creating this database, next create a filetable. I used the following code:<\/p>\n<pre class=\"lang:none theme:none\">USE DemoFiletable\r\nGO\r\nCREATE SCHEMA FileAssets;\r\nGO\r\n--Used to hold the entire enventory of pictures \r\nCREATE TABLE FileAssets.Picture AS FILETABLE ON [PRIMARY]\r\nFILESTREAM_ON FilestreamData\r\nWITH\r\n(\r\nFILETABLE_DIRECTORY = N'Pictures', \r\nFILETABLE_COLLATE_FILENAME = Latin1_General_100_CI_AS, \r\nFILETABLE_PRIMARY_KEY_CONSTRAINT_NAME=PK_Picture\r\n);<\/pre>\n<p>Now, in windows you can go to the directory created by this Filetable. For me this directory is:<\/p>\n<p>\\\\Desktop-18e8d88\\themepark\\DemoFiletable\\Pictures<\/p>\n<p>Or broken down:<\/p>\n<p>\\\\ServerName\\InstanceName\\DatabaseDirectoryName\\FileTableName<\/p>\n<p>In that directory, I will drag the five files that were in Figure 1.You can see the files in Windows now, but also in SQL. To find the duplicate, you would think that we could just run the query and it would be obvious. These are 4 pretty different pictures. But his is not the case:<\/p>\n<pre class=\"lang:none theme:none\">SELECT name, Picture.file_stream\r\nFROM   FileAssets.Picture\r\nORDER  BY Picture.file_stream<\/pre>\n<p>Returns:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1174\" height=\"127\" class=\"wp-image-92071\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/07\/word-image-86.png\" \/><\/p>\n<p>The two pictures we know are duplicates are sorted together, but all the hex values up to there are exactly the same. But we can use just plain old SQL that we know and love:<\/p>\n<pre class=\"lang:none theme:none\">SELECT file_stream, COUNT(*)\r\nFROM   FileAssets.Picture\r\nGROUP  BY Picture.file_stream<\/pre>\n<p>The output of this statement shows us the duplicated image file\u2019s textual values:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"476\" height=\"138\" class=\"wp-image-92072\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/07\/word-image-87.png\" \/><\/p>\n<p>And then you can write something like the following to see the files that are duplicated:<\/p>\n<pre class=\"lang:none theme:none\">SELECT Picture.stream_id, Picture.name, \r\n        CHECKSUM(Picture.file_stream),\r\n\t   ROW_NUMBER() OVER (PARTITION BY Picture.file_stream ORDER BY creation_time)\r\nFROM   FileAssets.Picture\r\nWHERE  Picture.file_stream IN (\r\n\t\t\t\t\tSELECT file_stream\r\n\t\t\t\t\tFROM FileAssets.Picture\r\n\t\t\t\t\tGROUP BY Picture.file_stream\r\n\t\t\t\t\tHAVING COUNT(*) &gt; 1\r\n\t\t\t\t\t     )<\/pre>\n<p>This returns the following. The CHECKSUM output can be handy if you are working with lots of data as it will give you a very simple representation of your data that is different.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"676\" height=\"98\" class=\"wp-image-92073\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/07\/word-image-88.png\" \/><\/p>\n<p>Now, to remove the duplicates, all you have to do is delete the rows where that ROW_NUMBER value is greater than 1 (in my database, I also have to modify anywhere the #2 picture is historically used for tweets, but that isn\u2019t the point here. Executing the following code:<\/p>\n<pre class=\"lang:none theme:none\">WITH BaseRows AS (\r\nSELECT Picture.stream_id, \r\n       Picture.name, \r\n       ROW_NUMBER() OVER (PARTITION BY Picture.file_stream             \r\n                          ORDER BY creation_time) AS Ordering\r\nFROM   FileAssets.Picture\r\nWHERE  Picture.file_stream IN (\r\n\t\t\t\t\t\tSELECT file_stream\r\n\t\t\t\t\t\tFROM FileAssets.Picture\r\n\t\t\t\t\t\tGROUP BY Picture.file_stream\r\n\t\t\t\t\t\tHAVING COUNT(*) &gt; 1\r\n\t\t\t\t\t\t)\r\n)\r\nDELETE\r\nFROM   BaseRows\r\nWHERE BaseRows.Ordering &lt;&gt; 1;<\/pre>\n<p>And now the picture files are unique!<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1681\" height=\"570\" class=\"wp-image-92074\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2021\/07\/word-image-89.png\" \/><\/p>\n<p class=\"caption\">Figure 2. Duplicate picture is gone<\/p>\n<p>The obvious, likely to the casual observer, issue with this could be performance. You cannot index a file_stream value (my image files are in the 2-4MB category,) so it is going to be disk intensive and not at all fast.<\/p>\n<p>But it isn\u2019t terrifyingly slow either. To run this query:<\/p>\n<pre class=\"lang:none theme:none\">SELECT Picture.stream_id, Picture.name, \r\n        CHECKSUM(Picture.file_stream),\r\n\t   ROW_NUMBER() OVER (PARTITION BY Picture.file_stream ORDER BY creation_time)\r\nFROM   FileAssets.Picture\r\nWHERE  Picture.file_stream IN (\r\n\t\t\t\t\tSELECT file_stream\r\n\t\t\t\t\tFROM FileAssets.Picture\r\n\t\t\t\t\tGROUP BY Picture.file_stream\r\n\t\t\t\t\tHAVING COUNT(*) &gt; 1\r\n\t\t\t\t\t     )<\/pre>\n<p>On my local machine with a 3 year old i7, 32GB of RAM, and an nVME 1TB drive took around 2 minutes with a directory that has over 4000 image files to tell me there were no duplicates. When I first started the process, it took a bit longer as I had nearly 100 duplicates, mostly I suspect to return several hundred megabytes of results.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In my article \u201cFiletables, Great For Business, Great For Hobbyist\u201d, I implemented the basis of a hobby database I use for my Twitter accounts (@disneypicaday and @dollywoodp), where I post daily pictures of theme parks. As my inventory of prepped pictures has grown, one thing has become a major pain. Duplicates. I have tens of&#8230;&hellip;<\/p>\n","protected":false},"author":56085,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[2,1],"tags":[],"coauthors":[19684],"class_list":["post-92064","post","type-post","status-publish","format-standard","hentry","category-blogs","category-uncategorized"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/92064","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/56085"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=92064"}],"version-history":[{"count":4,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/92064\/revisions"}],"predecessor-version":[{"id":92078,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/92064\/revisions\/92078"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=92064"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=92064"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=92064"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=92064"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}