Finding (and removing) duplicate files on your hard drive
secretGeek .:dot Nuts about dot Net:.
home .: about .: sign up .: sitemap .: secretGeek RSS

Finding (and removing) duplicate files on your hard drive

I generally hold to the philsophy that hard drive space is cheap, and your time is too valuable to waste on optimising hard drive space.

But one of those fun holiday activities, reserved for times when procrastination is at its peak, is to thoroughly clean up a hard drive and make extra room available.

My usual technique is to use SpaceSniffer (found courtesy of Scott Hanselman's tool list) but this time around I suspected that the biggest waste of space was caused by duplicate files (particularly music and photos) taking up a lot of space.

When confronted with a simple problem, the smart guys look for pre-existing solutions. But not me.

I like to employ something I call the 'my way is the best way' philosophy. Other people call it 'not invented here' syndrome, but I prefer to call it 'my way is the best way' because... well, my way is the best way.

thinking about duplicate files

Analysis is more fun than Action

Most of the duplicate-finding tools in this category have a feature where they will automatically delete all but one copy of each duplicate file found. That's not something I'm willing to do, at least not automatically. What I wanted to do was to create the full list of files, and then analyse it, for example in NimbleText. I wanted to create the list of files and then stand back, thoughfully stroking my long beard, just like Pai Mei from Kill Bill.

So I embarked on a special project, codenamed Dinomopabot, a name recommended by my 5 year old daughter who is very clever at these things. The final result is now named 'Dupes.exe': a command line tool for finding duplicate files on your hard drive.


You can browse, clone or fork the source-code, at Bitbucket:

'Dupes' sourcecode



Or download the executable, ready for use:

Download 'dupes.exe'


Here's the built-in help text:

Dupes Find duplicate files, by calculating checksums.

Usage: Dupes.exe [options]
Tip: redirect output to a .csv file, and manipulate with NimbleText.

Options:
  -p, --path=VALUE           the folder to scan
  -s, --subdirs              include subdirectories
  -f, --filter=VALUE         search filter (defaults to *.*)
  -a, --all                  show ALL checksums of files, even non-copies
  -?, -h, --help             show this message and exit

For each file it encounters, Dupes generates a sha256 checksum, with which to compare files. They're short and catchy, they look like this:

271EC103B44960B6A4C6A26FE13682A855133D3D95AC8ED81D7C90FA41571D1F

Cute hey? Almost adoption-worthy.

And for every member of a duplicate file set that the tool encounters, it spits out a row with four columns, separated by bar symbols ('|')

The four columns are:

CheckSum       Sha256 checksum of the file. (Hint: sort by this to get all duplicates together)
DuplicateNum   0 for the first file in the duplicate set, 1 for the second file, etc.
Filesize       In bytes. (Hint: sort by this, if you want to tackle big files first)
Path           Full path and filename for this duplicate.

So you run dupes.exe and direct the output into a textfile (using > [filename]), and from there you can manipulate it (with NimbleText for example), to create a batch file that carefully deletes all the hand-picked, unwanted duplicates of your choice.

Here's an example of a NimbleText pattern you could use with the output of Dupes. This will create a batch file that deletes all but the first copy of each file:

<% if ($1 > 0) { 'del ' + $3 } %>

That pattern is just a piece of embedded javascript (you can embed javascript in NimbleText patterns) that says "if column 1 is greater than Zero, then output the text 'del ' plus the text from column 3." Column 1 is the duplicate number, so it will be greater than zero for all but the first instance of the file. And column 3 is the full path and filename of the duplicate.

Thank you. I hope someone finds this thing useful. Also, please imagine suitably gigantic and terrifying disclaimers attached to this code. I wrote it after all.





'Chai' on Fri, 11 Jan 2013 06:42:55 GMT, sez:

Wow ! This is such a great idea. I think I am going to write my own dupes.exe. I will call it "Dupe of Edinburgh."



'lb' on Fri, 11 Jan 2013 07:01:04 GMT, sez:

nice one chai, how about "Dupe, where's my car?"



'Doeke' on Fri, 11 Jan 2013 07:59:03 GMT, sez:

I love the tool, but shouldn't the options be modelled after the "dir" command?

Like: dupes /s c:\temp
Or: dupes /s c:\prj\*.cs



'SumDumGuy' on Fri, 11 Jan 2013 19:25:04 GMT, sez:

Hm...adding DateCreated/Modified fields could also be useful when judiciously evaluating dupes...time for a pull request! ;)



'lb' on Fri, 11 Jan 2013 23:10:39 GMT, sez:

@SumDumGuy -- pull requests welcome, it's a good feature idea.

@Doeke -- I love using the options class from mono. I can't go back to dos style params.



'Larry' on Thu, 01 Aug 2013 12:58:40 GMT, sez:

Hi,
I appreciate your efforts for trying it for yourself even though there is the availability of the tool already. We will learn a lot while trying out new things, I practice it. Good work!




name


website (optional)


enter the word:
 

comment (HTML not allowed)


All viewpoints welcome. Incivility is not tolerated, such comments are deleted.

 

I'm the co-author of TimeSnapper, a life analysis system that stores and plays-back your computer use. It makes timesheet recording a breeze, helps you recover lost work and shows you how to sharpen your act.

 

NimbleText - FREE text manipulation and data extraction

NimbleText is a Powerful FREE Tool

I wrote this, and use it every day for:

  • extracting data from text
  • manipulating text
  • generating code

It makes you look awesome. You should use NimbleText, you handsome devil!

 

Articles

The Canine Pyramid The Canine Pyramid
Humans: A Tragedy. Humans: A Tragedy.
ACK! ACK!
OfficeQuest... Gamification for the Office Suite OfficeQuest... Gamification for the Office Suite
New product launch: NimbleSET New product launch: NimbleSET
Programming The Robot from Diary of a Wimpy Kid Programming The Robot from Diary of a Wimpy Kid
Happy new year 2014 Happy new year 2014
Downtime as a service Downtime as a service
The Shape of Your Irrationality The Shape of Your Irrationality
This is why I don't go to nice restaurants any more. This is why I don't go to nice restaurants any more.
A flowchart of what programmers do at work all day A flowchart of what programmers do at work all day
The Telepresent Man. The Telepresent Man.
Interview with an Ex-Microsoftie. Interview with an Ex-Microsoftie.
CRUMBS! Commandline navigation tool for Powershell CRUMBS! Commandline navigation tool for Powershell
Little tool for making Amazon affiliate links Little tool for making Amazon affiliate links
Extracting a Trello board as markdown Extracting a Trello board as markdown
hgs: Manage Lots of Mercurial Projects Simultaneously hgs: Manage Lots of Mercurial Projects Simultaneously
You Must Get It! You Must Get It!
AddDays: A Very Simple Date Calculator AddDays: A Very Simple Date Calculator
Google caught in a lie. Google caught in a lie.
NimbleText 2.0: More Than Twice The Price! NimbleText 2.0: More Than Twice The Price!
A Computer Simulation of Creative Work, or 'How To Get Nothing Done' A Computer Simulation of Creative Work, or 'How To Get Nothing Done'
NimbleText 1.9 -- BoomTown! NimbleText 1.9 -- BoomTown!
Line Endings. Line Endings.
**This** is how you pivot **This** is how you pivot
Art of the command-line helper Art of the command-line helper
Go and read a book. Go and read a book.
Slurp up mega-traffic by writing scalable, timeless search-bait Slurp up mega-traffic by writing scalable, timeless search-bait
Do *NOT* try this Hacking Script at home Do *NOT* try this Hacking Script at home
The 'Should I automate it?' Calculator The 'Should I automate it?' Calculator

Archives Complete secretGeek Archives

TimeSnapper -- Automated Screenshot Journal TimeSnapper: automatic screenshot journal

25 steps for building a Micro-ISV 25 steps for building a Micro-ISV
3 minute guides -- babysteps in new technologies: powershell, JSON, watir, F# 3 Minute Guide Series
Universal Troubleshooting checklist Universal Troubleshooting Checklist
Top 10 SecretGeek articles Top 10 SecretGeek articles
ShinyPower (help with Powershell) ShinyPower
Now at CodePlex

Realtime CSS Editor, in a browser RealTime Online CSS Editor
Gradient Maker -- a tool for making background images that blend from one colour to another. Forget photoshop, this is the bomb. Gradient Maker



[powered by Google] 

How to be depressed How to be depressed
You are not inadequate.



Recommended Reading


the little schemer


The Best Software Writing I
The Business Of Software (Eric Sink)

Recommended blogs

Jeff Atwood
Joseph Cooney
Phil Haack
Scott Hanselman
Julia Lerman
Rhys Parry
Joel Pobar
OJ Reeves
Eric Sink

InfoText - amazing search for SharePoint
LogEnvy - event logs made sexy
Computer, Unlocked. A rapid computer customization resource
Aussie Bushwalking
BrisParks :: best parks for kids in brisbane
PhysioTec, Brisbane Specialist Physiotherapy & Pilates
 
home .: about .: sign up .: sitemap .: secretGeek RSS .: © Leon Bambrick 2012 .: privacy

home .: about .: sign up .: sitemap .: RSS .: © Leon Bambrick 2006 .: privacy