Duplicates, again, but other sort of ;)

This forum is meant for anything you would like to share with other visitors
Post Reply
FFF
Posts: 1580
Joined: Fri Sep 25, 2015 4:52 pm
Location: Germany

Duplicates, again, but other sort of ;)

Post by FFF »

I need to write a tool to find duplicate files in a folder. We talk about some 25k files in there ;), so "by hand" is no option.
The filenames are arbitrary, (i just found a file named: "____ ___ ___ (_____, ____. ___).docx" ); the extensions are arbitrary.

The app that controls this folder "notices", if a given file is already present, and changes the filename of the newly inserted file, by adding a date plus an increment. E.g:
If there's a "myTest.prg", that copy will be ""myTest-Nov-26-1.prg", if another occurs at the same date, it will be "myTest-Nov-26-2.prg", if it appears tomorrrow it will be named: "myTest-Nov-27-1.prg
I have no control over this.
My first thought was to get a list of FileInfos, stepping through, comparing file n with file n+1, if filenname front parts are identical, check for same size and same change date, delete the n+1 file (or better, move to backup ;->), iterate until no dups are found.
Does that make sense?
I'd feel better, if i could check for files being identitical, but didn't find some tool in .Net (probably searched wrongly).
Maybe one could send pairs to something like WinMerge?

The dups appear usually rather rarely, but every once and again there are hiccups in the upstream process, and i get 1000 new ones ;-(

Any idea welcomed!
EDIT: maybe should have consulted the web prior to write ;) - found some candidates, and found one which tells my how dumb i was - ignoring the first "marker" - two identical files have to be the same size...
Regards
Karl
(on Win8.1/64, Xide32 2.20, X#2.20.0.3)
TerryB1
Posts: 306
Joined: Wed Jan 03, 2018 11:58 am

Duplicates, again, but other sort of ;)

Post by TerryB1 »

Hi Karl

Don't know if this helps, but some time ago I had similar problems using C#.

I can't remember (or find) exactly what I did, but essentially it involved creating a newClass with separate fields eg name, fullname, dates and so on, that class was initialised in the way you suggest from FileInfos.

I then added it all to a list<newClass> which allowed me to jigger things about in any way I wished.

I was doing this over several directories. The overall processing time for the same ballpark figures you quote was just a few m/s.

Terry
TerryB1
Posts: 306
Joined: Wed Jan 03, 2018 11:58 am

Duplicates, again, but other sort of ;)

Post by TerryB1 »

Hi Karl

Further to my last, have just remembered a bit more.

You'll need to do a number of passes generating new arrays as you go.

Order is important. So if things get out of order make one long string and use sort. File names etc will need to be padded out (space) to a consistent minimum length, the added together. Don't use StringBuilder "a" + "b" is, I think, far more efficient.

You can consider introducing some oddball Unicode characters as identifiers and so on.

One other point you'll generate a lot of redundant strings in the process so make sure they go out of scope asap or they'll fill up memory.

I hope this makes sense - you'll have absolute control over everything, no need for 3rd party tools.

Sorry it's C#.

Terry
User avatar
ArneOrtlinghaus
Posts: 412
Joined: Tue Nov 10, 2015 7:48 am
Location: Italy

Duplicates, again, but other sort of ;)

Post by ArneOrtlinghaus »

Hi Karl, I have attached two files with some VO or the converted XS functions.I believe it isn't so difficult do it with what we are used to do. - Create an array of the files with directory2arrayex- Verify that with this order always the first file appears in the list, otherwise change the order- Make two for loops to compare all files. Use the filesize ( second parameter of the inner array) to make a quick comparison. - If filesize is equal then use fFilesEqual to compare contents. If they are equal you can delete the second file. - An alternative is is to order the two dimensional array by file size (second parameter) with ASortTwoDim (adir, 2, true). In this case the for loop can be enhanced a little bit, but it must be verified, which file is the second one. Arne
Attachments
filecomparevo.txt
(18 KiB) Downloaded 87 times
filecomparexs.txt
(28.72 KiB) Downloaded 89 times
ic2
Posts: 1858
Joined: Sun Feb 28, 2016 11:30 pm
Location: Holland

Duplicates, again, but other sort of ;)

Post by ic2 »

Hello Karl,

Not sure why you need to write it yourself. This is a great & free tool for finding duplicate files, using several criteria:

www.clonespy.com

Dick
Post Reply