For some time now I have found the Batch Manager definition of "Duplicate" as very limited, as it only looks for photos with exactly the same filename.
This is a particular problem for me as I have always used Canon Digital cameras and each time I buy a new one, the filename starts again at IMG_0001.jpg so that I have many 'false' duplicates over the years.
I think it would be useful to have some (optional) additional tests:
1) To also test if the Creation Date (and Time) is the same, as well as the filename.
2) To possibly test if the Pixel Dimensions (Width and Height) are also the same.
Does anyone have any other suggestions?
I also see that there is a 'bug' request reported by sakanaou
"0002801: [Batch Manager] Find duplicates based on md5sum"
Offline
Hello
You're perfectly right The md5 check is really the ultimate duplicate check
Offline
There is also a configuration property named "uniqueness_mode". So I would suggest to modify the file "batch_manager.php" in the following way:
// perform 2 queries instead. We hope there are not too many duplicates. + $field = ($conf['uniqueness_mode'] == 'md5sum') ? 'md5sum' : 'file'; $query = ' -SELECT file +SELECT '.$field.' FROM '.IMAGES_TABLE.' - GROUP BY file + GROUP BY '.$field.' HAVING COUNT(*) > 1 ;'; - $duplicate_files = array_from_query($query, 'file'); + $duplicate_files = array_from_query($query, $field); $query = ' SELECT id FROM '.IMAGES_TABLE.' - WHERE file IN (\''.implode("','", $duplicate_files).'\') + WHERE '.$field.' IN (\''.implode("','", $duplicate_files).'\') ;'; array_push(
And we need an additional index for the field "md5sum" in the database.
Offline
this field already exists, but is NULL for photos stored in "galleries" folder
Offline
Then we have to add it!?
Where can I find the code that synchronizes the galleries folder?
Offline
there is certainly a good reason why it's not synced... or it's just historical
the code is in admin/site_update.php
anyway we need to keep in mind that if we start to compute md5sum of "galleries" photos, we can't do it for already synced photos (I think computing the hash of thousands photos is very time consuming)
Offline
We should add the hash also to imports from the galleries folder:
$insert = array( 'id' => $next_element_id++, 'file' => $filename, 'name' => get_name_from_file($filename), 'date_available' => CURRENT_DATE, 'path' => $path, 'representative_ext' => $fs[$path]['representative_ext'], 'storage_category_id' => $db_fulldirs[$dirname], 'added_by' => $user['id'], + 'md5sum' => md5_file($path), );
We could add a batch function to recalculate the md5 hash. It is not that time consuming (see http://php.net/manual/de/function.md5-file.php#81751 ).
I think the current duplicate filter is useless for most users anyway.
Last edited by thimo (2013-01-19 12:23:35)
Offline
Hello, I know this is an old thread but another possibility is to find duplicated files by using similarity, like convolution im maths. This way same pictures with different names and sizes may be classified as very similar.
Offline
that's indeed the best solution but this king of algorythm is too much complex and resources consuming
Offline
thimo wrote:
We should add the hash also to imports from the galleries folder:
Code:
$insert = array( 'id' => $next_element_id++, 'file' => $filename, 'name' => get_name_from_file($filename), 'date_available' => CURRENT_DATE, 'path' => $path, 'representative_ext' => $fs[$path]['representative_ext'], 'storage_category_id' => $db_fulldirs[$dirname], 'added_by' => $user['id'], + 'md5sum' => md5_file($path), );We could add a batch function to recalculate the md5 hash. It is not that time consuming (see http://php.net/manual/de/function.md5-file.php#81751 ).
I think the current duplicate filter is useless for most users anyway.
I just found the issue of duplicate images while trying to query the database to import comments from an old comment table/db I have from my old picture server (home grown hacked up phpix). I have 30k images and in some cases the images have the same filename, I worked around this in my old database by using md5 hash and storing this in the database before editing the comments. If it was a new comment, I would check the hash of the original file and then see if it was already in the table before the comment was updated or added. I have not had any problems with this in 5+years of use now.
The issue I saw tonight while trying to figure out how to import my comments from one table to another correctly is that one file name shows up 4 times :
select `file`,`id` from `piwigo_images` where FILE='DSC01450.JPG';
+--------------+-------+
| file | id |
+--------------+-------+
| DSC01450.JPG | 15255 |
| DSC01450.JPG | 16252 |
| DSC01450.JPG | 16952 |
| DSC01450.JPG | 25967 |
+--------------+-------+
4 rows in set (0.01 sec)
My org table with the same file name showed up one time in this example as I only had a comment on one of the files :
SELECT * FROM comments WHERE fileid = "DSC01450.JPG";
+-------------------+-----------------------------+----------------------------------+
| fileid | desctxt | uniqueid |
+-------------------+-----------------------------+----------------------------------+
| DSC01450.JPG | from inside the hotel room. | 3cc71e12ca10a5d57859e7d17140f56c |
+-------------------+-----------------------------+----------------------------------+
1 row in set (0.00 sec)
I am not at all good at scripting/writing php and sql queries, but will figure it out at some point :)
Offline
As md5sum already exists in database, isn't it reasonable that find duplicates function use this instead names? Or both.
Offline
You don't need to necessarily md5 (or any hash) all of the file, but only a part of the file, like the first 2k bites (for instance). I don't recall where in the file each part exists (been a while since I did image work), but depending on type, e.g. .jpg, .png, etc, you can look at 2k (or whatever about you decide) anywhere.
I wrote a python dedupe program that reads the first 5k of files to compare them; the creator/maintainer of the file browser SpaceFM used a very similar dedupe.
I know my Python, Java, and Javascript, (I'm a professional software engineer) but I'm rusty with my php (I haven't coded in php in about 4 years), but you would have something like:
$f = "path/to/text/file.txt"; //String file path $size = filesize($f); // File size (how much data to read) $fH = fopen($f,"r"); // File handle $data = fread($fH,$size); // Read data from file handle fclose($fH); // Close handle $hash = md5($data);
Of course, sh1 or other would work. I'm not familiar with each hash's lib speed, but that should be taken into account.
I have not tried out this code, just throwing out the idea.
$f = "path/to/text/file.txt"; //String file path $size = 2048; // File size (how much data to read) $fH = fopen($f,"r"); // File handle $data = fread($fH,$size); // Read data from file handle fclose($fH); // Close handle $hash = md5($data);
Fixed to use a set bit size rather than file size.
Offline
[Github] Piwigo issue #210
So I have added a new checkbox "checksum" on the list of "find duplicates based on" options on the batch manager. The main problem is that photos added with synchronization have no md5sum (in the database, I mean), so this filter won't work for them.
We need a feature to compute checksums. I don't think it's a good idea to have these features in core, because it's only useful to very few people. ie people searching for duplicates based on checksum + using sync which is less and less used compared to other upload methods. The best idea I have now is to create a plugin "Compute Checksum" that would add features on Batch Manager:
* a filter "with no checksum"
* an action "compute checksum"
Offline
Everything I load is in ftp uploads to gallery, so the lack of checksums on that data makes the feature useless for me. (Public display of a photo archive, pictures come in in big chunks, plus they get updated sometimes, so ftp upload and sync is vastly the best way to handle it.) I was all excited, seeing the checksum field in the database and the checkbox on the batch manager, but I also noticed that all my photos had null checksum fields.
It should be an automatic part of taking in an FTP upload. Is there any reason it isn't? Nobody has actually given any reason for it not being done in the threads I've found and read so far.
Offline