Announcement

#1 2012-12-18 17:29:52

Tadjio
Member
UK
2012-05-02
432

Duplicates

For some time now I have found the Batch Manager definition of "Duplicate" as very limited, as it only looks for photos with exactly the same filename.

This is a particular problem for me as I have always used Canon Digital cameras and each time I buy a new one, the filename starts again at IMG_0001.jpg so that I have many 'false' duplicates over the years.

I think it would be useful to have some (optional) additional tests:

1) To also test if the Creation Date (and Time) is the same, as well as the filename.

2) To possibly test if the Pixel Dimensions (Width and Height) are also the same.

Does anyone have any other suggestions?

I also see that there is a 'bug' request reported by sakanaou
"0002801: [Batch Manager] Find duplicates based on md5sum"


Tadjio

Offline

 

#2 2012-12-18 21:02:03

flop25
Piwigo Team
2006-07-06
7037

Re: Duplicates

Hello
You're perfectly right The md5 check is really the ultimate duplicate check


To get a better help : Politeness like Hello-A link-Your past actions precisely described
Check my extensions : more than 30 available
who I am and what I do : http://fr.gravatar.com/flop25
My gallery : an illustration of how to integrate Piwigo in your website

Offline

 

#3 2013-01-19 02:00:25

thimo
Member
2013-01-14
6

Re: Duplicates

There is also a configuration property named "uniqueness_mode". So I would suggest to modify the file "batch_manager.php" in the following way:

Code:

    // perform 2 queries instead. We hope there are not too many duplicates.

+     $field = ($conf['uniqueness_mode'] == 'md5sum') ? 'md5sum' : 'file';

     $query = '
-SELECT file
+SELECT '.$field.'
   FROM '.IMAGES_TABLE.'
-  GROUP BY file
+  GROUP BY '.$field.'
   HAVING COUNT(*) > 1
 ;';
-    $duplicate_files = array_from_query($query, 'file');
+    $duplicate_files = array_from_query($query, $field);
 
     $query = '
 SELECT id
   FROM '.IMAGES_TABLE.'
-  WHERE file IN (\''.implode("','", $duplicate_files).'\')
+  WHERE '.$field.' IN (\''.implode("','", $duplicate_files).'\')
 ;';
 
     array_push(

And we need an additional index for the field "md5sum" in the database.

Offline

 

#4 2013-01-19 02:13:20

mistic100
Former Piwigo Team
Lyon (FR)
2008-09-27
3277

Re: Duplicates

this field already exists, but is NULL for photos stored in "galleries" folder

Offline

 

#5 2013-01-19 02:43:58

thimo
Member
2013-01-14
6

Re: Duplicates

Then we have to add it!?
Where can I find the code that synchronizes the galleries folder?

Offline

 

#6 2013-01-19 11:22:44

mistic100
Former Piwigo Team
Lyon (FR)
2008-09-27
3277

Re: Duplicates

there is certainly a good reason why it's not synced... or it's just historical

the code is in admin/site_update.php

anyway we need to keep in mind that if we start to compute md5sum of "galleries" photos, we can't do it for already synced photos (I think computing the hash of thousands photos is very time consuming)

Offline

 

#7 2013-01-19 12:22:08

thimo
Member
2013-01-14
6

Re: Duplicates

We should add the hash also to imports from the galleries folder:

Code:

    $insert = array(
      'id'             => $next_element_id++,
      'file'           => $filename,
      'name'           => get_name_from_file($filename),
      'date_available' => CURRENT_DATE,
      'path'           => $path,
      'representative_ext'  => $fs[$path]['representative_ext'],
      'storage_category_id' => $db_fulldirs[$dirname],
      'added_by'       => $user['id'],
+      'md5sum'         => md5_file($path),
      );

We could add a batch function to recalculate the md5 hash. It is not that time consuming (see http://php.net/manual/de/function.md5-file.php#81751 ).

I think the current duplicate filter is useless for most users anyway.

Last edited by thimo (2013-01-19 12:23:35)

Offline

 

#8 2013-09-10 21:06:11

msakik
Translation Team
São Paulo, Brazil
2013-09-06
78

Re: Duplicates

Hello, I know this is an old thread but another possibility is to find duplicated files by using similarity, like convolution im maths. This way same pictures with different names and sizes may be classified as very similar.

Offline

 

#9 2013-09-10 21:17:59

mistic100
Former Piwigo Team
Lyon (FR)
2008-09-27
3277

Re: Duplicates

that's indeed the best solution but this king of algorythm is too much complex and resources consuming

Offline

 

#10 2013-09-19 03:23:10

planetb
Member
2013-09-19
4

Re: Duplicates

thimo wrote:

We should add the hash also to imports from the galleries folder:

Code:

    $insert = array(
      'id'             => $next_element_id++,
      'file'           => $filename,
      'name'           => get_name_from_file($filename),
      'date_available' => CURRENT_DATE,
      'path'           => $path,
      'representative_ext'  => $fs[$path]['representative_ext'],
      'storage_category_id' => $db_fulldirs[$dirname],
      'added_by'       => $user['id'],
+      'md5sum'         => md5_file($path),
      );

We could add a batch function to recalculate the md5 hash. It is not that time consuming (see http://php.net/manual/de/function.md5-file.php#81751 ).

I think the current duplicate filter is useless for most users anyway.

I just found the issue of duplicate images while trying to query the database to import comments from an old comment table/db I have from my old picture server (home grown hacked up phpix). I have 30k images and in some cases the images have the same filename, I worked around this in my old database by using md5 hash and storing this in the database before editing the comments. If it was a new comment, I would check the hash of the original file and then see if it was already in the table before the comment was updated or added. I have not had any problems with this in 5+years of use now.

The issue I saw tonight while trying to figure out how to import my comments from one table to another correctly is that one file name shows up 4 times :

select `file`,`id` from `piwigo_images` where FILE='DSC01450.JPG';
+--------------+-------+
| file         | id    |
+--------------+-------+
| DSC01450.JPG | 15255 |
| DSC01450.JPG | 16252 |
| DSC01450.JPG | 16952 |
| DSC01450.JPG | 25967 |
+--------------+-------+
4 rows in set (0.01 sec)


My org table with the same file name showed up one time in this example as I only had a comment on one of the files :

SELECT * FROM comments WHERE fileid = "DSC01450.JPG";
+-------------------+-----------------------------+----------------------------------+
| fileid                  | desctxt                     | uniqueid                         |
+-------------------+-----------------------------+----------------------------------+
| DSC01450.JPG | from inside the hotel room. | 3cc71e12ca10a5d57859e7d17140f56c |
+-------------------+-----------------------------+----------------------------------+
1 row in set (0.00 sec)


I am not at all good at scripting/writing php and sql queries, but will figure it out at some point :)

Offline

 

#11 2013-10-07 00:55:14

msakik
Translation Team
São Paulo, Brazil
2013-09-06
78

Re: Duplicates

As md5sum already exists in database, isn't it reasonable that find duplicates function use this instead names? Or both.

Offline

 

#12 2014-02-13 06:03:15

trash80
Guest

Re: Duplicates

You don't need to necessarily md5 (or any hash) all of the file, but only a part of the file, like the first 2k bites (for instance).  I don't recall where in the file each part exists (been a while since I did image work), but depending on type, e.g. .jpg, .png, etc, you can look at 2k (or whatever about you decide) anywhere.

I wrote a python dedupe program that reads the first 5k of files to compare them; the creator/maintainer of the file browser SpaceFM used a very similar dedupe.

I know my Python, Java, and Javascript, (I'm a professional software engineer) but I'm rusty with my php (I haven't coded in php in about 4 years), but you would have something like:

Code:

$f = "path/to/text/file.txt";    //String file path
$size = filesize($f);  // File size (how much data to read)
$fH = fopen($f,"r");   // File handle
$data = fread($fH,$size);  // Read data from file handle
fclose($fH);  // Close handle
$hash = md5($data);

Of course, sh1 or other would work.  I'm not familiar with each hash's lib speed, but that should be taken into account.

I have not tried out this code, just throwing out the idea.

 

#13 2014-02-13 06:34:52

trash80
Member
2014-02-13
1

Re: Duplicates

Code:

$f = "path/to/text/file.txt";    //String file path
$size = 2048;  // File size (how much data to read)
$fH = fopen($f,"r");   // File handle
$data = fread($fH,$size);  // Read data from file handle
fclose($fH);  // Close handle
$hash = md5($data);

Fixed to use a set bit size rather than file size.

Offline

 

#14 2016-12-11 20:05:59

plg
Piwigo Team
Nantes, France, Europe
2002-04-05
13786

Re: Duplicates

[Github] Piwigo issue #210

So I have added a new checkbox "checksum" on the list of "find duplicates based on" options on the batch manager. The main problem is that photos added with synchronization have no md5sum (in the database, I mean), so this filter won't work for them.

We need a feature to compute checksums. I don't think it's a good idea to have these features in core, because it's only useful to very few people. ie people searching for duplicates based on checksum + using sync which is less and less used compared to other upload methods. The best idea I have now is to create a plugin "Compute Checksum" that would add features on Batch Manager:

* a filter "with no checksum"
* an action "compute checksum"

Offline

 

#15 2019-08-09 19:29:24

dd-b
Member
Minneapolis, MN USA
2018-04-16
69

Re: Duplicates

Everything I load is in ftp uploads to gallery, so the lack of checksums on that data makes the feature useless for me.  (Public display of a photo archive, pictures come in in big chunks, plus they get updated sometimes, so ftp upload and sync is vastly the best way to handle it.)  I was all excited, seeing the checksum field in the database and the checkbox on the batch manager, but I also noticed that all my photos had null checksum fields.

It should be an automatic part of taking in an FTP upload. Is there any reason it isn't?  Nobody has actually given any reason for it not being done in the threads I've found and read so far.

Offline

 

Board footer

Powered by FluxBB

github twitter newsletter Donate Piwigo.org © 2002-2024 · Contact