Piwigo.org

You are not logged in. (Register / Login)

Announcement

#1 2012-12-18 17:29:52

Tadjio
Member
Location: UK
Registered: 2012-05-02
Posts: 432

Duplicates

For some time now I have found the Batch Manager definition of "Duplicate" as very limited, as it only looks for photos with exactly the same filename.

This is a particular problem for me as I have always used Canon Digital cameras and each time I buy a new one, the filename starts again at IMG_0001.jpg so that I have many 'false' duplicates over the years.

I think it would be useful to have some (optional) additional tests:

1) To also test if the Creation Date (and Time) is the same, as well as the filename.

2) To possibly test if the Pixel Dimensions (Width and Height) are also the same.

Does anyone have any other suggestions?

I also see that there is a 'bug' request reported by sakanaou
"0002801: [Batch Manager] Find duplicates based on md5sum"


Tadjio

Offline

 

#2 2012-12-18 21:02:03

flop25
Piwigo Team
Registered: 2006-07-06
Posts: 6392
Website

Re: Duplicates

Hello
You're perfectly right The md5 check is really the ultimate duplicate check


To get a better help : Politeness like Hello-A link-Your past actions precisely described
Check my extensions : more than 30 available
who I am and what I do : http://fr.gravatar.com/flop25
My gallery : an illustration of how to integrate Piwigo in your website

Offline

 

#3 2013-01-19 02:00:25

thimo
Member
Registered: 2013-01-14
Posts: 6

Re: Duplicates

There is also a configuration property named "uniqueness_mode". So I would suggest to modify the file "batch_manager.php" in the following way:

Code:

    // perform 2 queries instead. We hope there are not too many duplicates.

+     $field = ($conf['uniqueness_mode'] == 'md5sum') ? 'md5sum' : 'file';

     $query = '
-SELECT file
+SELECT '.$field.'
   FROM '.IMAGES_TABLE.'
-  GROUP BY file
+  GROUP BY '.$field.'
   HAVING COUNT(*) > 1
 ;';
-    $duplicate_files = array_from_query($query, 'file');
+    $duplicate_files = array_from_query($query, $field);
 
     $query = '
 SELECT id
   FROM '.IMAGES_TABLE.'
-  WHERE file IN (\''.implode("','", $duplicate_files).'\')
+  WHERE '.$field.' IN (\''.implode("','", $duplicate_files).'\')
 ;';
 
     array_push(

And we need an additional index for the field "md5sum" in the database.

Offline

 

#4 2013-01-19 02:13:20

mistic100
Piwigo Team
Location: Lyon (FR)
Registered: 2008-09-27
Posts: 3259
Website

Re: Duplicates

this field already exists, but is NULL for photos stored in "galleries" folder


» All my plugins  » My website
For an efficient support give the URL of your website

Offline

 

#5 2013-01-19 02:43:58

thimo
Member
Registered: 2013-01-14
Posts: 6

Re: Duplicates

Then we have to add it!?
Where can I find the code that synchronizes the galleries folder?

Offline

 

#6 2013-01-19 11:22:44

mistic100
Piwigo Team
Location: Lyon (FR)
Registered: 2008-09-27
Posts: 3259
Website

Re: Duplicates

there is certainly a good reason why it's not synced... or it's just historical

the code is in admin/site_update.php

anyway we need to keep in mind that if we start to compute md5sum of "galleries" photos, we can't do it for already synced photos (I think computing the hash of thousands photos is very time consuming)


» All my plugins  » My website
For an efficient support give the URL of your website

Offline

 

#7 2013-01-19 12:22:08

thimo
Member
Registered: 2013-01-14
Posts: 6

Re: Duplicates

We should add the hash also to imports from the galleries folder:

Code:

    $insert = array(
      'id'             => $next_element_id++,
      'file'           => $filename,
      'name'           => get_name_from_file($filename),
      'date_available' => CURRENT_DATE,
      'path'           => $path,
      'representative_ext'  => $fs[$path]['representative_ext'],
      'storage_category_id' => $db_fulldirs[$dirname],
      'added_by'       => $user['id'],
+      'md5sum'         => md5_file($path),
      );

We could add a batch function to recalculate the md5 hash. It is not that time consuming (see http://php.net/manual/de/function.md5-file.php#81751 ).

I think the current duplicate filter is useless for most users anyway.

Last edited by thimo (2013-01-19 12:23:35)

Offline

 

#8 2013-09-10 21:06:11

msakik
Translation Team
Location: São Paulo, Brazil
Registered: 2013-09-06
Posts: 78

Re: Duplicates

Hello, I know this is an old thread but another possibility is to find duplicated files by using similarity, like convolution im maths. This way same pictures with different names and sizes may be classified as very similar.

Offline

 

#9 2013-09-10 21:17:59

mistic100
Piwigo Team
Location: Lyon (FR)
Registered: 2008-09-27
Posts: 3259
Website

Re: Duplicates

that's indeed the best solution but this king of algorythm is too much complex and resources consuming


» All my plugins  » My website
For an efficient support give the URL of your website

Offline

 

#10 2013-09-19 03:23:10

planetb
Member
Registered: 2013-09-19
Posts: 4

Re: Duplicates

thimo wrote:

We should add the hash also to imports from the galleries folder:

Code:

    $insert = array(
      'id'             => $next_element_id++,
      'file'           => $filename,
      'name'           => get_name_from_file($filename),
      'date_available' => CURRENT_DATE,
      'path'           => $path,
      'representative_ext'  => $fs[$path]['representative_ext'],
      'storage_category_id' => $db_fulldirs[$dirname],
      'added_by'       => $user['id'],
+      'md5sum'         => md5_file($path),
      );

We could add a batch function to recalculate the md5 hash. It is not that time consuming (see http://php.net/manual/de/function.md5-file.php#81751 ).

I think the current duplicate filter is useless for most users anyway.

I just found the issue of duplicate images while trying to query the database to import comments from an old comment table/db I have from my old picture server (home grown hacked up phpix). I have 30k images and in some cases the images have the same filename, I worked around this in my old database by using md5 hash and storing this in the database before editing the comments. If it was a new comment, I would check the hash of the original file and then see if it was already in the table before the comment was updated or added. I have not had any problems with this in 5+years of use now.

The issue I saw tonight while trying to figure out how to import my comments from one table to another correctly is that one file name shows up 4 times :

select `file`,`id` from `piwigo_images` where FILE='DSC01450.JPG';
+--------------+-------+
| file         | id    |
+--------------+-------+
| DSC01450.JPG | 15255 |
| DSC01450.JPG | 16252 |
| DSC01450.JPG | 16952 |
| DSC01450.JPG | 25967 |
+--------------+-------+
4 rows in set (0.01 sec)


My org table with the same file name showed up one time in this example as I only had a comment on one of the files :

SELECT * FROM comments WHERE fileid = "DSC01450.JPG";
+-------------------+-----------------------------+----------------------------------+
| fileid                  | desctxt                     | uniqueid                         |
+-------------------+-----------------------------+----------------------------------+
| DSC01450.JPG | from inside the hotel room. | 3cc71e12ca10a5d57859e7d17140f56c |
+-------------------+-----------------------------+----------------------------------+
1 row in set (0.00 sec)


I am not at all good at scripting/writing php and sql queries, but will figure it out at some point :)

Offline

 

#11 2013-10-07 00:55:14

msakik
Translation Team
Location: São Paulo, Brazil
Registered: 2013-09-06
Posts: 78

Re: Duplicates

As md5sum already exists in database, isn't it reasonable that find duplicates function use this instead names? Or both.

Offline

 

#12 2014-02-13 06:03:15

trash80
Guest

Re: Duplicates

You don't need to necessarily md5 (or any hash) all of the file, but only a part of the file, like the first 2k bites (for instance).  I don't recall where in the file each part exists (been a while since I did image work), but depending on type, e.g. .jpg, .png, etc, you can look at 2k (or whatever about you decide) anywhere.

I wrote a python dedupe program that reads the first 5k of files to compare them; the creator/maintainer of the file browser SpaceFM used a very similar dedupe.

I know my Python, Java, and Javascript, (I'm a professional software engineer) but I'm rusty with my php (I haven't coded in php in about 4 years), but you would have something like:

Code:

$f = "path/to/text/file.txt";    //String file path
$size = filesize($f);  // File size (how much data to read)
$fH = fopen($f,"r");   // File handle
$data = fread($fH,$size);  // Read data from file handle
fclose($fH);  // Close handle
$hash = md5($data);

Of course, sh1 or other would work.  I'm not familiar with each hash's lib speed, but that should be taken into account.

I have not tried out this code, just throwing out the idea.

 

#13 2014-02-13 06:34:52

trash80
Member
Registered: 2014-02-13
Posts: 1

Re: Duplicates

Code:

$f = "path/to/text/file.txt";    //String file path
$size = 2048;  // File size (how much data to read)
$fH = fopen($f,"r");   // File handle
$data = fread($fH,$size);  // Read data from file handle
fclose($fH);  // Close handle
$hash = md5($data);

Fixed to use a set bit size rather than file size.

Offline

 

#14 2016-12-11 20:05:59

plg
Piwigo Team
Location: Nantes, France, Europe
Registered: 2002-04-05
Posts: 12949
Website

Re: Duplicates

[Github] Piwigo issue #210

So I have added a new checkbox "checksum" on the list of "find duplicates based on" options on the batch manager. The main problem is that photos added with synchronization have no md5sum (in the database, I mean), so this filter won't work for them.

We need a feature to compute checksums. I don't think it's a good idea to have these features in core, because it's only useful to very few people. ie people searching for duplicates based on checksum + using sync which is less and less used compared to other upload methods. The best idea I have now is to create a plugin "Compute Checksum" that would add features on Batch Manager:

* a filter "with no checksum"
* an action "compute checksum"

Offline

 

Board footer

Powered by FluxBB

github twitter facebook google+ newsletter Donate Piwigo.org © 2002-2017 · Contact