Deduplicating Near Duplicates

Any modification to an image file, even if barely visible to the human eye, changes the binary representation to the point where standard de-duplication method fails.

Such modifications include

Recoding the image to another file format
Changing the image size in pixels
Changing the image size in bytes by changing the JPEG quality
Tuning of color or brightness
Minor cropping

To deduplicate a set of images that include near-duplicates one can use semantic search. The idea is to encode the images with an appropriate deep neural network and then compare the distances in vector space. Images that are close in this vector space are likely to be near-duplicates.

Sounds complicated? It doesn't have to be! Using a Nyckel Image Search function this can be done using the following pseudo-code.

INPUT: set of images that need to be deduplicated

Create a new image search function.
Add all images to the function. Store the idfor each image.
Search the function for each image using sampleCount=2. Store each response.
For each image IMG

Get the search response corresponding to IMG
Read out the second searchSample entry from the response. (The first entry corresponds to the self-match).
If searchSample.distance < 1%, IMG and searchSample.sampleId are near-duplicates.

Delete the function.

Optional speed-ups:

Resize images to 224x224 pixels for faster uploads.
Use multithreading when adding-to and searching the function (Steps 2 & 3).

Codesample

A python codesample is provide in our codesamples repo.


        python -m dedupe <nyckel_client_id> <nyckel_secret_id> <path_to_folder_with_image_files>

Docs

Deduplicating Near Duplicates

Codesample