Deduplicating Near Duplicates
Any modification to an image file, even if barely visible to the human eye, changes the binary representation to the point where standard de-duplication method fails.
Such modifications include
- Recoding the image to another file format
- Changing the image size in pixels
- Changing the image size in bytes by changing the JPEG quality
- Tuning of color or brightness
- Minor cropping
To deduplicate a set of images that include near-duplicates one can use semantic search. The idea is to encode the images with an appropriate deep neural network and then compare the distances in vector space. Images that are close in this vector space are likely to be near-duplicates.
Sounds complicated? It doesn't have to be! Using a Nyckel Image Search function this can be done using the following pseudo-code.
INPUT: set of images that need to be deduplicated
- Create a new image search function.
- Add all images to the function. Store the
id
for each image. - Search the function for each image using
sampleCount=2
. Store each response. - For each image IMG
- Get the search response corresponding to IMG
- Read out the second
searchSample
entry from the response. (The first entry corresponds to the self-match). - If
searchSample.distance
< 1%, IMG andsearchSample.sampleId
are near-duplicates. - Delete the function.
Optional speed-ups:
- Resize images to 224x224 pixels for faster uploads.
- Use multithreading when adding-to and searching the function (Steps 2 & 3).
Codesample
A python codesample is provide in our codesamples repo.
python -m dedupe <nyckel_client_id> <nyckel_secret_id> <path_to_folder_with_image_files>