Find & Delete Duplicate Images with Ease

Welcome to the Image Duplication Detector project. This tool helps you clean up your photo library by finding and removing exact or similar images. It's built with Python, using a mix of a command-line interface and a graphical user interface for maximum flexibility.

How the Project Works

1

The script recursively scans a target directory and finds all supported image files (JPG, PNG, GIF, BMP).

2

For each image, it calculates a unique **perceptual hash** that represents its visual content, ignoring minor changes like compression or size. This is done using the `imagehash` library.

3

The tool then compares the hashes of all images. If the **Hamming distance** between two hashes is below a user-defined threshold, the images are considered duplicates.

4

Finally, based on the user's chosen strategy (`keep_first` or `keep_smallest`), the duplicate files are either marked for deletion (dry run) or permanently removed.

Scan Directory & Find Images Calculate Perceptual Hashes Compare Hashes

Key Concepts

Perceptual Hashing

Unlike a cryptographic hash (like SHA-256) which changes drastically with a single-pixel change, a **perceptual hash** is designed to capture the "fingerprint" of an image's visual content. It generates a hash that is very similar for images that look similar to the human eye. This is what allows the tool to find not just exact duplicates, but also near-duplicates, like resized or slightly edited photos.

Hamming Distance

The **Hamming distance** is a metric used to compare two hashes. It counts the number of positions at which the corresponding symbols are different. In this project, it's used to measure how "different" two image hashes are. A low Hamming distance means the images are visually very similar, while a high distance means they are very different. The `threshold` setting controls the maximum acceptable distance for two images to be considered duplicates.

Object-Oriented Programming (OOP)

The project is structured using OOP principles, which helps to organize the code and make it more maintainable. The `Variables` class, for example, encapsulates all the state of the application (like the target directory, threshold, etc.) into a single object. This keeps the data separate from the functions that operate on it, making the code cleaner and easier to manage.

How to Run the Script

Run from the Command Line

The `_cli.py` script provides a powerful way to run the detector with customizable arguments.

python _cli.py "C:\path\to\your\images" --threshold 10 --strategy keep_first --dry_run no

The arguments are:

  • `directory`: The path to the folder to scan.
  • `--threshold`: (Optional) The maximum Hamming distance. Default is 10.
  • `--strategy`: (Optional) `keep_first` or `keep_smallest`. Default is `keep_first`.
  • `--dry_run`: (Optional) `yes` or `no`. If set to `yes`, no files will be deleted. Default is `yes`.

Project Source Code


# cli_backup/functions.py

import os
import imagehash
import logging
from collections import defaultdict
from PIL import Image

logger = logging.getLogger(__name__)

def get_image_hashes(var, hash_size=8, hash_method='dhash'):
    """
    Recursively walks through a directory, computes a perceptual hash for each
    image file, and stores it in a dictionary.
    
    Args:
        var (Variables): The variables object containing the target directory.
        hash_size (int): The size of the hash, which can affect precision.
        hash_method (str): The hashing algorithm to use ('phash', 'ahash', 'dhash').
    
    Returns:
        dict: A dictionary where keys are image hashes and values are a list of
              file paths that share that hash.
    """
    logger.info(f"Scanning directory: {var.target_directory}")
    
    image_hashes = defaultdict(list)

    for dirpath, _, filenames in os.walk(var.target_directory):
        for filename in filenames:
            file_path = os.path.join(dirpath, filename)
            
            if not filename.lower().endswith(('.png', '.jpg', '.jpeg', '.gif', '.bmp')):
                continue

            try:
                img = Image.open(file_path)
                
                # Check if the image is valid
                img.verify()

                # Re-open the image to ensure the file pointer is at the start
                img = Image.open(file_path)
                
                # Compute the hash based on the selected method
                if hash_method == 'phash':
                    image_hash = str(imagehash.phash(img, hash_size=hash_size))
                elif hash_method == 'ahash':
                    image_hash = str(imagehash.average_hash(img, hash_size=hash_size))
                elif hash_method == 'dhash':
                    image_hash = str(imagehash.dhash(img, hash_size=hash_size))
                else:
                    logger.warning(f"Unsupported hash method: {hash_method}. Using 'dhash' by default.")
                    image_hash = str(imagehash.dhash(img, hash_size=hash_size))
                
                # Append the file path to the list for this hash
                image_hashes[image_hash].append(file_path)
            
            except (IOError, OSError) as e:
                logger.error(f"Error processing file {file_path}: {e}")
                continue

    return image_hashes

def find_duplicates(hashes_map, threshold=10):
    """
    Finds groups of duplicate images based on their hashes and a given threshold.
    
    Args:
        hashes_map (dict): A dictionary where keys are image hashes and values
                           are lists of file paths.
        threshold (int): The maximum Hamming distance for two images to be 
                         considered near-duplicates.
                         
    Returns:
        list: A list of lists, where each inner list contains the file paths of
              a group of duplicate images.
    """
    
    # Filter out unique files (those with only one hash entry)
    hashes_to_check = {h: p for h, p in hashes_map.items() if len(p) > 1}
    
    # Create a list of hashes and their paths for comparison
    hash_list = list(hashes_to_check.items())
    
    duplicate_groups = []
    processed_indices = set()
    
    # Use a dictionary to keep track of hashes already found in a group
    found_hashes = defaultdict(list)
    
    for i in range(len(hash_list)):
        if i in processed_indices:
            continue
            
        current_hash_str, current_paths = hash_list[i]
        current_hash = imagehash.hex_to_hash(current_hash_str)
        
        # Start a new group with the current file
        group = current_paths[:]
        
        # Mark the current file as processed
        processed_indices.add(i)
        
        for j in range(i + 1, len(hash_list)):
            if j in processed_indices:
                continue
            
            other_hash_str, other_paths = hash_list[j]
            other_hash = imagehash.hex_to_hash(other_hash_str)
            
            # Calculate the Hamming distance
            hamming_distance = current_hash - other_hash
            
            if hamming_distance <= threshold:
                # This is a duplicate; add all its paths to the group
                group.extend(other_paths)
                # Mark these files as processed to avoid re-checking
                processed_indices.add(j)
                
        # If the group has more than one file, it's a duplicate group
        if len(group) > 1:
            duplicate_groups.append(group)
            
    return duplicate_groups

def delete_duplicates(var, deletion_strategy='keep_first'):
    """
    Deletes duplicate files based on the specified strategy.
    
    Args:
        var (Variables): The variables object containing duplicate groups.
        deletion_strategy (str): The strategy to use for deletion: 'keep_first' 
                                 or 'keep_smallest'.
    """
    logger.info(f"Using deletion strategy: '{deletion_strategy}'")
    files_to_delete = []

    for group in var.duplicate_groups:
        if deletion_strategy == 'keep_first':
            # Keep the first file found, delete the rest
            files_to_delete.extend(group[1:])
        elif deletion_strategy == 'keep_smallest':
            # Sort files by size and keep the smallest one
            files_and_sizes = [(f, os.path.getsize(f)) for f in group]
            files_and_sizes.sort(key=lambda x: x[1])
            files_to_delete.extend([f for f, s in files_and_sizes[1:]])
        else:
            logger.error(f"Error: Unsupported deletion strategy '{deletion_strategy}'. Using 'keep_first'.")
            files_to_delete.extend(group[1:])

    logger.info("\n--- Duplicate files identified ---")
    if not files_to_delete:
        logger.info("No duplicates found to delete.")
    else:
        for group in var.duplicate_groups:
            if not group: continue
            kept_file = group[0]
            deleted_files_in_group = [f for f in group[1:] if f in files_to_delete]
            if deleted_files_in_group:
                logger.info(f"Group with original kept file: {kept_file}")
                logger.info("  - Files to delete:")
                for file_path in deleted_files_in_group:
                    logger.info(f"    - {file_path}")

    logger.info("-----------------------------------\n")
    
    deleted_count = 0
    if not var.dry_run:
        for file_path in files_to_delete:
            try:
                os.remove(file_path)
                logger.info(f"Deleted file: {file_path}")
                deleted_count += 1
            except OSError as e:
                logger.error(f"Error deleting {file_path}: {e}")
        logger.info(f"\n{deleted_count} files were successfully deleted.")
    else:
        # Dry run block
        logger.info("Dry run enabled. No files will be deleted. Above is a list of files that would have been deleted.")

Project Directory Structure

.
├── build.bat
├── requirements.bat
├── gui.py
├── _cli.py
├── cli_backup/
│   ├── functions.py
│   ├── logger.py
│   └── variables.py
├── gui_backup/
│   └── helper.py
└── logs/