Image Duplication Detector

Find & Delete Duplicate Images with Ease

Welcome to the Image Duplication Detector project. This tool helps you clean up your photo library by finding and removing exact or similar images. It's built with Python, using a mix of a command-line interface and a graphical user interface for maximum flexibility.

How the Project Works

The script recursively scans a target directory and finds all supported image files (JPG, PNG, GIF, BMP).

For each image, it calculates a unique **perceptual hash** that represents its visual content, ignoring minor changes like compression or size. This is done using the `imagehash` library.

The tool then compares the hashes of all images. If the **Hamming distance** between two hashes is below a user-defined threshold, the images are considered duplicates.

Finally, based on the user's chosen strategy (`keep_first` or `keep_smallest`), the duplicate files are either marked for deletion (dry run) or permanently removed.

Key Concepts

Perceptual Hashing

Unlike a cryptographic hash (like SHA-256) which changes drastically with a single-pixel change, a **perceptual hash** is designed to capture the "fingerprint" of an image's visual content. It generates a hash that is very similar for images that look similar to the human eye. This is what allows the tool to find not just exact duplicates, but also near-duplicates, like resized or slightly edited photos.

Hamming Distance

The **Hamming distance** is a metric used to compare two hashes. It counts the number of positions at which the corresponding symbols are different. In this project, it's used to measure how "different" two image hashes are. A low Hamming distance means the images are visually very similar, while a high distance means they are very different. The `threshold` setting controls the maximum acceptable distance for two images to be considered duplicates.

Object-Oriented Programming (OOP)

The project is structured using OOP principles, which helps to organize the code and make it more maintainable. The `Variables` class, for example, encapsulates all the state of the application (like the target directory, threshold, etc.) into a single object. This keeps the data separate from the functions that operate on it, making the code cleaner and easier to manage.

How to Run the Script

Run from the Command Line

The `_cli.py` script provides a powerful way to run the detector with customizable arguments.

python _cli.py "C:\path\to\your\images" --threshold 10 --strategy keep_first --dry_run no

The arguments are:

`directory`: The path to the folder to scan.
`--threshold`: (Optional) The maximum Hamming distance. Default is 10.
`--strategy`: (Optional) `keep_first` or `keep_smallest`. Default is `keep_first`.
`--dry_run`: (Optional) `yes` or `no`. If set to `yes`, no files will be deleted. Default is `yes`.

Use the Graphical Interface

The `gui.py` script provides a user-friendly Tkinter-based interface.

python gui.py

This will open a window where you can select a directory, set the threshold, choose a deletion strategy, and view the log output in real-time.

Building the Executable

You can create a standalone executable for the GUI using the `build.bat` script and `PyInstaller`.

build.bat

This script will:

Clean up any previous build files.
Run PyInstaller with the correct parameters to create a single executable file.

This makes the application portable and easy to distribute, as the end-user doesn't need to install Python or the required libraries.

Project Source Code


# cli_backup/functions.py

import os
import imagehash
import logging
from collections import defaultdict
from PIL import Image

logger = logging.getLogger(__name__)

def get_image_hashes(var, hash_size=8, hash_method='dhash'):
    """
    Recursively walks through a directory, computes a perceptual hash for each
    image file, and stores it in a dictionary.
    
    Args:
        var (Variables): The variables object containing the target directory.
        hash_size (int): The size of the hash, which can affect precision.
        hash_method (str): The hashing algorithm to use ('phash', 'ahash', 'dhash').
    
    Returns:
        dict: A dictionary where keys are image hashes and values are a list of
              file paths that share that hash.
    """
    logger.info(f"Scanning directory: {var.target_directory}")
    
    image_hashes = defaultdict(list)

    for dirpath, _, filenames in os.walk(var.target_directory):
        for filename in filenames:
            file_path = os.path.join(dirpath, filename)
            
            if not filename.lower().endswith(('.png', '.jpg', '.jpeg', '.gif', '.bmp')):
                continue

            try:
                img = Image.open(file_path)
                
                # Check if the image is valid
                img.verify()

                # Re-open the image to ensure the file pointer is at the start
                img = Image.open(file_path)
                
                # Compute the hash based on the selected method
                if hash_method == 'phash':
                    image_hash = str(imagehash.phash(img, hash_size=hash_size))
                elif hash_method == 'ahash':
                    image_hash = str(imagehash.average_hash(img, hash_size=hash_size))
                elif hash_method == 'dhash':
                    image_hash = str(imagehash.dhash(img, hash_size=hash_size))
                else:
                    logger.warning(f"Unsupported hash method: {hash_method}. Using 'dhash' by default.")
                    image_hash = str(imagehash.dhash(img, hash_size=hash_size))
                
                # Append the file path to the list for this hash
                image_hashes[image_hash].append(file_path)
            
            except (IOError, OSError) as e:
                logger.error(f"Error processing file {file_path}: {e}")
                continue

    return image_hashes

def find_duplicates(hashes_map, threshold=10):
    """
    Finds groups of duplicate images based on their hashes and a given threshold.
    
    Args:
        hashes_map (dict): A dictionary where keys are image hashes and values
                           are lists of file paths.
        threshold (int): The maximum Hamming distance for two images to be 
                         considered near-duplicates.
                         
    Returns:
        list: A list of lists, where each inner list contains the file paths of
              a group of duplicate images.
    """
    
    # Filter out unique files (those with only one hash entry)
    hashes_to_check = {h: p for h, p in hashes_map.items() if len(p) > 1}
    
    # Create a list of hashes and their paths for comparison
    hash_list = list(hashes_to_check.items())
    
    duplicate_groups = []
    processed_indices = set()
    
    # Use a dictionary to keep track of hashes already found in a group
    found_hashes = defaultdict(list)
    
    for i in range(len(hash_list)):
        if i in processed_indices:
            continue
            
        current_hash_str, current_paths = hash_list[i]
        current_hash = imagehash.hex_to_hash(current_hash_str)
        
        # Start a new group with the current file
        group = current_paths[:]
        
        # Mark the current file as processed
        processed_indices.add(i)
        
        for j in range(i + 1, len(hash_list)):
            if j in processed_indices:
                continue
            
            other_hash_str, other_paths = hash_list[j]
            other_hash = imagehash.hex_to_hash(other_hash_str)
            
            # Calculate the Hamming distance
            hamming_distance = current_hash - other_hash
            
            if hamming_distance <= threshold:
                # This is a duplicate; add all its paths to the group
                group.extend(other_paths)
                # Mark these files as processed to avoid re-checking
                processed_indices.add(j)
                
        # If the group has more than one file, it's a duplicate group
        if len(group) > 1:
            duplicate_groups.append(group)
            
    return duplicate_groups

def delete_duplicates(var, deletion_strategy='keep_first'):
    """
    Deletes duplicate files based on the specified strategy.
    
    Args:
        var (Variables): The variables object containing duplicate groups.
        deletion_strategy (str): The strategy to use for deletion: 'keep_first' 
                                 or 'keep_smallest'.
    """
    logger.info(f"Using deletion strategy: '{deletion_strategy}'")
    files_to_delete = []

    for group in var.duplicate_groups:
        if deletion_strategy == 'keep_first':
            # Keep the first file found, delete the rest
            files_to_delete.extend(group[1:])
        elif deletion_strategy == 'keep_smallest':
            # Sort files by size and keep the smallest one
            files_and_sizes = [(f, os.path.getsize(f)) for f in group]
            files_and_sizes.sort(key=lambda x: x[1])
            files_to_delete.extend([f for f, s in files_and_sizes[1:]])
        else:
            logger.error(f"Error: Unsupported deletion strategy '{deletion_strategy}'. Using 'keep_first'.")
            files_to_delete.extend(group[1:])

    logger.info("\n--- Duplicate files identified ---")
    if not files_to_delete:
        logger.info("No duplicates found to delete.")
    else:
        for group in var.duplicate_groups:
            if not group: continue
            kept_file = group[0]
            deleted_files_in_group = [f for f in group[1:] if f in files_to_delete]
            if deleted_files_in_group:
                logger.info(f"Group with original kept file: {kept_file}")
                logger.info("  - Files to delete:")
                for file_path in deleted_files_in_group:
                    logger.info(f"    - {file_path}")

    logger.info("-----------------------------------\n")
    
    deleted_count = 0
    if not var.dry_run:
        for file_path in files_to_delete:
            try:
                os.remove(file_path)
                logger.info(f"Deleted file: {file_path}")
                deleted_count += 1
            except OSError as e:
                logger.error(f"Error deleting {file_path}: {e}")
        logger.info(f"\n{deleted_count} files were successfully deleted.")
    else:
        # Dry run block
        logger.info("Dry run enabled. No files will be deleted. Above is a list of files that would have been deleted.")


# cli_backup/variables.py

class Variables:
    """
    A simple class to hold and manage application variables.
    This helps in organizing the state and passing it between functions
    in an object-oriented manner.
    """
    def __init__(self):
        self.target_directory = None
        self.threshold = None
        self.strategy = None
        self.dry_run = None

        self.image_hashes = {}
        self.duplicate_groups = []


# gui_backup/helper.py

import os
import logging
import tkinter as tk
from tkinter import filedialog, messagebox

logger = logging.getLogger(__name__)

class TkinterTextHandler(logging.Handler):
    """
    A custom logging handler that redirects log messages to a Tkinter Text widget.
    This allows us to display real-time log output directly in the GUI.
    """
    def __init__(self, text_widget):
        super().__init__()
        self.text_widget = text_widget
        self.text_widget.config(state=tk.DISABLED)
        self.setFormatter(logging.Formatter('%(message)s'))

    def emit(self, record):
        """
        Emits a log record to the Tkinter Text widget.
        """
        msg = self.format(record)
        
        self.text_widget.config(state=tk.NORMAL)
        self.text_widget.insert(tk.END, msg + '\n')
        self.text_widget.config(state=tk.DISABLED)
        
        self.text_widget.see(tk.END)

def setup_gui(app):
    """
    Configures all the GUI widgets and their layout,
    attaching them to the main application object.
    
    Args:
        app (MyTinkerApp): The main application instance.
    """
    main_frame = tk.Frame(app.root, padx=10, pady=10)
    main_frame.pack(fill=tk.BOTH, expand=True)

    # Directory selection section
    directory_frame = tk.Frame(main_frame)
    directory_frame.pack(fill=tk.X, pady=5)
    
    tk.Label(directory_frame, text="Directory to Scan:").pack(side=tk.LEFT)
    app.directory_entry = tk.Entry(directory_frame)
    app.directory_entry.pack(side=tk.LEFT, fill=tk.X, expand=True, padx=(5,0))
    
    browse_btn = tk.Button(directory_frame, text="Browse", command=lambda: browse_directory(app.directory_entry, app.status_label))
    browse_btn.pack(side=tk.LEFT, padx=5)
    
    # Options section
    options_frame = tk.Frame(main_frame)
    options_frame.pack(fill=tk.X, pady=5)
    
    # Threshold
    tk.Label(options_frame, text="Threshold:").pack(side=tk.LEFT)
    app.threshold_entry = tk.Entry(options_frame, width=5)
    app.threshold_entry.pack(side=tk.LEFT, padx=(5, 20))
    app.threshold_entry.insert(0, str(app.var.threshold))

    # Strategy
    tk.Label(options_frame, text="Deletion Strategy:").pack(side=tk.LEFT)
    app.strategy_var = tk.StringVar(options_frame)
    app.strategy_var.set('keep_first')
    strategy_options = ['keep_first', 'keep_smallest']
    strategy_menu = tk.OptionMenu(options_frame, app.strategy_var, *strategy_options)
    strategy_menu.pack(side=tk.LEFT, padx=(5, 20))
    
    # Checkbox for Dry Run
    tk.Checkbutton(options_frame, text="Dry Run (don't delete files)", variable=app.dry_run, onvalue=True, offvalue=False).pack(side=tk.LEFT)
    
    # Checkbox for Full Logs
    tk.Checkbutton(options_frame, text="Show Full Logs", variable=app.show_full_logs, onvalue=True, offvalue=False).pack(side=tk.LEFT)
    
    # Buttons
    button_frame = tk.Frame(main_frame)
    button_frame.pack(fill=tk.X, pady=10)
    
    tk.Button(button_frame, text="Analyze and Run", command=app.analyze_and_run).pack(side=tk.LEFT, expand=True, fill=tk.X)
    tk.Button(button_frame, text="Clear Log", command=lambda: clear_log(app.log_text)).pack(side=tk.LEFT, expand=True, fill=tk.X, padx=5)

    # Status label
    app.status_label = tk.Label(main_frame, text="Ready.", bd=1, relief=tk.SUNKEN, anchor=tk.W)
    app.status_label.pack(side=tk.BOTTOM, fill=tk.X, pady=5)
    
    # Log display area
    log_frame = tk.LabelFrame(main_frame, text="Log Output", padx=5, pady=5)
    log_frame.pack(fill=tk.BOTH, expand=True, pady=10)
    
    app.log_text = tk.Text(log_frame, wrap=tk.WORD)
    app.log_text.pack(fill=tk.BOTH, expand=True)

def setup_logging(app):
    """
    Configures the logging system to output to both a file and the GUI text widget.
    
    Args:
        app (MyTinkerApp): The main application instance.
    """
    # Get the root logger and clear any existing handlers
    root_logger = logging.getLogger()
    if root_logger.hasHandlers():
        root_logger.handlers.clear()

    # Set up file logging
    if not os.path.exists("logs"):
        os.mkdir("logs")
    log_file = os.path.join("logs", 'log.txt')
    file_handler = logging.FileHandler(log_file, "w")
    file_handler.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))
    root_logger.addHandler(file_handler)

    # Set up GUI logging
    app.logger_handler = TkinterTextHandler(app.log_text)
    root_logger.addHandler(app.logger_handler)

    root_logger.setLevel(logging.INFO)

def browse_directory(directory_entry, status_label):
    """
    Opens a directory selection dialog and puts the selected path
    into the entry field.
    """
    directory = filedialog.askdirectory()
    if directory:
        normalized_path = os.path.normpath(directory)
        directory_entry.delete(0, tk.END)
        directory_entry.insert(0, normalized_path)
        status_label.config(text=f"Selected: {normalized_path}")

def clear_log(log_text):
    """
    Clears the content of the log text widget.
    """
    log_text.config(state=tk.NORMAL)
    log_text.delete(1.0, tk.END)
    log_text.config(state=tk.DISABLED)


# gui.py

from tkinter import filedialog, messagebox
from cli_backup.variables import Variables
from cli_backup.functions import delete_duplicates
from gui_backup.helper import setup_gui, setup_logging, browse_directory, clear_log

import _cli
import logging
import tkinter as tk
import traceback

class MyTinkerApp:
    """
    The main class for the Tkinter GUI application.
    It encapsulates the state and logic for the GUI.
    """
    def __init__(self, root):
        self.root = root
        self.root.title("Image Duplication Detector")
        self.root.geometry("600x700")
        
        # 1. Initialize variables to hold the app state
        self.var = Variables()
        self.dry_run = tk.BooleanVar(value=True) # Default to dry run
        self.show_full_logs = tk.BooleanVar(value=False)
        self.var.threshold = 10

        # 2. Setup the GUI layout and logging using helper functions
        setup_gui(self)
        setup_logging(self)

    def analyze_and_run(self):
        """
        This function orchestrates the analysis and deletion process for the GUI.
        It reads user input from the GUI and calls the core functions.
        """
        # Clear log and update status
        clear_log(self.log_text)
        self.status_label.config(text="Scanning...")
        self.root.update_idletasks()

        input_directory = self.directory_entry.get()
        threshold_value = self.threshold_entry.get()
        strategy_value = self.strategy_var.get()
        
        # Update the Variables object from GUI input
        self.var.dry_run = self.dry_run.get()
        
        # Input validation
        if not input_directory:
            messagebox.showerror("Input Error", "Please select a directory to scan.")
            self.status_label.config(text="Ready.")
            return
        
        try:
            self.var.threshold = int(threshold_value)
            if self.var.threshold < 0:
                raise ValueError
        except ValueError:
            messagebox.showerror("Input Error", "Threshold must be a non-negative integer.")
            self.status_label.config(text="Ready.")
            return

        self.var.target_directory = input_directory
        self.var.strategy = strategy_value
        
        try:
            # Step 1: Find duplicates
            self.var.duplicate_groups = _cli.find_and_group_duplicates(self.var)
            
            # Count total files to be deleted
            total_files_to_delete = sum(len(group) - 1 for group in self.var.duplicate_groups)

            if total_files_to_delete > 0:
                # Step 2: Delete duplicates if dry run is off
                if not self.var.dry_run:
                    logger.info("Dry run is OFF. Deleting files...")
                    delete_duplicates(self.var, deletion_strategy=self.var.strategy)
                    self.status_label.config(text="Analysis finished. Duplicates deleted.")
                
                elif self.var.dry_run == True and self.show_full_logs.get() == True:
                    logger.info("Dry Run is checked. Showing Full Logs.")
                    delete_duplicates(self.var, deletion_strategy=self.var.strategy)
                    self.status_label.config(text="Analysis finished. Duplicates would have been deleted.")
                
                elif self.var.dry_run == True and self.show_full_logs.get() == False:
                    logger.info("Dry Run is checked & not showing Full logs")
                    self.status_label.config(text=f"Analysis finished. Found {total_files_to_delete} duplicates. Deletion not requested.")

                else:
                    logger.error("Some error in catching the conditions")
            else:
                self.status_label.config(text="Analysis finished. No duplicates found.")

        except Exception as e:
            error_message = f"An error occurred: {e}"
            self.status_label.config(text=error_message)
            logger.error(error_message)
            traceback.print_exc()

def main():
    root = tk.Tk()
    app = MyTinkerApp(root)
    root.mainloop()

if __name__ == "__main__":
    main()


# _cli.py

import sys
import argparse
import os
import logging
from cli_backup.functions import get_image_hashes, find_duplicates, delete_duplicates
from cli_backup.variables import Variables
from cli_backup.logger import loggerSetup

def find_and_group_duplicates(var):
    """
    Finds and groups duplicate images without prompting for deletion.
    This function is designed to be called by the GUI.
    """
    logger = logging.getLogger(__name__)

    # Verify that the provided path is a valid directory
    if not os.path.isdir(var.target_directory):
        logger.error(f"Error: The provided path '{var.target_directory}' is not a valid directory.")
        return []

    # Step 1: Get all image hashes
    try:
        logger.info(f"Scanning '{var.target_directory}' with threshold {var.threshold}...")
        hashes_map = get_image_hashes(var)
    except Exception as e:
        logger.error(f"An unexpected error occurred during hashing: {e}")
        return []

    # Step 2: Find duplicate groups using the corrected function
    try:
        duplicate_groups = find_duplicates(hashes_map, threshold=var.threshold)
        logger.info(f"Successfully found {len(duplicate_groups)} groups of duplicates.")
    except Exception as e:
        logger.error(f"An unexpected error occurred while finding duplicates: {e}")
        return []

    return duplicate_groups

if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="A tool to detect and delete duplicate and near-duplicate images based on their content."
    )
    
    parser.add_argument(
        "directory",
        type=str,
        help="The path to the directory to scan for duplicate images."
    )
    
    parser.add_argument(
        "--threshold",
        type=int,
        default=10,
        help="The maximum Hamming distance for two images to be considered near-duplicates. (default: 10)"
    )
    
    parser.add_argument(
        "--strategy",
        type=str,
        default='keep_first',
        choices=['keep_first', 'keep_smallest'],
        help="The strategy to use for deletion: 'keep_first' or 'keep_smallest'. (default: 'keep_first')"
    )
    
    parser.add_argument(
        "--dry_run",
        type=str,
        default='yes',
        help="Do you want to delete the files? Yes or No. (default: no)"
    )

    logger = loggerSetup()
    logger = logging.getLogger(__name__)

    try:
        args = parser.parse_args()
    except Exception as e:        
        logger.info(f"Error parsing arguments: {e}", file=sys.stderr)
        sys.exit(1)

    # Initialize a Variables object with command-line arguments
    var = Variables()
    var.target_directory = args.directory
    var.threshold = args.threshold
    var.strategy = args.strategy
    var.dry_run = args.dry_run.lower() == 'yes'

    try:
        logger.info("\n************")
        logger.info("Script Started")
        logger.info("************")
        
        # Step 1: Find duplicates
        var.duplicate_groups = find_and_group_duplicates(var)
        
        if var.duplicate_groups:
            # Step 2: Delete duplicates if dry run is off
            delete_duplicates(var, deletion_strategy=var.strategy)
        else:
            logger.info("No duplicates found.")

        logger.info("\n************")
        logger.info("Script Ended")
        logger.info("************")

    except Exception as e:
        logger.error(f"An unexpected error occurred: {e}")
        traceback.print_exc()


@echo off

rem Delete existing build and dist folders if they exist
echo ===========================================
echo Cleaning previous build artifacts...
rmdir /S /Q "build" 2>nul
echo Deleted build/
rmdir /S /Q "dist" 2>nul
echo Deleted dist/
IF EXIST *.spec (
    DEL /F /Q *.spec
    echo Deleted *.spec
)
echo Cleaning complete.
echo ===========================================

rem Run PyInstaller command

pyinstaller --clean --onefile --noconsole --name="Image Duplication Detector" --hidden-import=imagehash -p ./src src/gui.py

echo ===========================================
echo Build process finished
echo ===========================================


@echo off
REM Activate your virtual environment first
REM Example: call path\to\your\venv\Scripts\activate

call py-env\Scripts\activate.bat

echo ===========================================
echo Installing required Python packages...
echo ===========================================

pip install Pillow
pip install imagehash
pip install numpy
pip install pyinstaller
pip install imagehash

echo ===========================================
echo All packages installed.
echo ===========================================
pause

Project Directory Structure

.
├── build.bat
├── requirements.bat
├── gui.py
├── _cli.py
├── cli_backup/
│   ├── functions.py
│   ├── logger.py
│   └── variables.py
├── gui_backup/
│   └── helper.py
└── logs/