Taming the Repository: Finding and Removing Large Files in Git

Bruno Peixoto
2 min readMay 26, 2024

Keeping your Git repository lean and mean is crucial for efficient collaboration and performance. Large files can bloat your repository size, making clones and pushes slower. The below script helps you identify and remove these space hogs, keeping your Git repository in tip-top shape.

import os
from subprocess import run

# 50MB
max_size = 50 * 1024 * 1024

def list_large_files(repo_path, max_size):
os.chdir(repo_path)
command_list=['git', 'ls-tree', '-r', '--long', 'HEAD']
result = run(command_list, capture_output=True, text=True)
files = result.stdout.splitlines()

large_files = []
for line in files:
parts = line.split()
size = int(parts[3])
file_path = parts[4]
if size > size_limit:
large_files.append(file_path)

return large_files

def remove_large_files(repo_path, large_files):
command_list=['git', 'filter-repo', '--path', file_path, '--invert-paths']
for file_path in large_files:
run(command_list)

repo_path = '.'
size_limit = 50 * 1024 * 1024 # 50 MB
large_files = list_large_files(repo_path, size_limit)

if large_files:
print(f"Found large files: {large_files}")
# remove_large_files(repo_path, large_files)
#print("Large files removed successfully.")
else:
print("No large files found.")

Finding the Culprits: Listing Large Files

The script starts by defining a function list_large_files. This function takes two arguments:

  • repo_path: The path to your Git repository.
  • max_size: The maximum file size threshold in bytes (here, set to 50MB).

It uses the subprocess module to execute the git ls-tree command. This command lists all the files in your Git repository along with their details, including size. The script then iterates through each file and checks if its size exceeds the specified threshold. If so, the file path is added to a list of large_files.

Taking Action: Removing Large Files (Commented Out)

The script also includes a function remove_large_files (currently commented out). This function takes two arguments:

  • repo_path: The path to your Git repository.
  • large_files: A list of file paths to be removed.

This function utilizes the git filter-branch command, a powerful but potentially risky operation. It rewrites your Git history to exclude the specified files. However, due to the potentially destructive nature of this command, it's currently commented out in the script.

Running the Script

  1. Save the script as a Python file (e.g., find_large_files.py).
  2. Navigate to your Git repository using your terminal.
  3. Run the script using python find_large_files.py.

The script will scan your repository and report any files exceeding the size limit.

Important Note:

  • Removing Large Files: Uncommenting the remove_large_files function and running the script will permanently remove large files from your Git history. This action is irreversible.
  • Alternatives: Before removing large files, consider alternative solutions like using Git Large File Storage (LFS) for managing very large files or exploring version control systems specifically designed for large files.

Conclusion

This script provides a handy tool to identify large files in your Git repository. Remember to proceed with caution when removing them, and explore alternative solutions for managing exceptionally large files effectively.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Bruno Peixoto
Bruno Peixoto

Written by Bruno Peixoto

A person. Also engineer by formation, mathematician and book reader as hobby.

No responses yet

Write a response