Skip to main content
U.S. flag

An official website of the United States government

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Using Bulk_extractor

Outbound Reviews: Scanning Tools for PII and Secrets

Bulk_extractor is a forensic tool used to extract useful information from disk images, files and other digital media. It looks for strings of interest, such as email addresses, URLs, credit card numbers and other artifacts. Unlike traditional forensic tools, bulk_extractor does not mount or parse the file system, making it faster and more suitable for unstructured data. It is commonly used for cybersecurity investigations, data recovery and digital forensics.

Gitleaks is an open-source tool designed to detect and prevent secrets (like API keys, tokens and passwords) from being committed to Git repositories. It scans repositories, including their entire history, for hardcoded secrets and generates reports to help developers address potential security vulnerabilities. Gitleaks is specifically tailored to source code repositories, making it highly effective in development environments.

Uses

Checks for GitLeaks Bulk-Extractor
Emails
Credit Cards
File fragments (JPEGs, PDF headers)
JSON snippets
Phone numbers
URLs
IP addresses
DNS queries
Base64-encoded data
PII
API keys
SSH keys
Hardcoded passwords
Tokens
Secrets

Key Differences

Bulk-extractor:

  • Analyzes raw data and disk images
  • Digital forensics and data recovery
  • Source: Files, disk images, raw data

GitLeaks:

  • Scan for Git repo secrets
  • Source code and secrets
  • Used for repos

Bulk-Extractor Installation for Mac

Ensure you are installing bulk_extractor in your local drive not within the repository or disk you are checking.

 $ brew install autoconf
 $ brew install automake
 $ brew install libtool
 $ brew install pkg-config

 $ git clone --recurse-submodules https://github.com/simsong/bulk_extractor.git

 $ cd bulk\_extractor

 $ ./bootstrap.sh
 $ ./configure
 $ make
 $ make install

Bulk-Extractor Installation for Windows

Note: Currently bulk_extractor 2.1 does not build on windows, but bulk_extractor 2.0 does.

  • Download the Bulk Extractor Windows binary from the official website.

  • Extract the downloaded zip file to your preferred directory.

  • Add the Bulk Extractor binary folder to your system PATH:

    • Press Win + R, type sysdm.cpl, and hit Enter.
    • Navigate to the "Advanced" tab and click on "Environment Variables."
    • Under "System Variables," locate and edit the Path variable.
    • Add the path to the Bulk Extractor folder and click OK.

    Or

  • Start with a clean Virtual Machine (VM) and use these commands:

 $ git clone --recurse-submodules https://github.com/simsong/bulk_extractor.git
 $ cd bulk_extractor/etc
 $ bash CONFIGURE_FEDORA36_win64.bash
 $ cd ..
 $ make win64
  • Open Command Prompt and test the installation:
bulk_extractor.exe -h

Bulk-Extractor Installation for Linux

  • Update your package manager:
  $ sudo apt-get update
  • Install required dependencies
  $ sudo apt-get install -y build-essential cmake
  • Clone the Bulk Extractor repository and build the source code:
  $ git clone https://github.com/simsong/bulk_extractor.git
  $ cd bulk_extractor
  $ mkdir build && cd build
  $ cmake ..
  $ make
  $ sudo make install
  • Verify installation:
  $ bulk_extractor -h

How to use

  • Create the directory you would like to save the output into. This can be in the repository being checked but the location is up to you.
  • Alternatively, if the repository you are checking has a directory already available for storing test results, you can use that.

You can run the basic:

$ bulk_extractor -o </path/to/output_directory> <path/to/input_file_or_image>

You can also extract only certain files:
Example:

$ bulk_extractor -o <output_directory> -E email,input <input_file_or_image>

Outputs are saved as .txt files.


When Not to Use Bulk Extractor

  • Highly Structured File Systems: When specific file system metadata needs to be preserved.
  • Real-Time Analysis: It’s not designed for active/live systems.
  • Deep Contextual Analysis: Bulk Extractor focuses on raw data extraction and lacks contextual file relationships.

Language use

Language Suitability Notes
Python ✅ Great Great for libraries and easy to use.
Java ✅ Great Well suited.
C++ ✅ Great High performance and control over resources.
JavaScript ⚠️ Moderate Good for web based extraction, but less efficient for bulk extraction.
Ruby ⚠️ Moderate Can handle small to medium scale tasks.
R ⚠️ Moderate Excellent for data analysis, but less efficient for bulk extraction.
Bash ❌ Poor Limited for large-scale processing.
Looking for U.S. government information and services?
Visit USA.gov