Custom Combo List Compilation

Quick Update/Intro
Hello there, and thank you for stopping by my blog!
It's been a while since my last post. Life and my new job have kept me busy, leaving little time to write up any new posts until now. I have still been working on some homelab projects and messing with some interesting radio hacking, which I look forward to sharing with you soon. I really enjoy these hands-on projects and can't wait to post about them.
That said, today’s post is all about data breaches and combo lists. I hope you enjoy it and learn something new.
Real quick interjection!
Soon after writing this blog post, I was able to get my hands on another severe data breach from National Public Data. This breach has a lot of information including full name, addresses, phone numbers, and full SSN among much more.
Here is Troy Hunt's article about the breach. I highly recommend reading this as many articles and news outlets do not ever tell the full story or try to analyze the data themselves.
Inside the "3 Billion People" National Public Data Breach
And here is a checker to see if your information was included from pentester.com (Founded by Ryan Montgomery)
With that out of the way, lets get back to my post.
Data Breaches
Websites being breached and having their data dumped is the source of many of the compilations and collections available on the internet. Other sources include InfoStealer malware, keyloggers, credential harvesters, social engineering, etc.
Some breaches contain much more information than we are after with this project, such as PII, PHI, Credit Card numbers, and proprietary company information.
Some also contain hashed versions of passwords, which can be cracked using wordlists and tools like Hashcat or John the Ripper. I have a ton of wordlist resources and will write something up on that soon as well as how I use Hashcat to crack them as it's what I use most often.
Today we are after the plain text compilations that have the credentials in a "email:password" format known as combo lists.
These are released and can be found on various cybercrime and hacking forums, among other places like Telegram where there are channels dedicated to finding, releasing, and selling them.





Forums and Telegram Channels Offering Combo Lists
Combo lists are useful for credential stuffing attacks, where an attacker or pentester uses known credentials from one site to try and gain access to other sites with the same credentials. These attacks can be very effective since many people reuse their passwords.
Some of the largest and most well-known credential dumps include:
- Collection #1 - 1,013,050,906 records
- Collection #2 - 3,040,689,677 records
- Collection #3 - 69,747,990 records
- Collection #4 - 1,835,141,695 records
- Collection #5 - 540,972,614 records
- AntiPublic #1 - 1,737,991,372 records
- AntiPublic #2 - 517,524,658 records
- Breach Compilation - Not sure (processed before checking) - Most if not all of these are within COMB.
- Compilation of Many Breaches (COMB) - 1,230,703,487 records
There are others, but this list is going to be my starting point since all of these dumps can be found easily on the internet.
Here is a list on GitHub with the magnet links if you wish to download them yourself:
- Email and Password Breach Collection on GitHub *Be aware that all of these combined are close to 1.5TB - 2TB once downloaded and decompressed.

For some more information on these breaches, the following articles do a great job at getting you caught up on the details of each dataset.
Here’s one from SpyCloud regarding many of these breaches, which includes some awesome stats:
A great article from CyberNews about the Compilation of Many Breaches (COMB):
And a more recent article discusses a breach discovered by a cybersecurity researcher, called the Mother of All Breaches (MOAB) which I have yet to find any remnants of so far but will definitely keep an eye out:
Have I Been Pwned?
Given the number of data breaches that bad actors have access to, there are ways to check if your email and/or password has been included in any of them. I will provide the 3 free options I found below along with some information about each.
Have I Been Pwned (HIBP)
- Have I Been Pwned is an amazing website/service started by Troy Hunt that allows you to check if your email has been involved in any data breach within their own database. I highly recommend reading Troy Hunts work in this area as it's very comprehensive.
- It will also let you know which data breaches your email/password were seen in. This is helpful to know if you’ve been breached previously, but it’s not as useful for the attacks mentioned earlier since it doesn’t show the actual password.


Breach Directory
- This a similar site to HIBP but that provides more information.
- Breach Directory will provide you with the first few characters of the passwords as well as the full SHA1 hash, which you can crack using a wordlist and cracking tool as mentioned above.

ProxyNova COMB Search
- This tool only searches through COMB but it does show the full password in the results.

What’s My Goal?
While these services are helpful, I wanted to create my own database that consolidates large collections and allows for easily updating with new breaches, so I can run searches locally on my machine and stay up to date with new data coming out.
Where Do We Begin?
I researched for a bit on the folder structure and query options for these types of breach data. I eventually got lucky finding a tool on GitHub that does exactly what I was looking for with this project.
The repo is called 3.7-billion-passwords-tools and provides an easy way to parse, sort, combine, and deduplicate most of the breaches listed above. The tool sorts the end results into a hierarchical data structure, making queries almost instantaneous.
If you decide to do this on your own, I recommend a 2TB SSD, 16GB RAM, and 12 CPU cores at the very least. I moved the data back and forth between my main computer and a VM I set up with a lot more RAM and CPU cores to help process the data quicker.
*For me, parsing, cleaning, and merging took me about two weeks to get through all the breaches mentioned in the repo, as well as many other collections I found throughout this project.
This tool worked well for the most part, but I did run into issues when parsing some other large databases, especially random dumps I came across that had horrible formatting.
For these, I utilized ChatGPT to help write a Python script to extract “email:password” pairs from all files found within the subdirectories you point it at. You can find the script on my GitHub here.

This script works well for messy collections or just throwing a bunch of random single combo lists together into a directory and running it to extract all the pairs, ignoring the data we don't need, into one large text file for later processing.
Once you have all the pairs combined into one text file, you can run the following command (after creating a tmp folder) to sort the data before proceeding with the 3.7-billion-passwords-tools workflow:
pv -cN processing combined.txt | sort -T tmp -u -S 90% --parallel=12 > combined_sorted.txt
Make sure to adjust the command to fit your hardware: “-S 90%” states the percentage of your memory you are okay allocating to the task, and “--parallel=12” is the number of CPU cores you want to allocate. Going over what you have available will cause the task to fail.
After combining most of the data breaches, I ended up with a dataset close to 700GB but ran into an issue with the 3.7-billion-passwords-tools not performing the deduplication operation at the end of the merge.
I ran through some troubleshooting and was still not able to figure out what was causing the tool to fail at the deduplication operation. I used the tool to merge the datasets then I once again turned to ChatGPT to write a shell script to deduplicate the files. You can also get the script here also on my GitHub:

After days and days of combining all the collections and individual combo lists and finally running the deduplication script, I ended up with a dataset of 150GB. I was a surprised at the end result and spent a day or two testing to ensure I wasn't losing too much useful data during this process.
Even with this massive reduction in size, the end resulting dataset, which I’ve named ComboVault contains 4,638,336,222 unique email and password pairs. Testing the result by running many queries against my collection vs the sites above shows no missing passwords in my limited testing.

As mentioned before, running queries takes no time at all due to the way the dataset is structured. Because of this structure another great tool at our disposal is to use the breach-parse tool created by Heath Adams. This tool takes a TLD as input and will search through and grab all the credentials for the given company/organization and combine them all into seperate text files including passwords, users, and a master file containing the credential pairs.

Why Go Through All the Trouble?
Having access to not only a large dataset of credential combos but one that you can continuously add to with newly leaked credentials is invaluable for pentesting and security research. Although this process removes the URL, IP address, or other portions of some leaks (which could be important depending on your needs for the data), it serves my purposes just fine. The main advantage is that I can continue building upon it with more combo lists I come across in the future.
The other advantage is cleaning up the formatting and getting rid of duplicates, which, as you can tell from the original file sizes, is a significant amount of garbage that takes up a lot of space.
In the future, I’d like to improve this process by writing my own scripts and tools to perform these functions more efficiently, ensuring I’m not losing valuable data in the process, as I’m sure I did.
For now, I’m happy with the 4.6 billion pairs I have to work with in my current iteration of ComboVault.
That's all I have for you today!
I was initially going to post a link to the resulting ComboVault collection since the information is readily available, but after thinking on it I'd rather not add to the availability of such data.
If you have any questions regarding this post you can reach out to me through LinkedIn. I'm always open to helping out and collaborating!
I look forward to posting about my other projects and revisiting this one once my coding and scripting skills catch up.
Hope you have a great day!
*11/1/24 UPDATE
Just wanted to add a quick update here. I've been pulling new combolists from different forums and a bunch of different Telegram channels. I check in usually weekly to grab new ones, then once a month I'll combine them together then add them to my ComboVault collection. Since my initial collection was completed (4.6 billion pairs) I've added over 200 million more unique email:pssword pairs in the last two months and am now up to 4.9 billion pairs and growing.