Jump to content

[C++] Best method to organize massive word lists?


h4344

Recommended Posts

Hey there!

I am trying to write my own application for merging and organizing wordlists. The problem I am having is thinking of the best way to filter through all the candidates to remove duplicates, organize them, etc.

Im going through it and I have to be able to open pretty big word lists already in existence (anywhere from 1gb to 5gb PER FILE). I also am not sure how I would compare say candidate number 555740 against all the previous options without going back through each possible option for comparison. I thought maybe organize the results I have into separate temp .txt files that are alphabetical but I could see the same result with huge lists.

Any advice on where I should go with these problems?

Link to comment
Share on other sites

3 hours ago, Sebkinne said:

I'd probably use something like external merge-sort for this, but a database like @noncenz suggested should be a good choice for this.

I just looked up external merge sort and I think that might be perfect. I database sounds interesting but I want this to be a simple tool I can redistribute and keep it simple for users!

Link to comment
Share on other sites

For simple deduplication of arrays you can use an associative array/hash. Simply use the word as the key and give it a value of true.  Enumerating the keys then gives you the deduplicated list of words.  If you want that list sorted then simply sort it afterwards.  Here's a quick Perl example:

#!/usr/bin/perl

use strict;
use warnings;

my %words;

while (my $line = <STDIN>) {
    chomp($line);
    $words{lc($line)} = 1;
}

print join("\n", sort keys %words);

 

Link to comment
Share on other sites

I might be missing the point, but this should not be a difficult thing to the point that you would want to do this with a C program(other than maybe for the mental floss exercise and trying to reinvent the wheel with some complex routines). Depending on how much you need to modify things, just do it from the command line with utilities already designed for this, where the heavy lifting of programming has been done for you. Tools like cat and sort, handle this kind of thing and what they were designed for.

If you just need to merge and sort, then a one liner command would be easiest under Linux.

On linux:

cat file1 file2 file3 file4 | sort -u > list.txt

On windows, put your files in same directory then run the following 2 commands (windows will need a few more steps, but still easy-peasy):

:: Open a cmd prompt, then open powershell:
C:\ > powershell [hit enter]
:: Run each of the following commands
dir C:\path\to\files\* -include *.txt -rec | gc | out-file C:\unique\path\to\results\result.txt 
gc C:\unique\path\to\results\result.txt | Sort-Object | Get-Unique | out-file C:\path\to\sorted\wordlist.txt

On windows, you can't chain the entire command up above to pipe the first into the second; while you won't get an error, it's a sharing violation of the file in use, so the second command needs to be input by itself after the first command completes or the wordlist at the end will be empty. (see https://forums.hak5.org/topic/42274-c-best-method-to-organize-massive-word-lists/?do=findComment&comment=300340 for how to chain the above in a single command)

 

If you need anything more complicated than this, then I'd say do the database thing as suggested above so you can tag words in different groups and sort dynamically into different list categories, but to merge a sorted, unique list, keep it simple with the command line.

Edited by digip
Link to comment
Share on other sites

2 hours ago, digip said:

I might be missing the point, but this should not be a difficult thing to the point that you would want to do this with a C program(other than maybe for the mental floss exercise and trying to reinvent the wheel with some complex routines). Depending on how much you need to modify things, just do it from the command line with utilities already designed for this, where the heavy lifting of programming has been done for you. Tools like cat and sort, handle this kind of thing and what they were designed for.

If you just need to merge and sort, then a one liner command would be easiest under Linux.

On linux:


cat file1 file2 file3 file4 | sort -u > list.txt

On windows, put your files in same directory then run the following 2 commands (windows will need a few more steps, but still easy-peasy):


:: Open a cmd prompt, then open powershell:
C:\ > powershell [hit enter]
:: Run each of the following commands
dir C:\path\to\files\* -include *.txt -rec | gc | out-file C:\unique\path\to\results\result.txt 
gc C:\unique\path\to\results\result.txt | Sort-Object | Get-Unique | out-file C:\path\to\sorted\wordlist.txt

On windows, you can't chain the entire command up above to pipe the first into the second; while you won't get an error, it's a sharing violation of the file in use, so the second command needs to be input by itself after the first command completes or the wordlist at the end will be empty. Add this in a powershell script and it's just a matter of putting the files in the path to find them, then run it(could do similar with a bash script). You can add a cleanup routine to this to remove unwanted files as well. Just be sure not to put the result file in the same directory as where you scan for files to merge, or you will fill up your HDD in a loop of a file that grows forever! 

I like the idea of using the command line because you might be on a system where you actually want to cat some output and merge then sort the output for whatever reason, maybe even for ex-filtration needs, vs needing to have a compiled C program do this, requires being able to move the binary to those systems, and/or compiled for that specific OS, something not needed if you know your environment per OS. I assume MAC OSX would be very similar to the Linux/*nix equivalence of commands and utilities as well.

If you need anything more complicated than this, then I'd say do the database thing as suggested above so you can tag words in different groups and sort dynamically into different list categories, but to merge a sorted, unique list, keep it simple with the command line.

Huh I never knew linux had built in command line tools for this express purpose.

Well I suppose that makes the program a little useless in terms of practical use with the tools already available. It was a good brain stretch thinking about how I would have to handle the data and do this all myself in C++. I was at work the whole time thinking how I would disassemble the files to more manageable sized "chunks" and having separate methods to de-dup and organize the output depending on user choices.

Sounds like I really was trying to reinvent the wheel lol. Still, there is always pride to be had in writing all the code yourself and going through the process :P

Link to comment
Share on other sites

5 hours ago, digip said:

 

On windows, put your files in same directory then run the following 2 commands (windows will need a few more steps, but still easy-peasy):


:: Open a cmd prompt, then open powershell:
C:\ > powershell [hit enter]
:: Run each of the following commands
dir C:\path\to\files\* -include *.txt -rec | gc | out-file C:\unique\path\to\results\result.txt 
gc C:\unique\path\to\results\result.txt | Sort-Object | Get-Unique | out-file C:\path\to\sorted\wordlist.txt

On windows, you can't chain the entire command up above to pipe the first into the second; while you won't get an error, it's a sharing violation of the file in use, so the second command needs to be input by itself after the first command completes or the wordlist at the end will be empty. Add this in a powershell script and it's just a matter of putting the files in the path to find them, then run it(could do similar with a bash script). You can add a cleanup routine to this to remove unwanted files as well. Just be sure not to put the result file in the same directory as where you scan for files to merge, or you will fill up your HDD in a loop of a file that grows forever! 

Just wanted to add 1 thing.  Sharing violation is not hit here if after GC you sort it.  Reason being when gc was ran against the files, all their content was picked up first, files were closed and then contents were piped.  Powershell runs all of the command in each pipeline before proceeding to the next.  To see this in action, GC a large file and pipe it to out-string.  If it did it line by line then the out-string would populate line by line but it sits for while while it gets the contents before doing the out-string.  Watch the process mem size and you will see it increases as it is reading the file.  To read it bit by bit you will need to access the .NET classes and create a steam and use in a while loop or something of the sort that can be looped to keep reading the stream and do something with the contents until the stream is done.

So, the above command will work like so:

Get-Childitem C:\path\to\files\* -include *.txt -Recurse -File | gc | Sort-Object -Unique | Set-Content c:\path\to\sorted\wordlist.txt

 

The above will work.  I include "-File" in get-childitem to get only files..in case some folder is named something.txt.  Just a habit for me to target objects I want to work with.

Link to comment
Share on other sites

24 minutes ago, PoSHMagiC0de said:

Just wanted to add 1 thing.  Sharing violation is not hit here if after GC you sort it.  Reason being when gc was ran against the files, all their content was picked up first, files were closed and then contents were piped.  Powershell runs all of the command in each pipeline before proceeding to the next.  To see this in action, GC a large file and pipe it to out-string.  If it did it line by line then the out-string would populate line by line but it sits for while while it gets the contents before doing the out-string.  Watch the process mem size and you will see it increases as it is reading the file.  To read it bit by bit you will need to access the .NET classes and create a steam and use in a while loop or something of the sort that can be looped to keep reading the stream and do something with the contents until the stream is done.

So, the above command will work like so:


Get-Childitem C:\path\to\files\* -include *.txt -Recurse -File | gc | Sort-Object -Unique | Set-Content c:\path\to\sorted\wordlist.txt

 

The above will work.  I include "-File" in get-childitem to get only files..in case some folder is named something.txt.  Just a habit for me to target objects I want to work with.

I'll have to try that. I only ever piped from one command into the next, didn't know you could use them together like that, figured it couldn't touch a file already in another handle for ownership reasons, but I guess that wasn't the case. I should have realized sort-object had a unique switch like linux sort has.

Edited by digip
Link to comment
Share on other sites

  • 1 month later...

I made some posts about this a few years back but I can't seem to find those older threads anymore. Is there an archive or do those thread just get pruned out? That post had a lot of good syntax. Stuff I've pretty much forgotten again by now.

I compiled a wordlist from about 1300 other sources. The one problem that I ran into with a lot of the Linux command line sorting tools was that they are memory intensive and sort just crashes when you feed it a file that is bigger than amount of physcial memory available.

The coolest thing ever would be to have a machine with a lot of RAM then copy all of the files to a tmp folder and sort it there shaving off the read / write time to and from disk and just copying the final product to your primary storage.

The split command came in handy for me since I don't have a lot of memory.  sed, awk, grep are pretty handy for doing a lot of this stuff.

sed awk grep sort split

Edited by vailixi
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...