Jump to content

Deep Web Crawler Building 101


Recommended Posts

Hey Guys,

I've recently been getting into web crawling and I've been considering ways one could make a web crawler to detect onion sites on the Tor network. I know there are already lots of deep-web/dark-web/dank-web indexing sites, such as Ahmia and the onion crate, where one can go to find active onions. However, because new onions appear and disappear daily,  it would be handy to have a personal tool that automatically detects onions, possibly even extracting some basic information, and logs the findings for later. Maybe catch some sweet hacks before the feds get to them, or accidentally infect yourself with cutting-edge malware.

Idea 1: Brute Force

The obvious (and naive) implementation would be to try and brute-force onion names and run something like requests.get from Python's requests library. Assuming you are routing traffic into the Tor network, requests.get will return 200 when an onion site exists and is online at the given address, so any combinations returning 200 should be flagged for later. If if another flag is thrown, such as 404, no action will be taken and the loop will continue to iterate. By iterating through all possible onion links, one would eventually hit some real onions. This design is very similar to password brute-forcing, in both concept and effectiveness.

All onion addresses consist of 16-character hashes made up of any letter of the alphabet (case insensitive) and decimal digits from 2 to 7, thus representing an 80-bit number in base32. An example of an actual onion address is http://msydqstlz2kzerdg.onion/ which is the onion link to the Ahmia search engine for Tor sites. This leaves roughly 1208925820000000000000000 possible character combinations for onion addresses. For reference, the largest possible value of a "long", the largest primitive data type for storing integers in Java, is 9223372036854775807, a solid six digits too short to contain the number of potential onions. If you designed a simple program to count from 0 to 1208925820000000000000000 it would take... a long ass time to run (my pc takes about a minute get into 7 digit territory counting by one, and about eight minutes to get into 8 digit territory... the destination number has 24 digits).

It isn't that important to me if the web crawler takes several days or even weeks to run through every possible combination, since the majority of onion sites with actual content do persist for a while anyway. As for fresh sites that may not last long, you would have to get lucky for your crawler to hit the correct address during the short period where the site is online. This crawler would be designed to run continuously, looping through every possible combination over and over to continually update the list. There would also be periodic checks of whether onions in the list are still online.

Pros: relatively straightforward to program and maintain, could potentially discover onions not contained in other indexes

Cons: inefficient and ineffective unless you have a supercomputer lying around

Idea 2: Crawler Crawler

The next possible implementation would be to leverage the work already done by others by creating an index of indexes. By checking for changes in popular existing indexes at arbitrary intervals, my onion list would update itself with far less computation and time. The one downside is that we can already access these indexes anyway, so we wouldn't get any juicy information before our deep-web peers do. Each site stores its index info in a different format, so the crawler would have to be tailored to read sites from each index differently. We would also have to manually account for index sites going down or new sites being discovered.

Pros: less heavy-lifting for my PC, doesn't need to be run constantly

Cons: must be tailored to each individual index, more work to code, indexes could go down or change formats, onion sites discovered are ones I could already find anyway.

Idea 3: Google-Style Crawler

The last idea I have is to implement a crawler algorithm similar to the ones used by Google's own web spiders. My above crawler algorithms only consider the main 'home' addresses, consisting of the 16 characters and the .onion, even though most sites have many pages (fh5kdigeivkfjgk4.onion would be indexed, fh5kdigeivkfjgk4.onion/home would not). One function of professional-grade  search-engine crawlers is they build their indexes by following links on the current site. The algorithm would follow links contained in the page source to navigate around the website, and if addresses belonging to new onion sites are found (i.e. the 16 characters are different) it will add them to the index. This would be especially handy upon discovery of sites similar to the Hidden Wiki, which are stuffed full of links to other active (or inactive) onions.

Pros: Can take advantage of onion links discovered within new sites, index will fill faster

Cons: The Tor network is often quite slow, navigating though sites could be time-consuming.

 

Right now I have some basic test code running to test out a few things, but nothing worth posting quite yet. I will post any progress I make here. Let me know if you guys have any recommendations.

Link to comment
Share on other sites

5 hours ago, The Power Company said:

Multi-threading would probably help. I think I'll try implementing some of that sweet Cuda GPU Acceleration sauce as well, it works wonders for deep learning and password cracking.

cuda wont matter. The bottle neck is waiting on that 200 response.

 

16^(32) = kabillion

my math is inaccurate but this is crazy amount of computing... 

 

You need IoT distributed. I have been logging IoT activity for a few months. I have a decent list of infected ip addresses.

Link to comment
Share on other sites

I would setup crunch real quick. See if you can create a quick multithreaded python/perl/or ruby script.

 

send out 10,000 dns request, dump the results to a log file for grepable filtering after the scan completes and time this activity.

 

When complete figure the time it took compared to how many successful 200 response.

Link to comment
Share on other sites

  • 2 weeks later...

The only thing that I can think of is that why should the Deep Web be legalized in the first place? I mean, when we talk about the deep web, it is somewhat similar to the possibility of surfing in total anonymity. This aspect makes it wholly desirable for cyber criminals.

Edited by digininja
edited to remove white on white spam links
Link to comment
Share on other sites

  • 3 weeks later...
On 2018-03-20 at 12:23 PM, athman8 said:

The only thing that I can think of is that why should the Deep Web be legalized in the first place? I mean, when we talk about the deep web, it is somewhat similar to the possibility of surfing in total anonymity. This aspect makes it wholly desirable for cyber criminals.

And?

  • Like 1
Link to comment
Share on other sites

  • 1 month later...

ops, forgot.

Not sure if possible but best would be if each thread has an own connection to the TOR network.
That helps to widen the bottle neck along the connection to the network

I would use threads to create DNS response chunks, filter these and save the positives in chunk files.
These can then be crawled by other threads separately.

Link to comment
Share on other sites

  • 2 weeks later...

The bottle neck is waiting on that 200 response from 10million dns request that will fail. 

 

16^(32) = kabillion possibilities

my math is inaccurate but this is crazy amount of bandwidth.

 

Its not a practical approach... ill bet you will discover 1 working Domain per day (or maybe even 1 working dns per week. )

 

I have a practical approach for you. google search dorks...  this can be automated pretty quickly with perl/python/ruby or even wget and bash.

 

google search allintext:.onion.

Crawl Google's results.

use google query modifiers to adjust how much results are display or fine tune by date of relevance.

 

lets say you scrape from google 10,000 domains that have this http;//*.onion written on there page.

 

Then you crawl each of these domains and scrape together your list of possible working onions.

 

Next you run this list through a tcp scan onto the tor network.

 

If I find the time I can build a tool that does the hard work/crawling.

 

But I'm not the type to give tools I make away for free. ?

Edited by i8igmac
Link to comment
Share on other sites

7958661109946400884391936

7,958,661,109,946,400,884,391,936

Does this look like a kabillion?

katrillion?

36**16

36 characters. a-z + 0-9

26+10=36

A onion link looks Looks typically 16 characters long

So 36 to the 16th power

36**16=7958661109946400884391936
 

Kazillion dns queries. impossible.

there are already onion search engines. You can pull millions of working onions.

 

Makes me think about scanning these onions for sqli/lfi injections...

 

Anyways if you had 500 botnet GIG devices.  you might accomplish this. I wish I had government access to perform this kind of scanning.

 

I diverse a government job after this post.

Edited by i8igmac
Link to comment
Share on other sites

  • 3 years later...
On 6/6/2018 at 7:46 AM, i8igmac said:

7958661109946400884391936

7,958,661,109,946,400,884,391,936

Does this look like a kabillion?

katrillion?

36**16

36 characters. a-z + 0-9

26+10=36

A onion link looks Looks typically 16 characters long

So 36 to the 16th power

36**16=7958661109946400884391936
 

Kazillion dns queries. impossible.

there are already onion search engines. You can pull millions of working onions.

 

Makes me think about scanning these onions for sqli/lfi injections...

 

Anyways if you had 500 botnet GIG devices.  you might accomplish this. I wish I had government access to perform this kind of scanning.

 

I diverse a government job after this post.

Wait, so is it actually possible to create a web crawler that would work on the Tor network websites?! That's an incredible idea! I never thought it could be possible. I mean, it's a prolonged network for browsing, and that's why I didn't know it was possible. I'm also pretty new to web-crawling, but the basics are something that I have known since I was a student. I think a template for your project could be https://proxycrawl.com/scraping-api-avoid-captchas-blocks. They have some of the best web crawling features on the entire web. Besides that, they also tried to develop a service that would work on the Tor network.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...