Jump to content

Data Mining!


The Sorrow

Recommended Posts

Im looking to set up a data-mining/spiderbot server for my own devious reasons. Can someone suggest a good piece of software that will run on Ubuntu Server and crawl the interwebs in search of informations? I would like it to toss its findings into some sort of database if all possible.

Link to comment
Share on other sites

Hey Sorrow, I remember you mentioning that you code in Python. You should check out scrapy (scrapy.org). I think it has DB options, if not you should be able to pickle or shelve your findings for future referencing.

Edited by bobbyb1980
Link to comment
Share on other sites

The majestic bot is downloadable to put on your own servers. Personally I block the damn thing, doesn't follow robots.txt most of the time, and have found a few sites running it that leave the admin interface open from the web, so not sure how safe the damn thing is if someone found it running, if its easy to get into the machine via the bot. I think peopel mod it to do what they want and ignore robots.txt. http://www.majestic12.co.uk/

edit: by the way, if you want one off for searching a specific site, adn following links on that site, wget will do this as well. Read the documentation.

Edited by digip
Link to comment
Share on other sites

Thanks, both you guys. Those are leads ill follow. But im looking to do some really deep web indexing. So something that follows links to other sites and such would be awesome too. It just sounds like fun to see what a bot can find on the tangle of routers and switches we call the internet.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...