The Sorrow Posted May 13, 2012 Share Posted May 13, 2012 Im looking to set up a data-mining/spiderbot server for my own devious reasons. Can someone suggest a good piece of software that will run on Ubuntu Server and crawl the interwebs in search of informations? I would like it to toss its findings into some sort of database if all possible. Quote Link to comment Share on other sites More sharing options...
bobbyb1980 Posted May 13, 2012 Share Posted May 13, 2012 (edited) Hey Sorrow, I remember you mentioning that you code in Python. You should check out scrapy (scrapy.org). I think it has DB options, if not you should be able to pickle or shelve your findings for future referencing. Edited May 13, 2012 by bobbyb1980 Quote Link to comment Share on other sites More sharing options...
digip Posted May 13, 2012 Share Posted May 13, 2012 (edited) The majestic bot is downloadable to put on your own servers. Personally I block the damn thing, doesn't follow robots.txt most of the time, and have found a few sites running it that leave the admin interface open from the web, so not sure how safe the damn thing is if someone found it running, if its easy to get into the machine via the bot. I think peopel mod it to do what they want and ignore robots.txt. http://www.majestic12.co.uk/ edit: by the way, if you want one off for searching a specific site, adn following links on that site, wget will do this as well. Read the documentation. Edited May 13, 2012 by digip Quote Link to comment Share on other sites More sharing options...
The Sorrow Posted May 13, 2012 Author Share Posted May 13, 2012 Thanks, both you guys. Those are leads ill follow. But im looking to do some really deep web indexing. So something that follows links to other sites and such would be awesome too. It just sounds like fun to see what a bot can find on the tangle of routers and switches we call the internet. Quote Link to comment Share on other sites More sharing options...
The Sorrow Posted May 14, 2012 Author Share Posted May 14, 2012 So i seem to have found a decent program to utilize. Sphider is a PHP/MySQL indexer and crawler. Just gonna have to tweak it to do its job automatically and constantly follow links from a root site. Still open to ideas though! Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.