murder_face Posted December 7, 2012 Posted December 7, 2012 A friend of mine is paying someone for SEO right now and isn't very happy with the results. The price is right, but I guess you get what you pay for. I took a quick glance at the page source and there are no keywords in the the meta tags(I'm not really sure what he is paying for then) and found a redirect link in the source as well. I tried "wget --recursive blah" on the TLD and all I get is the index.html. I then tried the same thing on the subdomain that it redirects to and I get "403 forbidden". Why can I browse the page in full, but not download it? I'm heading over to the backtrack tutorials after this because I remember seeing something in there about downloading entire sites for phishing attacks, but will I get the same error? Quote
digip Posted December 7, 2012 Posted December 7, 2012 Use the user agent switch with wget and spoof your user agent. They probably block wget by default. Something like --user-agent="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" would work in most cases, but you can use a real browser agent instead of spoofing google bot. Quote
murder_face Posted December 8, 2012 Author Posted December 8, 2012 Tried the user agent spoof. Still no dice. Also tried the -H option which got me basically EVERYTHING other than the site I wanted. I have also tried HTTrack from the backtrack tutorials with no luck. I ran a whois on the domain and got the tech contacts email address and told them my situation. We'll see if that works, but I'm doubtful.... Quote
digip Posted December 8, 2012 Posted December 8, 2012 Is the page required to use a referrer, certificate or cookies? There are other switches and things you can feed to wget to tell it to use cookies and ignore certificate errors if its https or to allow it to follow and go to https pages. Other than that, is there any other authentication mechanism to the page itself that you do in the browser that you might need to pass in wget? Just trying to figure out why it can't be reached with wget but can with a regular browser. Try using wireshark to view the http request and response with a browser, and also what you get with wget, and compare the two, might shed some light on the issue. With respect to HTTrack, by default its user agent is HTTrack too so you need to also configure a user agent spoof in HTTrack, but HTTrack also hammers a site so fast, that throttling can sometimes kill your connection to the site. If they use any rate limiting modules like for apache or such, they could just return 403's all day long if they think its too many concurrent connections, so threads to the server are an issue too. In Opera and FireFox, I sometimes run into this, since I can configure how many concurrent threads are connected to one site, and they will just give you a 403 forbidden or 500 error page if you have too many connections. My own site does it to me at times, so that can be something to look into. Quote
murder_face Posted December 8, 2012 Author Posted December 8, 2012 There is no authentication, and I can still browse the site with cookies disabled. I found this in the page source: <script type="text/javascript">var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));</script>[/CODE]But that just looks like tracking nonsense. There is also a meta tag titled verify-v1, but on the next line it refers to the content in "google-site-verification". I also noticed that there is a line that requests the background from "/images" rather than the redirected domain, but wget still only grabs index.html it never sees an "images" subdir.The whois on the TLD shows all of my buddies information through and through. Name, address, everything. Would it be legal to try more nefarious routes if he gave me his permission? Or would it be a waste of time since everything that I want is on a subdomain(cdn.blah.com) which doesn't belong to him Quote
digip Posted December 8, 2012 Posted December 8, 2012 If there is no href link on the index.html page to the subdomain you want to spider, then it has nothing to follow. I'd have to see the script you wrote for wget to use, but here is what I use to spider and download files from sites. its a windows bat script. If you use linux, just change accordingly to your own needs. Just create the folder "SpiderDownloadShit" and place the bat script one level up from the download folder wget-spider-download.bat: :123@echo OFFclsecho Choose a site to download links from.SET /P website="[example: www.google.com] : "wget -erobots=off --accept="html,htm,php,phps,phtml,jpg,jpeg,gif,png,bmp,pl,txt,asp,aspx,jsp,js,chm,shtml,css,mov,avi,mpg,mp3,mp4,pdf,flv,swf,bz2,tar,rar,zip" -l 1000 -rH -P SpiderDownloadShit/ -D%website% --no-check-certificate --user-agent="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" %website% -do SpiderDebug.txtecho Links found in %website%. SpiderDownloadShit/ is to be ignored. > Spiderlinks.txtfind "saved" SpiderDebug.txt >> Spiderlinks.txt::del SpiderDebug.txt::rmdir SpiderDownloadShit /s /q::pausegoto:123[/CODE] Quote
murder_face Posted December 8, 2012 Author Posted December 8, 2012 Here is what I normally use(I added the useragent string you showed me) #!/bin/shecho "Enter Site:"read SITEwget -r -p --save-headers --auth-no-challenge --user-agent="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" $SITE[/CODE]:my friends site is amerangaragedoor.com and the redirect is cdn.amerangaragedoor.com. The site is nothing fancy and my original plan was just to wget the site change a few things and upload it to another host so he could stop giving someone money for nothing. Quote
digip Posted December 9, 2012 Posted December 9, 2012 (edited) Well, the link you posted doesn't even work., i think you have the wrong site name, i get a DNS error trying to reach both "amerangaragedoor.com" and "www.amerangaragedoor.com" but even still, (assuming was typo) try the one line from my script above. wget -erobots=off --accept="html,htm,php,phps,phtml,jpg,jpeg,gif,png,bmp,pl,txt,asp,aspx,jsp,js,chm,shtml,css,mov,avi,mpg,mp3,mp4,pdf,flv,swf,bz2,tar,rar,zip" -l 1000 -rH -P SpiderDownloadShit/ -Damerangaragedoor.com --no-check-certificate --user-agent="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" amerangaragedoor.com -do SpiderDebug.txt[/CODE]Just change "amerangaragedoor.com" in the above line two places, to what your friends site name is. By the way, no space after the -D command.. Edited December 9, 2012 by digip Quote
murder_face Posted December 9, 2012 Author Posted December 9, 2012 Worked like a charm! Sorry about the typo actual url was amerangaragedoors.com. I was only halfway into a pot of coffee. What was I doing wrong? I noticed that you use -rH when I just used -H it seemed like wget was trying to download the entire interent. Quote
digip Posted December 9, 2012 Posted December 9, 2012 Might have been your syntax, but first read all the switches I added in the strings. -rH -D -l from the help file. I think with recursive you also have to tell it how many levels deep, with the -l command. I also tell it what files to accept, where normally it would just download html files I beleive. You could probably get away with an "*.*", haven't tried that, but I use it for specific file types. I also write the requests and output to a text file, so I can see all the server requests and redirects it follows, and you can see what it downloaded, etc. Thats part of the debug and file output section or "-do SpiderDebug.txt" Some of the syntax is weird though, like the -D one, needs no space afterwards, or the command fails. Because I only passed one domain name, it only stays on that domain. Mine basically says, ignore robots.txt, ignore certificate errors, and limit it to 1000 links. There are commands to even pass cookies back and forth, where in case some sites want them and even usernames and passwords if need be, or force all referrers from the same domain as well. Trial and error though. wget has always been kind of tricky for me, but I just keep experimenting with it. I have a few other scripts I use for various things like pulling down podcast videos and mp3's from rss feeds. Some of the other examples are on the forums already, just search my name and wget for posts with the other code examples. Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.