Jump to content

Recommended Posts

Posted

A friend of mine is paying someone for SEO right now and isn't very happy with the results. The price is right, but I guess you get what you pay for. I took a quick glance at the page source and there are no keywords in the the meta tags(I'm not really sure what he is paying for then) and found a redirect link in the source as well. I tried "wget --recursive blah" on the TLD and all I get is the index.html. I then tried the same thing on the subdomain that it redirects to and I get "403 forbidden". Why can I browse the page in full, but not download it?

I'm heading over to the backtrack tutorials after this because I remember seeing something in there about downloading entire sites for phishing attacks, but will I get the same error?

Posted

Use the user agent switch with wget and spoof your user agent. They probably block wget by default. Something like --user-agent="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" would work in most cases, but you can use a real browser agent instead of spoofing google bot.

Posted

Tried the user agent spoof. Still no dice. Also tried the -H option which got me basically EVERYTHING other than the site I wanted. I have also tried HTTrack from the backtrack tutorials with no luck. I ran a whois on the domain and got the tech contacts email address and told them my situation. We'll see if that works, but I'm doubtful....

Posted

Is the page required to use a referrer, certificate or cookies? There are other switches and things you can feed to wget to tell it to use cookies and ignore certificate errors if its https or to allow it to follow and go to https pages. Other than that, is there any other authentication mechanism to the page itself that you do in the browser that you might need to pass in wget? Just trying to figure out why it can't be reached with wget but can with a regular browser. Try using wireshark to view the http request and response with a browser, and also what you get with wget, and compare the two, might shed some light on the issue.

With respect to HTTrack, by default its user agent is HTTrack too so you need to also configure a user agent spoof in HTTrack, but HTTrack also hammers a site so fast, that throttling can sometimes kill your connection to the site. If they use any rate limiting modules like for apache or such, they could just return 403's all day long if they think its too many concurrent connections, so threads to the server are an issue too. In Opera and FireFox, I sometimes run into this, since I can configure how many concurrent threads are connected to one site, and they will just give you a 403 forbidden or 500 error page if you have too many connections. My own site does it to me at times, so that can be something to look into.

Posted

There is no authentication, and I can still browse the site with cookies disabled. I found this in the page source:


<script type="text/javascript">var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));</script>
[/CODE]

But that just looks like tracking nonsense. There is also a meta tag titled verify-v1, but on the next line it refers to the content in "google-site-verification". I also noticed that there is a line that requests the background from "/images" rather than the redirected domain, but wget still only grabs index.html it never sees an "images" subdir.

The whois on the TLD shows all of my buddies information through and through. Name, address, everything. Would it be legal to try more nefarious routes if he gave me his permission? Or would it be a waste of time since everything that I want is on a subdomain(cdn.blah.com) which doesn't belong to him

Posted

If there is no href link on the index.html page to the subdomain you want to spider, then it has nothing to follow. I'd have to see the script you wrote for wget to use, but here is what I use to spider and download files from sites. its a windows bat script. If you use linux, just change accordingly to your own needs. Just create the folder "SpiderDownloadShit" and place the bat script one level up from the download folder

wget-spider-download.bat:


:123
@echo OFF
cls
echo Choose a site to download links from.
SET /P website="[example: www.google.com] : "
wget -erobots=off --accept="html,htm,php,phps,phtml,jpg,jpeg,gif,png,bmp,pl,txt,asp,aspx,jsp,js,chm,shtml,css,mov,avi,mpg,mp3,mp4,pdf,flv,swf,bz2,tar,rar,zip" -l 1000 -rH -P SpiderDownloadShit/ -D%website% --no-check-certificate --user-agent="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" %website% -do SpiderDebug.txt
echo Links found in %website%. SpiderDownloadShit/ is to be ignored. > Spiderlinks.txt
find "saved" SpiderDebug.txt >> Spiderlinks.txt
::del SpiderDebug.txt
::rmdir SpiderDownloadShit /s /q
::pause
goto:123
[/CODE]

Posted

Here is what I normally use(I added the useragent string you showed me)


#!/bin/sh
echo "Enter Site:"
read SITE
wget -r -p --save-headers --auth-no-challenge --user-agent="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" $SITE
[/CODE]

:

my friends site is amerangaragedoor.com and the redirect is cdn.amerangaragedoor.com. The site is nothing fancy and my original plan was just to wget the site change a few things and upload it to another host so he could stop giving someone money for nothing.

Posted (edited)

Well, the link you posted doesn't even work., i think you have the wrong site name, i get a DNS error trying to reach both "amerangaragedoor.com" and "www.amerangaragedoor.com" but even still, (assuming was typo) try the one line from my script above.


wget -erobots=off --accept="html,htm,php,phps,phtml,jpg,jpeg,gif,png,bmp,pl,txt,asp,aspx,jsp,js,chm,shtml,css,mov,avi,mpg,mp3,mp4,pdf,flv,swf,bz2,tar,rar,zip" -l 1000 -rH -P SpiderDownloadShit/ -Damerangaragedoor.com --no-check-certificate --user-agent="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" amerangaragedoor.com -do SpiderDebug.txt
[/CODE]

Just change "amerangaragedoor.com" in the above line two places, to what your friends site name is. By the way, no space after the -D command..

Edited by digip
Posted

Worked like a charm! Sorry about the typo actual url was amerangaragedoors.com. I was only halfway into a pot of coffee. What was I doing wrong? I noticed that you use -rH when I just used -H it seemed like wget was trying to download the entire interent.

Posted

Might have been your syntax, but first read all the switches I added in the strings. -rH -D -l from the help file. I think with recursive you also have to tell it how many levels deep, with the -l command. I also tell it what files to accept, where normally it would just download html files I beleive. You could probably get away with an "*.*", haven't tried that, but I use it for specific file types. I also write the requests and output to a text file, so I can see all the server requests and redirects it follows, and you can see what it downloaded, etc. Thats part of the debug and file output section or "-do SpiderDebug.txt"

Some of the syntax is weird though, like the -D one, needs no space afterwards, or the command fails. Because I only passed one domain name, it only stays on that domain. Mine basically says, ignore robots.txt, ignore certificate errors, and limit it to 1000 links.

There are commands to even pass cookies back and forth, where in case some sites want them and even usernames and passwords if need be, or force all referrers from the same domain as well. Trial and error though. wget has always been kind of tricky for me, but I just keep experimenting with it.

I have a few other scripts I use for various things like pulling down podcast videos and mp3's from rss feeds. Some of the other examples are on the forums already, just search my name and wget for posts with the other code examples.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...