HTMLgetter: Python Code gets HTML code and outputs to .txt file

Forgiven · September 16, 2013

I don't know, maybe somebody will find this useful in their pentesting arsenal.

#!/usr/local/bin/python

# HTMLgetter v1.0 by Forgiven
# This is a handy bit of python that will reap the HTML code of any page
# and output it to a txt file of your choice.

import urllib2

urlStr = raw_input('Input the full URL of the webpage whose HTML code you which to reap:')
fileName = raw_input("Input the *.txt filename for the output:")
fileName = fileName + ".txt"
fileOut = open(fileName, "w")
try:
fileHandle = urllib2.urlopen(urlStr)
str1 = fileHandle.read()
fileHandle.close()
print '-'*50
print 'HTML code of URL =', urlStr
print '-'*50
except IOError:
print 'Cannot open URL %s for reading' % urlStr
str1 = 'error!'
fileOut.writelines(str1)
print str1
fileOut.close()

I thought it was cool, creates a nice txt file of the HTML from a web page...I guess I don't have permission to upload the .py for this above. But the code is small and simple enough to cp.

You can find it on github at the link.

Edited September 16, 2013 by Forgiven

digip · September 16, 2013

wget http://www.somesite.com/page.html -O file.txt

would work too but good to see someone writing scripts since I'm a n00b at scripting and nice to see how python works since I mostly just do simple bash scripts for things.

Forgiven · September 16, 2013

wget http://www.somesite.com/page.html -O file.txt
would work too but good to see someone writing scripts since I'm a n00b at scripting and nice to see how python works since I mostly just do simple bash scripts for things.

Gotta love linux man.

newbi3 · September 17, 2013

I just wrote something similar to this yesterday for a much different purpose and I used the requests library. I suggest you check it out!

http://docs.python-requests.org/en/latest/

Mr-Protocol · September 17, 2013

Even easier:

curl anysiteyouwish.com > local.html

Curl has a ton of flexability:

man curl

Forgiven · September 17, 2013

The bash scripts you guys shared are so tight! I'm going to have to learn me some of that...science is my gig.

Here's a question for you gurus: lets say that I want to logon to my favorite horse wagering site, twinspires.com from the command line. Is there a script that will pass the username and password through the form so that I can gain access to live toteboard odds when the page redirects to the wagering home page? I can't find live odds data for horsetracks anywhere else. I want to pass the odds to an app I'm writing. OR once I have already logged onto a website, a simple script that will scarf the data I need and pass it to a .csv or .txt file?

...Requests and Mechanize are pretty awesome, the BASH is way awesomer.

Forgiven · September 17, 2013

Here's the HTML of the login section of twinspires

<div class="column col1" id="sidebar-left">
<div id="sidebar-outer-wrapper">
<div class="bottom-wrapper">
<div class="sidebar-container">
<div id="logged-in-user">
<div class="ajax-loading"></div>
<div class="panel-pane pane-type1 anonymous-content" id="pane-login-block">
<h2 class="pane-title">Login</h2>
<div id="login-section" class="pane-content">

<form method="post" action="https://www.twinspires.com/php/login.php">
<input type="hidden" name="destination" value="">
<input type="hidden" value="user_login" name="form_id">
<input type="hidden" value="2800" name="affid">
<input type="hidden" value="0" name="blocklogin">
<input type="hidden" value="1" name="wager">
<input id="edit-redirect" type="hidden" value="http://www.twinspires.com/wager" name="redirect">

<ul class="field-set">
<li>
<label for="username">Username:</label>
<input type="text" name="acct" id="username" class="text-box" maxlength="100" size="20">
</li>
<li>
<label for="password">Password:</label>
<input type="password" name="pin" id="password" class="text-box" maxlength="16" size="20">
</li>
<li>
<span id="reset-login-link"><a href="http://www.twinspires.com/account/password/request">forgot your login information?</a></span>
<input type="submit" class="button" value="Login" id="Login" name="Login">
</li>
</ul>

</form>

digip · September 17, 2013

curl can do data posts with usernames and passwords, but so can wget and some sites, if don't take post but use like 401 auth, can just encode in url itself, ie: http://user:pass@site.com but I DON'T reccomend doing that on http sites or in a browser others use since it can be seen in address bar and sent in the clear.

Sign In

HTMLgetter: Python Code gets HTML code and outputs to .txt file

Recommended Posts

Forgiven

Link to comment

Share on other sites

digip

Link to comment

Share on other sites

Forgiven

Link to comment

Share on other sites

newbi3

Link to comment

Share on other sites

Mr-Protocol

Link to comment

Share on other sites

Forgiven

Link to comment

Share on other sites

Forgiven

Link to comment

Share on other sites

digip

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members

Browse

Activity