Jump to content

PHP scraping


digip

Recommended Posts

Is there a way to use php to scrape the contents of a frame or iframe before rendering it and paste it into a string file for parsing?

I have the whole concept of "file_get_contents" for extracting urls to a temp file and then pasting them wherever I want, like into the current page, etc  but I want to take data from the current page from a navigated subframe and parse it out as text without passing a url using "file_get_contents". So basically take the inner html and pass it to the outer html and then return it back to the inner html sanitized.

Example, if I follow links in a frame or iframe the pages I go to display in the frames. I want to scrape that data using somthing like dom or php to pull specific things out, like remove images, ads, javascript, etc and rewrite it to the frame or iframe.

Basically making a safe browser or text browser in php. So if a person navigates through the frames before it loads it into the frame it catches its request and strips out certain data before returning it to the user. Im thinking it probalby would need some sort of xml request or somthing to do the pulling and reqrite all links within the frame to send back to the outer html/php side of the page.

The key reason being is that with file_get_contents I usually supply it with a url either through a post form or directly in the script as a variable, and I want to do it on the fly when they move from page to page within the frame. This way they never have to post the url to trough a form or script, but just surf normally and the php outer html will do all the safegaurding dynamically to the inner html contents.

Link to comment
Share on other sites

I browsers prevent that type of access, unless the page in the frame is from the same domain as the page containing it.

Its a XSS vulnerability.

you could have a form like:

<form action="cleanMe.php" method="get">
<input type="text" id="url" name="url" />
<input type="submit" />
</form>

and  cleanMe.php could look like:

<?php
$page = file_get_contents($_GET['url']);

//process $page

echo $page;
?>

when you process $page you would then rewrite all the links as cleanMe.php?url=<url>

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...