Google Book Downloader Update: High Quality

Until now, Google Book Downloader has used the default image resolution provided by Google, even though books.google.com allows you to zoom in on pages to see higher resolution. Version 2.1 fixes this, finding the highest resolution available for every book you download. To download Google Book Downloader, see the app page here.

Quality comparison between GBD v. 2.0.1 and 2.1: Click to see a larger version where you can see the difference.

As you can see, books downloaded with this new version look much better. Now, the technical details:

Books on Google Books have a zoom in button that allows you to see the pages on most books in higher detail. In order to get GBD to download higher detail images, I needed to somehow hijack this button. My first attempt was, of course, with Javascript. But the Javascript code that does the online zooming turned out to be very evasive—hidden inside layers of anonymous functions and called by Google’s custom event handling code. I eventually gave up on this.

In order to get around this problem, I used Javascript to first find the button I wanted to click, (using getElementById) but rather then trying to “click” the button with Javascript, passed the location of the button to Objective-C code. The native Objective-C code then used an NSEvent to simulate a click in the webpage.

If you’re interested, this code is all available on Github. It simply adds a few methods to WebKit’s WebView which allow you to find an element in a web page, click on a location in a web page, or click on an element.

Posted in Uncategorized | Tagged , , | 2 Comments

Google Book Downloader Update: Version 2.0

Google Book Downloader 2.0 is out! What’s new: much faster, fewer bugs, and less important things. To download, see the product page here.

If you are into the technical details, maybe you are wondering what went on behind the scenes to make such improvements. First, a quick background on scraping:

One way to scrape: Start with an AJAX web app that you want to borrow some data from. Reverse engineer it until you understand the API that it is using to get data from the server. Then re-engineer your app to use this API.

However, there is another way to scrape: Start with someone else’s AJAX web app. Load it into a web browser engine. Use whatever hooks the web browser engine provides to do things like 1) run your own JavaScript on the page and 2) monitor the network traffic of the web app. Compared to the first way to scrape, this requires a lot less thinking!

As you may have guessed, GBD 1 used the first method, while GBD 2 used the second method. Not only does this prevent me from having to worry about Google’s AJAX calls, but it also simplifies the source code of the application. It also has the nice effect of speeding up the application.

Posted in Uncategorized | Tagged , , | 7 Comments

[OS X] Install Pwntcha

Pwntcha is an open source tool for breaking CAPTCHAs. While it is a few years old and only works for very simple CAPTCHAs, it’s still and interesting project and would be a good place to start if you wanted to write a program to break more complex ones. To install it on OS X:

  1. Install the Simple DirectMedia Layer library: From Terminal and with MacPorts installed, type
    sudo port install libsdl_image

    As an alternative, installing imlib2 would probably also work.

    I encountered an error installing db46, one of the dependencies of libsdl and imlib2, which I fixed by installing the Java for Mac OS X 10.6 Update 3 Developer Package.

  2. Check out a copy of Pwntcha via SVN:
    svn co svn://svn.zoy.org/caca/pwntcha/trunk pwntcha
  3. Compile:
    cd pwntcha
    ./bootstrap
    ./configure
    sudo make install
  4. Lastly, run the program:
    curl -O http://hactheplanet.com/blog/wp-content/uploads/2011/01/authimage.jpeg
    pwntcha authimage.jpeg


The image is analyzed to be 5Z28AF.

Posted in Uncategorized | Tagged , , , | 6 Comments

[Ruby] Automate Facebook

Using the Ruby Mechanize library, I have been writing a Ruby class to allow automation of parts of Facebook like friending, status updating, and messaging.

You can get the class here and use it like so:

#!/usr/bin/ruby

# Require FacebookBot.rb from the same directory.
require File.join(File.dirname(__FILE__), 'FacebookBot.rb')

# Log in.
fb = FacebookBot.new("example@example.com", "secret")

# Accept all friend requests.
fb.acceptRequests

# Friend a whole page of suggested friends.
fb.suggestedFriends.each { |friendId| fb.requestFriend(friendId) }

# Display all personal messages and recent wall posts.
require "pp"
pp fb.personalMessages
pp fb.recentPosts
Posted in Uncategorized | Tagged , , | 5 Comments

[OS X] View MHTML Files

Using code from the UnMHT QuickLook project, I put together an app that renders MHTML files using WebKit on OS X. Features: print, export to PDF, and export to webarchive.

Posted in Uncategorized | Tagged , | 3 Comments

Google Book Downloader Update: Download JPEGs

Google Book Downloader is my app that downloads Google Books/Book Previews in PDF format. The major problem with it so far has been that when it encodes JPEGs from Google’s servers into PDF format, there is a loss in quality of the image. So far I have found no way to avoid this loss of quality when making a PDF from JPEGs.

Today I’m releasing a new version (1.2) of Google Book Downloader in which you can choose not to save a PDF, but a folder of JPEGs. The folder also has an index.html file that makes it convenient to read the JPEGS in the right order.

Download it here.

Posted in Uncategorized | Tagged , | 8 Comments

Replacing C Functions with Dynamic Linking in OS X

Say we wanted to make a command line program like date use a fake time instead of the current one. We could do this by supplying a time() function to replace the time() function in libSystem.

How do we know that date uses time()? We use nm, which lists all the symbols used by a particular program:

$ nm -m /bin/date | grep _time
(undefined [lazy bound]) external _time (from libSystem)

Once we know what function to replace, we can write a replacement function:

// time.c

#include <sys/time.h>

// This function will override the one in /usr/lib/libSystem.dylib.
time_t time(time_t *tloc)
{
    // January 1st, 2000.
    struct tm timeStruct;
    timeStruct.tm_year = 2000 - 1900;
    timeStruct.tm_mon = 0;
    timeStruct.tm_mday = 1;
    timeStruct.tm_hour = 0;
    timeStruct.tm_min = 0;
    timeStruct.tm_sec = 0;
    timeStruct.tm_isdst = -1;

    *tloc = mktime(&timeStruct);
    return *tloc;
}

Then we compile the code as a dynamic library:

gcc -c time.c
gcc -flat_namespace -dynamiclib -current_version 1.0 time.o -o libTime.dylib

To tell OS X’s dynamic linker to load our dynamic library, we need to set DYLD_INSERT_LIBRARIES to the path of the library. We also need to set DYLD_FORCE_FLAT_NAMESPACE, or our function will not override the old one. These settings and more can be found on the dyld man page.

The result:

$ date
Sun Oct 24 13:21:12 EST 2010
$ DYLD_FORCE_FLAT_NAMESPACE=1 DYLD_INSERT_LIBRARIES=./libTime.dylib date
Sat Jan  1 00:00:00 EST 2000
Posted in Uncategorized | Tagged , | 4 Comments

Upcoming OS X App: Music Player/Downloader

Here is a screenshot of an app I plan on releasing soon. As you can see, it has an iTunes-like interface, with some extra features for downloading music. You can search YouTube for songs, and download them from YouTube directly into your music library.

Leave a comment or contact me if you would like to beta test.

Posted in Uncategorized | Tagged , , , | 12 Comments

[PHP] Get Google’s Cache of a URL

This PHP function fetches the contents of a URL as it exists in Google’s cache:

function cachedHTMLForURL($url)
{
	// Request the cache from Google.
	$googleRequestURL = "http://webcache.googleusercontent.com/search?q=" . urlencode("cache:" . $url);
	$googleResponse = file_get_contents($googleRequestURL);

	// Return false if Google did not have it.
	if (preg_match("/^.*<title>cache:/", $googleResponse))
	return false;

	// Remove the first 3 lines of the response, which is inserted by Google.
	$importantHTML = preg_replace("/^(.*\n){3}/", "", $googleResponse);

	// Allow one line to be inserted, which corrects the base path of the site.
	preg_match_all("/<base href=\"[^\"]*\">/", $googleResponse, $matches);
	$base = $matches[0][0] . "\n";

	return $base . $importantHTML;
}

Use like so:

echo cachedHTMLForURL("http://news.google.com/");
Posted in Uncategorized | Tagged , , | 1 Comment

[PHP] Retrieve iTunes Store HTML/XML

In iTunes 9, almost all of the pages in the iTunes store are rendered with HTML. This PHP function will retrieve the raw iTunes Store XHTML for a URL:

function htmlForiTunesStoreURL($path)
// Download and return the HTML for an iTunes Store page at the given URL.
{
	$ch = curl_init();
	curl_setopt($ch, CURLOPT_URL, $path);

	// The following header is what causes the server to think we are iTunes.app.
	// This header in particular is for the U.S. Store.
	curl_setopt($ch, CURLOPT_HTTPHEADER, array('X-Apple-Store-Front: 143441-1,5'));

	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
	$html = curl_exec($ch);

	return trim($html);
}

For example, this is the music store home page:

echo htmlForiTunesStoreURL(
"http://ax.itunes.apple.com/WebObjects/MZStore.woa/wa/viewGrouping?id=38"
);

Some special pages will not return XHTML, but property list XML. For example, the advanced search page:

echo htmlForiTunesStoreURL(
"http://ax.search.itunes.apple.com/WebObjects/MZSearch.woa/wa/advancedSearch"
};

In older versions of iTunes (4-8) all pages were rendered from property list XML. This modified version of the function returns all pages as XML:

function xmlForiTunesStoreURL($path)
// Download and return the HTML for an iTunes Store page at the given URL.
{
	$ch = curl_init();
	curl_setopt($ch, CURLOPT_URL, $path);

	// The iTunes user agent without a special header causes the server to give us XML for a page.
	curl_setopt($ch, CURLOPT_USERAGENT, 'iTunes/9.0.2 (Macintosh; Intel Mac OS X 10.5.8) AppleWebKit/531.21.8');

	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
	$html = curl_exec($ch);

	return trim($html);
}

However I wouldn’t rely upon the XML from the second function for anything important, because (as far as I know) it is no longer used in iTunes, so it may stop working in the future.

Posted in Uncategorized | Tagged , | Leave a comment