luafriends

learning lua

31 October 2007

spumblr.lua

A program to backup my tumblr log.

Simply, we will perform a sequence of HTTP GET requests using the Tumblr API specifications, grabbing 50 posts at a time and saving them to our host computer.

0001  #!/usr/local/bin/lua
0002  
0003  function put(f, c)
0004    fw = io.open(f, "w+")
0005    fw:write(c)
0006    fw:close()
0007  end
0008  
0009  TUMBLRURL = arg[1] or nil
0010  if not TUMBLRURL then
0011    io.write(string.format("Usage: %s tumblr_url\n", arg[0]))
0012    os.exit()
0013  end
0014  
0015  SITETITLE = string.gsub(TUMBLRURL, "^http://", '')
0016  XMLOUTDIR = os.getenv("HOME") .. "/Desktop/" .. SITETITLE
0017  d = io.open(XMLOUTDIR)
0018  if not d then
0019    os.execute("mkdir " .. XMLOUTDIR)
0020  else
0021    d:close()
0022  end
0023  
0024  local start = 0
0025  local num = 50
0026  TUMBLRAPIURL = TUMBLRURL .. "/api/read"
0027  
0028  http = require("socket.http")
0029  local xmlout = http.request(TUMBLRAPIURL)
0030  totalposts = string.match(xmlout, "<posts start=\".-\" total=\"(.-)\">")
0031  
0032  while start < tonumber(totalposts) do
0033    TUMBLRAPIURL = TUMBLRURL .. string.format("/api/read/?start=%s&num=%s", start, num)
0034    local xmlout = http.request(TUMBLRAPIURL)
0035    local outfile = XMLOUTDIR .. "/posts_" .. (start + 1) .. "_" .. (start + num)
0036    put(outfile, xmlout)
0037    start = start + num
0038    --if start > 200 then break end
0039  end

spumblr.lua accepts a single argument, the URL address of your tumblelog.

./spumblr.lua http://azscoop.tumblr.com

Depending upon how extensive your tumblr post collection is, it might take awhile for your script to execute. When it has completed (assuming no error) you'll have a directory on your desktop named after your tumblelog, with each file holding a clump of 50 posts.

[~/Desktop/azscoop.tumblr.com]$ ls -lhtr
total 2080
-rw-r--r--   1 gnt  gnt    168K Oct 31 19:08 posts_1_50
-rw-r--r--   1 gnt  gnt    149K Oct 31 19:08 posts_51_100
-rw-r--r--   1 gnt  gnt    156K Oct 31 19:08 posts_101_150
-rw-r--r--   1 gnt  gnt    169K Oct 31 19:08 posts_151_200
-rw-r--r--   1 gnt  gnt    149K Oct 31 19:08 posts_201_250
-rw-r--r--   1 gnt  gnt    228K Oct 31 19:08 posts_251_300

Line 0003 - 0007

Convenient file function to enable one line file write operations.

Line 0009 - 0013

Accept the URL argument, and exit with a message if it is not provided.

Line 0015 - 0022

Check to see if there is an output directory already created for this tumblr web address. If not create the web directory. Note that this code chunk is Mac OS X specific, though Linux users can easily run — just change the /Desktop/ directory whatever value desired. And we are relying on an external OS command, executed via os.execute() so Windows users will have to substitute their own command syntax, if possible on Windows OS. Sorry, I cannot offer help for Windows machine, at least at this point.

Line 0024 - 0030

Using Lua socket library, as detailed in last luafriends episode, we will fetch an initial page load so we determine the value of totalposts. A Lua pattern, ""> is employed to pluck out the value.

Line 0032 - 0040

Loop until our start variable is no longer less than totalposts. Each iteration, plug in the new start value to the URL address. And let's use a filename that demarcates our post range store. Increment start, then continue to loop.

Exercises

  1. Write a program to collect all the posts in each of the files to make one file bundle, with optional header/footer data for the whole site bundle. Use this file (or the entire directory chunk file collection) to load a database.
  2. spumblr.lua only grabs the returned web page (which is actually XML). We could easily use this output to convert our tumblelog into a different format, or save in an internal database, so that we use it for a client application on our computer. However, we would only have the image tags and not the images themselves (or audio or movies…). Modify spumblr.com to also extract the images.
    • Create an image directory (if it doesn't already exist)
    • Use a Lua string.gmatch to iterate over picture image (<img> ) URL addresses. For each one you find, issue another HTTP request to retrieve the image file. Not sure if http.request will work here, but there a variety of alternate methods, from even using *nix utility commands curl or wget.
  3. Write a script that indexes all the content words so that you enter search words and return a list of posts (and a web link!) that contain those search words.