Simply, we will perform a sequence of HTTP GET requests using the Tumblr API specifications, grabbing 50 posts at a time and saving them to our host computer.
0001 #!/usr/local/bin/lua
0002
0003 function put(f, c)
0004 fw = io.open(f, "w+")
0005 fw:write(c)
0006 fw:close()
0007 end
0008
0009 TUMBLRURL = arg[1] or nil
0010 if not TUMBLRURL then
0011 io.write(string.format("Usage: %s tumblr_url\n", arg[0]))
0012 os.exit()
0013 end
0014
0015 SITETITLE = string.gsub(TUMBLRURL, "^http://", '')
0016 XMLOUTDIR = os.getenv("HOME") .. "/Desktop/" .. SITETITLE
0017 d = io.open(XMLOUTDIR)
0018 if not d then
0019 os.execute("mkdir " .. XMLOUTDIR)
0020 else
0021 d:close()
0022 end
0023
0024 local start = 0
0025 local num = 50
0026 TUMBLRAPIURL = TUMBLRURL .. "/api/read"
0027
0028 http = require("socket.http")
0029 local xmlout = http.request(TUMBLRAPIURL)
0030 totalposts = string.match(xmlout, "<posts start=\".-\" total=\"(.-)\">")
0031
0032 while start < tonumber(totalposts) do
0033 TUMBLRAPIURL = TUMBLRURL .. string.format("/api/read/?start=%s&num=%s", start, num)
0034 local xmlout = http.request(TUMBLRAPIURL)
0035 local outfile = XMLOUTDIR .. "/posts_" .. (start + 1) .. "_" .. (start + num)
0036 put(outfile, xmlout)
0037 start = start + num
0038 --if start > 200 then break end
0039 end
spumblr.lua accepts a single argument, the URL address of your tumblelog.
./spumblr.lua http://azscoop.tumblr.com
Depending upon how extensive your tumblr post collection is, it might take awhile for your script to execute. When it has completed (assuming no error) you'll have a directory on your desktop named after your tumblelog, with each file holding a clump of 50 posts.
[~/Desktop/azscoop.tumblr.com]$ ls -lhtr total 2080 -rw-r--r-- 1 gnt gnt 168K Oct 31 19:08 posts_1_50 -rw-r--r-- 1 gnt gnt 149K Oct 31 19:08 posts_51_100 -rw-r--r-- 1 gnt gnt 156K Oct 31 19:08 posts_101_150 -rw-r--r-- 1 gnt gnt 169K Oct 31 19:08 posts_151_200 -rw-r--r-- 1 gnt gnt 149K Oct 31 19:08 posts_201_250 -rw-r--r-- 1 gnt gnt 228K Oct 31 19:08 posts_251_300
Line 0003 - 0007
Convenient file function to enable one line file write operations.
Line 0009 - 0013
Accept the URL argument, and exit with a message if it is not provided.
Line 0015 - 0022
Check to see if there is an output directory already created for this tumblr web address. If not create the web directory. Note that this code chunk is Mac OS X specific, though Linux users can easily run — just change the /Desktop/ directory whatever value desired. And we are relying on an external OS command, executed via os.execute() so Windows users will have to substitute their own command syntax, if possible on Windows OS. Sorry, I cannot offer help for Windows machine, at least at this point.
Line 0024 - 0030
Using Lua socket library, as detailed in last luafriends episode, we will fetch an initial page load so we determine the value of totalposts. A Lua pattern, "
Line 0032 - 0040
Loop until our start variable is no longer less than totalposts. Each iteration, plug in the new start value to the URL address. And let's use a filename that demarcates our post range store. Increment start, then continue to loop.
Exercises
- Write a program to collect all the posts in each of the files to make one file bundle, with optional header/footer data for the whole site bundle. Use this file (or the entire directory chunk file collection) to load a database.
- spumblr.lua only grabs the returned web page (which is actually XML). We could easily use this output to convert our tumblelog into a different format, or save in an internal database, so that we use it for a client application on our computer. However, we would only have the image tags and not the images themselves (or audio or movies…). Modify spumblr.com to also extract the images.
- Create an image directory (if it doesn't already exist)
- Use a Lua string.gmatch to iterate over picture image (<img> ) URL addresses. For each one you find, issue another HTTP request to retrieve the image file. Not sure if http.request will work here, but there a variety of alternate methods, from even using *nix utility commands curl or wget.
- Write a script that indexes all the content words so that you enter search words and return a list of posts (and a web link!) that contain those search words.
No comments:
Post a Comment