learning lua

31 October 2007

spumblr.lua

A program to backup my tumblr log.

Simply, we will perform a sequence of HTTP GET requests using the Tumblr API specifications, grabbing 50 posts at a time and saving them to our host computer.

0001  #!/usr/local/bin/lua
0002  
0003  function put(f, c)
0004    fw = io.open(f, "w+")
0005    fw:write(c)
0006    fw:close()
0007  end
0008  
0009  TUMBLRURL = arg[1] or nil
0010  if not TUMBLRURL then
0011    io.write(string.format("Usage: %s tumblr_url\n", arg[0]))
0012    os.exit()
0013  end
0014  
0015  SITETITLE = string.gsub(TUMBLRURL, "^http://", '')
0016  XMLOUTDIR = os.getenv("HOME") .. "/Desktop/" .. SITETITLE
0017  d = io.open(XMLOUTDIR)
0018  if not d then
0019    os.execute("mkdir " .. XMLOUTDIR)
0020  else
0021    d:close()
0022  end
0023  
0024  local start = 0
0025  local num = 50
0026  TUMBLRAPIURL = TUMBLRURL .. "/api/read"
0027  
0028  http = require("socket.http")
0029  local xmlout = http.request(TUMBLRAPIURL)
0030  totalposts = string.match(xmlout, "<posts start=\".-\" total=\"(.-)\">")
0031  
0032  while start < tonumber(totalposts) do
0033    TUMBLRAPIURL = TUMBLRURL .. string.format("/api/read/?start=%s&num=%s", start, num)
0034    local xmlout = http.request(TUMBLRAPIURL)
0035    local outfile = XMLOUTDIR .. "/posts_" .. (start + 1) .. "_" .. (start + num)
0036    put(outfile, xmlout)
0037    start = start + num
0038    --if start > 200 then break end
0039  end

spumblr.lua accepts a single argument, the URL address of your tumblelog.

./spumblr.lua http://azscoop.tumblr.com

Depending upon how extensive your tumblr post collection is, it might take awhile for your script to execute. When it has completed (assuming no error) you'll have a directory on your desktop named after your tumblelog, with each file holding a clump of 50 posts.

[~/Desktop/azscoop.tumblr.com]$ ls -lhtr
total 2080
-rw-r--r--   1 gnt  gnt    168K Oct 31 19:08 posts_1_50
-rw-r--r--   1 gnt  gnt    149K Oct 31 19:08 posts_51_100
-rw-r--r--   1 gnt  gnt    156K Oct 31 19:08 posts_101_150
-rw-r--r--   1 gnt  gnt    169K Oct 31 19:08 posts_151_200
-rw-r--r--   1 gnt  gnt    149K Oct 31 19:08 posts_201_250
-rw-r--r--   1 gnt  gnt    228K Oct 31 19:08 posts_251_300

Line 0003 - 0007

Convenient file function to enable one line file write operations.

Line 0009 - 0013

Accept the URL argument, and exit with a message if it is not provided.

Line 0015 - 0022

Check to see if there is an output directory already created for this tumblr web address. If not create the web directory. Note that this code chunk is Mac OS X specific, though Linux users can easily run — just change the /Desktop/ directory whatever value desired. And we are relying on an external OS command, executed via os.execute() so Windows users will have to substitute their own command syntax, if possible on Windows OS. Sorry, I cannot offer help for Windows machine, at least at this point.

Line 0024 - 0030

Using Lua socket library, as detailed in last luafriends episode, we will fetch an initial page load so we determine the value of totalposts. A Lua pattern, ""> is employed to pluck out the value.

Line 0032 - 0040

Loop until our start variable is no longer less than totalposts. Each iteration, plug in the new start value to the URL address. And let's use a filename that demarcates our post range store. Increment start, then continue to loop.

Exercises

  1. Write a program to collect all the posts in each of the files to make one file bundle, with optional header/footer data for the whole site bundle. Use this file (or the entire directory chunk file collection) to load a database.
  2. spumblr.lua only grabs the returned web page (which is actually XML). We could easily use this output to convert our tumblelog into a different format, or save in an internal database, so that we use it for a client application on our computer. However, we would only have the image tags and not the images themselves (or audio or movies…). Modify spumblr.com to also extract the images.
    • Create an image directory (if it doesn't already exist)
    • Use a Lua string.gmatch to iterate over picture image (<img> ) URL addresses. For each one you find, issue another HTTP request to retrieve the image file. Not sure if http.request will work here, but there a variety of alternate methods, from even using *nix utility commands curl or wget.
  3. Write a script that indexes all the content words so that you enter search words and return a list of posts (and a web link!) that contain those search words.

30 October 2007

calcwebwordfreq.lua

Many have seen those word frequency distributions for presidential speeches. Or even presidential debate transcripts represented as a tag cloud with the more frequent words displayed in larger, bolder text.

Let's take a gander at putting Lua to the task.

Our first attempt goes something like this:

0001  #!/usr/local/bin/lua
0002  
0003  -- calcwebwordfreq.lua
0004  -- construct chart of word frequency, given a website URL
0005  
0006  if not arg[1] then
0007      print("Usage:", arg[0], "url", "[minimum_word_length]", "[max_display_lines]")
0008      os.exit()
0009  end
0010  
0011  FETCHURL = arg[1]
0012  MIN_WORD_LEN = tonumber(arg[2]) or 6
0013  MAX_DISPLAY_LINES = tonumber(arg[3]) or 40
0014  
0015  http = require("socket.http")
0016  
0017  -- return keys of an associative array, assumed that keys are of "string"
0018  -- or "numeric" type
0019  function keys (t)
0020      local listofkeys = {}
0021      for k, _ in pairs(t) do
0022          listofkeys[#listofkeys + 1] = k
0023      end
0024      return listofkeys
0025  end
0026  
0027  -- crude HTML tag stripping, needs refinement to eliminate javascript code, 0028  -- inline CSS, HTML entities, etc....
0029  function strip_tags (h) 
0030      local newstr = string.gsub(h, "<[^>]+>", ' ')
0031      return newstr
0032  end
0033  
0034  html = http.request(FETCHURL)
0035  if not html then
0036      print("Error encountered in retrieving " .. FETCHURL)
0037      os.exit()
0038  end
0039  
0040  wordchart = {}
0041  tex = strip_tags(html)
0042  for w in string.gfind(tex, "%a+") do
0043      if #w >= MIN_WORD_LEN then
0044          local lcw = string.lower(w)
0045          wordchart[lcw] = (wordchart[lcw] or 0) + 1
0046      end
0047  end
0048  
0049  wordlist = keys(wordchart)
0050  table.sort(wordlist, function(a, b) return wordchart[a] > wordchart[b] end)
0051  for i, w in ipairs(wordlist) do
0052      print(string.format("%-20s %10d", w, wordchart[w]))
0053      if (i >= MAX_DISPLAY_LINES) then break end
0054  end
0055  
0056  -- end of program

To put this puppy in play, we'll point it at a Project Gutenberg online copy of the William James classic, Pragmatism: A New Name for Some Old Ways of Thinking, originally published in 1907.

$ ./calcwebwordfreq.lua http://www.gutenberg.org/dirs/etext04/prgmt10.txt 5 25
which                       294
world                       239
their                       187
truth                       177
other                       159
these                       143
there                       134
things                      132
would                       125
universe                    107
being                       105
experience                  103
philosophy                  102
reality                     102
pragmatism                   98
facts                        94
absolute                     90
sense                        90
means                        81
human                        78
about                        74
whole                        69
should                       68
notion                       67
itself                       65

Yes, a real "production" script would omit common placeholder words like "these", "there", "their", etc.… And our HTML tag stripper function is a crude construct, that allows far too much uninteresting cruft to appear in the distribution composition. I'll leave those as follow up exercises for the arduous programmer in practice. But let's break down, chunk by chunk, our initial foray.

Parse script arguments (line 0006-0013)

Global variable arg contains any script arguments that are provided on the command line. It is a table array, and the 1st argument is arg[1]. arg[0] contains the name of the script being executed. Incidentally, there may be negative arguments present also — these denote any flags provided on the command line (i.e., if we invoked the script in the vein of lua scriptname.lua arg1 arg2 arg3).

Argument number one is the web page URL we wish to decompose and is a required parameter — if not present, a "usage" line is printed and the script terminates. For the other two options, default settings are provided if the command line does not grant them. Note that the or expression "short circuits" just like other modern scripting languages.

Require the Lua Socket library (line 0015)

Oh, we'll need to install the Lua socket library. You'll need to refer to specific instructions for your OS platform of choice. The library is available here, and install is a snap, at least it was for yours truly, taking what seemed like seconds to download and install. I simply downloaded, moved to /usr/local, un-tar, and then make install. The socket.http assignment confines our access to the http submodule, and specifically we are interested in the http.request function. The Lua socket library will be revisited again here, as its url.escape and url.unescape will be most useful for CGI work. A reference to all the Lua socket modules is available here.

Line 0019-0025

The keys function will enable a shorthand method of returning the keys of an associative array, our wordchart that will be indexed by word value. Other languages have this built in, but to my brief Lua encounter thus far, I have not discovered a built in manner of accomplishing this feat. We'll need it for sorting the generated word list in descending frequency sequence.

Line 0029-0032

As alluded to above, strip_tags is a very crude stab at stripping HTML tags. That it does, but it's woefully deficient as it still leaves the stuff between <script> and </script> tags intact, doesn't remove inline CSS styling (i.e., <style> ... </style> tags), and also neglects HTML entities (stuff like &nbsp;).

Go fetch our web page and store the retrieved page in a variable (line 0034-0039

http.request is the magic utterance. If there is a problem, Houston, we will abort the mission.

Line 0040-0047

Build our word frequency chart. Basically, for w in string.gfind(tex, "%a+") do… iterates around a pattern match, in this case a string of alphabetic characters. Lua, being lightweight, doesn't employ the full featured regular expression beastiary present in other scripting platforms such as Ruby, Perl, PHP, Python, etc.…. But we can specify patterns, make captures and do most of the stringy stuff needed to be done. Here, #a+ denotes a string of alphabet letters of 1 or more characters in length. The Lua pattern lingo is similar to regular expression syntax, except that special characters are prefixed with a % (not a \ as with those other languages). Also, while Lua shares the metacharacter notion of ^ (start of string>, $ (end of string), + (1 or more), * (0 or more), etc.… it uses a - instead of a ? to signal a "non-greedy" match. I've blathered enough on Lua patterns — for the definitive word, consult the excellent Lua Reference Manual section on Patterns.

Back to our program — we are capturing sequences of alphabetic letters, storing them in lcw after converting to lowercase. And to employ some terminology that I culled from Lua documentation, we are using a bag to store the word frequencies. Remember, by default, uninitialized table index access is met with nil so wordchart[lcw] = (wordchart[lcw] or 0) + 1 will always result in a successful increment.

Show me the words… (line 0049-0054)

So we've churned thorough the web page and lifted out all the words. Now we need to sort the words by frequency total. Unlike PHP, associative array order is arbitrary, so first it will be necessary to cull the keys out and sort them. That is the work of the keys function.

Sorting in Lua is simple. All that is to be provided is a comparison function, and we use an "anonymous" function that will sort wordlist in descending frequency order.

Actually, later, we will write an iterator function that makes the wordlist table creation unnecessary. Well, not really, as it will be part of a generic iterator and we'll just provide the comparison function and iterate through each key returned in proper sequence. For now, we're just utilizing the built in Lua generators, ipairs and pairs. ipairs for numeric indexed arrays, pairs to traverse key/value table pairs.

Finally, we will break out of the loop once our print line total is equal to or greater than provided parameter, or a default value set by the script.

23 October 2007

dealhand.lua

For our second program, let us create a deck of playing cards, shuffle the deck, and then deal ourselves a bridge hand (13 cards out of the pack of 52). We'll expand upon our card dealing program in coming installments, but for now it'll be used as a centerpiece for a beginning discussion on Lua types and values.
#!/usr/local/bin/lua

-- constant declarations

HANDLEN = 13
FACES = { 'A', 'K', 'Q', 'J', '0', '9', '8', '7', '6', '5', '4', '3', '2' }
SUITS = { 'S', 'H', 'D', 'C' }

-- function definitions

function shuffle (list)
  local n = #list
  while n > 1 do
    k = math.random(n)
    if k ~= n then
      list[n], list[k] = list[k], list[n]
    end
    n = n - 1
  end
end

function slice (list, b, l)
  local sl = {}
  local l = l or (#list - b + 1)
  local e = b + l - 1
  for i = b, e do
    sl[#sl+1] = list[i]
  end
  return sl
end

-- main

math.randomseed (os.time())

deckofcards = {}
for _, f in ipairs(FACES) do
  for _, s in ipairs(SUITS) do
    deckofcards[#deckofcards+1] = { ["f"] = f, ["s"] = s }
  end
end

shuffle (deckofcards)

hand = slice(deckofcards, 1, HANDLEN)
for _, card in ipairs(hand) do
  io.write(card.f .. card.s .. ' ')
end

print('')

A sample run:

$ ./dealhand.lua 
0S 2S 7S QC 9C 6S QD 8S AS 3D JH 4D JC

In subsequent posts, we'll enhance our first stab at a bridge playing bot, but let's dive into Lua types and values. The definitive textbook on the matter, Programming in Lua, offers this breakdown:

nil
The non-value, indicates a variable that has not been initialized yet, or one that has been deleted.
booleans
true or false. Truth in Lua is unlike other modern programming languages — only nil or false compute to being false. Take note, fellow coders, as the empty string ('') and the number 0 evaluate as true in Lua.
numbers
Lua makes no distinction between integers and floating point.
strings
Strings in Lua work similar to the Java programming language — they are immutable, and to modify a string means you create a new string.
tables
In Unix parlance, everything is a file. Likewise, in Lua, everything is a table. At first glance, the notion of table is confusing — in essence, a table is akin to how arrays in Javascript and PHP. Tables can represent simple integer indexed arrays or associative arrays with string (or other Lua object type) keys. And tables can be nested within other tables. A new table is thrust into existence with the {} constructor.
functions
Functions are "first class objects" in Lua. That's a fancy way of saying that they can serve as normal variables, be passed to functions, returned as function results, etc.… In another words, Lua offers support for functional programming.
threads
Will save study and discussion of threads for eventual, advanced topic set.

OK, those are the values that are bopped about. Before we can dive deep into script comprehension efforts, some basic Lua idiom grounding is necessary.

Lua Idioms

{} is the table constructor
Tables consisting of nil to complex structures with vast nesting are possible. Dynamically defined, and capable of storing any Lua type. Can also mix integer arrays and associative arrays. Can access individual elements with [] notation. Example: mytable[3] or groovy_table[x] or anothertable["rating"] (the latter could also be referenced via anothertable.rating)
# is the length operator
# can be applied to strings and "non"-associative arrays.
.. is the string concatenation operator
Used to join strings together.
-- denotes a comment
All following text until end of line is ignored by interpreter.
list[#list + 1] = newitem
Adding entries to an array. Could also be performed with this instruction — table.insert(list, newitem)
local declares a "local" variable, with scope confined to that code block
By default, all variables are global.
~= means "not equal"
Don't get it confused with the =~ regular expression match operator in Perl and Ruby. All of the other logical operators are fairly obvious — == for "equals", <=, >=, >, < should be self explanatory.
list assignment
Functions can return multiple values. Or several assignments in series, can be made, such as a, b = b, a
_ often serves as a "dummy" variable
Valid identifiers are strings of alphanumeric (letters and digits) characters and underscores that cannot begin with a digit. Avoid names with a prefix followed by uppercase characters — as those are used as Lua internal globals.
table indexing is one-based
Unlike just about any other modern programming language. A throwback to the days of COBOL and FORTRAN. Which is why list[#list + 1] can be an idiom! Also might lend to a slightly different algorithmic attack approach. Consider the slice function in the program listing. Function code would be much more concise if it had an argument signature of function slice(list, b, e) where e denoted the inclusive ending index. Then, the "1" juggling would be unneeded.

22 October 2007

genesis.lua

The journey to programmer proficiency starts with a single script.

#!/usr/local/bin/lua

print("first post!")

The greater part of this task is to discover how a Lua script can be executed on a computing machine. That is, you have to get Lua and install it on your system. Or seek advisement from your friendly local system administrator. On a Mac OS X box, there are pre-built binary packages available, or you can use the source and do the untar, make and make install dance. I have not toiled on a Windows box for some time now, so you'll have to search the internets for Windows XP/Vista install instructions.

Lua is lightweight, so installing and getting setup is a snap, especially when compared to other development platform environments.

My Lua development environment is fairly spartan — it consists of Vim and Mac OS X Terminal.

$ ./genesis.lua 
first post!

In genesis.lua, the "shebang" line magic is employed to invoke the script directly. Lua scripts can also be executed by issuing the lua command and specifying the script file to be executed as a command argument.

$ lua -?
usage: lua [options] [script [args]].
Available options are:
  -e stat  execute string 'stat'
  -l name  require library 'name'
  -i       enter interactive mode after executing 'script'
  -v       show version information
  --       stop handling options
  -        execute stdin and stop handling options

Or you can try some interactive Lua:

$ lua
Lua 5.1.2  Copyright (C) 1994-2007 Lua.org, PUC-Rio
> print(math.pi)
3.1415926535898
> print(os.tmpname())
/tmp/lua_cSugQn
> =1600*1200
1920000

Strike control+D to exit the interpreter.

Yes, it's a silly little program of no merit on its own, but the first step of figuring out how to run Lua on your system is the gateway to Lua learning. Baby steps for now…