learning lua

30 October 2007

calcwebwordfreq.lua

Many have seen those word frequency distributions for presidential speeches. Or even presidential debate transcripts represented as a tag cloud with the more frequent words displayed in larger, bolder text.

Let's take a gander at putting Lua to the task.

Our first attempt goes something like this:

0001  #!/usr/local/bin/lua
0002  
0003  -- calcwebwordfreq.lua
0004  -- construct chart of word frequency, given a website URL
0005  
0006  if not arg[1] then
0007      print("Usage:", arg[0], "url", "[minimum_word_length]", "[max_display_lines]")
0008      os.exit()
0009  end
0010  
0011  FETCHURL = arg[1]
0012  MIN_WORD_LEN = tonumber(arg[2]) or 6
0013  MAX_DISPLAY_LINES = tonumber(arg[3]) or 40
0014  
0015  http = require("socket.http")
0016  
0017  -- return keys of an associative array, assumed that keys are of "string"
0018  -- or "numeric" type
0019  function keys (t)
0020      local listofkeys = {}
0021      for k, _ in pairs(t) do
0022          listofkeys[#listofkeys + 1] = k
0023      end
0024      return listofkeys
0025  end
0026  
0027  -- crude HTML tag stripping, needs refinement to eliminate javascript code, 0028  -- inline CSS, HTML entities, etc....
0029  function strip_tags (h) 
0030      local newstr = string.gsub(h, "<[^>]+>", ' ')
0031      return newstr
0032  end
0033  
0034  html = http.request(FETCHURL)
0035  if not html then
0036      print("Error encountered in retrieving " .. FETCHURL)
0037      os.exit()
0038  end
0039  
0040  wordchart = {}
0041  tex = strip_tags(html)
0042  for w in string.gfind(tex, "%a+") do
0043      if #w >= MIN_WORD_LEN then
0044          local lcw = string.lower(w)
0045          wordchart[lcw] = (wordchart[lcw] or 0) + 1
0046      end
0047  end
0048  
0049  wordlist = keys(wordchart)
0050  table.sort(wordlist, function(a, b) return wordchart[a] > wordchart[b] end)
0051  for i, w in ipairs(wordlist) do
0052      print(string.format("%-20s %10d", w, wordchart[w]))
0053      if (i >= MAX_DISPLAY_LINES) then break end
0054  end
0055  
0056  -- end of program

To put this puppy in play, we'll point it at a Project Gutenberg online copy of the William James classic, Pragmatism: A New Name for Some Old Ways of Thinking, originally published in 1907.

$ ./calcwebwordfreq.lua http://www.gutenberg.org/dirs/etext04/prgmt10.txt 5 25
which                       294
world                       239
their                       187
truth                       177
other                       159
these                       143
there                       134
things                      132
would                       125
universe                    107
being                       105
experience                  103
philosophy                  102
reality                     102
pragmatism                   98
facts                        94
absolute                     90
sense                        90
means                        81
human                        78
about                        74
whole                        69
should                       68
notion                       67
itself                       65

Yes, a real "production" script would omit common placeholder words like "these", "there", "their", etc.… And our HTML tag stripper function is a crude construct, that allows far too much uninteresting cruft to appear in the distribution composition. I'll leave those as follow up exercises for the arduous programmer in practice. But let's break down, chunk by chunk, our initial foray.

Parse script arguments (line 0006-0013)

Global variable arg contains any script arguments that are provided on the command line. It is a table array, and the 1st argument is arg[1]. arg[0] contains the name of the script being executed. Incidentally, there may be negative arguments present also — these denote any flags provided on the command line (i.e., if we invoked the script in the vein of lua scriptname.lua arg1 arg2 arg3).

Argument number one is the web page URL we wish to decompose and is a required parameter — if not present, a "usage" line is printed and the script terminates. For the other two options, default settings are provided if the command line does not grant them. Note that the or expression "short circuits" just like other modern scripting languages.

Require the Lua Socket library (line 0015)

Oh, we'll need to install the Lua socket library. You'll need to refer to specific instructions for your OS platform of choice. The library is available here, and install is a snap, at least it was for yours truly, taking what seemed like seconds to download and install. I simply downloaded, moved to /usr/local, un-tar, and then make install. The socket.http assignment confines our access to the http submodule, and specifically we are interested in the http.request function. The Lua socket library will be revisited again here, as its url.escape and url.unescape will be most useful for CGI work. A reference to all the Lua socket modules is available here.

Line 0019-0025

The keys function will enable a shorthand method of returning the keys of an associative array, our wordchart that will be indexed by word value. Other languages have this built in, but to my brief Lua encounter thus far, I have not discovered a built in manner of accomplishing this feat. We'll need it for sorting the generated word list in descending frequency sequence.

Line 0029-0032

As alluded to above, strip_tags is a very crude stab at stripping HTML tags. That it does, but it's woefully deficient as it still leaves the stuff between <script> and </script> tags intact, doesn't remove inline CSS styling (i.e., <style> ... </style> tags), and also neglects HTML entities (stuff like &nbsp;).

Go fetch our web page and store the retrieved page in a variable (line 0034-0039

http.request is the magic utterance. If there is a problem, Houston, we will abort the mission.

Line 0040-0047

Build our word frequency chart. Basically, for w in string.gfind(tex, "%a+") do… iterates around a pattern match, in this case a string of alphabetic characters. Lua, being lightweight, doesn't employ the full featured regular expression beastiary present in other scripting platforms such as Ruby, Perl, PHP, Python, etc.…. But we can specify patterns, make captures and do most of the stringy stuff needed to be done. Here, #a+ denotes a string of alphabet letters of 1 or more characters in length. The Lua pattern lingo is similar to regular expression syntax, except that special characters are prefixed with a % (not a \ as with those other languages). Also, while Lua shares the metacharacter notion of ^ (start of string>, $ (end of string), + (1 or more), * (0 or more), etc.… it uses a - instead of a ? to signal a "non-greedy" match. I've blathered enough on Lua patterns — for the definitive word, consult the excellent Lua Reference Manual section on Patterns.

Back to our program — we are capturing sequences of alphabetic letters, storing them in lcw after converting to lowercase. And to employ some terminology that I culled from Lua documentation, we are using a bag to store the word frequencies. Remember, by default, uninitialized table index access is met with nil so wordchart[lcw] = (wordchart[lcw] or 0) + 1 will always result in a successful increment.

Show me the words… (line 0049-0054)

So we've churned thorough the web page and lifted out all the words. Now we need to sort the words by frequency total. Unlike PHP, associative array order is arbitrary, so first it will be necessary to cull the keys out and sort them. That is the work of the keys function.

Sorting in Lua is simple. All that is to be provided is a comparison function, and we use an "anonymous" function that will sort wordlist in descending frequency order.

Actually, later, we will write an iterator function that makes the wordlist table creation unnecessary. Well, not really, as it will be part of a generic iterator and we'll just provide the comparison function and iterate through each key returned in proper sequence. For now, we're just utilizing the built in Lua generators, ipairs and pairs. ipairs for numeric indexed arrays, pairs to traverse key/value table pairs.

Finally, we will break out of the loop once our print line total is equal to or greater than provided parameter, or a default value set by the script.

No comments: