Many have seen those word frequency distributions for presidential speeches. Or even presidential debate transcripts represented as a tag cloud with the more frequent words displayed in larger, bolder text.
Let's take a gander at putting Lua to the task.
Our first attempt goes something like this:
0001 #!/usr/local/bin/lua
0002
0003 -- calcwebwordfreq.lua
0004 -- construct chart of word frequency, given a website URL
0005
0006 if not arg[1] then
0007 print("Usage:", arg[0], "url", "[minimum_word_length]", "[max_display_lines]")
0008 os.exit()
0009 end
0010
0011 FETCHURL = arg[1]
0012 MIN_WORD_LEN = tonumber(arg[2]) or 6
0013 MAX_DISPLAY_LINES = tonumber(arg[3]) or 40
0014
0015 http = require("socket.http")
0016
0017 -- return keys of an associative array, assumed that keys are of "string"
0018 -- or "numeric" type
0019 function keys (t)
0020 local listofkeys = {}
0021 for k, _ in pairs(t) do
0022 listofkeys[#listofkeys + 1] = k
0023 end
0024 return listofkeys
0025 end
0026
0027 -- crude HTML tag stripping, needs refinement to eliminate javascript code, 0028 -- inline CSS, HTML entities, etc....
0029 function strip_tags (h)
0030 local newstr = string.gsub(h, "<[^>]+>", ' ')
0031 return newstr
0032 end
0033
0034 html = http.request(FETCHURL)
0035 if not html then
0036 print("Error encountered in retrieving " .. FETCHURL)
0037 os.exit()
0038 end
0039
0040 wordchart = {}
0041 tex = strip_tags(html)
0042 for w in string.gfind(tex, "%a+") do
0043 if #w >= MIN_WORD_LEN then
0044 local lcw = string.lower(w)
0045 wordchart[lcw] = (wordchart[lcw] or 0) + 1
0046 end
0047 end
0048
0049 wordlist = keys(wordchart)
0050 table.sort(wordlist, function(a, b) return wordchart[a] > wordchart[b] end)
0051 for i, w in ipairs(wordlist) do
0052 print(string.format("%-20s %10d", w, wordchart[w]))
0053 if (i >= MAX_DISPLAY_LINES) then break end
0054 end
0055
0056 -- end of program
To put this puppy in play, we'll point it at a Project Gutenberg online copy of the William James classic, Pragmatism: A New Name for Some Old Ways of Thinking, originally published in 1907.
$ ./calcwebwordfreq.lua http://www.gutenberg.org/dirs/etext04/prgmt10.txt 5 25 which 294 world 239 their 187 truth 177 other 159 these 143 there 134 things 132 would 125 universe 107 being 105 experience 103 philosophy 102 reality 102 pragmatism 98 facts 94 absolute 90 sense 90 means 81 human 78 about 74 whole 69 should 68 notion 67 itself 65
Yes, a real "production" script would omit common placeholder words like "these", "there", "their", etc.… And our HTML tag stripper function is a crude construct, that allows far too much uninteresting cruft to appear in the distribution composition. I'll leave those as follow up exercises for the arduous programmer in practice. But let's break down, chunk by chunk, our initial foray.
Parse script arguments (line 0006-0013)
Global variable arg contains any script arguments that are provided on the command line. It is a table array, and the 1st argument is arg[1]. arg[0] contains the name of the script being executed. Incidentally, there may be negative arguments present also — these denote any flags provided on the command line (i.e., if we invoked the script in the vein of lua scriptname.lua arg1 arg2 arg3).
Argument number one is the web page URL we wish to decompose and is a required parameter — if not present, a "usage" line is printed and the script terminates. For the other two options, default settings are provided if the command line does not grant them. Note that the or expression "short circuits" just like other modern scripting languages.
Require the Lua Socket library (line 0015)
Oh, we'll need to install the Lua socket library. You'll need to refer to specific instructions for your OS platform of choice. The library is available here, and install is a snap, at least it was for yours truly, taking what seemed like seconds to download and install. I simply downloaded, moved to /usr/local, un-tar, and then make install. The socket.http assignment confines our access to the http submodule, and specifically we are interested in the http.request function. The Lua socket library will be revisited again here, as its url.escape and url.unescape will be most useful for CGI work. A reference to all the Lua socket modules is available here.
Line 0019-0025
The keys function will enable a shorthand method of returning the keys of an associative array, our wordchart that will be indexed by word value. Other languages have this built in, but to my brief Lua encounter thus far, I have not discovered a built in manner of accomplishing this feat. We'll need it for sorting the generated word list in descending frequency sequence.
Line 0029-0032
As alluded to above, strip_tags is a very crude stab at stripping HTML tags. That it does, but it's woefully deficient as it still leaves the stuff between <script> and </script> tags intact, doesn't remove inline CSS styling (i.e., <style> ... </style> tags), and also neglects HTML entities (stuff like ).
Go fetch our web page and store the retrieved page in a variable (line 0034-0039
http.request is the magic utterance. If there is a problem, Houston, we will abort the mission.
Line 0040-0047
Build our word frequency chart. Basically, for w in string.gfind(tex, "%a+") do… iterates around a pattern match, in this case a string of alphabetic characters. Lua, being lightweight, doesn't employ the full featured regular expression beastiary present in other scripting platforms such as Ruby, Perl, PHP, Python, etc.…. But we can specify patterns, make captures and do most of the stringy stuff needed to be done. Here, #a+ denotes a string of alphabet letters of 1 or more characters in length. The Lua pattern lingo is similar to regular expression syntax, except that special characters are prefixed with a % (not a \ as with those other languages). Also, while Lua shares the metacharacter notion of ^ (start of string>, $ (end of string), + (1 or more), * (0 or more), etc.… it uses a - instead of a ? to signal a "non-greedy" match. I've blathered enough on Lua patterns — for the definitive word, consult the excellent Lua Reference Manual section on Patterns.
Back to our program — we are capturing sequences of alphabetic letters, storing them in lcw after converting to lowercase. And to employ some terminology that I culled from Lua documentation, we are using a bag to store the word frequencies. Remember, by default, uninitialized table index access is met with nil so wordchart[lcw] = (wordchart[lcw] or 0) + 1 will always result in a successful increment.
Show me the words… (line 0049-0054)
So we've churned thorough the web page and lifted out all the words. Now we need to sort the words by frequency total. Unlike PHP, associative array order is arbitrary, so first it will be necessary to cull the keys out and sort them. That is the work of the keys function.
Sorting in Lua is simple. All that is to be provided is a comparison function, and we use an "anonymous" function that will sort wordlist in descending frequency order.
Actually, later, we will write an iterator function that makes the wordlist table creation unnecessary. Well, not really, as it will be part of a generic iterator and we'll just provide the comparison function and iterate through each key returned in proper sequence. For now, we're just utilizing the built in Lua generators, ipairs and pairs. ipairs for numeric indexed arrays, pairs to traverse key/value table pairs.
Finally, we will break out of the loop once our print line total is equal to or greater than provided parameter, or a default value set by the script.
No comments:
Post a Comment