Authorea

Fiona Tweedie deleted figures/Screen Shot 2014-10-09 at 5.47.28 pm/day-2-completed.ipynb about 9 years ago

Commit id: fcffed6df793c09b9305f66a8b1df73e5b9cf17c

deletions | additions

{ "metadata": {}, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "
" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Session 4: The Fraser Speech Corpus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Welcome back!**\n", "\n", "So, what did we learn yesterday? A brief recap:\n", "\n", "* The **IPython** Notebook\n", "* **Python**: syntax, variables, functions, etc.\n", "* **NLTK**: manipulating linguistic data\n", "* **Corpus linguistic tasks**: tokenisation, keywords, collocation, stemming, concordances\n", "\n", "Today's focus will be on **developing more advanced NLTK skills** and using these skills to **investigate the Fraser Speeches Corpus**. In the final session, we will discuss **how to use what you have learned here in your own research**.\n", "\n", "*Any questions or anything before we dive in?*" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Malcolm Fraser and his speeches" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, for much of the next two sessions, we are going to be working with a corpus of speeches made by Malcolm Fraser. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "# this code allows us to display images and webpages in our notebook\n", "from IPython.display import display\n", "from IPython.display import display_pretty, display_html, display_jpeg, display_png, display_svg\n", "from IPython.display import Image\n", "from IPython.display import HTML\n", "import nltk" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "Image(url='http://www.unimelb.edu.au/malcolmfraser/photographs/family/105~36fam6p9.jpg')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because our project here is *corpus driven*, we don't necessarily need to know about Malcolm Fraser and his speeches in order to analyse the data: we may be happy to let things emerge from the data themselves. Even so, it's nice to know a bit about him.\n", "\n", "Malcolm Fraser was a member of Australian parliament between 1955 and 1983, holding the seat of Wannon in western Victoria. He held a number of ministries, including Education and Science, and Defence. \n", "\n", "He became leader of the Liberal Party in March 1975 and Prime Minister of Australia in December 1975, following the dismissal of the Whitlam government in November 1975.\n", "\n", "He retired from parliament following the defeat of the Liberal party at the 1983 election and in 2009 resigned from the Liberal party after becoming increasingly critical of some of its policies.\n", "\n", "He can now be found on Twitter as **@MalcolmFraser12**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "HTML('')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In 2004, Malcolm Fraser made the University of Melbourne the official custodian of his personal papers. The collection consists of a large number of photographs, speeches and personal papers, including Neville Fraser's WWI diaries and materials relating to CARE Australia, which Mr Fraser helped to found in 1987. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "HTML('')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Every week, between 1954 until 1983, Malcolm Fraser made a talk to his electorate that was broadcast on Sunday evening on local radio. \n", "\n", "The speeches were transcribed years ago. Optical Character Recognition (OCR) was used to digitise the transcripts. This means that the texts are not of perfect quality. \n", "\n", "Some have been manually corrected, which has removed extraneous characters and mangled words, but even so there are still some quirks in the formatting. \n", "\n", "For much of this session, we are going to manipulate the corpus data, and use the data to restructure the corpus. " ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Cleaning the corpus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A common part of corpus building is corpus cleaning. Reasons for cleaning include:\n", "\n", "1. Not break the code with unexpected input\n", "2. Ensure that searches match as many examples as possible\n", "3. Increasing readability, the accuracy of taggers, stemmers, parsers, etc.\n", "\n", "The level of kind of cleaning depends on your data and the aims of your project. In the case of very clean data (lucky you!), there may be little that needs to be done. With messy data, you may need to go as far as to correct variant spellings (online conversation, very old books)." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Discussion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*What are the characteristics of clean and messy data? Any personal experiences? Discuss with your neighbours.* \n", "\n", "It will be important to bear these characteristics in mind once you start building your own datasets and corpora. " ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Exploring the corpus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First of all, let's load in our text.\n", "\n", "Via file management, open and inspect one file in *corpora/UMA_Fraser_Radio_Talks*. What do you see? Are there any potential problems?\n", "\n", "We can also look at file contents within the IPython Notebook itself:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import os\n", "# import tokenizers\n", "from nltk import word_tokenize\n", "from nltk.text import Text" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# make a list of files in the directory 'UMA_Fraser_Radio_Talks'\n", "files = os.listdir('corpora/UMA_Fraser_Radio_Talks')\n", "files[:3]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Actually, since we'll be referring to this path quite a bit, let's make it into a variable. This makes our code easier to use on other projects (and saves typing)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "corpus_path = 'corpora/UMA_Fraser_Radio_Talks'" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now tell Python to get the contents of a file in the file list and print it:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# print file contents\n", "# change zero to something else to print a different file\n", "f = open(os.path.join(corpus_path, files[0]), \"r\")\n", "text = f.read()\n", "print text" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Exploring further: splitting up text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've had a look at one file, but the real strength of NLTK is to be able to explore large bodies of text. \n", "\n", "When we manually inspected the first file, we saw that it contained a metadata section, before the body of the text. \n", "\n", "We can ask Python to show us just the start of the file. For analysing the text, it is useful to split the metadata section off, so that we can interrogate it separately but also so that it won't distort our results when we analyse the text." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# open the first file, read it and then split it into two parts, metadata and body\n", "data = open(os.path.join(corpus_path, os.listdir(corpus_path)[0])).read().split(\"\")\n", "# notice that many different commands can be strung together in one line!" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# view the first part\n", "data[0]\n", "# put print before this to change the way you see it!" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# split into lines, add '*' to the start of each line\n", "# \\r is a carriage return, like on a typewriter.\n", "# \\n is a newline character\n", "for line in data[0].split('\\r\\n'):\n", " print '*', line" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# skip empty lines and any line that starts with '<'\n", "for line in data[0].split('\\r\\n'):\n", " if not line:\n", " continue\n", " if line[0] == '<':\n", " continue\n", " print '*', line" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# split the metadata items on ':' so that we can interrogate each one\n", "for line in data[0].split('\\r\\n'):\n", " if not line:\n", " continue\n", " if line[0] == '<':\n", " continue\n", " element = line.split(':')\n", " print '*', element" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# actually, only split on the first colon\n", "for line in data[0].split('\\r\\n'):\n", " if not line:\n", " continue\n", " if line[0] == '<':\n", " continue\n", " element = line.split(':', 1)\n", " print '*', element" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "**Challenge**: Building a Dictionary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've already worked with strings, integers, and lists. Another kind of data structure in Python is a *dictionary*.\n", "\n", "Here is how a simple dictionary works:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# create a dictionary\n", "commonwords = {'the': 4023, 'of': 3809, 'a': 3098}\n", "# search the dictionary for 'of'\n", "commonwords['of']" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "type(commonwords)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The point of dictionaries is to store a *key* (the word) and a *value* (the count). When you ask for the key, you get its value.\n", "\n", "Notice that you use curly braces for dictionaries, but square brackets for lists.\n", "\n", "Dictionaries are a great way to work with the metadata in our corpus. Let's build a dictionary called *metadata*:\n", "\n", "Your first line will look like this:\n", "\n", " metadata = {}" ] }, { "cell_type": "code", "collapsed": false, "input": [ "metadata = {}\n", "for line in data[0].split('\\r\\n'):\n", " if not line:\n", " continue\n", " if line[0] == '<':\n", " continue\n", " element = line.split(':', 1)\n", " metadata[element[0]] = element[-1]\n", "print metadata" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# look up the date\n", "print metadata['Date']" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Building functions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Challenge**: define a function that creates a dictionary of the metadata for each file and gets rid of the whitespace at the start of each element\n", "\n", "**Hint**: to get rid of the whitespace use the *.strip()* command." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# open the first file, read it and then split it into two parts, metadata and body\n", "data = open(os.path.join(corpus_path, 'UDS2013680-100-full.txt'))\n", "data = data.read().split(\"\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "def parse_metadata(text):\n", " metadata = {}\n", " for line in text.split('\\r\\n'):\n", " if not line:\n", " continue\n", " if line[0] == '<':\n", " continue\n", " element = line.split(':', 1)\n", " metadata[element[0]] = element[-1].strip(' ')\n", " return metadata" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Test it out!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "parse_metadata(data[0])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we're confident that the function works, let's find out a bit about the corpus.\n", "As a start, it would be useful to know which years the texts are from. Are they evenly distributed over time? A graph will tell us!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#import conditional frequency distribution\n", "from nltk.probability import ConditionalFreqDist\n", "import matplotlib\n", "% matplotlib inline\n", "cfdist = ConditionalFreqDist()\n", "for filename in os.listdir(corpus_path):\n", " text = open(os.path.join(corpus_path, filename)).read()\n", " #split text of file on 'end metadata'\n", " text = text.split(\"\")\n", " #parse metadata using previously defined function \"parse_metadata\"\n", " metadata = parse_metadata(text[0])\n", " #skip all speeches for which there is no exact date\n", " if metadata['Date'][0] == 'c':\n", " continue\n", " #build a frequency distribution graph by year, that is, take the final bit of the 'Date' string after '/'\n", " cfdist['count'][metadata['Date'].split('/')[-1]] += 1\n", "cfdist.plot()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's build another graph, but this time by the 'Description' field:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "cfdist2 = ConditionalFreqDist()\n", "for filename in os.listdir(corpus_path):\n", " text = open(os.path.join(corpus_path, filename)).read()\n", " text = text.split(\"\")\n", " metadata = parse_metadata(text[0])\n", " if metadata['Date'][0] == 'c':\n", " continue\n", " cfdist2['count'][metadata['Description']] += 1\n", "cfdist2.plot()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Discussion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've got messy data! What's the lesson here?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Bonus chellenge**: Build a frequency distribution graph that includes speeches without an exact date.\n", "Hint: you'll need to tell Python to ignore the 'c' and just take the digits" ] }, { "cell_type": "code", "collapsed": false, "input": [ "cfdist3 = ConditionalFreqDist()\n", "for filename in os.listdir(corpus_path):\n", " text = open(os.path.join(corpus_path, filename)).read()\n", " text = text.split(\"\")\n", " metadata = parse_metadata(text[0])\n", " date = metadata['Date']\n", " if date[0] == 'c':\n", " year = date[1:]\n", " elif date[0] != 'c':\n", " year = date.split('/')[-1]\n", " cfdist3['count'][year] += 1\n", "cfdist3.plot()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Structuring our data by metadata feature" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because our data samples span a long stretch of time, we thought we'd investigate the ways in which Malcolm Fraser's language changes over time. This will be the key focus of the next session.\n", "\n", "In order to study this, it is helpful to structure our data according to the year of the sample. This simply means creating folders for each sample year, and moving each text into the correct one.\n", "\n", "We can use our metadata parser to help with this task. Then, after structuring our corpus by date, we want the metadata gone, so that when we count language features in the files, we are not also counting the metadata.\n", "\n", "So, let's try this:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import re\n", "# a path to our soon-to-be organised corpus\n", "newpath = 'corpora/fraser-annual'\n", "#if not os.path.exists(newpath):\n", " #os.makedirs(newpath)\n", "files = os.listdir(corpus_path)\n", "# define a regex to match year portion of date\n", "yearfinder = re.compile('[0-9]{4}')\n", "for filename in files:\n", " # split file contents at end of metadata\n", " data = open(os.path.join(corpus_path, filename)).read().split(\"\")\n", " # get date from data[0]\n", " # use our metadata parser to get metadata\n", " metadata = parse_metadata(data[0])\n", " #look up date field of dict entry\n", " date = metadata.get('Date')\n", " # search date for year\n", " yearmatch = re.search(yearfinder, str(date))\n", " #get the year as a string\n", " year = str(yearmatch.group())\n", " # make a directory with the year name\n", " if not os.path.exists(os.path.join(newpath, year)):\n", " os.makedirs(os.path.join(newpath, year))\n", " # make a new file with the same name as the old one in the new dir\n", " fo = open(os.path.join(newpath, year, filename),\"w\")\n", " # write the content portion, without metadata\n", " fo.write(data[1])\n", " fo.close()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Did it work? How can we check?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# print os.listdir(newpath)\n", "# print os.listdir(newpath + '/1981')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Keywords in Fraser's speeches" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, we now have a structured, metadata-free corpus." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Last time we tried keywording, we simply looked for keywords in a single text file corpus.\n", "\n", "A bit part of the power of programming is that we can perform a very similar operation again and again. We should be able to generate the keywords for each subcorpus, one after the other. Using a GUI (*graphical user interface*) tool for keywording would mean that you have to reload the tool with every subcorpus, run the keyworder, save the result, unload the subcorpus, and repeat.\n", "\n", "So, let's do it a much more sustainable way." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import os\n", "# path to our new corpus\n", "corpus = 'corpora/fraser-annual'\n", "# and make an empty list to store all our output:\n", "all_text = []\n", "for subcorpus in os.listdir(corpus):\n", " subcorpus_text = []\n", " for txtfile in os.listdir(os.path.join(corpus, subcorpus)):\n", " filepath = os.path.join(corpus, subcorpus, txtfile)\n", " data = open(filepath).read()\n", " data = data.lower() # make it lowercase!\n", " # add the data from each file to subcorpus text\n", " subcorpus_text.append(data)\n", " # after going through each file, turn all the texts into a string\n", " subcorpus_text = '\\n'.join(subcorpus_text)\n", " all_text.append(subcorpus_text)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So now we have each subcorpus as a list item in *all_text*. We can generate keywords for each:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# reimport keyworder\n", "import sys\n", "sys.path.insert(0, 'spindle-code-master/keywords')\n", "from keywords import keywords_and_ngrams \n", "results = []\n", "for text in all_text:\n", " print all_text.index(text)\n", " result = keywords_and_ngrams(text)\n", " results.append(result[0])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... and do whatever we like with our results. In the cell below, why don't you try to develop a way of printing some useful results?\n", "\n", "**Challenge**: print the year of each subcorpus before printing its top n keywords." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# print top 10 keywords and bigrams from each subcorpus, maybe?\n", "subcorpora = os.listdir(corpus)\n", "for index, result in enumerate(results):\n", " print subcorpora[index]\n", " for keyword in result[:10]:\n", " print keyword" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Adding information to the corpus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So far, the kinds of tasks we've done have involved meaningfully reducing data into numbers, or words into stems, etc.\n", "\n", "At this point in the course, we begin to add additional data to the corpora. This allows us to 'go deeper' into the texts.\n", "\n", "Before we start annotating our own corpora, let's just quickly play with a pre-annotated corpus." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from nltk.corpus import brown\n", "print(brown.words())\n", "print(brown.tagged_words())" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, each word in the 1961 *Brown Corpus* is tagged for its part of speech, as well as some additional information. The tag descriptions are available here:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "HTML('')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, we can pretty easily make lists containing all words of a given type. Below, we'll print the first 50 adverbs. Try changing the 'RB' to another kind of tag (in the list above), and see what results turn up. \n", "\n", "> JJ and RB are shorthand for adjective and adverb. It's linguistics jargon from the 50s that we're stuck with now." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from nltk.corpus import brown\n", "adverbs = []\n", "for tup in brown.tagged_words():\n", " # get any word whose tag is adverb\n", " if tup[1] == 'RB':\n", " adverbs.append(tup[0])\n", "adverbs[:50]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's easy to grasp the potential power of annotation: think how difficult it would be to write regular expressions that locate all adverbs!\n", "\n", "> **Note:** John Sinclair, an early proponent of corpus linguistics generally, was famously resistant to the use of annotation and parsing. He felt that the corpus alone should be used to build theory, rather than using existing theories (grammars) to annotate data (e.g. [2004](#ref:sinclair)). Though this is an uncommon viewpoint today, it is still useful to remember that the process of 'value-adding' is never free of theory or interpretation." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Part-of-speech tagging" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Part-of-speech (POS) tagging is the process of assigning each token a label. Often, these labels are similar to what was used to tag the Brown Corpus.\n", "\n", "> **Note:** It is generally considered good practice to train your tagger by exposing it to well-annotated language of a similar variety. For reasons of scope, however, training taggers and parsers is not covered in these sessions." ] }, { "cell_type": "code", "collapsed": false, "input": [ "text = word_tokenize(\"We can put any text we like here, and NLTK will tag each word with its part of speech.\")\n", "tagged = nltk.pos_tag(text)\n", "tagged" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could use this to search text by part of speech:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for word, tag in tagged:\n", " if tag == 'NN':\n", " print word\n", "\n", "# > In the legal profession, during the discovery process, a legal team may receive hundreds of thousands of pages of text. Searching of POS-tagged data can locate documents likely to contain important information, or at the very least, can sort texts in order of their relevance. This can save countless hours of work.\n", "\n", "# Part of speech tagging still has some limitations, though. The problem is that words in a sentence are related in complicated ways. If we are interested in modal auxiliaries that modify the verb *tag*, we would like our search to match:\n", "\n", "# * it **will** tag ...\n", "# * it **could** potentially tag\n", "# * it **can**'t always easily tag\n", "# * and so on...\n", "\n", "# In order to match these examples, we have to develop annotations not only of words, but groups of words. If we recognise *will tag*, *could potentially tag*, and *can't always easily tag* as verb phrases (VPs), it makes it much easier to search for the modal auxiliaries within them.\n", "\n", "# The idea of mapping out the grammatical relationships between words in a sentence is a very, very old idea indeed. Hundreds of different models of grammar have been proposed. Right now, we'll focus on a very influential and well-known model of language called *phrase structure grammar*." ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Phrase structure grammar" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Phrase structure grammar is the tree-style representation of popularised by generative grammarians (i.e. [Chomsky 1965](#ref:chomsky)):\n", "\n", "
\n", "\n", "
" ] }, { "cell_type": "code", "collapsed": false, "input": [ "HTML('')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Originally, generative grammarians were attempting to write rules that could account for any well-formed sentence in a language. The assumption was that we could do this for many languages, and then compare grammars in order to find *linguistic universals*---things common to all languages.\n", "\n", "These days, people aren't so interested in this task. Phrase structure grammars, however, are still common within natural language processing. *Automatic text comprehension* and *text generation* are two tasks that are commonly approached with phrase-structure as an underlying theory.\n", "\n", "The task of automatically annotating this level of information is called *parsing*." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Parsing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Parsing involves determining parts of speech for each word, but also the underlying grammatical structure of a sentence. There are many different grammars for a language like English, and accordingly, many different parsers. There is no way of determining which parser is objectively *the best*: some work well for multiple languages, or for certain genres of communication like journalism. Speed and portability may in some contexts be very important values (think *Siri*).\n", "\n", "NLTK as a library contains many different kinds of parsers. It also provides interfaces to work with a number of popular parsers such as *BLLIP*, *MaltParser* or *Stanford CoreNLP*. Unfortunately for us, NLTK as a library is largely oriented toward *building* parsers, rather than simply *using* them. Building parsers is a *very* complicated thing. To build a parser, you need to write out a grammar, and train a machine to learn this grammar by feeding it a corpus of correctly annotated clauses. This kind of task is well beyond the scope of our short course.\n", "\n", "> If you're interested in the idea of developing a grammar, you can head [here](http://www.nltk.org/book/ch08.html) for NLTK's documentation.\n", "\n", "What we're going to do is use a parser that works 'out of the box', without any training. One of the simplest to use within NLTK/Python is *pyStatParser*. First, we have to get it from the web and install it." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# copy the parser files\n", "! git clone https://github.com/emilmont/pyStatParser.git\n", "import os\n", "# go to parser directory\n", "os.chdir('pyStatParser')\n", "# install parser\n", "! python setup.py install\n", "# back to original directory\n", "os.chdir('..')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#import parser\n", "from stat_parser import Parser\n", "# name and load the parser\n", "parser = Parser()\n", "# parse and print a sentence\n", "tree = parser.parse(\"We act to prevent a wider war, to diffuse a powder keg at the heart of Europe \"\n", " \"that has exploded twice before in this century with catastrophic results.\")\n", "print tree" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NLTK provides a *draw()* function for graphically representing these bracketted trees. With a bit of hacking, we can get IPython to show us the tree:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from nltk import Tree\n", "from nltk.draw.util import CanvasFrame\n", "from nltk.draw import TreeWidget\n", "cf = CanvasFrame()\n", "# draw the tree\n", "tc = TreeWidget(cf.canvas(),tree)\n", "cf.add_widget(tc,10,10) # (10,10) offsets\n", "# print it to file\n", "cf.print_to_file('tree.ps')\n", "# don't show it on screen (yet!)\n", "cf.destroy()\n", "# convert to displayable form\n", "! convert tree.ps tree.png\n", "# remove the old file\n", "! rm tree.ps\n", "# show the image\n", "Image(filename='tree.png')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is a simple function that uses the code above to turn a sentence into a visualised tree. Insert whatever text you like as a string, and see what happens!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def quicktree(sentence):\n", " \"\"\"Parse a sentence and return a visual representation\"\"\"\n", " from nltk import Tree\n", " from nltk.draw.util import CanvasFrame\n", " from nltk.draw import TreeWidget\n", " from stat_parser import Parser\n", " from IPython.display import display\n", " from IPython.display import Image\n", " parser = Parser()\n", " parsed = parser.parse(sentence)\n", " cf = CanvasFrame()\n", " tc = TreeWidget(cf.canvas(),parsed)\n", " cf.add_widget(tc,10,10) # (10,10) offsets\n", " cf.print_to_file('tree.ps')\n", " cf.destroy()\n", " ! convert tree.ps tree.png\n", " ! rm tree.ps\n", " return Image(filename='tree.png')\n", " ! rm tree.png" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "quicktree(\"It depends upon what the meaning of the word is is.\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we've got any spare time, you might like to try to build some functionality into *quicktree()*. Maybe it would be nice to be able to provide a filename, and not to delete the file after loading it ... ?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Use these cells to visualise some sentences, if you like!" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, that's the end of Session 4. Now, we're able to do some pretty complex stuff!\n", "\n", "In this session, we've generated real insights into the Fraser Corpus using corpus linguistic/distant reading techniques (keywording, ngrams, and collocation).\n", "\n", "We've also learned about some more advanced computational linguistic ideas, like tagging and parsing. It's good to keep in mind, however, that any kind of POS tagging or parsing is an act of interpretation. Just because it's being done by a computer doesn't mean it's objective. Certain kinds of meaning can be systematically missed by processes like parsing.\n", "\n", "In the next lesson, we'll use a fully parsed version of the Fraser Corpus to look for longitudinal change in his use of language. The only reason we didn't all parse the texts ourselves is that the process is computationally intensive, and takes a few hours to complete. For this reason, you actually downloaded the parsed version of the corpus at the start of the first session.\n", "\n", "*See you soon!*" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Session 5: Charting change in Fraser's speeches" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this lesson, we investigate a fully-parsed version of the Fraser Corpus. We do this using purpose-built tools.\n", "\n", "In the first part of the session, we will go through how to use each of the tools. Later, you'll be able to use the tools to navigate the data and visualise results in any way you like.\n", "\n", "The Fraser Speeches have been parsed for part of speech and grammatical structure by [*Stanford CoreNLP*](http://nlp.stanford.edu/software/corenlp.shtml), a parser that can be loaded within NLTK. We rely on [*Tregex*](http://nlp.stanford.edu/~manning/courses/ling289/Tregex.html) to interrogate the parse trees. Tregex allows very complex searching of parsed trees, in combination with [Java Regular Expressions](http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html), which are very similar to the regexes we've been using thus far.\n", "\n", "If you plan to work more with parsed corpora later, it's definitely worthwhile to learn the Tregex syntax in detail. For now, though, we'll use simple queries, and explain the query construction syntax as we go.\n", "\n", "Before we get started, we have to install Java, as some of our tools rely on some Java code. You'll very likely have Java installed on your local machine, but we need it on the cloud. To make it work, you should run the following line of code in the cloud Terminal:\n", "\n", " sudo yum install java" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, that's out of the way. Next, let's import the functions we'll be using to investigate the corpus. These functions have been designed specifically for our investigation, but they will work with any parsed dataset.\n", "\n", "We'll take a look at the code used in this session a little later on, if there's time. Much of the code is derived from things we've learned here, combined with a lot of Google and Stack Overflow searching. All our code is on GitHub too, remember. It's open-source, so you can do whatever you like with it.\n", "\n", "Here's an overview of the functions we'll be using, and their purpose:\n", "\n", "| **Function name** | Purpose | |\n", "| ----------------- | ---------------------------------- | |\n", "| *searchtree()* | find things in a parse tree | |\n", "| *interrogator()* | interrogate parsed corpora | |\n", "| *plotter()* | visualise *interrogator()* results | |\n", "| *quickview()* | view *interrogator()* results | |\n", "| *tally()* | get total frequencies for *interrogator()* results | |\n", "| *surgeon()* | edit *interrogator()* results | |\n", "| *merger()* | merge *interrogator()* results | |\n", "| *conc()* | complex concordancing of subcopora | |\n", "\n", "We can import them using IPython Magic:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import os # for joining paths\n", "from IPython.display import display, clear_output # for clearing huge lists of output\n", "# import functions to be used here:\n", "%run corpling_tools/interrogator.ipy\n", "%run corpling_tools/resbazplotter.ipy\n", "%run corpling_tools/additional_tools.ipy" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also need to set the path to our corpus as a variable. If you were using this interface for your own corpora, you would change this to the path to your data." ] }, { "cell_type": "code", "collapsed": false, "input": [ "path = 'corpora/fraser-corpus-annotated' # path to corpora from our current working directory." ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Interrogating the corpus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To interrogate the corpus, we need a crash course in parse labels and Tregex syntax. Let's define a tree (from the Fraser Corpus, 1956), and have a look at its visual representation.\n", "\n", " Melbourne has been transformed over the let 18 months in preparation for the visitors." ] }, { "cell_type": "code", "collapsed": false, "input": [ "melbtree = (r'(ROOT (S (NP (NNP Melbourne)) (VP (VBZ has) (VP (VBN been) (VP (VBN transformed) '\n", " r'(PP (IN over) (NP (NP (DT the) (VBN let) (CD 18) (NNS months)) (PP (IN in) (NP (NP (NN preparation)) '\n", " r'(PP (IN for) (NP (DT the) (NNS visitors)))))))))) (. .)))')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that an OCR error caused a parsing error. Oh well. Here's a visual representation, drawn with NLTK:\n", "\n", "
\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data is annotated at word, phrase and clause level. Embedded here is an elaboration of the meanings of tags *(ask Daniel if you need some clarification!)*:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "HTML('')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the tags are a little bit different from the last parser we were using:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "quicktree(\"Melbourne has been transformed over the let 18 months in preparation for the visitors\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Neither parse is perfect, but the one we just generated has a major flaw: *Melbourne* is parsed as an adverb! Stanford CoreNLP correctly identifies it as a proper noun, and also, did a better job of handling the 'let' mistake." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*searchtree()* is a tiny function that searches a syntax tree. We'll use the sample sentence and *searchtree()* to practice our Tregex queries. We can feed it either *tags* (S, NP, VBZ, DT, etc.) or *tokens* enclosed in forward slashes." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# any plural noun\n", "query = r'NNS'\n", "searchtree(melbtree, query)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# A token matching the regex *Melb.?\\**\n", "query = r'/Melb.?/'\n", "searchtree(melbtree, query)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "query = r'NP'\n", "searchtree(melbtree, query)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make things more specific, we can create queries with multiple criteria to match, and specify the relationship between each criterion we want to match. Tregex will print everything matching **the leftmost criterion**." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# NP with 18 as a descendent\n", "query = r'NP << /18/'\n", "searchtree(melbtree, query)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using an exclamation mark negates the relationship. Try producing a query for a *noun phrase* (NP) without a *Melb* descendent:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "query = r'NP !<< /Melb.?/'\n", "searchtree(melbtree, query)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dollar specifies a sibling relationship between two parts of the tree---that is, two words or tags that are horizontally aligned." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# NP with a sister VP\n", "# This corresponds to 'subject' in many grammars\n", "query = r'NP $ VP'\n", "searchtree(melbtree, query)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try changing the **more than** symbols to **less than**, and see how it affects the results." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Prepositional phrase in other prepositional phrases\n", "query = r'PP >> PP'\n", "searchtree(melbtree, query)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is also a double underscore, which functions as a wildcard." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# anything with any kind of noun tag\n", "query = r'__ > /NN.?/'\n", "searchtree(melbtree, query)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using brackets, it's possible to create very verbose queries, though this goes well beyond our scope. Just know that it can be done!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# particle verb in verb phrase with np sister headed by Melb.\n", "# the particle verb must also be in a verb phrase with a child preposition phrase\n", "# and this child preposition phrase must be headed by the preposition 'over'.\n", "query = r'VBN >> (VP $ (NP <<# /Melb.?/)) > (VP < (PP <<# (IN < /over/)))'\n", "searchtree(melbtree, query)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are two more trees for you to query, from 1969 and 1973.\n", "\n", " We continue to place a high value on economic aid through the Colombo Plan, involving considerable aid to Asian students in Australia." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "
" ] }, { "cell_type": "code", "collapsed": false, "input": [ "colombotree = r'(ROOT (S (NP (PRP We)) (VP (VBP continue) (S (VP (TO to) (VP (VB place) (NP (NP (DT a) (JJ high) '\n", " r'(NN value)) (PP (IN on) (NP (JJ economic) (NN aid)))) (PP (IN through) (NP (DT the) (NNP Colombo) (NNP Plan))) '\n", " r'(, ,) (S (VP (VBG involving) (NP (JJ considerable) (NN aid)) (PP (TO to) (NP (NP (JJ Asian) (NNS students)) \n", " r'(PP (IN in) (NP (NNP Australia))))))))))) (. .)))'" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ " As a result, wool industry and the research bodies are in a state of wonder and doubt about the future." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "
" ] }, { "cell_type": "code", "collapsed": false, "input": [ "wooltree = r'(ROOT (S (PP (IN As) (NP (DT a) (NN result))) (, ,) (NP (NP (NN wool) (NN industry)) (CC and) '\n", " r'(NP (DT the) (NN research) (NNS bodies))) (VP (VBP are) (PP (IN in) (NP (NP (DT a) (NN state)) '\n", " r'(PP (IN of) (NP (NN wonder) (CC and) (NN doubt))))) (PP (IN about) (NP (DT the) (NN future)))) (. .)))'" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try a few queries in the cells below.\n", "\n", "> If you need help constructing a Tregex query, ask Daniel. He writes them all day long for fun." ] }, { "cell_type": "code", "collapsed": false, "input": [ "query = '?'\n", "searchtree(colombotree, query)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, now we understand the basics of a Tregex query (don't worry---many queries have already been written for you. We can start our investigation of the Fraser Corpus by generating some general information about it. First, let's define a query to find every word in the corpus. Run the cell below to define the *allwords_query* as the Tregex query.\n", "\n", "> *When writing Tregex queries or Regular Expressions, remember to always use **r'...'** quotes!*" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# any token containing letters or numbers (i.e. no punctuation):\n", "# we specify here that it cannot have any descendants,\n", "# just to be sure we only get tokens, not tags.\n", "allwords_query = r'/[A-Za-z0-9]/ !< __' " ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we perform interrogations with *interrogator()*. Its most important arguments are:\n", "\n", "1. **path to corpus** (the *path* variable)\n", "\n", "2. Tregex **options**:\n", " * **'-t'**: return only words\n", " * **'-C'**: return a count of matches\n", "\n", "3. the **Tregex query**\n", "\n", "We only need to count tokens, so we can use the **-C** option (it's often faster than getting lists of matching tokens). The cell below will run *interrogator()* over each annual subcorpus and count the number of matches for the query." ] }, { "cell_type": "code", "collapsed": false, "input": [ "allwords = interrogator(path, '-C', allwords_query) " ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When the interrogation has finished, we can view the total counts by getting the *totals* branch of the *allwords* interrogation:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# from the allwords results, print the totals\n", "print allwords.totals" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to see the query and options that created the results, you can print the *query* branch." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print allwords.query" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Plotting results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lists of years and totals are pretty dry. Luckily, we can use the *plotter()* function to visualise our results. At minimum, *plotter()* needs two arguments:\n", "\n", "1. a title (in quotation marks)\n", "2. a list of results to plot" ] }, { "cell_type": "code", "collapsed": false, "input": [ "plotter('Word counts in each subcorpus', allwords.totals)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great! So, we can see that the number of words per year varies quite a lot. That's worth keeping in mind.\n", "\n", "Next, let's plot something more specific, using the **-t** option." ] }, { "cell_type": "code", "collapsed": false, "input": [ "query = r'/(?i)\\baustral.?/' # australia, australian, australians, etc.\n", "aust = interrogator(path, '-t', query) # -t option to get matching words, not just count" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have a list of words matching the query stores in the *aust* variable's *results* branch:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "aust.results[:3] # just the first few entries" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Your turn!* Try this exercise again with a different term. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use a *fract_of* argument to plot our results as a percentage of something else. This helps us deal with the issue of different amounts of data per year." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# as a percentage of all aust* words:\n", "plotter('Austral*', aust.results, fract_of = aust.totals)\n", "# as a percentage of all words (using our previous interrogation)\n", "plotter('Austral*', aust.results, fract_of = allwords.totals)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great! So, we now have a basic understanding of the *interrogator()* and *plotter()* functions." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Customising visualisations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, *plotter()* plots the absolute frequency of the seven most frequent results.\n", "\n", " We can use other *plotter()* arguments to customise what our chart shows. *plotter()*'s possible arguments are:\n", "\n", " | plotter() argument | Mandatory/default? | Use | Type |\n", " | :------|:------- |:-------------|:-----|\n", " | *title* | **mandatory** | A title for your plot | string |\n", " | *results* | **mandatory** | the results you want to plot | *interrogator()* total |\n", " | *fract_of* | None | results for plotting relative frequencies/ratios etc. | list (interrogator(-C) form) |\n", " | *num_to_plot* | 7 | number of top results to display | integer |\n", " | *multiplier* | 100 | result * multiplier / total: use 1 for ratios | integer |\n", " | *x_label* | False | custom label for the x-axis | string |\n", " | *y_label* | False | custom label for the y-axis | string |\n", " | *yearspan* | False | plot a span of years | a list of two int years |\n", " | *justyears* | False | plot specific years | a list of int years |\n", " | *csvmake* | False | make csvmake the title of csv output file | string |\n", "\n", "You can easily use these to get different kinds of output. Try changing some parameters below:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# maybe we want to get rid of all those non-words?\n", "plotter('Austral*', aust.results, fract_of = allwords.totals, num_to_plot = 3, y_label = 'Percentage of all words')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# or see only the 1960s?\n", "plotter('Austral*', aust.results, fract_of = allwords.totals, num_to_plot = 3, yearspan = [1960,1969])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Your Turn**: mess with these variables, and see what you can plot. Try using some really infrequent results, if you like!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Viewing and editing results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Aside from *interrogator()* and *plotter()*, there are also a few simple functions for viewing and editing results." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "quickview()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*quickview()* is a function that quickly shows the n most frequent items in a list. Its arguments are:\n", "\n", "1. an *interrogator()* result\n", "2. number of results to show (default = 50)\n", "\n", "We can see the full glory of bad OCR here:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "quickview(aust.results, n = 20)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The number shown next to the item is its index. You can use this number to refer to an entry when editing results." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "tally()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*tally()* displays the total occurrences of results. Its first argument is the list you want tallies from. For its second argument, you can use:\n", "\n", "* a list of indices for results you want to tally\n", "* a single integer, which will be interpreted as the index of the item you want\n", "* a string, 'all', which will tally every result. This could be very many results, so it may be worth limiting the number of items you pass to it with [:n]," ] }, { "cell_type": "code", "collapsed": false, "input": [ "tally(aust.results, [0, 3])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Your turn**: Use 'all' to tally the result for the first 11 items in aust.results" ] }, { "cell_type": "code", "collapsed": false, "input": [ "tally(aust.results[:10], 'all')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "surgeon()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Results lists can be edited quickly with *surgeon()*. *surgeon()*'s arguments are:\n", "\n", "1. an *interrogator()* results list\n", "2. *criteria*: either a regex or a list of indices.\n", "3. *remove = True/False*\n", "\n", "By default, *surgeon()* removes anything matching the regex/indices criteria, but this can be inverted with a *remove = False* argument. Because you are duplicating the original list, you don't have to worry about deleting *interrogator()* results.\n", "\n", "We can use it to remove some obvious non-words." ] }, { "cell_type": "code", "collapsed": false, "input": [ "non_words_removed = surgeon(aust.results, [5, 9], remove = True)\n", "plotter('Some non-words removed', non_words_removed, fract_of = allwords.totals)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that you do not access surgeon lists with *aust.non_words_removed* syntax, but simply with *non_words_removed*." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "merger()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*merger()* is for merging items in a list. Like *surgeon()*, it duplicates the old list. Its arguments are:\n", "\n", "1. the list you want to modify\n", "2. the indices of results you want to merge, or a regex to match\n", "3. newname = *str/int/False*: \n", " * if string, the string becomes the merged item name.\n", " * if integer, the merged entry takes the name of the item indexed with the integer.\n", " * if not specified/False, the most most frequent item in the list becomes the name.\n", "\n", "In our case, we might want to collapse *Australian* and *Australians*, because the latter is simply the plural of the former." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# before:\n", "plotter('Before merging Australian and Australians', aust.results, num_to_plot = 3)\n", "# after:\n", "merged = merger(aust.results, [1, 2], newname = 'australian(s)')\n", "plotter('After merging Australian and Australians', merged, num_to_plot = 2)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "conc()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The final function is *conc()*, which produces concordances of a subcorpus based on a Tregex query. Its main arguments are:\n", "\n", "1. A subcorpus to search *(remember to put it in quotation marks!)*\n", "2. A Tregex query" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# here, we use a subcorpus of politics articles,\n", "# rather than the total annual editions.\n", "conc(os.path.join(path,'1966'), r'/(?i)\\baustral.?/') # adj containing a risk word" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can set *conc()* to print *n* random concordances with the *random = n* parameter. You can also store the output to a variable for further searching." ] }, { "cell_type": "code", "collapsed": false, "input": [ "randoms = conc(os.path.join(path,'1963'), r'/(?i)\\baustral.?/', random = 5)\n", "randoms" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*conc()* takes another argument, window, which alters the amount of co-text appearing either side of the match." ] }, { "cell_type": "code", "collapsed": false, "input": [ "conc(os.path.join(path,'1981'), r'/(?i)\\baustral.?/', random = 5, window = 50)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*conc()* also allows you to view parse trees. By default, it's false:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "conc(os.path.join(path,'1954'), r'/(?i)\\baustral.?/', random = 5, window = 30, trees = True)\n", "\n", "# Now you're familiar with the corpus and functions, it's time to explore the corpus in a more structured way. To do this, we need a little bit of linguistic knowledge, however." ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Some linguistics..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Functional linguistics* is a research area concerned with how *realised language* (lexis and grammar) work to achieve meaningful social functions.\n", "\n", "One functional linguistic theory is *Systemic Functional Linguistics*, developed by Michael Halliday (Prof. Emeritus at University of Sydney).\n", "\n", "Central to the theory is a division between **experiential meanings** and **interpersonal meanings**.\n", "\n", "* Experiential meanings communicate what happened to whom, under what circumstances.\n", "* Interpersonal meanings negotiate identities and role relationships between speakers \n", "\n", "Halliday argues that these two kinds of meaning are realised **simultaneously** through different parts of English grammar.\n", "\n", "* Experiential meanings are made through **transitivity choices**.\n", "* Interpersonal meanings are made through **mood choices**\n", "\n", "Here's one visualisation of it. We're concerned with the two left-hand columns. Each level is an abstraction of the one below it.\n", "\n", "
\n", "\n", "
\n", "\n", "Transitivity choices include fitting together configurations of:\n", "\n", "* Participants (*a man, green bikes*)\n", "* Processes (*sleep, has always been, is considering*)\n", "* Circumstances (*on the weekend*, *in Australia*)\n", "\n", "Mood features of a language include:\n", "\n", "* Mood types (*declarative, interrogative, imperative*)\n", "* Modality (*would, can, might*)\n", "* Lexical density---the number of words per clause, the number of content to non-content words, etc.\n", "\n", "Lexical density is usually a good indicator of the general tone of texts. The language of academia, for example, often has a huge number of nouns to verbs. We can approximate an academic tone simply by making nominally dense clauses: \n", "\n", " The consideration of interest is the potential for a participant of a certain demographic to be in Group A or Group B*.\n", "\n", "Notice how not only are there many nouns (*consideration*, *interest*, *potential*, etc.), but that the verbs are very simple (*is*, *to be*).\n", "\n", "In comparison, informal speech is characterised by smaller clauses, and thus more verbs.\n", "\n", " A: Did you feel like dropping by?\n", " B: I thought I did, but now I don't think I want to\n", "\n", "Here, we have only a few, simple nouns (*you*, *I*), with more expressive verbs (*feel*, *dropping by*, *think*, *want*)\n", "\n", "> **Note**: SFL argues that through *grammatical metaphor*, one linguistic feature can stand in for another. *Would you please shut the door?* is an interrogative, but it functions as a command. *invitation* is a nominalisation of a process, *invite*. We don't have time to deal with these kinds of realisations, unfortunately." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Fraser's speeches and linguistic theory" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, from an SFL perspective, when Malcolm Fraser gives a speech, he is simultaneously making meaning about events in the real world (through transitivity choices) and about his role and identity (through mood and modality choices).\n", "\n", "With this basic theory of language, we can create two research questions:\n", "\n", "1. **How does Malcolm Fraser's tone change over time?**\n", "2. **What are the major things being spoken about in Fraser's speeches, and how do they change?**\n", "\n", "As our corpus is well-structured and parsed, we can create queries to answer these questions, and then visualise the results." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Interpersonal features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll start with interpersonal features of language in the corpus. First, we can devise a couple of simple metrics that can teach us about the interpersonal tone of Fraser's speeches over time. We don't have time to run all of these queries right now, but there should be some time later to explore the parts of this material that interest" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# number of content words per clause\n", "openwords = r'/\\b(JJ|NN|VB|RB)+.?\\b/'\n", "clauses = r'S < __'\n", "opencount = interrogator(path, '-C', openwords)\n", "clausecount = interrogator(path, '-C', clauses)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "plotter('Lexical density', opencount.totals, \n", " fract_of = clausecount.totals, y_label = 'Lexical Density Score', multiplier = 1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also look at the use of modals auxiliaries (*would could, may, etc.*) over time. This can be interesting, as modality is responsible for communicating certainty, probability, obligation, etc.\n", "\n", "Modals are very easily and accurately located, as there are only a few possible words, and they occur in predicable places within clauses.\n", "\n", "Most grammars tag them with 'MD'.\n", "\n", "If modality interests you, later, it could be a good set of results to manipulate and plot." ] }, { "cell_type": "code", "collapsed": false, "input": [ "query = r'MD < __'\n", "modals = interrogator(path, '-t', query)\n", "plotter('Modals', modals.results, fract_of = modals.totals)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# percentage of tokens that are I/me\n", "query = r'/PRP.?/ < /(?i)^(i|me|my)$/'\n", "firstperson = interrogator(path, '-C', query)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "plotter('First person', firstperson.totals, fract_of = allwords.totals)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# percentage of questions\n", "query = r'ROOT <<- /.?\\?.?/'\n", "questions = interrogator(path, '-C', query)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "plotter('Questions/all clauses', questions.totals, fract_of = clausecount.totals)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# ratio of open/closed class words\n", "closedwords = r'/\\b(DT|IN|CC|EX|W|MD|TO|PRP)+.?\\b/'\n", "closedcount = interrogator(path, '-C', closedwords)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "plotter('Open/closed word classes', opencount.totals, \n", " fract_of = closedcount.totals, y_label = 'Open/closed ratio', multiplier = 1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# ratio of nouns/verbs\n", "nouns = r'/NN.?/ < __'\n", "verbs = r'/VB.?/ < __'\n", "nouncount = interrogator(path, '-C', nouns)\n", "verbcount = interrogator(path, '-C', verbs)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "plotter('Noun/verb ratio', nouncount.totals, fract_of = verbcount.totals, multiplier = 1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Experiential features of Fraser's speech" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now turn our attention to what is being spoken about in the corpus. First, we can get the heads of grammatical participants:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# heads of participants (heads of NPS not in prepositional phrases)\n", "query = r'/NN.?/ >># (NP !> PP)'\n", "participants = interrogator(path, '-t', query, lemmatise = True)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "plotter('Participants', participants.results, fract_of = allwords.totals)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we can get the most common processes. That is, the rightmost verb in a verbal group (take a look at the visualised tree!)\n", "\n", "> *Be careful not to confuse grammatical labels (predicator, verb), with semantic labels (participant, process) ... *" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# most common processes\n", "query = r'/VB.?/ >># VP >+(VP) VP'\n", "processes = interrogator(path, '-t', query, lemmatise = True)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "plotter('Processes', processes.results[2:], fract_of = processes.totals)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It seems that the verb *believe* is a common process in 1973. Try to run *conc()* in the cell below to look at the way the word behaves." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# write a call to conc() that gets concordances for r'/VB.?/ < /believe/ in 1973\n", "# conc('fraser-corpus-annotated/1973', r'/VB.?/ < /believe/)\n", "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For discussion: what events are being discussed when *believe* is the process? Why use *believe* here?\n", "
\n", "\n", "Next, let's chart noun phrases headed by a proper noun (*the Prime Minister*, *Sydney*, *John Howard*, etc.). We can define them like this:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# any noun phrase headed by a proper noun\n", "pn_query = r'NP <# NNP'" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make for more accurate results the *interrogator()* function has an option, *titlefilter*, which uses a regular expression to strip determiners (*a*, *an*, *the*, etc.), titles (*Mr*, *Mrs*, *Dr*, etc.) and first names from the results. This will ensure that the results for *Prime Minister* also include *the Prime Minister*, and *Fraser* results will include the *Malcolm* variety. The option is turned on in the cell below:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Proper noun groups\n", "propernouns = interrogator(path, '-t', pn_query, titlefilter = True)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "plotter('Proper noun groups', propernouns.results, fract_of = propernouns.totals, num_to_plot = 15)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Proper nouns are a really good category to investigate further, as it is through proper nouns that we can track discussion of particular people, places or things. So, let's look at the top 100 results:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "quickview(propernouns.results, n = 100)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ " You can now use the *merger()* and *surgeon()* options to make new lists to plot. Here's one example: we'll use *merger()* to merge places in Victoria, and then *surgeon()* to create a list of places in Australia." ] }, { "cell_type": "code", "collapsed": false, "input": [ "merged = merger(propernouns.results, [9, 13, 27, 36, 78, 93], newname = 'places in victoria')\n", "quickview(merged, n = 100)\n", "\n", "ausparts = surgeon(merged, [7, 9, 23, 25, 33, 41, 49], remove = False)\n", "plotter('Places in Australia', ausparts, fract_of = propernouns.totals)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Neat, eh? Well, that concludes the structured part of the lesson. You now have a bit of time to explore the corpus, using the tools provided. Below, for your convenience, is a table of the functions and their arguments.\n", "\n", "Particularly rewarding can be playing more with the proper nouns section, as in the cells above. Shout out if you find something interesting!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "
" ] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By the way, here's the code behind some of the functions we've been using. With all your training, you can probably understand quite a bit of it!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%load corpling_tools/additional_tools.ipy" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's it for this lesson, and for our interrogation of the Fraser Corpus. Remember that this is the first time anybody has conducted a sustained corpus linguistic investigation of this corpus. Everything we found here is a new discovery about the way language changes over time! (feel free to write it up and publish it!)\n", "\n", "The final session will look to the future: we hope to have a conversation about what you can do with the kind of skills you've learned here.\n", "\n", "*See you soon!*" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Session 6: Getting the most out of what we've learned" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, now you know Python and NLTK! The main things we still have to do are:\n", "\n", "1. Manage resources and results\n", "2. Brainstorm some other uses for NLTK\n", "3. Integrate IPython into your existing workflow\n", "4. Have an open discussion about what we've done\n", "5. Summarise and say goodbye!\n", "\n", "This lesson is pretty light on content and structure. Please do jump in at any point, and tell us about your research, and whether or not what you've learned here will be of much use.\n", "\n", "Or, ask us if Python can do a certain thing. Maybe we have some tips!" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Managing resources and results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You generate huge amounts of code, data and findings. Often, it's hard to know what to do with it all. In this section, we'll provide some suggestions designed to keep your work:\n", "\n", "1. Reproducible\n", "2. Reusable\n", "3. Comprehensible" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Your code" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Most importantly, **write comments on your code**. You **will** forget what bits of code are supposed to do. Using others' code is much easier if it's commented up. \n", "2. A related point is to name your variables meaningfully: *variablexxy* does not tell us much about what it will contain. *For image in images:* is a very comprehensible line.\n", "3. Also, write docstrings for your functions. Help messages come in very handy for not only others, but yourself. Simply stating what\n", "2. **Version control**. When editing your code, you may sometimes break it. [Here](https://drclimate.wordpress.com/2012/11/16/version-control/)'s a write-up about version control from Damien Irving.\n", "3. **Share your code**. You are often doing novel things when you code, and sharing what you've done can save somebody else a lot of work. *GitHub* is free for open-source projects. GitHub provides version control, which is especially useful when you are working with a team." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Developing as a programmer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've only scratched the surface of Python, to be honest. In fact, we've only been treating Python as a programming language. Many of its users, however, see it as more than just a programming language: it is an ideology and culture, as well. \n", "\n", "You'll notice on Stack Overflow, people will remark that some solutions are more 'pythonic' than others. By this, they typically mean that the code is easy to read and broken into discrete functions. More broadly, *pythonic* refers to code that adheres to the *Zen of Python*:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import this" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, as you explore Python more and more, you learn not only new ways to get tasks done, but also what ways are better to others. While at first you'll be content with making code that works, you'll later want to make sure your code is elegant as well. Fixing up your old code becomes a form of procrastination from thesis writing. Luckily, of all the kinds of procrastination, it's one of the better kinds.\n", "\n", "Another change you might notice is a switch toward *defensive programming*, where you write code to handle potential errors, and to provide useful messages when people do something wrong. This is a really awesome thing to do.\n", "\n", "Some code authors also try to use *test-driven development*. From [the wikipedia article](http://en.wikipedia.org/wiki/Test-driven_development):\n", "\n", "> First the developer writes an (initially failing) automated test case that defines a desired improvement or new function, then produces the minimum amount of code to pass that test, and finally refactors the new code to acceptable standards.\n", "\n", "This helps stop feature-creep, builds your confidence, and encourages the division of long code into well-defined functions.\n", "\n", "Oh, and you'll probably start dreaming in code. *Not* a joke." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Your data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It should now be clear to you that you have data!\n", "Think about how you structure it. Without necessarily becoming an archivist, do think about your metadata. It will help you to manage your data later.\n", "*Cloud computing* offers you access to more storage and compute-power than you might want to own. Plus you're unlikely to spill coffee on it." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Your findings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[*Figshare*](http://www.figshare.com) is a site for storing tables and figures. It's particularly useful for working with large datasets, as we often generate far more raw tables and statistics than we can possibly publish.\n", "\n", "It's becoming more and more common to link journal publications to additional online resources such as GitHub code or Figshares. It's also more and more common to cite GitHub and Figshare---always nice to bump up your citation count!" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Other uses of NLTK" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What other things might we use NLTK for? A few examples, and possible workflows." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Scenario 1: You have some old books." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Are they machine readable?\n", "* OCR options---institutional or DIY?\n", "* Structure them in a meaningful way---by author, by year, by language ... \n", "* Start querying!" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Scenario 2: You're interested in an online community." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Explore the site. Sign up for it, maybe.\n", "* Download it: *Wget*, *curl*, *crawlers, spiders* ...\n", "* Extract relevant data and metadata: Python's *Beautiful Soup* library.\n", "* **Structure your data!**\n", "* Annotate your data, save these annotations\n", "* Start querying!" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Scenario 3: Something of interest breaks in the news" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* It will start being discussed all over the web.\n", "* You can use the Twitter API to harvest tweets containing a term or hashtag of interest.\n", "* You can get a list of RSS feeds and mine news articles\n", "* You can use something like *WebBootCat* to harvest search engine results and make a plain text corpus\n", "* Process these into a manageable form\n", "* Structure them\n", "* *Start querying!" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Integrating IPython into your workflow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What you've learned here isn't much good unless you can pull things out of it and put them into your own research workflow.\n", "\n", "It's important to remember that IPython code may be a little different from vanilla Python, as it can contain Magics, shell commands, and the like.\n", "\n", "Perhaps the coolest thing about programming is you are simultaneously researching and developing. The functions that you write can be uploaded to the web and used by others who encounter the problem that necessitated your writing the function in the first place.\n", "\n", "In reality, NLTK is nothing more than a lot of Python functions, coupled with some datasets (corpora, stopword lists, etc.). You can even visit NLTK on GitHub, fork their repository, and start playing around with the code! If you find bugs in the code, or if you think documentation is lacking, you can either write directly to the people who maintain the code, or fix the problem yourself and request that they review your fix and integrate it into NLTK." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Using IPython locally" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've done everything on the cloud so far, and it's been pretty good to us. You may also want to use IPython locally. To do this, you need to install it. There are many ways to install it, and these vary depending on your OS and what you already have installed. See the [IPython website](http://ipython.org/ipython-doc/2/install/install.html#installnotebook) for detailed instructions.\n", "\n", "> *[Anaconda](http://continuum.io/downloads)* is a large package of Python-based tools (including IPython and Matplotlib) that is easy to install. \n", "\n", "Once you have IPython installed, it's very easy to start using it. All you need to do is open up Terminal, navigate to the notebook directory and type:\n", "\n", " ipython notebook filename.ipynb\n", "\n", "This will open up a blank notebook, exactly the same as the kind of notebook we've been using on the cloud. The only difference will be that if you enter:\n", "\n", " os.listdir('.')\n", "\n", "you'll get a list of files in the directory of your notebook file, rather than a directory of your part of the cloud." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Next steps - keep going!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "Image(url='http://starecat.com/content/wp-content/uploads/two-states-of-every-programmer-i-am-god-i-have-no-idea-what-im-doing.jpg')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We hope you've learned enough in these two days to be excited about what NLTK can add to your work and you're feeling confident to start working on your own.\n", "Code breaks. Often. Be patient and try not to get discouraged.\n", "The good thing about code breaking so often is that you can find help. Try:\n", "* Coming back to these notebooks and refreshing your memory\n", "* Checking the NLTK book\n", "* Googling your error messages. This will often lead you to Stack Overflow, the major online community for sharing coding questions.\n", "* NLTK also has a Google group where people share their experiences and ask for help\n", "* Keep in touch! Your community is a wonderful resource." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Summaries and goodbye" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we go, we should summarise what we've learned. Add all this to your CV!\n", "\n", "* Navigating the IPython notebook\n", "* Python commands - defining a variable; building a function\n", "* Using Python to perform basic quantitative analysis of text\n", "* Tagging and parsing to perform more sophisticated analysis of language\n", "* A crash course in corpus linguistics!\n", "* An appreciation of clean vs messy data and data structure\n", "* Data management practices" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Bragging rights " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The work you have been doing today on the Fraser corpus is actually pretty cutting edge. Very little analysis like this has been undertaken on an Australian political corpus.\n", "\n", "You have produced publishable work today. Really. Be proud. And if you feel like writing up your findings, do it!" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Thanks!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's the end of of course. Thank you to everybody for your participation.\n", "\n", "Please let us know how you found the course.\n", "\n", "Also, [submit a pull request](https://github.com/resbaz/lessons) and improve our teaching materials! " ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Bibliography" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Chomsky, N. (1965). Aspects of the Theory of Syntax (Vol. 11). The MIT press.\n", "\n", "Eggins, S. (2004). Introduction to systemic functional linguistics. Continuum International Publishing Group.\n", "\n", "Halliday, M., & Matthiessen, C. (2004). An Introduction to Functional Grammar. Routledge.\n", "\n", "\n", "Sinclair, J. (2004). Trust the text: Language, corpus and discourse. Routledge. Available at\n", "[http://books.google.com.au/books/about/Trust_the_Text.html?id=n6xU2lyVoeQC&redir_esc=y](http://books.google.com.au/books/about/Trust_the_Text.html?id=n6xU2lyVoeQC&redir_esc=y)." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Workspace" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are a few blank cells, in case you need them for anything:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }