Presidential-radio-broadcasts-analysis

An Analysis of Presidential Radio Broadcasts

by Brent Shulman

I. Introduction

Goals of Project:

To explore the uses and application of natural language processing and text mining
Apply class ideas to a project about analysis
Expand coding knowledge
Become better at self-teaching
Examine a corpus of President Saturday radio transcripts from Reagan – Obama
Discover trends in words choice over time – both by looking at individual Presidents and the group as a whole.
Look at word Frequency for individuals and as a whole

II. Theory, Prior Research, and Background

While examining topics for this project I was drawn back to examples used in class of analyzing text. Professor Mark Liberman took State of the Union addresses and look at various lexical features of the speeches and graphed them. This stimulated my interests in programming as well as computational linguistics and led me to wonder what other large corpuses of documents could be investigated. I began to research for documents and was soon led back to political address. This is because there are often the most consistently and regularly produced. In addition, I thought that data of this kind would be ideal for following trends, as the political landscape is always changing, shaped by local and world events.

Following this, I researched techniques and strategies for Text-mining, topic tracking, and content analysis. While I typically briefly scanned websites general information, two papers specifically provided the greatest insights. Both were read in greater depth. A Survey of Topic Tracking Techniques by Kamaldeep Kaur and Vishal Gupta provided background on techniques based around this general concept:

n addition, this paper talked further about hidden Markov Models and provided inspiration for my own analysis. The other paper that influenced my research was Text Mining: A Brief Survey by Falguni Patel and Neha Soni. It gave a slightly different look at the models for extracting information from a text documents.

Following these readings, I delved in further to find a viable corpuses of data. After searching the internet and comparing sources, I found the University of California, Santa Barbara’s President Project. Here I discovered numerous amounts of data pertaining to presidential speeches and writings across much of recent history. For this project, I narrowed down my analysis to be of only the available Presidential Saturday Radio Addresses1. I chose to do this because the amount of data was of a large but manageable size for this first, initial project. Additionally, the addresses were produced on a weekly basis and therefore would be more diverse lexical data. The document corpus spans from Reagan though to Obama; however, there was a significant amount of data from Bush Sr’s address missing. This served to limit my projects view his presidency.

The next step was to find to proper tools in order to accomplish my goals for the study. Very quickly the Natural Language Tool Kit (NLTK4) for Python appeared as the natural choice. After reading the preface of the Natural Language Processing book, I decided that this was the path to follow.

III. Methods and Procedures

After downloading and setting up Enthought’s Canopy5 python compiler, I set up and tested the NLTK package. I practiced analyzing the corpuses available with the functions provided and found success. Next, I needed to acquire the texts of the Radio Transcripts. With the help of Professor Liberman, I was guided through a series of programs and code in a UNIX environment that downloaded all of the html files and created my current corpus of documents.

The first program we created looked through all of the URL’s by year and put those addresses into a file label RadioIndex”year”.html. This was done accordingly for years in sequence from 1982 to 2014. Next, we noticed that all of the pages had very similar URL’s except for an ID number for each page at the end. In addition, we observed that these were not in direct numerical order by president. So, code was created that captured the URL ID numbers and placed them into a file called AllRadio.txt in chronological. Next, we took the indexes from AllRadio.txt and printed out each transcript to its own file on my computer. However, because the ID’s numerical order did not correspond with their chronological order, that all became scrambled in file explorer. I later solved this issue in my personal python code.

Next I began to create a program in python called PresidentHTMLParser.py that would make the html files “analyzable”.

Function – readInFiles():: Opens AllRadio.txt and reads it; Strips the formatting in the document so that the file is only the fileIDs; These are then each concanted to a string called StaticVariables.fileIDs; This is then tokenized into an list/array to be used later
Presidents(president):: Takes in user inputted string related to which President they are interested in; Sets indices corresponding to the according documents transcripts – these variables are used later
displayType(formOfDisplay):: Takes in user input for type (paragraph or tokenize) and implements it
printTranscripts():: Concants cleaned html files corresponding to the chosen President into one outputHTML string or array (this depends on whether is in paragraph of tokenized form); Makes all words lower case if the user specifies
frequencyDistribution():: Prepares data to be analyzed; Removes common stop words like (ex: is); Removes end punctuation – this is to ensure that “people” is counted the same as “people,”; Takes the frequency distribution of the output and plots it
conditionalFreqDist():: Creates and implements code that plots a conditional frequency distribution of user chosen words.; Can either be cumulative of non-cumulative with respect to time.

An Analysis of Presidential Radio Broadcasts

I. Introduction

II. Theory, Prior Research, and Background

III. Methods and Procedures

IV. Findings and Results

V. Future Research, Exploration, and Improvements