EXPLANATION OF THE SOFTWARE AT: http://probability.ca/usscj/ SUMMARY: This directory contains various computer programs used during the writing of the research paper "Detecting Multiple Authorship of United States Supreme Court Legal Decisions Using Function Words", by J.S. Rosenthal and A. Yoon (2009), to appear in Annals of Statistics, available at: http://probability.ca/jeff/research.html The software automatically downloads and sorts plain-text versions of United States Supreme Court decisions, using the Justia source provided at supreme.justia.com. It then performs various statistical analyses on the text of these decisions, including the frequency of various "function words" and more. The programs are designed for use only on linux/unix/Mac machines. All programs in this directory are Copyright (c) 2010 by Jeffrey S. Rosenthal, and are licensed for general copying, distribution and modification according to the GNU General Public License (http://www.gnu.org/copyleft/gpl.html). TO PREPARE TO USE THIS SOFTWARE: First, on your own linux/unix/Mac computer, make sure you have "cc" and "lynx" installed. ("cc" or "gcc" is the C compiler; on linux/unix it should come pre-loaded, but for Mac you might need to install the Xcode package from developer.apple.com/tools/xcode. "lynx" is a plain-text web browser, widely available for free download, just Google "lynx browser".) Then, open a Unix shell window (on Mac, use the "Terminal" application). Then, create and move to a new directory (folder), e.g. "usscj": mkdir usscj cd usscj Then, download and run the software's INSTALL file, e.g. by typing: lynx -source http://probability.ca/usscj/INSTALL > INSTALL chmod +x INSTALL ./INSTALL (This may take a few minutes, as it downloads and installs all the needed software to enable you to download and analysis USSC decisions.) You are now ready to begin! SPECIAL LYNX NOTE: It seems that sometimes (e.g. in Mac OS) the command "lynx" cannot handle the https interface, which Justia now uses. If so, then first make sure "curl" is installed, and type either "setenv USECURL true" (in csh or tcsh) or "export USECURL=true" (in sh or bash), to use "curl" for the Justia downloads instead. (Here "lynx" is still required, to decode the resulting html files.) USING THIS SOFTWARE: From within that same "usscj" directory, proceed as follows. First, type "grabvol" followed by a volume number to automatically download as plain text (removing extraneous header and footer text), and sort (by author), all of the majority decisions in that volume: ./grabvol 541 ./grabvol 456 (etc.) Or, use "grabsequence" to download an entire sequence of volumes at once, e.g. to download all volumes from 480 to 520 inclusive, type: ./grabsequence 480 520 You can then perform statistical analysis on all downloaded decisions authored by a given justice, using "textvarit": ./textvarit BREYER ./textvarit SCALIA (etc.) You can also perform various bootstrap comparisons. For example, to bootstrap compare decisions written by two different justices, use "comptwo": ./comptwo BREYER SCALIA (etc.) MULTIPLE JUSTICES WITH THE SAME SURNAME: There are nine pairs of USSC justices with the same surname, e.g. Owen J. Roberts (1930-1945, volumes 280-326) and John G. Roberts (2005-present, volumes 543-present). In such cases, the earlier justice gets a "1" appended to their surname. So, for example, to refer to John G. Roberts use "ROBERTS", but to refer to Owen J. Roberts use "ROBERTS1". VOLUME-SPECIFIC ANALYSIS: There are also commands to perform statistical analysis on decisions by given justices for just certain USSC volume numbers. (This could be used to e.g. compare a justice's early decisions to his/her own later decisions, or to compare two justices over a certain specific time period, or to examine a justice's decisions for volumes early in USSC sessions, etc.) To use this feature, create one or more files consisting solely of lists of volume numbers that you want to isolate, e.g. create a file (named "vollist1", say) consisting solely of the text: 481 482 500 For example, one simple way to create that file is with a command like: echo 481 482 500 > vollist1 Then, to e.g. analyse all of Scalia's decisions for just volumes 481, 482, and 500 only, type: ./textvarvols vollist1 SCALIA Or, to compare Scalia's and Stevens' decisions for just those same three volumes, type: ./comptwovols vollist1 SCALIA vollist1 STEVENS Or, to compare Scalia's decisions in those three volumes, to Scalia's decisions in volumes 500, 512, 520, and 526, instead first create a second file named "vollist2" consisting solely of the text: 500 512 520 526 and then type: ./comptwovols vollist1 SCALIA vollist2 SCALIA To include ALL justices for certain volumes, use "-" as the justice name: ./textvarvols vollist1 - ./comptwovols vollist1 - vollist2 - ./comptwovols vollist1 - vollist2 SCALIA (etc.) A short-cut to directly create a volume list file (e.g. "vollist1") containing all volumes in a sequence (e.g. 342 through 375) is: ./createvollist 342 375 vollist1 LOG FILES FOR LISTS OF JUSTICES AND VOLUME RANGES: A short-cut to producing basic statistical information (#words/file, and V4) for each of an entire list of justices is provided as follows. First create a plain-text file containing a list of justices, e.g. with: echo SCALIA KENNEDY > namelist1 Then, to compute and save each justice's basic statistical information to a log file (e.g. to the file "mylogfile"), type: textvarlog namelist1 mylogfile You can also use commands like: ./justrange STEVENS f to run textvar on the first five years' of STEVENS opinions. The name can be any justice who has already reached 65 years of age, and the final letter can be 'f' for "first five years", or 'l' for "last five years", or 'y' for "young" (i.e. age < 65), or 'o' for "old" (i.e. age > 65). You can also use commands like: ./justrangelog namelist outputfile to get it to cycle through all the justice names in the file "namelist", and output to the file "outputfile" the results of textvarit plus all four "justrange" tests (plus f-l and y-o bootstrap tests), on each of the justices named in the file. Or, use a command like: ./decaderun SCALIA to get the basic output for a justice on a decade-by-decade basis (or replace "SCALIA" by "-" to include all justices). Or, use: ./volrun SCALIA for volume-by-volume analysis, or: ./yearlyrun SCALIA for year-by-year analysis (from 1870 onward), or: ./sessionlyrun SCALIA for session-by-session analysis (from 1870-1871 onward). AUTHORSHIP IDENTIFICATION: We also provide software for identifying authorship of cases based on function word patterns from other cases. To perform a cross-validation test of a naive Bayes classifier for determining which of two justices authored a decision, use "naivebayesit": naivebayesit KENNEDY SCALIA (etc.) Or, to perform a cross-validation test of a linear classifier for determining which of two justices authored a decision, use "lindiscit": lindiscit KENNEDY SCALIA (etc.) (Note that the matrix inversion required for the linear classifier may fail if too few cases have been downloaded.) Or, to see which judgment in a collection is the biggest "outlier" (i.e. the most likely to have a different authorship from all the others), use "outlierit", e.g. to try to pick out the single early judgment volume8/thecase8-172 from all of Justice Scalia's judgments, type: outlierit volume*/SCALIA/thecase* volume8/thecase8-172 You may contact me with questions. -- Jeffrey Rosenthal, jeff@math.toronto.edu, http://probability.ca/jeff/