Formating deposition transcripts

One of the odd things I do on a regular basis is read deposition transcripts (not for fun, it’s part of my job). A deposition is testimony given under oath prior to trial; like trial testimony, it’s recorded by a court reporter on a stenotype machine (court reporters are also known as stenographers). I prefer having the transcripts emailed to me in plain text format so I can quickly search for words or phrases, but there’s nothing like having a printed copy when you need to read a transcript from start to finish.

Most stenographers will provide a printed version of the transcript in a condensed form, usually called a miniscript or some clever (and probably trademarked) variation on that term. By “condensed” I don’t mean zipped or gzipped or bzipped, I mean printed 4-up, like this:

+--------+--------+
|        |        |
|        |        |
| Page 1 | Page 3 |
|        |        |
|        |        |
+--------+--------+
|        |        |
|        |        |
| Page 2 | Page 4 |
|        |        |
|        |        |
+--------+--------+

The pages are printed on both sides of the paper, so each sheet contains 8 pages of transcript, a nice savings of paper.

For some time, I’ve been making my own miniscripts from the text transcripts by

  1. inserting troff commands into the files;

  2. running the files through groff, the GNU version of troff, to create a PostScript file; and

  3. running that PostScript file through psnup to create a 4-up version.

The second and third steps are purely mechanical, requiring no thought, but the first can be tricky. Deposition transcripts are done in many styles: sometimes they’re double-spaced, sometimes not; sometimes the page numbers are at the bottoms of the pages at the right margin, sometimes they’re at the top and flush left; sometimes there are several blank spaces at the beginning of each line, sometimes not. I wrote some notes to myself to help with the process, but it always seemed more time consuming than it should be.

(What is consistent from stenographer to stenographer is that the pages and lines of a transcript are always numbered so testimony can be cited with precision. Whatever stylistic transformations may be done, these numbers must be preserved.)

Late last week, I decided the time had come to automate the process in a Perl program. Some design decisions were easy:

  1. Blank lines (or lines with only whitespace) would be eliminated, making every transcript single-spaced.

  2. A fixed number of leading spaces would be chopped from each line, the number coming from the user as a command-line option.

These were simple, each requiring only one line of code (plus a few lines to handle the command line). The user (me) would know how many spaces to cut by a quick examination of the file. But handling the tops and bottoms of the page was more difficult. Troff needs a command (.bp) to break a page, and it seemed like every deposition I ran into needed a different rule for where to put that command.

Ultimately, my model for handling the page boundaries was Larry Wall’s rename script. It was in the pink version of Programming Perl, but has since been moved to the Perl Cookbook. The genius of this program was that it took advantage of the user’s knowledge of Perl to adapt itself to changing conditions. My program has a somewhat dumbed-down version of this flexibility.

Here it is.

#!/usr/bin/perl

use strict;
use warnings;
use Getopt::Std;

my $usage = <<'USAGE';
dep24up -- Create a 4-up PostScript version of a deposition transcript
           so it can be printed out nicely.

usage:   dep24up [options] file

options:
    -h     : print this message
    -b n   : number of blank spaces to remove from beginning 
             of each line (default: 0)
    -p sss : Perl regexp that defines a page boundary
             (default: '^(\s{10,}(Page\s+)*(\d+))\s*$')
    -e     : page boundary regexp is at end of page, rather than
             at beginning
    -d     : create a troff version of the transcript for
             debugging, but don't process it into PostScript
notes:
     Page boundaries are processed after blank lines and initial
     blank spaces are removed. Only $1 from the -p regexp is
     preserved. The output file has the same base name as the
     input file, with a '-4up.ps' extension (or '.rf' if the -d
     option is used).
USAGE

# Handle command line
my %opt;
getopts('b:dehp:', \%opt);
my $file = shift;

die $usage if ($opt{h} || ! $file);
my $blanks = $opt{b} || 0;
my $page   = $opt{p} || q(^(\s{10,}(Page\s+)*(\d+))\s*$);
my $atend  = $opt{e};
my $debug  = $opt{d};

# Slurp in whole file
open(TS, $file) or die "No $file: $!\n";
undef $/;
my $ts = <TS>;                          # ts = transcript

# Simple cleanup
$ts =~ s/\n(\s*\n)+/\n/g;               # weed out blank lines
$ts =~ s/^ {$blanks}//mg;               # strip beginning blank spaces

# Add codes at page boundaries
if ($atend) {
  $ts =~ s/$page/$1\n.bp\n\.sp |.5i/mg; # $1 goes before .bp
} else {
  $ts =~ s/$page/.bp\n\.sp |.5i\n$1/mg; # $1 goes after .bp
  $ts =~ s/\.bp\n\.sp \|\.5i\n//;       # delete inadvertent .bp before 1st page
}

# Add overall formatting commands at beginning
my $prolog = <<'PROLOG';
.ft BMR
.ps 18
.vs 28
.po .5i
.ll 7.5i
.sp |.5i
.nf
.na
PROLOG

$ts = $prolog . $ts;

# Output
my ($base, undef) = split(/\./, $file, 2);
if ($debug) {
  open OUT, "> $base.rf" or
    die "Can't open $base.rf for writing: $!\n";

} else {
  open OUT, "| groff | psnup -4 -c -q -m18 > $base-4up.ps" or
    die "Can't run pipeline: $!\n";
}

print OUT $ts;

I think the code is pretty straightforward Perl. There’s a header comment for each section of the program; only a few lines seemed to merit their own comments. Here are a few additional notes.

With a transcript file named depo.txt, a command of

dep24up -b 8 -p '^(\d{4})\s*$' depo.txt

will strip the first 8 spaces from each line and recognize the top of a page as being a sequence of four digits, flush left. The output file will be named depo-4up.ps.