|
How to Generate
VMP2.2s
(using a revised
formula for generating Vocabulary-Management Profiles)
The formula
for computing VMP1 counts new vocabulary as = 1.0 and repeated vocabulary
as = 0.0, then computes a ratio of new vocabulary (types)/tokens for moving
intervals of 35 words, 55 words, or something similar. However, as long
texts unfold, repetitions increase and new vocabulary becomes rarer. Consequently,
VMP1s may bottom out at zero for long stretches, yielding no useful signals.
VMP2.1 solves this problem by computing a ratio > 0.0 for repeated
words based on how recently the word occurred in the text. The VMP2.1
formula for repeated words is (Number of Current Word - Number of Previous
Occurrence - 1)/(Total tokens in the Text - 1). Like VMP1, VMP2.1 starts
out high and drops off quickly at the beginnings of texts. In some respects,
this mirrors the first reading of a text: everything seems new at first,
then more familiar as the text unfolds. However, unlike VMP1, VMP2.1 never
bottoms out at zero even for long texts; instead, it continues to give
useful signals throughout the text.
VMP2.2 uses the same formula for computing ratios as VMP2.1, except that
VMP2.2 calculates the ratios for the second pass through the text rather
than the first pass. For VMP2.1, the first occurrence of any word in a
text is assigned a maximum ratio of 1.0 (which is averaged with other
ratios over the moving interval). Thus, even common words such as "the",
"of", "and", "a" are assigned maximum ratios
of 1.0 at the beginnings of texts. By contrast, VMP2.2 computes ratios
wrap-around style, for the second pass through a text. Hence, the first
occurrence of a word such as "the" (near the beginning of a
text) occurs shortly after its last occurrence (near the end of the text);
hence, its ratio is nearer to 0.0 than to 1.0. The same is true for all
other repeated words; their first occurrences are assigned ratios greater
than 0.0 and less than 1.0. Words that appear only once in the text are
assigned ratios = 1.0. Unlike VMP2.1, VMP2.2 shows no rapid downtrend
at the beginning of a text. VMP2.2s mirror our second readings of texts,
when the beginnings are as familiar to us as the ends. Because we normally
associate rhetorical structure with second (and subsequent) readings rather
than first readings, VMP2.2 is the default program selected for this web
site.
How to
Generate VMP2.2s
1. From
the "Analysis Method" drop-down menu, select "VMP2.2
(2nd pass thru text)."
2. Select an odd-numbered interval greater than 1 and less than the
length of your text.
Note: This number will be the interval used to compute a moving
average of the number of new types/tokens over that interval. Choose
shorter intervals if your want your VMPs to be sensitive to short-term
fluctuations in new vocabulary. Choose longer intervals if you prefer
smoother VMPs, tracking longer-term trends. The default is 35, which
is a rather short interval suitable for tracking short-term changes.
For example, Youmans 1991 found a strong correlation between paragraph
boundaries and valleys on VMP1s that were constructed with 35-word moving
intervals. Shorter intervals can be used to highlight variations within
shorter constituents of discourse, for example sentences rather than
paragraphs. Longer intervals can be used to highlight longer constituents.
For example, Youmans 1994 used a 55-word interval to investigate the
correlation between VMP1 and the boundaries between numbered sections
in two short stories by William Faulkner.
3. If you
don't want to see a plot of your type-token curves displayed on your
screen, then uncheck the box to the right of the prompt "Include
graph with your output?"
Note: This online
program can generate VMP statistics and their accompanying graphs for
novella-length texts and shorter. An error message may result for longer
texts unless the graphing option is clicked off. If no graph is requested,
the program can generate VMP statistics for very long texts. These statistics
can be converted into graphical form by any standard spreadsheet/plotting
program such as Microsoft Excel or Corel Quattro Pro.
4. Select the text file you want to analyze. (For instructions: see
How to Upload Your Text File.)
5. Click the button labeled "Upload & process".
6. If all goes well, progress messages similar to the following should
appear:
Uploading
file C:\My Documents\MyTextFile.txt for analysis...
File upload complete.
Processing file . . .
Processing of file MyTextFile.txt complete...
7. If you
have chosen to include a graph with your output, then a plot of VMP2.1
should appear on your screen: the ratio types/tokens (y-axis) vs. tokens
(x-axis). If you receive an error message instead, then your text may
be too long for this online program to generate a graph. Try running
the program again with the graphical option turned off.
8. If you wish to view your VMP2.2 statistics on your screen, click
the button labeled "Download Output".
9. If you wish to save your VMP2.2 statistics, follow the instructions
described in How to Download Your Data.
 
VMP2.2 Statistics:
William Faulkner's short story "A Rose for Emily"
(for a 55-word moving interval)
VMP2.2:
"Corrected Rose.text 1" Interval: 55 Types=1071
Tokens=3693 Types/Tokens=0.2900 AvgR = the ratio of Types/Tokens
over the moving interval.
Average of the avgR = 0.28995
Standard Deviation = 0.06650
| Midpoint |
AvgR |
Last
Word in Interval |
|
| 1, |
0.30038, |
fallen, |
28 |
| 2, |
0.31574, |
monument, |
29 |
| 3, |
0.31511, |
the, |
30 |
| 4, |
0.32049, |
women, |
31 |
| 5, |
0.33845, |
mostly, |
32 |
| 6, |
0.34081, |
out, |
33 |
| 7, |
0.34012, |
of, |
34 |
| 8, |
0.34298, |
cuiosity, |
35 |
| 9, |
0.32489, |
to, |
36 |
| 10, |
0.32809, |
see, |
37 |
| 11, |
0.31251, |
the, |
38 |
| 12, |
0.33068, |
inside, |
39 |
| 13, |
0.31592, |
of, |
40 |
| 14, |
0.31590, |
her, |
41 |
| 15, |
0.30356, |
house, |
42 |
|
|

Types
= the total vocabulary (graphically distinct words) used in the story
Tokens = the total number of words in the story
Types/Tokens = the ratio of types divided by tokens
Average
of the avgR = the mean of all the moving interval averages in the
text
Standard
deviation = the standard deviation of the average ratio of types/tokens
over the moving interval
VMP2.2 plots
the average ratios (between 0.0 and 1.0) for each word in a preset moving
interval. Unlike VMP2.1, VMP2.2 computes these ratios wrap-around style,
as though the text were concatenated with itself. Hence, the first ratio
0.30038, plotted at token #1 is computed for a 55-word interval that begins
with the 27th word from the end of the story and extends through the 28th
word of the beginning of the story. The next ratio is computed for the
26th word from the end through the 29th word of the beginning. This procedure
is repeated throughout the text, generating a moving average of ratios,
much like a moving average of stock market prices. Note that the last
word in the interval has a crucial effect upon the average ratios.
The ratio
for this text is 0.30038 for the 55-word interval ending with the 28th
word "fallen", which occurs 3 times in the story. The ratio
increases to 0.31574 for the interval ending with the next word "monument",
which occurs only once in the story. Then the ratio declines to 0.31511
for the interval ending with the next word "the", which occurs
257 times in the story. Hence, upturns in VMP2.2s occur only when less-recently-used
words are added to the end of the interval, and downturns occur when more-recently-used
words are added to the end of the interval. This is why the program lists
the last words in the interval. Less-recently-used vocabulary at the ends
of moving intervals tends to correlate with a change in topics, whereas
more-recently-used vocabulary tends to correlate with a continuation of
the same topic. Hence, VMP2.2s are surprisingly sensitive indicators of
the ebb and flow of new topics in discourse.
Note that
the program changes upper case ("Emily") to lower case ("emily").
This can result in occasional processing errors; for example, "Will"
(proper name), "will" (auxiliary verb), and "will"
(noun: 'volition') are all treated as being identical graphical words.
If you want to distinguish among homographs such as these, you will have
to recode your text, for example, by replacing "Will/will" with
"will1/will2" or something similar. (For further discussion,
see How to Prepare Your Text File for Analysis.)
If single
words such as "emily" are broken into two parts, "e"
and "mly", this probably means that the text file was stored
with "soft returns" rather than "hard returns" for
line breaks. Cure: store your text file with "hard returns"
to indicate line breaks.
Youmans, Gilbert. 1991. "A New Tool
for Discourse Analysis: the Vocabulary-Management Profile."
Language 67.4, 763-789.
Youmans, Gilbert. 1994. "The Vocabulary-Management-Profile:
Two Stories by William Faulkner."
Empirical Studies of the Arts 12.2, 113-130.
Youmans,
Gilbert. 2001. Manuscript. "The Hierarchical Structure of Discourse:
A New, Improved Vocabulary-Management Profile."
|