Class 4: Repetitive Tasks in Stata

So far we've learned a few things in Stata: We already know a few things that will make repetitive tasks run easily.

Lists

In certain commands, Stata can deal with lists. In fact, in most commands it will deal with variable lists (jargonized as varlists), so instead of typing
. sum v1
. sum v2
. sum v3
. sum v4

you can simply type
. sum v1-v4
if those variables sit next to each other, or
. sum v*
to get summaries of all the variables that start with the letter v. (If you need to use specific returned results from summarize variable by variable, this won't work: summarize will only return the results for the last variable summarized.)

Also, some commands support numeric lists, or numlists. You can have a numeric list simply as a set of numbers, say 2 3 5 7 11 13 17 19, or as a range 1/11, or as a range with a step 0(5)25, or as any mix of the three.

by: construct

Stata's way to perform the same command separately on different parts of the data set identified by some varlist is to use by: (or by option in such commands as egen or various incarnations of graph, such as histogram or scatter):
. by rep78: sum pri
We used it here instead of typing
. sum pri if rep == 1
. sum pri if rep == 2
. sum pri if rep == 3
. sum pri if rep == 4
. sum pri if rep == 5

(and we also must know there are five values to if for).

Now, let's hit for some real stuff.

Do-files

Do-files are the simplest examples of Stata programs. You open a text file, you type in several Stata commands, and here is your do-file. At the basic level, you use do-files because you don't want to retype that sequence of Stata commands all over again. The real reason you would want to use your do-files is to be able to reproduce your own work if you need to attend it a few days, weeks, or months later. (You'll have surprisingly many occasions to do so; at least, I've had.) So for instance here is the sequence of commands for our previous class:
cd h:\
log using class3, replace
sysuse auto

reg pri mpg wei for
estimates store Model1

reg pri mpg wei for if ~mi(rep)
estimates store Model2

reg pri mpg wei for , robust
estimates store Model3

reg pri mpg wei for if ~mi(rep), cluster(rep)
estimates store Model4

regress pri mpg wei for leng turn displ , cl(rep)

regress

estimates restore Model3
eret list

matrix list e(b)

vce

count if e(sample)
return list

di _b[_cons]
di _b[wei]
di _se[wei]

test wei
test wei mpg
test wei = mpg
test (wei + 2*mpg - 0.001*for = 50) (mpg)

testnl ln(_b[wei]/_b[mpg])=0

estimates restore Model1

hettest
hettest displ gear

ovtest
ovtest, rhs

imtest

vif

rvfplot, yline(0)
rvpplot wei
avplots
avplot gear
lvr2plot , mlab(make) mlabsize(small) xsc( r(0 0.16) )
I have added some spaces between commands to make it more readable. This is a reasonably good do-file, and let's see if we can run it. Go to Stata Do-file editor by either typing
. doedit
or by clicking the Do-editor button on the toolbar:
Now, select the above piece of code and Copy/Paste it (I am sorry, this is certainly a Windows idea, but I am pretty sure there are analogues for Unices and Macs) to the Do-editor. Then save it under some reasonable name (class3-draft.do may be used as a suggestion. Also make sure you save it either to the current directory, which would probably be h:\), and do it in Stata by
. do class3-draft
Self Check: Did it run? Any error messages?

But here's what we can do to make it more readable.

  1. First, we might want to do some safe proofs to get the file started. Those would include clearing Stata memory and resetting it, as well as making sure no log file is open:
    clear
    set memory 1m
    capture log close
    version 8
    
    The mission of capture is to preserve Stata from stopping in case there is an error. Here, an error would arise if there were no log file were open, and hence no log file to be closed. A couple of other relevant commands that you might find useful in your programs are quietly and its counterpart noisily. The former simply suppresses output but does stop for errors, unlike capture, and the latter turns out output temporarily. Try this piece of code:
    qui {
      di 1
      noi di 2
      cap di #
      cap noi di !
    }
    The first line was swallowed by the quietly that works on the whole block given by the curly brackets. The second one was indeed shown by Stata. The next two lines would have returned errors of invalid syntax if they were not preceded by capture (try them without capture to see what happens). In the last line, however, the noisily statement turned the output back on, so we saw what happened.
    (I copied and pasted it to Stata command line; it does tolerate a few lines entered simultaneously by this trick.)
    Finally, the version command tells Stata that everything to follow was written with Stata 8 in mind. Just a precaution for future work with maybe Stata 11 if it becomes a totally different animal...
  2. Back to the Class 3 do-file. I would also want to exit my program gracefully, so this is what I'll put in the end:
    log close
    exit
  3. I shall also insert comments here and there to indicate what I am doing. The comments (either in a do-file, or in Stata command prompt; the two are functionally equivalent) are entered by putting * as the first character in a line / command; Stata treats the remainder as something it does not care to look at. See more on comments in the help comments.
  4. The final thing among bells-and-whistles is that Stata flashes through the graphs without giving me a chance to have a look at them. This can be circumvented with the more command that stops the output until a user presses a key. We've already seen that in the long list commands. This is a sequence that makes this happen:
    set more on
    more
    set more off
    The first command sets more to the state where it stops if Stata has to show more than one screen of output, or if more command is issued explicitly. The last one sets more to the state where it disregards more or long outputs. This is often helpful in long files! (Another way to deal with long outputs if you see a lot of --more-- messages is to press and hold spacebar so that Stata will have enough of spaces to use up, or press Break... and enter set more off in the header of your do-file or in the command line.)
Now, have a look at my resulting do-file: class3.do on the web page. Take a minute to run it now. You can do it by either downloading it to your local directory, or by
. do http://www.duke.edu/~skolenik/class3.do

GREAT!
Once you know how to create and run do-files, you are almost ready to apply this very powerful concept to your own analyses. A few words on the logistics of the process may be in place, however.

Stylish do-files

There are a few rules that will make your Stata operation far more efficient. I've learned some of them in Stata's
Stata NetCourses (NC 151, in particular). I consider this resource highly useful, and strongly recommend for anybody whose work will involve running Stata for research purposes at least once a week. I've derived some others from my practice. They are only suggestive, of course, and you don't have to follow them, but once you have more than one project to concentrate on (and the meaning of a "project" may vary from a homework problem to a dissertation), you would want to revisit those suggestions and start wondering why you did not adopt them earlier.
  1. Each of your research projects should sit in a dedicated directory.
  2. Your original data files should remain intact no matter what. If you need to modify them in any manner, do so with a do-file, and document changes with comments. You may even put a read-only status on your freshly downloaded data files from a trusted data base. On UNIX, you can do this by
    . chmod a-w+r Census2000Excerpt.dta
    In Windows, you can set read-only attribute of the file by right-clicking on it in the Explorer, selecting Properties and checking Read-Only box.
  3. Data manipulation do-files and statistical analysis do-files should be separated. A do-file either handles the data (creates and labels new variables, keeps and drops variables and cases, blends together different files -- more on that later, or have a look at Carolina Population Center tutorial), or does the analysis (as our Class3.do file).
  4. Your do-files should have meaningful names. If you have a data set called, for whatever reasons, r15dac.dta, the do-files that work on that may be called r15dac-data.do for data management tasks, and r15dac-analysis.do for analysis tasks.
  5. For the new variables that you create, give then meaningful names showing that those are derived variables; give them short and concise labels; and give them extra notes if there is something special about those variables. Don't forget to label data the resulting intermediate data file as well, unless it has a nice label already. I also add a little
    note : "The file r15dac-selectedvars.dta is created on `c(current_date)' at `c(current_time)' by r15dac-data.do"
    before I save the data (recall that the notes are cumulative, so it never hurts to add one). `c(...)' stuff are special Stata return values (or rather creturn values, see help for creturn) that contain the current date and the current time. Copy the text in the quotes and paste it into Stata prefix with display to see how Stata converts it to something more sensible.
  6. Each do-file should open and close a log-file with a meaningful name. The simplest way to make a name of the log-file meaningful is to have the same name as the do-file. Here, our class3.do opened and closed class3.smcl log-file.
  7. All of the do-files of a research projects should be assembled into a master do-file that may have a structure like this (yes, you can call do-files from within do-files):
          clear
          do r15dac-clean
          do r16dac-clean
          do r15r16dac-combine
          do r15r16dac-regressions
          exit
You can convert your interactive sessions into do-files:
  1. by trying some commands in the command prompt, and copying those that worked OK into a do-file;
  2. by editing your command log (cmdlog) and saving it as a do-file;
  3. by saving the contents of the Review Windows and proceeding as in 2. Both Review window and
    . #review #
    only store the last 100 or so commands, so with long sessions, you may have to do it several times;
  4. you can also start creating your do-file directly and check how it runs either from Do-editor's Tools -> "Do", "Run", "Do selection" or "Do to bottom", or by typing
    . do yourdofilename
    in Stata's command prompt every now and then when you introduce substantial changes.
Make sure your do-file actually runs in Stata! You may want to restart the Stata to be sure it does not depend on the results currently in memory.

Other ways of running do-files

There is another Stata command to execute a do-file: run will launch it suppressing the output, except for errors. If the screen or graphics output takes a lot of time, this might be a good option to consider.

Also, you can launch Stata in the background mode and instruct it to run a specified do-file by typing something like
wstata /b do dofilename
in Windows, or
stata -b do dofilename
in Unix. (On some Unix servers that have special background job management tools, you might need to submit your job to a special queue to be executed. Contact your Unix administrator to find out the details.) In the background mode, Stata will redirect all output into the file dofilename.log.

Local and global macros

Knowing how to write and run a do-file is in fact a very small part of actually mastering Stata programming skills. Let us revisit some of the class3.do elements and see how those can be dealt with more efficiently.

There is not that much repetition of the same thing over and over again in that file, but here's the central piece I'd like to focus on for now (omitting estimates store ... and more commands):

reg pri mpg wei for
reg pri mpg wei for if ~mi(rep)
reg pri mpg wei for , robust
reg pri mpg wei for , cluster(rep)
We are entering the same command, the same list of variables, and only change options and selection criteria. How can we automate that? Historically, we were pressing PgUp key on the keyboard (Ctrl+R in UNIX) and entered some modifications to the previous command (and later copied and pasted it into the do-file). What if you had to write it from scratch? The answer will probably again be that you copied and pasted the a command you entered into the appropriate spaces.

Now let me get to a killer question: suppose you had to remove a variable from the regressor list, or add another one to the model. Here, you would have to go across four lines of your do-file code to modify it. If you had 150 lines (and I've seen do-files with even greater number of regressions, don't ask me for the reason why somebody wanted to run 150 regressions at a time), you'll make at least one error as you get around. Is there a neat solution?

Well, I wouldn't ask this rhetorical question in a tutorial if there were no neat answer. And the answer is, "Use local macros".

local reglist mpg wei for
reg pri `reglist'
reg pri `reglist' if ~mi(rep)
reg pri `reglist' , robust
reg pri `reglist' , cluster(rep)
The local macro is defined by the command local, and simply is a string of characters. If you typed the above sequence in Stata by copying the whole five lines and pasting it into Stata command prompt, you can try
. di "`reglilst'"
to see what is in there. If you had it in your do-file (try it too! You can go to the Do-file editor, open a new document, paste those lines, and then do the current file by going to Tools / Do, pressing Ctrl+D, or hitting the "Do current" button) and then you tried it in Stata command prompt, you won't get the answer you expected:
Do you think Stata did not display anything? Well, you are almost right: Stata displayed an empty string. Which means, the local macro `reglist' turned out to be empty. Why is that?

An important (and in fact extremely powerful rather than annoying) property of the local macros, and the reason they are called "local", is that they only exist within the process where they were defined. A process may be your interactive Stata session, your do-file, or your programs (and Stata has a very specific definition of what a program is; you probably should not refer to your do-files as programs, although most people would refer to the art of writing ones as programming. We'll get to a real program by the end of this tutorial). If you have two local macros of the same name in two different processes, those will be different objects to Stata.

If you wanted to create a string that you would need to use in different processes (say more than one do-file, plus an interactive session), you can use global command to create a global macro. The global macros are referred to with a dollar sign:

. global reglist mpg wei for
. di "$reglist"
. reg pri $reglist

Going back to our example, specifying reg pri `reglist' etc. made it really simple to introduce modifications to the list of regressors. If we want to exclude mpg that was not significant in any of the regressions, all we need to do now is to exclude it from a single local definition:
local reglist wei for
or even
local reglist /* mpg */ wei for

Cycles

In fact, Stata uses local macros on many different occasions. One such occasion that is directly related to our topic of the ways to avoid repetitions are cycles.

Supposes I wanted to perform a certain action on the list of variables, say take logs. The brute force method of doing this will involve something like

g lprice = log(price)
lab var lprice "Log of price"
g lgear_ratio = log(gear)
lab var lgear "Log of gear_ratio"
g ltrunk = log(trunk)
lab var ltrunk "Log of trunk"
Just to see how tiring this is, type it all in the your Do-file editor without Copy/Paste utilities. Now, count how many errors you made...

Stata cycles handle such jobs as follows.

foreach x of varlist price gear trunk {
  gen l`x' = log(`x')
  lab var l`x' "Log of `x' "
}
Stata's cycle command foreach created the local macro `x' and consecutively redefined it with the variable names price, gear_ratio and trunk inside the body of the cycle between the curly brackets. Indentation within the body of the cycle is not necessary, but it does add a lot of readability to your do-files and programs.

Here is example of another cycle command:

forvalues p = 2/5 {
  gen mpg`p' = mpg^`p'
  lab var mpg`p' "mpg to power `p' "
}
forvalues used all the values of `p' between 2 and 5 in computing powers and making labels.

A few more intricate examples are given at UCLA Academic Technologies Services Stata website.

Self Check: Write a do-file that produces a table of factorials. Hint: you would need to create a macro before the forvalues command that would contain the current product of integers, and update it inside your cycle. The output, as you remember, is performed by display command. You should be able to fit it into five lines.

Can you reproduce your results with variables and observations? Come up with a generate command and a replace command with some in qualifiers -- you would actually need only two commands to perform it.

Programs

Probably the most powerful way of putting Stata locals to action is by writing Stata programs. For Stata, a program is the sequence of commands that starts with program define and ends with end. The rules of learning programming dictate that the first program you ever write should print "Hello, world!" on the screen. Here we go:
program define hello
  di "Hello, world!"
end
exit
Type it into a new do-file in your Do-file editor, and let's call it hello-program.do. Let's try it:
Did Stata produce the "Hello, world!" message? Not quite. So what did it do? It stored the program in its memory:
(your directory of programs stored by Stata in a current session may be way longer, but it should have hello program somewhere.) Now we can get the message from Stata by typing
. hello
Cool! Let's incorporate this hello command into our do-file, so that it reads
program define hello
  di "Hello, world!"
end

hello

exit
Can we run it again?
Oops! Stata found that a program by that name is already loaded, so it issued an error message and stopped. We need to make sure no such program exists, and we can simply program drop it:
cap pro drop hello
program define hello
  di "Hello, world!"
end

hello

exit
Self Check: why is capture needed here? When would program drop without a capture issue an error message?
Now everything should work fine:

Let us now go back to Class3.do and imagine that we would need to run the four regressions (the basic one, with an if condition, with the robust option and with the cluster option) over different set of variables. Should we copy / paste the code, or should we write a program? Here's an example that is almost as real in handling Stata's local macros as the "official" Stata programs:

cap pro drop checkreg
pro def checkreg
  syntax varlist(numeric min=1) [if] , cluster(passthru)
  marksample touse
  reg `varlist'
  reg `varlist' if `touse'
  reg `varlist' , robust
  reg `varlist' , `cluster'
end
The syntax command parses whatever follows the checkreg command. The specification here is that it needs:
  1. the list of numeric variables with at least one variable (surely, we would at least need the dependent variable to run a regression);
  2. optionally (as indicated by the square brackets), an if condition can be specified;
  3. the cluster option must also be given, and it is transferred as is (passtrhu) to whatever command is going to use it.
After you issue something like
. checkreg pri wei mpg for if ~mi(rep78), cluster(rep78)
the syntax command does the following:
  1. checks that a variable list follows the checkreg command -- yes, it does -- and places the list into the local macro varlist, so that we can use it later as in reg `varlist' . Think of it as
    . local varlist pri wei mpg for
    -- exactly the way the command was entered.
  2. checks if the if condition is specified -- yes, it is -- and places the if ~mi(rep78) into the local macro if, equivalent to
    . local if ~mi(rep78)
    We don't use directly, but marksample command that follows syntax uses it to create a temporary variable `touse' that contains 1 when the if condition is satisfied, and 0 when it is not;
  3. checks if the required cluster option is specified -- yes, it is -- and places cluster(rep78) into the local macro `cluster' , equivalent to
    . local cluster cluster(rep78)
This is a very smart command, isn't it? It was added in Stata 7 release, and it simplified the Stata programmers' life tremendously.


Let us stop here and review what we've covered in this class:


Back to the list of tutorials and Stata resources

Questions, comments? E-mail me!.
Stas Kolenikov