[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
n j cox <n.j.cox@durham.ac.uk> |

To |
statalist@hsphsun2.harvard.edu, pr@ctu.mrc.ac.uk |

Subject |
Random number routines [was: st: Simple Question - Use of rndbin] |

Date |
Fri, 21 Sep 2007 16:16:39 +0100 |

This is long. It starts off grumpy but ends with positive

solutions and suggestions. People interested in generating random

numbers will, with probability almost 1, find something

interesting or useful here.

Tiago Pereira wrote

===================

I am writing a do file that uses -rndbin- to generate a second

variable, xb. As you all know -rndbin- runs as:

rndbin obs prob numb

However, the do file repeats that task X times and, in some cases,

_n is greater than obs. Hence, for example, the following error

shows up:

obs must be between 32 and 582539

r(198);

Indeed, in one out of X runs of the loop, the local `observations'

assumes value of 19. The problem is that _n=32 in the dataset.

Thence,

rndbin 19 0.05 1

obs must be between 32 and 582539

r(198);

So, I would like to know if there is a way to solve this

problem... say... run -rndbin- in a way that the variable xb is

not created.

"Simple questions"

==================

I'd advise strongly against labelling something a "simple

question". It doesn't usually make posts more appealing.

* You can't know that your question really is simple until it's

been answered.

* For many Statalisters, the likely response is going to be

"Well, if it's simple, someone else will answer this" or "If it's

simple, answer it yourself". Such responses will be private

thoughts, followed by deletion of your post. You decrease interest

in your post, as many Statalisters filter on titles.

* In this case, the question isn't simple at all, or at least not

especially simple to explain.

Tiago's troubles with -rndbin-

==============================

-rndbin-, contrary to statement, is unlikely to be familiar to all

Statalisters.

The Statalist FAQ advises

"Say what command(s) you are using. If they are not part of

official Stata, say where they come from: the STB/SJ, SSC, or

other archives."

-rndbin- is a program written by Joe Hilbe, published in the -rnd-

package (as we now say) in STB-28 in 1995 and revised in STB-41 in

1998 and also accessible from SSC, as -findit rndbin- will show.

W. Linde-Zwirble is also credited as joint author of this package.

Joe is, at least intermittently, a member of Statalist, but my

Statascope or iStata indicates that he may be travelling, en route

to the Italian Stata Users Group meeting next week.

Be that as it may, Tiago's issue is getting -rndbin- to do what

is wanted in a .do file. I am not very clear on what Tiago is

trying to do, in the absence of code. Again, there is advice in

the Statalist FAQ, which is probably the most quoted part of that

FAQ:

"Say exactly what you typed and exactly what Stata typed (or did)

in response. N.B. exactly!"

The local -observations- must be Tiago's, as it does not appear in

-rndbin-. For mentions of _n, understand _N.

However, the code of -rndbin- is accessible to me, so let me work

from that towards the question. Here it is, so anyone interested

can see what we are talking about:

----------------------------------- rndbin.ado (from SSC)

*!version 1.2 1999 Joseph Hilbe

* version 1.1.1 1993 Joseph Hilbe rev:7-7-95 (sg44: STB-28)

* binomial distribution random number generator

* Example: rndbin 10000 .5 1 [set obs 10000; p = 0.5; n = 1]

program define rndbin

version 4.0

set type double

cap drop xb

qui {

local cases `1'

set obs `cases'

mac shift

local pp `1'

mac shift

local n `1'

if `pp' < 0.5 { local p = `pp' }

else { local p = 1.0 - `pp' }

local am = `n'*`p'

noi di in gr "( Generating " _c

if `n' < 25 {

tempvar bn1 ran1

gen `bn1' = 0

gen `ran1' = uniform()

local j=0

while `j' < `n' {

replace `bn1' = `bn1' + 1 if (`ran1' < `p')

local j = `j' + 1

replace `ran1' = uniform()

noi di in gr "." _c

}

}

else if `am' < 1.0 {

local g = exp(-`am')

tempvar t em ds sum1 ran1 bn1

gen `t' = 1.0

gen `em' = -1

gen `ran1'=uniform()

gen `ds' = 1

egen `sum1' = sum(`ds')

while `sum1' > 0 {

replace `em' = `em' + 1 if (`ds'==1)

replace `t' = `t' * `ran1' if (`ds'==1)

replace `ds' = 0 if (`g' > `t')

replace `ran1' = uniform()

drop `sum1'

egen `sum1' = sum(`ds')

}

gen `bn1' = `em'

replace `bn1' = `n' if (`em' > `n')

}

else {

tempvar ran1 ran2 ds ts sum1 e y em bn1

local en = `n'

local oldg = lngamma(`en'+1.0)

local pc=1.0-`p'

local plog = log(`p')

local pclog = log(`pc')

local sq = sqrt(2.0*`am'*`pc')

gen `em' = -1

gen `e' = -1

gen `ran1' = uniform()

gen `ran2' = uniform()

gen `ds' = 1

gen `ts' = 1

gen `y' = -1

egen `sum1' = sum(`ds')

while `sum1' > 0 {

replace `y' = sin(_pi*`ran1')/cos(_pi*`ran1')

replace `em' = `sq'*`y' + `am' if (`ds'==1)

replace `ts' =0 if (((0>`em') | (`em' >=(`en'+1.0))) & (`ds'==1))

#delimit ;

replace `e' = 1.2*`sq'*(1.0+(`y'*`y'))*exp(`oldg'-lngamma(`em'+1.0)

-lngamma(`en'-`em'+1.0) + (`em' *`plog')+(`en'-`em')*`pclog') if

(( `ds'==1) & (`ts'==1));

#delimit cr

replace `ds'=0 if ((`ran2'<`e') & (`ds'==1) & (`ts'==1))

replace `ran1' = uniform()

replace `ran2' = uniform()

replace `e'=-1

replace `ts' = 1

drop `sum1'

egen `sum1' = sum(`ds')

noi di in gr "." _c

}

gen `bn1' = int(`em'+.5)

}

replace `bn1' = `n'-`bn1' if `p' ~= `pp'

gen xb = `bn1'

noi di in gr " )"

noi di in bl "Variable " in ye "xb " in bl "created."

lab var xb "Binomial random variable"

set type float

}

end

----------------------------------

Some key facts:

1. -rndbin- produces a variable -xb- which is a binomial random

variable. That's wired in. If you have an existing variable -xb-,

it will get -drop-ped. I can't see that behaviour being explained

in the help. (Also, if you had say an existing variable -xbin- and

no other -xb*-, that would get -drop-ped, unless you had -set

varabbrev off-.) One of Tiago's questions is whether this can be

changed, and the answer is: No, not without rewriting -rndbin-

(which is not ours to rewrite). (You could clone and change, but

the end of this story is that you don't need to do that.)

2. -rndbin- tacitly assumes that you have -set type float- and

temporarily issues -set type double-. There are various small

issues here:

* If you had earlier -set type double-, or anything else other

than -float-, -rndbin- would override that earlier setting on

exit. In practice, this is unlikely (but certainly not

impossible).

* If -rndbin- exits prematurely, then (depending on when it

exits) you might well be working under -set type double- from then

on in your session. It is very likely to exit prematurely if you

misunderstand its expectations.

(Some readers may want more explanation here. -set type- sets the

default type for new variables. By default this is -float-. Most

users probably never touch this, nor should they. The best

practice is usually to indicate, explicitly in individual

commands, that you want a datatype other than -float-, as in -gen

double foo = <whatever>-.)

3. -rndbin- early on issues

set obs `cases'

where `cases' is the first argument you fed it. Consider some

possibilities:

* You start with no data in memory. No problem.

* You specify `cases' to be the existing number of

observations. -set obs- changes nothing. No problem.

* You specify `cases' larger than the number of

observations. -set obs- increases it. No problem. (The first

possibility is a special case of this one.)

* You specify `cases' smaller than the number of

observations. -set obs- refuses, and crashes -rndbin-. This is

what is biting Tiago.

An immediate solution to this problem is: Don't do that then! If

you want _fewer_ binomial random numbers than your number of

observations, just -generate- as many as the number of

observations, then ignore what you don't want after running

-rndbin-, using -if- or -in- to select.

However, I've started, so I'll finish. We are all on learning

curves. The Stata code I was writing in 1995 might look pretty

lousy to me now if I looked at it again. As a matter of style or

taste I suggest -- in 2007, clearly --

1'. A program should let you specify your own variable name and

never -drop- an existing variable, unless that is part of its

purpose and that is explicitly documented.

2',3'. A program should not mess with settings except temporarily,

unless (again) that is part of its purpose and that is explicitly

documented.

Alternatives to -rndbin-

========================

In any case, a program is barely necessary if there is a direct

alternative. I will now show that there are currently at least two

direct ways of doing it.

I started reading -rndbin- until it got complicated. I guess the

main complications in the algorithm are not to do things the

direct way when in principle there is a faster way. As the

complication involves a lot more code, which also includes much

looping, so there is extra interpretative overhead, it is not

obvious to me that you gain overall.

Incidentally, the 3rd edition of "Numerical recipes" from William

H. Press and friends, Cambridge University Press, 2007 goes over

the top in showing how binomial randoms can be done ultra-fast

with some convoluted code (in C++).

There is a large interpretative overhead in -rndbin- from repeated

calls to -egen, sum()- to get sums. That is, Stata works through

-egen.ado- and then -_gsum.ado- every time -egen, sum()- is

invoked. (In fact, in Stata 9 up, a call to -egen, sum()- also

fires up -_gtotal.ado-.) Nowadays, we could just use

-summarize, meanonly- and save on dozens if not hundreds of lines

of code.

Let's start again from first principles. A canonical binomial

problem I take to be something like this. I have 1000 families

with 6 people. The probability of being left-handed I take to be

0.1. What is the distribution of the number of left-handed people

in each family? (I don't know enough genetics to know if this null

model is as gauche as it may sound, but there is no sinister

motive here.)

How should you do this the modern Stata way? We can use -forval-,

not available to Joe in the mid-1990s:

set obs 1000

gen xb = 0

qui forval i = 1/6 {

replace xb = xb + (uniform() < 0.1)

}

Even without -forval-, this was possible in early Stata:

set obs 1000

gen xb = 0

local i = 0

qui while `i' < 6 {

local i = `i' + 1

replace xb = xb + (uniform() < 0.1)

}

As from Stata 9, we can also use Mata as our favourite calculator:

set obs 1000

gen xb = 0

mata: st_store(., "xb", ., rowsum(uniform(1000, 6) :<= 0.1))

In the code, ":<=" means either "Bill Gould is grinning at you,

because Mata is smart" or "elementwise <=" or both.

Naturally, in most cases something like -set obs 1000- will not

be necessary, as the number of observations is already what we want.

There is no need to speculate about speed. We can experiment:

-------------------------------------

clear

set obs 1000

set rmsg on

forval j = 1/100 {

gen xb`j' = 0

qui forval i = 1/6 {

replace xb`j' = xb`j' + (uniform() < 0.1)

}

}

drop xb*

set obs 1000

forval j = 1/100 {

gen xb`j' = 0

mata: st_store(., "xb`j'", ., rowsum(uniform(1000, 6) :<= 0.1))

}

qui forval j = 1/100 {

rndbin 1000 0.1 6

}

--------------------------------------

The timings on my somewhat mesolithic machine are

suppressed to save my embarrassment, but the order I get is

fastest is the Mata route (surprise)

second is a direct solution with -forval-

third is -rndbin-.

In short, direct alternatives to -rndbin- are possible, explicit

and fast. (As indicated, -rndbin- could be speeded up, but

that doesn't affect the main point.)

However, don't pester me for simple solutions in the same spirit

for your favourite distribution. Manifestly, the binomial is

especially easy.

The future lies ahead*

======================

(* any big boss, and some little ones)

But there's a broader issue that will interest many.

Isn't it high time for a decent set of routines from StataCorp for

random numbers from the more basic distributions? This grumble

keeps re-surfacing, for example at the 2007 Boston users' meeting.

Sometimes it seems that StataCorp find the request too boring to

be really interesting, compared with some exotic modish model that

1% of the community really, really want. (Or say they do: next

year, something else will be hot, or is it cool.)

For a long time, StataCorp have been relying, at least implicitly,

on three lines of defence when asked similar questions:

1. Given the existence of -uniform()-, random numbers for many

distributions can be got in just one line, as

-invnorm(uniform())-, -exp(invnorm(uniform()))-, etc., etc.

2. It's rather difficult to decide on a standard list. Weibull or

Pareto? Does anybody but you and two friends really work with the

Singh-Maddala distribution? And the issue is complicated in

many instances by tribal preferences for two or even more different

parameterisations.

3. See Joe Hilbe's commands.

To which the answers are, Sure, sure, sure, but StataCorp could now

look at it this way:

* Start with issuing Mata functions. That cuts out any need for the

programming of inbuilt Stata functions, strict sense, which

can only be done by StataCorp. (For example, -log()- is an inbuilt strict sense function; -egen- functions are not; commands are commands, and not functions.)

* Publicise for those not yet into Mata how easy it is to feed a

Mata vector into a variable. Or, better, from most points of view,

this could all be gracefully hidden by something like

rgen mygamma = rgamma(<alpha>, <beta>)

which could be a wrapper for

* example with 1000 obs

gen mygamma = .

mata : st_store(., "mygamma", ., rgamma(1000, <alpha>, <beta>))

and all the extensions that needs (support for -if-, -in-, default

parameter values, options, treatment of missing values). All you then need are the Mata functions like -rgamma()-. (Homage to S and R, for

which this has been standard for (say) 20 years.)

* Once a standard syntax is public with a decent set of examples,

user-programmers would happily add other distributions. Different

parameterisations would not be an issue. (No-one need tussle over

the better -beta- parameterisation, for example. It could be handled via

options or alternative -rgen- functions.)

In short, -rgen- could flourish in the way that -egen- has

flourished. (There are more user-written -egen- functions than

official, and more official -egen- functions that were once

user-written than were originally written by StataCorp.

StataCorp's key -- indeed essential -- contribution was to provide

the -egen- scaffolding and some examples to emulate.)

Various conversations with Kit Baum helped crystallise some of the

ideas in this last section.

Nick

n.j.cox@durham.ac.uk

*

* For searches and help try:

* http://www.stata.com/support/faqs/res/findit.html

* http://www.stata.com/support/statalist/faq

* http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: Random number routines [was: st: Simple Question - Use of rndbin]***From:*"Clive Nicholas" <clivelists@googlemail.com>

**Re: Random number routines [was: st: Simple Question - Use of rndbin]***From:*"Michael Blasnik" <michael.blasnik@verizon.net>

- Prev by Date:
**Re: st: Models ARIMA in STATA and Eviews** - Next by Date:
**Re: Re: st: levelsof** - Previous by thread:
**st: Mata run-time error when using gologit2** - Next by thread:
**Re: Random number routines [was: st: Simple Question - Use of rndbin]** - Index(es):

© Copyright 1996–2021 StataCorp LLC | Terms of use | Privacy | Contact us | What's new | Site index |