Suppose you wish to select a random sample from a large SAS dataset. No problem. The PROC SURVEYSELECT step below randomly selects a 2 percent sample:
proc surveyselect data=large out=sample
method=srs /* simple random sample */
n=1000000; /* sample size */
run;
|
Do you have a SAS/STAT license? If not, or if you seek maximum efficiency, keep reading.
I benchmarked the above program against a SAS dataset with 50 million rows and 3 columns, running SAS 9.4 on my 64-bit Windows 7 machine.
Time: 2.8 seconds
Is the DATA step more efficient? It might be, if you know some tricks. Let's look at a few.
The DATA step below selects a random sample of approximately 1 million rows.
899 data sample;
900 set large;
901 if rand('uniform') < .02;
902 run;
NOTE: There were 50000000 observations read from the data set WORK.LARGE.
NOTE: The data set WORK.SAMPLE has 1001125 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 3.31 seconds
cpu time 3.29 seconds
|
The RAND function generates random numbers between 0 and 1. The UNIFORM argument means that every value between 0 and 1 has an equal chance of being generated. The subsetting IF statement selects a random sample of approximately 2 percent.
Time: 3.3 seconds (Winner: PROC SURVEYSELECT, but keep reading)
The SET statement above reads all 50 million rows by default, which are then filtered by the subsetting IF statement. We can, however, change the SET statement to read only 1 million rows, bypassing the other 49 million, for a huge performance improvement, like this:
data sample;
set large (obs=1000000);
run;
|
Time: .1 second
This, of course, is cheating, because the DATA step above reads the first million rows, which is not a random sample.
The DATA step below, like the one above, also reads only 1 million rows, bypassing the other 49 million, for very fast performance. This time, instead of reading the first million rows, the SET statement reads every 50th row.
data sample;
do RowNum=50 to 50000000 by 50;
set large point=RowNum;
output;
end;
stop;
run;
|
The POINT= option designates a numeric variable whose value represents a row number. The SET statement jumps directly to each designated row, and reads that row, bypassing intervening rows.
The selected sample may be more representative than the first million rows, but it's still not random.
Time: .9 second
The DATA step below, like the two above, also reads only 1 million rows, bypassing the other 49 million. This time, however, the datastep selects a random sample, comparable to PROC SURVEYSELECT.
data sample;
drop Need Pool Ratio Row;
Need= 1000000; * Desired sample size;
Pool=50000000; * Number of rows in dataset;
do until (Need=0); * Continue until no additional sample rows needed;
Row+1; * Row number. Every row in the pool is a candidate
for possible inclusion in the sample;
Ratio=Need/Pool; * Initial ratio is .02 or 2 percent.
Ratio is continuoulsy adjusted as necessary;
if rand('uniform') < ratio then do; * Expression will be true for
exactly 2 percent of rows;
set large point=Row; * Jump to desired row. Read desired row;
output; * Output desired row;
Need=Need-1; * Because a row was selected, subtract 1
from remaining rows needed.;
end;
Pool=Pool-1; * Subtract 1 from pool, regardless of
whether a row was selected or not;
end;
stop;
run;
|
Method Time Advantage
PROC SURVEYSELECT 2.8 seconds ease of use
DATA step 2.2 seconds speed