Suppose you wish to select a random sample from a large SAS dataset. No problem. The PROC SURVEYSELECT step below randomly selects a 2 percent sample:
Do you have a SAS/STAT license? If not, or if you seek maximum efficiency, keep reading.
I benchmarked the above program against a SAS dataset with 50 million rows and 3 columns, running SAS 9.4 on my 64-bit Windows 7 machine.
Time: 2.8 seconds
Is the DATA step more efficient? It might be, if you know some tricks. Let's look at a few.
The DATA step below selects a random sample of approximately 1 million rows.
The RAND function generates random numbers between 0 and 1. The UNIFORM argument means that every value between 0 and 1 has an equal chance of being generated. The subsetting IF statement selects a random sample of approximately 2 percent.
Time: 3.3 seconds (Winner: PROC SURVEYSELECT, but keep reading)
The SET statement above reads all 50 million rows by default, which are then filtered by the subsetting IF statement. We can, however, change the SET statement to read only 1 million rows, bypassing the other 49 million, for a huge performance improvement, like this:
Time: .1 second
This, of course, is cheating, because the DATA step above reads the first million rows, which is not a random sample.
The DATA step below, like the one above, also reads only 1 million rows, bypassing the other 49 million, for very fast performance. This time, instead of reading the first million rows, the SET statement reads every 50th row.
The POINT= option designates a numeric variable whose value represents a row number. The SET statement jumps directly to each designated row, and reads that row, bypassing intervening rows.
The selected sample may be more representative than the first million rows, but it's still not random.
Time: .9 second
The DATA step below, like the two above, also reads only 1 million rows, bypassing the other 49 million. This time, however, the datastep selects a random sample, comparable to PROC SURVEYSELECT.
Method Time Advantage
PROC SURVEYSELECT 2.8 seconds ease of use
DATA step 2.2 seconds speed
source from http://blogs.sas.com/content/sastraining/