Interview Questions   Tutorials   Discussions   Programs   Videos   

SAS - What is the difference between nodup and nodupkey options?




805
views
asked mar September 20, 2014 06:30 AM  

What is the difference between nodup and nodupkey options?


           

1 Answers



 
answered By vishnoiprem   0  

proc sort nodup gets rid of duplicate records with the same sort key but proc sort nodupkey gets rid of other records with the same sort key".

"nodup" is an alias for "noduprecs" which appears to mean "no duplicate records" but there is no way sas can know about these duplicate records unless they, by chance, land next to each other in sequence. That is a matter of chance. Take a look at "nodup" at work. Note the record with the "extra" value of 3. It's still there after the "nodup" sort.

eg.

  data test1; 
  input id1 $ id2 $ extra ; 
cards; 
aa ab 3 
aa ab 1 
aa ab 2 
aa ab 3 
; 
proc sort nodup data=test1; 
by id1 id2; 
run; 
options nocenter; 
proc print data=test1; 
run;


Obs    id1    id2    extra 
 1     aa     ab       3 
 2     aa     ab       1 
 3     aa     ab       2 
 4     aa     ab       3

Now look again where the two records with an "extra" value of 3 are next to each other in the input dataset. This time it has been removed by "nodup".

data test2; 
  input id1 $ id2 $ extra ; 
cards; 
aa ab 3 
aa ab 3 
aa ab 2 
aa ab 1 
; 
proc sort nodup data=test2; 
by id1 id2; 
run; 
options nocenter; 
proc print data=test2; 
run;


Obs    id1    id2    extra 
 1     aa     ab       3 
 2     aa     ab       2 
 3     aa     ab       1

If you sort "nodupkey" then you will only be left with one record with that key combination in the above case as you can see below.

data test3; 
  input id1 $ id2 $ extra ; 
cards; 
aa ab 3 
aa ab 3 
aa ab 2 
aa ab 1 
; 
proc sort nodupkey data=test3; 
by id1 id2; 
run; 
options nocenter; 
proc print data=test3; 
run;


Obs    id1    id2    extra 
 1     aa     ab       3

It is a big mistake to think sorting "nodup" will remove duplicate records. Sometime it will, sometime it won't. The only way you can be sure of removing duplicate records is to "proc sort nodupkey" and include enough key variables to be sure you will lose the duplicates you want to lose. In the case shown above, then if we knew of the same "extra" values being duplicates we wanted to remove then this variable should be included in the list of sort variables and then "nodupkey" will remove the duplicates as shown below.

data test4; 
   input id1 $ id2 $ extra ; 
 cards; 
 aa ab 3 
 aa ab 1 
 aa ab 2 
 aa ab 3 
 ; 
 proc sort nodupkey data=test4; 
 by id1 id2 extra; 
 run; 
 options nocenter; 
 proc print data=test4; 
 run;


Obs    id1    id2    extra 
 1     aa     ab       1 
 2     aa     ab       2 
 3     aa     ab       3 
flag   
   add comment

Your answer

Join with account you already have

FF

Preview


Ready to start your tutorial with us? That's great! Send us an email and we will get back to you as soon as possible!

Alert