answered By vishnoiprem 0

Before using a particular step to remove the duplicate observations, we should understand that the duplicate records present are pertaining to the key variables like usubjid, treatment, patientno. etc which are unique or exact duplicates (duplicates with respect to all the variables in the dataset).

If the observations are exact duplicates with respect to all the variables in the dataset, we can remove the exact duplicates by:

Using the noduprecs option in the PROC SORT with a by *all* statement:

```
proc sort data=dsn noduprecs;
by _all_;
run;
```

NODUPRECS compares all the variables in the data set and delete exact duplicates.

```
PROC SQL approach:
Proc SQL noprint;
create table unique as select distinct (*) from dsn;
quit;
```

Adding Asterisk means that we are telling SAS to identify distinct/unique observations with respect to all variables in the proposed dataset.

```
If the observations arenʼt the exact duplicates but they are duplicates with respect to some of the key variables in the dataset (ex: usubjid, studyid, patientid, visit etc) then we can remove the duplicates by using a:
PROC SQL approach:
proc SQL noprint;
create table unique as select distinct (usubjid) from dsn;
quit;
```

by considering usubjid as the unique variable, we are asking SAS to give us the one observation for each unique usubjid.

The same can be done by another approach i.e use proc sort:

```
Proc sort data=dsn nodupkey;
by usubjid;
run;
```

NODUPKEY compares only the variables in the data set and delete the duplicate observations pertaining to key variables.

```
PROC FREQ approach:
Proc freq data=dsn noprint;
tables usubjid/out=unique (keep=usubjid count where=(count=1));
run;
```

Noprint option is required because we donʼt want the procedure to print all the unique observations. We just want a dataset with all the unique observations.

```
Using Datastep approach:
This code keeps only unique observations.
proc sort data=dsn out=temp;
by usubjid;
run;
data unique;
set temp;
by usubjid;
if not first.usubjid and last.usubjid;
run;
```

If not first.usubjid and last.usubjid , SAS will check the number of observations for each usubjid (key variable) and if any usubjid has any duplicates then SAS will not include them in the output dataset(unique);

```
data nodups;
set temp;
by usubjid;
if first.usubjid;
run;
```

flag

answered By vishnoiprem 0

Before using a particular step to remove the duplicate observations, we should understand that the duplicate records present are pertaining to the key variables like usubjid, treatment, patientno. etc which are unique or exact duplicates (duplicates with respect to all the variables in the dataset).

If the observations are exact duplicates with respect to all the variables in the dataset, we can remove the exact duplicates by:

Using the noduprecs option in the PROC SORT with a by *all* statement:

```
proc sort data=dsn noduprecs;
by _all_;
run;
```

NODUPRECS compares all the variables in the data set and delete exact duplicates.

```
PROC SQL approach:
Proc SQL noprint;
create table unique as select distinct (*) from dsn;
quit;
```

Adding Asterisk means that we are telling SAS to identify distinct/unique observations with respect to all variables in the proposed dataset.

```
If the observations arenʼt the exact duplicates but they are duplicates with respect to some of the key variables in the dataset (ex: usubjid, studyid, patientid, visit etc) then we can remove the duplicates by using a:
PROC SQL approach:
proc SQL noprint;
create table unique as select distinct (usubjid) from dsn;
quit;
```

by considering usubjid as the unique variable, we are asking SAS to give us the one observation for each unique usubjid.

The same can be done by another approach i.e use proc sort:

```
Proc sort data=dsn nodupkey;
by usubjid;
run;
```

NODUPKEY compares only the variables in the data set and delete the duplicate observations pertaining to key variables.

```
PROC FREQ approach:
Proc freq data=dsn noprint;
tables usubjid/out=unique (keep=usubjid count where=(count=1));
run;
```

Noprint option is required because we donʼt want the procedure to print all the unique observations. We just want a dataset with all the unique observations.

```
Using Datastep approach:
This code keeps only unique observations.
proc sort data=dsn out=temp;
by usubjid;
run;
data unique;
set temp;
by usubjid;
if not first.usubjid and last.usubjid;
run;
```

If not first.usubjid and last.usubjid , SAS will check the number of observations for each usubjid (key variable) and if any usubjid has any duplicates then SAS will not include them in the output dataset(unique);

```
data nodups;
set temp;
by usubjid;
if first.usubjid;
run;
```

flag

Write A Tutorials

Question tags

Similar questions

Follow this question

Follow sas category

Question tags

sas
× 1

Asked
**2 years and 3 months ago ago**

Number of Views -669

Number of Answers -2

Last updated

**3 years and 2 months ago ago**

Number of Views -669

Number of Answers -2

Last updated

Similar questions