Matching two Columns in R


I have a big dataset df (354903 rows) with two columns named df$ColumnName and df$ColumnName.1

head(df)
       CompleteName       CompleteName.1
1   Lefebvre Arnaud Lefebvre Schuhl Anne
1.1 Lefebvre Arnaud              Abe Lyu
1.2 Lefebvre Arnaud              Abe Lyu
1.3 Lefebvre Arnaud       Louvet Nicolas
1.4 Lefebvre Arnaud   Muller Jean Michel
1.5 Lefebvre Arnaud  De Dinechin Florent

I am trying to create labels to see weather the name is the same or not. When I try a small subset it works [1 if they are the same, 0 if not]:

> match(df$CompleteName[1], df$CompleteName.1[1], nomatch = 0)
[1] 0
> match(df$CompleteName[1:10], df$CompleteName.1[1:10], nomatch = 0)
[1] 0 0 0 0 0 0 0 0 0 0

But as soon as I throw the complete columns, it gives me complete different values, which seem nonsense to me:

> match(df$CompleteName, df$CompleteName.1, nomatch = 0)
[1] 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101
[23] 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101
[45] 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101

Should I use sapply? I did not figured it out, I tried this with an error:

 sapply(df, function(x) match(x$CompleteName, x$CompleteName.1, nomatch = 0))

Please help!!!


Answers:


From the man page of match,

‘match’ returns a vector of the positions of (first) matches of its first argument in its second.

So your data seem to indicate that the first match of "Lefebvre Arnaud" (the first position in the first argument) is in the row 101. I believe what you intended to do is a simple comparison, so that's just the equality operator ==.

Some sample data:

> a <- rep ("Lefebvre Arnaud", 6)
> b <- c("Abe Lyu", "Abe Lyu", "Lefebvre Arnaud", rep("De Dinechin Florent", 3))
> x <- data.frame(a,b, stringsAsFactors=F)
> x
            a                   b
1 Lefebvre Arnaud             Abe Lyu
2 Lefebvre Arnaud             Abe Lyu
3 Lefebvre Arnaud     Lefebvre Arnaud
4 Lefebvre Arnaud De Dinechin Florent
5 Lefebvre Arnaud De Dinechin Florent
6 Lefebvre Arnaud De Dinechin Florent
> x$a == x$b
[1] FALSE FALSE  TRUE FALSE FALSE FALSE

EDIT: Also, you need to make sure that you are comparing apples to apples, so double check the data type of your columns. Use str(df) to see whether the columns are strings or factors. You can either construct the matrix with "stringsAsFactors = FALSE", or convert from factor to character. There are several ways to do that, check here: Convert data.frame columns from factors to characters