I have a need that I imagine could be satisfied by aggregate or reshape, but I can't quite figure out.

I have a list of names (brand), and accompanying ID number (id). This data is in long form, so names can have multiple ID's. I'd like to de-dupicate by the name (brand) and concatenate the multiple possible id's into a string separated by a comment.

For example:

brand            id 
RadioShack       2308
Rag & Bone       4466
Ragu             1830
Ragu             4518
Ralph Lauren     1638
Ralph Lauren     2719
Ralph Lauren     2720
Ralph Lauren     2721
Ralph Lauren     2722 

should become:

RadioShack       2308
Rag & Bone       4466
Ragu             1830,4518
Ralph Lauren     1638,2719,2720,2721,2722

How would I accomplish this?

Solution 1

Let's call your data.frame DF

> aggregate(id ~ brand, data = DF, c)
         brand                           id
1   RadioShack                         2308
2   Rag & Bone                         4466
3         Ragu                   1830, 4518
4 Ralph Lauren 1638, 2719, 2720, 2721, 2722

Another alternative using aggregate is:

result <- aggregate(id ~ brand, data = DF, paste, collapse = ",")

This produces the same result and now id is not a list anymore. Thanks to @Frank comment. To see the class of each column try:

> sapply(result, class)
      brand          id 
   "factor" "character"

As mentioned by @DavidArenburg in the comments, another alternative is using the toString function:

aggregate(id ~ brand, data = DF, toString)

Solution 2

A nice clean one line in data.table

library(data.table)
setDT(DF)

TWO OPTIONS:

results as a list

DF[ , .(id = list(id)), by = brand]
          brand                       id
1:   RadioShack                     2308
2:   Rag & Bone                     4466
3:         Ragu                1830,4518
4: Ralph Lauren 1638,2719,2720,2721,2722
> 

results as a string

DF[ , .(id = paste(id, collapse=",")), by = brand]
          brand                       id
1:   RadioShack                     2308
2:   Rag & Bone                     4466
3:         Ragu                1830,4518
4: Ralph Lauren 1638,2719,2720,2721,2722

Note

Even though the two results appear the same (that is when you print them, they look identical), they are in fact very different and allow for different functionality.

Namely, using the list option (the first one) allows you to then perform functions on the orignal ids.

The latter will allow you to display the information more easily (including exporting to CSV or excel), but to operate on the id's will require splicing them back.

Solution 3

Or using dplyr:

library(dplyr)
DF %>%
  group_by(brand) %>%
  summarise(id = paste(id, collapse = ","))

Where DF is the name of your data.frame.

Solution 4

Here's the information in base R:

myby <- by(df$id,df$brand,function(x)paste(x,collapse=","))

The formatting of "by" objects is weird. You can take data.frame(id=c(myby)) and the brands will become rownames:

#                                    id
# RadioShack                       2308
# Rag & Bone                       4466
# Ragu                        1830,4518
# Ralph Lauren 1638,2719,2720,2721,2722

Alternately, if you load the data.table package, this will work:

dt <- data.table(df)
dt[,paste(id,collapse=","),by=brand]
#           brand                       V1
# 1:   RadioShack                     2308
# 2:   Rag & Bone                     4466
# 3:         Ragu                1830,4518
# 4: Ralph Lauren 1638,2719,2720,2721,2722