R: Combining lapply and left_join to conditionally merge dataframes

I'm hoping someone out there might be able to help me get to the root of a frustrating problem I've been having with my code in R. I have a list consisting of dataframes, and I want to left join each element on one of two OTHER dataframes (call them A and B). Which of these secondary dataframes to join on depends on the element's position in the list. For my purposes, I'd like every odd element to be left-joined to A, and every even element to be left-joined to B.

library(dplyr)
DF <- data.frame(Num = c("1","2"), Let = c("a","b"), stringsAsFactors = FALSE)
A <- data.frame(Let = c("a","b"), Col = c("Yellow","Red"), stringsAsFactors = FALSE)
B <- data.frame(Let = c("a","b"), Col = c("Green","Blue"), stringsAsFactors = FALSE)
LIST <- list(DF, DF)

So far, I've tried to do this in two different ways. The first approach involved an if-else statement. If I apply such a statement to assign an integer value based on position, I obtain the expected result. Similarly, when I do away with the if-else statement and simply perform a series of left-joins on the list elements, everything works as expected.

lapply(seq_along(LIST), function(x, {ifelse((x %% 2)==0, y[[x]] <- 1, y[[x]] <- 2)}, y = LIST)
lapply(seq_along(LIST), function(x, {left_join(y[[x]], A, by = c("Let"))}, y = LIST)

Where I run into problems is when I attempt to combine the if-else statement and the left join. In particular, I end up with a list consisting of lists, each of which retains only the first column of the original corresponding dataframe.

lapply(seq_along(LIST), function(x, y) {ifelse((x %% 2)==0, left_join(y[[x]], A, by = c("Let")), left_join(y[[x]], B, by = c("Let")))}, y = LIST)

Here is the output I would LIKE to obtain:

[[1]]
  Let Num    Col
1   a   1 Yellow
2   b   2    Red

[[2]]
  Let Num   Col
1   a   1 Green
2   b   2  Blue

I'm certain there's an absurdly simple solution to the problem. Can anyone see it?

Thanks in advance! Matthew

P.S.: I also attempted a second approach, applying subsetting rather than an if-else statement. Again, however, I run into problems. The first line below works as expected, but the second returns an error, as though R doesn't recognize the list indices:

lapply(seq_along(LIST), function(x, y) {left_join(y[[x > 0]], A, by = c("Let"))}, y = LIST)
lapply(seq_along(LIST), function(x, y) {left_join(y[[x == 1]], A, by = c("Let"))}, y = LIST)

Error in y[[x == 1]] : attempt to select less than one element in integerOneIndex 

4 answers

  • answered 2018-04-14 15:41 Maurits Evers

    I'm not entirely sure I understand your issue.

    The following solution is based on a reproduction of the output of lapply(seq_along(LIST), function(x, y) {left_join(y[[x > 0]], A, by = c("Let"))}, y = LIST) from your postscript. Note that the other lapply lines throw errors.

    library(tidyverse);
    map(list(A, B), function(x) left_join(DF, x))
    #Joining, by = "Let"
    #Joining, by = "Let"
    #[[1]]
    #  Num Let    Col
    #1   1   a Yellow
    #2   2   b    Red
    #
    #[[2]]
    #  Num Let   Col
    #1   1   a Green
    #2   2   b  Blue
    

    We use purrr:map with dplyr::left_join to join A and B with DF.


    The same can be achieved in base R using Map and merge:

    mapply(function(x) merge(DF, x, by = "Let"), list(A, B), SIMPLIFY = F)
    #[[1]]
    #  Let Num    Col
    #1   a   1 Yellow
    #2   b   2    Red
    #
    #[[2]]
    #  Let Num   Col
    #1   a   1 Green
    #2   b   2  Blue
    

  • answered 2018-04-14 15:41 aspiringurbandatascientist

    Overview

    Use base::mapply() to return a list of data frames which were conditionally merged. Here, I supply two inputs:

    1. seq.along( along.with = LIST ) to obtain the number of elements in LIST; and
    2. LIST itself.

    The FUN argument is an anonymous function that takes in two inputs - i and j - and tests whether the current element in LIST is even or odd numbered prior to performing a left-join using base::merge().

    If the result of the modulus operator for the ith element in seq.along( along.with = LIST ) is equal to zero, then left-join B onto the jth element in LIST; if it does not equal zero, then perform the left-join A onto the jth element in LIST.

    # load data
    DF <- data.frame(Num = c("1","2"), Let = c("a","b"), stringsAsFactors = FALSE)
    A <- data.frame(Let = c("a","b"), Col = c("Yellow","Red"), stringsAsFactors = FALSE)
    B <- data.frame(Let = c("a","b"), Col = c("Green","Blue"), stringsAsFactors = FALSE)
    LIST <- list(DF, DF)
    
    # goal: left join all odd elements in LIST[[j]]
    #       to `A` and all even elements to `B`
    merged.list <- 
      mapply( FUN = function( i, j )
              if( i %% 2 == 0 ){
                merge( x = j
                       , y = B
                       , by = "Let"
                       , all.x = TRUE )
              } else{
                merge( x = j
                       , y = A
                       , by = "Let"
                       , all.x = TRUE )
              }
            , seq_along( along.with = LIST )
            , LIST
            , SIMPLIFY = FALSE )
    
    # view results
    merged.list
    # [[1]]
    # Let Num    Col
    # 1   a   1 Yellow
    # 2   b   2    Red
    # 
    # [[2]]
    # Let Num   Col
    # 1   a   1 Green
    # 2   b   2  Blue
    
    # end of script #
    

    Tidyverse approach

    The results are replicated below using functions from the purrr and dplyr packages.

    library( dplyr )
    library( purrr )
    
    merged.list <-
      map2( .x = seq_along( along.with = LIST )
            , .y = LIST
            , .f = function( i, j )
              if( i %% 2 == 0 ){
                left_join( x = j
                           , y = B
                           , by = "Let" )
              } else{
                left_join( x = j
                           , y = A
                           , by = "Let" )
              })
    
    # view results
    merged.list
    # [[1]]
    # Num Let    Col
    # 1   1   a Yellow
    # 2   2   b    Red
    # 
    # [[2]]
    # Num Let   Col
    # 1   1   a Green
    # 2   2   b  Blue
    
    # end of script #
    

  • answered 2018-04-14 15:41 42-

    It's possible that MauritsEvers has already answered your question but I thought I would address the obvious errors in R syntax and programming logic. Focus on the first lapply-call:

    lapply(seq_along(LIST), function(x, {ifelse((x %% 2)==0, y[[x]] <- 1, y[[x]] <- 2)}, y = LIST)
    

    First and perhaps trivial is the missing closing ) argument list in the first lapply-action. Next and more fundamental is the incorrect use of ifelse as a programming construct. The ifelse function is not designed for serial tests of data-objects. It is only designed to be applied along a single vector. The if(.){.}else{.}-function probably should have been used in that lapply call if serial choices were to be made.

    However, (and now trying to implement the first paragraph rather than continue correecting code) I think it would be much simpler to use logical indexing (with R's implicit recycling process) on the LIST-object rather than any looping process. (This is not a tidyverse solution.) This code would segment the LIST into "odd" and "even" components:

      oddList <- LIST[ c(TRUE,FALSE) ]  # implicit seq-along by virtue of recycling
      evenList <- LIST[ c(FALSE,TRUE) ]
    

    We could use that type of results to make two one-liners that would accomplish your stated goals. I made the LIST-object four wide rather than two wide.

    Abig <- Reduce( function(x,y) {merge(x,y,by="Let")}, LIST, init=A)
    Warning message:
    In merge.data.frame(x, y, by = "Let") :
      column names ‘Num.x’, ‘Num.y’ are duplicated in the result
    Bbig <- Reduce( function(x,y) {merge(x,y,by="Let")}, LIST, init=B)
    Warning message:
    In merge.data.frame(x, y, by = "Let") :
      column names ‘Num.x’, ‘Num.y’ are duplicated in the result
    

    That is only a warning and here you can see what it was warning about:

    > Abig
      Let    Col Num.x Num.y Num.x Num.y
    1   a Yellow     1     1     1     1
    2   b    Red     2     2     2     2
    

    If you need those duplicated column names labeled uniquely (and I thin that would be a good idea), then:

    names(Abig)[ grep("Num", names(Abig)) ] <- 
                        paste0("Num.", seq_along( grep("Num", names(Abig)) ) )
    Abig
      Let    Col Num.1 Num.2 Num.3 Num.4
    1   a Yellow     1     1     1     1
    2   b    Red     2     2     2     2
    

  • answered 2018-04-14 15:41 MKR

    This solution is quite similar to mapply based solution already posted here by (@MauritsEvers & @aspiringurbandatascientist) but it usages a different approach to join data.frames. The dplyr::left_join has been used to fit the purpose.

    library(dplyr)
    # Using mapply and left_join
    mapply(function(x,y){
      if(y %% 2 == 1){
        left_join(x, A, by="Let")
      }else {
        left_join(x, B, by="Let")
      }
    }, LIST, seq_along(LIST), SIMPLIFY = FALSE)
    
    # [[1]]
    #   Num Let    Col
    # 1   1   a Yellow
    # 2   2   b    Red
    # 
    # [[2]]
    #   Num Let   Col
    # 1   1   a Green
    # 2   2   b  Blue