Matching Columns with Overlapping Intervals (lubridate)

I have two data frames of different number of rows and number of columns: each of these data frames have a date interval. df has an additional column which indicates some kind of attribute. My goal is to extract information from df ( with the attributes) to df2 under certain conditions. The procedure should be the following:

For each date interval of df2, check if there is any interval in df which overlaps with the interval of df2. If yes, create a column in df2 which indicates the attributes matching with the overlapping interval of df. There can be multiple attributes that are matched to a specific interval of df2.

I created the following example of my data:

library(lubridate)
date1 <- as.Date(c('2017-11-1','2017-11-1','2017-11-4'))
date2 <- as.Date(c('2017-11-5','2017-11-3','2017-11-5'))
df <- data.frame(matrix(NA,nrow=3, ncol = 4)) 
names(df) <- c("Begin_A", "End_A", "Interval", "Attribute")
df$Begin_A <-date1
df$End_A <-date2

df$Interval <-df$Begin_A %--% df$End_A
df$Attribute<- as.character(c("Attr1","Attr2","Attr3"))

### Second df:

date1 <- as.Date(c('2017-11-2','2017-11-5','2017-11-7','2017-11-1'))
date2 <- as.Date(c('2017-11-3','2017-11-6','2017-11-8','2017-11-1'))
df2 <- data.frame(matrix(NA,nrow=4, ncol = 3)) 
names(df2) <- c("Begin_A", "End_A", "Interval")
df2$Begin_A <-date1
df2$End_A <-date2
df2$Interval <-df2$Begin_A %--% df2$End_A

This results in these data frames:

df:

Begin_A      End_A        Interval                         Attribute
2017-11-01   2017-11-05   2017-11-01 UTC--2017-11-05 UTC   Attr1
2017-11-01   2017-11-03   2017-11-01 UTC--2017-11-03 UTC   Attr2
2017-11-04   2017-11-05   2017-11-04 UTC--2017-11-05 UTC   Attr3

df2:

Begin_A      End_A        Interval
2017-11-02   2017-11-03   2017-11-02 UTC--2017-11-03 UTC
2017-11-05   2017-11-06   2017-11-05 UTC--2017-11-06 UTC
2017-11-07   2017-11-08   2017-11-07 UTC--2017-11-08 UTC
2017-11-01   2017-11-01   2017-11-01 UTC--2017-11-01 UTC

My desired data frames look like this:

Begin_A      End_A        Interval                         Matched_Attr 
2017-11-02   2017-11-03   2017-11-02 UTC--2017-11-03 UTC   Attr1;Attr2
2017-11-05   2017-11-06   2017-11-05 UTC--2017-11-06 UTC   Attr1;Attr3
2017-11-07   2017-11-08   2017-11-07 UTC--2017-11-08 UTC   NA
2017-11-01   2017-11-01   2017-11-01 UTC--2017-11-01 UTC   Attr1;Attr2

I already looked into the int_overlaps() function but could not make the "scanning through all intervals of another column"-part work. If yes, is there any solution that makes use of the tidyr environment?

1 answer

  • answered 2017-10-11 10:04 Patrik_P

    Using tidyverse´s lubridate package and it´s function int_overlaps(), you can create a simple for loop to go through the individual values of df2$Interval like follows:

    df2$Matched_Attr <- NA
    for(i in 1:nrow(df2)){
      df2$Matched_Attr[i] <-  paste(df$Attribute[int_overlaps(df2$Interval[i], df$Interval)], collapse=", ")
    }
    

    giving the following outcome

    #     Begin_A      End_A                       Interval Matched_Attr
    #1 2017-11-02 2017-11-03 2017-11-02 UTC--2017-11-03 UTC Attr1, Attr2
    #2 2017-11-05 2017-11-06 2017-11-05 UTC--2017-11-06 UTC Attr1, Attr3
    #3 2017-11-07 2017-11-08 2017-11-07 UTC--2017-11-08 UTC             
    #4 2017-11-01 2017-11-01 2017-11-01 UTC--2017-11-01 UTC Attr1, Attr2
    

    I left the NA strategy open, but additional line df2$Matched_Attr[df2$Matched_Attr==""]<-NA would return exact desired outcome.

    In response to your comment (only perform the above action when a df$ID[i]==df2$ID[i] condition is met), the inplementation follows:

    library(lubridate)
    #df
    df <- data.frame(Attribute=c("Attr1","Attr2","Attr3"),
                     ID = c(3,2,1),
                     Begin_A=as.Date(c('2017-11-1','2017-11-1','2017-11-4')),
                     End_A=as.Date(c('2017-11-5','2017-11-3','2017-11-5')))
    df$Interval <- df$Begin_A %--% df$End_A
    
    ### Second df:
    df2 <- data.frame(ID=c(3,4,5),
                      Begin_A=as.Date(c('2017-11-2','2017-11-5','2017-11-7')),
                      End_A=as.Date(c('2017-11-3','2017-11-6','2017-11-8')))
    df2$Interval <- df2$Begin_A %--% df2$End_A
    
    df2$Matched_Attr <- NA
    for(i in 1:nrow(df2)){
      if(df2$ID[i]==df$ID[i]){
      df2$Matched_Attr[i] <-  paste(df$Attribute[int_overlaps(df2$Interval[i], df$Interval)], collapse=", ")
      }
    }
    print(df2)
    #  ID    Begin_A      End_A                       Interval Matched_Attr
    #1  3 2017-11-02 2017-11-03 2017-11-02 UTC--2017-11-03 UTC Attr1, Attr2
    #2  4 2017-11-05 2017-11-06 2017-11-05 UTC--2017-11-06 UTC         <NA>
    #3  5 2017-11-07 2017-11-08 2017-11-07 UTC--2017-11-08 UTC         <NA>