(2) tab delimited text files, need to replace specific rows from one to the other.
Have tried data.table, no luck.
Programmer left the company and I know very little R, a C++ person.
Some help would be appreciated.
file1.txt 12 site data example row3 needs to replace row7 in file2.txt
file2.txt 32 site data
Need to extract specific rows from file1.txt and replace specific rows in file2.txt with file1.txt row.
thanks much
If the data have the same number and types of columns, this can be easily done.
#Invent toy data
File1Data <- data.frame(Source = "File1", Value = 1:4, stringsAsFactors = FALSE)
File1Data
Source Value
1 File1 1
2 File1 2
3 File1 3
4 File1 4
File2Data <- data.frame(Source = "File2", Value = 11:14, stringsAsFactors = FALSE)
File2Data
Source Value
1 File2 11
2 File2 12
3 File2 13
4 File2 14
#Use row 2 of File1 to replace row 3 of File2
File2Data[3, ] <- File1Data[2, ]
File2Data
Source Value
1 File2 11
2 File2 12
3 File1 2
4 File2 14
However, using that method for multiple rows seems inefficient. To provide better guidance, can you explain what determines the source and target rows? Is there some label that matches so that a process similar to a data base join could be used? Is the row order in the data important or can the original rows be deleted and the new rows appended to the end of the data?
I'll be off line for several hours. I hope others see your response and can carry on the thread.
Hi and thanks so much.
The (2) files will always have the same # or rows & columns for each specific file.
Every set of (2) files, original & retest will always have the same amount of columns, rows could be different if the same number of wafers are not tested in the retest file.
The files are are output from parametric semiconductor testing, (1) file was a retest of all or specific wafers.
The other file is the original data, all wafers. There will be specific rows, wafer site, that need to
replaced in the original file from the retest file. This will be determined by the site number, which is column parameter in each row. The files has headers in each column that define the column contents.
Example top, header row date /t lot# /t wafer /t site /t data pt1 /t data pt2 /t.............
They may have 10 columns or 50 columns, But the two files will always have the same amount of columns.
Every test structure(output file) will be different, hence the diff # of columns & rows per structure.
Rows may be different if all wafers were not retested.
They cannot be appended, the row from the retest file needs to replace the exact row in the original file.
thank you so much for your help.
The code below shows how I would do this. File1 has four rows of data and File2 has retest values for two of those rows. The idea is to pivot the two data files to a "long" format so there is one column listing the data point and one column showing the value. I then join the data sets. All of the rows from File1 are kept. Where there is a corresponding value from File2, that appears in the same row and an NA is shown if File2 did not have a value. The ifelse() function within the mutate() replaces the File1 values if the File2 value is not NA. I then drop the columns from File2 and pivot the data back to the wide format. The last pivot may not be needed. A lot of data analysis in R is easier in the long format but you will have to decide what works for your situation.
The code shows the steps after the initial pivot twice. The first is a verbose version that prints all of the intermediate steps. The second does the process in one chain of function calls.
library(dplyr)
library(tidyr)
library(tibble)
File1 <- tibble(date = as.Date("2022-04-26"),
lot = c("A", "A", "B", "B"),
site = c("S1", "S2", "S1", "S2"),
Data1 = c(1,2,3,4), Data2 = c(2,3,4,5), Data3 = c(3,4,5,6))
File1
#> # A tibble: 4 × 6
#> date lot site Data1 Data2 Data3
#> <date> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 2022-04-26 A S1 1 2 3
#> 2 2022-04-26 A S2 2 3 4
#> 3 2022-04-26 B S1 3 4 5
#> 4 2022-04-26 B S2 4 5 6
File2 <- tibble(date = as.Date("2022-04-27"),
lot = c("A", "B"),
site = c("S2", "S1"),
Data1 = c(2.1,3.1), Data2 = c(3.1,4.1), Data3 = c(4.1,5.1))
File2
#> # A tibble: 2 × 6
#> date lot site Data1 Data2 Data3
#> <date> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 2022-04-27 A S2 2.1 3.1 4.1
#> 2 2022-04-27 B S1 3.1 4.1 5.1
File1Long <- pivot_longer(data = File1, cols = -c("date", "lot", "site"), names_to = "DataPt")
File1Long
#> # A tibble: 12 × 5
#> date lot site DataPt value
#> <date> <chr> <chr> <chr> <dbl>
#> 1 2022-04-26 A S1 Data1 1
#> 2 2022-04-26 A S1 Data2 2
#> 3 2022-04-26 A S1 Data3 3
#> 4 2022-04-26 A S2 Data1 2
#> 5 2022-04-26 A S2 Data2 3
#> 6 2022-04-26 A S2 Data3 4
#> 7 2022-04-26 B S1 Data1 3
#> 8 2022-04-26 B S1 Data2 4
#> 9 2022-04-26 B S1 Data3 5
#> 10 2022-04-26 B S2 Data1 4
#> 11 2022-04-26 B S2 Data2 5
#> 12 2022-04-26 B S2 Data3 6
File2Long <- pivot_longer(data = File2, cols = -c("date", "lot", "site"), names_to = "DataPt")
#Step by step processing for clarity
FinalData <- left_join(File1Long, File2Long, by = c("lot", "site", "DataPt"), suffix = c(".F1", ".F2"))
FinalData
#> # A tibble: 12 × 7
#> date.F1 lot site DataPt value.F1 date.F2 value.F2
#> <date> <chr> <chr> <chr> <dbl> <date> <dbl>
#> 1 2022-04-26 A S1 Data1 1 NA NA
#> 2 2022-04-26 A S1 Data2 2 NA NA
#> 3 2022-04-26 A S1 Data3 3 NA NA
#> 4 2022-04-26 A S2 Data1 2 2022-04-27 2.1
#> 5 2022-04-26 A S2 Data2 3 2022-04-27 3.1
#> 6 2022-04-26 A S2 Data3 4 2022-04-27 4.1
#> 7 2022-04-26 B S1 Data1 3 2022-04-27 3.1
#> 8 2022-04-26 B S1 Data2 4 2022-04-27 4.1
#> 9 2022-04-26 B S1 Data3 5 2022-04-27 5.1
#> 10 2022-04-26 B S2 Data1 4 NA NA
#> 11 2022-04-26 B S2 Data2 5 NA NA
#> 12 2022-04-26 B S2 Data3 6 NA NA
FinalData <- mutate(.data = FinalData, value.F1 = ifelse(is.na(value.F2), value.F1, value.F2))
FinalData
#> # A tibble: 12 × 7
#> date.F1 lot site DataPt value.F1 date.F2 value.F2
#> <date> <chr> <chr> <chr> <dbl> <date> <dbl>
#> 1 2022-04-26 A S1 Data1 1 NA NA
#> 2 2022-04-26 A S1 Data2 2 NA NA
#> 3 2022-04-26 A S1 Data3 3 NA NA
#> 4 2022-04-26 A S2 Data1 2.1 2022-04-27 2.1
#> 5 2022-04-26 A S2 Data2 3.1 2022-04-27 3.1
#> 6 2022-04-26 A S2 Data3 4.1 2022-04-27 4.1
#> 7 2022-04-26 B S1 Data1 3.1 2022-04-27 3.1
#> 8 2022-04-26 B S1 Data2 4.1 2022-04-27 4.1
#> 9 2022-04-26 B S1 Data3 5.1 2022-04-27 5.1
#> 10 2022-04-26 B S2 Data1 4 NA NA
#> 11 2022-04-26 B S2 Data2 5 NA NA
#> 12 2022-04-26 B S2 Data3 6 NA NA
FinalData <- select(FinalData, date = date.F1, lot, site, DataPt, value = value.F1)
FinalData
#> # A tibble: 12 × 5
#> date lot site DataPt value
#> <date> <chr> <chr> <chr> <dbl>
#> 1 2022-04-26 A S1 Data1 1
#> 2 2022-04-26 A S1 Data2 2
#> 3 2022-04-26 A S1 Data3 3
#> 4 2022-04-26 A S2 Data1 2.1
#> 5 2022-04-26 A S2 Data2 3.1
#> 6 2022-04-26 A S2 Data3 4.1
#> 7 2022-04-26 B S1 Data1 3.1
#> 8 2022-04-26 B S1 Data2 4.1
#> 9 2022-04-26 B S1 Data3 5.1
#> 10 2022-04-26 B S2 Data1 4
#> 11 2022-04-26 B S2 Data2 5
#> 12 2022-04-26 B S2 Data3 6
FinalData <- pivot_wider(FinalData, names_from = DataPt, values_from = value)
FinalData
#> # A tibble: 4 × 6
#> date lot site Data1 Data2 Data3
#> <date> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 2022-04-26 A S1 1 2 3
#> 2 2022-04-26 A S2 2.1 3.1 4.1
#> 3 2022-04-26 B S1 3.1 4.1 5.1
#> 4 2022-04-26 B S2 4 5 6
#Process in one chain without needless print statemements
FinalData2 <- left_join(File1Long, File2Long, by = c("lot", "site", "DataPt"), suffix = c(".F1", ".F2")) %>%
mutate(value.F1 = ifelse(is.na(value.F2), value.F1, value.F2)) %>%
select(date = date.F1, lot, site, DataPt, value = value.F1) %>%
pivot_wider(names_from = DataPt, values_from = value)
#Test that the two process produce the same result
identical(FinalData, FinalData2)
#> [1] TRUE
Created on 2022-04-28 by the reprex package (v0.2.1)
So appreciate it, I will give it a shot.
forgive me, where do I enter my filenames?
The two places where I have code like
File1 <- tibble(date = as.Date("2022-04-26"),
lot = c("A", "A", "B", "B"),
site = c("S1", "S2", "S1", "S2"),
Data1 = c(1,2,3,4), Data2 = c(2,3,4,5), Data3 = c(3,4,5,6))
need to be replaced with code that reads in your data. If your data are in a csv, you would use something like
File1 <- read.csv("Path/to/the/file")
You will also want to remove all of the (implied) print statements where a line consists of just an object name. On a real data set those will just spam your console with output.
Sorry I wasn't more explicit earlier. I know it is hard to take on code in an unfamiliar language. Don't hesitate to ask more questions if you need to. I had to deal with some COBOL last year and it made my head swim.
thank you once again, I appreciate your help
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.
If you have a query related to it or one of the replies, start a new topic and refer back with a link.