bash - Removing rows from one file which do not mach another -
i looking efficient way delete rows in file1
not exist in file2
in bash:
file1.txt:
file1 <- 'probeset_id sample1 sample2 sample3 ax-2 100 200 180 ax-1 90 180 267 ax-3 80 890 124' file1 <- read.table(text=file1, header=t) write.table(file1, "file1.txt", col.names=t, quote=f, row.names=f)
file2.txt:
file2 <- 'probeset_id ax-1 ax-2 ' file2 <- read.table(text=file2, header=t) write.table(file2, "file2.txt", col.names=f, quote=f, row.names=f)
the expected output:
out <- 'probeset_id sample1 sample2 sample3 ax-1 90 180 267 ax-2 100 200 180' out <- read.table(text=out, header=t) write.table(out, "out.txt", col.names=t, quote=f, row.names=f)
the additional problem file2
not sorted file1
. trying use:
head -n 1 file1.txt ; grep -f file2.txt file1.txt
however, taking long time. ideas perform in more efficient way (the real files quite big)?
awk
of great use in case
awk 'nr==fnr{line[$1]++; next} $1 in line'
example
$ awk 'nr==fnr{line[$1]++; next} $1 in line' file2 file1 probeset_id sample1 sample2 sample3 ax-2 100 200 180 ax-1 90 180 267
what does?
nr==fnr{line[$1]++; next}
saves lines infile2
in associative arrayline
( indexed first column )nr==fnr
true first file in list,file2
.nr
number or records read till now.fnr
number of records read in current file.
$1 in line
checks if column 1 infile1
saved inline
, if true,awk
takes default action of printing current records.
Comments
Post a Comment