Skip to main content

Script to combine separate csv files [Resolved]

I have two separate files containing information on clients that I would like to merge into one. The first file contains global information, the second is for options.

Basically, every client has an entry in the first file, whereas only some have an entry in the second file.

What I'm aiming to do is to merge the second file into the first (complete) one, with (if possible) no duplicate columns. I've tried pasting and cat, but with no success...

Example :

complete.csv :

Number  Name   Surname   Price   Town
2       Alpha  Beta      10.0    Blob
1       Gamma  Delta     13.0    Upsy

options.csv :

Number  Name   Surname   Op_Price   Option
1       Alpha  Beta      65.0       Yawn

I would like to get something like this :

result.csv :

Number  Name   Surname   Price   Town  Op_Price  Option
2       Alpha  Beta      10.0    Blob  65.0      Yawn
1       Gamma  Delta     13.0    Upsy

Could you point me in a direction ?

Cheers


Question Credit: olirwin
Question Reference
Asked June 12, 2019
Posted Under: Unix Linux
11 views
2 Answers

You can achieve this with join and couple other tools, which are all part of coreutils and thus are present on every Linux box. BSDs have those tools as well, but they don't have the flags I use here.

join uses one of the columns as a key by which to compare lines in two files. To use multiple columns as a key, you'll have to remove spaces from them, e.g. replacing them by tabs. You didn't specify if that's what you want, so I'll just assume that we're joining on "Name". Ask a separate question if you need the columns combined.

One last prerequisite: both files need to be sorted on the key column. Your example data appear to already be sorted by "Name", but let's make sure:

$ sort --key=2 complete.csv > sorted-complete.csv
$ sort --key=2 options.csv  > sorted-options.csv

Now we're ready to paste the files together:

$ join -a 1 -j 2 -o '1.1 1.2 1.3 1.4 1.5 2.4 2.5' \
        --header sorted-complete.csv sorted-options.csv \
        | column -t
Number  Name   Surname  Price  Town  Op_Price  Option
2       Alpha  Beta     10.0   Blob  65.0      Yawn
1       Gamma  Delta    13.0   Upsy

Let's deconstruct that.

The backslashes (\) are there to break the line, for readability. You can remove them.

-a 1 means "print all lines from the first file` (i.e. sorted-complete.csv).

-j 2 means "the key column is the second one".

-o ... sets the output format. It contains specifiers in the format of <file number>.<column number. For example, "1.1" means first column of "sorted-complete.csv", while 2.5 is fifth column of "sorted-options.csv". If you don't specify the output format, the key column will be printed out first, and repeated after the "complete" data.

Finally, column -t reformats the output into a nicely aligned table. Without this, columns will be jagged, because they all will be separated by a single space.


credit: Alexander Batischev
Answered June 12, 2019
Your Answer