How to merge many TSV files by common key using Python Pandas?
For data analysis the most important thing is data and we need to prepare it before we can use it for analysis. Sometimes required data can be scattered in multiple files and we need to merge them. In this article, we are going to merge multiple TSV (Tab Separated Values) files with a common key. This can be possible by using the merge method of the pandas Python library. This method allows us to combine files by using a common key.
Approach:
- Import pandas library
- Then read first two tsv files and merge them using pd.merge() function by setting the ‘on’ parameter to the common column present in both files. Then store the result in a new dataframe called ‘Output_df’.
- Store remaining files in a list.
- Run a loop that will iterate over these file names. Read these files one by one and merge them with ‘Output_df’ dataframe
- Save ‘Output_df’ in tsv file
Example 1:
In this example, we will merge tsv files by using an inner join. We have taken four tsv files for this example as follows.
Used file: Customer.tsv, Account.tsv, Branch.tsv, Loan.tsv
Python3
import pandas as pd
tsv1 = pd.read_csv( "Documents/Customer.tsv" , sep = '\t' )
tsv2 = pd.read_csv( "Documents/Account.tsv" , sep = '\t' )
Output_df = pd.merge(tsv1, tsv2, on = 'ID' ,
how = 'inner' )
tsv_files = [ "Branch.tsv" , "Loan.tsv" ]
for i in tsv_files:
path = "Documents/" + i
tsv = pd.read_csv(path, sep = '\t' )
Output_df = pd.merge(Output_df, tsv,
on = 'ID' , how = 'inner' )
Output_df.to_csv( "Documents/Output.tsv" ,
sep = "\t" , header = True ,
index = False )
|
Output:
Output.tsv
Example 2:
In this example, we will merge tsv files by using an outer join. We have taken four tsv files for this example as follows.
Used file: Course.tsv, Teacher.tsv, Credits.tsv, Marks.tsv
Python3
import pandas as pd
tsv3 = pd.read_csv( "Documents/Course.tsv" , sep = '\t' )
tsv4 = pd.read_csv( "Documents/Teacher.tsv" , sep = '\t' )
Output_df2 = pd.merge(tsv3, tsv4, on = 'Course_ID' , how = 'outer' )
tsv_files = [ "Credits.tsv" , "Marks.tsv" ]
for i in tsv_files:
path = "Documents/" + i
tsv = pd.read_csv(path, sep = '\t' )
Output_df2 = pd.merge(Output_df2, tsv,
on = 'Course_ID' , how = 'outer' )
Output_df2.to_csv( "Documents/Output_outer.tsv" , sep = "\t" ,
header = True , index = False , na_rep = "NA" )
|
Output:
Last Updated :
23 Mar, 2022
Like Article
Save Article
Share your thoughts in the comments
Please Login to comment...