This blog focuses on converting the XML format of data into CSV format using pig commands.
Now we will take a sample XML data. After installing hadoop we get many configuration files in xml format and in this case we are taking hdfs-site.xml as our input data.
Our hdfs-site.xml file looks like this.
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>/home/kiran/hadoop/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/home/kiran/hadoop/datanode</value> </property> <property> <name>dfs.block.size</name> <value>67108864</value> </property> </configuration>
Now we will convert the data inside this file to CSV format using pig.
A = load '/hdfs-site.xml' using org.apache.pig.piggybank.storage.XMLLoader('property') as (x:chararray);
Here we will load the xml file using the default XML loader available in pig, inside the XML loader we are specifying that our root element is property and we are storing the whole thing with an alias name x as chararray.
B = foreach A generate REPLACE(x,'[\\n]','') as x;
C = foreach B generate REGEX_EXTRACT_ALL(x,'.*(?:<name>)([^<]*).*(?:<value>)([^<]*).*');
Now we are removing the brackets by using the above mentioned regular expression.
Before flatten statement the output looks like this.
D =FOREACH C GENERATE FLATTEN (($0));
Here by using flatten it will remove the remaining brackets.Now the Final result looks like this.
The above output will be stored in a file using CSV loader available in pig by using the below command:
STORE D INTO '/pig_conversions/xml_to_csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage();
This output is stored in the location /pig_conversions/xml_to_csv with name part-m-00000 of HDFS. We can download and see the contents of the file.
This is the final output which is in CSV format. We can now easily perform analysis on this data.
Hope this blog helped you in learning how to convert XML data into CSV.
Keep visiting our website Acadgild for more updates on Big Data and other technologies. Click here to learn Big Data Hadoop Development.