Many people working with different kinds of data come across the problem that they need to convert between different data formats. Comma Separated Values (CSV) and the Extensible Markup Language (XML) are the most widely used formats for data, and conversion between these two formats needs often to be accomplished. Especially to XML, because this format is very well supported by modern applications, and is very well suited for furter data manipulation and customization.
XML to CSV conversion is trivial for some simple document structures. It can probably be done best with XSL Transformations (XSLT), and won't be discussed here. (Well, maybe some examples can help.)
Countless tools and methods for CSV to XML conversion do exist, nevertheless it is very often quite hard to find a method that suits one's particular needs. Different situations/users/documents require different tools. Requirements such as ease-of-installation, set-up, customization, GUI/CLI, encodings, availability of a scripting language interpreter, may have different impact on each use case. Here we will review a wide array of tools and methods for the conversion of CSV documents into XML.
Online Tools
The Creativyst® CSV to XML Converter let's you paste your CSV document in one form field, and copy the resulting XML from another field online and for free. Also check out their excellent documentation about the CSV file format.
Spreadsheet Applications
Spreadsheet applications such as Openoffice Calc, KSpread, Gnumeric or MS Excel are often very useful tools for manually entering data into tables. Once created, you can usually save (or export) the data as a CSV document. Maybe your preferred applications allows you to export your data as "simple" XML natively, or with help of a third-party plugin. Modern spreadsheet aplications support file formats that are based on XML, and thus can be processed with standard XML tools.
ODF + XSLT
The OpenDocument Format that is used by default in Openoffice Calc is a ZIP compressed archive, which contains XML data. Thus, the content of your spreadsheet can be processed with XSLT.
- Create your data in your spreadsheet application, or import a CSV document.
- Save the spreadsheet as 'document.ods'.
- Rename the file to 'document.zip'.
- Extract the archive, and navigate to the 'document' directory.
- Here, we are only interested in the 'content.xml' file. Convert it with an XSLT stylesheet and xsltproc:
$ xsltproc --output "output.xml" "odf.xsl" "content.xml"
A basic 'odf.xsl' stylesheet might look like:
<?xml version="1.0" encoding="UTF-8" ?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"> <xsl:template match="office:spreadsheet"> <document> <xsl:apply-templates/> </document> </xsl:template> <xsl:template match="table:table"> <table> <xsl:attribute name="name"> <xsl:value-of select="@table:name"/> </xsl:attribute> <xsl:apply-templates/> </table> </xsl:template> <xsl:template match="table:table-row"> <row> <xsl:apply-templates/> </row> </xsl:template> <xsl:template match="table:table-cell"> <xsl:element name="cell{position()}"> <xsl:value-of select="text:p"/> </xsl:element> </xsl:template> </xsl:stylesheet>
Please note, that the use of 'position()' in the 'table:table-cell' template might be problematic. It allows you to get distinct element names for each data column. It should be possible to use a similar mechanism for reading labels from the first row. Or you can get a flat element naming replacing <xsl:element> with <cell>. Flat elements are a little bit harder to process, but it can be done.
Graphical (GUI) Applications
A specialized tool based on Java should be mentionned here, the CSV2XML Konverter published by OIO for free use. It's possible to use it as a standalone or as a Java Web Start Application or integrated in own software projects. The tool gives several options, e.g. define the delimiter and define the first line as source for the element names (very useful!). Nevertheless, it's good for ASCII but it has some problems with Unicode.
Another tool for CSV2XML conversion is the Altova XMLSpy. It's not free, but when you work a lot with XML, maybe it's worth to you because it's like a Swiss Army Knife for XML use. How to convert now? In menu klick convert - import textfile - "convert CSV to XML" (2nd option). Then choose path and define delimiter as encoding (!). Additionally, as like as in the OIO tool, you may choose first line entries as element names. Finally import. The results are perfect. Unfortunately, only a Windows version exists.
Command Line Tools
The Unix shell is extremely powerful. It allows you to combine existing programs with redirections and pipes, creating your own custom tools. If some part is not available, you can write or modify a custom script in your scripting language of choice. For more information, you can read the Command Line Howto in this Wiki.
ffe Flat File Extractor
The Flat File Extractor is a command-line tool that allows you to convert arbitrary text files to XML. It is available as the 'ffe' package in Ubuntu Universe, and can be installed as follows:
$ sudo aptitude update $ sudo aptitude install ffe
Once installed, an 'input.csv' file can be converted to an 'output.xml' file using the ffe tool with the following instructions on the command line:
$ ffe -o output.xml -c csv2xml.fferc input.csv
Where the 'csv2xml.fferc' configuration file looks something like the following:
structure csv2xml { type separated ; output xml record data { field field_name_1 field field_name_2 field field_name_3 } } output xml { file_header "<?xml version=\"1.0\" encoding=\"OTF-8\"?>\n" data "<%n>%t</%n>\n" record_header "<%r>\n" record_trailer "</%r>\n" indent " " }
This will read a simple CSV file with four columns, and fields separated by ';'. You can edit the 'field' and separator values as needed.
If you want to read the field names from the first line of the CSV file, you can use the following alternative structure. Just make sure to add the correct number of fields inside the record:
structure csv2xml { type separated ; header all output xml record data { field * field * field * } }
csv2xml
csv2xml is a command line tool released under th BSD license. It is written in C to process very large files very quickly. There seem to be no packages available, so you must compile it manually.
After downloading the source from sourceforge (for example 'csv2xml-0.6.tar.gz'), make sure that you have 'build-essential' installed, and that the '/usr/local/man/man1' directory exists. Then you can simply make & make install:
$ sudo aptitude update $ sudo aptitude install build-essential $ tar -xvzf csv2xml-0.6.tar.gz $ cd csv2xml-0.6 $ sudo mkdir -p /usr/local/man/man1 $ make $ sudo make install
Now, you can process a comma-separated 'input.csv' file (without spaces), write it to 'output.xml'. The xml2csv tool will read the column headers from the first line in the input file:
$ csv2xml < input.csv > output.xml
Warning: This tool does not create a well-formed XML document, as it lacks a proper root element! Thus, you will have to add this to get an XML document, probably together with an XML declaration:
$ (echo '<?xml version="1.0" encoding="UTF-8"?>'; echo "<root>"; csv2xml < input.csv; echo "</root>") > output.xml
Perl
The Ubuntu 'libxml-csv-perl' package provides all the functionality needed for converting CSV to XML with Perl. At it's most basic, the following script converts 'input.csv' to 'output.xml', reading the "headings" labels from the first line of the CSV:
#!/usr/bin/perl use XML::CSV; $csv_obj = XML::CSV->new(); $csv_obj->parse_doc("input.csv", {headings => 1}); $csv_obj->print_xml("output.xml");
In order to use this script, you must save it as 'csv2xml.pl', and set it's executable flag. Or you can invoke perl explicitly:
$ perl csv2xml.pl
Warning: This script only works with ASCII text. All lines containing accented characters or other extended characters are silently ignored. A more robust approach would probably require the use of the Encode module.
A complete command-line tool written in Perl is available at the following place: http://sourceforge.net/forum/forum.php?thread_id=1179362&forum_id=422301.
Python
The following Python script reads the 'input.csv' document, and writes the data in XML to the standard output. You can adapt it to your personal requirements, adjusting the formatting parameters directly in the code below.
1 #! /usr/bin/env python
2
3 import csv
4
5 csv.register_dialect('custom',
6 delimiter=',',
7 doublequote=True,
8 escapechar=None,
9 quotechar='"',
10 quoting=csv.QUOTE_MINIMAL,
11 skipinitialspace=False)
12
13 with open('input.csv') as ifile:
14 data = csv.reader(ifile, dialect='custom')
15 print "<document>"
16 for record in data:
17 print " <record>"
18 for i, field in enumerate(record):
19 print " <field%s>" % i + field + "</field%s>" % i
20 print " <record>"
21 print "</document>"
In order to run it, you must save the previous as 'csv2xml.py' and change the permissions to executable. Or you can also invoke it explicitly:
$ python csv2xml.py input.csv > output.xml
Another script for Python3 allows you to set a lot of input and output formatting options directly with command-line options.
PHP
PHP is a widely-used general-purpose scripting language that fits particularly well with XML. It is most commonly used in servers, but you can easily install it with the 'php5' metapackage from the Ubuntu repository.
You can use this sample script to convert CSV to XML, or you could also write your own script, using the fgetcsv function.
Ruby
Ruby is probably not installed by default, but you can easily get it from the 'ruby' package. Due to it's simplicity and elegance, this language is becoming a favorite of many programmers.
#!/usr/bin/env ruby require 'csv' separator = ';' ifile = ARGV[0] csv = CSV::parse(File.open(ifile) {|f| f.read}, separator) lines = csv.shift puts '<?xml version="1.0"?>' puts "<table>" csv.each do |cell| puts "<row>" for i in 0..(lines.length - 1) puts "<cell#{i}>#{cell[i].strip}</cell#{i}>" end puts "</row>" end puts "</table>"
In order to execute this script it, you must save the previous as 'csv2xml.rb' and change the permissions to executable. Or you can also invoke it explicitly:
$ ruby csv2xml.rb input.csv > output.xml
XSLT
While the XSL Transformations template language looks like the natural choice for the conversion from XML to CSV, it is not very well suited for conversions the other way around.
Yet, there are a few solutions available on the web. Andrew Welch has developed the following XSLT 2.0 stylesheet:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="fn" exclude-result-prefixes="xs fn"> <xsl:output indent="yes" encoding="US-ASCII"/> <xsl:param name="pathToCSV" select="'file:///c:/csv.csv'"/> <xsl:function name="fn:getTokens" as="xs:string+"> <xsl:param name="str" as="xs:string"/> <xsl:analyze-string regex='("[^"]*")+' select="$str"> <xsl:matching-substring> <xsl:sequence select='replace(., "^""|""$|("")""", "$1")'/> </xsl:matching-substring> <xsl:non-matching-substring> <xsl:for-each select="tokenize(., '\s*,\s*')"> <xsl:sequence select="."/> </xsl:for-each> </xsl:non-matching-substring> </xsl:analyze-string> </xsl:function> <xsl:template match="/" name="main"> <xsl:choose> <xsl:when test="unparsed-text-available($pathToCSV)"> <xsl:variable name="csv" select="unparsed-text($pathToCSV)"/> <xsl:variable name="lines" select="tokenize($csv, '
')" as="xs:string+"/> <xsl:variable name="elemNames" select="fn:getTokens($lines[1])" as="xs:string+"/> <root> <xsl:for-each select="$lines[position() > 1]"> <row> <xsl:variable name="lineItems" select="fn:getTokens(.)" as="xs:string+"/> <xsl:for-each select="$elemNames"> <xsl:variable name="pos" select="position()"/> <elem name="{.}"> <xsl:value-of select="$lineItems[$pos]"/> </elem> </xsl:for-each> </row> </xsl:for-each> </root> </xsl:when> <xsl:otherwise> <xsl:text>Cannot locate : </xsl:text> <xsl:value-of select="$pathToCSV"/> </xsl:otherwise> </xsl:choose> </xsl:template> </xsl:stylesheet>
Save this in a file called 'csv2xml.xsl' and perform the conversion with help of the Saxon XSLT processor. You will probably have to install the 'libsaxonb-java' package, which provides the XSLT 2.0 aware version. To convert 'input.csv' to 'output.xml', use the following command in a terminal:
$ saxonb-xslt -ext:on -o:output.xml -xsl:csv2xml.xsl -s:input.csv
An XSLT 1.0 solution by Kirk Evans requires you to wrap the whole CSV document inside some dummy XML element tags.
Anyway, XSLT is not wery well suited for this task, these are quite advanced hacks, and you should use these methods at your own risk.
Troubleshooting
A very common cause of problems in all text-processing software, including XML processors are the encoding, special, and non-ASCII characters. If your source document contains any of these, you should check your output document very carefully, because very strange things may happen, and usually without a warning!
Encoding
You have to be very careful about the encoding of your source-document. Unicode UTF-8 is the standard today on all modern operating systems, and that's what you get usually when you create text documents with Ubuntu. If you are using legacy documents, they could be using other encodings, and you will probably have to tell your software somehow about that. Beware, many applications still don't handle Unicode, or behave badly/unexpectedly. You might have to change your document's encoding, maybe using the VIM or Emacs text editors.
Encoding only matters if you are using characters outside of the ASCII range, ie. accented or non-latin characters.
Special Characters
Some characters might cause problems with some applications/tools. The most common causes of conflict are summarized on the following table:
Character |
Unicode point |
Name |
Character Entity |
# |
U+0023 |
NUMBER SIGN |
# |
& |
U+0026 |
AMPERSAND |
& |
< |
U+003C |
LESS-THAN SIGN |
< |
> |
U+003E |
GREATER-THAN SIGN |
> |
" |
U+0022 |
QUOTATION MARK |
" |
' |
U+0027 |
APOSTROPHE |
' |
\ |
U+005C |
REVERSE SOLIDUS |
\ |
Some applications perform automatic substitution of these characters and/or accented and non-latin characters by their corresponding character entity references. Since these character entities are understood by standard XML processing software, this substitution should pose no problems to you. In fact, this is sometimes the only way you have for processing non-ASCII documents with software that is not Unicode aware. In these cases, you have to substitute all non-ASCII characters by their character entities.