My preference is .csv, because it can be read by almost any program.
Except the one most used for it: Excel. Excel sucks at csv, especially if the recipients have various internationalized versions. It can't deal properly with multi-line strings and spaces, but especially the internationalization is hell (semi-colons instead of commas, decimal separators, date formats).
Many times have I dealt with CSV-exports that could be read in any software except client's Excel versions.
You left out the absolute worst bit: Excel tries to "helpfully" guess the correct data format for each cell in a CSV file. This can lead to silent data loss. Long strings of numbers are converted to a Numeric type that only stores a certain number of significant figures. The string "1-5" is converted to a Date field with the value "5-Jan"
Sure, it's a nice trick, but now it's not really a CSV file any more. If I have to create a file that can only be read by Excel it might as well be XLSX.
There's a slight difference between having to strip "=" from the column values for other applications to handle it, and require those other applications to be able to handle XLSX..
"Are the rules for Excel CSV parsing written down anywhere?"
They are loosely described in ECMA-376 section 18.8.30 under the "General" format code. But like with most details surrounding XLS and XLSX, the documentation is barely coherent and incorrect in some places.
Except that when users click on or download a csv-file, it Excel opens it automatically, bypassing any options and not even bothering to check if the contents of the csv file matches its defaults.
This is 90% of use cases with regular users.
For the other 10%, the interface for specifying the field types sucks, doesn't cover certain options and you can't set your preferences.
Imagine being a user and having to do that every day in order to import your daily whatever report, and having to start all over if you make one little mistake.
And that's just for the options Excel supports. Don't even get me started on all the other quirks. Ever tried to read a csv file who's first line starts with "ID"?
… and precisely how many users do you think know to do this? They might if it's happened before, been correctly diagnosed and explained before and it's obvious as soon as you open the file. It's more common, however, that the Excel bug only causes data loss or invalid results in a percentage of records and, when people notice it, it's corrected by hand.
If you need to do this a lot, generate xlsx - it's not that hard and there are decent libraries for most languages. Relying on users correctly setting column types every time is like making your daily commute depend on having airbags.
Any numbers with leading zeroes, i.e. 0000123, excel will interpret as the integer 123 even if you change the column format to "text". It's infuriating.
The only way I found to get around this was to open a new workbook, change the column type to "text" and then paste the data in. I believe this was excel 2010 on windows.
Indeed. I don't know what version of Excel the parent is using, but Excel for Mac 2010 handles csv files just fine, even ones that have weird delimiters. I'd love to see an example of a non-readable csv file.
Edit: I haven't tried different Unicodes yet, so granted that may not work.
This is an example of a poorly laid-out menu structure. To force it to do the right thing, you have to wrap text (right click -> format cells -> alignment -> "Wrap text" on).
Basically, the default rendering is supposed to be a single space for any whitespace (tab, newline, space) and "wrap text" forces it to do the right thing.
From someone currently dealing with Excel's CSV writing for a business application, I heartily concur. There is a decade old bug (at least) where the CSV save as format does not properly handle Unicode, and there's no simple way to handle it down the pipeline. I guess the Office team would rather work on the metro ribbon.
"Save as CSV" produces text encoded in the users ANSI codepage which is atrocious, Excel 2007+ at least has "Save as Unicode Text" which produces \t separated UTF16 output.
As I said, no real specification. It reads "This section documents the format that seems to be followed by most implementations". Another part reads "Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all)."
And then laid down which ones would be used and exactly how, and how to specify that you conform. Http is just describing process of how the spec came about. Because there is no metadata in csv you cannot even say how or if you conform.
I guess the Office team would rather work on the metro ribbon.
Of course they would! They've already achieved their goal with Excel: make it a lock-in product. In this case, they did it by breaking CSV compatibility, thereby forcing everyone to use XLSX.
I am fairly sure that other spreadsheets will make such errors, too, though. I recently saw one in Numbers that depended on the amount of data pasted (pasting hundreds of rows seemed to disable the smartness)
I've always found the C in CSVs to be needlessly limiting. I'm not aware of any tool that contextualise commas in addresses and other "free text" fields that might form part of the data dump/extract.
For this reason, I am strongly in favour of pipe-separated values in such files. The probability that a data extract has a pipe symbol in any data field is quite small (in my experience, it's been 0 so far).
The first thing I do in any Windows system I have to use on a regular basis is to change the the system separator to '|' in the Regional Settings..
Then you run into ASCII art some end user made, or pre-Unicode text from Scandinavia where 0x7C was the code point of a letter. Commas make it pretty obvious which tools are unusably broken, where very rare characters let these bugs go undetected far too long.
Tab-seperated is also very common, and I find better - how often are you storing text-formatting within cell for a data interchange? Similar to pipes, tabs are much less common than commas.
Yes, pipe-delimited format is great, and much easier to parse than comma-delimited.
Tab-delimited is also better than comma-delimited, I think.
Commas just have far too high a probability of appearing within the data itself, and when you hit a comma that was supposed to be escaped, it adds an extra column to the row, screwing everything up.
Tabs are great, until you try to explain to a non-programmer why your tool doesn't parse their file correctly, even though in the word-processor they're using as a text editor it looks identical to another file that parses correctly.
I also find the C in CSVs to be needlessly limiting. Especially since ASCII already comes defined with characters for record separator (30), unit separator(31), and group separator(29).
Excel and other tools can handle commas if you use quoted fields. Unfortunately this generates a new problem when you need to escape quotes inside a quoted field. There is no universal escape method.
Agreed, CSV sucks if you data contains punctuation. UTF-8 encoded tab-delimited tables are preferred. These are kind of a pain to import into Excel (on Mac at least) because you have to specify exactly what format they are in, but it can be done and there is no worry about quoting conventions and so forth unless your data contains literal tab characters that are important.
Use a flexible file format. My preference is .csv, because it can be read by almost any program. I'll tolerate .xls, but I'm not pleased with .xlsx (not everyone uses Excel!). And please, please, please do not use pdf.
Why is .xls tolerated but not .xlsx?
XLS and XLSX are both Excel formats, but XLS is a binary blob while XLSX is just a zip containing xml files. I'd much rather parse xml files (XLSX) than the binary blob of XLS.
My experience is that open-source implementations like LibreOffice have much better support for the older file formats. docx/xlsx/etc may be a "zip of XML files" but it doesn't really matter if the original data is stored as XML or a blob when it comes to the end user's rendering experience -- it's about how the system can interpret that data. And the interpretations of the new file formats are mostly severely lacking.
Very much this. Sometimes, I don't want (or can't) open OO.org, I just want to be able to get a quick glimpse in Emacs (http://www.emacswiki.org/emacs/UnXls) or mutt of a simple table of data someone sent me, but so far there are no quick CLI tools (that I've found) to do this. It almost feels as if Microsoft said "hmm, all these free software products can open our files, but they're still complaining about open file formats; how can we abuse our monopoly even more and guarantee that some truly open standard doesn't get mandated?" and thus was born docx, xlsx, pptx, etc.
Yes, it gets data into the sheet, but means that you need extra steps to run calculations. There is no option to specify the column as a number during the import phase. So what takes one step in Excel takes a minimum of two in Libre.
I thought you were worried about leading zeros or something.
In calc 4.0 on XP, I can choose from a few options for each column and the default (called "Standard") results in a column that I can use in calculations (if I highlight the cells it shows the sum, in formulas, etc.).
If I specify the column is "Text", leading zeros are retained.
The newest versions of LibreOffice are a lot better at handling xlsx, and it is getting better all the time. I still see some issues, but lately I have had a much better time importing.
XLSX and the other OpenXML formats are generally speaking very easy to query, particularly as Microsoft provide their OpenXML SDK to do it with as well as other 3rd party tools provide similar functionality.
Third party spreadsheet programs (at least used to) have poor support for XLSX. That's from when it was new, I'd think that popular programs have improved, but that the reputation stuck.
I have had reasonable success with using the Apache POI project for the XLSX and XLS. You might want to consider looking at the community edition of Pentaho Data Integration at the 5.0 level and up. It includes POI for XLSX. I use it all the time to deal with random spreadsheets full of data I get from clinical data (hospital reports).
But I hate excel of any type for data. People randomly change formats and give me updated spreadsheets with 100K's of rows. Then seem surprised when their data doesn't link up with other data sources anymore . . .
ProPublica recently published an article (with an unseemly title) about the nightmare of scraping data from pdfs and putting it into tables. [1] They wrote some custom software to do the scraping. At the end of the article they mention some software released recently by a Knight-Mozilla OpenNews Fellow, Manuel Aristarán, called Tabula, which is a PDF data table scaper.[2] I haven't tested it yet.
There is lots of bad advice in this article. OK maybe the article is shorter than I thought. Here, this is bad:
Use short variable names with no whitespace. Underscores are usually a safe bet, so instead of "Number of new tuberculosis cases" use "incident_tb"
While it might be clear to the domain experts, I had no idea that "incident_tb" meant "Number of new tuberculosis cases" other than that it was placed next to that text.
I think the message is right, but honestly if they are sending you data to parse in PDF, you better just give up now. That person will most likely not have seen a README file before in their life, they're probably not even trying.
I think the rest of the points are good suggestions but "use short variable names" is a terrible suggestion. "incident_tb" is one of the worst possible names that could be used while still being relevant to the data represented.
It's 2013, software can handle titles longer than 8 characters. Even something like "n_new_tb_incidents" (which is shorter than I'd like) buys a lot more clarity at the expense of a few characters. I'd take that bargain any day.
I agree with you, I actually had a client who demanded that the data in their project be done in excel (with multiple files ranging from 1-3 gb in size!). Both myself and my partner tried exhaustively to use a database or even JSON/XML solutions but the client kept repeating "in the real world business is done in excel!" Needless to say it was a nightmare.
I don't have the full picture here but by moving away from an Excel format you are also potentially taking away a lot of control they have themselves over managing the data and working the data which Excel is extremely good at with little to no programming.
Your point is true generally, but I have yet to use a version of Excel that can handle multiple gigabytes of data without serious performance degradation and probably a lot of crashes.
Which would matter, if businesses cared. I spend 4 days every month compiling a 1.5GB, Excel-based report. 80% of that time is spent with Excel frozen up and me playing on my phone, hoping that when Excel unfreezes it doesn't crash.
I suggested porting the process to Access with an Excel-based front end, which would cut the compilation time down to 4 hours at the most. I was denied due to the fact that whomever supports it after me will more than likely not have Access experience, and told to continue with the 4 days of hours-long freeze ups.
This is seen all the time in Accounting departments. GAAP and STAT financials in spreadsheets with millions of linked documents and macros. A large amount of this deals with business logic being tied directly in Excel. Their data and the constraints around it are embedded into the spreadsheet. Segregating the data from the spreadsheet means adapting the business logic from Excel and maintaining it in two places. (Source: I work with an accounting department that regularly deals with spreadsheets that take anywhere from 30 minutes to 4 hours to load/process/filter).
Excel is absolutely standard in accounting circles. It has the benefit of allowing accountants to do all sorts of things without knowing how to program.
Maybe your client was wanting ubiquitous Excel-compatible export/import - not for the back end to actually be Excel? Making any table in your app exportable to at least CSV is pretty straightforward.
I think that is perfectly logical, the project was basically a series of command line tools. I tried to swap out excel for CSV or at least TSV, the client demanded .xls in .xls out. Believe me we tried every suggestion and that was as far as the client would negotiate. We got it to work and I died a little inside.
That person is obviously bat-shit crazy. I am very surprised to hear that Excel or any other spreadsheet program will open multi-gigabyte files without exploding.
When moving from a mostly broken .xls data export system to a CSV file with the same data, we got lots of complaints from customers that it was too hard to use.
CSV files ARE broken in Excel. Have you ever tried importing a column of data with numbers that contain leading zeros / are not meant as numbers? There are at least 5 ways to do it wrong.
I know how to do this in LibreOffice (requires an extra step above and beyond double clicking) but honestly I do not know how to do it in Excel, and I would not attempt to learn or teach this knowledge to a person who does not have experience with a different spreadsheet tool first.
This can be helpful advice for many people. Why bash him? It's good advice. I would understand "incident_tb" in health context, and I'm 17. So a readme should be plenty, right?
Yes, honestly I read down to that point and began writing my comment, assumed there would be lots more bad advice to follow, then the article went and ended on me. The README is the correct solution in this case.
It seems like most federal and state agencies nowadays make much data - data that they are forced by law to make public - only available in pdf files. The reason for this seems to be to prevent as much as possible use and analysis of the data, while technically complying with public disclosure requirements. It is often extremely difficult to get public data in a usable form out of most public web sites, and the ways in which this happen are so unoptimal they move past the level of mere government incompetence to intentional obfuscation.
The law requires state and federal agencies to make the data public. It does not require them to make the data useful for a specific use case.
But generally, the reasons PDF are used is: (1) PDF signals (internally) that the file has been examined from sensitive, confidential, or non-public content, and no longer needs further modification; (2) PDF is a long-lived standard with plenty of free viewers; and (3) PDF guarantees print fidelity across all known viewers, which is something most office software (including online suites) can't even promise between minor version updates.
Having worked for the US government, trying to collect trade and production data from other governments and private entities, I always assumed this was the case when I received PDFs. Interestingly, interest groups were always eager to give you the data in whatever format you requested...
Any format other than PDF and the large numbers of citizens will not have the proper tool to access it - e.g. a person using a public computer in a library.
It's not malice even if it hampers outside researchers. Any format other than printed (e.g. PDF) is going to require the user to deal with the meta-data in order to extract useful information - that's just the nature of data.
They could publish the data in multiple formats, PDF and something actually useful if you're worried about someone at a library not having the tools. The point is, don't cripple the data from an analysis perspective.
Wait, how come they release newly-public data to traders on financial markets in easily computer-parseable format like JSON, but PDFs with tables for so much of this stuff?
I was really hoping to come to this comment page and see a dozen top-level comments claiming you could easily parse a PDF with XYZ tool, because the PDF represented the table with reasonable data, and how could you not know that, you poor excuse for a computer geek, in true HN (and Reddit and 4chan and whatever you your flavor of BBS) fashion.
Perhaps we're at the point where some bunch of people have already created a Best Current Practice standard for preparing data prior to data sharing? Because it seems like something that someone, somewhere, should have done already.
---
People can be great at using their particular tool, and then they'll do something to make you realise that they're just an expert with that tool and that their wider computer knowledge isn't so great.
I've had people want online bank statements as PDFs not CSVs because "anyone can change the Excel file"; I had people asking me to scan engineering drawing on paper to gif files ("Nice and small! We don't have enough space for that other format!" (we did have space)) and then import those gifs into AutoCAD so we could update our versions of the drawings when the customer updated their versions. (The guy before me had given up trying to explain why it was a dumb idea and had just scanned about 1,000 (of about 10,000) documents and saved them as gif.)
Some people aren't particularly sharing data. They're just sharing their information that they've got from that data; the interpretation they're making.
I'm always surprised that FITS [0] doesn't get more of a look
in outside of astronomy. It works very well in my experience
for storing tables of data as well as images.
This reads like a giant advertisement for why scientists should learn to code. Not only would they not be helpless in the face of stupid formats, they'd be far more likely to pick a decent one to store the data in the first place.
It does seem to be a big challenge to make data available both in analyzed, "crunched" format for readers of a report (in which case PDF is usually fine), and in "raw" format for analysts and developers (in which case one of CSV flatfile, JSON or XML is usually the best format). Too often, people don't seem to recognize the difference.
The internet isn't just about software and protocols, it's also about the data. Sharing data is kind of the whole point of networking, so making your data available in a machine-parseable format is so important.
Several good advices, but missing the real problem here. Of course you CAN extract text from a pdf and you can extract a table from excel. A few lines of perl or your favourite language and you have it.
Data are worthless if you can't trust it. Plain CSV are easy to read but easy to change, even on the fly. Pdf can be changed also, but is not so easy and if someone makes a subtile change in a number the error don't propagates by all the pages reaching the totals like in excel. Excel macros/formulas can be a source of headaches.
So to share your data use a format that:
1-you can trust (reasonably).
2-all your other reasons go here...
And if this format is full of nested tables, and you find difficult to extract the info from those tables, don't throw out the format, ask for help instead.
PDF can support passwords, is very compact, not so easy to change on the fly and can be encrypted. Maybe not the best, but not the worst of the available tools, in my opinion.
Yes, a million times yes. I'd like to emphasize the following points:
(2), (3), (4) are all tied to the same idea: Your data will be parsed by a script, not by me. I'm not going to use excel to analyse your data. I'm gonna import them into R/Python/... and then extract the features I'm interested in. I've been a graduate student in statistics for almost 2 years now, and when the data are correctly packaged, I've never had to use excel.
(5) Come on, I know that nobody likes writing README's. Heck, I don't like it myself. But do you really expect me to guess what your two-letters variable name means? And please, make your readme searchable, and short. The cdc data comes with a 200 pages manual explaining what the data is, and finding what you're looking for is a nightmare..
...and horrors of math equations and columns in PDF.
We should really come up with some light format where style is separated from data. Make a universal converter from doc and xls files and save world from data stuck in PDFs.
As long as the PDF contains text (and isn't just scans of paper), it's usually not TOO hard to select and copy the relevant tables/text, paste into a text editor, and use regex's to transform it into whatever you want (CSV, SQL insert statements, etc.)
Since there's no universal data format (e.g. the data equivalent of PDF), I pretty much assume that whatever format data comes in, I'll have to be doing regex transforms to it in a text editor, in order to import it wherever I want.
So as long as it's copy-pastable as text, PDF doesn't seem appreciably worse than pretty much any other format.
Maybe it's a naive question, but why shouldn't you use quotes to encapsulate things that are meant to be text so that you can catch things like white space?
Also that do people use formats other than comma and tab delimited?
My own experience with CSV is dealing with my own data and instrument readouts (which are mostly always tab delimited with a header section) so I don't know.
Many CSV generators and CSV parsers don't conform to RFC 4180, however, but if you're planning on transmitting data using CSV, it's simple enough to say "use an RFC 4180-compliant parser."
yeah, i was just wondering why the author kind of explicitly recommended against using white space when most parsers (including MS excel in my experience) handle white space encapsulated in quotes just fine
That's for variable names, not run-of-the-mill text. Variable names (and, in certain situations, dictionary keys) generally can't have spaces, so if you're generating code off of the data set, having to generate variable names (or dictionary keys) without spaces is an additional step.
RDF took Prolog clauses, restricted them to triplets, and made their representation dramatically more verbose. Then it removed the reasoning engine part of logic programming and replaced it with OWL which is by default computationally intractable.
I have tried for many, many years to learn to appreciate RDF and semantic technologies in general, but everytime I get involved it screams of a problem desperately looking for a solution, and whose general case of the problem (reasoning about relations) was already solved better decades ago (and with the recent revived interested in logic/relational programming seems likely to improve soon).
If you want to represent a graph structure in your data why not sets of json objects? Sure you end up with an ad hoc vocabulary, but I've never seen a successful project make use of the RDF ideal of massively shared vocabularies.
JSON doesn't really support datatypes. So not really useful for sending data to someone else. Even simple things such as float vs. double will be a challenge. Not to mention dates or locations.
Pity you haven't gotten to like it. It's being tested on a lot of stuff, some sticks some doesn't.
And OWL has a number of subsets for whatever complexity you feel like enjoying.
My personal preference is, whenever there's a lot of records to be processed, a tab separated UTF-8 text file. (I admit I am biased as I like to use Perl for first passes).
The number one difference between RDF and CSV (TSV in you case) is that it's understandable by a computer. CSV is simple parsable, while RDF comes with all the meta data the computer needs.
I much prefer the RDF graph data model over the document-centric tree model of JSON or XML for representing data. It's too bad the need for RDF to be a solid formal knowledge representation format trumped more practical concerns (like being able to represent an ordered list in a simple manner).
RDF is a graph based data format (serialised to XML). It is based on triples. Such as: Peter hasParent Paal.
To query RDF, we use the graph query language SPARQL. You create a graph with variables in it, and the query engine will find subgraphs that match the graph.
On top of this we can do reasoning (based on 1st order logics and datalog). With RDFS (simple) and OWL (advanced).
And to bring all this together there are many open vocabularies and open data sets linked together. One notable one: DBpedia (based on wikipedia).
select ?a min(?b) (min(?date) as ?date2) where {
?a a dbpedia-owl:Software;
rdfs:label ?b;
dbpedia-owl:latestReleaseDate ?date.
}
group by ?a
order by ?date2
LIMIT 100
Deep in the clearance bin of web technologies (looking at you xhtml) lies like RDF and SPARQL. The only reason governments use it is for job security and because only they have the endless budgets to bother with it.
this advice is pertinent even within excel, especially for those who rely on any of its summary features (pivot charts, data analysis, etc). essentially an illustration of the benefits of separating content from presentation.
Except the one most used for it: Excel. Excel sucks at csv, especially if the recipients have various internationalized versions. It can't deal properly with multi-line strings and spaces, but especially the internationalization is hell (semi-colons instead of commas, decimal separators, date formats).
Many times have I dealt with CSV-exports that could be read in any software except client's Excel versions.