How to share your data effectively

onemorepassword · on April 8, 2013

My preference is .csv, because it can be read by almost any program.

Except the one most used for it: Excel. Excel sucks at csv, especially if the recipients have various internationalized versions. It can't deal properly with multi-line strings and spaces, but especially the internationalization is hell (semi-colons instead of commas, decimal separators, date formats).

Many times have I dealt with CSV-exports that could be read in any software except client's Excel versions.

eli · on April 8, 2013

You left out the absolute worst bit: Excel tries to "helpfully" guess the correct data format for each cell in a CSV file. This can lead to silent data loss. Long strings of numbers are converted to a Numeric type that only stores a certain number of significant figures. The string "1-5" is converted to a Date field with the value "5-Jan"

niggler · on April 8, 2013

If you are generating the CSV yourself, save yourself some agony and just wrap the text in =""

    $ cat test.csv
    ="12.34567890124312341234123412341234",="1-5"
    $ open -a Microsoft\ Excel test.csv

UnoriginalGuy · on April 8, 2013

Where did that = sign come from? That isn't in the standard... Quotes are, but wtf? An equals sign? Fucking Excel.

JUST TREAT THE NUMBERS LIKE TEXT YOU STUPID PROGRAM.

000001 === 000001 != 1

niggler · on April 8, 2013

It uses another auto trick: if the leading character is '=', the result is treated as a formula.

Thus, you can actually write formulas in your CSV and excel will interpret appropriately:

    $ cat test.csv
    =1+1,="1-5"
    $ open -a Microsoft\ Excel test.csv

You should see the value '2' (with content `=1+1`)

vacri · on April 9, 2013

If you start using metacharacters, it's no longer a universal .csv, but an excel-specific one.

eli · on April 8, 2013

Sure, it's a nice trick, but now it's not really a CSV file any more. If I have to create a file that can only be read by Excel it might as well be XLSX.

vidarh · on April 9, 2013

There's a slight difference between having to strip "=" from the column values for other applications to handle it, and require those other applications to be able to handle XLSX..

eli · on April 9, 2013

At least XLSX is an open and documented format. Are the rules for Excel CSV parsing written down anywhere?

niggler · on April 9, 2013

"At least XLSX is an open and documented format."

The documentation diverges from the excel implementation in so many places that there's a special document (which itself is incomplete): http://msdn.microsoft.com/en-us/library/dd922181(v=office.12...

"Are the rules for Excel CSV parsing written down anywhere?"

They are loosely described in ECMA-376 section 18.8.30 under the "General" format code. But like with most details surrounding XLS and XLSX, the documentation is barely coherent and incorrect in some places.

qw · on April 8, 2013

You can specify the field types when opening the file.

onemorepassword · on April 9, 2013

Except that when users click on or download a csv-file, it Excel opens it automatically, bypassing any options and not even bothering to check if the contents of the csv file matches its defaults.

This is 90% of use cases with regular users.

For the other 10%, the interface for specifying the field types sucks, doesn't cover certain options and you can't set your preferences.

Imagine being a user and having to do that every day in order to import your daily whatever report, and having to start all over if you make one little mistake.

And that's just for the options Excel supports. Don't even get me started on all the other quirks. Ever tried to read a csv file who's first line starts with "ID"?

acdha · on April 9, 2013

… and precisely how many users do you think know to do this? They might if it's happened before, been correctly diagnosed and explained before and it's obvious as soon as you open the file. It's more common, however, that the Excel bug only causes data loss or invalid results in a percentage of records and, when people notice it, it's corrected by hand.

If you need to do this a lot, generate xlsx - it's not that hard and there are decent libraries for most languages. Relying on users correctly setting column types every time is like making your daily commute depend on having airbags.

pielud · on April 9, 2013

Any numbers with leading zeroes, i.e. 0000123, excel will interpret as the integer 123 even if you change the column format to "text". It's infuriating.

The only way I found to get around this was to open a new workbook, change the column type to "text" and then paste the data in. I believe this was excel 2010 on windows.

azadder · on April 9, 2013

You can add an apostrophe before the leading zeros, and the cell will keep the zeros and still be a number type.

mirkules · on April 8, 2013

Indeed. I don't know what version of Excel the parent is using, but Excel for Mac 2010 handles csv files just fine, even ones that have weird delimiters. I'd love to see an example of a non-readable csv file.

Edit: I haven't tried different Unicodes yet, so granted that may not work.

niggler · on April 8, 2013

'It can't deal properly with multi-line strings and spaces'

Works fine in excel 2011 for mac:

    $ cat test.csv
    abc,"def
    
    ghi",jkl
    mno,pqr,stu
    $ open -a Microsoft\ Excel test.csv

http://imgur.com/CRUJZ57

This is an example of a poorly laid-out menu structure. To force it to do the right thing, you have to wrap text (right click -> format cells -> alignment -> "Wrap text" on).

Basically, the default rendering is supposed to be a single space for any whitespace (tab, newline, space) and "wrap text" forces it to do the right thing.

Shameless plug: I'm actually working on an XLSX parser: http://niggler.github.io/js-xlsx/

jrajav · on April 8, 2013

From someone currently dealing with Excel's CSV writing for a business application, I heartily concur. There is a decade old bug (at least) where the CSV save as format does not properly handle Unicode, and there's no simple way to handle it down the pipeline. I guess the Office team would rather work on the metro ribbon.

thomasz · on April 8, 2013

"Save as CSV" produces text encoded in the users ANSI codepage which is atrocious, Excel 2007+ at least has "Save as Unicode Text" which produces \t separated UTF16 output.

justincormack · on April 8, 2013

Well CSV has no real specification, or way to store metadata eg about encoding. It is a terrible format.

UnoriginalGuy · on April 8, 2013

http://tools.ietf.org/html/rfc4180

justincormack · on April 8, 2013

As I said, no real specification. It reads "This section documents the format that seems to be followed by most implementations". Another part reads "Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all)."

The RFC is a useful collection of anecdotes...

anonydsfsfs · on April 9, 2013

The HTTP 1.0 RFC says the same thing:

http://tools.ietf.org/html/rfc1945#page-4

"This specification describes the features that seem to be consistently implemented in most HTTP/1.0 clients and servers."

Is HTTP not a real specification either?

justincormack · on April 9, 2013

And then laid down which ones would be used and exactly how, and how to specify that you conform. Http is just describing process of how the spec came about. Because there is no metadata in csv you cannot even say how or if you conform.

merlincorey · on April 8, 2013

Meet RFC 4180 - http://www.ietf.org/rfc/rfc4180.txt

(Yes, it's not an internationally recognized standard, but neither is HTTP or IRC)

npsimons · on April 9, 2013

I guess the Office team would rather work on the metro ribbon.

Of course they would! They've already achieved their goal with Excel: make it a lock-in product. In this case, they did it by breaking CSV compatibility, thereby forcing everyone to use XLSX.

niggler · on April 8, 2013

'CSV save as format does not properly handle Unicode'

[MS-XLS] ends up using special codepages (e.g. http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOW...) for representing characters, so you need to make sure the same codepage is used through all of your applications.

Example: For those that have access to windows and macintosh excel, CHAR(240) is the apple logo on mac but something different on windows.

bjhoops1 · on April 8, 2013

> I guess the Office team would rather work on the metro ribbon.

Zing! Thank you for that.

Someone · on April 8, 2013

Nice example: http://www.biomedcentral.com/1471-2105/5/80 "Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics"

I am fairly sure that other spreadsheets will make such errors, too, though. I recently saw one in Numbers that depended on the amount of data pasted (pasting hundreds of rows seemed to disable the smartness)

shrikant · on April 8, 2013

I've always found the C in CSVs to be needlessly limiting. I'm not aware of any tool that contextualise commas in addresses and other "free text" fields that might form part of the data dump/extract.

For this reason, I am strongly in favour of pipe-separated values in such files. The probability that a data extract has a pipe symbol in any data field is quite small (in my experience, it's been 0 so far).

The first thing I do in any Windows system I have to use on a regular basis is to change the the system separator to '|' in the Regional Settings..

prodigal_erik · on April 8, 2013

Then you run into ASCII art some end user made, or pre-Unicode text from Scandinavia where 0x7C was the code point of a letter. Commas make it pretty obvious which tools are unusably broken, where very rare characters let these bugs go undetected far too long.

coldtea · on April 8, 2013

And then you get sent the file:

Statistics on 100000 UNIX shell one liners!

r00fus · on April 8, 2013

Tab-seperated is also very common, and I find better - how often are you storing text-formatting within cell for a data interchange? Similar to pipes, tabs are much less common than commas.

kyllo · on April 8, 2013

Yes, pipe-delimited format is great, and much easier to parse than comma-delimited.

Tab-delimited is also better than comma-delimited, I think.

Commas just have far too high a probability of appearing within the data itself, and when you hit a comma that was supposed to be escaped, it adds an extra column to the row, screwing everything up.

jleader · on April 9, 2013

Tabs are great, until you try to explain to a non-programmer why your tool doesn't parse their file correctly, even though in the word-processor they're using as a text editor it looks identical to another file that parses correctly.

grannyg00se · on April 9, 2013

I also find the C in CSVs to be needlessly limiting. Especially since ASCII already comes defined with characters for record separator (30), unit separator(31), and group separator(29).

qw · on April 8, 2013

Excel and other tools can handle commas if you use quoted fields. Unfortunately this generates a new problem when you need to escape quotes inside a quoted field. There is no universal escape method.

eksith · on April 8, 2013

Same for me. CSVs would be far less likely to mess up the data if they were PSVs instead.

nickm12 · on April 8, 2013

Agreed, CSV sucks if you data contains punctuation. UTF-8 encoded tab-delimited tables are preferred. These are kind of a pain to import into Excel (on Mac at least) because you have to specify exactly what format they are in, but it can be done and there is no worry about quoting conventions and so forth unless your data contains literal tab characters that are important.

JustARandomGuy · on April 8, 2013

Use a flexible file format. My preference is .csv, because it can be read by almost any program. I'll tolerate .xls, but I'm not pleased with .xlsx (not everyone uses Excel!). And please, please, please do not use pdf.

Why is .xls tolerated but not .xlsx?

XLS and XLSX are both Excel formats, but XLS is a binary blob while XLSX is just a zip containing xml files. I'd much rather parse xml files (XLSX) than the binary blob of XLS.

cookiecaper · on April 8, 2013

My experience is that open-source implementations like LibreOffice have much better support for the older file formats. docx/xlsx/etc may be a "zip of XML files" but it doesn't really matter if the original data is stored as XML or a blob when it comes to the end user's rendering experience -- it's about how the system can interpret that data. And the interpretations of the new file formats are mostly severely lacking.

npsimons · on April 9, 2013

Very much this. Sometimes, I don't want (or can't) open OO.org, I just want to be able to get a quick glimpse in Emacs (http://www.emacswiki.org/emacs/UnXls) or mutt of a simple table of data someone sent me, but so far there are no quick CLI tools (that I've found) to do this. It almost feels as if Microsoft said "hmm, all these free software products can open our files, but they're still complaining about open file formats; how can we abuse our monopoly even more and guarantee that some truly open standard doesn't get mandated?" and thus was born docx, xlsx, pptx, etc.

BrianEatWorld · on April 8, 2013

To be fair, isn't Libre is also pretty terrible at handling more basic formats like csvs?

I still have yet to find a good workaround for the "numbers as text" issue aside from switching to tsvs.

maxerickson · on April 9, 2013

What do you mean by good workaround?

Specifying that the column is text during the import at least gets the data into the sheet.

BrianEatWorld · on April 10, 2013

Yes, it gets data into the sheet, but means that you need extra steps to run calculations. There is no option to specify the column as a number during the import phase. So what takes one step in Excel takes a minimum of two in Libre.

maxerickson · on April 10, 2013

I thought you were worried about leading zeros or something.

In calc 4.0 on XP, I can choose from a few options for each column and the default (called "Standard") results in a column that I can use in calculations (if I highlight the cells it shows the sum, in formulas, etc.).

If I specify the column is "Text", leading zeros are retained.

krschultz · on April 8, 2013

The newest versions of LibreOffice are a lot better at handling xlsx, and it is getting better all the time. I still see some issues, but lately I have had a much better time importing.

dav-id · on April 8, 2013

XLSX and the other OpenXML formats are generally speaking very easy to query, particularly as Microsoft provide their OpenXML SDK to do it with as well as other 3rd party tools provide similar functionality.

LukeShu · on April 8, 2013

Third party spreadsheet programs (at least used to) have poor support for XLSX. That's from when it was new, I'd think that popular programs have improved, but that the reputation stuck.

dpcx · on April 8, 2013

Likely because it's not a well known fact about XLSX and DOCX.

tom_b · on April 9, 2013

I have had reasonable success with using the Apache POI project for the XLSX and XLS. You might want to consider looking at the community edition of Pentaho Data Integration at the 5.0 level and up. It includes POI for XLSX. I use it all the time to deal with random spreadsheets full of data I get from clinical data (hospital reports).

But I hate excel of any type for data. People randomly change formats and give me updated spreadsheets with 100K's of rows. Then seem surprised when their data doesn't link up with other data sources anymore . . .

a_p · on April 8, 2013

ProPublica recently published an article (with an unseemly title) about the nightmare of scraping data from pdfs and putting it into tables. [1] They wrote some custom software to do the scraping. At the end of the article they mention some software released recently by a Knight-Mozilla OpenNews Fellow, Manuel Aristarán, called Tabula, which is a PDF data table scaper.[2] I haven't tested it yet.

[1] http://www.propublica.org/nerds/item/heart-of-nerd-darkness-...

[2] http://source.mozillaopennews.org/en-US/articles/introducing...

yebyen · on April 8, 2013

There is lots of bad advice in this article. OK maybe the article is shorter than I thought. Here, this is bad:

Use short variable names with no whitespace. Underscores are usually a safe bet, so instead of "Number of new tuberculosis cases" use "incident_tb"

While it might be clear to the domain experts, I had no idea that "incident_tb" meant "Number of new tuberculosis cases" other than that it was placed next to that text.

I think the message is right, but honestly if they are sending you data to parse in PDF, you better just give up now. That person will most likely not have seen a README file before in their life, they're probably not even trying.

krichman · on April 9, 2013

I think the rest of the points are good suggestions but "use short variable names" is a terrible suggestion. "incident_tb" is one of the worst possible names that could be used while still being relevant to the data represented.

It's 2013, software can handle titles longer than 8 characters. Even something like "n_new_tb_incidents" (which is shorter than I'd like) buys a lot more clarity at the expense of a few characters. I'd take that bargain any day.

mrtriangle · on April 8, 2013

I agree with you, I actually had a client who demanded that the data in their project be done in excel (with multiple files ranging from 1-3 gb in size!). Both myself and my partner tried exhaustively to use a database or even JSON/XML solutions but the client kept repeating "in the real world business is done in excel!" Needless to say it was a nightmare.

dav-id · on April 8, 2013

I don't have the full picture here but by moving away from an Excel format you are also potentially taking away a lot of control they have themselves over managing the data and working the data which Excel is extremely good at with little to no programming.

omni · on April 8, 2013

Your point is true generally, but I have yet to use a version of Excel that can handle multiple gigabytes of data without serious performance degradation and probably a lot of crashes.

cosmie · on April 8, 2013

Which would matter, if businesses cared. I spend 4 days every month compiling a 1.5GB, Excel-based report. 80% of that time is spent with Excel frozen up and me playing on my phone, hoping that when Excel unfreezes it doesn't crash.

I suggested porting the process to Access with an Excel-based front end, which would cut the compilation time down to 4 hours at the most. I was denied due to the fact that whomever supports it after me will more than likely not have Access experience, and told to continue with the 4 days of hours-long freeze ups.

Corporate life for you, I guess.

ambiate · on April 8, 2013

This is seen all the time in Accounting departments. GAAP and STAT financials in spreadsheets with millions of linked documents and macros. A large amount of this deals with business logic being tied directly in Excel. Their data and the constraints around it are embedded into the spreadsheet. Segregating the data from the spreadsheet means adapting the business logic from Excel and maintaining it in two places. (Source: I work with an accounting department that regularly deals with spreadsheets that take anywhere from 30 minutes to 4 hours to load/process/filter).

rayiner · on April 8, 2013

Excel is absolutely standard in accounting circles. It has the benefit of allowing accountants to do all sorts of things without knowing how to program.

duncans · on April 8, 2013

Maybe your client was wanting ubiquitous Excel-compatible export/import - not for the back end to actually be Excel? Making any table in your app exportable to at least CSV is pretty straightforward.

mrtriangle · on April 9, 2013

I think that is perfectly logical, the project was basically a series of command line tools. I tried to swap out excel for CSV or at least TSV, the client demanded .xls in .xls out. Believe me we tried every suggestion and that was as far as the client would negotiate. We got it to work and I died a little inside.

coldtea · on April 8, 2013

Obligatory "Excel as a database" link (you'll thank me for it):

http://wyorock.com/excelasadatabase.htm

yebyen · on April 8, 2013

That person is obviously bat-shit crazy. I am very surprised to hear that Excel or any other spreadsheet program will open multi-gigabyte files without exploding.

jeffasinger · on April 8, 2013

This.

When moving from a mostly broken .xls data export system to a CSV file with the same data, we got lots of complaints from customers that it was too hard to use.

yebyen · on April 8, 2013

CSV files ARE broken in Excel. Have you ever tried importing a column of data with numbers that contain leading zeros / are not meant as numbers? There are at least 5 ways to do it wrong.

I know how to do this in LibreOffice (requires an extra step above and beyond double clicking) but honestly I do not know how to do it in Excel, and I would not attempt to learn or teach this knowledge to a person who does not have experience with a different spreadsheet tool first.

I will send them .xls formatted files instead.

jQueryIsAwesome · on April 8, 2013

"Well, that explains why the business world is so broken"

TsiCClawOfLight · on April 8, 2013

This can be helpful advice for many people. Why bash him? It's good advice. I would understand "incident_tb" in health context, and I'm 17. So a readme should be plenty, right?

yebyen · on April 8, 2013

Yes, honestly I read down to that point and began writing my comment, assumed there would be lots more bad advice to follow, then the article went and ended on me. The README is the correct solution in this case.

crntaylor · on April 9, 2013

> Why bash him?

*her

th0ma5 · on April 8, 2013

A new tool for pulling tables out of PDFs was released recently by Mozilla Open News http://source.mozillaopennews.org/en-US/articles/introducing...

droithomme · on April 8, 2013

It seems like most federal and state agencies nowadays make much data - data that they are forced by law to make public - only available in pdf files. The reason for this seems to be to prevent as much as possible use and analysis of the data, while technically complying with public disclosure requirements. It is often extremely difficult to get public data in a usable form out of most public web sites, and the ways in which this happen are so unoptimal they move past the level of mere government incompetence to intentional obfuscation.

gamblor956 · on April 8, 2013

The law requires state and federal agencies to make the data public. It does not require them to make the data useful for a specific use case.

But generally, the reasons PDF are used is: (1) PDF signals (internally) that the file has been examined from sensitive, confidential, or non-public content, and no longer needs further modification; (2) PDF is a long-lived standard with plenty of free viewers; and (3) PDF guarantees print fidelity across all known viewers, which is something most office software (including online suites) can't even promise between minor version updates.

BrianEatWorld · on April 8, 2013

Having worked for the US government, trying to collect trade and production data from other governments and private entities, I always assumed this was the case when I received PDFs. Interestingly, interest groups were always eager to give you the data in whatever format you requested...

brudgers · on April 8, 2013

Any format other than PDF and the large numbers of citizens will not have the proper tool to access it - e.g. a person using a public computer in a library.

It's not malice even if it hampers outside researchers. Any format other than printed (e.g. PDF) is going to require the user to deal with the meta-data in order to extract useful information - that's just the nature of data.

amboar · on April 8, 2013

They could publish the data in multiple formats, PDF and something actually useful if you're worried about someone at a library not having the tools. The point is, don't cripple the data from an analysis perspective.

SilasX · on April 8, 2013

Wait, how come they release newly-public data to traders on financial markets in easily computer-parseable format like JSON, but PDFs with tables for so much of this stuff?

You'd almost think it's an issue of priorities.

moron4hire · on April 9, 2013

I was really hoping to come to this comment page and see a dozen top-level comments claiming you could easily parse a PDF with XYZ tool, because the PDF represented the table with reasonable data, and how could you not know that, you poor excuse for a computer geek, in true HN (and Reddit and 4chan and whatever you your flavor of BBS) fashion.

But no, we are well and truly fucked.

DanBC · on April 8, 2013

Perhaps we're at the point where some bunch of people have already created a Best Current Practice standard for preparing data prior to data sharing? Because it seems like something that someone, somewhere, should have done already.

---

People can be great at using their particular tool, and then they'll do something to make you realise that they're just an expert with that tool and that their wider computer knowledge isn't so great.

I've had people want online bank statements as PDFs not CSVs because "anyone can change the Excel file"; I had people asking me to scan engineering drawing on paper to gif files ("Nice and small! We don't have enough space for that other format!" (we did have space)) and then import those gifs into AutoCAD so we could update our versions of the drawings when the customer updated their versions. (The guy before me had given up trying to explain why it was a dumb idea and had just scanned about 1,000 (of about 10,000) documents and saved them as gif.)

Some people aren't particularly sharing data. They're just sharing their information that they've got from that data; the interpretation they're making.

astrosi · on April 8, 2013

I'm always surprised that FITS [0] doesn't get more of a look in outside of astronomy. It works very well in my experience for storing tables of data as well as images.

[0] http://en.wikipedia.org/wiki/FITS

noonespecial · on April 8, 2013

This reads like a giant advertisement for why scientists should learn to code. Not only would they not be helpless in the face of stupid formats, they'd be far more likely to pick a decent one to store the data in the first place.

coldtea · on April 8, 2013

Posted "Excel as a database" link lower in the discussion threads, but I think it's worth a comment of it's own:

http://wyorock.com/excelasadatabase.htm

kyllo · on April 8, 2013

It does seem to be a big challenge to make data available both in analyzed, "crunched" format for readers of a report (in which case PDF is usually fine), and in "raw" format for analysts and developers (in which case one of CSV flatfile, JSON or XML is usually the best format). Too often, people don't seem to recognize the difference.

The internet isn't just about software and protocols, it's also about the data. Sharing data is kind of the whole point of networking, so making your data available in a machine-parseable format is so important.

pvaldes · on April 8, 2013

Several good advices, but missing the real problem here. Of course you CAN extract text from a pdf and you can extract a table from excel. A few lines of perl or your favourite language and you have it.

Data are worthless if you can't trust it. Plain CSV are easy to read but easy to change, even on the fly. Pdf can be changed also, but is not so easy and if someone makes a subtile change in a number the error don't propagates by all the pages reaching the totals like in excel. Excel macros/formulas can be a source of headaches.

So to share your data use a format that: 1-you can trust (reasonably). 2-all your other reasons go here...

And if this format is full of nested tables, and you find difficult to extract the info from those tables, don't throw out the format, ask for help instead.

PDF can support passwords, is very compact, not so easy to change on the fly and can be encrypted. Maybe not the best, but not the worst of the available tools, in my opinion.

abraxasz · on April 9, 2013

Yes, a million times yes. I'd like to emphasize the following points:

(2), (3), (4) are all tied to the same idea: Your data will be parsed by a script, not by me. I'm not going to use excel to analyse your data. I'm gonna import them into R/Python/... and then extract the features I'm interested in. I've been a graduate student in statistics for almost 2 years now, and when the data are correctly packaged, I've never had to use excel.

(5) Come on, I know that nobody likes writing README's. Heck, I don't like it myself. But do you really expect me to guess what your two-letters variable name means? And please, make your readme searchable, and short. The cdc data comes with a 200 pages manual explaining what the data is, and finding what you're looking for is a nightmare..

bliker · on April 8, 2013

...and horrors of math equations and columns in PDF. We should really come up with some light format where style is separated from data. Make a universal converter from doc and xls files and save world from data stuck in PDFs.

bdunbar · on April 8, 2013

We could call it .. LaTeX.

bliker · on April 9, 2013

Why people share exported PDF's not their LaTeX files?

bdunbar · on April 13, 2013

Because no one has heard of LaTeX.

crazygringo · on April 8, 2013

As long as the PDF contains text (and isn't just scans of paper), it's usually not TOO hard to select and copy the relevant tables/text, paste into a text editor, and use regex's to transform it into whatever you want (CSV, SQL insert statements, etc.)

Since there's no universal data format (e.g. the data equivalent of PDF), I pretty much assume that whatever format data comes in, I'll have to be doing regex transforms to it in a text editor, in order to import it wherever I want.

So as long as it's copy-pastable as text, PDF doesn't seem appreciably worse than pretty much any other format.

zdw · on April 9, 2013

Even better, use `pdftotext` from the Xpdf package to turn it into text, so you skip the manual steps.

I do this will all my banking data - dump their PDF's in a folder, a script converts to CSV, then to ledger (cli accounting program) format.

BHSPitMonkey · on April 8, 2013

Tables or columns are actually almost always a nightmare to copy and paste from a PDF. I have never had it come out sanely.

lttlrck · on April 8, 2013

Well if it was TOO hard it would be impossible. The point is it should be trivial, and more often than not it isn't.

maxerickson · on April 8, 2013

For pdfs that are coming from some sort of consistent system, the conversion tools in xpdf work pretty well.

minimize_me · on April 8, 2013

http://source.mozillaopennews.org/en-US/articles/introducing... Wasn't this on here a few days ago?

navait · on April 9, 2013

I started a white house petition to mandate federal agencies also provide plain-text data: http://wh.gov/MIXT

dpcx · on April 8, 2013

I've used scraperwiki.com before to convert PDFs to "usable" data in cases like these. Yeah, it's a pain, but then it's available for everyone.

kimagure · on April 8, 2013

Maybe it's a naive question, but why shouldn't you use quotes to encapsulate things that are meant to be text so that you can catch things like white space?

Also that do people use formats other than comma and tab delimited?

My own experience with CSV is dealing with my own data and instrument readouts (which are mostly always tab delimited with a header section) so I don't know.

itafroma · on April 8, 2013

Where did you read that you shouldn't use quotes to encapsulate text? RFC 4180 (http://tools.ietf.org/html/rfc4180) says you may.

Many CSV generators and CSV parsers don't conform to RFC 4180, however, but if you're planning on transmitting data using CSV, it's simple enough to say "use an RFC 4180-compliant parser."

kimagure · on April 8, 2013

yeah, i was just wondering why the author kind of explicitly recommended against using white space when most parsers (including MS excel in my experience) handle white space encapsulated in quotes just fine

itafroma · on April 8, 2013

That's for variable names, not run-of-the-mill text. Variable names (and, in certain situations, dictionary keys) generally can't have spaces, so if you're generating code off of the data set, having to generate variable names (or dictionary keys) without spaces is an additional step.

bloaf · on April 9, 2013

I suppose I should mention Wolfram's expedition into this domain-the CDF format:

http://www.wolfram.com/cdf/compare-cdf/how-cdf-compares.html

http://en.wikipedia.org/wiki/Computable_Document_Format

rurounijones · on April 9, 2013

"It is a closed format created by Wolfram Research"

"CDF files can be read using a proprietary CDF Player with a restrictive license"

No thank you.

hmottestad · on April 8, 2013

RDF, RDF, RDF, RDF, RDF, RDF, RDF.

Yes. Please use RDF.

And please use a common vocabulary.

If you want to know how to do this? Have a look at the UK governments open data portal (which uses RDF): data.gov.uk

And their SPARQL (query) endpoint: http://data.gov.uk/sparql

Homunculiheaded · on April 8, 2013

RDF took Prolog clauses, restricted them to triplets, and made their representation dramatically more verbose. Then it removed the reasoning engine part of logic programming and replaced it with OWL which is by default computationally intractable.

I have tried for many, many years to learn to appreciate RDF and semantic technologies in general, but everytime I get involved it screams of a problem desperately looking for a solution, and whose general case of the problem (reasoning about relations) was already solved better decades ago (and with the recent revived interested in logic/relational programming seems likely to improve soon).

If you want to represent a graph structure in your data why not sets of json objects? Sure you end up with an ad hoc vocabulary, but I've never seen a successful project make use of the RDF ideal of massively shared vocabularies.

hmottestad · on April 8, 2013

JSON doesn't really support datatypes. So not really useful for sending data to someone else. Even simple things such as float vs. double will be a challenge. Not to mention dates or locations.

Pity you haven't gotten to like it. It's being tested on a lot of stuff, some sticks some doesn't.

And OWL has a number of subsets for whatever complexity you feel like enjoying.

acqq · on April 8, 2013

Is that a joke? RDF and SPARQL look exactly like something even mother has problem to love.

http://en.wikipedia.org/wiki/SPARQL

My personal preference is, whenever there's a lot of records to be processed, a tab separated UTF-8 text file. (I admit I am biased as I like to use Perl for first passes).

hmottestad · on April 8, 2013

Nope. Not a joke. Lots of love from all over the world: http://lod-cloud.net/

The number one difference between RDF and CSV (TSV in you case) is that it's understandable by a computer. CSV is simple parsable, while RDF comes with all the meta data the computer needs.

acqq · on April 8, 2013

Everything is readable by computer, if you have the program which reads it. For tabular data nobody needs the name of each field in every line.

For non-tabular data, why not using JSON?

hmottestad · on April 8, 2013

Understandable.

RDF is understandable by computers. It brings it's semantics along.

CSV is only syntax. It doesn't even support datatypes.

JSON is simply a data format serialisation. RDF can be serialised to JSON or to XML if you wish.

ejain · on April 9, 2013

I much prefer the RDF graph data model over the document-centric tree model of JSON or XML for representing data. It's too bad the need for RDF to be a solid formal knowledge representation format trumped more practical concerns (like being able to represent an ordered list in a simple manner).

jessriedel · on April 8, 2013

Is this what you're talking about?

http://en.wikipedia.org/wiki/Resource_Description_Framework

I'm interested in knowing a little more about why you like it so much.

hmottestad · on April 8, 2013

RDF is a graph based data format (serialised to XML). It is based on triples. Such as: Peter hasParent Paal.

To query RDF, we use the graph query language SPARQL. You create a graph with variables in it, and the query engine will find subgraphs that match the graph.

On top of this we can do reasoning (based on 1st order logics and datalog). With RDFS (simple) and OWL (advanced).

And to bring all this together there are many open vocabularies and open data sets linked together. One notable one: DBpedia (based on wikipedia).

A grand view of linked data: http://lod-cloud.net/versions/2011-09-19/lod-cloud.html

A sample SPARQL query on DBpedia listing software by date:

http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbp...

The actual query:

  select ?a min(?b) (min(?date) as ?date2) where {

  ?a a dbpedia-owl:Software;
     rdfs:label ?b;
     dbpedia-owl:latestReleaseDate ?date.

  } 
  group by ?a
  order by ?date2
  LIMIT 100

jessriedel · on April 8, 2013

Hey this is really cool. Thanks.

smrtinsert · on April 8, 2013

Deep in the clearance bin of web technologies (looking at you xhtml) lies like RDF and SPARQL. The only reason governments use it is for job security and because only they have the endless budgets to bother with it.

wiggity · on April 8, 2013

this advice is pertinent even within excel, especially for those who rely on any of its summary features (pivot charts, data analysis, etc). essentially an illustration of the benefits of separating content from presentation.

smrtinsert · on April 8, 2013

TOML will solve all these problems. More structured than JSON, easier to consume than XML.

https://github.com/mojombo/toml

lttlrck · on April 8, 2013

Is it really more structured than JSON?

smrtinsert · on April 8, 2013

first class dates says yes it is!

sokrates · on April 8, 2013

The "D" in PDF does not stand for data, but for documents.

g8oz · on April 8, 2013

You can embed documents within PDFs right? That feature should be used more to include source data that can be manipulated by end users.

dmourati · on April 9, 2013

If PDF is all you have, at least there is pdfescape:

http://www.pdfescape.com/

rowdyrabbit · on April 8, 2013

Why not just provide multiple formats?

Lambdanaut · on April 8, 2013

Some "import from PDF" functionality would be immensely valuable. It's not impossible.

johngalt · on April 9, 2013

'Send me your data in PDF and ONLY PDF' said most courts in the US.