A class like NSDateFormatter is designed to handle a wide range of date formats. This usual results in sub-optimal performance. If you find it too slow and you have a known, specific date format you should write a specific fast parser.
I did the same with Java's Integer.parseInt(...) method. It is an interesting task to go through.
Now I'll spend the rest of this rainy afternoon playing around with writing a fast ISO date parser :)
Edit: Seems Java's Joda Time library already does parse ISO dates really quickly. 7 seconds for 4 million on my MBP
Edit: A fast custom date parser for ISO dates I just wrote can parse 4 million dates in 150 milliseconds.
>Many web services choose to return dates in something other than a unix timestamp (unfortunately)
Wrong. That's very fortunate. Unix time stamps have some serious deficiencies as data type for storing time information: for one, they lack precision. One second just might not do it. Then they lack any time zone information. You will never know what a specific time stamp is in. GMT? UTC? Time zone where the server is in?
Sure. Maybe you are lucky and it's documented (it probably isn't because people who care about such things are not using unix time stamps to begin with), but using a string time stamp formatted in ISO means that no documentation is needed. The encoding is good enough to store any sub second time stamp including time zone info.
That way, you can turn any of these into whatever your environment uses internally which you will then use in conjunction with the library routines to deal with all the difficulties related to doing math with dates (how many days in a month? What about leap years? What about time zones? Not really hard issues, but many to keep in mind and many possible causes for bugs)
You are criticizing things which you do not understand (and getting upvoted for it on HN, which is a little disturbing).
As others have mentioned, Unix Timestamps can be arbitrarily precise by adding arbitrarily many places of decimal precision (and this is common practice, supported by the Unix "date" command, among other things).
Secondly, Unix Time is an absolute timescale that is not relative to any time zone. A Unix Timestamp alone unambiguously (1) identifies an absolute point in time; there is no need to involve time zones, which are a political concept. A Unix Timestamp can be converted to any timezone and vice-versa. Any representation that is based on civil time is going to be more complicated and have more edge cases.
Thirdly, time zone offsets like -03:00 do not actually specify a time zone; they specify a time zone offset. These two are not the same thing. There are multiple time zones that can have a -03:00 offset, depending on the time of year. Even given a specific time of year, the time zone offset may not uniquely identify the time zone. For example, Arizona doesn't do daylight savings, so if you see a -07:00 time in the summer it could either be a PDT time (used on the west coast) or a MST time (used in Arizona).
Unix Timestamps have many advantages over text-based timestamp representations. They are much simpler to parse and have far fewer lexical variations. They are never invalid (whereas text-based dates like 2000-01-32 can be). They can be stored directly in a numeric variable. You can perform math on them directly.
I can't upvote this high enough.
Anybody who thinks iso8601 dates are a good way to store time should not be allowed to handle time.
The same hour does also not occur twice in unix timestamps, though it does in most timezones (but not time offsets).
Conversion rules are a mess, and have changed over time.
Precision can be fixed by just adding a decimal point. And a "UNIX time stamp" doesn't need a time zone because it's always UTC.
However, you're overall point there remains valid, because people will try to pass off something as a "UNIX time stamp" that is actually in a different time zone. There is value to self-describing data.
The other major advantage of using ISO 8601 is it's human readable. Very few people are going to be able to look at a Unix timestamp and convert it in their head (...if you can that's a good party trick).
Another good reason to avoid POSIX timestamps: they ignore leap-seconds. Thus, to determine how many (for example) days are between two timestamps, you need an up-to-date database of leap-second insertions. Maybe an error-bar of a few seconds doesn't sound like much, but if that error bar happens to straddle midnight and (like most code) you get a date-stamp by truncation, you could be off by a day. If that error-bar happens to straddle midnight on December 31st and you're truncating to month or year values, you could be out by a whole lot more.
It doesn't matter either way. If you wish to account for leap seconds then the device that creates the timestamp needs access to that very same database.
The advantage of ignoring leap-seconds on the recorder is you can map any sufficiently precise monotonic clock to UNIX time with a simple linear equation. Personally I think it makes a lot more sense to keep the complexity contained to the decoder, rather than the encoder where bugs could mean you end up not recording an accurate timestamp to begin with.
Why would you ever care about leap seconds when calculating the number of days between timestamps? Leap seconds are necessary to calculate the exact number of seconds between two timestamps. But a day isn't exactly 86,400 seconds on leap second days, it's a little bit longer. So the simple algorithm for calculating the number of days between timestamps (floor((ts2-ts1)/86400) seems more correct than anything that takes leap seconds into account.
Actually the ISO format does not give you timezone information. It gives you offset from UTC, from which you cannot infer anything about the timezone in which the date resides.
A string timestamp needs more documentation, because there are many subtly different string formats. Does it refer to a specific instant, or a political time notion? What separator is in use? Are leap seconds a possibility?
The ISO date format's notion of timezones is a compromise that gives you the worst of all worlds. They complicate referring to a physical instant, because you can refer to it in several timezones, rather than the unique representation of a unix timestamp. But they're inadequate for political time, because what time comes 6 months after 13:00 (+00:00)? (It could be 13:00 (+01:00) or 13:00 (+00:00) or likely others - you need a symbolic timezone like "Europe/Lisbon").
For physical or "system" times unix time is great (unless you need the greater precision, but how common is that?). For user-facing times ISO is inadequate. The use case for ISO string datetime formats is very narrow.
"They complicate referring to a physical instant, because you can refer to it in several timezones, rather than the unique representation of a unix timestamp."
I thought it's supposed to be an external format. I'd always expect a computer system presenting the output information to the user in his local time zone, while accepting inputs from all time zones equally.
"But they're inadequate for political time, because what time comes 6 months after 13:00 (+00:00)? (It could be 13:00 (+01:00) or 13:00 (+00:00) or likely others - you need a symbolic timezone like "Europe/Lisbon")."
You can't standardize a changing practice. I'd never expect it to deal with these issues.
>I thought it's supposed to be an external format. I'd always expect a computer system presenting the output information to the user in his local time zone, while accepting inputs from all time zones equally.
We're talking about a web API here, not user display. But even so, IME users don't think of their timezones as "+8" or the like, so for human I/O you want to use symbolic timezone names, not offsets.
> Unix time, or POSIX time, is a system for describing instants in time, defined as the number of seconds that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970
More importantly (the GP missed this as well), Unix timestamps can't convey local time. Local time has UI implications, e.g. the query "is this event on a weekend" is not generally answerable without the time zone.
For historical dates, I'd rather everyone knew how to convert accurately to and from UTC (2 conversions), rather than relying on everyone to have a bug-free and up to date implementation of 2N(N-1) conversions.
That said, the exception for local time, at least in my opinion, is agreeing on dates in the future meant for human interaction (e.g. "I'll meet you at 7 AM local time in Time Square on the 3rd of April 2068"). Here time zone rules may actually change before the date transpires, and you can't be sure of the representation in any other zone or format until closer to the event.
If you've ever moved country, and imported some data from the old country and mixed it with data from your new country, it quickly becomes obvious why preserving source timezone is a deeply useful attribute ("It's a beautifully sunny day! -- me, 04:17hrs").
As far as I can tell, GMT is now UT1 which may not diverge from UTC by more than one second. I don't think conflating the two is especially egregious in this context.
This is a standard as well as a convention. People breaking it are idiots. Most software out there assumes UTC time when dealing with Unix timestamps. Most Java date libraries handle this perfectly and get the correct timezone.
Why not see what sqlite is doing and do something in C yourself that solves the actual problem. It's not surprising that a general purpose Obj-C (or any language) class isn't terribly fast at one specific thing.
I'd suggest using timegm instead of mktime, or set the TZ environment variable to UTC to ensure all implementations return an identical date. I ran the same tests and found that the strptime was quite fast, but gmtime was taking most of the time. To speed that up, you could borrow SQLite's implementation. Checkout the computeJD function from SQLite's date.c - http://www.sqlite.org/src/doc/trunk/src/date.c
If date formatting is a bottleneck for me (it is surprisingly often, because it's very slow in some languages) I typically just run it through the command-line program 'convdate' [1] from crush-tools, which is more or less just a wrapper around strptime+strftime.
Yes, and that's the workload convdate is intended for: it batch-converts an entire column of a tab-delimited file. The larger crush-tools suite is intended for unix-style batch processing of tabular data, but fills in some functionality that the classic set of POSIX tools (cut, sort, paste, join, etc.) didn't cover.
That's interesting. I'm not sure of the significance of that year or how that relates to the algorithm in Meeus' book. This web page talks about the date algorithms in Meeus' book in some detail but the math is beyond me - http://mysite.verizon.net/aesir_research/date/jdimp.htm. The Julian day conversion algorithm here is the same one used by SQLite.
mmm.....can I see the code about NSDateFormatter? because I feel like you're using it wrong. You need to cache somewhere the NSDateFormatter allocation (it is really expensive), reusing the same instance to convert the string to NSDate*.
Sadly, NSDateFormatter (and NSFormatter in general) are explicitly not thread safe. You'll either have to just allocate and release on demand or implement some sort of thread-local mechanism. Unfortunately, one of the drawbacks do the (generally excellent) concurrency APIs in macos/iOS is that thread locals are actually sort of a pain to implement. I've been writing cocoa apps for a few years now, and I find the platform to be generally quite good, but I do miss things like Joda time.
The NSDateFormatter is already being cached. That was my first suspicion on finding this issue too. We are using one formatter per thread in the production code, but that doesn't apply for the code I've posted since everything is done on the main thread using a single formatter instance.
sqlite happens after the objc version. Are the results any different when the sqlite code comes first? Actually the fairest comparison would be to store the dataset in a file, and have a process for timing NSDateFormatter, and a different one for testing sqlite. This would eliminate any advantage that a warm cache might give you.
> To parse a million randomly generated dates on an iPhone 5 running iOS 7, NSDateFormatter took a whooping 106.27 seconds, while the SQLite version took just 7.02 seconds.
Yes, NSDateFormatter is slower than other methods including some C libraries out there or this novel approach for turning a string into a NSDate however in most instances it's plenty fast enough and has a bunch of useful functionality [1] the least interesting of which is easily turning a string to a NSDate.
If you are optimizing this aspect of you code first you are likely wasting your time and would suggest iOS/Mac developers get to know NSDateFormatter intimately especially if you are displaying date/time information to users anywhere in you apps.
That's a fair point. NSDateFormatter is fast enough and you should never replace it for anything else unless you know for sure that it's a problem. Well that goes for any optimization. In my particular case, it was really a bottleneck, and the time difference of 100 seconds vs 7 seconds meant a user will not have to wait for 93 seconds during the initial import step. I am not suggesting we start getting rid of NSDateFormatter as it is a very valuable tool which we use for any and all date formatting ourselves, just not during massive imports anymore.
For those million timestamps I am sure that those extra allocations from the statement APIs are not helpful. Since he references the C code sqlite is using (at a quick glance it looks pretty contained) I don't know why he doesn't just include it directly in his project and call it from objc, no statement API needed.
[Edit: I see now that in the test there is only 1 statement object ever created for a test of a million dates. Better than I thought initially. But my guess is the statement object still creates some degree of inefficiency not found in directly calling the C version.]
7 seconds is a long time in CPU terms, I am sure that he can do better.
Why convert each date with a select if they are putting it into the db anyway? Why not just let sqlite do the conversion as part of the insert statement?
That was to make a fair comparison between the two approaches since NSDateFormatter's dateFromString gives an NSDate, while SQLite was handing back an integer.
But you are right. In production, it makes more sense to let SQLite handle the conversion and insertion in the same statement.
I did the same with Java's Integer.parseInt(...) method. It is an interesting task to go through.
Now I'll spend the rest of this rainy afternoon playing around with writing a fast ISO date parser :)
Edit: Seems Java's Joda Time library already does parse ISO dates really quickly. 7 seconds for 4 million on my MBP
Edit: A fast custom date parser for ISO dates I just wrote can parse 4 million dates in 150 milliseconds.