Date parsing performance on iOS (NSDateformatter vs sqlite)

stevoski · on Sept 7, 2013

A class like NSDateFormatter is designed to handle a wide range of date formats. This usual results in sub-optimal performance. If you find it too slow and you have a known, specific date format you should write a specific fast parser.

I did the same with Java's Integer.parseInt(...) method. It is an interesting task to go through.

Now I'll spend the rest of this rainy afternoon playing around with writing a fast ISO date parser :)

Edit: Seems Java's Joda Time library already does parse ISO dates really quickly. 7 seconds for 4 million on my MBP

Edit: A fast custom date parser for ISO dates I just wrote can parse 4 million dates in 150 milliseconds.

gilgoomesh · on Sept 7, 2013

It does seem as though Apple should implement a separate optimized path for the very common case of ISO-8601 date formats.

Although I'm not sure how many people on iOS need millions of dates parsed.

azinman2 · on Sept 7, 2013

So are you going to post that iso parser?

pilif · on Sept 7, 2013

>Many web services choose to return dates in something other than a unix timestamp (unfortunately)

Wrong. That's very fortunate. Unix time stamps have some serious deficiencies as data type for storing time information: for one, they lack precision. One second just might not do it. Then they lack any time zone information. You will never know what a specific time stamp is in. GMT? UTC? Time zone where the server is in?

Sure. Maybe you are lucky and it's documented (it probably isn't because people who care about such things are not using unix time stamps to begin with), but using a string time stamp formatted in ISO means that no documentation is needed. The encoding is good enough to store any sub second time stamp including time zone info.

That way, you can turn any of these into whatever your environment uses internally which you will then use in conjunction with the library routines to deal with all the difficulties related to doing math with dates (how many days in a month? What about leap years? What about time zones? Not really hard issues, but many to keep in mind and many possible causes for bugs)

haberman · on Sept 7, 2013

You are criticizing things which you do not understand (and getting upvoted for it on HN, which is a little disturbing).

As others have mentioned, Unix Timestamps can be arbitrarily precise by adding arbitrarily many places of decimal precision (and this is common practice, supported by the Unix "date" command, among other things).

Secondly, Unix Time is an absolute timescale that is not relative to any time zone. A Unix Timestamp alone unambiguously (1) identifies an absolute point in time; there is no need to involve time zones, which are a political concept. A Unix Timestamp can be converted to any timezone and vice-versa. Any representation that is based on civil time is going to be more complicated and have more edge cases.

Thirdly, time zone offsets like -03:00 do not actually specify a time zone; they specify a time zone offset. These two are not the same thing. There are multiple time zones that can have a -03:00 offset, depending on the time of year. Even given a specific time of year, the time zone offset may not uniquely identify the time zone. For example, Arizona doesn't do daylight savings, so if you see a -07:00 time in the summer it could either be a PDT time (used on the west coast) or a MST time (used in Arizona).

Unix Timestamps have many advantages over text-based timestamp representations. They are much simpler to parse and have far fewer lexical variations. They are never invalid (whereas text-based dates like 2000-01-32 can be). They can be stored directly in a numeric variable. You can perform math on them directly.

(1) Except for leap seconds

DannyBee · on Sept 8, 2013

I can't upvote this high enough. Anybody who thinks iso8601 dates are a good way to store time should not be allowed to handle time.

The same hour does also not occur twice in unix timestamps, though it does in most timezones (but not time offsets). Conversion rules are a mess, and have changed over time.

mikeash · on Sept 7, 2013

Precision can be fixed by just adding a decimal point. And a "UNIX time stamp" doesn't need a time zone because it's always UTC.

However, you're overall point there remains valid, because people will try to pass off something as a "UNIX time stamp" that is actually in a different time zone. There is value to self-describing data.

hughw · on Sept 7, 2013

It's frequently important to preserve the original time zone offset of a timestamp. Sending everything as UTC loses that information.

haberman · on Sept 7, 2013

There is nothing preventing you from sending a time zone offset (or an Olson timezone id like 'America/Los_Angeles') along with the Unix Time.

TheZenPsycho · on Sept 8, 2013

Sure no problem. We just have to invent a new format.

paisawalla · on Sept 8, 2013

That's what JSON is for:

{timestamp: xyz, tz: "America/Los_Angeles"}

objclxt · on Sept 7, 2013

The other major advantage of using ISO 8601 is it's human readable. Very few people are going to be able to look at a Unix timestamp and convert it in their head (...if you can that's a good party trick).

ycombobreaker · on Sept 7, 2013

I deal with epoch timestamps on a daily basis, and this is my go-to command:

  $ date -d @1378585039
  Sat Sep  7 20:17:19 UTC 2013

greenlakejake · on Sept 7, 2013

You must go to weird parties.

thristian · on Sept 7, 2013

Another good reason to avoid POSIX timestamps: they ignore leap-seconds. Thus, to determine how many (for example) days are between two timestamps, you need an up-to-date database of leap-second insertions. Maybe an error-bar of a few seconds doesn't sound like much, but if that error bar happens to straddle midnight and (like most code) you get a date-stamp by truncation, you could be off by a day. If that error-bar happens to straddle midnight on December 31st and you're truncating to month or year values, you could be out by a whole lot more.

nly · on Sept 7, 2013

It doesn't matter either way. If you wish to account for leap seconds then the device that creates the timestamp needs access to that very same database.

The advantage of ignoring leap-seconds on the recorder is you can map any sufficiently precise monotonic clock to UNIX time with a simple linear equation. Personally I think it makes a lot more sense to keep the complexity contained to the decoder, rather than the encoder where bugs could mean you end up not recording an accurate timestamp to begin with.

haberman · on Sept 7, 2013

Why would you ever care about leap seconds when calculating the number of days between timestamps? Leap seconds are necessary to calculate the exact number of seconds between two timestamps. But a day isn't exactly 86,400 seconds on leap second days, it's a little bit longer. So the simple algorithm for calculating the number of days between timestamps (floor((ts2-ts1)/86400) seems more correct than anything that takes leap seconds into account.

Tloewald · on Sept 8, 2013

A leap second only lengthens one day.

In any event POSIX time stamps are fine w.r.t. leap seconds, it's the conversion functions which may or may not reflect them.

mcdee · on Sept 7, 2013

Actually the ISO format does not give you timezone information. It gives you offset from UTC, from which you cannot infer anything about the timezone in which the date resides.

lmm · on Sept 7, 2013

A string timestamp needs more documentation, because there are many subtly different string formats. Does it refer to a specific instant, or a political time notion? What separator is in use? Are leap seconds a possibility?

The ISO date format's notion of timezones is a compromise that gives you the worst of all worlds. They complicate referring to a physical instant, because you can refer to it in several timezones, rather than the unique representation of a unix timestamp. But they're inadequate for political time, because what time comes 6 months after 13:00 (+00:00)? (It could be 13:00 (+01:00) or 13:00 (+00:00) or likely others - you need a symbolic timezone like "Europe/Lisbon").

For physical or "system" times unix time is great (unless you need the greater precision, but how common is that?). For user-facing times ISO is inadequate. The use case for ISO string datetime formats is very narrow.

gngeal · on Sept 7, 2013

"They complicate referring to a physical instant, because you can refer to it in several timezones, rather than the unique representation of a unix timestamp."

I thought it's supposed to be an external format. I'd always expect a computer system presenting the output information to the user in his local time zone, while accepting inputs from all time zones equally.

"But they're inadequate for political time, because what time comes 6 months after 13:00 (+00:00)? (It could be 13:00 (+01:00) or 13:00 (+00:00) or likely others - you need a symbolic timezone like "Europe/Lisbon")."

You can't standardize a changing practice. I'd never expect it to deal with these issues.

lmm · on Sept 7, 2013

>I thought it's supposed to be an external format. I'd always expect a computer system presenting the output information to the user in his local time zone, while accepting inputs from all time zones equally.

We're talking about a web API here, not user display. But even so, IME users don't think of their timezones as "+8" or the like, so for human I/O you want to use symbolic timezone names, not offsets.

rorrr2 · on Sept 7, 2013

Unix timestamps are always UTC (GMT)

Quote:

> Unix time, or POSIX time, is a system for describing instants in time, defined as the number of seconds that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970

colanderman · on Sept 7, 2013

More importantly (the GP missed this as well), Unix timestamps can't convey local time. Local time has UI implications, e.g. the query "is this event on a weekend" is not generally answerable without the time zone.

nly · on Sept 7, 2013

For historical dates, I'd rather everyone knew how to convert accurately to and from UTC (2 conversions), rather than relying on everyone to have a bug-free and up to date implementation of 2N(N-1) conversions.

That said, the exception for local time, at least in my opinion, is agreeing on dates in the future meant for human interaction (e.g. "I'll meet you at 7 AM local time in Time Square on the 3rd of April 2068"). Here time zone rules may actually change before the date transpires, and you can't be sure of the representation in any other zone or format until closer to the event.

DannyBee · on Sept 8, 2013

Entirely true, but not necessarily relevant.

Local time is a weird thing and changes all the time.

For giggles, look at the history of timezone rule changes in tzdata.

Most timezones have at least one duplicate hour per year (IE the same time occurs twice) in the US as well.

Local times are not an appropriate way to store time.

Note: ISO8601 does not give you local time anyway, since you cannot infer the timezone from the time offset.

colanderman · on Sept 9, 2013

No, that's exactly my point. Sometimes you want an event to occur at 9 AM EST, regardless of how that translates to UTC.

That you cannot infer the timezone from the time offset in ISO 8601 is a good point though.

hosay123 · on Sept 7, 2013

If you've ever moved country, and imported some data from the old country and mixed it with data from your new country, it quickly becomes obvious why preserving source timezone is a deeply useful attribute ("It's a beautifully sunny day! -- me, 04:17hrs").

Or at the very least, preserving the time offset.

TheZenPsycho · on Sept 8, 2013

Uhm, UTC != GMT Nobody in this comments thread knows what they're talking about and I wouldn't trust anyone here to program anything to do with time.

Tloewald · on Sept 8, 2013

As far as I can tell, GMT is now UT1 which may not diverge from UTC by more than one second. I don't think conflating the two is especially egregious in this context.

andrewaylett · on Sept 7, 2013

My experience suggests that's not always the case :(.

VexXtreme · on Sept 7, 2013

This is a standard as well as a convention. People breaking it are idiots. Most software out there assumes UTC time when dealing with Unix timestamps. Most Java date libraries handle this perfectly and get the correct timezone.

0x0 · on Sept 7, 2013

I wonder if this could be improved by just using the standard C library strftime(3) instead of going through sqlite?

jurre · on Sept 7, 2013

I was wondering the same thing since that's also what apple recommends in situations like these. This is what I got on the same hardware:

strptime_l took 58.803 seconds

NSDateFormatter took 107.570 seconds

sqlite3 took 7.022 seconds

And with MishraAnurag's suggestion of using timegm instead of mktime:

strptime_l took 21.656 seconds

NSDateFormatter took 108.163 seconds

sqlite3 took 7.096 seconds

coldcode · on Sept 7, 2013

Why not see what sqlite is doing and do something in C yourself that solves the actual problem. It's not surprising that a general purpose Obj-C (or any language) class isn't terribly fast at one specific thing.

jurre · on Sept 7, 2013

Yeah that would probably be the way to go ultimately if you're doing a lot of date parsing, I agree!

MishraAnurag · on Sept 7, 2013

Are you using mktime to get the unix timestamp? That might be the slower part as opposed to strptime.

jurre · on Sept 7, 2013

I am, source is here: https://gist.github.com/jurre/6475263

My c is quite poor so if you have any suggestions on how to improve I'd love to hear them!

MishraAnurag · on Sept 7, 2013

I'd suggest using timegm instead of mktime, or set the TZ environment variable to UTC to ensure all implementations return an identical date. I ran the same tests and found that the strptime was quite fast, but gmtime was taking most of the time. To speed that up, you could borrow SQLite's implementation. Checkout the computeJD function from SQLite's date.c - http://www.sqlite.org/src/doc/trunk/src/date.c

jurre · on Sept 7, 2013

timegm actually already makes a huge difference, thanks! Might be useful to make a small fast date parsing library based on the sqlite source code.

mjn · on Sept 7, 2013

If date formatting is a bottleneck for me (it is surprisingly often, because it's very slow in some languages) I typically just run it through the command-line program 'convdate' [1] from crush-tools, which is more or less just a wrapper around strptime+strftime.

[1] https://code.google.com/p/crush-tools/wiki/ConvdateUserDocs

justincormack · on Sept 7, 2013

If it is a bottleneck shelling out is not a great solution...

pestaa · on Sept 7, 2013

Shelling out for each piece of data is indeed not great.

Shelling out for batch-processing loads of data is on the other hand great.

mjn · on Sept 7, 2013

Yes, and that's the workload convdate is intended for: it batch-converts an entire column of a tab-delimited file. The larger crush-tools suite is intended for unix-style batch processing of tabular data, but fills in some functionality that the classic set of POSIX tools (cut, sort, paste, join, etc.) didn't cover.

becauseICan · on Sept 7, 2013

These two methods start producing different dates after about year 3515.

MishraAnurag · on Sept 7, 2013

That's interesting. I'm not sure of the significance of that year or how that relates to the algorithm in Meeus' book. This web page talks about the date algorithms in Meeus' book in some detail but the math is beyond me - http://mysite.verizon.net/aesir_research/date/jdimp.htm. The Julian day conversion algorithm here is the same one used by SQLite.

stcredzero · on Sept 7, 2013

NSDateForatter is designed for the UI and "Swiss Army knife" use case. SQLite is for a back end data context.

Two different design goals. Two different sets of design trade offs. (For "design" in the Rich Hickey sense.)

dante_dev · on Sept 7, 2013

mmm.....can I see the code about NSDateFormatter? because I feel like you're using it wrong. You need to cache somewhere the NSDateFormatter allocation (it is really expensive), reusing the same instance to convert the string to NSDate*.

dpratt · on Sept 7, 2013

Sadly, NSDateFormatter (and NSFormatter in general) are explicitly not thread safe. You'll either have to just allocate and release on demand or implement some sort of thread-local mechanism. Unfortunately, one of the drawbacks do the (generally excellent) concurrency APIs in macos/iOS is that thread locals are actually sort of a pain to implement. I've been writing cocoa apps for a few years now, and I find the platform to be generally quite good, but I do miss things like Joda time.

MishraAnurag · on Sept 7, 2013

A link to the source code is present in the article. You can find it here - https://gist.github.com/AnuragMishra/6474321

The NSDateFormatter is already being cached. That was my first suspicion on finding this issue too. We are using one formatter per thread in the production code, but that doesn't apply for the code I've posted since everything is done on the main thread using a single formatter instance.

asveikau · on Sept 7, 2013

sqlite happens after the objc version. Are the results any different when the sqlite code comes first? Actually the fairest comparison would be to store the dataset in a file, and have a process for timing NSDateFormatter, and a different one for testing sqlite. This would eliminate any advantage that a warm cache might give you.

andymoe · on Sept 7, 2013

> To parse a million randomly generated dates on an iPhone 5 running iOS 7, NSDateFormatter took a whooping 106.27 seconds, while the SQLite version took just 7.02 seconds.

Yes, NSDateFormatter is slower than other methods including some C libraries out there or this novel approach for turning a string into a NSDate however in most instances it's plenty fast enough and has a bunch of useful functionality [1] the least interesting of which is easily turning a string to a NSDate.

If you are optimizing this aspect of you code first you are likely wasting your time and would suggest iOS/Mac developers get to know NSDateFormatter intimately especially if you are displaying date/time information to users anywhere in you apps.

http://goo.gl/7flGRI

MishraAnurag · on Sept 8, 2013

That's a fair point. NSDateFormatter is fast enough and you should never replace it for anything else unless you know for sure that it's a problem. Well that goes for any optimization. In my particular case, it was really a bottleneck, and the time difference of 100 seconds vs 7 seconds meant a user will not have to wait for 93 seconds during the initial import step. I am not suggesting we start getting rid of NSDateFormatter as it is a very valuable tool which we use for any and all date formatting ourselves, just not during massive imports anymore.

pothibo · on Sept 7, 2013

That's bad. Very bad. iOS Apps use timestamp everywhere so I'm sure this could be a low hanging fruit optimization for many apps out there.

You should wrap this piece of code in a nice API and let people benefit from your findings. (Someone else could do it too, I'm just saying!)

asveikau · on Sept 7, 2013

For those million timestamps I am sure that those extra allocations from the statement APIs are not helpful. Since he references the C code sqlite is using (at a quick glance it looks pretty contained) I don't know why he doesn't just include it directly in his project and call it from objc, no statement API needed.

[Edit: I see now that in the test there is only 1 statement object ever created for a test of a million dates. Better than I thought initially. But my guess is the statement object still creates some degree of inefficiency not found in directly calling the C version.]

7 seconds is a long time in CPU terms, I am sure that he can do better.

MishraAnurag · on Sept 8, 2013

Using the relevant SQLite implementation directly cuts down that time in half to about 3.5-3.7 seconds over a few runs.

jergosh · on Sept 7, 2013

terrible use of percentages.

winter_blue · on Sept 7, 2013

Yea, 15 times faster would have be much clearer. On the other hand, 1400% has "shock value" and makes for a very link-baity title.

willvarfar · on Sept 7, 2013

Why convert each date with a select if they are putting it into the db anyway? Why not just let sqlite do the conversion as part of the insert statement?

MishraAnurag · on Sept 7, 2013

That was to make a fair comparison between the two approaches since NSDateFormatter's dateFromString gives an NSDate, while SQLite was handing back an integer.

But you are right. In production, it makes more sense to let SQLite handle the conversion and insertion in the same statement.