Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I know a thing or two about that, 2,400 commits to the scrapers powering openstates over the past 9 years.

Even with openstates, we have an API but don't "just" dump the bills to git for legacy nerd reasons.

The nice thing about laws is that the host websites (or PDFs) don't change templates _that_ often, so generally you can rescrape quarterly (or in some states, annually) without a ton of maintenance. With administrative codes you need to scrape more often, but the websites are still pretty stable.

The downside is that codes in particular are often big, so a single scrape might need to make 20,000 or more requests, so you have to be very careful about rate limiting and proxies, which goes to my original point that it sucks that accessing this stuff is such a mess.





The assessor's office in my county provides data older than two weeks (iirc) as a few sql export dumps because that's how things were done back in the day.

Current information is gated behind a web2.0 view of their live data with severe limits. It wasn't designed to be scraped and is in fact hostile to the attempt. I'd imagine they're seeing rising hosting costs and that they'll keep rising.

I should reach out to them and see what this looks like from their angle. The local commercial real estate community is pretty tech-savvy and I'm wondering if we could all be a bit more proactive around data access.

I'd love to hear your thoughts on county vs state vs national data! I'd be very interested in any bandwidth usage or processing requirement info you might have recorded.


Fair, I stand corrected. Thanks for your work. All Openness efforts are welcome.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: