I personally value a lot not having to install stuff by hand on my system, hoping for the version of the said dependencies in my OS package manager end up working with the software I want to use. Using dependencies managed by the language dependency manager removes 99% of the dependency hassle.
(Yes, I know I could install an complete second distro in docker and run the software there. But I won't, thank you).
This is uncomfortably close to the way I imagine a glitch from something like a FOXACID exploit might look, and youtube is a known exploitable service. Maybe I need to get someone more skilled than myself have a close look at this machine.
Knowledgeable people here, would you please comment if there's reason to rule out concerns about this.
No "knowledgeable people" can rule out anything. You must consider any machine that is connected to any network compromised, if you have anything "sensitive" on there. You have to define for yourself what "sensitive" means.
This is a fair point, but I find that in practice I usually need something more like a probability assessment to act on, or better information about mechanisms.
Disambiguating information such as 'the flash component reloading normally occurs under condition X and Y, you guys may have been compromised but you'll have to decide if that's more plausible than X or Y' or technical constraints like 'to pull off exploit delivery that way an attacker would have to P, Q and R. If they did it should leave traces in S which you could try to check by doing T' would be a real help.
I'm certainly not disagreeing in theory with taking a pessimistic view on the security of any networked machine, but unfortunately that's way too restrictive as basis for action in most situations.
I haven't thought about the filesize/hash to reduce collision, but I chose to stick with the last-modified time in the article, because it can takes hours computing hashes for a big directory tree.
Tools like rsync relies on last-modified time by default, and since I want to use this to track my own files, I won't fake it, so I think it's not a big deal?
It's not just that it could be faked, it could be an accident that you modify a file but the file modification date is not changed. For example, say you edit a photo, but later you run a script that sets the file's modification date to the EXIF data in the photo.
So I guess the point is that also including the file size will be one more (fast) data point to help ensure 'accurate' change tracking, without adding the overhead of computing content hashes.
The service is not continually running, I use this method to make incremental backups with archives stored on AWS Glacier and meta-data stored on S3 (the index is stored on S3, and I can't access files on Glacier to compute deltas).
I'm not sure what you mean by continually running, but inotifywait is basically just waiting on an event. As long as the process sticks around it doesn't have to do anything until it gets an inotify.
I don't want to have a running process just for this, I want to be able to take a "snapshot"/"index", store some data on Glacier, and the index on S3, and later, given the index, be able to compute deltas without accessing the full archive store on Glacier.
Is there a reason that you want it in python? Couldn't you do this with diff -q if you only want to know what data has changed? and then use diff to get the deltas?
I see the confusion, I tweaked the README to remove that claim. But I indeed meant "no external dependencies like postgres/redis..".