Hacker Newsnew | past | comments | ask | show | jobs | submit | perone's commentslogin

I share the same feeling, I think filesystems will have to reinvent themselves given the pace of how useful ML models became in the past years.


I built a local object store that was designed to replace file systems. You can create hundreds of millions of objects (e.g. files) and attach a variety of metadata tags to each one. A tag could be a number, string, or other data type (including vector info). Searches for objects with certain tags is exceptionally fast.

I invented it because I found searching conventional file systems that support extended attributes to be unbearably slow.


Got a demo ?


Tons of demo videos on my YouTube channel. Free beta available for download on my website. Links in my profile.


I'm planning to support MacOS, the only issue is with the encoders that I'm using now, I will probably work more on it next week to try to make a release that works on MacOS as well. Thanks !


Hi, there are no LLMs involved, it is all local and an embedding (vector representation) of the data is created and then that is used for search later, nothing is sent to cloud from your files and there are no local LLMs running as well, only the encoders (I use the Perception Encoder from Meta released a few weeks ago).


This is quite different than LanceDB. In VectorVFS I'm using the inodes directly to store the embeddings, there is no external file with metadata and db, the db is your filesystem itself, that's the key difference.


That's an implementation detail, and it sounds more like a liability than a selling point, to have such tight coupling. (Why) do you see not using files as a good thing?

Let me ask another question: is this intended for production use, or is it more of a research project? Because as a user I care about things like speed, simplicity, flexibility, and robustness.


Hi, not sure if I understood what you meant by opaque embeddings as well, but the reason why files surface or not is due to the similarity score (which is basically the dot product of embeddings).


How much work do you think it would be to also have a separate xattr which has a human-readable description of the file contents? I wonder if it that might already be an intermediate product of some of the embedding tools, like "arbitrary media" -> "text description of media" -> "embedding vector". You could store both of those as xattrs and you could debug by comparing your text query with the text description of the file contents as they should produce similar embedding vectors. You could even audit any file, assuming you know what its contents are, by checking the text description xattr generated by this program.


Hi, it is quite different, there is no LLM involved, we can certainly use it for a RAG for example, but what is currently implemented is basically a way to generate embeddings (vector representation) which are then used for search later, it is all offline and local (no data is ever sent to cloud from your files).


I understand that LLMs aren't involved in generating the embeddings and adding the xattrs. I was just wondering what the value add of this is if there's no other background process (like mds on macOS) which is using it to build a search index.

I guess what I'm asking is: how does VectorVFS enable search besides iterating through all files and iteratively comparing file embeddings with the embedding of a search query? The project description says "efficient and semantically searchable" and "eliminating the need for external index files or services" but I can't think of any more efficient way to do a search without literally walking the entire filesystem tree to look for the file with the most similar vector.

Edit: reading the docs [1] confirmed this. The `vfs search TERM DIRECTORY` command:

> will automatically iterate over all files in the folder, look for supported files and then embed the file or load existing embeddings directly from the filesystem."

[1]: https://vectorvfs.readthedocs.io/en/latest/usage.html#vfs-se...


Yeah this kind of setup is indefinitely scaleable, but not searchable without out a meta db/index keeping track of all the nodes.


Using it for a RAG is smart indeed, especially with a multimodal encoder (vision-rag), as the implementation would be straightforward from what you already have.


Thanks, I'm working on implementing the commands to clean the embeddings (you can now do that with Linux xattr command-line tool). I'm supporting CPU or GPU (NVIDIA) for the encoders and it only supports Linux at the moment.


I am curious why Python, and not rust for example?


Not OP, but despite working in an all-Go shop I just wrote a utility in Python the other week and caught some flak for it.

The reason I gave (which was accepted) was that the process of creating a proof of concept and iterating on it rapidly is vastly easier in Python (for me) than it is in Go. In essence, it would have taken me at least a week, possibly more, to write the program I ended up with in Golang, but it only took me a day to write it in Python, and, now that I understand the problem and have a working (production-ready) prototype, it would probably only take me another day to rewrite it in Golang.

Also, a large chunk of the functionality in this Python script seems to be libraries - pillow for image processing, but also pytorch and related vision/audio/codec libraries. Even if similar production-ready Rust crates are available (I'm not sure if they are), this kind of thing is something Python excels at and which these modules are already optimized for. Most of the "work" happening here isn't happening in Python, by and large.


Sure, but now your all-go shop now needs to support two languages, two sets of linters, ci/cd etc for a single utility. It might be faster for you but if the utility is going to be used for more than a couple of weeks now it's a real hassle to get a go developer to make sure they have the right version of the interpeter, remember all the ins and outs of python etc.


Hi, I think Rust won't bring much benefit here to be honest, the bottleneck is mainly the model and model loading. It would probably be a nightmare to load these models from Rust, I would have to use torch bindings and then convert everything from the preprocessing already in Python to Rust.


Thanks. There is a bit of a nuance there, for example: you can build an index in first pass which will indeed be linear, but then later keep it in an open prompt for subsequent queries, I'm planning to implement that mode soon. But agree, it is not intended to search 10 million files, but you seldom have this use case in local use anyways.


I'm not sure I agree about the data manifolds being too rigid. When we look at the quality score-based generative models and diffusion we can see a clear evidence of how flexible these representations are. We could say the same about statistical manifolds, but the fact that the Fisher is the fundamental metric tensor for the statistical manifold is a fundamental piece of many 1st and 2nd order optimizers today.


Would applying https://en.wikipedia.org/wiki/Banach_fixed-point_theorem yield interesting convergence (and uniqueness) guarantees ?


The Banach fixed point theorem is extensively used for convergence proofs in reinforcement learning, but when you operate at the level of gradient descent for deep neutral networks it's difficult to do so because most commonly used optimizers are not guaranteed to converge to a unique fixed point.


The article seems to do the work to define a Fisher Information metric space, and contractions with the Stein score. Which seems to be the hypothesis for the Banach fixed point theorem, but I am not quite sure what conclusion we would get in this instance.


This will be a good introduction, but won't cover the cost and benefits of a statistical space IID etc..

https://youtu.be/q8gng_2gn70


I wrote an article about it and S2 some time ago as well for those interested: https://blog.christianperone.com/2015/08/googles-s2-geometry...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: