NNX is a Neural Networks library for JAX that provides a simple yet powerful module system that adheres to standard Python semantics. Its aim is to combine the robustness of Flax with a simplified, Pythonic API akin to that of PyTorch.
I think we have to make a distiction here:
- On one hand, having access to these large scale language models that can do few-shot learning is incredibly useful for the industry as in can be easily deployed to solve thosands of simple tasks.
- On the other hand, this approach will not solve harder problems (as Yann points out) and "just" creating bigger models using the same techniques is probably not the path forward in those domains.
One thing not mentioned in the original "Why Swift for Tensorflow" document and was a mayor source of conflict when the differentiable programming feature was formally proposed by the S4TF as a standard Swift feature: Swift has no mechanisms for metaprogramming. The reason is that Automatic Differentiation can be implemented 100% using metaprogramming, instead the S4TF team had to create internally certain features for this, that is probably one of the reasons why it took so long to get the most basic stuff working.
In retrospective you can really say Swift was a bad choice for the project because the time to market was much slower than it could be vs e.g choosing Julia. The other thing they didn't take into account was the actual market, that is, the Data Science ecosystem in Swift is non-existente, you have an excellent Deep Learning library standing alone without a numpy, a pandas, a scipy, a opencv, a pillow, ect, which makes doing real application with it nearly impossible.
That said, Swift as a language is amazing, doing parallel computation is so easy, not having a garbage collector makes it super efficient. Its the kind of thing we need, but the language right now is not in the right state.
I think the ML community really needs a better language than Python but not because of the ML part, that works really good, its because of the Data Engineering part (which is 80-90% of most projects) where python really struggles for being slow and not having true parallelism (multiprocessing is suboptimal).
That said I love Python as a language, but if it doesn't fix its issues, on the (very) long run its inevitable the data science community will move to a better solution. Python 4 should focus 100% of JIT compilation.
I've found it generally best to push as much of that data prep work down to the database layer, as you possibly can. For small/medium datasets that usually means doing it in SQL, for larger data it may mean using Hadoop/Spark tools to scale horizontally.
I really try to take advantage of the database to avoid ever having to munge very large CSVs in pandas. So like 80-90% of my work is done in query languages in a database, the remaining 10-20% is in Python (or sometimes R) once my data is cooked down to a small enough size to easily fit in local RAM. If the data is still too big, I will just sample it.
It's an argument that Python being slow / single-threaded isn't the biggest problem with Python in data engineering. The biggest problem is the need to process data that doesn't fit in RAM on any single machine. So you need on-disk data structures and algorithms that can process them efficiently. If your strategy for data engineering is to load whole CSV files into RAM, replacing Python with a faster language will raise your vertical scaling limit a bit, but beyond a certain scale it won't help anymore and you'll have to switch to a distributed processing model anyway.
Dask might be lightweight internally but resorting to it just to solve a simple task that requires concurrency is not "simple".
Streamz looks nice! However:
"Streamz relies on the Tornado framework for concurrency. This allows us to handle many concurrent operations cheaply and consistently within a SINGLE THREAD."
Apparently you can set it up to use Dask to escape the single threads but that is kind of a global config. With Pypeline you can mix and match between using Processes, Threads, and asyncio.Tasks where it makes sense, resource management per stage is simple and explicit. If you have some understanding of the multiprocessing, threading and asyncio modules, Pypeline will save you tons of time.
Still, will keep an eye on Streamz, its a very nice work, lots of features, it should get more visibility.
"Pypeline was designed to solve simple medium data tasks that require concurrency and parallelism but where using frameworks like Spark or Dask feel exaggerated or unnatural."
it was actually because I've resorted / hacked into tf.data and Dask in the past just to get concurrency and parallelism. Pypeline is way more natural for pure python stuff.
3. If you first have to put in all the data first you cant handle streams. Pypeline accepts non-terminating iterables and also gives you back possibly non-terminating iterables.