I used to work for a Go shop. We dealt with financial data. I found it so annoyi...

umvi · on Aug 13, 2024

I've mostly seen the opposite - pandas and jupyter notebooks shipped directly to production because the data scientists and AI guys didn't know how to do anything but python. As a result, the solutions were not performant and often had lots of runtime crashes due to python's more loose typing

philote · on Aug 13, 2024

If they're shipping notebooks to production and having so many crashes, I'd question that they even know how to do Python.

Narhem · on Aug 13, 2024

When do you ship notebooks to production? Jupiter was never meant for external clients.

pclmulqdq · on Aug 13, 2024

While terrifying, it is not uncommon to see python notebooks make it to production.

EuAndreh · on Aug 13, 2024

Oh god why. I thought (and hoped) that GP didn't actually mean this.

I see how a team or an organization can eventually get to this point. It just saddens me that they got there.

nsonha · on Aug 13, 2024

why not, many data related tasks are rather ad-hoc, it's a waste of time to make a long lasting software out of every ad-hoc request

Narhem · on Aug 13, 2024

Seen the same with exposed matlab web apps. Not just the Python guys, kind of shocking how much group think exists on any platform.

mathgeek · on Aug 13, 2024

Quite a number of AI edtech sites use notebooks in production for assignments, as an example of when it happens.

umvi · on Aug 13, 2024

Usually what happens is:

1. "We got it all working in the notebook"

2. "Great, ship it"

3. (data scientist takes the notebook code almost verbatim, wraps it in a basic CLI or HTTP API and it gets shipped off in a docker container for other services to consume)

Narhem · on Aug 13, 2024

In practice that’s not too crazy. It’s fast easy to debug and works with most CI/CD tools.

If you ever want to work in teams that kind of setup works extremely well.

Spivak · on Aug 13, 2024

I'm not sure how you're going to fix this given the data science tools are in Python. Are you gonna implement a half-broken 20% of numpy/scipy for a one-off program and then try to port?

These libraries hide a lot of complexity and implementing even a few operators is a project.

supermatt · on Aug 13, 2024

i dont think they are complaining about the libraries, but rather the scratchpads and notebooks that people use for ideation and evaluation being moved directly into a production environment because the authors don't have the experience or time to build more structured, efficient and maintainable code.

grahamjameson · on Aug 13, 2024

Rust + Polars comes to mind.

physicsguy · on Aug 13, 2024

I've had exactly the same experience, it's a nice language but using it for things it's not suited for like data exploration makes no sense to me.

Production data pipelines on the other hand, but only after testing them well and as you say, making sure there's good testing if you're implementing things like numerical routines.

Lyngbakr · on Aug 13, 2024

Do you have experience implementing ETL pipelines in Go? I think it'd be a better fit for us over our current language, but I'm curious to hear from people who've actually done it.

physicsguy · on Aug 13, 2024

Yes. It works fairly well. With that said, I've got a feeling that life would be a lot easier to change around if we weren't using it, we end up writing a lot of code to do relatively simple things.

brushyamoeba · on Aug 13, 2024

I do this at my job.. Disclaimer: I’m a web dev (“architect”) who does some lightweight data engineering tasks to facilitate views in some of my apps.

My pipelines are very simple (no DAG-like dependencies across pipelines). I could just have separate scripts, but instead I have a monorepo of pipelines that implement an interface with Extract, Transform, Load methods. I run this as a single process that runs pipelines on a schedule and has an HTTP API for manually triggering pipelines.

At some point I felt guilty that I am doing something nobody else seems to do, and that I had rolled my own poor-man’s orchestrator. I played around with Dagster and it was pretty nice but I decided it was overkill for my needs (however I definitely think the actual data analysis team at my company should switch from Jenkins to Dagster heh…)

On a separate note, all of my pipelines Load into Elasticsearch, which I’m using as a data warehouse. I’ve realized this is another unconventional decision I’ve made, but it also seems to work well for my use-cases.

fifilura · on Aug 13, 2024

What is current language and have considered doing it in SQL?

I don't think go will be the right choice. It is just not its strength.

physicsguy · on Aug 13, 2024

It depends on what you're doing right? The commenter here replied to me, and we're processing really large data files that are deliberately not in a SQL database due to size, only artefacts of these files eventually make it into a time series DB. For us Go works well and is performant without any great difficulty. For domain specific analytics we generally use Python, and Go just calls out to an API to do them.

fifilura · on Aug 13, 2024

You are right. And I am mostly a "T" guy so I guess the answer was mostly about the transform.

For extracting the data, go is probably a very good choice. But for transforming, pretty often not, although your use case may be suitable.

In the end, the question was very open ended.

caeril · on Aug 13, 2024

> it's a nice language but using it for things it's not suited for like data exploration

Pedantic point, but this is an issue of library support, not the language.

For whatever reason, data scientists and ML researchers decided to write their libraries and tools in a Kindergartner's language with meaningful whitespace, dynamic typing, and various other juvenile PL features specifically aimed at five year olds and the mentally infirm.

physicsguy · on Aug 14, 2024

Nobody would really use a compiled language for this, the compile-run-edit-cycle just takes too long. Prior to Python people really just used MATLAB and Mathematica for that sort of work in the physics/engineering side, and R and Stata/SPSS on the bio + maths side. MATLAB, Mathematica, Stata and SPSS are all commercial and R has exactly the same problems to Python in environment management and compiled binaries, if you use it today you end up doing a lot of manual compilation of dependencies and putting them in the PATH on Linux at least.

Python became popular because the key scientific ecosystem libraries copied the libraries from MATLAB closely which made it easy to pick up, and because it was free. Anaconda made a distribution that was easy to install with all the dependencies compiled for you, which worked on Linux/Mac/Windows which made it much easier to use than R. The other interactive languages around at the time were Ruby which was heavily web dev focused, and Perl. Node didn’t yet exist.

Once you have an ecosystem in a language it’s very hard to supplant. You need big resources to go against the grain. That no big company has decided to pour lots of money into alternatives even despite the problems probably tells us that it’s not viewed as being worth it.

theshrike79 · on Aug 13, 2024

> I used to work for a Go shop. We dealt with financial data.

People dealing with finance are THE MOST risk-averse people I know. No new tool will be used without years of vetting and it'll still be blamed if something goes wrong - even if the new tool never touched that bit of the process =)

Source: Consulted for a finance company and it was Ye Olde Java and COBOL all the way down =)

Someone · on Aug 13, 2024

If they’re risk averse, they shouldn’t, as the OP claimed, implement basic algorithms such as rolling median, or finding a maximum instead of loading data into Pandas and doing a group by, would they?

sanderjd · on Aug 13, 2024

Using Go for data analysis is not the risk averse choice. If the comment were "everyone was using excel and maybe R and I couldn't get them to use python", then the risk aversion comment would be spot on :)

zerkten · on Aug 13, 2024

>> People dealing with finance are THE MOST risk-averse people I know. No new tool will be used without years of vetting and it'll still be blamed if something goes wrong - even if the new tool never touched that bit of the process =)

What exactly do you mean by "finance" in your context? Even within a huge organization like Bank of America you can have an extremely conservative part and an innovative part. Obviously, most employees are likely working for the conservative part, but that doesn't mean that there aren't some people playing using modern technology. It just doesn't get evenly distributed, and you can be many layers from that usage in your department.

dagw · on Aug 13, 2024

Depends what you mean with "dealing with finance". If you're talking about core back office 'plumbing' the yes. Other parts are far more adventurous. Poke around any huge financial institution and you'll probably find everything from COBOL to some experimental in-house variant of Haskell.

lolinder · on Aug 13, 2024

It sounds like every single data operation in this shop was done with a "new tool" written on the spot in Go.

noisy_boy · on Aug 14, 2024

People who are saying Finance == assembly/COBOL/Java either have never worked in the area or have decided about the elephant by touching a very small part it.

Within a bank, you can have the accounting department using tested and tried approaches and/or vendor products whereas the machine learning group may be using the latest tech. But finance is not all banks - you can have non-banking financial orgs, rating firms, investment consultancies, private equity, hedge funds and on and on. A hedge fund probably uses super cutting edge stuff in one team while the other has been chugging along with stored procedures and Java.

It is too big and varied to generalize.

murray-buttchin · on Aug 13, 2024

I don't think this is true of every financial institution. Nubank runs almost entirely on Clojure except for some of its Data Science tooling which is in Scala. Sure it's JVM, but it doesn't exactly scream risk-averse to me.

martinclayton · on Aug 13, 2024

My experience is different to that.

I'd say people and businesses that best understand and balance risks against potential rewards are the most successful. I know a couple of skydiving accountants...

Use of innovative tech in finance is one way for companies to gain an "edge" which can make them a lot of money. I wouldn't characterise people working in those particular areas as more risk averse.

OTOH finance businesses (esp. those that manage other peoples money) are regulated and can't put new tech into Production without careful change control. But this evolves over time, and controls were looser in the past.

There's a steady creep of regulation and control into tech used in finance outside and around IT: so called EUC (End-User Computing). This can be a source of "wrong tool for the job" syndrome. I've seen some hellish SQL written by non-IT people, but it got the job done. It's also an area where source code control, testing, and release processes are inevitably more human and error-prone.

There is a culture clash internal to large finance between "move fast and break things!" and "nothing can change, ever - too risky!" This leads to a mixture of the very old and very new, and inevitably, more complexity. Achieving homogeneity once the heterogeneous is out of the bottle is very hard - attempts to do so often (always?) lead to a variation on the xkcd "Standards" situation https://xkcd.com/927/

classified · on Aug 13, 2024

I'm surprised they were adventurous enough to use Java.

theshrike79 · on Aug 13, 2024

Someone picked Java at the start of the millennium-ish and they never dared to move away. =)

whateveracct · on Aug 13, 2024

> I used to work for a Go shop. We dealt with financial data.

Hopefully you never forgot to initialize a numeric field and had it default to 0!

ratorx · on Aug 14, 2024

I think you are implying that this is incorrect behaviour. Could you explain why?

The alternative would be to have it be undefined, which seems clearly worse. Or do you mean that it should be a compiler warning/error instead?

sapiogram · on Aug 14, 2024

I've worked in Go for two years, and I hate zero values with a passion. I would much prefer undefined/nil/whatever as a default value. At least that obviously represents an invalid value, and will crash on use, become `null` in the database and when serialized to JSON. `0`, however, is indistinguishable from "missing data", and will just sit there and slowly poison your production data over time.

ratorx · on Aug 14, 2024

I don’t think you mean that. By undefined, I was referring to C-style undefined behaviour stemming from reading uninitialised memory (which is the only alternative to not setting the memory to a known value when it is declared but not initialised). This is clearly worse than zero values because at least zero values won’t result in initialisation from memory which could be anything.

If “null” is better for your use case, that’s not too difficult to emulate in Go (although perhaps not as ergonomic as it is in other languages). What you want is effectively an “optional” type rather than the type itself. The easiest way to represent this in Go is to use pointers to the type, which have the zero-value of nil. Combining this with JSON encoding ‘omitempty’ would get you the properties you were looking for.

I don’t think the lack of default nullability is bad (look at all the languages that are trying to slowly phase it out by supporting non-nullable types).

sapiogram · on Aug 14, 2024

> I don’t think you mean that. By undefined, I was referring to C-style undefined behaviour stemming from reading uninitialised memory

You're right, I was referring to Javascript-style `undefined`, where it's basically a 2nd `null` value.

> The easiest way to represent this in Go is to use pointers to the type, which have the zero-value of nil.

This certainly works, but it has some serious drawbacks imo:

* Performance overhead

* The integer field is no longer copied along with the struct. The resulting aliasing shenanigans can easily trip up experienced developers, because "why would anyone store an `*int` in a struct instead of a plain `int`?"

* Other silly mistakes that no linters catch, such as comparing the pointers when you meant to compare the raw values.

whateveracct · on Aug 14, 2024

> Or do you mean that it should be a compiler warning/error instead?

Yes - it's pretty trivial to enforce that all fields be initialized.

RandomThoughts3 · on Aug 13, 2024

> rolling median, or finding a maximum

The real question is why are you writing code to make a plot when you can load a csv in excel and get a clean plot in 2 minutes.

noisy_boy · on Aug 14, 2024

Maybe because they need to do that every day with variations and automating away the monotonous chore is one of the main points of writing code.

RandomThoughts3 · on Aug 14, 2024

> I saw my colleagues again and again implementing basic algorithms such as rolling median, or finding a maximum.

Emphasis mine so I will go with no regarding your explanation. It’s obvious the discussion was about one shot plotting from the start. The whole Go vs Python discussion for graphing doesn’t make sense in the context of automation.

rcxdude · on Aug 16, 2024

Because Excel sucks at plotting, especially if you have non-trivial amounts of data.

bitbasher · on Aug 13, 2024

For that kind of processing, I prefer to use the database itself. I mean, it is a "data" base and it's pretty good at it. I get it-- sometimes you have a half cooked spreadsheet someone sends you and you need to run some analysis on it.. in that case I don't see any harm in using Go, using visidata, importing it in sqlite or using pandas or whatever.

wejick · on Aug 13, 2024

I have done the same like your colleagues. Recently I discovered that pandas + jupyter notebook is much better tools. However as gopher that needs occasional data cruncher, I totally relying on LLM chat agent to do my work.

But the rest of the work is still go, so the context switching is there but bare able. Imagine a go crawler dumping csv and a python script crunching the csv.

pantsforbirds · on Aug 13, 2024

I've almost entirely replaced Pandas with DuckDB in my day-to-day work. I wonder if it would be an easier lift for someone who is familiar with SQL, and doesnt want to pick up Pandas/Polars.

ansgri · on Aug 13, 2024

This comment might finally push me to try DuckDB, thanks. I’m quite proficient with scientific python in general, but Pandas API is just alien (and slow). SQL is more natural for its domain.

Daishiman · on Aug 13, 2024

How's DuckDB better than using SQLite on an in-memory DB?

pantsforbirds · on Aug 14, 2024

It's designed for analytics work. Some examples: * You could run a bunch of analytics queries directly on a jsonlines, csv, or parquet file (or even glob of files). * You can output directly to a pandas or polars dataframe * You can use numeric python functions inside of queries * You can also attach to S3 buckets, postgres databases, or sqlite databases and query them directly.

richrichie · on Aug 14, 2024

> I saw my colleagues again and again implementing basic algorithms such as rolling median, or finding a maximum. Instead of loading data into Pandas and doing a group by, they would create some kind of loopy solution that would use maps.

Isn’t this trivial? I use these functionalities and maintain my own Go package for such utilities. A software firm should be able to do this without much issue. In fact, it is better done in house. Don’t have to worry about dependency management and downloading the entire Pandas library for just rolling means, etc.

classified · on Aug 13, 2024

> Instead of loading data into Pandas and doing a group by

SQL can do `group by`. No need for overkill with "data science tools".

sanderjd · on Aug 13, 2024

If you have a csv, pandas is the "less overkill" route than loading it into a database to use sql.

chrisandchris · on Aug 13, 2024

Each their own tool, but that's nothing awk can't do [1]. I prefer a shell script, because shell is everywhere.

[1] https://stackoverflow.com/a/75073649

sanderjd · on Aug 13, 2024

Sorry, but no, awk and shell scripts are not a good choice for this kind of work. I'm sorry, but they just aren't!

Sure, go for it if you're just doing stuff for your own personal interest. But if you're doing serious data-driven work in a professional environment, this is going to be an awful choice.

infecto · on Aug 13, 2024

While definitely doable, using a SQL database to do data discovery is obtuse.

zerkten · on Aug 13, 2024

It's obtuse, but it's effective for some people. I know someone who hires devs from the .NET and Python space for finance. When asked to suggest a solution to any kind of data problem like this the interviewees split down the middle with the .NET ones almost entirely using a relational database and the Python ones using either the library du jour or suggesting something on the command line (since they often have Linux experience.)

In both parts of their org, the .NET area tends to use SQL exclusively for analysis and the Python folks use Pandas and bunch of other stuff. These departments are also in significantly different parts of their organization with their own mandates and culture.

hermitdev · on Aug 13, 2024

As someone with +20 years in finance (hedge funds/trading) and knows .Net, Java, Python, C++, shell, etc, the first question I'd ask is: where's the data?

If I'm being asked to do data analysis, it's because we need the answer yesterday. So, the tool I choose always going to be a matter of which gets me the answer fastest and with the least amount of friction. That's almost always dictated by where the data is _now_. Not where it'd ideally be.

ryanjshaw · on Aug 13, 2024

It's in this email attachment right here. Now what?

dotancohen · on Aug 13, 2024

To me that sounds like a csv file and it'll likely be a python script from me.

pjmlp · on Aug 15, 2024

Depends, that is what OLAP is all about, naturally most of the good tools happen to be commercial for the enterprise space.

orthoxerox · on Aug 13, 2024

Not really. DuckDB can operate on raw csv without ingesting them. This is as easy if not easier than setting up python and pandas.

infecto · on Aug 13, 2024

Totally agree that it depends on the tools available but given a typical toolset (sql and python). I would lean towards python because some types of analysis are easier to express in code, especially when working on top of the multiple data sets.

senkora · on Aug 13, 2024

XTX? I know that they use a lot of Go.

arberavdullahu · on Aug 13, 2024

You don’t need to spend time learning standard data science tools. These tasks are well-defined, thoroughly documented in different blogs, and with today’s AI-driven code generation, basic programming knowledge is sufficient. Attempting to manipulate CSV data and generate visualizations in Go would be a waste of time, delivering subpar results.